NSXMLParser not delegating parsing task properly to its delegate

I'm building an RSSfeed reader wherein there the task of parsing the xmldata gets delegated to two separate objects. As the xml tree is walked down, the actual parsing task is initiated with the RSSContent object, where the title of individual articles is generated by picking up appropriate strings of characters, and then moved to RSSContentArticle object, where the web links to the actual articles are pieced together character by character.

The job of delegation starts when the connection to the RSSfeed's webservice is established, and one ListViewController object is made a delegate of the NSXMLParser object. As the control flow reaches specific XML tags of interest, the job of parsing gets re-delegated to the next object, for instance- ListViewController->RSSContent->RSSContentArticle, and then gets moved back to the previous delegate as we keeping walking out of the nested element tags, one element at a time, like this- RSSContentArticle->RSSContent->ListViewController.

However, I'm currently facing a problem while trying to achieve this flow of parsing as I want to have it move from RSSContent to RSSContentArticle through re-delegation.

Here is the code-


ListViewController.h

#import 
@class RSSChannel, WebViewController, RSSContent;
@interface ListViewController : UITableViewController<nsxmlparserdelegate,uitableviewdelegate, uitableviewdatasource,="" nsurlconnectiondatadelegate="">
{
    NSURLConnection *connection;
    NSMutableData *xmlData;
    RSSChannel *channel;
    NSMutableArray *contentCollection;
    NSMutableString *currentString;
}
@property (nonatomic, strong) WebViewController *webViewController;
@property (nonatomic, strong) RSSContent *content;
@property (nonatomic, strong) NSMutableString *cellTitle;

-(void)fetchEntries;
@end

ListViewController.m

#import "ListViewController.h"
#import "RSSContent.h"
#import "RSSChannel.h"
#import "RSSContentArticle.h"
#import "RSSItem.h"
#import "WebViewController.h"
@interface ListViewController ()

@end

@implementation ListViewController
@synthesize webViewController, content;

-(BOOL)shouldAutorotateToInterfaceOrientation:(UIInterfaceOrientation)io{
    if ([[UIDevice currentDevice] userInterfaceIdiom]== UIUserInterfaceIdiomPad) {
        return YES;
    }
    return io== UIInterfaceOrientationPortrait;
}
- (id)initWithStyle:(UITableViewStyle)style
{
    self = [super initWithStyle:style];
    if (self) {
        // Custom initialization
        NSLog(@"ListViewcontroller init..%@ %@", self.tableView.dataSource, self.tableView.delegate);
        
        [self fetchEntries];
        contentCollection= [[NSMutableArray alloc] init];
    }
    return self;
}

- (void)viewDidLoad
{
    [super viewDidLoad];
    
    // Uncomment the following line to preserve selection between presentations.
    // self.clearsSelectionOnViewWillAppear = NO;
    
    // Uncomment the following line to display an Edit button in the navigation bar for this view controller.
    // self.navigationItem.rightBarButtonItem = self.editButtonItem;
}

-(void)fetchEntries{
    NSLog(@"%@", NSStringFromSelector(_cmd));
    xmlData= [[NSMutableData alloc] init];
    
    NSURL *url= [NSURL URLWithString:@"https://www.apple.com/pr/feeds/pr.rss"];
    NSURLRequest *req= [NSURLRequest requestWithURL:url];
    connection= [[NSURLConnection alloc] initWithRequest:req delegate:self startImmediately:YES];//self has been made NSURLConnection's delegate
}

-(void)connection:(NSURLConnection *)conn didReceiveData:(NSData *)data{
    NSLog(@"%@", NSStringFromSelector(_cmd));
    //    Add the incoming chunk of data to the container we are keeping
    //    The data always comes in the correct order
    [xmlData appendData:data];
}

-(void)connectionDidFinishLoading:(NSURLConnection *)conn{
    NSLog(@"%@", NSStringFromSelector(_cmd));
    //    We are just checking to make sure we are getting the XML
    NSString *xmlCheck= [[NSString alloc] initWithData:xmlData encoding:NSUTF8StringEncoding];
    NSLog(@"xmlCheck= %@",xmlCheck);
    
    NSXMLParser *parser=[[NSXMLParser alloc] initWithData:xmlData];
    [parser setDelegate:self];
    NSLog(@"parsing initiated");
    [parser parse];
    
    xmlData=nil;
    connection=nil;
    [self.tableView reloadData];
    WSLog(@"channel test- %@\n %@\n %@\n",channel, [channel title], [channel infoString]);
    
}

-(void)connection:(NSURLConnection *)conn didFailWithError:(NSError *)error{
    connection=nil;
    xmlData=nil;
    NSString *errorString= [NSString stringWithFormat:@"Fetch failed: %@",[error localizedDescription]];
    UIAlertView *av= [[UIAlertView alloc] initWithTitle:@"Error" message:errorString delegate:nil cancelButtonTitle:@"OK" otherButtonTitles:nil, nil];
    [av show];
}

-(void)parser:(NSXMLParser *)parser didStartElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName attributes:(NSDictionary *)attributeDict{
    NSLog(@"%@ found a %@ element", self, elementName);
    
    if ([elementName isEqual:@"channel"]) { // element starts
        //  If the parser saw a channel, create a new object, have the ivar- 'channel' point to it.
        channel= [[RSSChannel alloc] init];
        
        //  Give the channel object a pointer back to ourselves for later.
        channel.parentParserDelegate= self;
        
        //  Set the parser's delegate to the channel object
        //  There will be a warning here, ignore it for now
        parser.delegate= channel;
        
    }
    else if ([elementName isEqual:@"entry"]) {
        
        content= [[RSSContent alloc] init];
        
//      Give the content object a pointer back to ourselves for later.
        content.parentParserDelegate= self;
        
        parser.delegate= content;
        
        [contentCollection addObject:content];
    }
}


RSSContent.h

#import 
@class RSSContentArticle;
@interface RSSContent : NSObject 
{
    NSMutableString *currentString;
}
@property (nonatomic, weak)id parentParserDelegate;
@property (nonatomic, strong)NSString *title;
@property (nonatomic, strong)NSString *link;

@property (nonatomic, strong)RSSContentArticle *article;
@end

RSSContent.m

#import "RSSContent.h"
#import "RSSContentArticle.h"
@implementation RSSContent
@synthesize parentParserDelegate, title, link, article;

-(void)parser:(NSXMLParser *)parser didStartElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName attributes:(NSDictionary *)attributeDict{
    
    NSLog(@"\t%@ found a %@ element",self,elementName);
    if ([elementName isEqual:@"title"]) {
        currentString= [[NSMutableString alloc] init];
        title= currentString;
    }
    else if([elementName isEqual:@"content"]){
        article= [[RSSContentArticle alloc] init];
        parser.delegate= article;

        article.parentParserDelegate= self;
        
    }
}
-(void)parser:(NSXMLParser *)parser foundCDATA:(NSData *)CDATABlock{
    NSString *string= [[NSString alloc] initWithData:CDATABlock encoding:NSUTF8StringEncoding];
    NSLog(@"\tfound CDATA within content- %@",string);

    [currentString appendString:string];
}
-(void)parser:(NSXMLParser *)parser didEndElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName{
    NSLog(@"\tcontent ended");
    currentString= nil;
    if ([elementName isEqual:@"entry"]) {
        parser.delegate= parentParserDelegate;
    }
}
@end

RSSContentArticle.h

#import 

@interface RSSContentArticle : NSObject
{
    NSMutableString *currentString;
}
@property (nonatomic, weak)id parentParserDelegate;
@property (nonatomic, strong)NSString *link;
@end

RSSContentArticle.m

#import "RSSContentArticle.h"

@implementation RSSContentArticle
@synthesize link, parentParserDelegate;
-(void)parser:(NSXMLParser *)parser didStartElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName attributes:(NSDictionary *)attributeDict{
    NSLog(@"\t\t%@ found a %@ element",self,elementName);
    if ([elementName isEqual:@"a"]) {
        currentString= [[NSMutableString alloc] init];
        link= currentString;
    }
}

-(void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string{
    NSLog(@"\t\tfound character(s) within contentArticle- %@",string);
    [currentString appendString:string];
    NSLog(@"\t\tcurrentString- %@", currentString);
}
-(void)parser:(NSXMLParser *)parser didEndElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName{
    NSLog(@"\t\tcontent ended");
    currentString= nil;
    if ([elementName isEqual:@"content"]) {
        parser.delegate= parentParserDelegate;
    }
}
@end

When the code runs, the control does not flow through -(void)parser:(NSXMLParser *)parser didStartElement:(NSString *)elementName namespaceURI:(NSString *)namespaceURI qualifiedName:(NSString *)qName attributes:(NSDictionary<NSString *,NSString *> *)attributeDict in RSSContentArticle.m, meaning that the RSSContentArticle object does not get sent this message after the NSXMLParser having re-delegated parsing to the same in RSSContent.m file when the element name is "content". I have remained stuck with the issue for a while now. Can someone please look into this, advice me meaningfully on how to go about solving this issue. I'd be thankful.

Accepted Reply

OK. I see the problem. I was looking at the feed data using an XML editor. Those "a" nodes don't exist within the context of the overall XML structure of this document. They are inside "CDATA" qualifiers. Therefore, they are hidden from the parser. The parser will treat everything inside CDATA as a stream of text.


This is what I mean about RSS being junk. The "context" node is very loosely defined in the Atom "standard". It is allowed (https://tools.ietf.org/html/rfc4287#page-14) to have text, html, or xhtml content, and more. Your parsing logic has to handle all of those cases. How you handle them is for you to code. I can't give you any advice on how to do that other than writing special case after special case after special case. It is just ridiculously complicated. And to be technical, you haven't even started looking at RSS yet. This is the Atom standard, which is different than the RSS standard(s).


Please don't join the crowd and blame this on XML. XML has standards that allow documents to be very rigorously defined. But there is no law to force people to use those definitions, or to design logical documents, or to even pick a single standard. You have an awful lot of work ahead of you. Your parser will break on a regular basis as people try it on new feeds. You will have trouble getting people to use it because they have their own favourite RSS readers that already have years of code hacks to handle all of these cases.

Replies

It doesn't work that way. Once you start a "content" node and change the delegate, there aren't any other child nodes to process. Therefore, that method is never going to get called.


I have dabbled in RSS in the past. I wouldn't recommend it. RSS is ridiculously complex and has no standardization at all. Furthermore, the content is often not valid XML. Most feeds are computer-generated from HTML content from some CMS. That embedded HTML is just fragments and often not even syntactically correct HTML.

The strange thing is that the -(void)parser:(NSXMLParser *)parser foundCharacters:(NSString *)string message does get sent to RSSContentArticle object after re-delegation. So, i'm guessing this is happening because the rest of the content gets treated as part of the "content" node only. What you said is that "content" node is considered as the last child node, atleast thats how it is parsed as, by NSXMLParser. I'm still at a loss to fully understand how this works. Because, after the "content" element, there is yet another element- "a" (attribute element). So am i to assume that the "attribute" element is not parsed as a legit node?

That is because the parser hasn't started parsing the content (as in data, not name) of the node yet. Then you change the delegate, so the content gets routed to the new delegate now.


I'm not saying that the "content" node is considered as the last child node. I checked the feed data and none of "content" nodes have any child nodes. That is a fact of the structure. I wouldn't necessarily consider that guaranteed though. RSS is a royal mess.


Or are you talking about this "content" node being the last child node with respect to the "link" nodes that come after it? Your "didEndElement" method resets the delegate back to the original at the end of the "content" node.


I don't know what you mean about "a" nodes or attributes in this document. I checked the feed data a couple of days ago. There were no "a" node or attributes. This is XML, you have to have a clear understanding of what a "node" is with respect to "elements" and "attributes". I recommend a nice tool called Xmplify to view XML files.


But again, RSS is total junk. There have to be dozens of other RSS readers on the market, many of them with large and loyal customer bases. Those apps have been around for a long time and already have all the messy code necessary to parse real-world RSS feeds that you haven't encountered yet. If you are just interested in Apple feeds, that, too, is dangerous. This is one of the rare RSS feeds that Apple hasn't discontinued yet. Apple has its own push notifications. It is pretty much guaranteed that whenever Apple overhauls a particular web site or service, they will eliminate any RSS feeds.

Your reply- 'Or are you talking about this "content" node being the last child node with respect to the "link" nodes that come after it? Your "didEndElement" method resets the delegate back to the original at the end of the "content" node.'


Not exactly. i was talking about the <a href…> element that comes after the "content" node starts, and ends before the "content" node ends. The "link" node starts and ends, outside of the "content" node. So it isn't nested within the latter. Moreover, the "link" nodes in my case do not contain links to the actual data but to the images corresponding to the the individual articles. So they are largely irrelevant in my case. The delegate is reset to the "content" node when RSSContentArticle object receives the "didEndElement" message. From what you have mentioned above, I am given to understand that there is no such thing as <a> node. What I'm trying to have happen is to pick up the web links to individual articles wrapped within <a> element, and present them in UIWebView.

OK. I see the problem. I was looking at the feed data using an XML editor. Those "a" nodes don't exist within the context of the overall XML structure of this document. They are inside "CDATA" qualifiers. Therefore, they are hidden from the parser. The parser will treat everything inside CDATA as a stream of text.


This is what I mean about RSS being junk. The "context" node is very loosely defined in the Atom "standard". It is allowed (https://tools.ietf.org/html/rfc4287#page-14) to have text, html, or xhtml content, and more. Your parsing logic has to handle all of those cases. How you handle them is for you to code. I can't give you any advice on how to do that other than writing special case after special case after special case. It is just ridiculously complicated. And to be technical, you haven't even started looking at RSS yet. This is the Atom standard, which is different than the RSS standard(s).


Please don't join the crowd and blame this on XML. XML has standards that allow documents to be very rigorously defined. But there is no law to force people to use those definitions, or to design logical documents, or to even pick a single standard. You have an awful lot of work ahead of you. Your parser will break on a regular basis as people try it on new feeds. You will have trouble getting people to use it because they have their own favourite RSS readers that already have years of code hacks to handle all of these cases.

OK. I see the problem. I was looking at the feed data using an XML editor. Those "a" nodes don't exist within the context of the overall XML structure of this document. They are inside "CDATA" qualifiers. Therefore, they are hidden from the parser. The parser will treat everything inside CDATA as a stream of text.


This is what I mean about RSS being junk. The "context" node is very loosely defined in the Atom "standard". It is allowed (see RFC 4287) to have text, html, or xhtml content, and more. Your parsing logic has to handle all of those cases. How you handle them is for you to code. I can't give you any advice on how to do that other than writing special case after special case after special case. It is just ridiculously complicated. And to be technical, you haven't even started looking at RSS yet. This is the Atom standard, which is different than the RSS standard(s).


Please don't join the crowd and blame this on XML. XML has standards that allow documents to be very rigorously defined. But there is no law to force people to use those definitions, or to design logical documents, or to even pick a single standard. You have an awful lot of work ahead of you. Your parser will break on a regular basis as people try it on new feeds. You will have trouble getting people to use it because they have their own favourite RSS readers that already have years of code hacks to handle all of these cases.


(Sorry about the duplicate post. I guess only apple.com links are exempt from moderation. My other reply will probably show up in a few weeks with a link to the RFC.)

Your reply- Those "a" nodes don't exist within the context of the overall XML structure of this document. They are inside "CDATA" qualifiers. Therefore, they are hidden from the parser. The parser will treat everything inside CDATA as a stream of text.


Now i get why the parser wouldn't pick up <a> element as an individual node. Ok then, i guess my investigation ends here. I'll go through some more useful articles like the one that you cited above, so as to get a better understanding of xml parsing. Thank you for the assist and all the useful references.