A couple of my iPhone projects require a decent HTML/XHTML parser. On OS X, Cocoa ships with NSXMLDocument, which includes dirty HTML parsing functionality from libtidy. Unfortunately, NSXMLDocument is not part of the actual iPhone 2.2 SDK (though it is part of the 2.2 Simulator -- so it'll compile just fine at dev time but break when deploying -- a big gotcha if you never tested against a real iPhone).
NSXMLParser is a part of the iPhone SDK. However, the SAX-based parsing is 1) annoying to deal with, and 2) fault-intolerant, meaning that it cannot parse real-world, dirty XHTML. This is not a reasonable alternative.
So, what is a reasonable alternative?
Hpple: a simple XML/HTML parser that works on iPhone
- yliu on May 15, 2009, 10:11 PM UTC
As linked from the reference section, hpple provides an easy interface for tolerant XML/XHTML parsing from Objective-C on the iPhone SDK 2.x & 3.x
You'll need to include libxml and its headers into your iPhone project, and also pull in the three Hpple classes: TFHpple, TFHppleElement, and XPathQuery. Then, it's a nice and simple:
Simple as that. If you've worked with Hpricot in Rails, there is a very similar interface. And no need to pay for any licenses.
It's classified as an experimental project by the developer, but so far it's "worked for me"
UPDATE: seems to be kinda broken now. Anyone got a better solution?
You'll need to include libxml and its headers into your iPhone project, and also pull in the three Hpple classes: TFHpple, TFHppleElement, and XPathQuery. Then, it's a nice and simple:
TFHpple * doc = [[TFHpple alloc] initWithHTMLData:data];
NSArray * elements = [doc search:@"//a[@class='sponsor']"]; // XPath, woohoo!
Simple as that. If you've worked with Hpricot in Rails, there is a very similar interface. And no need to pay for any licenses.
It's classified as an experimental project by the developer, but so far it's "worked for me"
UPDATE: seems to be kinda broken now. Anyone got a better solution?
References used:
Adding the libXML framework to your iPhone App | Wulf
( http://welcome.totheinter.net/2008/03/11/adding-the-libxml-framework-to-your-iphone-app/ ) - found by yliu on May 07, 2009, 01:12 PM UTC
topfunky's hpple at master - GitHub
( http://github.com/topfunky/hpple/tree/master ) - found by yliu on May 07, 2009, 01:15 PM UTC
Think you've got a better solution? Help yliu out by posting your solution
topfunky's hpple at master - GitHub
http://github.com/topfunky/hpple/tree/master - found by yliu on May 07, 2009, 01:15 PM UTC
now THIS looks promising
Using libtidy for iPhone app - Stack Overflow
http://stackoverflow.com/questions/663211/using-libtidy-for-iphone-app - found by yliu on May 07, 2009, 01:14 PM UTC
there is no tidy header file in the iphone SDK
gist: 103637 - GitHub
http://gist.github.com/103637 - found by yliu on May 07, 2009, 01:13 PM UTC
example of how to use libxml's htmlParseFile. the in-memory equivalent is htmlParseDoc.
Adding the libXML framework to your iPhone App | Wulf
http://welcome.totheinter.net/2008/03/11/adding-the-libxml-framework-to-your-iphone-app/ - found by yliu on May 07, 2009, 01:12 PM UTC
adding libxml to an iPhone application
parsing HTML on the iPhone - Stack Overflow
http://stackoverflow.com/questions/405749/parsing-html-on-the-iphone - found by yliu on May 07, 2009, 01:11 PM UTC
libxml/HTMLParser does ship with iPhone SDK. Unfortunately it doesn't parse my piece of code -- blows up with a parse error.
Element Parser « Touch Tank
http://touchtank.wordpress.com/element-parser/ - found by yliu on May 07, 2009, 01:10 PM UTC
ElementParser seems to be what I need. Unfortunately it costs money to use for non-GPL apps
Comments
You could also check HyperParser. It's a simple HTML parser that has API similar to NSXMLParser. Designed specially to parse semi-valid HTML. http://www.dimzzy.com/index.php?page=hyper-parser
— dimzzy on August 11, 2009, 12:40 AM UTCwt is the string(//a[@class='sponsor']) u passed for search function
— rev on September 04, 2009, 11:52 AM UTCit's an XPath for finding nodes in XHTML ( http://www.w3schools.com/XPath/xpath_syntax.asp ).
— yliu on October 30, 2009, 01:35 AM UTC