#42: HTML or XHTML Parser for iPhone SDK 2.x

Solved!
A couple of my iPhone projects require a decent HTML/XHTML parser. On OS X, Cocoa ships with NSXMLDocument, which includes dirty HTML parsing functionality from libtidy. Unfortunately, NSXMLDocument is not part of the actual iPhone 2.2 SDK (though it is part of the 2.2 Simulator -- so it'll compile just fine at dev time but break when deploying -- a big gotcha if you never tested against a real iPhone).

NSXMLParser is a part of the iPhone SDK. However, the SAX-based parsing is 1) annoying to deal with, and 2) fault-intolerant, meaning that it cannot parse real-world, dirty XHTML. This is not a reasonable alternative.

So, what is a reasonable alternative?

Hpple: a simple XML/HTML parser that works on iPhone

5
As linked from the reference section, hpple provides an easy interface for tolerant XML/XHTML parsing from Objective-C on the iPhone SDK 2.x & 3.x

You'll need to include libxml and its headers into your iPhone project, and also pull in the three Hpple classes: TFHpple, TFHppleElement, and XPathQuery. Then, it's a nice and simple:

TFHpple * doc = [[TFHpple alloc] initWithHTMLData:data]; NSArray * elements = [doc search:@"//a[@class='sponsor']"]; // XPath, woohoo!

Simple as that. If you've worked with Hpricot in Rails, there is a very similar interface. And no need to pay for any licenses.

It's classified as an experimental project by the developer, but so far it's "worked for me"

UPDATE: seems to be kinda broken now. Anyone got a better solution?

Comments

  1. You could also check HyperParser. It's a simple HTML parser that has API similar to NSXMLParser. Designed specially to parse semi-valid HTML. http://www.dimzzy.com/index.php?page=hyper-parser

    dimzzy on August 11, 2009, 12:40 AM UTC
  2. wt is the string(//a[@class='sponsor']) u passed for search function

    rev on September 04, 2009, 11:52 AM UTC
  3. it's an XPath for finding nodes in XHTML ( http://www.w3schools.com/XPath/xpath_syntax.asp ).

    92049143cabb7ba896d7c06e19906303_small yliu on October 30, 2009, 01:35 AM UTC

Think you've got a better solution? Help 92049143cabb7ba896d7c06e19906303_small yliu out by posting your solution

topfunky's hpple at master - GitHub

http://github.com/topfunky/hpple/tree/master - found by 92049143cabb7ba896d7c06e19906303_small yliu on May 07, 2009, 01:15 PM UTC

now THIS looks promising

Tags: hpple HTML parser

Using libtidy for iPhone app - Stack Overflow

http://stackoverflow.com/questions/663211/using-libtidy-for-iphone-app - found by 92049143cabb7ba896d7c06e19906303_small yliu on May 07, 2009, 01:14 PM UTC

there is no tidy header file in the iphone SDK

Tags: iPhone libtidy tidy HTML

gist: 103637 - GitHub

http://gist.github.com/103637 - found by 92049143cabb7ba896d7c06e19906303_small yliu on May 07, 2009, 01:13 PM UTC

example of how to use libxml's htmlParseFile. the in-memory equivalent is htmlParseDoc.

Tags: libxml HTML htmlParseDoc htmlParseFile

Adding the libXML framework to your iPhone App | Wulf

http://welcome.totheinter.net/2008/03/11/adding-the-libxml-framework-to-your-iphone-app/ - found by 92049143cabb7ba896d7c06e19906303_small yliu on May 07, 2009, 01:12 PM UTC

adding libxml to an iPhone application

Tags: libxml Xcode

parsing HTML on the iPhone - Stack Overflow

http://stackoverflow.com/questions/405749/parsing-html-on-the-iphone - found by 92049143cabb7ba896d7c06e19906303_small yliu on May 07, 2009, 01:11 PM UTC

libxml/HTMLParser does ship with iPhone SDK. Unfortunately it doesn't parse my piece of code -- blows up with a parse error.

Tags: libxml parser

Element Parser « Touch Tank

http://touchtank.wordpress.com/element-parser/ - found by 92049143cabb7ba896d7c06e19906303_small yliu on May 07, 2009, 01:10 PM UTC

ElementParser seems to be what I need. Unfortunately it costs money to use for non-GPL apps

Tags: ElementParser Cocoa commercial license