[weboob] What library uses to parse HTML pages

Romain Bignon romain at peerfuse.org
Thu Apr 15 09:17:05 CEST 2010


On 15/Apr - 00:43, Christophe Benz wrote:
> Reopening the debate around HTML parsers, after having studied the
> question.
> 
> Here is a benchmark from 2007, made by Ian Bicking, the author or lxml:
> http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
> http://codespeak.net/lxml/performance.html
> 
> lxml is the most efficient for parsing HTML. And it is packaged for the
> Nokia N900 ;-)
> 
> lxml parses HTML, returning a document compatible with the
> elementtree API:
> http://codespeak.net/lxml/tutorial.html
> 
> lxml provides additional smart 
> crawling methods, like cssselect:
> http://codespeak.net/lxml/cssselect.html
> 
> lxml can also deal with links and forms, like mechanize does:
> http://codespeak.net/lxml/lxmlhtml.html
> 
> lxml has its own HTML parser, but it wraps html5lib too:
> http://codespeak.net/lxml/html5parser.html

Ok, but an other specification is the capacity of parser to tolerate bad HTML.
What appends when lxml tries to parse a really worst HTML document?

> Now, about APIs, there is the historical DOM API:
> http://docs.python.org/library/xml.dom.html
> but since Python 2.5 there is ElementTree:
> http://docs.python.org/library/xml.etree.elementtree.html
> 
> The aum backend uses the old DOM API, is there any reason?

For historical reasons. When I wrote AuM a year ago, it was the first parser
library I tried.

> For the youtube backend, I used lxml with the class LxmlHtmlParser
> provided by the module weboob.tools.parser.

Ok. I think that if it can parse correctly bad HTML and if this is the same API
than etree, we should use it as StandardParser.

> But for some reasons I put many parsers into the weboob.tools.parser
> module, so each developer can choose the one he or she wants.

Nice. Thanks.

Romain
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 489 bytes
Desc: not available
URL: <https://lists.symlink.me/pipermail/weboob/attachments/20100415/3b43631a/attachment.sig>


More information about the weboob mailing list