[weboob] What library uses to parse HTML pages

Christophe Benz christophe.benz at gmail.com
Thu Apr 15 11:27:57 CEST 2010


Le Thu, 15 Apr 2010 09:17:05 +0200,
Romain Bignon <romain at peerfuse.org> a écrit :

> On 15/Apr - 00:43, Christophe Benz wrote:
> > Reopening the debate around HTML parsers, after having studied the
> > question.
> > 
> > Here is a benchmark from 2007, made by Ian Bicking, the author or
> > lxml:
> > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
> > http://codespeak.net/lxml/performance.html
> > 
> > lxml is the most efficient for parsing HTML. And it is packaged for
> > the Nokia N900 ;-)
> > 
> > lxml parses HTML, returning a document compatible with the
> > elementtree API:
> > http://codespeak.net/lxml/tutorial.html
> > 
> > lxml provides additional smart 
> > crawling methods, like cssselect:
> > http://codespeak.net/lxml/cssselect.html
> > 
> > lxml can also deal with links and forms, like mechanize does:
> > http://codespeak.net/lxml/lxmlhtml.html
> > 
> > lxml has its own HTML parser, but it wraps html5lib too:
> > http://codespeak.net/lxml/html5parser.html
> 
> Ok, but an other specification is the capacity of parser to tolerate
> bad HTML. What appends when lxml tries to parse a really worst HTML
> document?

According to the documentation, lxml can deal with bad HTML:
http://codespeak.net/lxml/lxmlhtml.html#really-broken-pages

But if HTML is very crappy, lxml wraps BeautifulSoup with
lxml.html.soupparser which mimics the ElementSoup class.

It can be used at different levels:
* for parsing HTML
* only for encoding detection (BeautifulSoup has better algorithms)

But it has lower performance than lxml.html native parser.

So, to summarize, assuming we forget about the DOM API, we can use many
combinations of parsers and ElementTree. Here is a list from lower to
higher performance level:
* HTMLParser.HTMLParser + xml.etree.(c)ElementTree
* elementtidy.TidyHTMLTreeBuilder + xml.etree.(c)ElementTree
* BeautifulSoup.HTMLParser + xml.etree.(c)ElementTree
* lxml.html.soupparser + lxml.etree.ElementTree
* lxml.html.HTMLParser + lxml.etree.ElementTree

> > Now, about APIs, there is the historical DOM API:
> > http://docs.python.org/library/xml.dom.html
> > but since Python 2.5 there is ElementTree:
> > http://docs.python.org/library/xml.etree.elementtree.html
> > 
> > The aum backend uses the old DOM API, is there any reason?
> 
> For historical reasons. When I wrote AuM a year ago, it was the first
> parser library I tried.
> 
> > For the youtube backend, I used lxml with the class LxmlHtmlParser
> > provided by the module weboob.tools.parser.
> 
> Ok. I think that if it can parse correctly bad HTML and if this is
> the same API than etree, we should use it as StandardParser.

Great!

> > But for some reasons I put many parsers into the weboob.tools.parser
> > module, so each developer can choose the one he or she wants.
> 
> Nice. Thanks.

-- 
Christophe Benz
http://cbenz.pointique.org



More information about the weboob mailing list