[weboob] What library uses to parse HTML pages

Christophe Benz christophe.benz at gmail.com
Thu Apr 15 00:43:49 CEST 2010


Hi,

Reopening the debate around HTML parsers, after having studied the
question.

Here is a benchmark from 2007, made by Ian Bicking, the author or lxml:
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
http://codespeak.net/lxml/performance.html

lxml is the most efficient for parsing HTML. And it is packaged for the
Nokia N900 ;-)

lxml parses HTML, returning a document compatible with the
elementtree API:
http://codespeak.net/lxml/tutorial.html

lxml provides additional smart 
crawling methods, like cssselect:
http://codespeak.net/lxml/cssselect.html

lxml can also deal with links and forms, like mechanize does:
http://codespeak.net/lxml/lxmlhtml.html

lxml has its own HTML parser, but it wraps html5lib too:
http://codespeak.net/lxml/html5parser.html

Now, about APIs, there is the historical DOM API:
http://docs.python.org/library/xml.dom.html
but since Python 2.5 there is ElementTree:
http://docs.python.org/library/xml.etree.elementtree.html

The aum backend uses the old DOM API, is there any reason?

For the youtube backend, I used lxml with the class LxmlHtmlParser
provided by the module weboob.tools.parser.

But for some reasons I put many parsers into the weboob.tools.parser
module, so each developer can choose the one he or she wants.

Christophe


Le Wed, 31 Mar 2010 18:07:24 +0200,
Romain Bignon <romain at peerfuse.org> a écrit :

> Hi,
> 
> Historically, the 'AuM' backend uses html5lib to parse HTML pages.
> 
> This library has serious performance problems, and another issue is
> that it is not packaged on every systems (for example, juke tells me
> that it is not on the N900 Nokia cell phone.
> 
> I propose to use instead the xml.dom.minidom, a light implementation
> of DOM. This is a standard library, so probably with
> high-performances, probably more supported, available on every
> systems with python.
> 
> The only eventual problem is: how is it tolerant to bad-HTML?
> 
> So I'll try to do some test to know if this is a good solution. If
> you have other ideas, don't hesitate.
> 
> Romain


-- 
Christophe Benz
http://cbenz.pointique.org



More information about the weboob mailing list