[weboob] What library uses to parse HTML pages
romain at peerfuse.org
Thu Apr 15 09:17:05 CEST 2010
On 15/Apr - 00:43, Christophe Benz wrote:
> Reopening the debate around HTML parsers, after having studied the
> Here is a benchmark from 2007, made by Ian Bicking, the author or lxml:
> lxml is the most efficient for parsing HTML. And it is packaged for the
> Nokia N900 ;-)
> lxml parses HTML, returning a document compatible with the
> elementtree API:
> lxml provides additional smart
> crawling methods, like cssselect:
> lxml can also deal with links and forms, like mechanize does:
> lxml has its own HTML parser, but it wraps html5lib too:
Ok, but an other specification is the capacity of parser to tolerate bad HTML.
What appends when lxml tries to parse a really worst HTML document?
> Now, about APIs, there is the historical DOM API:
> but since Python 2.5 there is ElementTree:
> The aum backend uses the old DOM API, is there any reason?
For historical reasons. When I wrote AuM a year ago, it was the first parser
library I tried.
> For the youtube backend, I used lxml with the class LxmlHtmlParser
> provided by the module weboob.tools.parser.
Ok. I think that if it can parse correctly bad HTML and if this is the same API
than etree, we should use it as StandardParser.
> But for some reasons I put many parsers into the weboob.tools.parser
> module, so each developer can choose the one he or she wants.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 489 bytes
Desc: not available
More information about the weboob