[weboob] What library uses to parse HTML pages

Romain Bignon romain at peerfuse.org
Thu Apr 15 11:29:32 CEST 2010


On 15/Apr - 11:27, Christophe Benz wrote:
> According to the documentation, lxml can deal with bad HTML:
> http://codespeak.net/lxml/lxmlhtml.html#really-broken-pages
> 
> But if HTML is very crappy, lxml wraps BeautifulSoup with
> lxml.html.soupparser which mimics the ElementSoup class.
> 
> It can be used at different levels:
> * for parsing HTML
> * only for encoding detection (BeautifulSoup has better algorithms)
> 
> But it has lower performance than lxml.html native parser.
> 
> So, to summarize, assuming we forget about the DOM API, we can use many
> combinations of parsers and ElementTree. Here is a list from lower to
> higher performance level:
> * HTMLParser.HTMLParser + xml.etree.(c)ElementTree
> * elementtidy.TidyHTMLTreeBuilder + xml.etree.(c)ElementTree
> * BeautifulSoup.HTMLParser + xml.etree.(c)ElementTree
> * lxml.html.soupparser + lxml.etree.ElementTree
> * lxml.html.HTMLParser + lxml.etree.ElementTree

Ok for me.

Romain
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 489 bytes
Desc: not available
URL: <https://lists.symlink.me/pipermail/weboob/attachments/20100415/cfe7dd40/attachment.sig>


More information about the weboob mailing list