[weboob] What library uses to parse HTML pages
romain at peerfuse.org
Thu Apr 15 11:29:32 CEST 2010
On 15/Apr - 11:27, Christophe Benz wrote:
> According to the documentation, lxml can deal with bad HTML:
> But if HTML is very crappy, lxml wraps BeautifulSoup with
> lxml.html.soupparser which mimics the ElementSoup class.
> It can be used at different levels:
> * for parsing HTML
> * only for encoding detection (BeautifulSoup has better algorithms)
> But it has lower performance than lxml.html native parser.
> So, to summarize, assuming we forget about the DOM API, we can use many
> combinations of parsers and ElementTree. Here is a list from lower to
> higher performance level:
> * HTMLParser.HTMLParser + xml.etree.(c)ElementTree
> * elementtidy.TidyHTMLTreeBuilder + xml.etree.(c)ElementTree
> * BeautifulSoup.HTMLParser + xml.etree.(c)ElementTree
> * lxml.html.soupparser + lxml.etree.ElementTree
> * lxml.html.HTMLParser + lxml.etree.ElementTree
Ok for me.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 489 bytes
Desc: not available
More information about the weboob