[weboob] What library uses to parse HTML pages

Christophe Benz christophe.benz at gmail.com
Thu Apr 15 11:37:04 CEST 2010


Oops, I forget to mention html5lib:

So, to summarize, assuming we forget about the DOM API, we can use many
combinations of parsers and ElementTree. Here is a list from lower to
higher performance level:

* HTMLParser.HTMLParser + xml.etree.(c)ElementTree
* elementtidy.TidyHTMLTreeBuilder + xml.etree.(c)ElementTree
* BeautifulSoup.HTMLParser + xml.etree.(c)ElementTree
* html5lib.HTMLParser + xml.etree.(c)ElementTree
* lxml.html.soupparser + lxml.etree.ElementTree
* lxml.html.html5parser + lxml.etree.ElementTree
* lxml.html.HTMLParser + lxml.etree.ElementTree

Another point is that the lxml.etree API provides more methods than the
xml.etree API. For example, the two methods I use the most
are: element.xpath() and element.cssselect(). It is much easier than
using the getiterator() methods.

If I use them in a backend and if a user does not have lxml, my backend
will not work. We need to make a choice.

AFAIK, python-lxml is packaged for the n900 so I will use these methods.




More information about the weboob mailing list