[weboob] What library uses to parse HTML pages
christophe.benz at gmail.com
Thu Apr 15 11:37:04 CEST 2010
Oops, I forget to mention html5lib:
So, to summarize, assuming we forget about the DOM API, we can use many
combinations of parsers and ElementTree. Here is a list from lower to
higher performance level:
* HTMLParser.HTMLParser + xml.etree.(c)ElementTree
* elementtidy.TidyHTMLTreeBuilder + xml.etree.(c)ElementTree
* BeautifulSoup.HTMLParser + xml.etree.(c)ElementTree
* html5lib.HTMLParser + xml.etree.(c)ElementTree
* lxml.html.soupparser + lxml.etree.ElementTree
* lxml.html.html5parser + lxml.etree.ElementTree
* lxml.html.HTMLParser + lxml.etree.ElementTree
Another point is that the lxml.etree API provides more methods than the
xml.etree API. For example, the two methods I use the most
are: element.xpath() and element.cssselect(). It is much easier than
using the getiterator() methods.
If I use them in a backend and if a user does not have lxml, my backend
will not work. We need to make a choice.
AFAIK, python-lxml is packaged for the n900 so I will use these methods.
More information about the weboob