[weboob] What library uses to parse HTML pages
romain at peerfuse.org
Sat Apr 3 18:21:42 CEST 2010
On 02/Apr - 10:15, Romain Bignon wrote:
> I think we should use it.
I patched weboob to use cElementTree as default data structure container, and
ElementTree if it is not found.
As the tree builder included in *ElementTree only supports XML (there are
problems with HTML), I also included the elementtidy library as parser for the
elementtidy is not in the standard python library, so if it is not found, I've
written a wrapper to use HTMLParser instead.
Another thing, elementtidy crashes when there is no error on webpage, because of
the last version of libtidy. I've sent a patch to Debian to fix it:
Anymay, I don't know if elementtidy is faster than HTMLParser, we'll need to
So, backends which use the new standard parser are:
Backends which use html5lib:
Backends which use HTMLParser:
Backends wich use no parser:
We'll need to convert everything to the new standard parser. The new API is easy
to use and more python-like than the html5lib's one.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 489 bytes
Desc: not available
More information about the weboob