[weboob] What library uses to parse HTML pages

Romain Bignon romain at peerfuse.org
Sat Apr 3 18:21:42 CEST 2010


On 02/Apr - 10:15, Romain Bignon wrote:
> I think we should use it.

Hi guys,

I patched weboob to use cElementTree as default data structure container, and
ElementTree if it is not found.

As the tree builder included in *ElementTree only supports XML (there are
problems with HTML), I also included the elementtidy library as parser for the
ElementTree object.

elementtidy is not in the standard python library, so if it is not found, I've
written a wrapper to use HTMLParser instead.

Another thing, elementtidy crashes when there is no error on webpage, because of
the last version of libtidy. I've sent a patch to Debian to fix it:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=576343

Anymay, I don't know if elementtidy is faster than HTMLParser, we'll need to
benchmark that.

So, backends which use the new standard parser are:
	- dlfp

Backends which use html5lib:
	- aum

Backends which use HTMLParser:
	- transilien

Backends wich use no parser:
	- canaltp

We'll need to convert everything to the new standard parser. The new API is easy
to use and more python-like than the html5lib's one.

Romain
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 489 bytes
Desc: not available
URL: <https://lists.symlink.me/pipermail/weboob/attachments/20100403/29d6c4fc/attachment.sig>


More information about the weboob mailing list