[weboob] What library uses to parse HTML pages

Romain Bignon romain at peerfuse.org
Wed Mar 31 18:07:24 CEST 2010


Hi,

Historically, the 'AuM' backend uses html5lib to parse HTML pages.

This library has serious performance problems, and another issue is that it is
not packaged on every systems (for example, juke tells me that it is not on the
N900 Nokia cell phone.

I propose to use instead the xml.dom.minidom, a light implementation of DOM.
This is a standard library, so probably with high-performances, probably more
supported, available on every systems with python.

The only eventual problem is: how is it tolerant to bad-HTML?

So I'll try to do some test to know if this is a good solution. If you have
other ideas, don't hesitate.

Romain
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 489 bytes
Desc: not available
URL: <http://lists.symlink.me/pipermail/weboob/attachments/20100331/9b462453/attachment.pgp>


More information about the Weboob mailing list