[weboob] What library uses to parse HTML pages

Christophe Benz christophe.benz at gmail.com
Wed Mar 31 18:20:05 CEST 2010


Le Wed, 31 Mar 2010 18:07:24 +0200,
Romain Bignon <romain at peerfuse.org> a écrit :

> Hi,
> Historically, the 'AuM' backend uses html5lib to parse HTML pages.

> This library has serious performance problems, and another issue is
> that it is not packaged on every systems (for example, juke tells me
> that it is not on the N900 Nokia cell phone.

What about beautiful soup?

PS: I have a N900 too, and there are standard Debian repositories,
with devel, testing and stable flavors, and adding a package in devel
is accessible to anyone.

> I propose to use instead the xml.dom.minidom, a light implementation
> of DOM. This is a standard library, so probably with
> high-performances, probably more supported, available on every
> systems with python.

> The only eventual problem is: how is it tolerant to bad-HTML?

> So I'll try to do some test to know if this is a good solution. If
> you have other ideas, don't hesitate.
> Romain

Christophe Benz

