[weboob] What library uses to parse HTML pages

Laurent Bachelier laurent at bachelier.name
Wed Mar 31 18:31:29 CEST 2010


BeautifulSoup is discountinued.
http://www.crummy.com/software/BeautifulSoup/3.1-problems.html
The author recommends html5lib.

On Wed, Mar 31, 2010 at 18:20, Christophe Benz
<christophe.benz at gmail.com> wrote:
> Hi,
>
> Le Wed, 31 Mar 2010 18:07:24 +0200,
> Romain Bignon <romain at peerfuse.org> a écrit :
>
>> Hi,
>>
>> Historically, the 'AuM' backend uses html5lib to parse HTML pages.
>
>> This library has serious performance problems, and another issue is
>> that it is not packaged on every systems (for example, juke tells me
>> that it is not on the N900 Nokia cell phone.
>
> What about beautiful soup?
>
> PS: I have a N900 too, and there are standard Debian repositories,
> with devel, testing and stable flavors, and adding a package in devel
> is accessible to anyone.
>
>> I propose to use instead the xml.dom.minidom, a light implementation
>> of DOM. This is a standard library, so probably with
>> high-performances, probably more supported, available on every
>> systems with python.
>
>> The only eventual problem is: how is it tolerant to bad-HTML?
>
>> So I'll try to do some test to know if this is a good solution. If
>> you have other ideas, don't hesitate.
>>
>> Romain
>
>
> --
> Christophe Benz
> http://cbenz.pointique.org
> _______________________________________________
> weboob mailing list
> weboob at lists.symlink.me
> http://lists.symlink.me/mailman/listinfo/weboob
>


More information about the weboob mailing list