[weboob] What library uses to parse HTML pages

Laurent Bachelier laurent at bachelier.name
Wed Mar 31 19:03:51 CEST 2010


Good news! html5lib can output in ElementTree (and cElementTree, which
means there is a even faster version in C!) format.
So we could do what Christophe said, with [c]ElementTree and
html5lib/[c]ElementTree.

On Wed, Mar 31, 2010 at 19:00, Christophe Benz
<christophe.benz at gmail.com> wrote:
> I think that the most appropriated behavior is to use the standard
> Python parser, and if and only if it does not work with the crappy
> HTML, switch to html5lib.
>
> But I think that xml.dom.minidom is quite old, and elementtree is the
> new interface for parsing DOM in Python.
>
> Le Wed, 31 Mar 2010 18:57:14 +0200,
> Romain Bignon <romain at peerfuse.org> a écrit :
>
>> On 31/Mar - 18:54, Laurent Bachelier wrote:
>> > Not that I don't like Maemo or the n900 (thank god they exist), but
>> > if not using html5lib makes development much harder, I would be
>> > against it. Is it really slower? Since the n900 runs Firefox I
>> > would be a bit surprised.
>>
>> This is html5lib which is slower. You know it well as I used it for
>> AuM. Now, perhaps xml.dom.minidom isn't really more efficient, but it
>> is a godd thing to test.
>>
>> An other important thing is that xml.dom.minidom and html5lib have
>> approximately the same API, as they both implement DOM.
>
>
> --
> Christophe Benz
> http://cbenz.pointique.org
> _______________________________________________
> weboob mailing list
> weboob at lists.symlink.me
> http://lists.symlink.me/mailman/listinfo/weboob
>



More information about the weboob mailing list