[weboob] What library uses to parse HTML pages

Romain Bignon romain at peerfuse.org
Fri Apr 2 10:15:55 CEST 2010


On 31/Mar - 19:03, Laurent Bachelier wrote:
> Good news! html5lib can output in ElementTree (and cElementTree, which
> means there is a even faster version in C!) format.
> So we could do what Christophe said, with [c]ElementTree and
> html5lib/[c]ElementTree.

I've found this benchmarks:

http://effbot.org/zone/celementtree.htm

xml.dom.minidom seems to be really slow, and cElementTree is faster than
everything else!

I think we should use it.

> 
> On Wed, Mar 31, 2010 at 19:00, Christophe Benz
> <christophe.benz at gmail.com> wrote:
> > I think that the most appropriated behavior is to use the standard
> > Python parser, and if and only if it does not work with the crappy
> > HTML, switch to html5lib.
> >
> > But I think that xml.dom.minidom is quite old, and elementtree is the
> > new interface for parsing DOM in Python.
> >
> > Le Wed, 31 Mar 2010 18:57:14 +0200,
> > Romain Bignon <romain at peerfuse.org> a écrit :
> >
> >> On 31/Mar - 18:54, Laurent Bachelier wrote:
> >> > Not that I don't like Maemo or the n900 (thank god they exist), but
> >> > if not using html5lib makes development much harder, I would be
> >> > against it. Is it really slower? Since the n900 runs Firefox I
> >> > would be a bit surprised.
> >>
> >> This is html5lib which is slower. You know it well as I used it for
> >> AuM. Now, perhaps xml.dom.minidom isn't really more efficient, but it
> >> is a godd thing to test.
> >>
> >> An other important thing is that xml.dom.minidom and html5lib have
> >> approximately the same API, as they both implement DOM.
> >
> >
> > --
> > Christophe Benz
> > http://cbenz.pointique.org
> > _______________________________________________
> > weboob mailing list
> > weboob at lists.symlink.me
> > http://lists.symlink.me/mailman/listinfo/weboob
> >
> _______________________________________________
> weboob mailing list
> weboob at lists.symlink.me
> http://lists.symlink.me/mailman/listinfo/weboob
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 489 bytes
Desc: not available
URL: <https://lists.symlink.me/pipermail/weboob/attachments/20100402/7fe9f2f8/attachment.sig>


More information about the weboob mailing list