[weboob] What library uses to parse HTML pages
romain at peerfuse.org
Fri Apr 2 10:15:55 CEST 2010
On 31/Mar - 19:03, Laurent Bachelier wrote:
> Good news! html5lib can output in ElementTree (and cElementTree, which
> means there is a even faster version in C!) format.
> So we could do what Christophe said, with [c]ElementTree and
I've found this benchmarks:
xml.dom.minidom seems to be really slow, and cElementTree is faster than
I think we should use it.
> On Wed, Mar 31, 2010 at 19:00, Christophe Benz
> <christophe.benz at gmail.com> wrote:
> > I think that the most appropriated behavior is to use the standard
> > Python parser, and if and only if it does not work with the crappy
> > HTML, switch to html5lib.
> > But I think that xml.dom.minidom is quite old, and elementtree is the
> > new interface for parsing DOM in Python.
> > Le Wed, 31 Mar 2010 18:57:14 +0200,
> > Romain Bignon <romain at peerfuse.org> a écrit :
> >> On 31/Mar - 18:54, Laurent Bachelier wrote:
> >> > Not that I don't like Maemo or the n900 (thank god they exist), but
> >> > if not using html5lib makes development much harder, I would be
> >> > against it. Is it really slower? Since the n900 runs Firefox I
> >> > would be a bit surprised.
> >> This is html5lib which is slower. You know it well as I used it for
> >> AuM. Now, perhaps xml.dom.minidom isn't really more efficient, but it
> >> is a godd thing to test.
> >> An other important thing is that xml.dom.minidom and html5lib have
> >> approximately the same API, as they both implement DOM.
> > --
> > Christophe Benz
> > http://cbenz.pointique.org
> > _______________________________________________
> > weboob mailing list
> > weboob at lists.symlink.me
> > http://lists.symlink.me/mailman/listinfo/weboob
> weboob mailing list
> weboob at lists.symlink.me
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 489 bytes
Desc: not available
More information about the weboob