Martin von Loewis wrote: > > > Are you sure that we should choose expat as "native" XML parser ? > > It wouldn't necessarily be the only parser. To process XML, different > applications have different needs. However, since the expatreader is > the only SAX reader included in the standard library at the moment, > guaranteeing presence of pyexpat is oft-requested. Notice that > pyexpat.c is also in the standard library already. Just wanted to make sure that we still have the option of including other parsers as well :-) > > There are other candidates which would fit this role just > > as well (in particular, Fredrik's sgmlop looks like a nice > > extension since it not only works with XML but also many > > other meta languages). > > Not that many candidates would work as well. For example, sgmlop has a > number of known bugs, and a few unknown ones. Guido once complained > that it is easy to crash sgmlop with ill-formed input, and rejected > inclusion of sgmlop when xmlrpclib was integrated. A known problem is > that entity references are not expanded in attributes. Well, let's put it this way: if someone finds a need to fix these bugs, it is more likely to happen in the Python core, e.g. xmlrpclib has already received a few tweaks (by yourself ;-) after it was checked into the core. I think that the sgmlop design is sufficiently simple and easy to extend to make it a good candidate for inclusion. Sure, we'll get bug reports, but why not add sgmlop marked as experimental to the core in order to get it stabilized and bug-fixed ?! I would very much like a sandbox like part in the Python standard dist to encourage stabilizing of proposed-to-be-included std lib extensions, e.g. how about a sandbox package in the std lib ?! > Beyond that, I'm not aware of many more pure-C parsers that could be > reasonably be integrated into the core. There are many XML parsers, > but many of the are written in C++ or Java. Me neither... except RXP which is written in plain C. > > If you want a very fast validating XML parser, RXP would also > > be a good choice -- AFAIK, the RXP folks would allow us to > > ship RXP under a different license than GPL which is then > > bound to Python. > > RXP would indeed be a choice. Of course, integrating it is much > harder; you'd have to write the C module first, plus documentation, > plus a SAX driver, plus test cases. I'm not sure how much code you can > inherit from PyLTXML. Sure; the question I wanted to raise was: given that we have such an interface, would RXP also be a candidate for inclusion ? > On performance: Please have a look at > > http://www.xml.com/lpt/a/Benchmark/exec.html > > which suggests that expat still has a speed advantage over rxp > (assuming that the measurements where done carefully, i.e. disabling > validation in RXP). Hmm, I know that at least one company has been having great success in using RXP with Python; from their experience RXP is faster on average XML than any of the other available (validating) parsers. May be due to their application, though, so YMMV. > > Given the many alternatives, I am not sure whether going with > > expat is the right path... may be wrong though. > > It shouldn't be the only path. pyexpat is already integrated into the > Python library, all I'm suggesting to give the promise that it will be > available on every 2.2 Python installation. > > Any volunteers working on RXP integration are certainly welcome to do > so; code contributions to PyXML will be welcome (provided the GPL > issue gets resolved). Code contributions to the Python core would > require some review, of course - it took quite some time to get > pyexpat stable, and I guess any other C-integrated parser won't work > from scratch, either. True. Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4