On Nov 9, 2007 6:10 AM, Walter Dörwald <walter at livinglogic.de> wrote: > > Martin v. Löwis wrote: > >>> Yes, an XML parser should be able to use UTF-8, UTF-16, UTF-32, etc > >>> codecs to do the encoding. There's no need to create a magical > >>> mystery codec to pick out which though. > >> So the code is good, if it is inside an XML parser, and it's bad if it > >> is inside a codec? > > > > Exactly so. This functionality just *isn't* a codec - there is no > > encoding. Instead, it is an algorithm for *detecting* an encoding. > > And what do you do once you've detected the encoding? You decode the > input, so why not combine both into an XML decoder? It seems to me that parsing XML requires 3 steps: 1) determine encoding 2) decode byte stream 3) parse XML (including handling of character references) All an xml codec does is make the first part a side-effect of the second part. Rather than this: encoding = detect_encoding(raw_data) decoded_data = raw_data.decode(encoding) tree = parse_xml(decoded_data, encoding) # Verifies encoding You'd have this: e = codecs.getincrementaldecoder("xml-auto-detect")() decoded_data = e.decode(raw_data, True) tree = parse_xml(decoded_data, e.encoding) # Verifies encoding It's clear to me that detecting an encoding is actually the simplest part of all this (so long as there's an API to do it!) Putting it inside a codec seems like the wrong subdivision of responsibility. (An example using streams would end up closer, but it still seems wrong to me. Encoding detection is always one way, while codecs are always two way (even if lossy.)) -- Adam Olsen, aka Rhamphoryncus
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4