In general, I like this proposal a lot, but I think it only covers half the story. How we actually build the encoder/decoder for each encoding is a very big issue. Thoughts on this below. First, a little nit > u = u'<utf-8 encoded Python string>' I don't like using funny prime characters - why not an explicit function like "utf8()" On to the important stuff:> > unicodec.register(<encname>,<encoder>,<decoder> > [,<stream_encoder>, <stream_decoder>]) > This registers the codecs under the given encoding > name in the module global dictionary > unicodec.codecs. Stream codecs are optional: > the unicodec module will provide appropriate > wrappers around <encoder> and > <decoder> if not given. I would MUCH prefer a single 'Encoding' class or type to wrap up these things, rather than up to four disconnected objects/functions. Essentially it would be an interface standard and would offer methods to do the four things above. There are several reasons for this. (1) there are quite a lot of things you might want to do with an encoding object, and we could extend the interface in future easily. As a minimum, give it the four methods implied by the above, two of which can be defaults. But I'd like an encoding to be able to tell me the set of characters to which it applies; validate a string; and maybe tell me if it is a subset or superset of another. (2) especially with double-byte encodings, they will need to load up some kind of database on startup and use this for both encoding and decoding - much better to share it and encapsulate it inside one object (3) for some languages, there are extra functions wanted. For Japanese, you need two or three functions to expand half-width to full-width katakana, convert double-byte english to single-byte and vice versa. A Japanese encoding object would be a handy place to put this knowledge. (4) In the real world you get many encodings which are subtle variations of the same thing, plus or minus a few characters. One bit of code might be able to share the work of several encodings, by setting a few flags. Certainly true of Japanese. (5) encoding/decoding algorithms can be program or data or (very often) a bit of both. We have not yet discussed where to keep all the mapping tables, but if data is involved it should be hidden in an object. (6) See my comments on a state machine for doing the encodings. If this is done well, we might two different standard objects which conform to the Encoding interface (a really light one for single-byte encodings, and a bigger one for multi-byte), and everything else could be data driven. (6) Easy to grow - encodings can be prototyped and proven in Python, ported to C if needed or when ready. In summary, firm up the concept of an Encoding object and give it room to grow - that's the key to real-world usefulness. If people feel the same way I'll have a go at an interface for that, and try show how it would have simplified specific problems I have faced. We also need to think about where encoding info will live. You cannot avoid mapping tables, although you can hide them inside code modules or pickled objects if you want. Should there be a standard "..\Python\Enc" directory? And we're going to need some kind of testing and certification procedure when adding new encodings. This stuff has to be right. Guido asked about TypedString. This can probably be done on top of the built-in stuff - it is just a convenience which would clarify intent, reduce lines of code and prevent people shooting themselves in the foot when juggling a lot of strings in different (non-Unicode) encodings. I can do a Python module to implement that on top of whatever is built. Regards, Andy ===== Andy Robinson Robinson Analytics Ltd. ------------------ My opinions are the official policy of Robinson Analytics Ltd. They just vary from day to day. __________________________________________________ Do You Yahoo!? Bid and sell for free at http://auctions.yahoo.com
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4