> The last time we went around there was an anti-Unicode faction who > argued that adding Unicode support was fine but making it > the default would inconvenience Japanese users. Whoops, I nearly missed the biggest debate of the year! I guess the faction was Brian and I, and our concerns were misunderstood. We can lay this to rest forever now as the current implementation and forward direction incorporate everything I originally hoped for: (1) Frequently you need to work with byte arrays, but need a rich bunch of string-like routines - search and replace, regex etc. This applies both to non-natural-language data and also to the special case of corrupt native encodings that need repair. We loosely defined the 'string interface' in UserString, so that other people could define string-like types if they wished and so that users can expect to find certain methods and operations in both Unicode and Byte Array types. I'd be really happy one day to explicitly type x= ByteArray('some raw data') as long as I had my old friends split, join, find etc. (2) Japanese projects often need small extensions to codecs to deal with user-defined characters. Java and VB give you some canned codecs but no way to extend them. All the Python asian codec drafts involve 'open' code you can hack and use simple dictionaries for mapping tables; so it will be really easy to roll your own "Shift-JIS-plus" with 20 extra characters mapping to a private use area. This will be a huge win over other languages. (3) The Unicode conversion was based on a more general notion of 'stream conversion filters' which work with bytes. This leaves the door open to writing, for example, a direct Shift-JIS-to-EUC filter which adds nothing in the case of clean data but is much more robust in the case of user-defined characters or which can handle cleanup of misencoded data. We could also write image manipulation or crypto codecs. Some of us hope to provide general machinery for fast handling of byte-stream-filters which could be useful in image processing and crypto as well as encodings. This might need an extended or different lookup function (after all, neither end of the filter need be Unicode) but could be cleanly layered on top of the codec mechanism we have built in. (4) I agree 100% on being explicit whenever you do I/O or conversion and on generally using Unicode characters where possible. Defaults are evil. But we needed a compatibility route to get there. Guido has said that long term there will be Unicode strings and Byte Arrays. That's the time to require arguments to open(). > Similarly, we could improve socket objects so that they > have different > readtext/readbinary and writetext/writebinary without unifying the > string objects. There are lots of small changes we can make without > breaking anything. One I would like to see right now is a > unification of > chr() and unichr(). Here's a thought. How about BinaryFile/BinarySocket/ByteArray which do not need an encoding, and File/Socket/String which require explicit encodings on opeening. We keep broad parity between their methods. That seems more straightforward to me than having text/binary methods, and also provides a cleaner upgrade path for existing code. - Andy
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4