Fredik wrote: > the original unicode implementation did just that, but Bill and > Marc-Andre recently lowered the shields: the UTF-8 decoder > now generates UTF-16 encoded data. (so does \N{}, but > that's a non-issue:=20 > my proposal is to tighten this up in 2.0: ifdef out the UTF-16 > code in the UTF-8 decoder, and ifdef out the UTF-16 stuff in > the compare function. Commenting the UTF-16 stuff out in the compare function is a valid point, given our current Unicode string object. I disagree strongly with disabling the surrogate support in UTF-8, and we should fix the UTF-16 decoder. Since the decoder/encoder support doesn't hurt any other piece of code by emitting surrogate pairs, I don't see why you want to disable the code. > (oddly enough, the UTF-16 decoder won't accept anything > that isn't UCS-2 ;-) Well that's an easy bug to fix. > let's wait until 2.1 before we support the full unicode character > set (and I'm pretty sure "the right way" is UCS-4 storage and a > unified string implementation, but let's discuss that later). I've mentioned this before, but why discuss this later? Indeed why would we want to fix it for 2.1? Esp. if we move to UCS-4 storage in a minor release. Why not just get the Unicode support correct this time around. Save the poor users of the Python Unicode support from going nuts when we make these additional confusing changes later. If you think you want to move to UCS-4 later, don't wait, do it know. Add support for special surrogate handling later if we must, but please oh please don't change the storage mechanism in memory in a later relase. Java and Win32 are all UTF-16 based, and those extra 16-bits are actually wasted for nearly every common Unicode data you'd hope to handle. I think using UTF-16 as an internal storage mechanism actually makes sense. Whether or not you want to have your character type expose an array of 16-bit values, or the appearance of an array of UCS-4 characters is a separate issue. An issue I think should be answered now, instead of fixing it later. Bill
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4