RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2000-July/007507.html below:

[Python-Dev] UTF-16 code point comparison

[Python-Dev] UTF-16 code point comparisonBill Tutt billtut@microsoft.com
Thu, 27 Jul 2000 07:43:44 -0700

Previous message: [Python-Dev] UTF-16 code point comparison
Next message: [Python-Dev] UTF-16 code point comparison
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Fredik wrote:

> the original unicode implementation did just that, but Bill and
> Marc-Andre recently lowered the shields: the UTF-8 decoder
> now generates UTF-16 encoded data.  (so does \N{}, but
> that's a non-issue:=20

> my proposal is to tighten this up in 2.0: ifdef out the UTF-16
> code in the UTF-8 decoder, and ifdef out the UTF-16 stuff in
> the compare function.

Commenting the UTF-16 stuff out in the compare function is a valid point,
given our current Unicode string object.

I disagree strongly with disabling the surrogate support in UTF-8, and we
should fix the UTF-16 decoder.
Since the decoder/encoder support doesn't hurt any other piece of code by
emitting surrogate pairs, I don't see why you want to disable the code. 

> (oddly enough, the UTF-16 decoder won't accept anything
> that isn't UCS-2 ;-)

Well that's an easy bug to fix.

> let's wait until 2.1 before we support the full unicode character
> set (and I'm pretty sure "the right way" is UCS-4 storage and a
> unified string implementation, but let's discuss that later).

I've mentioned this before, but why discuss this later? Indeed why would we
want to fix it for 2.1?
Esp. if we move to UCS-4 storage in a minor release. Why not just get the
Unicode support correct this time around. Save the poor users of the Python
Unicode support from going nuts when we make these additional confusing
changes later. 
If you think you want to move to UCS-4 later, don't wait, do it know.  Add
support for special surrogate handling later if we must, but please oh
please don't change the storage mechanism in memory in a later relase.

Java and Win32 are all UTF-16 based, and those extra 16-bits are actually
wasted for nearly every common Unicode data you'd hope to handle. I think
using UTF-16 as an internal storage mechanism actually makes sense. Whether
or not you want to have your character type expose an array of 16-bit
values, or the appearance of an array of UCS-4 characters is a separate
issue. An issue I think should be answered now, instead of fixing it later.

Bill

Previous message: [Python-Dev] UTF-16 code point comparison
Next message: [Python-Dev] UTF-16 code point comparison
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4