RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2000-May/003950.html below:

[Python-Dev] [I18n-sig] Unicode strings: an alternative

[Python-Dev] [I18n-sig] Unicode strings: an alternativeTom Emerson tree@basistech.com
Wed, 3 May 2000 16:19:05 -0400 (EDT)

Previous message: [Python-Dev] Unicode strings: an alternative
Next message: [Python-Dev] [I18n-sig] Unicode strings: an alternative
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Just van Rossum writes:
 > The main concept is not to provide a new string type but to extend the
 > existing string object like so:

This is the most logical thing to do.

 > - wide strings are stored as if they were narrow strings, simply using two
 > bytes for each Unicode character.

I disagree with you here... store them as UTF-8.

 > - there's a flag that specifies whether the string is narrow or wide.

Yup.

 > - the ob_size field is the _physical_ length of the data; if the string is
 > wide, len(s) will return ob_size/2, all other string operations will have
 > to do similar things.

Is it possible to add a logical length field too? I presume it is too
expensive to recalculate the logical (character) length of a string
each time len(s) is called? Doing this is only slightly more time
consuming than a normal strlen: really just O(n) + c, where 'c' is the
constant time needed for table lookup (to get the number of bytes in
the UTF-8 sequence given the start character) and the pointer
manipulation (to add that length to your span pointer).

 > - there can possibly be an encoding attribute which may specify the used
 > encoding, if known.

So is this used to handle the case where you have a legacy encoding
(ShiftJIS, say) used in your existing strings, so you flag that 8-bit
("narrow" in a way) string as ShiftJIS?

If wide strings are always Unicode, why do you need the encoding?

 > Admittedly, this is tricky and involves quite a bit of effort to implement,
 > since all string methods need to have narrow/wide switch. To make it worse,
 > it hardly offers anything the current solution doesn't. However, it offers
 > one IMHO _big_ advantage: C code that just passes strings along does not
 > need to change: wide strings can be seen as narrow strings without any
 > loss. This allows for __str__() & str() and friends to work with unicode
 > strings without any change.

If you store wide strings as UCS2 then people using the C interface
lose: strlen() stops working, or will return incorrect
results. Indeed, any of the str*() routines in the C runtime will
break. This is the advantage of using UTF-8 here --- you can still use
strcpy and the like on the C side and have things work.

 > Any thoughts?

I'm doing essentially what you suggest in my Unicode enablement of MySQL.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"

Previous message: [Python-Dev] Unicode strings: an alternative
Next message: [Python-Dev] [I18n-sig] Unicode strings: an alternative
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4