Just van Rossum writes: > The main concept is not to provide a new string type but to extend the > existing string object like so: This is the most logical thing to do. > - wide strings are stored as if they were narrow strings, simply using two > bytes for each Unicode character. I disagree with you here... store them as UTF-8. > - there's a flag that specifies whether the string is narrow or wide. Yup. > - the ob_size field is the _physical_ length of the data; if the string is > wide, len(s) will return ob_size/2, all other string operations will have > to do similar things. Is it possible to add a logical length field too? I presume it is too expensive to recalculate the logical (character) length of a string each time len(s) is called? Doing this is only slightly more time consuming than a normal strlen: really just O(n) + c, where 'c' is the constant time needed for table lookup (to get the number of bytes in the UTF-8 sequence given the start character) and the pointer manipulation (to add that length to your span pointer). > - there can possibly be an encoding attribute which may specify the used > encoding, if known. So is this used to handle the case where you have a legacy encoding (ShiftJIS, say) used in your existing strings, so you flag that 8-bit ("narrow" in a way) string as ShiftJIS? If wide strings are always Unicode, why do you need the encoding? > Admittedly, this is tricky and involves quite a bit of effort to implement, > since all string methods need to have narrow/wide switch. To make it worse, > it hardly offers anything the current solution doesn't. However, it offers > one IMHO _big_ advantage: C code that just passes strings along does not > need to change: wide strings can be seen as narrow strings without any > loss. This allows for __str__() & str() and friends to work with unicode > strings without any change. If you store wide strings as UCS2 then people using the C interface lose: strlen() stops working, or will return incorrect results. Indeed, any of the str*() routines in the C runtime will break. This is the advantage of using UTF-8 here --- you can still use strcpy and the like on the C side and have things work. > Any thoughts? I'm doing essentially what you suggest in my Unicode enablement of MySQL. -tree -- Tom Emerson Basis Technology Corp. Language Hacker http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4