> Today I had a relatively simple idea that unites wide strings and narrow > strings in a way that is more backward comatible at the C level. It's quite > possible this has already been considered and rejected for reasons that are > not yet obvious to me, but I'll give it a shot anyway. > > The main concept is not to provide a new string type but to extend the > existing string object like so: > - wide strings are stored as if they were narrow strings, simply using two > bytes for each Unicode character. > - there's a flag that specifies whether the string is narrow or wide. > - the ob_size field is the _physical_ length of the data; if the string is > wide, len(s) will return ob_size/2, all other string operations will have > to do similar things. > - there can possibly be an encoding attribute which may specify the used > encoding, if known. > > Admittedly, this is tricky and involves quite a bit of effort to implement, > since all string methods need to have narrow/wide switch. To make it worse, > it hardly offers anything the current solution doesn't. However, it offers > one IMHO _big_ advantage: C code that just passes strings along does not > need to change: wide strings can be seen as narrow strings without any > loss. This allows for __str__() & str() and friends to work with unicode > strings without any change. This seems to have some nice properties, but I think it would cause problems for existing C code that tries to *interpret* the bytes of a string: it could very well do the wrong thing for wide strings (since old C code doesn't check for the "wide" flag). I'm not sure how much C code there is that merely passes strings along... Most C code using strings makes use of the strings (e.g. open() falls in this category in my eyes). --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4