Guido van Rossum wrote: > > > No automatic conversions between 8-bit "strings" and Unicode strings. > > > > If you want to turn UTF-8 into a Unicode string, say so. > > If you want to turn Latin-1 into a Unicode string, say so. > > If you want to turn ISO-2022-JP into a Unicode string, say so. > > Adding a Unicode string and an 8-bit "string" gives an exception. > > I'd accept this, with one change: mixing Unicode and 8-bit strings is > okay when the 8-bit strings contain only ASCII (byte values 0 through > 127). I could live with this compromise as long as we document that a future version may use the "character is a character" model. I just don't want people to start depending on a catchable exception being thrown because that would stop us from ever unifying unmarked literal strings and Unicode strings. -- Are there any steps we could take to make a future divorce of strings and byte arrays easier? What if we added a binary_read() function that returns some form of byte array. The byte array type could be just like today's string type except that its type object would be distinct, it wouldn't have as many string-ish methods and it wouldn't have any auto-conversion to Unicode at all. People could start to transition code that reads non-ASCII data to the new function. We could put big warning labels on read() to state that it might not always be able to read data that is not in some small set of recognized encodings (probably UTF-8 and UTF-16). Or perhaps binary_open(). Or perhaps both. I do not suggest just using the text/binary flag on the existing open function because we cannot immediately change its behavior without breaking code. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4