> u = aUnicodeStringFromSomewhere > s = an8bitStringFromSomewhere > > DoSomething(s + u) > in Guido's design, the first example may or may not result in > an "UTF-8 decoding error: UTF-8 decoding error: unexpected > code byte" exception. I would say it is less surprising for most people for this to follow the silent-widening of each byte - the Fredrik-Paul position. With the current scarcity of UTF-8 code, very few people will expect an automatic UTF-8 to UTF-16 conversion. While complete prohibition of automatic conversion has some appeal, it will just be more noise to many. > u = aUnicodeStringFromSomewhere > s = an8bitStringFromSomewhere > > if len(u) + len(s) == len(u + s): > print "true" > else: > print "not true" > the second example may result in a > similar error, print "true", or print "not true", depending on the > contents of the 8-bit string. I don't see this as important as its trying to take the Unicode strings are equivalent to 8 bit strings too far. How much further before you have to break? I always thought of len measuring the number of bytes rather than characters when applied to strings. The same as strlen in C when you have a DBCS string. I should correct some of the stuff Mark wrote about me. At Fujitsu we did a lot more DBCS work than Unicode because that's what Japanese code uses. Even with Java most storage is still DBCS. I was more involved with Unicode architecture at Reuters 6 or so years ago. Neil
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4