bill wrote: > So use UCS-4 internal storage now. UTF-16 just seems like a handy = internal > storage mechanism to pick since Win32 and Java use it for their native > string processing. umm. the Java docs I have access to doesn't mention surrogates at all (they do point out that a character is 16-bit, and they don't provide an \U escape). on the other hand, MSDN says: Windows 2000 provides support for basic input, output, and simple sorting of surrogates. However, not all Windows 2000 system components are surrogate compatible. Also, surrogates are not supported in Windows 95/98 or in Windows NT 4.0. and then mentions all the usual problems with variable-width encodings... > > after all, if variable-width internal storage had been easy to deal > > with, we could have used UTF-8 from the start... (and just like > > the Tcl folks, we would have ended up rewriting the whole thing > > in the next release ;-) >=20 > Oh please, UTF-16 is substantially simpler to deal with than UTF-8. in what way? as in "one or two words" is simpler than "one, two, three, four, five, or six bytes"? or as in "nobody will notice anyway..." ;-) ::: if UCS-2/BMP was good enough for NT 4.0, Unicode 1.1, and Java 1.0, it's surely good enough for Python 2.0 ;-) (and if I understand things correctly, 2.1 isn't that far away...) </F>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4