guido: > > This *should* be correct because Tcl/Tk always uses UTF-8 internally. > > (Even though it is "lenient" when receiving strings -- if a sequence > > of characters has no valid Unicode representation, it appears to falls > > back to Latin-1; I don't know the details of this algorithm.) Tcl/Tk uses a 16-bit (UCS-2) unicode string type internally, but their 8-bit strings use UTF-8. When converting from external 8-bit strings to unicode, they convert valid UTF-8 sequences to unicode characters just like Python, but "a lead-byte not followed by enough trail-bytes represents itself." (in other words, it's cast from an unsigned char to an unsigned short). And the chance that any reasonable Latin-1 string would contain a UTF-8 lead bytes followed by the right number of UTF-8 trail bytes is close to zero... (in case anyone wonders, Python's codec thinks that "close to zero" isn't good enough, so it raises an exception instead) tim: > Dunno, but wouldn't be surprised if they had a notion of default encoding, > and that it simply appears to be Latin-1 to us because American Windows uses > a superset of Latin-1. They have a system encoding, but it's not used here -- it's just that Latin-1 is a subset of Unicode... </F>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4