Glenn Linderman wrote: > On approximately 4/28/2009 1:25 PM, came the following characters from > the keyboard of Martin v. Löwis: >>> The UTF-8b representation suffers from the same potential ambiguities as >>> the PUA characters... >> >> Not at all the same ambiguities. Here, again, the two choices: >> >> A. use PUA characters to represent undecodable bytes, in particular for >> UTF-8 (the PEP actually never proposed this to happen). >> This introduces an ambiguity: two different files in the same >> directory may decode to the same string name, if one has the PUA >> character, and the other has a non-decodable byte that gets decoded >> to the same PUA character. >> >> B. use UTF-8b, representing the byte will ill-formed surrogate codes. >> The same ambiguity does *NOT* exist. If a file on disk already >> contains an invalid surrogate code in its file name, then the UTF-8b >> decoder will recognize this as invalid, and decode it byte-for-byte, >> into three surrogate codes. Hence, the file names that are different >> on disk are also different in memory. No ambiguity. > > C. File on disk with the invalid surrogate code, accessed via the str > interface, no decoding happens, matches in memory the file on disk with > the byte that translates to the same surrogate, accessed via the bytes > interface. Ambiguity. Is that an alternative to A and B? Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4