On approximately 4/28/2009 1:25 PM, came the following characters from the keyboard of Martin v. Löwis: >> The UTF-8b representation suffers from the same potential ambiguities as >> the PUA characters... > > Not at all the same ambiguities. Here, again, the two choices: > > A. use PUA characters to represent undecodable bytes, in particular for > UTF-8 (the PEP actually never proposed this to happen). > This introduces an ambiguity: two different files in the same > directory may decode to the same string name, if one has the PUA > character, and the other has a non-decodable byte that gets decoded > to the same PUA character. > > B. use UTF-8b, representing the byte will ill-formed surrogate codes. > The same ambiguity does *NOT* exist. If a file on disk already > contains an invalid surrogate code in its file name, then the UTF-8b > decoder will recognize this as invalid, and decode it byte-for-byte, > into three surrogate codes. Hence, the file names that are different > on disk are also different in memory. No ambiguity. C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4