On approximately 4/29/2009 7:50 PM, came the following characters from the keyboard of Aahz: > On Thu, Apr 30, 2009, Cameron Simpson wrote: >> The lengthy discussion mostly revolves around: >> >> - Glenn points out that strings that came _not_ from listdir, and that are >> _not_ well-formed unicode (== "have bare surrogates in them") but that >> were intended for use as filenames will conflict with the PEP's scheme - >> programs must know that these strings came from outside and must be >> translated into the PEP's funny-encoding before use in the os.* >> functions. Previous to the PEP they would get used directly and >> encode differently after the PEP, thus producing different POSIX >> filenames. Breakage. >> >> - Glenn would like the encoding to use Unicode scalar values only, >> using a rare-in-filenames character. >> That would avoid the issue with "outside' strings that contain >> surrogates. To my mind it just moves the punning from rare illegal >> strings to merely uncommon but legal characters. >> >> - Some parties think it would be better to not return strings from >> os.listdir but a subclass of string (or at least a duck-type of >> string) that knows where it came from and is also handily >> recognisable as not-really-a-string for purposes of deciding >> whether is it PEP-funny-encoded by direct inspection. > > Assuming people agree that this is an accurate summary, it should be > incorporated into the PEP. I'll agree that once other misconceptions were explained away, that the remaining issues are those Cameron summarized. Thanks for the summary! Point two could be modified because I've changed my opinion; I like the invariant Cameron first (I think) explicitly stated about the PEP as it stands, and that I just reworded in another message, that the strings that are altered by the PEP in either direction are in the subset of strings that contain fake (from a strict Unicode viewpoint) characters. I still think an encoding that uses mostly real characters that have assigned glyphs would be better than the encoding in the PEP; but would now suggest that an escape character be a fake character. I'll note here that while the PEP encoding causes illegal bytes to be translated to one fake character, the 3-byte sequence that looks like the range of fake characters would also be translated to a sequence of 3 fake characters. This is 512 combinations that must be translated, and understood by the user (or at least by the programmer). The "escape sequence" approach requires changing only 257 combinations, and each altered combination would result in exactly 2 characters. Hence, this seems simpler to understand, and to manually encode and decode for debugging purposes. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4