Martin v. Löwis wrote: >> I see two main user-oriented use cases for the resulting Unicode >> strings this PEP will produce on all systems: displaying a list of >> filenames for the user to select from (an open file dialog), and >> allowing a user to edit or supply a filename (a save dialog or a >> rename control). > > There are more, in particular the case "user passes a file name > on the command line", and "web server passes URL in environment > variable". > >> It's clear what this PEP provides for the former. On well-behaved >> systems where a simpler filesystemencoding approach would work, the >> results are identical; the user can select filenames that are what he >> expects to see on both Unix and Windows. On less well-behaved systems, >> some characters may appear as junk in the middle of the name (or would >> they be invisible?) > > Depends on the rendering. Try "print u'\udc00'" in your terminal to see > what happens; for me, it renders the glyph for "replacement character". > In GUI applications, you often see white boxes (rectangles). > >> What I don't find clear is what the risks are for the latter. On the >> less well behaved system, a user may well attempt to use this python >> application to fix filenames. Can we estimate a likelihood that edits >> to the names would result in a Unicode string that can no longer be >> encoded with the python-escape? Will a new name fully provided by a >> user on his keyboard (ignoring copy and paste) almost always safely >> encode? > > That very much depends on the system setup, and your impression is > right that the PEP doesn't address it - it only deals with cases > where you get random unsupported bytes; getting random unsupported > characters from the user is not considered. > > If the user has the locale setup in way that matches his keyboard, > it should work all fine - and will already, even without the PEP. > If the user enters a character that doesn't directly map to a > good file name, you get an exception, and have to tell the user > to pick a different filename. > > Notice that it may fail at several layers: > - it may be that characters entered are not supported in what > Python choses as the file system encoding. > - it may be that the characters are not supported by the file > system, e.g. leading spaces in Win32. > - it may be that the file cannot be renamed because the target > name already exists. > In all these cases, the application has to ask the user to > reconsider; for at least the last case, it should be prepared > to do that, anyway (there is also the case where renaming fails > because of lack of permissions; in that case, picking a different > file name won't help). > This has made me think about what happens going the other way, ie when a user-supplied Unicode string needs to be converted to UTF-8b. That should also be reversible. Therefore: When encoding using UTF-8b, codepoints in the range U+DC80..U+DCFF should map to bytes 0x80..0xFF; all other codepoints, including the remaining half surrogates, should be encoded normally. When decoding using UTF-8b, undecodable bytes in the range 0x80..0xFF should map to U+DC80..U+DCFF; all other bytes, including the encodings for the remaining half surrogates, should be decoded normally. This will ensure that even when the user has provided a string containing half surrogates it can be encoded to bytes and then decoded back to the original string.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4