On Apr 6, 2016 1:26 AM, "Chris Angelico" <rosuav at gmail.com> wrote: > > On Wed, Apr 6, 2016 at 3:37 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote: > > Chris Angelico writes: > > > > > Outside of deliberate tests, we don't create files on our disks > > > whose names are strings of random bytes; > > > > Wishful thinking. First, names made of control characters have often > > been deliberately used by miscreants to conceal their warez. Second, > > in some systems it's all too easy to create paths with components in > > different locales (the place I've seen it most frequently is in NFS > > mounts). I think that's much less true today, but perhaps that's only > > because my employer figured out that it was much less pain if system > > paths were pure ASCII so that it mostly didn't matter what encoding > > users chose for their subtrees. > > Control characters are still characters, though. You can take a > bytestring consisting of byte values less than 32, decode it as UTF-8, > and have a series of codepoints to work with. > > If your employer has "solved" the problem by restricting system paths > to ASCII, that's a fine solution for a single system with a single > ASCII-compatible encoding; a better solution is to mandate UTF-8 as > the file system encoding, as that's what most people are expecting > anyway. > > > It remains important to be able to handle nearly arbitrary bytestrings > > in file names as far as I can see. Please note that 100 million > > Japanese and 1 billion Chinese by and large still prefer their > > homegrown encodings (plural!!) to Unicode, while many systems are now > > defaulting filenames to UTF-8. There's plenty of room remaining for > > copying bytestrings to arguments of open and friends. > > Why exactly do they prefer these other encodings? Are they > representing characters that Unicode doesn't contain? If so, we have a > fundamental problem (no Python program is going to be able to cope > with these, without a third party library or some stupid mess of local > code); if not, you can always represent it as Unicode and encode it as > UTF-8 when it reaches the file system. Re-encoding is something that's > easy when you treat something as text, and impossible when you treat > it as bytes. > > So far, you're still actually agreeing with me: paths are *text*, but > sometimes we don't know the encoding (and that's a problem to be > solved). re: bytestring, unicode, encodings after e.g. os.path.split / Path.split: from "[Python-ideas] Type hints for text/binary data in Python 2+3 code" https://mail.python.org/pipermail/python-ideas/2016-March/038869.html >> would/will it be possible to use Typing.Text as a base class for even-more abstract string types https://mail.python.org/pipermail/python-ideas/2016-March/039016.html >> * Text.encoding >> * Text.lang (urn:ietf:rfc:3066) ... forgot to CC: >> * https://tools.ietf.org/html/rfc5646 "Tags for Identifying Languages" urn:ietf:rfc:5646 is this (Path) a narrower case of string types (#strypes), because after transformations we want to preserve string metadata like e.g encoding? I'd vote for * adding DirEntry.__path__ as a proxy to DirEntry.path * standardizing on __path__ (over .path) * because this operation *is* fundamentally similar to e.g. __str__ * operator.path pathify, pathifize > > ChrisA > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: https://mail.python.org/mailman/options/python-dev/wes.turner%40gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20160406/f76702cc/attachment-0001.html>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4