Taken out of order. >>>>> "Guido" =3D=3D Guido van Rossum <guido@python.org> writes: Guido> Same here. If you still think it's necessary, maybe you Guido> can try to express exactly when you would want a program to Guido> be declared illegal because of expected problems in phase Guido> 2? I guess my point is that I don't want to try to do that, because I'm pretty sure I'd get it wrong for some common natural language or platform encoding I have no specific knowledge of. Even the small amount of detail in the PEP seems too much, to me. I think it's much better to say "The parser accepts programs encoded in unicode. We provide some hooks to help you get from encodings convenient for your environment to Unicode, and some sample implementations of things to hang on the hooks. But if there are problems with non-unicode files, they're your problems." I remain unconvinced that this PEP is as good as it could be, but I don't have time to provide a full counter-proposal. It will provide the benefits claimed for the people it's targeted to. However, o There may be some audiences who are poorly served (Mr. Suzuki). o I think it will definitely tend to encourage use of national/ platform encodings rather than UTF-8 in source. It may be hard to get this sun to set. o I think it makes it hard to implement helper tools (eg python-mode). o I think it discourages a clean separation of the parser from the codecs (see below for examples). That said, it's clearly better than the current situation. Since the people who will be implementing seem to be unconvinced by my arguments, it's probably best to go ahead with it. I'll try to follow implementation discussions and certainly would respond if asked. >> Mr. Suzuki's friends. People who use UTF-16 strings in other >> applications (eg Java), but otherwise are happy with English. Guido> I think even Mr. Suzuki isn't thinking of using UTF-16 in Guido> his Unicode literals. He currently sets UTF-16 as the Guido> default encoding for data that he presumably reads from a Guido> file. Well, I'm not a native Japanese. But I have often edited English strings that occur in swaths of unrecognizable octets that would be Japanese if I had the terminal encoding set correctly. I have also cut and pasted encoded Japanese into "binary" buffers. And how is he going to use regexps or formatting sugar without literal UTF-16 strings? Guido> The other interpretation (that they would use UTF-16 inside Guido> u"" and ASCII elsewhere) is just as insane, since no person Guido> implementing a text editor with any form of encoding Guido> support would be insane enough support such a mixed-mode Guido> encoding. "I resemble that remark." Seriously, that is _exactly_ what X?Emacs/Mule does as implementation of multilingual buffers, since it's basically modeless ISO 2022. Currently it does not get display right for the interpretation I'm suggesting for Python strings, but it wouldn't be hard. However, that would require that Emacs _ignore_ the python coding cookie, and then turn around and have python-mode do the work. (This isn't a big deal, but the Python interpreter will implicitly be doing something similar---you won't be able to apply a standard codec and get what you want.) >> Are you going to deprecate the practice of putting KOI8-R into >> ordinary strings? [example of how it works if you just let it work snipped] Guido> I think this will actually work. Right, as long as by "work" you mean "it's formally undefined but 8-bit clean stuff just passes through." The problem is that people often do unclean things, like type ALT 1 8 5 to insert an 0xB9 octet, which the editor assumes is intended to be =B9 in a Latin-2 locale. However, if that file (which the user knows contains no Latin-2 at all) is read in a Latin-2 locale, and translated to Unicode, the byte value changes (in fact, it's no longer a byte value). What's a parser to do?<wink> This can be made safe by not decoding the contents of ordinary string literals, but that requires that the parser has to do the lexing, you can't delegate it to a general-purpose codec. Guido> But the treatment of k under phase 2 will be, um, Guido> interesting, and I'm not sure what it should do!!! Bingo. And files which until that point embedded arbitrary binary (ie, not representing characters) stop working, quite possibly. (This is a natural hack to anybody familiar with Emacs/Mule.) Guido> Since in phase 2 the entire file will be decoded from Guido> KOI8-R to Unicode before it's parsed, maybe the best thing Guido> would be to encode 8-bit string literals back using KOI8-R Guido> (in general, the encoding given in the encoding cookie). This probably mostly works, based on mule experience. But it requires the parser to have carnal knowledge of coding systems. Isn't it preferable to insist on UTF-8 here, since it's simply changing the representation from one or two bytes back to constant-width one, without changing values? Also, you'd have to prohibit encodings using ISO 2022 control sequences, as there are always many legal ways to encode the same text (there is no requirement that a mode-switching sequence actually be followed by any text before switching to a different mode), and there's no way to distinguish except to record the original input. You'd also have to document this use for the codecs, because otherwise somebody might do something really cool like make the codecs produce canonical Unicode (ie, either maximally decomposed or maximally composed representations). This would also make reversal ambiguous for any encoding that provides both composed and decomposed forms of characters. --=20 Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac= .jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JA= PAN Don't ask how you can "do" free software business; ask what your business can "do for" free software.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4