> >> a. Does this really make sense for UTF-16? It looks to me like > >> a great way to induce bugs of the form "write a unicode literal > >> containing 0x0A, then translate it to raw form by stripping the > >> u prefix." > > Guido> Of course not. I don't expect anyone to put UTF-16 in their > Guido> source encoding cookie. > > Mr. Suzuki's friends. People who use UTF-16 strings in other > applications (eg Java), but otherwise are happy with English. I don't understand the mechanics, unless they encode the entire file in UTF-16. And then Python can't parse it, because it assumes ASCII. I think even Mr. Suzuki isn't thinking of using UTF-16 in his Unicode literals. He currently sets UTF-16 as the default encoding for data that he presumably reads from a file. > Guido> But should we bother making a list of encodings that > Guido> shouldn't be used? > > I would say yes. People will find reasons to inflict harm on > themselves if you don't. Any file that's encoded in an encoding (such as UTF-16) that's not an ASCII superset is unparseable for Python -- Python would never even get to the point of recognizing the comment with the encoding cookie. So I doubt that this will be a problem. It's like the danger of hitting yourself in the head with a 16-ton weight -- in order to swing it, you'd first have to lift it... The other interpretation (that they would use UTF-16 inside u"" and ASCII elsewhere) is just as insane, since no person implementing a text editor with any form of encoding support would be insane enough support such a mixed-mode encoding. > >> b. No editor is likely to implement correct display to > >> distinguish between u"" and just "". > > Guido> That's fine. Given phase 2, the editor should display the > Guido> entire file using the encoding given in the cookie, despite > Guido> that phase 1 only applies the encoding to u"" literals. > Guido> The rest of the file is supposed to be ASCII, and if it > Guido> isn't, that's the user's problem. > > Huh? I thought that people were regularly putting arbitrary text into > ordinary strings, and that the whole purpose of this PEP was to extend > that practice to Unicode. > > Are you going to deprecate the practice of putting KOI8-R into > ordinary strings? This means that Cyrillic users have stop doing > that, change the string to Unicode, and apply codecs on IO. They > aren't going to bother in phase 1, will have a rude surprise in phase > 2. That's human nature, of course, but I don't see how it serves > Python to risk it. I wasn't clear on what you meant (see below). I think this will actually work. Suppose someone uses KOI8-R. Presumably they have an editor that reads, writes and displays KOI8-R, and their default interpretation of Python's stdout will also assume KOI8-R. Thus, if their program contains k = "...some KOI8-R string..." print k it will print what they want. If they write this: u = unicode(k, "koi8-r") it will also do what they want. Currently, if they write u = u"...some KOI8-R string..." it won't work, but with the PEP, in phase 1, it will do the right thing as long as they add a KOI8-R cookie to the file. The treatment of the 8-bit string assigned to k will not change in phase 1. But the treatment of k under phase 2 will be, um, interesting, and I'm not sure what it should do!!! Since in phase 2 the entire file will be decoded from KOI8-R to Unicode before it's parsed, maybe the best thing would be to encode 8-bit string literals back using KOI8-R (in general, the encoding given in the encoding cookie). *** MAL, can you think about this? *** > >> e. This causes problems for UTF-8 transition, since people will > >> want to put arbitrary byte strings in a raw string. > > Guido> I'm not sure I understand. What do you call a raw string? > Guido> Do you mean an r"" literal? Why would people want to use > Guido> that for arbitrary binary data? Arbitrary binary data > Guido> should *always* be encoded using \xDD hex or \OOO octal > Guido> escapes. > > raw -> non-Unicode here. Incorrect usage, my apologies. "Arbitrary" > was the wrong word too, I mean non-UTF-8. Eg, iso-8859-1 0xFF. I > would have not problem with requiring people to use escapes to write > non-English strings. But the whole point of this PEP is to allow > people to write those in their native encodings (for Unicode strings). > People are going to continue to squirt implicitly coded octet-strings > at their terminals (which just happen to have an appropriate font > installed<wink>) and expect it to work. How about the solution I suggested above? Basically, the encoding used for 8-bit string literals better match the encoding cookie used for the source file, otherwise all bets are off. But this should match common usage -- all people have to do is add the encoding cookie to their file. > AFAICT this interpretation of the PEP saves no pain, simply postpones > it. Worse, people who don't understand it fully are going to believe > it sanctions arbitrary encodings in string literals. IMO, only one arbitrary encoding will be used per user -- his/her favorite, default, and that's what they'll put in their encoding cookie once we train them properly. > I don't see how > you can avoid widespread misunderstanding of that sort unless you have > the parser refuse to execute the program---it may actually increase > the pain when phase 2 starts. > > Guido> Sounds like a YAGNI to me. > > Could be. I'm sorry I can't be less fuzzy about the specific points. > But then, that's the whole problem, really---we're trying to serve > natural language usage which is inherently fuzzy. > > I see lots of potential problems in interpretation of this PEP by the > people it's intended to serve: those who are attached to some native > encoding. Better to raise each now, and have the scorn it deserves > heaped high, than to say "we coulda guessed this would happen" later. > > If you think it's getting too abstract to be useful, I'll be quiet > until I've got something more concrete. I'm hoping the the discussion > seems useful despite the fuzz. Same here. If you still think it's necessary, maybe you can try to express exactly when you would want a program to be declared illegal because of expected problems in phase 2? --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4