Glenn Linderman wrote: > On approximately 4/28/2009 11:55 AM, came the following characters from > the keyboard of MRAB: >> I've been thinking of "python-escape" only in terms of UTF-8, the only >> encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are >> decodable. > > > UTF-8 is only mentioned in the sense of having special handling for > re-encoding; all the other locales/encodings are implicit. But I also > went down that path to some extent. > > >> But if you're talking about using it with other encodings, eg >> shift-jisx0213, then I'd suggest the following: >> >> 1. Bytes 0x00 to 0xFF which can't normally be decoded are decoded to >> half surrogates U+DC00 to U+DCFF. > > > This makes 256 different escape codes. > > Speaking personally, I won't call them 'escape codes'. I'd use the term 'escape code' to mean a character that changes the interpretation of the next character(s). >> 2. Bytes which would have decoded to half surrogates U+DC00 to U+DCFF >> are treated as though they are undecodable bytes. > > > This provides escaping for the 256 different escape codes, which is > lacking from the PEP. > > >> 3. Half surrogates U+DC00 to U+DCFF which can be produced by decoding >> are encoded to bytes 0x00 to 0xFF. > > > This reverses the escaping. > > >> 4. Codepoints, including half surrogates U+DC00 to U+DCFF, which can't >> be produced by decoding raise an exception. > > > This is confusing. Did you mean "excluding" instead of "including"? > Perhaps I should've said "Any codepoint which can't be produced by decoding should raise an exception". For example, decoding with UTF-8b will never produce U+DC00, therefore attempting to encode U+DC00 should raise an exception and not produce 0x00. > >> I think I've covered all the possibilities. :-) > > > You might have. Seems like there could be a simpler scheme, though... > > 1. Define an escape codepoint. It could be U+003F or U+DC00 or U+F817 > or pretty much any defined Unicode codepoint outside the range U+0100 to > U+01FF (see rule 3 for why). Only one escape codepoint is needed, this > is easier for humans to comprehend. > > 2. When the escape codepoint is decoded from the byte stream for a bytes > interface or found in a str on the str interface, double it. > > 3. When an undecodable byte 0xPQ is found, decode to the escape > codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. > > 4. When encoding, a sequence of two escape codepoints would be encoded > as one escape codepoint, and a sequence of the escape codepoint followed > by codepoint U+01PQ would be encoded as byte 0xPQ. Escape codepoints > not followed by the escape codepoint, or by a codepoint in the range > U+0100 to U+01FF would raise an exception. > > 5. Provide functions that will perform the same decoding and encoding as > would be done by the system calls, for both bytes and str interfaces. > > > This differs from my previous proposal in three ways: > > A. Doesn't put a marker at the beginning of the string (which I said > wasn't necessary even then). > > B. Allows for a choice of escape codepoint, the previous proposal > suggested a specific one. But the final solution will only have a > single one, not a user choice, but an implementation choice. > > C. Uses the range U+0100 to U+01FF for the escape codes, rather than > U+0000 to U+00FF. This avoids introducing the NULL character and escape > characters into the decoded str representation, yet still uses > characters for which glyphs are commonly available, are non-combining, > and are easily distinguishable one from another. > > Rationale: > > The use of codepoints with visible glyphs makes the escaped string > friendlier to display systems, and to people. I still recommend using > U+003F as the escape codepoint, but certainly one with a typcially > visible glyph available. This avoids what I consider to be an annoyance > with the PEP, that the codepoints used are not ones that are easily > displayed, so endecodable names could easily result in long strings of > indistinguishable substitution characters. > Perhaps the escape character should be U+005C. ;-) > It, like MRAB's proposal, also avoids data puns, which is a major > problem with the PEP. I consider this proposal to be easier to > understand than MRAB's proposal, or the PEP, because of the single > escape codepoint and the use of visible characters. > > This proposal, like my initial one, also decodes and encodes (just the > escape codes) values on the str interfaces. This is necessary to avoid > data puns on systems that provide both types of interfaces. > > This proposal could be used for programs that use str values, and easily > migrates to a solution that provides an object that provides an > abstraction for system interfaces that have two forms. >
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4