Stephen J. Turnbull wrote: > "Martin v. Löwis" writes: > > > I've updated the PEP accordingly. > > I have three substantive comments. First, although consequences for > Python 3 byte interfaces (ie, "none") are explicitly stated, as far as > I can see this PEP could apply to Python 2 as well. I don't think > it's intended that way. Either way, I think you should clarify that > point. > > Second, I suggest "surrogate-replace" as the name of the error handler > rather than "utf8b". (Elsewhere I've suggested others, but I think > this is the best of the bunch.) > +1 > Third, it is not clear to me why non-decodable ASCII should be an > error. There are plenty of low surrogates for the purpose. Is there > another technical reason? Stupid or not, Shift-JIS- and Big5-encoded > file systems are quite common in Asia still (including non-rewritable > media). I think surrogate-replacement of ASCII should at least be an > option. > > I don't think "people shouldn't be using non-ASCII-compatible > encodings for locale encodings" is a sufficient rationale for a hard > error here. I mean, of course they *should* be using UTF-8. Maybe > Python 3.1 should just go ahead and error on any other encoding on > POSIX platforms? <wink> > I don't see why the error handler couldn't in principle be used with encodings other than UTF-8, although in that case all of the low surrogates should be open to use. > I have a number of nitpicking comments and technical clarifications on > the PEP. Rationale is in footnotes. There were also a few typos I > noticed. > > 1. There is no such thing as a "half-surrogate" in Unicode. "Lone > surrogate" is clear enough. Or for somewhat fancier English, > "isolated surrogate" or "non-syntactic surrogate". To emphasize > that Python codecs will only produce them in contexts where a > Unicode character or high surrogate (for UTF-16 Python) is > syntactically required, "isolated low surrogate" or "isolated > trailing surrogate" might be good.[1] > > 2. The specification should state, and the discussion emphasize, that > strings which were produced by surrogate replacement *must not* be > used in data interchange with systems that do not specifically > accept such strings, and that this is the responsibility of the > application.[2] > > Rather than saying that "dealing with such conflicts is out of > scope of this PEP", I would say > > """Dealing with such conflicts is the responsibility of the > application. Since this PEP's mechanism produces valid Unicode > where possible, and produces *invalid* code points only via the > error handler, one strategy is for the application to validate all > other sources of strings as Unicode conforming. There may be > other useful application-specific strategies, as well.""" > > 3. In the discussion, the transition from the example of alternative > use of 'python-escape' to discussion of the error handler > interface extension is a bit abrupt. I suggest rewriting as: > > """The extension to the encode error handler interface proposed by > this PEP is necessary to implement the 'utf8b' error handler, > because there are required byte sequences which cannot be > generated from replacement Unicode. However, the encode error > handler interface presently requires replacement Unicode to be > provided in lieu of the non-encodable Unicode from the source > string. Then it promptly encodes that replacement Unicode. In > some error handlers, such as the 'utf8b' proposed here, it is also > simpler and more efficient for the error handler to provide a > pre-encoded replacement byte string, rather than forcing it to > calculating Unicode from which the encoder would create the > desired bytes.""" > > Typos (line references are to pep-0383.txt svn r72332): > > l. 86: "Byte-orientied" -> "Byte-oriented" > l. 98, 118, 124, 127, 132, 136: "python-escape" -> "utf8b" > l. 130: "provide" -> "provided" > l. 134: "calculating" -> "calculate" > > > Footnotes: > [1] Unicode 5.0 uses the terms "high-half" and "low-half" at least > once, in section 16.6, but the context is such that I take it to > refer to "half of the surrogate area". Section 3.8 doesn't use > these, instead noting that "leading" and "trailing" are sometimes > used instead of "high" and "low". Better to avoid the word "half" > in PEP 383, I think. > "Leading" and "trailing" simply state the order, not the set ("high" or "low"), so are not good terms to use. > [2] Since this error handler is going to be the default for POSIX I/O, > of course people are going to mostly ignore that restriction. The > point is, passing such strings to systems that don't expect them > is a bug, and the PEP should make it clear that it's the app's > bug, not the other system's. On the other hand, using those > strings in a context of consenting adults (and I do mean > double-opt-in here) is perfectly acceptable. I'm specifically > thinking of use in the Tahoe protocol discussed by Zooko > O'Whielacronx; it may not be usable there for backward > compatibility reasons, but "Unicode conformance" is not an issue > in principle. > > This does imply that programs that take advantage of the error > handler specified in this PEP are on their own if they accept data > from any sources that are not known to be Unicode-conforming. > OTOH, as far as I can see if other sources are known to be Unicode > conformant, it's reasonably (but not perfectly) safe to combine > them with strings from this PEP (and of course use either 'utf8b' > or 'strict', as appropriate, when passing data out of Python). > Should there be a function or method to check for conformance and lone surrogates?
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4