Showing content from http://mail.python.org/pipermail/python-dev/attachments/20110826/928797bf/attachment.html below:
<br><div class="gmail_quote">On Fri, Aug 26, 2011 at 5:59 AM, Guido van Rossum <span dir="ltr"><<a href="mailto:guido@python.org">guido@python.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class="im">On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland <<a href="mailto:ijmorlan@uwaterloo.ca">ijmorlan@uwaterloo.ca</a>> wrote:<br>
> On Thu, 25 Aug 2011, Guido van Rossum wrote:<br>
><br>
>> I'm not sure what should happen with UTF-8 when it (in flagrant<br>
>> violation of the standard, I presume) contains two separately-encoded<br>
>> surrogates forming a valid surrogate pair; probably whatever the UTF-8<br>
>> codec does on a wide build today should be good enough.</div></blockquote><div><br>Surrogates are used and valid only in UTF-16.<br>In UTF-8/32 they are invalid, even if they are in pair (see <a href="http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf">http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf</a> ). Of course Python can/should be able to represent them internally regardless of the build type.<br>
<br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im"> >>Similarly for<br>
>> encoding to UTF-8 on a wide build if one managed to create a string<br>
>> containing a surrogate pair. Basically, I'm for a<br>
>> garbage-in-garbage-out approach (with separate library functions to<br>
>> detect garbage if the app is worried about it).<br>
><br>
> If it's called UTF-8, there is no decision to be taken as to decoder<br>
> behaviour - any byte sequence not permitted by the Unicode standard must<br>
> result in an error (although, of course, *how* the error is to be reported<br>
> could legitimately be the subject of endless discussion).</div></blockquote><div><br>What do you mean? We use the "strict" error handler by default and we can specify other handlers already.<br> </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="im"> Â There are<br>
> security implications to violating the standard so this isn't just<br>
> legalistic purity.<br>
<br>
</div>You have a point. The security issues cannot be seen separate from all<br>
the other issues. The folks inside Google who care about Unicode often<br>
harp on this. So I stand corrected. I am fine with codecs treating<br>
code points or code point sequences that the Unicode standard doesn't<br>
like (e.g. lone surrogates) the same way as more severe errors in the<br>
encoded bytes (lots of byte sequences already aren't valid UTF-8).</blockquote><div><br>Codecs that use the official names should stick to the standards. For example s.encode('utf-32') should either produce a valid utf-32 byte string or raise an error if 's' contains invalid characters (e.g. surrogates).<br>
We can have other internal codecs that are based on the UTF-* encodings but allow the representation of lone surrogates and even expose them if we want, but they should have a different name (even 'utf-*-something' should be ok, see <a href="http://bugs.python.org/issue12729#msg142053">http://bugs.python.org/issue12729#msg142053</a> from "Unicode says you can't put surrogates or noncharacters in a UTF-anything stream.").<br>
 </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"> I<br>
just hope this doesn't require normal forms or other expensive<br>
operations; I hope it's limited to rejecting invalid use of surrogates<br>
or other values that are not valid code points (e.g. 0, or >= 2**21).<br></blockquote><div><br>I think there shouldn't be any normalization done automatically by the codecs.<br>Â </div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div class="im"><br>
> Hmmm, doesn't look good:<br>
><br>
> Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)<br>
> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin<br>
> Type "help", "copyright", "credits" or "license" for more information.<br>
>>>><br>
>>>> '\xed\xb0\x80'.decode ('utf-8')<br>
><br>
> u'\udc00'<br>
>>>><br>
><br>
> Incorrect! Â Although this is a narrow build - I can't say what the wide<br>
> build would do.<br></div></blockquote><div><br>The UTF-8 codec used to follow RFC 2279 and only recently has been updated to RFC 3629 (see <a href="http://bugs.python.org/issue8271#msg107074">http://bugs.python.org/issue8271#msg107074</a> ). On Python 2.x it still produces invalid UTF-8 because changing it is backward incompatible. In Python 2 UTF-8 can be used to encode every codepoint from 0 to 10FFFF, and it always works. If we change it now it might start raising errors for an operation that never raised them before (see <a href="http://bugs.python.org/issue12729#msg142047">http://bugs.python.org/issue12729#msg142047</a> ).<br>
Luckily this is fixed in Python 3.x.<br>I think there are more codepoints/byte sequences that should be rejected while encoding/decoding though, in both UTF-8 and UTF-16/32, but I haven't looked at them yet (I would be happy to fix these for 3.3 or even 2.7/3.2 (if applicable), so if you find mismatches with the Unicode standard and report an issue, feel free to assign it to me).<br>
<br>Best Regards,<br>Ezio Melotti<br><br><br></div><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">
><br>
> For reasons of practicality, it may be appropriate to provide easy access to<br>
> a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not be<br>
> called UTF-8. Â Other variations may also find use if provided.<br>
><br>
> See UTF-8 RFC: <a href="http://www.ietf.org/rfc/rfc3629.txt" target="_blank">http://www.ietf.org/rfc/rfc3629.txt</a><br>
><br>
> And CESU-8 technical report: <a href="http://www.unicode.org/reports/tr26/" target="_blank">http://www.unicode.org/reports/tr26/</a><br>
<br>
</div>Thanks for the links! I also like the term "supplemental character" (a<br>
code point >= 2**16). And I note that they talk about characters were<br>
we've just agreed that we should say code points...<br>
<div class="im"><br>
--<br>
--Guido van Rossum (<a href="http://python.org/%7Eguido" target="_blank">python.org/~guido</a>)<br>
</div><div><div></div><div class="h5">_______________________________________________<br></div></div></blockquote></div>
RetroSearch is an open source project built by @garambo
| Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4