Guido van Rossum wrote: >>>Please do. Bumping MAGIC is a no-no between dot releases. But I >>>don't understand why that is necessary? >> >>It would be necessary since marshal uses UTF-8 for storing >>Unicode literals. > > > Do you mean that in 2.2 it doesn't? Marshal uses it since 1.6. The point is that the fix to the lone surrogate problem resulted in a change of the UTF codec output. PYCs from unpatched and patched versions wouldn't interop if they use lone surrogates in Unicode literals. We usually bump the PYC magic in such a case, to avoid these issues. Since it's not possible for a patch level release, we have two choices: 1. leave things as they are 2. apply the fix and live with the consequences of having to regenerate PYCs by hand Just to give an example of the problem: Python 2.2: ------------- u'\ud800'.encode('utf-8') == '\xa0\x80' >>> unicode('\xa0\x80', 'utf-8') Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: UTF-8 decoding error: unexpected code byte >>> unicode('\xed\xa0\x80', 'utf-8') Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: UTF-8 decoding error: illegal encoding Current CVS Python: --------------------- u'\ud800'.encode('utf-8') == '\xed\xa0\x80' >>> unicode('\xed\xa0\x80', 'utf-8') u'\ud800' >>Even though it's highly unlikely that the problem cases are used in >>Python Unicode literals, there's a tiny chance. Without the MAGIC >>change this could result in PYC files failing to load. > > > Ha. You may have missed the start of this thread, but the whole > problem was that a PYC file *did* fail to load! (The .py file had a > lone surrogate in it.) So I'm not sure this argument holds much > water. Interesting. I wouldn't have expected that. > Can someone please explain what change would be necessary to what part > of the code to prevent a lone surrogate in a string literal from > creating a PYC file from blowing up? One possibility would be to: 1. change the UTF-8 encoder in Python 2.2 to produce correct output 2. let the UTF-8 decoder in Python 2.2 accept the correct output *and* the maformed output I am not sure whether 2. would introduce a security problem. Perhaps there is a way to restrict the work-around so that we don't run into UTF-8 encoding attack problems. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4