[Fredrik Lundh] > Brett Cannon wrote: > > it's 2.30 am over here, so I'm not going to try to explain this myself, > but some random googling brought up this page: > > http://216.239.37.100/search?q=cache:Dk12BZNt6skC:uk.geocities.com/BabelStone1357/Software/surrogates.html > > The code points U+D800 through U+DB7F are reserved as High Surrogates, > and the code points U+DC00 through U+DFFF are reserved as Low Surrogates. > Each code point in [the full 20-bit unicode character space] maps to a pair of > 16-bit code points comprising a High Surrogate followed by a Low Surrogate. > Thus, for example, the Gothic letter AHSA has the UTF-32 value of U+10330, > which maps to the surrogate pair U+D800 and U+DF30. That is to say, in the > 16-bit encoding of Unicode (UTF-16), the Gothic letter AHSA is represented > by two consecutive 16-bit code points (U+D800 and U+DF30), whereas in the > 32-bit encoding of Unicode (UTF-32), the same letter is represented by a > single 32-bit value (U+10330). > > </F> > So with that explanation, here is the current rewrite: """ In Unicode, a surrogate pair is when you create the representation of a character by using two values. So, for instance, UTF-32 can cover the entire Unicode space (Unicode is 20 bits), but UTF-16 can't. To solve the issue a character can be represented as a pair of UTF-16 values. The problem in Python 2.2.1 is that when there is only a lone surrogate (instead of there being a pair of values), the encoder for UTF-8 messes up and leaves off a UTF-8 value. The following line is an example: >>> u'\ud800'.encode('utf-8') '\xa0\x80' #In Python 2.2.1 '\xed\xa0\x80' #In Python 2.3a0 Notice how in Python 2.3a0 the extra value is inserted so as to make the representation a complete Unicode character instead of only encoding the half of the surrogate pair that the encode was given. """ How is that? -Brett
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4