Neil Hodgson <nhodgson@bigpond.net.au> wrote: > I'm dropping in a bit late in this thread but can the current = problem be > summarised in an example as "how is 'literal' interpreted here"? >=20 > s =3D aUnicodeStringFromSomewhere > DoSomething(s + "<literal>") nope. the whole discussion centers around what happens if you type: # example 1 u =3D aUnicodeStringFromSomewhere s =3D an8bitStringFromSomewhere DoSomething(s + u) and # example 2 u =3D aUnicodeStringFromSomewhere s =3D an8bitStringFromSomewhere if len(u) + len(s) =3D=3D len(u + s): print "true" else: print "not true" in Guido's design, the first example may or may not result in an "UTF-8 decoding error: UTF-8 decoding error: unexpected code byte" exception. the second example may result in a similar error, print "true", or print "not true", depending on the contents of the 8-bit string. (under the counter proposal, the first example will never raise an exception, and the second will always print "true") ... the string literal issue is a slightly different problem. > The two options being that literal is either assumed to be encoded in > Latin-1 or UTF-8. I can see some arguments for both sides. better make that "two options", not "the two options" ;-) a more flexible scheme would be to borrow the design from XML (see http://www.w3.org/TR/1998/REC-xml-19980210). for those who haven't looked closer at XML, it basically treats the source file as an encoded unicode character stream, and does all pro- cessing on the decoded side. replace "entity" with "script file" in the following excerpts, and you get close: section 2.2: A parsed entity contains text, a sequence of characters, which may represent markup or character data. A character is an atomic unit of text as specified by ISO/IEC 10646. section 4.3.3: Each external parsed entity in an XML document may use a different encoding for its characters. All XML processors must be able to read entities in either UTF-8 or UTF-16.=20 Entities encoded in UTF-16 must begin with the Byte Order Mark /.../ XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents. Parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration containing an encoding declaration. (also see appendix F: Autodetection of Character Encodings) I propose that we adopt a similar scheme for Python -- but not in 1.6. the current "dunno, so we just copy the characters" is good enough for now... </F>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4