Fredrik Lundh wrote: > > M.-A. Lemburg wrote: > > Fredrik Lundh wrote: > > > > > > M.-A. Lemburg wrote: > > > > The current need for #pragmas is really very simple: to tell > > > > the compiler which encoding to assume for the characters > > > > in u"...strings..." (*not* "...8-bit strings..."). > > > > > > why not? > > > > Because plain old 8-bit strings should work just as before, > > that is, existing scripts only using 8-bit strings should not break. > > but they won't -- if you don't use an encoding directive, and > don't use 8-bit characters in your string literals, everything > works as before. > > (that's why the default is "none" and not "utf-8") > > if you use 8-bit characters in your source code and wish to > add an encoding directive, you need to add the right encoding > directive... Fair enough, but this would render all the auto-coercion code currently in 1.6 useless -- all string to Unicode conversions would have to raise an exception. > > > why keep on pretending that strings and strings are two > > > different things? it's an artificial distinction, and it only > > > causes problems all over the place. > > > > Sure. The point is that we can't just drop the old 8-bit > > strings... not until Py3K at least (and as Fred already > > said, all standard editors will have native Unicode support > > by then). > > I discussed that in my original "all characters are unicode > characters" proposal. in my proposal, the standard string > type will have to roles: a string either contains unicode > characters, or binary bytes. > > -- if it contains unicode characters, python guarantees that > methods like strip, lower (etc), and regular expressions work > as expected. > > -- if it contains binary data, you can still use indexing, slicing, > find, split, etc. but they then work on bytes, not on chars. > > it's still up to the programmer to keep track of what a certain > string object is (a real string, a chunk of binary data, an en- > coded string, a jpeg image, etc). if the programmer wants > to convert between a unicode string and an external encoding > to use a certain unicode encoding, she needs to spell it out. > the codecs are never called "under the hood". > > (note that if you encode a unicode string into some other > encoding, the result is binary buffer. operations like strip, > lower et al does *not* work on encoded strings). Huh ? If the programmer already knows that a certain string uses a certain encoding, then he can just as well convert it to Unicode by hand using the right encoding name. The whole point we are talking about here is that when having the implementation convert a string to Unicode all by itself it needs to know which encoding to use. This is where we have decided long ago that UTF-8 should be used. The pragma discussion is about a totally different issue: pragmas could make it possible for the programmer to tell the *compiler* which encoding to use for literal u"unicode" strings -- nothing more. Since "8-bit" strings currently don't have an encoding attached to them we store them as-is. I don't want to get into designing a completely new character container type here... this can all be done for Py3K, but not now -- it breaks things at too many ends (even though it would solve the issues with strings being used in different contexts). > > > -- we still need an encoding marker for ascii supersets (how about > > > <?python encoding="utf-8" version="1.6"?> ;-). however, it's up to > > > the tokenizer to detect that one, not the parser. the parser only > > > sees unicode strings. > > > > Hmm, the tokenizer doesn't do any string -> object conversion. > > That's a task done by the parser. > > "unicode string" meant Py_UNICODE*, not PyUnicodeObject. > > if the tokenizer does the actual conversion doesn't really matter; > the point is that once the code has passed through the tokenizer, > it's unicode. The tokenizer would have to know which parts of the input string to convert to Unicode and which not... plus there are different encodings to be applied, e.g. UTF-8, Unicode-Escape, Raw-Unicode-Escape, etc. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4