> [GvR, on string.encoding ] > >Marc-Andre took this idea a bit further, but I think it's not > >practical given the current implementation: there are too many places > >where the C code would have to be changed in order to propagate the > >string encoding information, [JvR] > I may miss something, but the encoding attr just travels with the string > object, no? Like I said in my reply to MAL, I think it's undesirable to do > *anything* with the encoding attr if not in combination with a unicode > string. But just propagating affects every string op -- s+s, s*n, s[i], s[:], s.strip(), s.split(), s.lower(), ... > >and there are too many sources of strings > >with unknown encodings to make it very useful. > > That's why the default encoding must be settable as well, as Fredrik > suggested. I'm open for debate about this. There's just something about a changeable global default encoding that worries me -- like any global property, it requires conventions and defensive programming to make things work in larger programs. For example, a module that deals with Latin-1 strings can't just set the default encoding to Latin-1: it might be imported by a program that needs it to be UTF-8. This model is currently used by the locale in C, where all locale properties are global, and it doesn't work well. For example, Python needs to go through a lot of hoops so that Python numeric literals use "." for the decimal indicator even if the user's locale specifies "," -- we can't change Python to swap the meaning of "." and "," in all contexts. So I think that a changeable default encoding is of limited value. That's different from being able to set the *source file* encoding -- this only affects Unicode string literals. > >Plus, it would slow down 8-bit string ops. > > Not if you ignore it most of the time, and just pass it along when > concatenating. And slicing, and indexing, and... > >I have a better idea: rather than carrying around 8-bit strings with > >an encoding, use Unicode literals in your source code. > > Explain that to newbies... I guess is that they will want simple 8 bit > strings in their native encoding. Dunno. If they are hap-py with their native 8-bit encoding, there's no need for them to ever use Unicode objects in their program, so they should be fine. 8-bit strings aren't ever interpreted or encoded except when mixed with Unicode objects. > >If the source > >encoding is known, these will be converted using the appropriate > >codec. > > > >If you object to having to write u"..." all the time, we could say > >that "..." is a Unicode literal if it contains any characters with the > >top bit on (of course the source file encoding would be used just like > >for u"..."). > > Only if "\377" would still yield an 8-bit string, for binary goop... Correct. --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4