Hello, people. I'm switching from ISO-8859-1 to UTF-8 in my locale, knowing it may take a while before everything gets fully adapted. Of course, I am prepared to do whatever it means. On my side at least, the perception of a meaning is an evolving process. :-) So, my goal here is to share some of difficulties I see with the current setup of Python in Unicode context, under the hypothesis that Python should ideally be designed to alleviate the pain of migration. I hope this is not out of context on the Python development list. Converting a Python source file from ISO-8859-1 to UTF-8, back and forth at the charset level, is a snap within Vim, and I would like if it was (almost) a snap in the Python code as well. There is some amount of trickery that I could put in to achieve this, but too much trickery does not fit well in usual Python elegance. As Martin once put it, the ultimate goal is to convert data to Unicode as early as possible in a Python program, and back to the locale as late as possible. While it's very OK with me, we should not loose sight that people might adopt different approaches. One thing is that a Python module should have some way to know the encoding used in its source file, maybe some kind of `module.__coding__' next to `module.__file__', saving the coding effectively used while compilation was going on. When a Python module is compiled, per PEP 0263 as I understand it, strings are logically converted to UTF-8 before scanning, and produced str-strings (but not unicode-strings), converted back to the original file coding. When later, at runtime, the string has to be converted back to Unicode, it would help if the programmer did not have to hardwire the encoding in the program, and edit more than the `coding:' cookie at the beginning if s/he ever switches file charset. That same `module.__coding__' could also be used for other things, like for example, to decide at run-time whether codecs streawriters should be used or not. Another solution would of course be to edit all strings, or at least those containing non-ASCII characters, to prepend a `u' and turn them into Unicode strings. This is what I intend to do in practice. However, all this editing is cumbersome, especially until it is definitive. I wonder if some other cookie, next to the `coding:' cookie, could not be used to declare that all strings _in this module only_ should be interpreted as Unicode by default, but without the need of resorting to `u' prefix all over. That would be weaker than the `-U' switch on a Python call, but likely much more convenient as well. As a corollary, maybe that some `s' prefix could force `str' type in a Unicodized module. Another way of saying it would be that an unadorned string would have `s' or `u' implied, depending if the Unicode cookie is missing or given at the start of a module. I have the intuition, still unverified, but to be confirmed over time and maybe discussions, that the above would alleviate transition to Unicode, back and forth. P.S. - Should I say and confess, one thing I do not like much about Unicode is how proponents often perceive it, like a religion, and all the fanatism going with it. Unicode should be seen and implemented as a choice, more than a life commitment :-). Right now, my feeling is that Python asks a bit too much of a programmer, in terms of commitment, if we only consider the editing work required on sources to use it, or not. -- François Pinard http://www.iro.umontreal.ca/~pinard
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4