Hi! [me]: > > From my POV (using ISO Latin-1 all the time) it would be > > "intuitive"(TM) to assume ISO Latin-1 when interpreting u'äöü' in a > > Python source file so that (u'äöü' == 'äöü') == 1. This is what I see > > on *my* screen, whether there is a 'u' in Front of the string or not. M.-A. Lemburg: > u"äöü" is being interpreted as Latin-1. The problem is the > string 'äöü' to the right: during coercion this string is > being interpreted as UTF-8 and this causes the failure. > > You could say: ok, all my strings use Latin-1, but that would > introduce other problems... esp. when you take different > modules with different encoding assumptions and try to > integrate them into an application. Okay. This wouldn't occur here but we have deal with this possibility. > > In dist/src/Misc/unicode.txt you wrote: > > > > > Note that you should provide some hint to the encoding you used to > > > write your programs as pragma line in one the first few comment lines > > > of the source file (e.g. '# source file encoding: latin-1'). [me]: > > The upcoming 1.6 documentation should probably clarify whether > > the interpreter pays attention to "pragma"s or not. > > This is otherwise misleading. > > This "pragma" is nothing more than a hint for the source code > reader to switch his viewing encoding. The interpreter doesn't > treat the file differently. In fact, Python source code is > supposed to tbe 7-bit ASCII ! Sigh. In our company we use 'german' as our master language so we have string literals containing iso-8859-1 umlauts all over the place. Okay as long as we don't mix them with Unicode objects, this doesn't hurt anybody. What I would love to see, would be a well defined way to tell the interpreter to use 'latin-1' as default encoding instead of 'UTF-8' when dealing with string literals from our modules. The tokenizer in Python 1.6 already contains smart logic to get the size of TABs right (pasting from tokenizer.c): /* Skip comment, while looking for tab-setting magic */ if (c == '#') { static char *tabforms[] = { "tab-width:", /* Emacs */ ":tabstop=", /* vim, full form */ ":ts=", /* vim, abbreviated form */ "set tabsize=", /* will vi never die? */ /* more templates can be added here to support other editors */ }; .. It wouldn't be to hard to add something there to recognize other "pragma" comments like for example: #content-transfer-encoding: iso-8859-1 But what to do with it? May be adding a default encoding to every string object? Is this bloat? Just an idea. Regards, Peter
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4