Tim, I'm not sure this needs to be on the list. My major point, I guess, is that the byte vectors we tend to call strings in Python have no string-ness, as understood in the 21st century. There is no character set associated with them, which means that there is effectively no way to look at the "next character" in a string (you don't know how long a character is), no way to count the number of characters, etc. The documentation, particularly the language manual, is extremely confusing on this point, in classifying "string" and "Unicode" objects as the same sort of thing. And then not documenting them clearly. "struct.pack", for instance, doesn't really return a string -- it returns a byte vector. Unicode is really the only kind of *string* type that's supported, which is problematic, as it's not integrated with the file streams support. For instance, how do I write a function that opens a file containing text in some multi-byte format (which, we'll assume, I know the name of -- perhaps from a content-type field), and reads the first three characters of the text? Can't. That's because the "file" constructor doesn't take an encoding, and "read" and "readline" don't return Unicode objects. I could try, by reading some bytes, then using unicode to turn it into a string, then seeing how many characters I read, but that's pretty imprecise. I go round and round the "codecs" module thinking that someone must have thought of this -- or maybe there's an optional argument to file() that make it return real (Unicode) strings -- but no luck. I find it hard to believe that I've dreamed up something that neither you nor (especially) Martin have thought of till now. But consider this idea. Any file that is not explicitly opened as binary (with the 'b' flag (and, by the way, why isn't the 'b' flag the default for file opening? It would save a lot of grief dealing with Windows.)) should be considered a text file, and it should have an associated "encoding" attribute (as file objects already do), which would also be a keyword parameter to the constructor. The default would be sys.getdefaultencoding(). The "size" parameter to the methods "read" and "readline" should refer to characters, not bytes, for text files. The return values from "next", "read" and "readline" would be Unicode objects for text files. Similarly, the methods "write" and "writelines" should, for text files, take Unicode objects and raise an exception if fed a "byte vector". I'd go further. I'd introduce the notation v = b"abc" which means that "v" has assigned to it an 8-bit "string" byte vector. Then, after a release or two, I'd make plain old "foo" mean what u"foo" means today, so that string literals are by default Unicode (module PEP 263). Bill
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4