On Mon, 15 Nov 1999, M.-A. Lemburg wrote: > Guido van Rossum wrote: >... > > t# refers to byte-encoded data. Multibyte encodings are explicitly > > designed to be passed cleanly through processing steps that handle > > single-byte character data, as long as they are 8-bit clean and don't > > do too much processing. > > Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not > "8-bit clean" as you obviously did. Hrm. That might be dangerous. Many of the functions that use "t#" assume that each character is 8-bits long. i.e. the returned length == the number of characters. I'm not sure what the implications would be if you interpret the semantics of "t#" as multi-byte characters. >... > > For example, take an encryption engine. While it is defined in terms > > of byte streams, there's no requirement that the bytes represent > > characters -- they could be the bytes of a GIF file, an MP3 file, or a > > gzipped tar file. If we pass Unicode to an encryption engine, we want > > Unicode to come out at the other end, not UTF-8. (If we had wanted to > > encrypt UTF-8, we should have fed it UTF-8.) Heck. I just want to quickly throw the data onto my disk. I'll write a BOM, following by the raw data. Done. It's even portable. >... > > Aha, I think there's a confusion about what "8-bit" means. For me, a > > multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? Maybe. I don't see multi-byte characters as 8-bit (in the sense of the "t" format). > > (As far as I know, C uses char* to represent multibyte characters.) > > Maybe we should disambiguate it more explicitly? We can disambiguate with a new format character, or we can clarify the semantics of "t" to mean single- *or* multi- byte characters. Again, I think there may be trouble if the semantics of "t" are defined to allow multibyte characters. > There should be some definition for the two markers and the > ideas behind them in the API guide, I guess. Certainly. [ man, I'm bad... I've got doc updates there and for the buffer stuff :-( ] > > > Hmm, I would strongly object to making "s#" return the internal > > > format. file.write() would then default to writing UTF-16 data > > > instead of UTF-8 data. This could result in strange errors > > > due to the UTF-16 format being endian dependent. > > > > But this was the whole design. file.write() needs to be changed to > > use s# when the file is open in binary mode and t# when the file is > > open in text mode. Interesting idea, but that presumes that "t" will be defined for the Unicode object (i.e. it implements the getcharbuffer type slot). Because of the multi-byte problem, I don't think it will. [ not to mention, that I don't think the Unicode object should implicitly do a UTF-8 conversion and hold a ref to the resulting string ] >... > I still don't feel very comfortable about the fact that all > existing APIs using "s#" will suddenly receive UTF-16 data if > being passed Unicode objects: this probably won't get us the > "magical" Unicode integration we invision, since "t#" usage is not > very wide spread and character handling code will probably not > work well with UTF-16 encoded strings. I'm not sure that we should definitely go for "magical." Perl has magic in it, and that is one of its worst faults. Go for clean and predictable, and leave as much logic to the Python level as possible. The interpreter should provide a minimum of functionality, rather than second-guessing and trying to be neat and sneaky with its operation. >... > > Because file.write() for a binary file, and other similar things > > (e.g. the encryption engine example I mentioned above) must have > > *some* way to get at the raw bits. > > What for ? How about: "because I'm the application developer, and I say that I want the raw bytes in the file." > Any lossless encoding should do the trick... UTF-8 > is just as good as UTF-16 for binary files; plus it's more compact > for ASCII data. I don't really see a need to get explicitly > at the internal data representation because both encodings are > in fact "internal" w/r to Unicode objects. > > The only argument I can come up with is that using UTF-16 for > binary files could (possibly) eliminate the UTF-8 conversion step > which is otherwise always needed. The argument that I come up with is "don't tell me how to design my storage format, and don't make Python force me into one." If I want to write Unicode text to a file, the most natural thing to do is: open('file', 'w').write(u) If you do a conversion on me, then I'm not writing Unicode. I've got to go and do some nasty conversion which just monkeys up my program. If I have a Unicode object, but I *want* to write UTF-8 to the file, then the cleanest thing is: open('file', 'w').write(encode(u, 'utf-8')) This is clear that I've got a Unicode object input, but I'm writing UTF-8. I have a second argument, too: See my first argument. :-) Really... this is kind of what Fredrik was trying to say: don't get in the way of the application programmer. Give them tools, but avoid policy and gimmicks and other "magic". Cheers, -g -- Greg Stein, http://www.lyra.org/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4