Guido van Rossum wrote: > > [Misunderstanding in the reasoning behind "t#" and "s#"] > > Thanks for not picking an argument. Multibyte encodings typically > have ASCII as a subset (in such a way that an ASCII string is > represented as itself in bytes). This is the characteristic that's > needed in my view. > > > It was my understanding that "t#" refers to single byte character > > data. That's where the above arguments were aiming at... > > t# refers to byte-encoded data. Multibyte encodings are explicitly > designed to be passed cleanly through processing steps that handle > single-byte character data, as long as they are 8-bit clean and don't > do too much processing. Ah, ok. I interpreted 8-bit to mean: 8 bits in length, not "8-bit clean" as you obviously did. > > Perhaps I'm missing something... > > The idea is that (1)/s# disallows any translation of the data, while > (2)/t# requires translation of the data to an ASCII superset (possibly > multibyte, such as UTF-8 or shift-JIS). (2)/t# assumes that the data > contains text and that if the text consists of only ASCII characters > they are represented as themselves. (1)/s# makes no such assumption. > > In terms of implementation, Unicode objects should translate > themselves to the default encoding for t# (if possible), but they > should make the native representation available for s#. > > For example, take an encryption engine. While it is defined in terms > of byte streams, there's no requirement that the bytes represent > characters -- they could be the bytes of a GIF file, an MP3 file, or a > gzipped tar file. If we pass Unicode to an encryption engine, we want > Unicode to come out at the other end, not UTF-8. (If we had wanted to > encrypt UTF-8, we should have fed it UTF-8.) > > > > Note that the definition of the 's' format was left alone -- as > > > before, it means you need an 8-bit text string not containing null > > > bytes. > > > > This definition should then be changed to "text string without > > null bytes" dropping the 8-bit reference. > > Aha, I think there's a confusion about what "8-bit" means. For me, a > multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? > (As far as I know, C uses char* to represent multibyte characters.) > Maybe we should disambiguate it more explicitly? There should be some definition for the two markers and the ideas behind them in the API guide, I guess. > > Hmm, I would strongly object to making "s#" return the internal > > format. file.write() would then default to writing UTF-16 data > > instead of UTF-8 data. This could result in strange errors > > due to the UTF-16 format being endian dependent. > > But this was the whole design. file.write() needs to be changed to > use s# when the file is open in binary mode and t# when the file is > open in text mode. Ok, that would make the situation a little clearer (even though I expect the two different encodings to produce some FAQs). I still don't feel very comfortable about the fact that all existing APIs using "s#" will suddenly receive UTF-16 data if being passed Unicode objects: this probably won't get us the "magical" Unicode integration we invision, since "t#" usage is not very wide spread and character handling code will probably not work well with UTF-16 encoded strings. Anyway, we should probably try out both methods... > > It would also break the symmetry between file.write(u) and > > unicode(file.read()), since the default encoding is not used as > > internal format for other reasons (see proposal). > > If the file is encoded using UTF-16 or UCS-2, you should open it in > binary mode and use unicode(file.read(), 'utf-16'). (Or perhaps the > app should read the first 2 bytes and check for a BOM and then decide > to choose bewteen 'utf-16-be' and 'utf-16-le'.) Right, that's the idea (there is a note on this in the Standard Codec section of the proposal). > > > Any of the following choices is acceptable (from the point of view of > > > not breaking the intended t# semantics; we can now start deciding > > > which we like best): > > > > I think we have already agreed on using UTF-8 for the default > > encoding. It has quite a few advantages. See > > > > http://czyborra.com/utf/ > > > > for a good overview of the pros and cons. > > Of course. I was just presenting the list as an argument that if > we changed our mind about the default encoding, t# should follow the > default encoding (and not pick an encoding by other means). Ok. > > > - utf-8 > > > - latin-1 > > > - ascii > > > - shift-jis > > > - lower byte of unicode ordinal > > > - some user- or os-specified multibyte encoding > > > > > > As far as t# is concerned, for encodings that don't encode all of > > > Unicode, untranslatable characters could be dealt with in any number > > > of ways (raise an exception, ignore, replace with '?', make best > > > effort, etc.). > > > > The usual Python way would be: raise an exception. This is what > > the proposal defines for Codecs in case an encoding/decoding > > mapping is not possible, BTW. (UTF-8 will always succeed on > > output.) > > Did you read Andy Robinson's case study? He suggested that for > certain encodings there may be other things you can do that are more > user-friendly than raising an exception, depending on the application. > I am proposing to leave this a detail of each specific translation. > There may even be translations that do the same thing except they have > a different behavior for untranslatable cases -- e.g. a strict version > that raises an exception and a non-strict version that replaces bad > characters with '?'. I think this is one of the powers of having an > extensible set of encodings. Agreed, the Codecs should decide for themselves what to do. I'll add a note to the next version of the proposal. > > > Given the current context, it should probably be the same as the > > > default encoding -- i.e., utf-8. If we end up making the default > > > user-settable, we'll have to decide what to do with untranslatable > > > characters -- but that will probably be decided by the user too (it > > > would be a property of a specific translation specification). > > > > > > In any case, I feel that t# could receive a multi-byte encoding, > > > s# should receive raw binary data, and they should correspond to > > > getcharbuffer and getreadbuffer, respectively. > > > > Why would you want to have "s#" return the raw binary data for > > Unicode objects ? > > Because file.write() for a binary file, and other similar things > (e.g. the encryption engine example I mentioned above) must have > *some* way to get at the raw bits. What for ? Any lossless encoding should do the trick... UTF-8 is just as good as UTF-16 for binary files; plus it's more compact for ASCII data. I don't really see a need to get explicitly at the internal data representation because both encodings are in fact "internal" w/r to Unicode objects. The only argument I can come up with is that using UTF-16 for binary files could (possibly) eliminate the UTF-8 conversion step which is otherwise always needed. > > Note that it is not mentioned anywhere that > > "s#" and "t#" do have to necessarily return different things > > (binary being a superset of text). I'd opt for "s#" and "t#" both > > returning UTF-8 data. This can be implemented by delegating the > > buffer slots to the <defencstr> object (see below). > > This would defeat the whole purpose of introducing t#. We might as > well drop t# then altogether if we adopt this. Well... yes ;-) > > > > Now Greg would chime in with the buffer interface and > > > > argue that it should make the underlying internal > > > > format accessible. This is a bad idea, IMHO, since you > > > > shouldn't really have to know what the internal data format > > > > is. > > > > > > This is for C code. Quite likely it *does* know what the internal > > > data format is! > > > > C code can use the PyUnicode_* APIs to access the data. I > > don't think that argument parsing is powerful enough to > > provide the C code with enough information about the data > > contents, e.g. it can only state the encoding length, not the > > string length. > > Typically, all the C code does is pass multibyte encoded strings on to > other library routines that know what to do to them, or simply give > them back unchanged at a later time. It is essential to know the > number of bytes, for memory allocation purposes. The number of > characters is totally immaterial (and multibyte-handling code knows > how to calculate the number of characters anyway). -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 46 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4