Guido van Rossum wrote: > > I think I have a reasonable grasp of the issues here, even though I > still haven't read about 100 msgs in this thread. Note that t# and > the charbuffer addition to the buffer API were added by Greg Stein > with my support; I'll attempt to reconstruct our thinking at the > time... > > [MAL] > > Let me summarize a bit on the general ideas behind "s", "s#" > > and the extra buffer: > > I think you left out t#. On purpose -- according to my thinking. I see "t#" as an interface to bf_getcharbuf which I understand as 8-bit character buffer... UTF-8 is a multi byte encoding. It still is character data, but not necessarily 8 bits in length (up to 24 bits are used). Anyway, I'm not really interested in having an argument about this. If you say, "t#" fits the purpose, then that's fine with me. Still, we should clearly define that "t#" returns text data and "s#" binary data. Encoding, bit length, etc. should explicitly remain left undefined. > > First, we have a general design question here: should old code > > become Unicode compatible or not. As I recall the original idea > > about Unicode integration was to follow Perl's idea to have > > scripts become Unicode aware by simply adding a 'use utf8;'. > > I've never heard of this idea before -- or am I taking it too literal? > It smells of a mode to me :-) I'd rather live in a world where > Unicode just works as long as you use u'...' literals or whatever > convention we decide. > > > If this is still the case, then we'll have to come with a > > resonable approach for integrating classical string based > > APIs with the new type. > > > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. > > the Latin-1 folks) which has some very nice features (see > > http://czyborra.com/utf/ ) and which is a true extension of ASCII, > > this encoding seems best fit for the purpose. > > Yes, especially if we fix the default encoding as UTF-8. (I'm > expecting feedback from HP on this next week, hopefully when I see the > details, it'll be clear that don't need a per-thread default encoding > to solve their problems; that's quite a likely outcome. If not, we > have a real-world argument for allowing a variable default encoding, > without carnage.) Fair enough :-) > > However, one should not forget that UTF-8 is in fact a > > variable length encoding of Unicode characters, that is up to > > 3 bytes form a *single* character. This is obviously not compatible > > with definitions that explicitly state data to be using a > > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't > > work like it does in Latin-1 text. > > Sure, but where in current Python are there such requirements? It was my understanding that "t#" refers to single byte character data. That's where the above arguments were aiming at... > > So if we are to do the integration, we'll have to choose > > argument parser markers that allow for multi byte characters. > > "t#" does not fall into this category, "s#" certainly does, > > "s" is argueable. > > I disagree. I grepped through the source for s# and t#. Here's a bit > of background. Before t# was introduced, s# was being used for two > distinct purposes: (1) to get an 8-bit text string plus its length, in > situations where the length was needed; (2) to get binary data (e.g. > GIF data read from a file in "rb" mode). Greg pointed out that if we > ever introduced some form of Unicode support, these two had to be > disambiguated. We found that the majority of uses was for (2)! > Therefore we decided to change the definition of s# to mean only (2), > and introduced t# to mean (1). Also, we introduced getcharbuffer > corresponding to t#, while getreadbuffer was meant for s#. I know its too late now, but I can't really follow the arguments here: in what ways are (1) and (2) different from the implementations point of view ? If "t#" is to return UTF-8 then <length of the buffer> will not equal <text length>, so both parser markers return essentially the same information. The only difference would be on the semantic side: (1) means: give me text data, while (2) does not specify the data type. Perhaps I'm missing something... > Note that the definition of the 's' format was left alone -- as > before, it means you need an 8-bit text string not containing null > bytes. This definition should then be changed to "text string without null bytes" dropping the 8-bit reference. > Our expectation was that a Unicode string passed to an s# situation > would give a pointer to the internal format plus a byte count (not a > character count!) while t# would get a pointer to some kind of 8-bit > translation/encoding plus a byte count, with the explicit requirement > that the 8-bit translation would have the same lifetime as the > original unicode object. We decided to leave it up to the next > generation (i.e., Marc-Andre :-) to decide what kind of translation to > use and what to do when there is no reasonable translation. Hmm, I would strongly object to making "s#" return the internal format. file.write() would then default to writing UTF-16 data instead of UTF-8 data. This could result in strange errors due to the UTF-16 format being endian dependent. It would also break the symmetry between file.write(u) and unicode(file.read()), since the default encoding is not used as internal format for other reasons (see proposal). > Any of the following choices is acceptable (from the point of view of > not breaking the intended t# semantics; we can now start deciding > which we like best): I think we have already agreed on using UTF-8 for the default encoding. It has quite a few advantages. See http://czyborra.com/utf/ for a good overview of the pros and cons. > - utf-8 > - latin-1 > - ascii > - shift-jis > - lower byte of unicode ordinal > - some user- or os-specified multibyte encoding > > As far as t# is concerned, for encodings that don't encode all of > Unicode, untranslatable characters could be dealt with in any number > of ways (raise an exception, ignore, replace with '?', make best > effort, etc.). The usual Python way would be: raise an exception. This is what the proposal defines for Codecs in case an encoding/decoding mapping is not possible, BTW. (UTF-8 will always succeed on output.) > Given the current context, it should probably be the same as the > default encoding -- i.e., utf-8. If we end up making the default > user-settable, we'll have to decide what to do with untranslatable > characters -- but that will probably be decided by the user too (it > would be a property of a specific translation specification). > > In any case, I feel that t# could receive a multi-byte encoding, > s# should receive raw binary data, and they should correspond to > getcharbuffer and getreadbuffer, respectively. Why would you want to have "s#" return the raw binary data for Unicode objects ? Note that it is not mentioned anywhere that "s#" and "t#" do have to necessarily return different things (binary being a superset of text). I'd opt for "s#" and "t#" both returning UTF-8 data. This can be implemented by delegating the buffer slots to the <defencstr> object (see below). > > Now Greg would chime in with the buffer interface and > > argue that it should make the underlying internal > > format accessible. This is a bad idea, IMHO, since you > > shouldn't really have to know what the internal data format > > is. > > This is for C code. Quite likely it *does* know what the internal > data format is! C code can use the PyUnicode_* APIs to access the data. I don't think that argument parsing is powerful enough to provide the C code with enough information about the data contents, e.g. it can only state the encoding length, not the string length. > > Defining "s#" to return UTF-8 data does not only > > make "s" and "s#" return the same data format (which should > > always be the case, IMO), > > That was before t# was introduced. No more, alas. If you replace s# > with t#, I agree with you completely. Done :-) > > but also hides the internal > > format from the user and gives him a reliable cross-platform > > data representation of Unicode data (note that UTF-8 doesn't > > have the byte order problems of UTF-16). > > > > If you are still with, let's look at what "s" and "s#" > > (and t#, which is more relevant here) > > > do: they return pointers into data areas which have to > > be kept alive until the corresponding object dies. > > > > The only way to support this feature is by allocating > > a buffer for just this purpose (on the fly and only if > > needed to prevent excessive memory load). The other > > options of adding new magic parser markers or switching > > to more generic one all have one downside: you need to > > change existing code which is in conflict with the idea > > we started out with. > > Agreed. I think this was our thinking when Greg & I introduced t#. > My own preference would be to allocate a whole string object, not > just a buffer; this could then also be used for the .encode() method > using the default encoding. Good point. I'll change <defencbuf> to <defencstr>, a Python string object created on request. > > So, again, the question is: do we want this magical > > intergration or not ? Note that this is a design question, > > not one of memory consumption... > > Yes, I want it. > > Note that this doesn't guarantee that all old extensions will work > flawlessly when passed Unicode objects; but I think that it covers > most cases where you could have a reasonable expectation that it > works. > > (Hm, unfortunately many reasonable expectations seem to involve > the current user's preferred encoding. :-( ) -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 47 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4