> On purpose -- according to my thinking. I see "t#" as an interface > to bf_getcharbuf which I understand as 8-bit character buffer... > UTF-8 is a multi byte encoding. It still is character data, but > not necessarily 8 bits in length (up to 24 bits are used). > > Anyway, I'm not really interested in having an argument about > this. If you say, "t#" fits the purpose, then that's fine with > me. Still, we should clearly define that "t#" returns > text data and "s#" binary data. Encoding, bit length, etc. should > explicitly remain left undefined. Thanks for not picking an argument. Multibyte encodings typically have ASCII as a subset (in such a way that an ASCII string is represented as itself in bytes). This is the characteristic that's needed in my view. > > > First, we have a general design question here: should old code > > > become Unicode compatible or not. As I recall the original idea > > > about Unicode integration was to follow Perl's idea to have > > > scripts become Unicode aware by simply adding a 'use utf8;'. > > > > I've never heard of this idea before -- or am I taking it too literal? > > It smells of a mode to me :-) I'd rather live in a world where > > Unicode just works as long as you use u'...' literals or whatever > > convention we decide. > > > > > If this is still the case, then we'll have to come with a > > > resonable approach for integrating classical string based > > > APIs with the new type. > > > > > > Since UTF-8 is a standard (some would probably prefer UTF-7,5 e.g. > > > the Latin-1 folks) which has some very nice features (see > > > http://czyborra.com/utf/ ) and which is a true extension of ASCII, > > > this encoding seems best fit for the purpose. > > > > Yes, especially if we fix the default encoding as UTF-8. (I'm > > expecting feedback from HP on this next week, hopefully when I see the > > details, it'll be clear that don't need a per-thread default encoding > > to solve their problems; that's quite a likely outcome. If not, we > > have a real-world argument for allowing a variable default encoding, > > without carnage.) > > Fair enough :-) > > > > However, one should not forget that UTF-8 is in fact a > > > variable length encoding of Unicode characters, that is up to > > > 3 bytes form a *single* character. This is obviously not compatible > > > with definitions that explicitly state data to be using a > > > 8-bit single character encoding, e.g. indexing in UTF-8 doesn't > > > work like it does in Latin-1 text. > > > > Sure, but where in current Python are there such requirements? > > It was my understanding that "t#" refers to single byte character > data. That's where the above arguments were aiming at... t# refers to byte-encoded data. Multibyte encodings are explicitly designed to be passed cleanly through processing steps that handle single-byte character data, as long as they are 8-bit clean and don't do too much processing. > > > So if we are to do the integration, we'll have to choose > > > argument parser markers that allow for multi byte characters. > > > "t#" does not fall into this category, "s#" certainly does, > > > "s" is argueable. > > > > I disagree. I grepped through the source for s# and t#. Here's a bit > > of background. Before t# was introduced, s# was being used for two > > distinct purposes: (1) to get an 8-bit text string plus its length, in > > situations where the length was needed; (2) to get binary data (e.g. > > GIF data read from a file in "rb" mode). Greg pointed out that if we > > ever introduced some form of Unicode support, these two had to be > > disambiguated. We found that the majority of uses was for (2)! > > Therefore we decided to change the definition of s# to mean only (2), > > and introduced t# to mean (1). Also, we introduced getcharbuffer > > corresponding to t#, while getreadbuffer was meant for s#. > > I know its too late now, but I can't really follow the arguments > here: in what ways are (1) and (2) different from the implementations > point of view ? If "t#" is to return UTF-8 then <length of the > buffer> will not equal <text length>, so both parser markers return > essentially the same information. The only difference would be > on the semantic side: (1) means: give me text data, while (2) does > not specify the data type. > > Perhaps I'm missing something... The idea is that (1)/s# disallows any translation of the data, while (2)/t# requires translation of the data to an ASCII superset (possibly multibyte, such as UTF-8 or shift-JIS). (2)/t# assumes that the data contains text and that if the text consists of only ASCII characters they are represented as themselves. (1)/s# makes no such assumption. In terms of implementation, Unicode objects should translate themselves to the default encoding for t# (if possible), but they should make the native representation available for s#. For example, take an encryption engine. While it is defined in terms of byte streams, there's no requirement that the bytes represent characters -- they could be the bytes of a GIF file, an MP3 file, or a gzipped tar file. If we pass Unicode to an encryption engine, we want Unicode to come out at the other end, not UTF-8. (If we had wanted to encrypt UTF-8, we should have fed it UTF-8.) > > Note that the definition of the 's' format was left alone -- as > > before, it means you need an 8-bit text string not containing null > > bytes. > > This definition should then be changed to "text string without > null bytes" dropping the 8-bit reference. Aha, I think there's a confusion about what "8-bit" means. For me, a multibyte encoding like UTF-8 is still 8-bit. Am I alone in this? (As far as I know, C uses char* to represent multibyte characters.) Maybe we should disambiguate it more explicitly? > > Our expectation was that a Unicode string passed to an s# situation > > would give a pointer to the internal format plus a byte count (not a > > character count!) while t# would get a pointer to some kind of 8-bit > > translation/encoding plus a byte count, with the explicit requirement > > that the 8-bit translation would have the same lifetime as the > > original unicode object. We decided to leave it up to the next > > generation (i.e., Marc-Andre :-) to decide what kind of translation to > > use and what to do when there is no reasonable translation. > > Hmm, I would strongly object to making "s#" return the internal > format. file.write() would then default to writing UTF-16 data > instead of UTF-8 data. This could result in strange errors > due to the UTF-16 format being endian dependent. But this was the whole design. file.write() needs to be changed to use s# when the file is open in binary mode and t# when the file is open in text mode. > It would also break the symmetry between file.write(u) and > unicode(file.read()), since the default encoding is not used as > internal format for other reasons (see proposal). If the file is encoded using UTF-16 or UCS-2, you should open it in binary mode and use unicode(file.read(), 'utf-16'). (Or perhaps the app should read the first 2 bytes and check for a BOM and then decide to choose bewteen 'utf-16-be' and 'utf-16-le'.) > > Any of the following choices is acceptable (from the point of view of > > not breaking the intended t# semantics; we can now start deciding > > which we like best): > > I think we have already agreed on using UTF-8 for the default > encoding. It has quite a few advantages. See > > http://czyborra.com/utf/ > > for a good overview of the pros and cons. Of course. I was just presenting the list as an argument that if we changed our mind about the default encoding, t# should follow the default encoding (and not pick an encoding by other means). > > - utf-8 > > - latin-1 > > - ascii > > - shift-jis > > - lower byte of unicode ordinal > > - some user- or os-specified multibyte encoding > > > > As far as t# is concerned, for encodings that don't encode all of > > Unicode, untranslatable characters could be dealt with in any number > > of ways (raise an exception, ignore, replace with '?', make best > > effort, etc.). > > The usual Python way would be: raise an exception. This is what > the proposal defines for Codecs in case an encoding/decoding > mapping is not possible, BTW. (UTF-8 will always succeed on > output.) Did you read Andy Robinson's case study? He suggested that for certain encodings there may be other things you can do that are more user-friendly than raising an exception, depending on the application. I am proposing to leave this a detail of each specific translation. There may even be translations that do the same thing except they have a different behavior for untranslatable cases -- e.g. a strict version that raises an exception and a non-strict version that replaces bad characters with '?'. I think this is one of the powers of having an extensible set of encodings. > > Given the current context, it should probably be the same as the > > default encoding -- i.e., utf-8. If we end up making the default > > user-settable, we'll have to decide what to do with untranslatable > > characters -- but that will probably be decided by the user too (it > > would be a property of a specific translation specification). > > > > In any case, I feel that t# could receive a multi-byte encoding, > > s# should receive raw binary data, and they should correspond to > > getcharbuffer and getreadbuffer, respectively. > > Why would you want to have "s#" return the raw binary data for > Unicode objects ? Because file.write() for a binary file, and other similar things (e.g. the encryption engine example I mentioned above) must have *some* way to get at the raw bits. > Note that it is not mentioned anywhere that > "s#" and "t#" do have to necessarily return different things > (binary being a superset of text). I'd opt for "s#" and "t#" both > returning UTF-8 data. This can be implemented by delegating the > buffer slots to the <defencstr> object (see below). This would defeat the whole purpose of introducing t#. We might as well drop t# then altogether if we adopt this. > > > Now Greg would chime in with the buffer interface and > > > argue that it should make the underlying internal > > > format accessible. This is a bad idea, IMHO, since you > > > shouldn't really have to know what the internal data format > > > is. > > > > This is for C code. Quite likely it *does* know what the internal > > data format is! > > C code can use the PyUnicode_* APIs to access the data. I > don't think that argument parsing is powerful enough to > provide the C code with enough information about the data > contents, e.g. it can only state the encoding length, not the > string length. Typically, all the C code does is pass multibyte encoded strings on to other library routines that know what to do to them, or simply give them back unchanged at a later time. It is essential to know the number of bytes, for memory allocation purposes. The number of characters is totally immaterial (and multibyte-handling code knows how to calculate the number of characters anyway). > > > Defining "s#" to return UTF-8 data does not only > > > make "s" and "s#" return the same data format (which should > > > always be the case, IMO), > > > > That was before t# was introduced. No more, alas. If you replace s# > > with t#, I agree with you completely. > > Done :-) > > > > but also hides the internal > > > format from the user and gives him a reliable cross-platform > > > data representation of Unicode data (note that UTF-8 doesn't > > > have the byte order problems of UTF-16). > > > > > > If you are still with, let's look at what "s" and "s#" > > > > (and t#, which is more relevant here) > > > > > do: they return pointers into data areas which have to > > > be kept alive until the corresponding object dies. > > > > > > The only way to support this feature is by allocating > > > a buffer for just this purpose (on the fly and only if > > > needed to prevent excessive memory load). The other > > > options of adding new magic parser markers or switching > > > to more generic one all have one downside: you need to > > > change existing code which is in conflict with the idea > > > we started out with. > > > > Agreed. I think this was our thinking when Greg & I introduced t#. > > My own preference would be to allocate a whole string object, not > > just a buffer; this could then also be used for the .encode() method > > using the default encoding. > > Good point. I'll change <defencbuf> to <defencstr>, a Python > string object created on request. > > > > So, again, the question is: do we want this magical > > > intergration or not ? Note that this is a design question, > > > not one of memory consumption... > > > > Yes, I want it. > > > > Note that this doesn't guarantee that all old extensions will work > > flawlessly when passed Unicode objects; but I think that it covers > > most cases where you could have a reasonable expectation that it > > works. > > > > (Hm, unfortunately many reasonable expectations seem to involve > > the current user's preferred encoding. :-( ) > > -- > Marc-Andre Lemburg --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4