M.-A. Lemburg wrote: >... > I meant PyUnicode_* style APIs for dealing with all the aspects > of Unicode objects -- much like the PyString_* APIs available. Sure, these could be added as necessary. For raw access to the bytes, I would refer people to the abstract buffer functions, tho. > > Your abstract.c functions make it quite simple. > > BTW, do we need an extra set of those with buffer index or not ? > Those would really be one-liners for the sake of hiding the > type slots from applications. It sounds like NumPy and PIL would need it, which makes the landscape quite a bit different from the last time we discussed this (when we didn't imagine anybody needing those). >... > > > Since fp.write() uses "s#" this would use the getreadbuffer > > > slot in 1.5.2... I think what it *should* do is use the > > > getcharbuffer slot instead (see my other post), since dumping > > > the raw unicode data would loose too much information. Again, > > > > I very much disagree. To me, fp.write() is not about writing characters > > to a stream. I think it makes much more sense as "writing bytes to a > > stream" and the buffer interface fits that perfectly. > > This is perfectly ok, but shouldn't the behaviour of fp.write() > mimic that of previous Python versions ? How does JPython > write the data ? fp.write() had no semantics for writing Unicode objects since they didn't exist. Therefore, we are not breaking or changing any behavior. > Inlined different subject: > I think the internal semantics of "s#" using the getreadbuffer slot > and "t#" the getcharbuffer slot should be switched; see my other post. 1) Too late 2) The use of "t#" ("text") for the getcharbuffer slot was decided by the Benevolent Dictator. 3) see (2) > In previous Python versions "s#" had the semantics of string data > with possibly embedded NULL bytes. Now it suddenly has the meaning > of binary data and you can't simply change extensions to use the > new "t#" because people are still using them with older Python > versions. Guido and I had a pretty long discussion on what the best approach here was. I think we even pulled in Tim as a final arbiter, as I recall. I believe "s#" remained getreadbuffer simply because it *also* meant "give me the bytes of that object". If it changed to getcharbuffer, then you could see exceptions in code that didn't raise exceptions beforehand. (more below) > > There is no loss of data. You could argue that the byte order is lost, > > but I think that is incorrect. The application defines the semantics: > > the file might be defined as using host-order, or the application may be > > writing a BOM at the head of the file. > > The problem here is that many application were not written > to handle these kind of objects. Previously they could only > handle strings, now they can suddenly handle any object > having the buffer interface and then fail when the data > gets read back in. An application is a complete unit. How are you suddenly going to manifest Unicode objects within that application? The only way is if the developer goes in and changes things; let them deal with the issues and fallout of their change. The other is external changes such as an upgrade to the interpreter or a module. Again, (IMO) if you're perturbing a system, then you are responsible for also correcting any problems you introduce. In any case, Guido's position was that things can easily switch over to the "t#" interface to prevent the class of error where you pass a Unicode string to a function that expects a standard string. > > > such things should be handled by extra methods, e.g. fp.rawwrite(). > > > > I believe this would be a needless complication of the interface. > > It would clarify things and make the interface 100% backward > compatible again. No. "s#" used to pull bytes from any buffer-capable object. Your suggestion for "s#" to use the getcharbuffer could introduce exceptions into currently-working code. (this was probably Guido's prime motivation for the currently meaning of "t#"... I can dig up the mail thread if people need an authoritative commentary on the decision that was made) > > > Hmm, I guess the philosophy behind the interface is not > > > really clear. > > > > I didn't design or implement it initially, but (as you may have guessed) > > I am a proponent of its existence. > > > > > Binary data is fetched via getreadbuffer and then > > > interpreted as character data... I always thought that the > > > getcharbuffer should be used for such an interpretation. > > > > The former is bad behavior. That is why getcharbuffer was added (by me, > > for 1.5.2). It was a preventative measure for the introduction of > > Unicode strings. Using getreadbuffer for characters would break badly > > given a Unicode string. Therefore, "clients" that want (8-bit) > > characters from an object supporting the buffer interface should use > > getcharbuffer. The Unicode object doesn't implement it, implying that it > > cannot provide 8-bit characters. You can get the raw bytes thru > > getreadbuffer. > > I agree 100%, but did you add the "t#" instead of having > "s#" use the getcharbuffer interface ? Yes. For reasons detailed above. > E.g. my mxTextTools > package uses "s#" on many APIs. Now someone could stick > in a Unicode object and get pretty strange results without > any notice about mxTextTools and Unicode being incompatible. They could also stick in an array of integers. That supports the buffer interface, meaning the "s#" in your code would extract the bytes from it. In other words, people can already stick bogus stuff into your code. This seems to be a moot argument. > You could argue that I change to "t#", but that doesn't > work since many people out there still use Python versions > <1.5.2 and those didn't have "t#", so mxTextTools would then > fail completely for them. If support for the older versions is needed, then use an #ifdef to set up the appropriate macro in some header. Use that throughout your code. In any case: yes -- I would argue that you should absolutely be using "t#". Cheers, -g -- Greg Stein, http://www.lyra.org/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4