Guido van Rossum wrote: >=20 > > I'd like to query for the common opinion on an issue which I've > > run into when trying to resynchronize unicode() and str() in terms > > on what happens when you pass arbitrary objects to these constructors > > which happen to implement tp_str (or __str__ for instances). > > > > Currenty, str() will accept any object which supports the tp_str > > interface and revert to tp_repr in case that slot should not > > be available. > > > > unicode() supported strings, character buffers and instances > > having a __str__ method before yesterdays checkins. > > > > Now the goal of the checkins was to make str() and unicode() > > behave in a more compatible fashion. Both should accept > > the same kinds of objects and raise exceptions for all others. >=20 > Well, historically, str() has rarely raised exceptions, because > there's a default implementation (same as for repr(), returning <FOO > object at ADDRESS>. This is used when neither tp_repr nor tp_str is > set. Note that PyObject_Str() never looks at __str__ -- this is done > by the tp_str handler of instances (and now also by the tp_str handler > of new-style classes). I see no reason to change this. Me neither; what str() does not do (and unicode() does) is try the char buffer interface before trying tp_str. =20 > The question then becomes, do we want unicode() to behave similarly? Given that porting an application from strings to Unicode should be easy, I'd say: yes. =20 > > The path I chose was to fix PyUnicode_FromEncodedObject() > > to also accept tp_str compatible objects. This API is used > > by the unicode_new() constructor (which is exposed as unicode() > > in Python) to create a Unicode object from the input object. > > > > str() OTOH uses PyObject_Str() via string_new(). > > > > Now there also is a PyObject_Unicode() API which tries to > > mimic PyObject_Str(). However, it does not support the additional > > encoding and errors arguments which the unicode() constructor > > has. > > > > The problem which Guido raised about my checkins was that > > the changes to PyUnicode_FromEncodedObject() are seen not > > only in unicode(), but also all other instances where this > > API is used. > > > > OTOH, PyUnicode_FromEncodedObject() is the most generic constructor > > for Unicode objects there currently is in Python. > > > > So the questions are > > - should I revert the change in PyUnicode_FromEncodedObject() > > and instead extend PyObject_Unicode() to support encodings ? > > - should we make PyUnicode_Object() use > > PyUnicode_FromEncodedObject() instead of providing its > > own implementation ? > > > > The overall picture of all this auto-conversion stuff going > > on in str() and unicode() is very confusing. Perhaps what > > we really need is first to agree on a common understanding > > of which auto-conversion should take place and then make > > str() and unicode() support exactly the same interface ?! > > > > PS: Also see patch #446754 by Walter D=F6rwald: > > http://sourceforge.net/tracker/?func=3Ddetail&atid=3D305470&aid=3D446= 754&group_id=3D5470 >=20 > OK, let's take a step back. >=20 > The str() function (now constructor) converts *anything* to a string; > tp_str and tp_repr exist to allow objects to customize this. These > slots, and the str() function, take no additional arguments. To > invoke the equivalent of str() from C, you call PyObject_Str(). I see > no reason to change this; we may want to make the Unicode situation is > similar as possible. Right. =20 > The unicode() function (now constructor) traditionally converted only > 8-bit strings to Unicode strings,=20 Slightly incorrect: it converted 8-bit strings, objects compatible=20 to the char buffer interface and instances having a __str__ method to Unicode. To synchronize unicode() with str() we'd have to replace the __str__ lookup with a tp_str lookup (this will also allow things like unicode(2) and unicode(instance_having__str__)) and maybe also add the charbuf=20 lookup to str() (this would make str() compatible with memory mapped files and probably a few other char buffer aware objects as well). Note that in a discussion we had some time ago we decided that __str__ should be allowed to return Unicode objects as well (instead of defining a separate __unicode__ method/slot for this purpose). str() will convert a Unicode return value to an 8-bit string using the default encoding while unicode() takes the return value as-is. This was done to simplify moving from strings to Unicode. > with additional arguments to specify > the encoding (and error handling preference). There is no tp_unicode > slot, but for some reason there are at least three C APIs that could > correspond to unicode(): PyObject_Unicode() and PyUnicode_FromObject() > take a single object argument, and PyObject_FromEncodedObject() takes > object, encoding, and error arguments. >=20 > The first question is, do we want the unicode() constructor to be > applicable in all cases where the str() constructor is? =20 Yes. > I guess that > we do, since we want to be able to print to streams that support > Unicode. Unicode strings render themselves as Unicode characters to > such a stream, and it's reasonable to allow other objects to also > customize their rendition in Unicode. >=20 > Now, what should be the signature of this conversion? If we print > object X to a Unicode stream, should we invoke unicode(X), or > unicode(X, encoding, error)? I believe it should be just unicode(X), > since the encoding used by the stream shouldn't enter into the picture > here: that's just used for converting Unicode characters written to > the stream to some external format. >=20 > How should an object be allowed to customize its Unicode rendition? > We could add a tp_unicode slot to the type object, but there's no > need: we can just look for a __unicode__ method and call it if it > exists. The signature of __unicode__ should take no further > arguments: unicode(X) should call X.__unicode__(). As a fallback, if > the object doesn't have a __unicode__ attribute, PyObject_Str() should > be called and the resulting string converted to Unicode using the > default encoding. I'd rather leave things as they are: __str__/tp_str are allowed to return Unicode objects and if an object wishes to be rendered as Unicode it can simply return a Unicode object through the __str__/tp_str interface. =20 > Regarding the "long form" of unicode(), unicode(X, encoding, error), I > see no reason to treat this with the same generality. This form > should restrict X to something that supports the buffer API (IOW, > 8-bit string objects and things that are treated the same as these in > most situations).=20 Hmm, but this would restrict users from implementing string like objects (i.e. objects having the __str__ method to make it compatible to str()). > (Note that it already balks when X is a Unicode > string.) True -- since you normally cannot decode Unicode into Unicode using=20 some 8-bit character encoding. As a result encodings which convert=20 Unicode to Unicode (e.g. normalizations) cannot use this interface, but since these are probably only rarely used, I think it's better to prevent accidental usage of an 8-bit character codec on Unicode. =20 > So about those C APIs: I propose that PyObject_Unicode() correspond to > the one-arg form of unicode(), taking any kind of object, and that > PyUnicode_FromEncodedObject() correspond to the three-arg form. Ok. I'll fix this once we've reached consensus on what to do about str() and unicode(). > PyUnicode_FromObject() shouldn't really need to exist.=20 Note: PyUnicode_FromObject() was extended by PyUnicode_FromEncodedObject(= ) and only exists for backward compatibility reasons. > I don't see a > reason for PyUnicode_From[Encoded]Object() to use the __unicode__ > customization -- it should just take the bytes provided by the object > and decode them according to the given encoding. PyObject_Unicode(), > on the other hand, should look for __unicode__ first and then > PyObject_Str(). >=20 > I hope this helps. Thanks for the summary. --=20 Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4