> I'd like to query for the common opinion on an issue which I've > run into when trying to resynchronize unicode() and str() in terms > on what happens when you pass arbitrary objects to these constructors > which happen to implement tp_str (or __str__ for instances). > > Currenty, str() will accept any object which supports the tp_str > interface and revert to tp_repr in case that slot should not > be available. > > unicode() supported strings, character buffers and instances > having a __str__ method before yesterdays checkins. > > Now the goal of the checkins was to make str() and unicode() > behave in a more compatible fashion. Both should accept > the same kinds of objects and raise exceptions for all others. Well, historically, str() has rarely raised exceptions, because there's a default implementation (same as for repr(), returning <FOO object at ADDRESS>. This is used when neither tp_repr nor tp_str is set. Note that PyObject_Str() never looks at __str__ -- this is done by the tp_str handler of instances (and now also by the tp_str handler of new-style classes). I see no reason to change this. The question then becomes, do we want unicode() to behave similarly? > The path I chose was to fix PyUnicode_FromEncodedObject() > to also accept tp_str compatible objects. This API is used > by the unicode_new() constructor (which is exposed as unicode() > in Python) to create a Unicode object from the input object. > > str() OTOH uses PyObject_Str() via string_new(). > > Now there also is a PyObject_Unicode() API which tries to > mimic PyObject_Str(). However, it does not support the additional > encoding and errors arguments which the unicode() constructor > has. > > The problem which Guido raised about my checkins was that > the changes to PyUnicode_FromEncodedObject() are seen not > only in unicode(), but also all other instances where this > API is used. > > OTOH, PyUnicode_FromEncodedObject() is the most generic constructor > for Unicode objects there currently is in Python. > > So the questions are > - should I revert the change in PyUnicode_FromEncodedObject() > and instead extend PyObject_Unicode() to support encodings ? > - should we make PyUnicode_Object() use > PyUnicode_FromEncodedObject() instead of providing its > own implementation ? > > The overall picture of all this auto-conversion stuff going > on in str() and unicode() is very confusing. Perhaps what > we really need is first to agree on a common understanding > of which auto-conversion should take place and then make > str() and unicode() support exactly the same interface ?! > > PS: Also see patch #446754 by Walter Dörwald: > http://sourceforge.net/tracker/?func=detail&atid=305470&aid=446754&group_id=5470 OK, let's take a step back. The str() function (now constructor) converts *anything* to a string; tp_str and tp_repr exist to allow objects to customize this. These slots, and the str() function, take no additional arguments. To invoke the equivalent of str() from C, you call PyObject_Str(). I see no reason to change this; we may want to make the Unicode situation is similar as possible. The unicode() function (now constructor) traditionally converted only 8-bit strings to Unicode strings, with additional arguments to specify the encoding (and error handling preference). There is no tp_unicode slot, but for some reason there are at least three C APIs that could correspond to unicode(): PyObject_Unicode() and PyUnicode_FromObject() take a single object argument, and PyObject_FromEncodedObject() takes object, encoding, and error arguments. The first question is, do we want the unicode() constructor to be applicable in all cases where the str() constructor is? I guess that we do, since we want to be able to print to streams that support Unicode. Unicode strings render themselves as Unicode characters to such a stream, and it's reasonable to allow other objects to also customize their rendition in Unicode. Now, what should be the signature of this conversion? If we print object X to a Unicode stream, should we invoke unicode(X), or unicode(X, encoding, error)? I believe it should be just unicode(X), since the encoding used by the stream shouldn't enter into the picture here: that's just used for converting Unicode characters written to the stream to some external format. How should an object be allowed to customize its Unicode rendition? We could add a tp_unicode slot to the type object, but there's no need: we can just look for a __unicode__ method and call it if it exists. The signature of __unicode__ should take no further arguments: unicode(X) should call X.__unicode__(). As a fallback, if the object doesn't have a __unicode__ attribute, PyObject_Str() should be called and the resulting string converted to Unicode using the default encoding. Regarding the "long form" of unicode(), unicode(X, encoding, error), I see no reason to treat this with the same generality. This form should restrict X to something that supports the buffer API (IOW, 8-bit string objects and things that are treated the same as these in most situations). (Note that it already balks when X is a Unicode string.) So about those C APIs: I propose that PyObject_Unicode() correspond to the one-arg form of unicode(), taking any kind of object, and that PyUnicode_FromEncodedObject() correspond to the three-arg form. PyUnicode_FromObject() shouldn't really need to exist. I don't see a reason for PyUnicode_From[Encoded]Object() to use the __unicode__ customization -- it should just take the bytes provided by the object and decode them according to the given encoding. PyObject_Unicode(), on the other hand, should look for __unicode__ first and then PyObject_Str(). I hope this helps. --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4