Guido van Rossum wrote: > > > > Well, historically, str() has rarely raised exceptions, because > > > there's a default implementation (same as for repr(), returning <FOO > > > object at ADDRESS>. This is used when neither tp_repr nor tp_str is > > > set. Note that PyObject_Str() never looks at __str__ -- this is done > > > by the tp_str handler of instances (and now also by the tp_str handler > > > of new-style classes). I see no reason to change this. > > > > Me neither; what str() does not do (and unicode() does) is try > > the char buffer interface before trying tp_str. > > The meanings of these two are different: tp_str means "give me a > string that's useful for printing"; the buffer API means "let me treat > you as a sequence of 8-bit bytes (or 8-bit characters)". They are > different e.g. when you consider a PIL image, whose str() probably > returns something like '<PIL image WxHxD>' while its buffer API > probably gives access to the raw image buffer. > > The str() function should map directly to tp_str(). You *might* claim > that the 8-bit string type constructor *ought to* look at the buffer > API, but I'd say that it's easy enough for a type to provide a tp_str > implementation that does what the type wants. I guess "convert > yourself to string" is different than "display yourself as a string". Sure is :-) Ok, so let's leave remove the buffer API check from the list of str()/ unicode() conversion checks. > > > The question then becomes, do we want unicode() to behave similarly? > > > > Given that porting an application from strings to Unicode should > > be easy, I'd say: yes. > > Fearing this ends up being a trick question, I'll say +0. If we end > up with something I don't like, I reserve the right to change my > opinion on this. Ok. > > > The str() function (now constructor) converts *anything* to a string; > > > tp_str and tp_repr exist to allow objects to customize this. These > > > slots, and the str() function, take no additional arguments. To > > > invoke the equivalent of str() from C, you call PyObject_Str(). I see > > > no reason to change this; we may want to make the Unicode situation is > > > similar as possible. > > > > Right. > > > > > The unicode() function (now constructor) traditionally converted only > > > 8-bit strings to Unicode strings, > > > > Slightly incorrect: it converted 8-bit strings, objects compatible > > to the char buffer interface and instances having a __str__ method to > > Unicode. > > That's rather random collection of APIs, if you ask me... It was modelled after the PyObject_Str() API at the time. Don't know how the buffer interface ended up in there, but I guess it was a left-over from early revisions in the design. > Also, do you really mean *instances* (i.e. objects for which > PyInstance_Check() returns true), or do you mean anything for which > getattr(x, "__str__") is true? Looking at the code from Python 2.1: if (!PyInstance_Check(v) || (func = PyObject_GetAttr(v, strstr)) == NULL) { PyErr_Clear(); res = PyObject_Repr(v); } else { res = PyEval_CallObject(func, (PyObject *)NULL); Py_DECREF(func); } ... instances which have the __str__ attribute. > If the latter, you're in for a > surprise in 2.2 -- almost all built-in objects now respond to that > method, due to the type/class unification: whenever something has a > tp_str slot, a __str__ attribute is synthesized (and vice versa). > (Exceptions are a few obscure types and maybe 3rd party extension > types.) Nice :-) > > To synchronize unicode() with str() we'd have to replace the __str__ > > lookup with a tp_str lookup (this will also allow things like unicode(2) > > and unicode(instance_having__str__)) and maybe also add the charbuf > > lookup to str() (this would make str() compatible with memory mapped > > files and probably a few other char buffer aware objects as well). > > I definitely don't want the latter change to str(); see above. If you > want unicode(x) to behave as much as str(x) as possible, I recommend > removing using the buffer API. Ok, let's remove the buffer API from unicode(). Should it still be maintained for unicode(obj, encoding, errors) ? > > Note that in a discussion we had some time ago we decided that __str__ > > should be allowed to return Unicode objects as well (instead of > > defining a separate __unicode__ method/slot for this purpose). str() > > will convert a Unicode return value to an 8-bit string using the > > default encoding while unicode() takes the return value as-is. > > > > This was done to simplify moving from strings to Unicode. > > I'm now not so sure if this was the right decision. Hmm, perhaps we do need a __unicode__/tp_unicode slot after all. It would certainly help clarify the communication between the interpreter and the object. > > > with additional arguments to specify > > > the encoding (and error handling preference). There is no tp_unicode > > > slot, but for some reason there are at least three C APIs that could > > > correspond to unicode(): PyObject_Unicode() and PyUnicode_FromObject() > > > take a single object argument, and PyObject_FromEncodedObject() takes > > > object, encoding, and error arguments. > > > > > > The first question is, do we want the unicode() constructor to be > > > applicable in all cases where the str() constructor is? > > > > Yes. > > > > > I guess that > > > we do, since we want to be able to print to streams that support > > > Unicode. Unicode strings render themselves as Unicode characters to > > > such a stream, and it's reasonable to allow other objects to also > > > customize their rendition in Unicode. > > > > > > Now, what should be the signature of this conversion? If we print > > > object X to a Unicode stream, should we invoke unicode(X), or > > > unicode(X, encoding, error)? I believe it should be just unicode(X), > > > since the encoding used by the stream shouldn't enter into the picture > > > here: that's just used for converting Unicode characters written to > > > the stream to some external format. > > > > > > How should an object be allowed to customize its Unicode rendition? > > > We could add a tp_unicode slot to the type object, but there's no > > > need: we can just look for a __unicode__ method and call it if it > > > exists. The signature of __unicode__ should take no further > > > arguments: unicode(X) should call X.__unicode__(). As a fallback, if > > > the object doesn't have a __unicode__ attribute, PyObject_Str() should > > > be called and the resulting string converted to Unicode using the > > > default encoding. > > > > I'd rather leave things as they are: __str__/tp_str are allowed > > to return Unicode objects and if an object wishes to be rendered > > as Unicode it can simply return a Unicode object through the > > __str__/tp_str interface. > > Can you explain your motivation? In the long run, it seems better to > me to think of __str__ as "render as 8-bit string" and __unicode__ as > "render as Unicode string". The motivation was the idea of a unification of strings and Unicode. You may be right, though, that this idea is not really practical. > > > Regarding the "long form" of unicode(), unicode(X, encoding, error), I > > > see no reason to treat this with the same generality. This form > > > should restrict X to something that supports the buffer API (IOW, > > > 8-bit string objects and things that are treated the same as these in > > > most situations). > > > > Hmm, but this would restrict users from implementing string like > > objects (i.e. objects having the __str__ method to make it compatible > > to str()). > > Having __str__ doesn't make something a string-like object! A > string-like object (at least the way I understand this term) would > behave like a string, e.g. have string methods. The UserString module > is an example, and in 2.2 subclasses of the 'str' type are prime > examples. > > To convert one of these to Unicode given an encoding, shouldn't their > decode() method be used? Right... perhaps we don't need __unicode__ after all: the .decode() method already provides this functionality (on strings at least). > > > (Note that it already balks when X is a Unicode > > > string.) > > > > True -- since you normally cannot decode Unicode into Unicode using > > some 8-bit character encoding. As a result encodings which convert > > Unicode to Unicode (e.g. normalizations) cannot use this interface, > > but since these are probably only rarely used, I think it's better > > to prevent accidental usage of an 8-bit character codec on Unicode. > > Sigh. More special cases. Unicode objects do have a tp_str/__str__ > slot, but they are not acceptable to unicode(). > > Really, this is such an incredible morass of APIs that I wonder if we > shouldn't start over... There are altogether too many places in the > code where PyUnicode_Check() is used. I wish there was a better > way... Ideally, we'd need a new base class for strings and then have 8-bit and Unicode be subclasses of the this base class. There are several problems with this approach though; one certainly being the different memory allocation mechanisms used (strings store the value in the object, Unicode references an external buffer), the other being the different nature: strings don't carry meta-information while Unicode is in many ways restricted in use. > > > So about those C APIs: I propose that PyObject_Unicode() correspond to > > > the one-arg form of unicode(), taking any kind of object, and that > > > PyUnicode_FromEncodedObject() correspond to the three-arg form. > > > > Ok. I'll fix this once we've reached consensus on what to do > > about str() and unicode(). > > Alas, this is harder than we seem to have thought, collectively. I > want someone to sit back and rethink how this should eventually work > (say in Python 2.9), and then work backwards from there to a > reasonable API to be used in 2.2. The current piling of hack upon > hack seems hopeless. Agreed. > We have some time: 2.2a4 will be released this week, but 2.2b1 isn't > due until Oct 10, and we can even slip that a bit. Compatibility with > previous 2.2 alpha releases in not necessary; the hard compatibility > baseline is 2.1.1. > > > > PyUnicode_FromObject() shouldn't really need to exist. > > > > Note: PyUnicode_FromObject() was extended by PyUnicode_FromEncodedObject() > > and only exists for backward compatibility reasons. > > Excellent. I would like to boil this down to one API if possible which then implements unicode(obj) and unicode(obj, encoding, errors) -- if no encoding is given the semantics of PyObject_Str() are closely followed, with encoding the semantics of PyUnicode_FromEncodedObject() as it was are used (with the buffer interface logic removed). In a first step, I'd use the tp_str/__str__ for unicode(obj) as well. Later we can add a tp_unicode/__unicode__ lookup before trying tp_str/__str__ as fallback. If this sounds reasonable, I'll give it a go... > > > I don't see a > > > reason for PyUnicode_From[Encoded]Object() to use the __unicode__ > > > customization -- it should just take the bytes provided by the object > > > and decode them according to the given encoding. PyObject_Unicode(), > > > on the other hand, should look for __unicode__ first and then > > > PyObject_Str(). > > > > > > I hope this helps. > > > > Thanks for the summary. > > Alas, we're not done. :-( > > I don't have much time for this -- there still are important pieces of > the type/class unification missing (e.g. comparisons and pickling > don't work right, and _ must be able to make __dynamic__ the default). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4