A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://mail.python.org/pipermail/python-dev/2001-September/017602.html below:

[Python-Dev] str() vs. unicode()

[Python-Dev] str() vs. unicode()Guido van Rossum guido@python.org
Mon, 24 Sep 2001 09:30:14 -0400
> > Well, historically, str() has rarely raised exceptions, because
> > there's a default implementation (same as for repr(), returning <FOO
> > object at ADDRESS>.  This is used when neither tp_repr nor tp_str is
> > set.  Note that PyObject_Str() never looks at __str__ -- this is done
> > by the tp_str handler of instances (and now also by the tp_str handler
> > of new-style classes).  I see no reason to change this.
> 
> Me neither; what str() does not do (and unicode() does) is try
> the char buffer interface before trying tp_str.

The meanings of these two are different: tp_str means "give me a
string that's useful for printing"; the buffer API means "let me treat
you as a sequence of 8-bit bytes (or 8-bit characters)".  They are
different e.g. when you consider a PIL image, whose str() probably
returns something like '<PIL image WxHxD>' while its buffer API
probably gives access to the raw image buffer.

The str() function should map directly to tp_str().  You *might* claim
that the 8-bit string type constructor *ought to* look at the buffer
API, but I'd say that it's easy enough for a type to provide a tp_str
implementation that does what the type wants.  I guess "convert
yourself to string" is different than "display yourself as a string".

> > The question then becomes, do we want unicode() to behave similarly?
> 
> Given that porting an application from strings to Unicode should
> be easy, I'd say: yes.

Fearing this ends up being a trick question, I'll say +0.  If we end
up with something I don't like, I reserve the right to change my
opinion on this.

> > The str() function (now constructor) converts *anything* to a string;
> > tp_str and tp_repr exist to allow objects to customize this.  These
> > slots, and the str() function, take no additional arguments.  To
> > invoke the equivalent of str() from C, you call PyObject_Str().  I see
> > no reason to change this; we may want to make the Unicode situation is
> > similar as possible.
> 
> Right.
>  
> > The unicode() function (now constructor) traditionally converted only
> > 8-bit strings to Unicode strings, 
> 
> Slightly incorrect: it converted 8-bit strings, objects compatible 
> to the char buffer interface and instances having a __str__ method to
> Unicode.

That's rather random collection of APIs, if you ask me...

Also, do you really mean *instances* (i.e. objects for which
PyInstance_Check() returns true), or do you mean anything for which
getattr(x, "__str__") is true?  If the latter, you're in for a
surprise in 2.2 -- almost all built-in objects now respond to that
method, due to the type/class unification: whenever something has a
tp_str slot, a __str__ attribute is synthesized (and vice versa).
(Exceptions are a few obscure types and maybe 3rd party extension
types.)

> To synchronize unicode() with str() we'd have to replace the __str__
> lookup with a tp_str lookup (this will also allow things like unicode(2)
> and unicode(instance_having__str__)) and maybe also add the charbuf 
> lookup to str() (this would make str() compatible with memory mapped
> files and probably a few other char buffer aware objects as well).

I definitely don't want the latter change to str(); see above.  If you
want unicode(x) to behave as much as str(x) as possible, I recommend
removing using the buffer API.

> Note that in a discussion we had some time ago we decided that __str__
> should be allowed to return Unicode objects as well (instead of
> defining a separate __unicode__ method/slot for this purpose). str()
> will convert a Unicode return value to an 8-bit string using the
> default encoding while unicode() takes the return value as-is.
> 
> This was done to simplify moving from strings to Unicode.

I'm now not so sure if this was the right decision.

> > with additional arguments to specify
> > the encoding (and error handling preference).  There is no tp_unicode
> > slot, but for some reason there are at least three C APIs that could
> > correspond to unicode(): PyObject_Unicode() and PyUnicode_FromObject()
> > take a single object argument, and PyObject_FromEncodedObject() takes
> > object, encoding, and error arguments.
> > 
> > The first question is, do we want the unicode() constructor to be
> > applicable in all cases where the str() constructor is?  
> 
> Yes.
> 
> > I guess that
> > we do, since we want to be able to print to streams that support
> > Unicode.  Unicode strings render themselves as Unicode characters to
> > such a stream, and it's reasonable to allow other objects to also
> > customize their rendition in Unicode.
> > 
> > Now, what should be the signature of this conversion?  If we print
> > object X to a Unicode stream, should we invoke unicode(X), or
> > unicode(X, encoding, error)?  I believe it should be just unicode(X),
> > since the encoding used by the stream shouldn't enter into the picture
> > here: that's just used for converting Unicode characters written to
> > the stream to some external format.
> > 
> > How should an object be allowed to customize its Unicode rendition?
> > We could add a tp_unicode slot to the type object, but there's no
> > need: we can just look for a __unicode__ method and call it if it
> > exists.  The signature of __unicode__ should take no further
> > arguments: unicode(X) should call X.__unicode__().  As a fallback, if
> > the object doesn't have a __unicode__ attribute, PyObject_Str() should
> > be called and the resulting string converted to Unicode using the
> > default encoding.
> 
> I'd rather leave things as they are: __str__/tp_str are allowed
> to return Unicode objects and if an object wishes to be rendered
> as Unicode it can simply return a Unicode object through the
> __str__/tp_str interface.

Can you explain your motivation?  In the long run, it seems better to
me to think of __str__ as "render as 8-bit string" and __unicode__ as
"render as Unicode string".

> > Regarding the "long form" of unicode(), unicode(X, encoding, error), I
> > see no reason to treat this with the same generality.  This form
> > should restrict X to something that supports the buffer API (IOW,
> > 8-bit string objects and things that are treated the same as these in
> > most situations). 
> 
> Hmm, but this would restrict users from implementing string like
> objects (i.e. objects having the __str__ method to make it compatible
> to str()).

Having __str__ doesn't make something a string-like object!  A
string-like object (at least the way I understand this term) would
behave like a string, e.g. have string methods.  The UserString module
is an example, and in 2.2 subclasses of the 'str' type are prime
examples.

To convert one of these to Unicode given an encoding, shouldn't their
decode() method be used?

> > (Note that it already balks when X is a Unicode
> > string.)
> 
> True -- since you normally cannot decode Unicode into Unicode using 
> some 8-bit character encoding. As a result encodings which convert 
> Unicode to Unicode (e.g. normalizations) cannot use this interface,
> but since these are probably only rarely used, I think it's better
> to prevent accidental usage of an 8-bit character codec on Unicode.

Sigh.  More special cases.  Unicode objects do have a tp_str/__str__
slot, but they are not acceptable to unicode().

Really, this is such an incredible morass of APIs that I wonder if we
shouldn't start over...  There are altogether too many places in the
code where PyUnicode_Check() is used.  I wish there was a better
way...

> > So about those C APIs: I propose that PyObject_Unicode() correspond to
> > the one-arg form of unicode(), taking any kind of object, and that
> > PyUnicode_FromEncodedObject() correspond to the three-arg form.
> 
> Ok. I'll fix this once we've reached consensus on what to do
> about str() and unicode().

Alas, this is harder than we seem to have thought, collectively.  I
want someone to sit back and rethink how this should eventually work
(say in Python 2.9), and then work backwards from there to a
reasonable API to be used in 2.2.  The current piling of hack upon
hack seems hopeless.

We have some time: 2.2a4 will be released this week, but 2.2b1 isn't
due until Oct 10, and we can even slip that a bit.  Compatibility with
previous 2.2 alpha releases in not necessary; the hard compatibility
baseline is 2.1.1.

> > PyUnicode_FromObject() shouldn't really need to exist. 
> 
> Note: PyUnicode_FromObject() was extended by PyUnicode_FromEncodedObject()
> and only exists for backward compatibility reasons.

Excellent.

> > I don't see a
> > reason for PyUnicode_From[Encoded]Object() to use the __unicode__
> > customization -- it should just take the bytes provided by the object
> > and decode them according to the given encoding.  PyObject_Unicode(),
> > on the other hand, should look for __unicode__ first and then
> > PyObject_Str().
> > 
> > I hope this helps.
> 
> Thanks for the summary.

Alas, we're not done. :-(

I don't have much time for this -- there still are important pieces of
the type/class unification missing (e.g. comparisons and pickling
don't work right, and _ must be able to make __dynamic__ the default).

--Guido van Rossum (home page: http://www.python.org/~guido/)



RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4