A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://mail.python.org/pipermail/python-dev/2013-July/127366.html below:

[Python-Dev] Misc re.match() complaint

[Python-Dev] Misc re.match() complaintTerry Reedy tjreedy at udel.edu
Tue Jul 16 11:18:28 CEST 2013
On 7/15/2013 7:14 PM, Guido van Rossum wrote:
> In a discussion about mypy I discovered that the Python 3 version of
> the re module's Match object behaves subtly different from the Python
> 2 version when the target string (i.e. the haystack, not the needle)
> is a buffer object.
>
> In Python 2, the type of the return value of group() is always either
> a Unicode string or an 8-bit string, and the type is determined by
> looking at the target string -- if the target is unicode, group()
> returns a unicode string, otherwise, group() returns an 8-bit string.
> In particular, if the target is a buffer object, group() returns an
> 8-bit string. I think this is the appropriate behavior: otherwise
> using regular expression matching to extract a small substring from a
> large target string would unnecessarily keep the large target string
> alive as long as the substring is alive.
>
> But in Python 3, the behavior of group() has changed so that its
> return type always matches that of the target string. I think this is
> bad -- apart from the lifetime concern, it means that if your target
> happens to be a bytearray, the return value isn't even hashable!
>
> Does anyone remember whether this was a conscious decision? Is it too
> late to fix?

In both Python 2 and Python 3, the second sentence of the docs is "Both 
patterns and strings to be searched can be Unicode strings as well as 
8-bit strings." The Python 3 version goes on to say that patterns and 
targets must match. "However, Unicode strings and 8-bit strings cannot 
be mixed." I normally consider '8-bit string' to mean 'bytes'. It 
certainly meant that in Python 2. We use 'buffer object' or 'object 
satisfying the buffer protocol' to mean 'bytes, byte_arrays, or 
memoryviews'.

I wonder if the change was an artifact of changing the code to prohibit 
mixing Unicode and bytes.

Going on

"match.group([group1, ...])
     Returns one or more subgroups of the match. If there is a single 
argument, the result is a single string;"

In both 2.x and 3.x docs, I usually understand generic 'string' to mean 
'Unicode or bytes'. In any case, The sentence and a half from 'Returns' 
to 'string' is *exactly the same* as in the 2.x docs. As near as I could 
tell looking by the, the rest of the entry for match.group is unchanged 
from 2.x to 3.x. So it is easy to think that the behavior change is an 
unintended regression.

-- 
Terry Jan Reedy


More information about the Python-Dev mailing list

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4