On 2017-11-28 20:04, Serhiy Storchaka wrote: > The two largest problems in the re module are splitting on zero-width > patterns and complete and correct support of the Unicode standard. These > problems are solved in regex. regex has many other features, but they > are less important. > > I want to tell the problem of splitting on zero-width patterns. It > already was discussed on Python-Dev 13 years ago [3] and maybe later. > See also issues: [4], [5], [6], [7], [8]. > > In short it doesn't work. Splitting on the pattern r'\b' doesn't split > the text at boundaries of words, and splitting on the pattern > r'\s+|(?<=-)' will split the text on whitespaces, but will not split > words with hypens as expected. > > In Python 3.4 and earlier: > > >>> re.split(r'\b', 'Self-Defence Class') > ['Self-Defence Class'] > >>> re.split(r'\s+|(?<=-)', 'Self-Defence Class') > ['Self-Defence', 'Class'] > >>> re.split(r'\s*', 'Self-Defence Class') > ['Self-Defence', 'Class'] > > Note that splitting on r'\s*' (0 or more whitespaces) actually split on > r'\s+' (1 or more whitespaces). Splitting on patterns that only can > match the empty string (like r'\b' or r'(?<=-)') never worked, while > splitting > > Starting since Python 3.5 splitting on a pattern that only can match the > empty string raises a ValueError (this never worked), and splitting a > pattern that can match the empty string but not only emits a > FutureWarning. This taken developers a time for replacing their patterns > r'\s*' to r'\s+' as they should be. > > Now I have created a final patch [9] that makes re.split() splitting on > zero-width patterns. > > >>> re.split(r'\b', 'Self-Defence Class') > ['', 'Self', '-', 'Defence', ' ', 'Class', ''] > >>> re.split(r'\s+|(?<=-)', 'Self-Defence Class') > ['Self-', 'Defence', 'Class'] > >>> re.split(r'\s*', 'Self-Defence Class') > ['', 'S', 'e', 'l', 'f', '-', 'D', 'e', 'f', 'e', 'n', 'c', 'e', 'C', > 'l', 'a', 's', 's', ''] > > The latter case the result is differ too much from the previous result, > and this likely not what the author wanted to get. But users had two > Python releases for fixing their code. FutureWarning is not silent by > default. > > Because these patterns produced errors or warnings in the recent two > releases, we don't need an additional parameter for compatibility. > > But the problem was not just with re.split(). Other functions also > worked not good with patterns that can match the empty string. > > >>> re.findall(r'^|\w+', 'Self-Defence Class') > ['', 'elf', 'Defence', 'Class'] > >>> list(re.finditer(r'^|\w+', 'Self-Defence Class')) > [<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1, > 4), match='elf'>, <re.Match object; span=(5, 12), match='Defence'>, > <re.Match object; span=(13, 18), match='Class'>] > >>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class') > '<>S<elf>-<Defence> <Class>' > > After matching the empty string the following character will be skipped > and will be not included in the next match. My patch fixes these > functions too. > > >>> re.findall(r'^|\w+', 'Self-Defence Class') > ['', 'Self', 'Defence', 'Class'] > >>> list(re.finditer(r'^|\w+', 'Self-Defence Class')) > [<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(0, > 4), match='Self'>, <re.Match object; span=(5, 12), match='Defence'>, > <re.Match object; span=(13, 18), match='Class'>] > >>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class') > '<><Self>-<Defence> <Class>' > > I think this change don't need preliminary warnings, because it change > the behavior of more rarely used patterns. No re tests have been broken. > I was needed to add new tests for detecting the behavior change. > > But there is one spoonful of tar in a barrel of honey. I didn't expect > this, but this change have broken a pattern used with re.sub() in the > doctest module: r'(?m)^\s*?$'. This was fixed by replacing it with > r'(?m)^[^\S\n]+?$'). I hope that such cases are pretty rare and think > this is an avoidable breakage. > > The new behavior of re.split() matches the behavior of regex.split() > with the VERSION1 flag, the new behavior of re.findall() and > re.finditer() matches the behavior of corresponding functions in the > regex module (independently from the version flag). But the new behavior > of re.sub() doesn't match exactly the behavior of regex.sub() with any > version flag. It differs from the old behavior as you can see in the > example above, but is closer to it that to regex.sub() with VERSION1. > This allowed to avoid braking existing tests for re.sub(). > > >>> regex.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class') > > > > 'Self:Defence:Class' > > > > >>> regex.sub(r'(?V1)(\W+|(?<=-))', r':', 'Self-Defence Class') > > > > 'Self::Defence:Class' > >>> re.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class') > 'Self:Defence:Class' > > As re.split() it never matches the empty string adjacent to the previous > match. re.findall() and re.finditer() only don't match the empty string > adjacent to the previous empty string match. In the regex module > regex.sub() is mutually consistent with regex.findall() and > regex.finditer() (with the VERSION1 flag), but regex.split() is not > consistent with them. In the re module re.split() and re.sub() will be > mutually consistent, as well as re.findall() and re.finditer(). This is > more backward compatible. And I don't know reasons for preferring the > behavior of re.findall() and re.finditer() over the behavior of > re.split() in this corner case. > FTR, you could make an argument for either behaviour. For regex, I went with what Perl does. > Would be nice to get this change in 3.7.0a3 for wider testing. Please > make a review of the patch [9] or tell your thoughts about this change. > > [1] https://docs.python.org/3/library/re.html > [2] https://pypi.python.org/pypi/regex/ > [3] https://mail.python.org/pipermail/python-dev/2004-August/047272.html > [4] https://bugs.python.org/issue852532 > [5] https://bugs.python.org/issue988761 > [6] https://bugs.python.org/issue1647489 > [7] https://bugs.python.org/issue3262 > [8] https://bugs.python.org/issue25054 > [9] https://github.com/python/cpython/pull/4471 >
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4