[Andrew Koenig] > ... > Section 4.2.4 of the library reference says that the 'split' method of a > regular expression object is defined as > > Identical to the split() function, using the compiled pattern. Supplying words intended to be clear from context, it's saying that the split method of a regexp object is identical to the re.split() function, which is true. In much the same way, list.pop() isn't the same thing as eyeball.pop() <wink>. > This claim does not appear to be correct: > > >>> import re > >>> re.compile('').split('abcde') > ['abcde'] > > This result differs from the result of using the string split method. True, but it's the same as >>> import re >>> re.split('', 'abcde') ['abcde'] >>> which is all the docs are trying to say. > ... > My first impulse was to argue that (4) is right, and that the behavior > should be as follows > > >>> 'abcde'.split('') > ['a', 'b', 'c', 'd', 'e'] If that's what you want, list('abcde') is a direct way to get it. > ... > I made the counterargument that one could disambiguate by adding the > rule that no element of the result could be equal to the delimiter. > Therefore, if s is a string, s.split('') cannot contain any empty > strings. Sure, that's one arbitrary rule <wink>. It doesn't seem to extend to regexps in a reasonable way, though: >>> re.split('.*', 'abcde') ['', ''] >>> Both split pieces there match the pattern. > However, looking at the behavior of regular expression splitting more > closely, I become more confused. Can someone explain the following > behavior to me? > > >>> re.compile('a|(x?)').split('abracadabra') > ['', None, 'br', None, 'c', None, 'd', None, 'br', None, ''] >From the docs: If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. It should also say that splits never occur at points where the only match is against an empty string (indeed, that's exactly why re.split('', 'abcde') doesn't split anywhere). The logic is like: while True: find next non-empty match, else break emit the slice between this and the end of the last match emit all capturing groups advance position by length of match emit the slice from the end of the last match to the end of the string It's the last line in the loop body that makes empty matches a wart if allowed: they wouldn't advance the position at all, and an infinite loop would result. In order to make them do what you think you want, we'd have to add, at the end of the loop body ah, and if the match was emtpy, advance the position again, by, oh, i don't know, how about 1? That's close to 0 <wink>. So the pattern matches at the first 'a', and adds '' to the list (the slice to the left of the first match) and None to the list (the capturing group didn't participate in the match, but that doesn't excuse it from adding something to the list). There are no other non-empty matches until getting to the second 'a', and then that adds 'br' to the list (the slice between the current match and the last match), and None again for the non-participating capturing group. Etc. The trailing empty string is the slice from the end of the last match to the end of the string (which happens to be empty in this case). It's unclear to me what you expected instead. Perhaps this? >>> re.split('a|(?:x?)', 'abracadabra') ['', 'br', 'c', 'd', 'br', ''] >>>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4