>> Section 4.2.4 of the library reference says that the 'split' method of a >> regular expression object is defined as >> >> Identical to the split() function, using the compiled pattern. Tim> Supplying words intended to be clear from context, it's saying that the Tim> split method of a regexp object is identical to the re.split() function, Tim> which is true. In much the same way, list.pop() isn't the same thing as Tim> eyeball.pop() <wink>. Right. I missed the fact that there's another split. Sorry about that. >> My first impulse was to argue that (4) is right, and that the behavior >> should be as follows >> >> >>> 'abcde'.split('') >> ['a', 'b', 'c', 'd', 'e'] Tim> If that's what you want, list('abcde') is a direct way to get it. True, but that doesn't explain why it is useful to have 'abcde'.split('') and re.split('', 'abcde') behave differently. >> I made the counterargument that one could disambiguate by adding the >> rule that no element of the result could be equal to the delimiter. >> Therefore, if s is a string, s.split('') cannot contain any empty >> strings. Tim> Sure, that's one arbitrary rule <wink>. It doesn't seem to extend to Tim> regexps in a reasonable way, though: >>>> re.split('.*', 'abcde') Tim> ['', ''] Tim> Both split pieces there match the pattern. Yes, that's part of the source fo my confusion. >> However, looking at the behavior of regular expression splitting more >> closely, I become more confused. Can someone explain the following >> behavior to me? >> >>> re.compile('a|(x?)').split('abracadabra') >> ['', None, 'br', None, 'c', None, 'd', None, 'br', None, ''] >> From the docs: Tim> If capturing parentheses are used in pattern, then the text of all Tim> groups in the pattern are also returned as part of the resulting list. OK -- as I said, I had assumed that split() was referring to the other split function, probably because both of them were offscreen at the time. Tim> It should also say that splits never occur at points where the only match is Tim> against an empty string (indeed, that's exactly why re.split('', 'abcde') Tim> doesn't split anywhere). The logic is like: Tim> while True: Tim> find next non-empty match, else break Tim> emit the slice between this and the end of the last match Tim> emit all capturing groups Tim> advance position by length of match Tim> emit the slice from the end of the last match to the end of the string Tim> It's the last line in the loop body that makes empty matches a wart if Tim> allowed: they wouldn't advance the position at all, and an infinite loop Tim> would result. In order to make them do what you think you want, we'd have Tim> to add, at the end of the loop body Tim> ah, and if the match was emtpy, advance the position again, by, Tim> oh, i don't know, how about 1? That's close to 0 <wink>. Indeed, that's an arbitrary rule -- just about as arbitrary as the one that you abbreviated above, which should really be find the next match, but if the match is empty, disregard it; instead, find the next match with a length of at least, oh, I don't know, how about 1? That's close to 0 <wink>. What I'm trying to do is come up with a useful example to convince myself that one is better than the other.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4