If you've ever been frustrated that there wasn't an easy way to parse a string or file, even though you could easily match it with re.match, then this PEP may be for you. If what's being proposed seems a little ghastly, well, I'll just say that I've been wanting something like this for a long time and this is the best I could come up with. Your comments invited. Mike ############################################################################## PEP: XXX Title: Complete, Structured Regular Expression Group Matching Version: $Revision: 1.3 $ Last-Modified: $Date: 2004/08/02 05:18:31 $ Author: Mike Coleman <mkc at mathdogs.com> Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 1-Aug-2004 Python-Version: 2.4 Post-History: 1-Aug-2004 Abstract ======== This proposal is to add a new method ``re.structmatch`` that fully captures matches for groups that match repeatedly. The method returns a structured match tree whose shape is determined by the pattern groups. Ultimately this will allow a string that can be described by a Python regular expressions (e.g., the contents of ``/etc/passwd`` or ``.ini`` files, or the output of ``diff``) to be parsed into the obvious parse tree with a single call to ``re.structmatch``. Motivation ========== A notable limitation of the ``re.match`` method is that it fails to capture all group match information for repeatedly matched groups. For example, in a call like this :: m0 = re.match(r'([A-Z]|[a-z])*', 'XxxxYzz') one would like to see that the group which matched four times matched the strings ``'X'``, ``'xxx'``, ``'Y'`` and ``'zz'``. In the current implementation, only the last match for each group (``'zz'`` in this case) is available, even though the other matches are calculated implicitly during the matching process. For simple cases, the missing strings might be easily recovered by other means, but for more complicated cases this is a significant annoyance. The simplest extension would be to collect all of the matches for each group in a list, so that in the call above for example :: m0.group(1) --> ['X', 'xxx', 'Y', 'zz'] (A mechanism like this is apparently to be added to Perl in version 6 [#PERL6]_, though that did not inspire this PEP.) This immediately suggests the question of what is to be done for nested repeats, like so :: m1 = re.match(r'("([A-Z]|[a-z])*"\s*)*', '"Xx" "yy" "ZzZ"') We could have :: m1.group(2) --> ['X', 'x', 'yy', 'Z', 'z', 'Z' ] but this flattened result would discard useful information about the structure of the match. A more useful result would be :: m1.group(2) --> [['X', 'x'], ['yy'], ['ZzZ']] This is definitely an improvement. Consider the following slightly more complicated case, though :: m1 = re.match(r'("([A-Z]|[a-z])*"(\s*))*', '"Xx" "yy" "ZzZ"') which would give :: m1.group(2) --> [['X', 'x'], ['yy'], ['Z', 'z', 'Z']] m1.group(3) --> [' ', ' ', ''] This is less useful than the putative conceptual structure of the match, which would be something like :: [[['X', 'x'], ' '], [['yy'], ' '], [['Z', 'z', 'Z'], '']] In this simple case, this structure could be recovered using the ``zip`` function, but in the general case this is a non-trivial transformation. This PEP proposes a mechanism to address the general case directly. Proposal ======== The new function ``re.structmatch`` has the same arguments as ``re.match`` and will match using exactly the same algorithm. Instead of returning a MatchObject upon success, ``re.structmatch`` will return a "tree" (or rather, a forest) in the form of a Python list. First we will define some terms. Groups in the input pattern are parenthesized subexpressions that "capture" strings when matched, and are of the form ``'(abc)'`` or ``'(?P<name>abc)'``. Leaf groups are groups that do not contain any subgroups. So, for example, the group ``'(abc)'`` is a leaf group and the group ``'(a(xyz)b)'`` is not a leaf group (but it does contain the leaf group ``'(xyz)'``). We shall call parenthesized expressions that are not groups "nongroups"; these include constructs like ``'(?:abc)'`` as well as lookahead and lookbehind assertions. The structure of the returned tree is determined from the grouping tree present in the pattern in the following manner: * Leaf groups not followed immediately by a repeat operator map to a single string:: re.structmatch(r'(...)', 'abcdef') --> ['abc'] re.structmatch(r'(...).(..)', 'abcdef') --> ['abc', 'ef'] * Leaf groups followed immediately by a repeat operator map to a list of strings:: re.structmatch(r'([^d])*', 'abcdef') --> [['a', 'b', 'c']] re.structmatch(r'([^d])*(.)*', 'abcdef') --> [['a', 'b', 'c'], ['d', 'e', 'f']] re.structmatch(r'(..)*', 'abcdef') --> [['ab', 'cd', 'ef']] re.structmatch(r'(.){2}', 'abcdef') --> [['a', 'b']] * Non-leaf groups not followed immediately by a repeat operator map to a list of the mappings of their subgroups:: re.structmatch(r'(...)', 'abcdef') --> ['abc'] re.structmatch(r'((...))', 'abcdef') --> [['abc']] re.structmatch(r'(((...)))', 'abcdef') --> [[['abc']]] re.structmatch(r'((...)())', 'abcdef') --> [['abc'], []] re.structmatch(r'(.(.)(.(.)).(.))', 'abcdef') --> ['a', ['b'], ['c', ['d']], 'e', ['f']] * Non-leaf groups followed immediately by a repeat operator map to a list of the mappings of their repeated matches:: re.structmatch(r'((.).(.))*', 'abcdef') --> [[['a', 'c'], ['d', 'f']]] re.structmatch(r'(([ab])*(x)*)*', 'baxbxx') --> [[['b', 'a'], ['x']], [['b'], ['x', 'x']]] * In the case of alternation, only the matched groups appear in the output:: re.structmatch(r'([^a])*|([^d])*', 'abcdef') --> [['a', 'b', 'c']] If it's important to know which alternative matched, named groups can be used. * Named groups map to a pair where the first member is the name and the second member is what the unnamed group would have mapped to:: re.structmatch(r'([^d])*(?P<rest>.)*', 'abcdef') --> [['a', 'b', 'c'], ('rest', ['d', 'e', 'f'])] * Nongroups do not affect the output structure. Compared to non-leaf groups, nongroups have the effect of "flattening" the output, like so:: re.structmatch(r'((.).(.))', 'abcdef') --> [['a', 'c']] re.structmatch(r'(.).(.)', 'abcdef') --> ['a', 'c'] re.structmatch(r'(?:(.).(.))', 'abcdef') --> ['a', 'c'] re.structmatch(r'((.).(.))*', 'abcdef') --> [[['a', 'c'], ['d', 'f']]] re.structmatch(r'(?:(.).(.))*', 'abcdef') --> [['a', 'c', 'd', 'f']] (Nongroups that do not contain leaf groups have no effect whatsoever on the output structure.) Additional Features ------------------- In many cases in which ``'re.structmatch'`` fails to match, the cause will be due to an unexpected error in the format of the string being matched. In order to assist the calling program in generating a more meaningful possible error message, ``'re.structmatch'`` will return the endpoint of the largest match against the searched string. So, for example :: re.structmatch('abcd', 'abxxx') --> 2 re.structmatch('abcde|z', 'abxxx') --> 2 re.structmatch('x*?y', 'xxx') --> 3 (This return value should be considered advisory rather than exact, as future improvements in the match algorithm may make it difficult to calculate the exact value.) Examples ======== Parsing ``/etc/group`` ---------------------- If ``/etc/group`` contains the following lines :: root:x:0: daemon:x:1: bin:x:2: sys:x:3: then it can be parsed as follows :: p = r'''(?xm) # VERBOSE, MULTILINE ( ( # one field, preceded by a ':' if # the field is not the line's first (?:^|:) ([^:\n]*) )* \n )* ''' s = open('/etc/group').read() tree = re.structmatch(p, s) which will give ``tree`` the following value:: [['root', 'x', '0', ''], ['daemon', 'x', '1', ''], ['bin', 'x', '2', ''], ['sys', 'x', '3', '']] Note that the above pattern is written to allow any ``/etc/group`` field to be empty. The pattern won't work correctly for versions of Python prior to 2.4 because of the possibility of repeating empty matches. This problem seems to have been fixed in Python 2.4. (A slightly more complicated pattern would work for pre-2.4 versions.) Parsing ``.ini`` files ---------------------- The Microsoft ``.ini`` file format is found in various contexts (there is one in the Python source distribution: ``Tools/freeze/extensions_win32.ini``). Given a file with contents :: [singers] good=Annie Lennox bad=William Shatner [comedians] good=Monty Python the file can be parsed with pattern :: r'''(?xm) # VERBOSE, MULTILINE \s* # leading whitespace ( # begin sections ^\[ ([^]]+) \] \s* # section header ( # begin params ^([^=]+) = # param name (.*) $ # param value \s* # intervening whitespace )* # end params )* # end sections \Z # assert that the entire # input was matched ''' to give :: [['singers', ['good', 'Annie Lennox'], ['bad', 'William Shatner']], ['comedians', ['good', 'Monty Python']]] The pattern given omits support for ``.ini`` file comments for the sake of simplicity, but this could be added. Details ======= The proposal states that ``re.structmatch`` will match using exactly the same algorithm as ``re.match``. This might be a little odd for a pattern like ``r'(x|y|z)*aaa\1'``, where the algorithm will require that the ``\1`` back-reference match (at most) one character. It's not obvious whether there is any better option, though, and there are advantages of implementation and simplicity for keeping the match algorithms of these two methods identical. (A similar argument applies to ``'(?P=NAME)'``.) Discussion ========== Part of the reason the proposed method would be so useful is that ``re.split`` currently does not split on empty matches. If it had this feature, one could split on lookahead and lookbehind assertions, which would significantly ease parsing of strings with a recursive regular structure (e.g., ``.ini`` files). Patch `#988761`_ will correct this ``re.split`` deficiency, if it is accepted. .. _#988761: https://sourceforge.net/tracker/?func=detail&aid=988761&group_id=5470&atid=305470 For particularly complex patterns, the current 99 group limit might actually become a practical problem. One could imagine a variation in which subtrees of named groups might be captured in dictionaries rather than lists, with the group names used as keys. Rejected Alternatives ===================== Several simpler alternatives are rejected in the `Motivation`_ section above. Although these alternatives would be better than nothing, they would not adequately satisfy the goal of this proposal, which is to allow the programmer to extract the structure of a string in an immediate manner. It would be possible to use tuples for some substructures in the return tree, rather than composing it strictly of lists. This alternative was rejected because it was thought useful to be able to modify the resulting tree in place, perhaps to add decorations, etc., and tuples would make this more difficult. References ========== .. [#PERL6] Multiple regex matches in Perl Apocalypse 5 (http://www.perl.com/pub/a/2002/06/04/apo5.html?page=22#rfc%20360:%20allow%20multiply%20matched%20groups%20in%20regexes%20to%20return%20a%20listref%20of%20all%20matches) Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 End:
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4