lrparsing
¶
Abstract
lrparsing
is an LR(1) parser hiding behind a pythonic interface.
Russell Stuart <russell-lrparsing@stuart.id.au>
Lrparsing provides both an LR(1) parser and a tokeniser. It differs from other Python LR(1) parsers in using Python expressions as grammars, and offers simple to use disambiguation tools.
The result is something that is powerful yet concise. For simple tasks this means it can be thought of as an extension to Pythonâs existing re
module, used when regular expressions become too cumbersome. For complex tasks lrparsing offers a high speed parser (very roughly 25us per token from string to parse tree on a desktop CPU), pre-compilation of the grammar and error recovery hooks so parsing can continue after an error is found.
In addition to extensive documentation it comes with a parser for Sqlite3 data manipulation statements and a Lua 5.2 to Python compiler as examples. The documentation can be read online at http://lrparsing.sourceforge.net/doc/html/.
First Steps¶Below this section is a wall of text that will overwhelm all but the most determined. This example is designed to help you navigate it. Invest some time in reading it, enough to gain an intutitive feel for how lrparsing
works. If your task is small it might be enough to get you through.
The grammar is specified by a class derived from Grammar
. Attributes of that class define the grammar:
import lrparsing from lrparsing import Keyword, List, Prio, Ref, THIS, Token, Tokens class ExprParser(lrparsing.Grammar): # # Put Tokens we don't want to re-type in a TokenRegistry. # class T(lrparsing.TokenRegistry): integer = Token(re="[0-9]+") integer["key"] = "I'm a mapping!" ident = Token(re="[A-Za-z_][A-Za-z_0-9]*") # # Grammar rules. # expr = Ref("expr") # Forward reference call = T.ident + '(' + List(expr, ',') + ')' atom = T.ident | T.integer | Token('(') + expr + ')' | call expr = Prio( # If ambiguous choose atom 1st, ... atom, Tokens("+ - ~") >> THIS, # >> means right associative THIS << Tokens("* / // %") << THIS, THIS << Tokens("+ -") << THIS,# THIS means "expr" here THIS << (Tokens("== !=") | Keyword("is")) << THIS) expr["a"] = "I am a mapping too!" START = expr # Where the grammar must start COMMENTS = ( # Allow C and Python comments Token(re="#(?:[^\r\n]*(?:\r\n?|\n\r?))") | Token(re="/[*](?:[^*]|[*][^/])*[*]/")) parse_tree = ExprParser.parse("1 + /* a */ b + 3 * 4 is c(1, a)") print(ExprParser.repr_parse_tree(parse_tree))
Will print:
(START (expr # Here START is ExprParser.START (expr # And expr is ExprParser.expr too. (expr (expr (atom '1')) '+' # '+' is repr(input_matched_by_token) (expr (atom 'b'))) '+' # 'b' is actually ExprParser.T.ident (expr (expr (atom '3')) '*' # '3' is actually ExprParser.T.integer (expr (atom '4')))) 'is' (expr (atom (call 'c' '(' (expr (atom '1')) ',' # ',' is an anonymous Token (expr (atom 'a')) ')')))))
As is suggested by the output, the parse tree returned by parse()
is a tree composed of nested Python tuples
. The quoted strings above are actually tuples
of the form:
(Token(), 'matched_input', offset_in_stream, line_number, column_number)
So the tuple representing '3'
would be (ExprParser.T.integer, '3', 16, 1, 17)
.
Your next step will probably be writing a small grammar, and you will become discouraged when lrparsing
complains it is ambiguous. Read Ambiguities for hints on how to fix it.
If you go much beyond the example above you will need to read the reference material below, and you will find yourself getting lost in a thicket of jargon. The jargon unavoidable as parsing involves a lot of concepts and each must name its own name. Traditionally a Glossary is the solution for jargon, but you may find the Glossary Narrative a gentler introduction to the world of parsing.
Module contents¶The module contains many classes and a few functions. The classes are described below, lumped into the following categories:
In addition, there are a few utility routines. They all duplicate methods of the Grammar
class. They are provided in case a Grammar
subclass redefines them:
lrparsing.
compile_grammar
(grammar)¶
Compile grammar
, which must be a subclass of Grammar
. If the compile fails a GrammarError
will be raised. This is typically used when debugging new Grammars
. Grammar
must be compiled before it can be used, but this is done implicitly for you by parse()
or can be done explicitly by pre_compile_grammar()
if compiling is too slow to do at run time. repr_parse_table()
displays a lot more information when the grammar
is compiled by compile_grammar()
. Compiling a Grammar
is not thread safe.
lrparsing.
epoch_symbol
(grammar)¶
Returns the start symbol, which is the Symbol
used to initialise the Grammar
subclass grammar
. It is a non-terminal with a single production: <MyGrammar> = START __end_of_input__
. Useful during Error Recovery.
lrparsing.
parse
(grammar, input, tree_factory=None, on_error=None, log=None)¶
Parse the input
using the subclass of Grammar
passed, returning a parse tree if the grammar
matches the input or raising a ParseError
otherwise, calling compile_grammar()
first if the grammar
hasnât been compiled. If the function on_error
isnât None
it will be called when the parser input doesnât match the grammar
. It is documented in Error Recovery. If the function log
is passed it will be called repeatedly with strings
showing parserâs stack.
The input
is passed to a tokeniser which converts it to the stream of input tuples expected by the parser. This tokeniser is created by the TokenRegistry
declared in the Grammar
subclass. By default the inbuilt tokeniser is used, and it accepts either a string
or a generator that yields strings
and input tuples. Input tuples are normal Python tuples
that have a TokenSymbol
instance as their first element. The parser ignores the remaining elements in the tuple
.
The default format of the returned parse tree is a tree of nested tuples
. These tuples
will be one of the input tuples given to the parser, or will have one of the Rule
instances defined as a rule in the Grammar
followed by the tuples
representing the Symbols
that made up the right hand side of the production matched. (Use repr_productions()
to see these productions.) The format of the returned parse tree can be altered by passing a function as tree_factory
. Whatever it returns is substituted into the parse tree instead the tuple
. It must accept one argument: the tuple
that would have been inserted.
The parse operation is thread safe but the lazy compiling of the Grammar
is not. 1
lrparsing.
pre_compile_grammar
(grammar, pre_compiled=None)¶
compile_grammar()
can take a long while for reasonable sized grammars (seconds on a 32 bit out-of-order desktop CPU). This function allows you to skip that time in production evnironments by giving the parser a pre-compiled parse table.
The pre-compiled parse table is actually a (very big) tuple made up entirely of Python constants. This function returns the repr()
of that tuple (ie, a string
). Use the returned value by passing it to this funcion(!) as the pre_compiled argument. If you pass returned string directly it will be around 10 faster than calling it without the pre_compiled parameter. If you pass it the raw tuple it will be about 100 times faster. The reason for the 10 fold difference is the passed string must be parsed to eval()
, but if you embed the raw tuple into your code it will be tokenised by Python and put in the .pyc file.
If this function is passed a pre-compiled parse table that matches the current grammer it returns None
quickly, otherwise it compiles the grammar into a parse table and returns it as a string
, which is orders of magnitude slower. This means apart from speed the Grammar
always behaves the same way (ie, as specified by the source defining itâs class) regardless of whether this function is called; even if called with a pre-compiled parse table that doesnât match the Grammar
.
The returned parse table is optimised, meaning it has information useful for debugging the grammar out striped out of it. This operation is not thread safe.
lrparsing.
repr_grammar
(grammar)¶
Return a printable string
with the Rules
as seen by the Grammar
subclass grammar
. Useful for debugging grammars
.
lrparsing.
repr_parse_table
(grammar, state=None)¶
Return printable string
describing the parse table of the passed Grammar
subclass grammar
. This only returns something useful after compile_grammar()
has been called. The return value may even be useful if compile_grammar()
raised an exception. The result returned by a pre compiled
grammar
is barely useful for debugging the pre compiler. If state
(an integer
) is passed only that state instead of the entire table is returned. If state
is negative that state and all states that refer to it in their action or reduce tables are returned.
lrparsing.
repr_parse_tree
(parse_tree, indent=' ')¶
Return a printable string
representing the passed parse_tree
(as returned by parse()
) in a human readable format. If indent
is False
it will be on one line, otherwise it will be split across lines and indented using the string
passed as indent
to reveal the tree structure.
lrparsing.
repr_productions
(grammar)¶
Return a printable string
describing all the productions in the passed Grammar
subclass grammar
. This only works if compile_grammar()
has been called. It may work even if compile_grammar()
has raised an exception. Useful for debugging grammars
.
lrparsing.
unused_rules
(grammar)¶
Returns frozenset
of rules that arenât reachable from the Grammar.START
rule. If this isnât empty the grammar
probably contains an error.
A grammarâs rules are contained in a class derived from lrparsing's
Grammar
class. Members of the Grammar
subclass that are assigned instances of the Symbol
class (described below) make up the rules of the grammar. For the most part other members are simply ignored by lrparsing
, meaning you can add your own functions and variables to a Grammar
subclass with impunity.
lrparsing.
Grammar
¶
The constructor does nothing by default. Lrparsing
uses classes rather than instances to do its work, so it doesnât expect or care if a subclass is instantiated.
compile_grammar
(optimise=True, cache=None)¶
Identical to the moduleâs compile_grammar()
. If redefined the moduleâs function continues to work.
epoch_symbol
()¶
Identical to the moduleâs epoch_symbol()
. If redefined the moduleâs function continues to work.
parse
(input, tree_factory=None, on_error=None, log=None)¶
Identical to the moduleâs parse()
. If redefined the moduleâs function continues to work.
pre_compile_grammar
(pre_compiled=None)¶
Identical to the moduleâs pre_compile_grammar()
. If redefined the moduleâs function continues to work.
repr_grammar
()¶
Identical to the moduleâs repr_grammar()
. If redefined the moduleâs function continues to work.
repr_parse_tree
(parse_tree, indent=' ')¶
Identical to the moduleâs repr_parse_tree()
. If redefined the moduleâs function continues to work.
repr_parse_table
(state=None)¶
Identical to the moduleâs repr_parse_table()
. If redefined the moduleâs function continues to work.
repr_productions
()¶
Identical to the moduleâs repr_productions()
. If redefined the moduleâs function continues to work.
unused_rules
()¶
Identical to the moduleâs unused_rules()
. If redefined the moduleâs function continues to work.
_parser_
¶
Reserved for internal use by Grammar
. Do not use.
This member must be assigned a Keyword()
, Token
, UnrecognisedToken()
or UserToken
, or a Choice
of those symbols. TokenSymbols
assigned to COMMENTS
are ignored by parse()
.
START
¶
START
must assigned a Rule
instance. The goal of the grammar is to recognise the Rule
defined by START
.
WHITESPACE
¶
This is passed unaltered to the tokeniser. The inbuilt tokeniser insists that if this is present, it is a string whose characters can optionally separate Tokens
.
Assigning a Symbol
to a member of a Grammar
subclass creates a new Rule
instance containing the Symbol
on the right hand side, and whose Rule.name
is the name of the member it was assigned to. This makes it different from normal Python assignment where the left hand side and the right hand side are the same object after the assignment executes. In practice there are two places this matters. Firstly, except for assigning to Grammar.START
, you may not assign one Rule
directly to another as it makes it impossible to determine the grammar hierarchy 2. Secondly, you cannot get a reference to a TokenSymbol
by assigning it to a rule. Use a TokenRegistry
to do that 5. Rules
whose names
start with an underscore (_
) are treated specially. They are suppressed in the output parse tree 3.
Instances of Symbol
subclasses are used to build rules in a Grammar
. Symbol
instances can be created in the normal way - by calling their class name (aka the constructor) with the appropriate arguments. However lrparsing
provides some syntactic sugar by hijacking the Python expression builder. The Python operators that are translated to calling a Symbol
constructor are noted below. Another sweetener is all Symbol
class constructors will, when they are expecting a Symbol
as an argument, cast some standard Python types to an appropriate Symbol
class instance. For example, a string
will be cast to a literal Token
. 4
The Symbol
subclass constructors that can be used to define Grammars
are:
- class
lrparsing.
Choice
(s0, s1, ...)¶Alternate syntax:
s0 | s1 | ...
The grammar must choose between one of the
symbols
s0, s1, â¦
- class
lrparsing.
Left
(s)¶Alternate syntax:
s = s0 << s1 << s2 << ...
If the
symbol
s
is aSequence(s0, s1, s2, ...)
, and there is a choice between the parse trees(((s0 s1) s2) ...)
and(... (s0 (s1 s2)))
use the(((s0 s1) s2) ...)
. In other wordss
is left associative.
- class
lrparsing.
List
(s, d, min=0, max=None, opt=False)¶The grammar must see a list of the
symbol
s
separated by thesymbol
d
, ie,s d s d s ...
. Ifopt
isTrue
the list may optionally end withd
, otherwise it must end withs
. Ifmin
is absent or 0 the list may be empty, otherwises
must appear a minimum ofmin
times. Ifmax
isNone
any number ofs
are allowed, otherwisemax
is the maximum number of occurrences ofs
.
- class
lrparsing.
Nonassoc
(s)¶If the
symbol
s
is aSequence(s0, s1, s2, ...)
, and there is a choice between the parse trees(((s0 s1) s2) ...)
and(... (s0 (s1 s2)))
raise aParseError
. In other wordss
isnât associative.
- class
lrparsing.
Prio
(s0, s1, ..)¶Alternate syntax:
(s0, s1, ...)
The grammar must choose between one of the symbols
s0, s1, ...
If several choices are possible, chooses0
overs1
over...
Atuple
assigned directly to a class member is ignored (as any nonSymbol
is), but will be converted to aPrio
if passed to aSymbol
constructor.
- class
lrparsing.
Ref
(name)¶A forward reference to the
Rule
assigned to the class membername
(astring
). It will be replaced by that member definition, which must occur later.
- class
lrparsing.
Repeat
(s, min=0, max=None)¶
- Alternative syntaxes:
s * N
s * (min,)
s * (min,max)
Opt(s)
,Opt * s
,s * (0,1)
Some(s)
,Some * s
,s * (1,)
Many(s)
,Many * s
,Repeat * s
,s * ()
The grammar must see
s
repeated. Ifmin
is 0 or absent no repeats are allowed meaning there may be nos
at all, otherwise it must appear a minimum ofmin
times. Ifmax
isNone
any number ofs
are allowed, otherwisemax
is the maximum number of repeats. These shortcuts are also provided:
Opt(x)
orOpt * x
is equivalent toRepeat(x, min=0, max=1)
.
Some(x)
orSome * x
is equivalent toRepeat(x, min=1)
.
Many(x)
orMany * x
orRepeat * x
is equivalent toRepeat(x)
.
x * N
is equivalent toRepeat(x, min=N, max=N)
.
x * tuple(...)
is equivalent toRepeat(x, *tuple(...))
.
- class
lrparsing.
Right
(s)¶Alternative syntax:
s = s0 >> s1 >> s2 >> ...
If the
symbol
s
is aSequence(s0, s1, s2, ...)
, and there is a choice between the parse trees(((s0 s1) s2) ...)
and(... (s0 (s1 s2)))
use the(... (s0 (s1 s2)))
. In other wordss
is right associative.
- class
lrparsing.
Sequence
(s0, s1, ...)¶Alternative syntax:
s0 + s1 + ...
The grammar must see
symbol
s0
, followed bys1
, followed by â¦
- class
lrparsing.
THIS
¶A reference to the
Rule
being assigned to. So identical toRef('this')
inthis = Ref('this') + '+' + Ref('this')
.
- class
lrparsing.
Tokens
(literals[, keywords])¶The grammar must choose one of the supplied literals and keywords recognised by the inbuilt tokeniser. Both
literals
andkeywords
are strings which arestring.split()
to yield multiple literals and keywords, each of which is recognised using the inbuiltToken
class.
Python does not allow rule =
for an empty rule. Use rule = THIS * 0
instead.
Although the classes above are used to build Grammar
rules by appearing on the right hand side of a Python assignment statement their instances are not directly visible. Instead an instance of the Rule
class is created to hold them, and this is what is assigned to your Grammar
attribute. Since they are a subclass of Symbol
they can be referenced by other grammar rules:
All symbols the parser can recognise are derived from this abstract base class, and thus share these attributes:
Tokens¶
- class
lrparsing.
Symbol
¶If you override
Symbol
the constructor must be called as it does initialisation.
dict
¶A
dict
you can use for your own purposes for all rules barGrammar.START
.
__contains__(key):
Equivalent to
key in symbol.dict
.
__delitem__(key):
Equivalent to
del symbol.dict[key]
.
__getitem__(key):
Equivalent to
symbol.dict[key]
.
__iter__():
Equivalent to
iter(symbol.dict)
.
__len__():
Equivalent to
len(symbol.dict)
.
__setitem__
(key, value)¶Equivalent to
symbol.dict[key] = value
.
__repr__
()¶Returns the production for this
Symbol
.
Tokens are a special kind of Symbol
9. As a convenience the same token may have several instances in a Grammar
. Instances of the same token
are merged during the compile, resulting in all bar one being discarded. Two tokens
are considered to be the same if they have the same name
attribute. The inbuilt tokens generate a suitable name
for you, but you can assign a different name to any token
using a TokenRegistry
. Names
containing __
are used internally, and so are reserved.
The following methods and classes are used directly in grammars
:
lrparsing.
Keyword
(word, case=True)¶Create an instance of the
Token
class that recognises a keyword. If case isFalse
the keyword isnât case sensitive, otherwise it is. The defaultname
isrepr(word)
. A keyword is a literalToken
that is recognised by another reToken
. For example, an identifier might be recognised byToken(re='[A-Za-z_]+')
. The keyword'then'
would normally be recognised by the identifierToken
, but ifKeyword('then')
is present it will match'then'
, while all other non-keywords will still be matched by the identifierToken
.
- class
lrparsing.
Token
(literal=None, re=None, case=True)¶Alternative syntax:
'literal'
Use the inbuilt tokeniser to generate a
Token
symbol when it sees the text specified byliteral
orre
. The arguments becomes attributes with the name. You can only supplyliteral
orre
, not both. Ifliteral
is supplied the tokeniser will match exactly that text unlesscase
isFalse
in which case it ignores case when matching. Ifre
is supplied it must be a regular expression the tokeniser will match. There
is compiled withre.MULTILINE
andre.DOTALL
; adding additional flags usingre's
(?FLAG)
syntax affects the recognition of allTokens
. The defaultname
for literal tokens isrepr(literal)
. The defaultname
for re tokens is there
surrounded by/
âs. Strings assigned directly to class members are ignored as are all nonSymbols
, but strings passed toSymbol
constructors are turned into a literalToken
.
- class
lrparsing.
UserToken
¶A user defined token. The
UserToken
must be assigned aname
by aTokenRegistry
. This class can be used as a base for user defined token classes.
lrparsing.
UnrecognisedToken
()¶Create an instance of the
Token
class that is given to the parser when the inbuilt tokeniser strikes a string it doesnât recognise as a token or whitespace. If this token doesnât appear in theGrammar
the inbuilt tokeniser raises aTokenError
instead.
If you delve deeply into lrparsing
you will come across these token classes. They arenât directly used by grammars
:
The TokenRegistry Class¶
- class
lrparsing.
MetaToken
(name)¶Used to construct special tokens used internally by the grammar compiler and parser. You will never see these in the output parse tree, but error recovery may see them. There are two:
__empty__
represents a 0 width character string, and__end_of_input__
represents the end of the input stream.
- class
lrparsing.
TokenSymbol
(name)¶The abstract base class all tokens are based on, thus
isinstance(symbol, TokenSymbol)
will returnTrue
if aSymbol
is a token of some kind. The passedname
is the generated name which can be overridden by theTokenRegistry
.TokenSymbols
and thus all subclasses have these attributes and methods:
dict
¶This is inherited from
Symbol
, as is the shorthand access provided bySymbol's
mapping interface.TokenSymbol.dict
is best set from within aTokenRegistry
. Changes made to this attribute in the body of aGrammar
subclass change theSymbol.dict
of theRule
theTokenSymbol
is assigned to, not theTokenSymbols
dict
.
name
¶A
string
. The name of the token. If two tokens with the samename
are used in aGrammar
they will be merged usingmerge()
and only one will be kept.
named
¶A
bool
.True
ifset_name()
has been called.
merge
(other)¶Called by the
TokenRegistry
when two tokens have the samename
.Other
will be discarded after merge returns.
position
(input_tuple)¶Given an
input_tuple
supplied to the parser, return astring
describing the position in the input stream. The default implementation returnsNone
, which means if theinput_tuple
triggers an error in theparser
it wonât insert the position in the input stream of the error.
set_name
(name)¶Called by the
TokenRegistry
to override the generatedname
. Setsnamed
toTrue
.
__str__
()¶Returns
str(name)
.
The TokenRegistry
is the home for tokens. If you donât define subclass within the Grammar
it will use an instance of TokenRegistry
directly 5. Assigning a token
to a member of a TokenRegistry
class gives you a handle to the token
instances recognised by the Grammar
, and it is the only way to get such a handle 6. Assigning the token
also sets the TokenSymbol.name
of the token
to the qualified attribute name, which is a string
with the format 'NameOfTokenRegistrySubclass.member_name'
.
Inbuilt Tokeniser¶
The inbuilt tokeniser generates Token
instances. Thus if you use Keyword()
, Token
, Tokens
or UnrecognisedToken()
in your grammar, you must use the inbuilt tokeniser. It is used by default unless you add a TokenRegistry
to your grammar and override its TokenRegistry.compile_tokens()
and TokenRegistry.tokeniser()
methods.
The input accepted by the inbuilt tokeniser is either a string
or a generator yielding a mixture of strings and input tuples 7. If the input is broken up into multiple strings you should ensure the breaks occur at token boundaries. Grammar.WHITESPACE
is a string
whose characters can appear between tokens
but are otherwise ignored. If not defined it defaults to " \f\n\r\t\v"
.
Strings
are turned into input tuples with the following format:
(Token(), matched_data, stream_offset, line_number, column_number)
where:
Token()
matched_data
stream_offset
is the offset from the start of the input where the matched data was found. An
integer
.line_number
is the line number the string was found on. The first line is numbered 1. An
integer
.column_number
is the column number within the line where the string was found. The first column is numbered 1. An
integer
.
If a generator is passed and it yields objects other than strings they are passed onto the parser unchanged in the order they were received.
Exceptions¶lrparsing.
LrParsingError
¶
Base: StandardError
.
An abstract base class used by all exceptions defined here.
lrparsing.
GrammarError
¶
Base: LrParsingError
.
Raised when there is an error in the grammar. This can happen both when the grammar is first parsed, and when compile_grammar()
is called.
lrparsing.
ParsingError
¶
Base: LrParsingError
.
An abstract base class for all errors that can be raised by this module when parse()
is called.
lrparsing.
TokenError
(message, data, offset, line, column)¶
Base: ParsingError
.
Raised by the inbuilt tokeniser when it strikes something in the input that isnât entirely whitespace, isnât a recognisable Token
, and the grammar doesnât use UnrecognisedToken
. The arguments following message
become attributes of the instance.
data
¶
The string
that wasnât recognised.
lrparsing.
ParseError
(input_tuple, lr1_state)¶
Base: ParsingError
.
Raised by parse()
when the input the parser is given doesnât match the Grammar
. The arguments become instance attributes with the same name.
input_tuple
¶
The input tuple given to the parser that triggered the error. If this input tuple was created by the inbuilt tokeniser it will contain position information, ie, line number, column number and so on.
lr1_state
¶
The current parser state, an Lr1State
instance 10, from the parse table (ie, as returned by repr_parse_table()
). If the parse table hasnât been pre compiled repr()
reveals a lot of information about what the parser was expecting at the time.
The default action of parse()
is to raise a ParseError
exception which creates a suitable message, which of course stops the parser at that point. If you want the parser to continue so it can discover more errors in the same run you need to do error recovery. Error recovery is always accomplished by manipulating the parserâs input so the parser thinks it has seen something valid, and then letting it continue. It is implemented by the on_error
parameter to parse()
:
lrparsing.
on_error
(iterator, input_tuple, stack)¶
Handle a parser error.
input_tuple â The input tuple that triggered the error.
iterator â The iterator the parser is reading input tuples from.
stack â The parserâs stack. A list
.
None
or an iterable yielding input tuples that will be inserted into the input stream.
The difficulty with any error recovery is determining how to turn the mess you are given as input into something the parser will accept. Invariably this means you try to figure out what the user was trying to do. For example, letâs say you are parsing Python and strike an problem in an expression. If the expression was part of a for statement your best strategy is probably to replace the entire for statement with for x in ():. But if that for was inside a list comprehension inserting a for statement at that point would likely generate a cascade of errors later on. So understanding where you are is critical. Sadly this is one thing LR(1) makes hard. That information is buried in the stack
parameter. The stack
is a partially constructed branch, running from the root to a leaf, of a parse tree. It is a list
of tuples
, like this:
[ (lr1_state, (symbol, ...)), (lr1_state, (symbol, ...)), ... ]
Consider the following two grammars which both recognise the same thing, a small subset of the C language:
>>> def on_error(iterator, token, stack): for item in stack: print "([%s], (%s))" % ( ','.join(sorted(str(rule) for rule in item[0].rules)), ','.join( lrparsing.repr_parse_tree(sym, indent="") for sym in item[1])) >>> class Lang1(Grammar): e = Prio( Token(re='[0-9]+') | Token(re='[a-z]+') + Opt(Token('(') + THIS + ')'), THIS << '*' << THIS, THIS << '+' << THIS) START = Repeat(( Opt(e) | Keyword("if") + e + Keyword('then') + THIS + Opt(Keyword('else') + THIS + Keyword("endif") | Keyword("while") + e + Keyword('do') + THIS + Keyword("done") | Token(re='[a-z]+') + '=' + e) + ';') >>> Lang1.parse( "e=1; while funca(e) do if a then e=a; fi; done", on_error=on_error) ([<Lang1>], (__empty__)) ([START], ('e','=',(e '1'),';')) ([START], ('while')) ([START,e], ((e 'funca' '(' (e 'e') ')'))), ([START], ('do')) ([START], ('if')) ([START,e], ((e 'a'))) ([START], ('then')) ([START], ((START 'e' '=' (e 'a') ';' (e 'fi') ';'))) lrparsing.ParseError: line 1 column 43: Got 'done' when expecting 'else' or 'endif' while trying to match START in state 44 >>> class Lang2(Grammar): class T(TokenRegistry): number = Token(re='[0-9]+') ident = Token(re='[a-z]+') kwd_endif = Keyword("endif") exp = Ref("exp") block = Ref("block") call = T.ident + '(' + exp + ')' mul_op = exp << '*' << exp add_op = exp << '+' << exp exp = Prio(T.number | T.ident | call, mul_op, add_op) if_part = Keyword("if") + exp then_part = Keyword("then") + block else_part = Keyword("else") + block if_st = (if_part + then_part + Opt(else_part) + T.kwd_endif) while_part = Keyword("while") + exp do_part = Keyword("do") + block while_st = while_part + do_part + Keyword("done") exp_st = exp * 1 ass_exp = exp * 1 ass_st = T.ident + '=' + ass_exp empty_st = THIS * 0 st = ass_st | empty_st | exp_st | if_st | while_st block = Repeat(st + ';') START = block >>> Lang2.parse( "e=1; while funca(e) do if a then e=a; fi; done", on_error=on_error) ([<Lang2>], (__empty__)) ([block], ((st (ass_st 'e' '=' (ass_exp (exp '1'))),';')) ([while_st], ((while_part 'while' (exp (call 'funca' '(' (exp 'e') ')')))) ([do_part], ('do')) ([if_st], ((if_part 'if' (exp 'a')))) ([then_part], ('then')) ([then_part], ((block (st (ass_st 'e' '=' (ass_exp (exp 'a'))) ';' (st (exp_st (exp 'fi'))) ';'))), lrparsing.ParseError: line 1 column 43: Got 'done' when expecting 'else' or 'endif' while trying to match then_part in state 50
Here the on_error()
function prints the stack
passed, but doesnât attempt recovery. Looking at the right hand side of the stack
, the (symbol, symbol, ...)
bit, it shouldnât be too hard to see how stitching it together will yield a branch of the parse tree. This sequence of symbols
must eventually match one of the Rules
listed to their left. The LR(1) parser may list several as it will only make its final decision once it has seen the entire production. For Lang2
it is not difficult to see the parser was in the middle of the then_part of an if_st. Lang1
is an example of how to make recognising this difficult.
lr1_state
has more information in it than the above example shows. It is an Lr1State
10:
- class
lrparsing.
Lr1State
(id, actions, gotos, rules)¶This class is immutable, the parameters becoming attributes of the same name. It has no useful methods beyond
object.__str__()
andobject.__repr__()
.
id
¶An
integer
identifing the state. In anLr1State
this integer is its index into the parse table.
actions
¶The shift table of this parse state. It is a
dict
, indexed by all validTokenSymbols
that could appear at this point. The error occurred because the current token isnât a key to thisdict
.
gotos
¶The reduce table of this parse state. A
dict
. Not useful for error recovery.
By just looking at the top item of the passed stack for Lang2
we see the parser was in the middle of a then_block. Its task is to re-arrange the input so the parser can continue. The simple way is to insert one of the TokenSymbols
out of Lr1State.actions
. In this case 'endif'
would be the right choice. Another choice is to erase some input, possibly replacing it with something else. This would be appropriate if, for example the string being parsed was ((2 func 2;
. Replacing func 2
with ))
would get the parser onto the next statement so the user only gets one error for the current line.
If on_error()
returns None
the parser will raise a ParseError
as if on_error()
wasnât passed to it. Otherwise it will read input tuples from the returned iterable, and when that is exhausted return to reading the original input. So, returning a list
of input tuples will insert them, reading input tuples off the passed iterator
will delete them, and reading them and returning what you read allows you to peek ahead. As the parser has already processed the input tuple it passed to on_error()
, if you want the parser to reprocess it, it must be in the returned iterable. Going back to the example, this line would insert an 'endif'
, thus allowing the parser to process the remainder of the input:
endif = (Lang2.T.kwd_endif, Lang2.T.kwd_endif.literal) + token[2:] return [endif, token]
There is one other option for the adventurous. The stack
passed to on_error()
is the actual parser stack, so you can modify it. The states on the stack depend on the whims of the parser generator so adding or modifying them is impossibly fragile. Deleting them, using say del stack[-2:]
is not. In the ((2 func 2
example deleting the entire expression from the stack and presenting the parser with a replacement expression on its input (eg, just Lang2.T.number
) is often simpler.
Lurking under lrparsing
âs hood is an LR(1) parser. LR(1) parser generators are cantankerous beasts. lrparsing
puts lipstick on the pig, but the pig is still a pig. The principal way it bites is to say your grammar is ambiguous. This section is here to help you work past this error, but be aware writing large complex grammars requires a fair amount of expertise which canât be imparted by this small how-to.
To understand what ambiguous means, understand that a parser is meant to take some input, validate it conforms to the grammar, and then produce a parse tree. When the parser generator says the grammar is ambiguous, it means that for some valid input several different parse trees can be produced. An LR(1) grammar requires there be only one parse tree for any given input.
A blatantly ambiguous grammar will produce a Reduce/Reduce conflict. Consider:
class A(Grammar): a = Token('x') b = Token('x') START = a | b
A.compile_grammar()
raises this exception:
lrparsing.GrammarError: Conflict: Reduce/Reduce While trying to recognise: b = 'x' ^ a = 'x' ^ on seeing a __end_of_input__ I could not decide between: replace the sequence ['x'] with b and replace the sequence ['x'] with a Please remove this ambiguity from your grammar
Looking at the error message, it displays every possible production that could apply at this point, with an ^ showing where the parser is up to in each production. There are only two types of actions the parser can take. It can accept the next token in the input stream. This is called a Shift. It isnât possible here as the next symbol is __end_of_input__
, in other words it is at the end of the string. The other action it could take is replace some symbols with the production they match. This is called a Reduce. It can do a reduce as there are two productions that match. Sadly two is one too many, and this conflict is called Reduce/Reduce because both options are reduces. They correspond to these parse trees:
(START (a 'x')) OR (START (b 'x'))
Since there are two possible parse trees the grammar is ambiguous. If you look at the error message just the right way, it could almost be said to say that. The solution in this case is to remove one of the redundant Rules
:
class A(Grammar): a = Token('x') START = a
Looping grammars are also ambiguous. Consider this grammar:
class B(Grammar): b = Ref("e") e = b | Token("x") START = e
Again, the Grammar
will recognise a single 'x'
. B.compile_grammar()
raises this exception:
lrparsing.GrammarError: Conflict: Reduce/Reduce While trying to recognise: START = e ^ b = e ^ on seeing a __end_of_input__ I could not decide between: replace the sequence [e] with b and replace the sequence [e] with START Please remove this ambiguity from your grammar
The error message is saying when the parser has seen a single 'e'
, it has two choices, which are listed. These two choices correspond to an infinite number of parse trees:
(START (e 'x')) OR (START (e (b (e 'x')))) OR (START (e (b (e (b (e 'x')))))) OR ...
Again since there is more than one possible parse tree, the grammar is ambiguous.
Unfortunately those two simple examples donât capture the complexity of common ambiguities. Consider this example which is common in real world grammars:
class E(Grammar): e = THIS + Tokens("+ /") + THIS | Token(re="[0-9]+") START = e
This Grammar
recognises strings of integers separated by '+'
and '/'
. If the grammar writer wanted to produce a grammar that only recognised valid integer expressions using the operators '+'
and '/'
he was successful. Nonetheless, when he calls E.compile_grammar()
this exception is raised:
lrparsing.GrammarError: Conflict: Shift/Reduce and no associativity While trying to recognise: e = e '+' e ^ e = e ^ '/' e on seeing a '/' I could not decide between: replace the sequence [e '+' e] with e and accepting the '/' in the hope it will match: e = e '/' ^ e Please remove this ambiguity from your grammar
Again the error message lists the productions under consideration and how far the parser has got in each. The smallest possible string the parser could be looking at and end up with those possibles is: e + e ^ / e
. The parser wants to know whether should it replace (reduce) e + e
with e, which corresponds to performing the addition now, or should it continue reading, ie, delay the addition until it processes the '/'
. This is called a Shift/Reduce conflict because the two actions possible are Shift and Reduce. The two parse trees under consideration are thus:
Choose Shift: e + (e / e) Choose Reduce: (e + e) / e
The normal arithmetic convention is to do the division first, so we want to choose Shift
. There are several ways to make the grammar do that. The easiest way in lrparsing
is to prioritise the two choices, saying e '/' e
should always be done before e '+' e
:
class E(Grammar): e = Prio(THIS + '/' + THIS, THIS + '+' + THIS) | Token(re="[0-9]+") START = e
But doing that only replaces the previous error E.compile_grammar()
raised with a new one:
lrparsing.GrammarError: Conflict: Shift/Reduce and no associativity While trying to recognise: e.Choice.Prio.Prioritised0 = e ^ '/' e e.Choice.Prio.Prioritised0 = e '/' e ^ on seeing a '/' I could not decide between: replace the sequence [e '/' e] with e.Choice.Prio.Prioritised0 and accepting the '/' in the hope it will match: e.Choice.Prio.Prioritised0 = e '/' ^ e Please remove this ambiguity from your grammar
Firstly, notice how the choices listed donât directly correspond to the rules in your Grammar
. That is because lrparsing
has compiled your Grammar
into productions, which is the only thing an LR(1) parser understands. You can print this compiled form using print E.repr_productions()
which yields:
0 : <E> = START __end_of_input__ 1 : START = e 2 : e = e.Choice.Prio e = /[0-9]+/ 3 : e.Choice.Prio = e.Choice.Prio.Prioritised0 e.Choice.Prio = e.Choice.Prio.Prioritised1 4 : e.Choice.Prio.Prioritised0 = e '/' e 5 : e.Choice.Prio.Prioritised1 = e '+' e
It is worth taking some time to understand how these productions correspond to the original set of Rules
. In any case you can now see where all those odd looking symbol names come from.
Using the same technique as before we deduce the smallest string it could be looking at and end up with these possibilities: e '/' e ^ '/' e
. In other words it canât decide between these two parse trees:
e / (e / e) OR (e / e) / e
As an aside this is a real problem because integer division isnât commutative: (9 / 3) / 2 = 1
, whereas 9 / (3 / 2) = 9
. So if the parse tree was being used to evaluate the expressions, choosing the wrong one would yield an incorrect answer.
We canât use Prio
to solve this because the production is conflicting with itself. The solution is to say division is left associative:
class E(Grammar): e = Prio(THIS << '/' << THIS, THIS << '+' << THIS) | Token(re="[0-9]+") START = e
And with that the parser will compile, and will yield parse trees that conform to the conventions of arithmetic.
Looking deeper under the hood, an LR(1) parser is called that because it delays all decisions until it has recognised an entire production. If you look at the above error messages, you will see they all involve a reduce and that is because is a reduce is what the parser does when it makes the decision on which production it has seen. There are two ways it can get confused.
The first involves two reduces. Since the parser is making its decision after it has seen the entire production, the only way it can be confused is if two productions have the same right hand side. If you look at the Reduce/Reduce conflicts above that is indeed the case. But even in that case the LR(1) parser can still distinguish between the two if they are followed by different tokens. Consider this ambiguous Grammar
:
class F(Grammar): a = Token('a') b = Token('a') START = a + 'x' + 'y' | b + 'x' + 'z'
F.compile_grammar()
raises this exception:
lrparsing.GrammarError: Conflict: Reduce/Reduce While trying to recognise state 4: a = 'a' ^ b = 'a' ^ on seeing a 'x' I could not decide between: replace the sequence ['a'] with a and replace the sequence ['a'] with b Please remove this ambiguity from your grammar
Unlike the previous example this Grammar
isnât inherently ambiguous as there is one unique parse tree for every input. The problem is an LR(1) parser isnât powerful enough to recognise it. The symbol it needs to see in order to know whether it should be an a
or a b
(the 'y'
or 'z'
) is two tokens away and LR(1) only looks one token ahead. An LR(2) parser would be powerful enough but we donât have one of those.
In this case it is easy enough re-write the Grammar
so an LR(1) parser is powerful enough. The solution is to delay recognising a
and b
until the parser can see the y
or z
:
class F(Grammar): a = Token('a') + 'x' b = Token('a') + 'x' START = a + 'y' | b + 'z'
The problem with re-writing like this is it changes the parse tree. If the back end that processes the parse tree canât tolerate this, something is going to have to give, and it wonât be the LR(1) parser. Post processing the parse tree so it is in the form the back end wants is the normal solution. Console yourself with the knowledge that people using LL(1) grammars have to do it far more.
Avoid the temptation to use Prio
to resolve Reduce/Reduce conflicts. It will make the conflict go away, but it means the parser will never recognise part of your Grammar
. In the above example yielding to temptation and using Prio
would result in:
class F(Grammar): a = Token('a') b = Token('a') START = Prio(a + 'x' + 'y', b + 'x' + 'z')
And presto F.compile_grammar()
works. But now F.parse("axz")
yields:
lrparsing.ParseError: line 1 column 3: Got 'z' when expecting 'y'
Since "axz"
is a valid input for our Grammar
this is a bug. It was expecting a 'y'
because our Grammar
says if it canât decide choose a + 'x' + 'y'
over b + 'x' + 'z'
. As we know it could not decide, so will always choose a + 'x' + 'y'
. This means it can never recognise b + 'x' + 'z'
.
Since at least one reduce must be involved, the only other possibility is a Shift/Reduce conflict. These always take the form:
SYMBOL = SYMBOL <anything1> SYMBOL ^ SYMBOL = SYMBOL ^ <anything2> SYMBOL
You can pick the Reduce because it has the ^ at the end. In this case the Grammar
is always ambiguous, which is good because you can fix it by using associativity or priorities. The trick in figuring out which to choose is to understand what the parser is trying to decide between, and you can do that by replacing a Symbol in the right hand side of the Shift with the entire right hand side of the Reduce. The name of the Symbol replaced must be the same as the left hand side of the Reduce, and the ^
in the two productions must coincide, like this:
SYMBOL = SYMBOL <anything1> SYMBOL ^ <anything2> SYMBOL
The parse trees become either:
Shift: SYMBOL <anything1> (SYMBOL <anything2> SYMBOL) Reduce: (SYMBOL <anything1> SYMBOL) <anything2> SYMBOL
If <anything1>
and <anything2>
are different (like '*'
and '+'
above) you can use Prio
to choose the one that should happen first. If they have the same priority (and in particular this will be the case if <anything1>
and <anything2>
are the same thing), then you must use Left
if the Shift is the correct choice or Right
if the Reduce is the correct choice. If it should never happen use Nonassoc
.
The source of lrparsing
contains the follow examples, which can also be viewed online at http://lrparsing.sourceforge.net/examples:
A Grammar
for all of the Data Manipulation Statements in Sqlite3.py. lrparsing
was written because the author could not find a Python parsing library that ran at an acceptable speed and required less lines of code to do this job than a hand written LL(1) parser.
An integer expression evaluator It illustrates one design pattern for building a compiler on top of lrparsing
: using a Symbols
inbuilt mapping capabilities. If the job is simple this works well.
A Grammar
for Lua 5.2 and a cross compiler to Python. It illustrates a second design pattern for building a compiler: using a compiler class whose functions have the same names as rules
and and tokens
in the Grammar
. These functions are then called to compile the node as it is output by parse()
. This is a good way to do it for more complex tasks. It cleanly separates the compiler from the parser, while keeping the relationship between the output parse tree and the generated code clear.
lrparsing
was developed with inspiration from several sources. The two main ones were parsing.py (http://www.canonware.com/Parsing/) which from a technical perspective is a powerful and clean LR parser implementing several things lrparsing
doesnât, including a GLR(1) parser. The other source of inspiration is pyparsing.py (http://pyparsing.wikispaces.com/) which proves that providing a pythonic interface to parsing is a worthwhile thing to do.
Finally, thank you to Greg Black for proof reading this document.
Lrparsing Glossary¶If you use a parser generator to its full extent and particularly if you are doing error recovery, you will end up knowing enough about parser generation to write your own. It will be a long journey, made longer by parser generation being an old art that has built up a truly impressive amount of jargon. The glossary is here to help you get past the jargon.
Unfortunately the glossary is large and the concepts complex, so it ends up being another wall of text. The narrative below is less rigorous but hopefully doesnât suffer from the wall of text problem. May it speed your journey to parser generation zen.
Glossary Narrative¶Here is a simple grammar:
import lrparsing class MyGrammar(lrparsing.Grammar): xy_non_term = lrparsing.Token('x') + lrparsing.Token('y') y_non_term = lrparsing.Token('y') the_world_non_term = lrparsing.Repeat(xy_non_term | y_non_term, min=1) START = the_world_non_term
This grammar is composed of rules
, which are things Python can parse. lrparsing
translates the rules
into an alternate representation of the same grammar using productions. Productions are what an LR(1) parser generator understands (this is the output produced by MyGrammar.repr_productions()):
0 : <MyGrammar> = START __end_of_input__ 1 : START = the_world_non_term 2 : the_world_non_term = the_world_non_term.Repeat 3 : the_world_non_term.Repeat = xy_non_term the_world_non_term.Repeat = y_non_term the_world_non_term.Repeat = the_world_non_term.Repeat xy_non_term the_world_non_term.Repeat = the_world_non_term.Repeat y_non_term 4 : xy_non_term = 'x' 'y' 5 : y_non_term = 'y'
Productions have an internal structure and each bit has its own name; sometimes several. Apart from the = they are made up of symbols. The symbol on the left hand side is called a non-terminal. A non-terminal is never read from the input stream directly. It is just a place holder created by the grammar author to represent whatever is on the right hand side of the production. This contrasts to the symbols read from the input stream. They are called tokens. lrparsing
embeds the tokens in a input tuple mostly because additional information is needed for error reporting - things like the where the token was found (eg, line and column).
The parser generator now creates a parse table from the productions. To do that it goes through an intermediate step. It creates a third representation of the grammar, a directed graph. Each node in the graph is called an item set. The first item set is:
{ <MyGrammar> = ^ START __end_of_input__ }
As the {} are meant to imply, an item set really is a set
, a set of productions. However there is an extra bit of information associated with each production, represented as a ^ here. This is called the dot and it shows how far the parser has progressed in recognising the production. The combination of the production and the dot is called an item, and hence the term item set arises because it is a set of items.
The production used in this initial item set is the root of the parse tree. It includes the special end of MetaToken
__end_of_input__, which signals the entire input has been read. The non-terminal on its left hand side (<MyGrammar> here) has a special name. It is called the start symbol. So the item set above is a succinct representation of the parsers state before it has read anything. However itâs not finished yet.
In order to finish it off an operation called closure must be performed on it. The closure is intuitively easy to construct once you recognise that if the parser is about to see a START symbol, then the START symbols production(s) belong in the closure as well:
{ <MyGrammar> = ^ START __end_of_input__ START = ^ the_world_non_term }
But the closure is still not finished as now the_world_non_term is also a non-terminal proceeded by the dot. Adding it adds yet more unclosed non-terminals. This goes on for a bit. Once complete the closure will be:
{ <MyGrammar> = ^ START __end_of_input__ START = ^ the_world_non_term the_world_non_term = ^ the_world_non_term.Repeat the_world_non_term.Repeat = ^ xy_non_term the_world_non_term.Repeat = ^ y_non_term the_world_non_term.Repeat = ^ the_world_non_term.Repeat xy_non_term the_world_non_term.Repeat = ^ the_world_non_term.Repeat y_non_term xy_non_term = ^ 'x' 'y' y_non_term = ^ 'y' }
The item set is now complete. There are two tokens that follow the dots in it: âxâ and âyâ. This means these two tokens, and only these two tokens, could legitimately be appear in the input stream at this point. Anything else will cause the parser to raise a ParseError
.
Lets say âxâ was the next thing to be read from the input stream. Reading it creates a new item which can be used to create a core we havenât see before:
{ xy_not_term = 'x' ^ 'y' }
Since there are no non-terminals following the dot the closure adds nothing, so this core is also a completed item set. âxâ is not the only token that the epoc item sets could first read from the input stream. It might also be an âyâ. This yields another new core, and doing a closure on that core will create a new item set for it.
Recall the parser generator is creating a graph, and these item sets are the nodes in the graph. Since reading a token moves to these new item sets, the net effect of reading any symbol is to create a directed line connecting the two nodes in the graph. The act of moving along that line to a new node in the graph is called an action. This particular variant of an action is called a shift action. All possible shift actions a node / item set can make are held in its shift table. For the item set representing the start symbol, the shift table will be:
{ 'x': item_set({ xy_non_term = 'x' ^ 'y', }), 'y': item_set({ y_non_term = 'y' ^, }), }
When the xy_non_term reads the âxâ followed âyâ the entire right hand side of the production has been seen. Each shift has pushed the token it accepted onto a data structure called the stack. The âxâ and âyâ on the stack are popped from the stack and the non-terminal xy_non_term is pushed onto the stack, effectively replacing them. These are the first steps of an action called reduce. Effectively, the non-terminal xy_non_term has just been read from the input stream, and based on that parser must move to a new state (node in the graph, aka item set). It does it in the same way as a shift action: it looks up the non-terminal just âreadâ in a table called the reduce table. This table can be displayed using MyGrammer.repr_parse_table(). The reduce table of the start symbols item set is:
{ START: item_set({<MyGrammar> = START ^ __end_of_input__}), the_world_non_term: item_set({START = the_world_non_term ^}), the_world_non_term.Repeat: item_set({ the_world_non_term = the_world_non_term.Repeat ^ the_world_non_term.Repeat = the_world_non_term.Repeat ^ xy_non_term the_world_non_term.Repeat = the_world_non_term.Repeat ^ y_non_term }), xy_non_term: item_set({the_world_non_term.Repeat = xy_non_term ^}), y_non_term: item_set({the_world_non_term.Repeat = y_non_term ^}), }
âReadingâ a non-terminal moves the dot just like reading a token does, and so creates a new item set. All these new item sets are processed, yielding new item sets in turn that are also processed - but only if the core hasnât been seen before. When the process winds down the graph is complete.
To do itâs job the parser needs only the base structure of the graph, not the details used to build it. In fact the item sets can be represented by a integer, which can conveniently be used as an index into an array whose elements contain the reduce table and shift table for the node the item set represented. This stripped down version of the item set graph is the parse table.
Glossary¶A action is a collective name for the two steps an LR(1) parser can perform: shift and reduce. Alternatively, if the parse table is thought of as a graph, an action is the name of a line connecting the nodes in the graph. The name for a node in the graph is a state.
A parser of any sort has two jobs: to verify the input is valid according to the grammar, and to create a parse tree for it. The words create a parse tree could equally well be written assign meaning. Eg, 9 + 3 / 2
is a valid arithmetic, but without the normal rules of arithmetic could have two different meanings: 9 + (3 / 3) = 10
or (9 + 3) / 3 = 4
. In English, jokes are based on this same ambiguity, eg, Time flies like an arrow. It follows a parser would not be much use if it could not assign a unique meaning to every input sequence. Sadly it is possible (even easy) to write grammars that donât assign a unique meaning to every input sequence. These grammars are rejected by the parser constructor as ambiguous. Even more sadly no practical parser, including LR(1), is powerful enough to recognise all non-ambiguous grammars, so they incorrectly say that some non-ambiguous grammars are ambiguous. The only work around is to re-write the grammar in a form the parser can use, and post process the parse tree if the new version is too far from the truth.
Closure the name for both an operation performed by item sets during their construction and the output of that operation, which is a set of items. An item set is a set of items, ie, partially recognised productions. Initially an item set contains just its core. If the next expected symbol of any item in the item set is a non-terminal, ie the left hand side of another production, then it follows the parser is also recognising this new item, which is that production with its dot at the start. This new item becomes part of the closure. That new item could also have a non-terminal at its start, thus yielding another new item recursively. All items added in this way become the item sets closure.
The core is the name of the initial subset of items in an item set. The core is constructed from the item sets in the state graph that have lines leading to this new item set. It is the items in those ancestor item sets after the recognition of the symbol labelling the line connecting them. The dot of those items has been moved along one position to reflect the recognition of that symbol.
The job of the parser generator is to create a graph. The nodes of the graph become the parsers states. The lines that join the nodes of the graph are termed actions. They are labeled with the symbol the parser just recognised. If the parser generator canât uniquely decide what the next state should be when symbol is seen (ie several lines would be labelled with the same symbol), the grammar is said to have a conflict. This means the grammar is ambiguous, because having several possible states reachable for the same symbol means there are several possible parse trees. For an LR parser one of the conflicting actions will always be a reduce because an LR parser only makes a decision once it has seen an entire production. Since there are two types of actions, shift and reduce, this means a conflict must be between a shift and a reduce which is termed a Shift/Reduce conflict or it could be between two reduces which is termed a Reduce/Reduce conflict.
Marks a position in the right hand side of a production. In days gone by when parse tables were constructed manually on paper, a human would mark a dot in a production to keep track of where he was up to. In a computer the position of a dot is represented by the number of symbols on the right hand side seen so far.
Rules that explain how a series of tokens should be interpreted. For example, in English tokens are words and punctuation. English grammar shows how to build up words and punctuation into phrases of various kinds (eg, noun phrases, adverb phrases), then the phrases into sentences, and the sentences into constructs like quotes, paragraphs and chapters. The result is often represented as a parse tree.
The parser recognises tokens. However, it is often helpful to have other information associated with the token that the parser doesnât use, such as the position in the input stream it was found at for error reporting. lrparsing
calls the combination of a Token plus that other information an input tuple. lrparsing
uses a Python tuple
with the Token as it first element for this purpose.
A partially recognised production. It is a production, plus a marker (the dot) which shows how much the parser has seen so far, plus the tokens that could legally follow the production. An LR(0) parser doesnât store any such tokens, an LR(1) stores all valid tokens that could come next and an LR(n) parser stores all valid sequences of n tokens that could possibly follow it. Since an LR(1) parser only makes decisions after seeing the entire right hand side of a production, it only considers the symbols that could follow a production when it does a reduce.
This data structure is the heart of an LR(1) parser generator. As the name implies an item set is set of items. Internally this set is broken up into two distinct subsets, the core and the closure. The generator initialises itself by creating the core of the initial item set from the start symbol. This core consists of the sole production of the start symbol with the dot before the first symbol. The next step is to compute the closure of this core. In this case if the first symbol of the right hand side is a non-terminal, then it follows the parser is also at the start of all the non-terminalâs productions, so they are added to the closure. And so on, recursively, if they have non-terminals as their first symbol. So now we have a set of items (the core and its its closure), each waiting for their first symbol to appear. If the parser receives one such symbol it can move forward, and what it moves to is a new item set whose core is all the current items waiting on that symbol, but with their dot position moved forward one step because a new symbol has been seen. This process of computing closures and generating new item sets continues until only duplicate item sets are created. The item sets then become the states of the parse table.
The Symbol
on the left hand side of a production. Such symbols are called a non-terminal. In a lrparsing
Grammar
calls the symbol on the left hand side of a production a Rule
.
LR(1) is an abbreviation for Left Right (parser with) 1 (symbol lookahead). Left here simply means it treats the string of characters like English - it starts reading at the left. The Right here means it delays making a decision on whether it has recognised a production until it has read the last symbol (right most) of its right hand side. The 1 symbol lookahead means it considers the next input symbol when it makes this decision. (It only needs to do this if two different non-terminals have the same right hand side. An LR(1) parser will look at the next token to be read to distingiush between them. If the same token can ligetimately follow both doing that doesnât help, so the parser generator will say it has a Reduce/Reduce confict.) LR(0) parser doesnât consider the next input token. If someone were to write an LR(2) parser it would consider the next 2 tokens. The other common type of parser is an LL(1) parser, also called a recursive descent parser. The only difference is R becomes an L (for Left), meaning the parser must make the decisions on what it is doing using just the left hand side of the production it is currently trying to recognise and the next input symbol. In theory because it has significantly less information to work with (it hasnât seen any of the right hand side yet), an LL(1) parser is less powerful than a LR(1) parser. In practice recursive descent parsers are often written by hand and the programmer cheats by looking ahead or re-writing the grammar. This cheating goes a long way to making up the difference in power. However, automatically generated LL(1) parsers donât cheat and so are significantly less powerful (ie, LL(1) handles a smaller set of otherwise valid grammars) than automatically generated LR(1) parsers such as lrparsing
.
The state, which is typically identified by a number, is how the LR(1) parser knows what it is up to. A state is called an item set by the parser generator. The name change reflects a change in the way is is used, not in what it is. Conceptually an LR(1) parser is a Deterministic Finite Automaton (DFA) which is a fancy name for a way of moving through a graph. The states are the nodes in this graph. The LR(1) parser progresses by moving from one node to another, but like all graphs this is only possible if the nodes are directly connected by a line. In the parsers graph these lines are actions. Every valid token that can be accepted as input while the parser is in that state has an associated action. In fact the only time the parser raises an error is when it knows the input stream canât match the grammar, and the only time it knows that is when the token it just read doesnât have an action associated with it. A shift action consumes the next token (pushes it onto the stack) and then moves onto the next state by following the line labelled with that token. A reduce action recognises a production and so it pops the stack to see what parent state the LR(1) parser is trying to recognise. That parent state will have a line in the graph labeled with the non-terminal the left hand side of the production just recognised, which will take it to the next state. People familiar with the purest form of regular expressions will know that they can also be recognised with DFAâs. An LR(1) parser is a souped up version of a regular expression DFA that recognises the right hand side of its productions.
A parse table is the data structure that drives a parser. The parse table is an array of states, the states being identified by their array index. The parse table can be viewed as the description of the nodes and lines of a graph. The nodes of the graph are the states. The lines are the state the parser moves to when it recognises a symbol, and thus are labelled with that symbol. The lines correspond to the state actions.
The way a parser explains the meaning of its input is to output a parse tree. The parse tree assigns the meaning to the symbols and the relationships between them. The sentence Time flies like an arrow can be parsed two ways. In the one way âTimeâ is an adjective describing a type of insect, and âlikeâ is a verb describing what âTime fliesâ do to arrows. An lrparsing
parse tree for that meaning might look like:
(sentence (noun_phrase (adjective 'Time') (noun 'flies')) (verb_phrase (verb 'like') (noun_phrase (determiner 'an') (noun 'arrow')))))
The alternative, where âTimeâ is a noun, looks like this:
(sentence (noun 'Time') (verb_phrase (verb 'flies') (preposition_phrase (preposition 'like') (noun_phrase (determiner 'an') (noun 'arrow')))))
The parser in an LR(1) is a simple loop that processes a stream of input tuples according to a parse table, and produces a parse tree as a consequence. The same parser is used for all input streams and all parse tables. It is fast because it is simple. Its one internal data structure is the stack. The parse table is actually a graph, and the parserâs sole job is move to a new node in the graph as it recognises each new symbol by following the line labelled with that symbol. As it goes it outputs the productions it has recognised, and these become the parse tree.
A parser generator takes a grammar as input and generates a parse table as output, unless the grammar is ambiguous in which cause it raises an error. It does this by constructing item sets.
The LR(1) name for the lines describing a Grammar an LR(1) parser generator accepts as input. lrparsing
does not use productions, it uses rules
. It compiles your rules
into LR(1) productions for you. You can display those productions using repr_productions()
. Productions are similar to rules
in that they have a left hand side which replaces the symbols on the right hand side when they are recognised. The differences between a Rule
and a production are a production can only be a Sequence
, and alternative derivations for the same non-terminal are expressed by allowing a non-terminal to have many productions.
A symbol that appears on the left hand side of a production. In lrparsing
Grammars
the Rules
are its non-terminals.
A reduce is the parsers aah! moment. It happens when it has decided it has recognised a complete production. This has occurred because the symbols on the right hand side plus the next token in the input stream uniquely determined the non-terminal on the productions left hand side. The reduce action now pops the symbols from the right hand side from the stack, and then pushes the non-terminal on the left hand side. In other words, it replaces the symbols on the stack are the productions right hand side with the non-terminal on its left hand side. It then looks up the newly arrived arrived non-terminal in the reduce table to determine the next state. Goto is an older name for reduce still found in the literature.
The reduce table is a mapping of a non-terminal to a state. When a production is recognised by a reduce action it is replaced by the non-terminal on its left hand side. This is done by popping the symbols on right hand side from the stack, and then pushing the non-terminal. The parser must then move to a new state, and it does that by looking up non-terminal in the states reduce table. Every state has a reduce table, but it will be empty if all the productions in the item set the state represents have right hand sides consisting purely of tokens. If the parse table is thought of as a graph, the reduce table describes the lines in the graph labelled by the non-terminal (ie, to be followed) when the non-terminal is recognised by a reduce. There is a corresponding mapping for tokens called the shift table and between the two they describe all the lines in the graph. In the literature this table is sometimes called the reduce table. Goto Table is an older name for the reduce table which is still found in the literature.
The sequence of symbols on the right hand side of a production. When these symbols are seen in the order they were given, they can be replaced on the stack (via an operation called a reduce) by the non-terminal on the left hand side.
When a token is recognised as valid it is removed from the input stream and pushed onto the stack. The parser looks up token in the shift table to determine the new state it should be in.
The shift table maps a token to another state. Each state has its own shift table that lists every token the grammar says could legitimately come next. If the next token to be read isnât in that list a ParseError
is raised, and fact this is the only thing that triggers a ParseError
. If the parse table is thought of as a graph, the shift table describes the lines in the graph followed when a particular token appears as the next item in the input stream. There must be only one, otherwise the grammar would be ambiguous. The act of doing this is called a shift and thus the shift table lists all possible shift actions that can happen when the parser in a particular state. There is a similar table called a reduce table that describes the lines in the graph to be followed when a non-terminal is recognised. Between them, the shift tables and the reduce tables contain all lines in the graph, and thus describe all possible transitions the parser can make been states.
The stack is a data structure the LR(1) parser uses internally to keep track of where it is up to. A shift pushes the next input Token
onto the stack. A reduce happens when the symbols on the right hand side of a production match the symbols on the top of the stack. When they do, a reduce replaces those matching symbols with the non-terminal on the left hand side of the production. That symbol will always (unless it is the start symbol) be part of another production that the parser was in the process of recognising, and now its (probably partial) right hand side will be what is now on the top of the stack. The stack continues downwards with symbols from partially recognised productions until, at its bottom, we find the partially recognised production from the start symbol. The Symbols popped from the stack become nodes in the parse tree. Thus the stack is actually a partially constructed branch of the parse tree, with the bottom becoming the root node of the parse tree.
The goal of the grammar is to recognise this symbol. Although an lrparsing
Grammar
requires a Symbol
named Grammar.START
and its goal is indeed to recognise that symbol, it isnât the real start symbol. The real start symbol has a single production whose right hand side is the Grammar.START
symbol followed by an internal MetaToken
called __end_of_input__
. This real start symbol is returned by epoch_symbol()
.
In LR(1) parlance a symbol is the name for something that is recognised by the parser, ie, it is a collective name for tokens and non-terminals.
Typically what the parser is trying to recognise is a series of characters. The first step is to break that series of characters into tokens. In English a token is a word or a punctuation character. The parser then works on these tokens, not the individual characters.
The thing that turns text into a stream of Tokens.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4