On Wed, 21 Aug 2002 Aahz wrote: > On Wed, Aug 21, 2002, Skip Montanaro wrote: > > parsing != tokenizing. ;-) > > Regular expressions are great for tokenizing (most of the time). > Ah. Here we see one of the little drawbacks of not finishing my CS > degree. ;-) Can someone suggest a good simple reference on the > distinctions between parsing / lexing / tokenizing, particularly in the > context of general string processing (e.g. XML) rather than the arcane > art of compiler technology? > Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ It's been 8 or 9 years since I took a compiler design class, so this info is probably be WRONG, but as luck would have it I've been reviewing some of this stuff lately so I can feel some of the old neuron paths warming up. Basically the distinction between a lexer and a parser refers to the complexity of symbol inputs that they can recognize. A lexer (AKA tokenizer) is modeled by a finite state machine (FSM). These don't have a stack or memory, just a state. They are not good for things that require nested structure. A parser recognizes more complex structures. They are good for things like XML and source code where you have NESTED tree structures (familiarly known as SYNTAX). If you have to remember how many levels deep you are in something then it means you need a parser. Technically a parser is something that can be defined by a context free grammar and can be recognized by a Push Down Automata (PDA). A PDA is a FSM with memory. A PDA has at least one stack. The "context free" on the grammar means that you can unambiguously recognize any section of the input stream no matter what came earlier in the stream. ... Sometimes real grammars are a little dirty and context does matter, but only within a small window. That means you might have to "look ahead" or behind a few symbols to eliminate some ambiguity. The amount that you have to look ahead sets the complexity of the grammar. This is called LALR (look ahead left reduce). So a simple grammar with no look ahead is called LALR(0). A slightly more complex grammar that requires 1 symbol look-ahead is called LALR(1). We like most parsing to be simple. I think languages like C and Python require are LALR(0). I think FORTRAN does require a look-ahead, so it's LALR(1). I have no idea what it must require to parse Perl. [Again Note: I sure I probably got some details wrong here.] If you go further in complexity and you want to handle evaluating expressions then you need a Turing Machine (TM). These are problems where a context-free grammar cannot be tweaked with a look-ahead. Another way to think about it is if your input is so complex that it must be described algorithmically then you need a TM. For example neither a FSM nor a PDA can recognize a non-rational number like SQRT(2) or pi. Nor can they recognize the truthfulness of expressions like "2*5=10" (although a PDA can recognize that it's a valid expression). RegExs are hybrids of FSM and PDA and are fine for ad hoc lexing. They are not very good for parsing. I think older style RegExs started life off as pure FSM, but newer flavors made popular by Perl added memory and became PDAs... or something like that. But somehow they are limited and not quite as powerful as a real PDA, so they can't parse. Traditionally C programmers used a combination of LEX and YACC (or GNU's versions of FLEX and Bison) to build parsers. You really only need YACC, but the problem is so much simpler if the input stream is tokenized before you try to parse it, which is why you also use LEX. Hopefully that captures the essence if not the actual facts. But then I'm trying to compress one year of computer science study into this email :-) If you are interested I just wrote a little PDA class for Python. Yours, Noah
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4