Two major parsing strategies are top-down parsing (LL) and bottom-up parsing (LR).
PDAs generated from CFGs were non-deterministic. While NPDAs work in theory, computationally they blow up (in memory and time). Different techniques are used to overcome that.
Top-Down Parsing
Constructs the parse tree starting from the root (the start symbol) and works downward toward the leaves, following leftmost derivation.
Lookahead
A technique to determine the path to choose in each step of top-down parsing. In addition to the current input symbol, 1 or more next symbols are examined.
LL Grammar
Means left-to-right, left-most derived grammar.
A grammar where looking ahead symbols is always sufficient to choose the correct production is called an grammar. In practice, is almost always used.
Every grammar is a grammar.
Recursive-Descent Parsing
A straightforward top-down method. Each non-terminal has a function and each production has a clause. May involve backtracking. Can be used for grammars.
Predictive Parsing
A simple form of recursive-descent parsing. Works when the first terminal of each sub-expression uniquely identifies which production to use. Does not use backtracking.
Fails when the grammar is left-recursive or ambiguous. To make a grammar suitable for predictive parsing, one may need to eliminate left-recursion and apply left-factoring.
Table-Driven Predictive Parsing
Non-recursive. Uses an explicit stack. Uses a parsing table where is a non-terminal and is a token, to lookup productions to apply. A grammar with no multiply-defined entries in its parsing table is an grammar.
If a parsing table entry has more than 1 productions, it is either ambiguous or left-recursive. Then the grammar cannot be parsed by predictive parsing.
For grammar, columns of the parsing table are every combination of terminals.
Bottom-Up Parsing
Aka. shift-reduce parsing. Builds the parse tree starting from the leaves (the input tokens) and works upward toward the root. It traces a rightmost derivation in reverse by repeatedly identifying a substring on the right-hand side of a production and reducing it to the corresponding left-hand side non-terminal.
Shift-Reduce Mechanism
The parser uses a stack and an input buffer. It performs one of 4 actions at each step:
- Shift: push the next input token onto the stack
- Reduce: replace a handle (a matched RHS) on top of the stack with the corresponding LHS non-terminal
- Accept: parsing is complete when the start symbol is on the stack and input is empty
- Error: a syntax error is detected
Conflicts
- Shift-reduce conflict
Unclear whether to shift or reduce. Resolved by preferring shift. - Reduce-reduce conflict
Multiple valid reductions available. Resolved by choosing the earlier production in the grammar
LR Grammar
Means left-to-right right-most derived grammar. Right-most derivation is done in reverse.
parsing is the most general and widely used form of bottom-up parsing. Here is the number of lookahead tokens. Can handle a much larger class of grammars than , making it the standard approach in compiler construction.
LR parsing uses a DFA applied to the stack contents along with a transition table to decide when to shift and when to reduce. The table has two parts — an action table and a goto table.
Can recognize virtually all programming language constructs (if CFG can be given). Most general non-backtracking shift-reduce method known, but can be implemented efficiently. Can parse all grammars parsed by . Can detect syntax errors as soon as possible. Too tedious to do by hand. Tools like Yacc, Bison, JavaCC, and PLY generate LALR(1) parsers automatically from a grammar specification.
In practice, tables can be very large. A variant called (Look-Ahead LR) merges similar states to produce significantly smaller tables while still being able to parse most programming language constructs.