Parsing | Sahithyan's Notes

Two major parsing strategies are top-down parsing (LL) and bottom-up parsing (LR).

PDAs generated from CFGs were non-deterministic. While NPDAs work in theory, computationally they blow up (in memory and time). Different techniques are used to overcome that.

Top-Down Parsing

Constructs the parse tree starting from the root (the start symbol) and works downward toward the leaves, following leftmost derivation.

Lookahead

A technique to determine the path to choose in each step of top-down parsing. In addition to the current input symbol, 1 or more next symbols are examined.

LL Grammar

Means left-to-right, left-most derived grammar.

A grammar where looking ahead $k$ symbols is always sufficient to choose the correct production is called an $\text{LL}(k)$ grammar. In practice, $k=1$ is almost always used.

Every $\text{LL}(k)$ grammar is a $\text{LL}(k+1)$ grammar.

Recursive-Descent Parsing

A straightforward top-down method. Each non-terminal has a function and each production has a clause. May involve backtracking. Can be used for $\text{LL}(1)$ grammars.

Predictive Parsing

A simple form of recursive-descent parsing. Works when the first terminal of each sub-expression uniquely identifies which production to use. Does not use backtracking.

Fails when the grammar is left-recursive or ambiguous. To make a grammar suitable for predictive parsing, one may need to eliminate left-recursion and apply left-factoring.

Table-Driven Predictive Parsing

Non-recursive. Uses an explicit stack. Uses a parsing table $M[V, T]$ where $V$ is a non-terminal and $T$ is a token, to lookup productions to apply. A grammar with no multiply-defined entries in its parsing table is an $\text{LL}(1)$ grammar.

If a parsing table entry has more than 1 productions, it is either ambiguous or left-recursive. Then the grammar cannot be parsed by predictive parsing.

For $\text{LL}(k)$ grammar, columns of the parsing table are every combination of $k$ terminals.

Bottom-Up Parsing

Aka. shift-reduce parsing. Builds the parse tree starting from the leaves (the input tokens) and works upward toward the root. It traces a rightmost derivation in reverse by repeatedly identifying a substring on the right-hand side of a production and reducing it to the corresponding left-hand side non-terminal.

Shift-Reduce Mechanism

The parser uses a stack and an input buffer. It performs one of 4 actions at each step:

Shift: push the next input token onto the stack
Reduce: replace a handle (a matched RHS) on top of the stack with the corresponding LHS non-terminal
Accept: parsing is complete when the start symbol is on the stack and input is empty
Error: a syntax error is detected

Conflicts

Shift-reduce conflict
Unclear whether to shift or reduce. Resolved by preferring shift.
Reduce-reduce conflict
Multiple valid reductions available. Resolved by choosing the earlier production in the grammar

LR Grammar

Means left-to-right right-most derived grammar. Right-most derivation is done in reverse.

$\text{LR}(k)$ parsing is the most general and widely used form of bottom-up parsing. Here $k$ is the number of lookahead tokens. Can handle a much larger class of grammars than $\text{LL}(k)$ , making it the standard approach in compiler construction.

LR parsing uses a DFA applied to the stack contents along with a transition table to decide when to shift and when to reduce. The table has two parts — an action table and a goto table.

Can recognize virtually all programming language constructs (if CFG can be given). Most general non-backtracking shift-reduce method known, but can be implemented efficiently. Can parse all grammars parsed by $\text{LL}(k)$ . Can detect syntax errors as soon as possible. Too tedious to do by hand. Tools like Yacc, Bison, JavaCC, and PLY generate LALR(1) parsers automatically from a grammar specification.

In practice, $\text{LR}(1)$ tables can be very large. A variant called $\text{LALR}(1)$ (Look-Ahead LR) merges similar states to produce significantly smaller tables while still being able to parse most programming language constructs.