Parsing

4 min read Updated Fri Apr 24 2026 07:36:29 GMT+0000 (Coordinated Universal Time)

Two major parsing strategies are top-down parsing (LL) and bottom-up parsing (LR).

PDAs generated from CFGs were non-deterministic. While NPDAs work in theory, computationally they blow up (in memory and time). Different techniques are used to overcome that.

Top-Down Parsing

Constructs the parse tree starting from the root (the start symbol) and works downward toward the leaves, following leftmost derivation.

Lookahead

A technique to determine the path to choose in each step of top-down parsing. In addition to the current input symbol, 1 or more next symbols are examined.

LL Grammar

Means left-to-right, left-most derived grammar.

A grammar where looking ahead kk symbols is always sufficient to choose the correct production is called an LL(k)\text{LL}(k) grammar. In practice, k=1k=1 is almost always used.

Every LL(k)\text{LL}(k) grammar is a LL(k+1)\text{LL}(k+1) grammar.

Recursive-Descent Parsing

A straightforward top-down method. Each non-terminal has a function and each production has a clause. May involve backtracking. Can be used for LL(1)\text{LL}(1) grammars.

Predictive Parsing

A simple form of recursive-descent parsing. Works when the first terminal of each sub-expression uniquely identifies which production to use. Does not use backtracking.

Fails when the grammar is left-recursive or ambiguous. To make a grammar suitable for predictive parsing, one may need to eliminate left-recursion and apply left-factoring.

Table-Driven Predictive Parsing

Non-recursive. Uses an explicit stack. Uses a parsing table M[V,T]M[V, T] where VV is a non-terminal and TT is a token, to lookup productions to apply. A grammar with no multiply-defined entries in its parsing table is an LL(1)\text{LL}(1) grammar.

If a parsing table entry has more than 1 productions, it is either ambiguous or left-recursive. Then the grammar cannot be parsed by predictive parsing.

For LL(k)\text{LL}(k) grammar, columns of the parsing table are every combination of kk terminals.

Bottom-Up Parsing

Aka. shift-reduce parsing. Builds the parse tree starting from the leaves (the input tokens) and works upward toward the root. It traces a rightmost derivation in reverse by repeatedly identifying a substring on the right-hand side of a production and reducing it to the corresponding left-hand side non-terminal.

Shift-Reduce Mechanism

The parser uses a stack and an input buffer. It performs one of 4 actions at each step:

  • Shift: push the next input token onto the stack
  • Reduce: replace a handle (a matched RHS) on top of the stack with the corresponding LHS non-terminal
  • Accept: parsing is complete when the start symbol is on the stack and input is empty
  • Error: a syntax error is detected

Conflicts

  • Shift-reduce conflict
    Unclear whether to shift or reduce. Resolved by preferring shift.
  • Reduce-reduce conflict
    Multiple valid reductions available. Resolved by choosing the earlier production in the grammar

LR Grammar

Means left-to-right right-most derived grammar. Right-most derivation is done in reverse.

LR(k)\text{LR}(k) parsing is the most general and widely used form of bottom-up parsing. Here kk is the number of lookahead tokens. Can handle a much larger class of grammars than LL(k)\text{LL}(k), making it the standard approach in compiler construction.

LR parsing uses a DFA applied to the stack contents along with a transition table to decide when to shift and when to reduce. The table has two parts — an action table and a goto table.

Can recognize virtually all programming language constructs (if CFG can be given). Most general non-backtracking shift-reduce method known, but can be implemented efficiently. Can parse all grammars parsed by LL(k)\text{LL}(k). Can detect syntax errors as soon as possible. Too tedious to do by hand. Tools like Yacc, Bison, JavaCC, and PLY generate LALR(1) parsers automatically from a grammar specification.

In practice, LR(1)\text{LR}(1) tables can be very large. A variant called LALR(1)\text{LALR}(1) (Look-Ahead LR) merges similar states to produce significantly smaller tables while still being able to parse most programming language constructs.