CPython Grammar: From Beginner to Advanced¶

Overview¶

CPython's grammar is the formal specification of what constitutes valid Python syntax. Everything from if x: to a complex decorator expression is defined in a single file — Grammar/python.gram — using a formalism called PEG (Parsing Expression Grammar). When you write Python code, CPython reads that file's compiled output to decide whether your source text is valid and how to turn it into an Abstract Syntax Tree (AST).

This tutorial walks through the entire pipeline, from the grammar file itself to the C code that drives the parser at runtime.

flowchart TD
    A["`**Grammar/python.gram**`"] -->|pegen — Grammar/pegen/build.py| B

    subgraph build["Build time"]
        B["`Parser/parser.c
        Include/internal/pycore_ast.h`"]
        B -->|compilation| C[libpython\nparser embedded in interpreter]
    end

    subgraph runtime["Runtime"]
        D[Source text] --> E[Tokenizer]
        E --> F[Parser]
        F --> G[AST]
        G --> H[Compiler]
        H --> I[Bytecode]
    end

    C --> runtime

Part 1 — Beginner: What is the grammar file?¶

The file: `Grammar/python.gram`¶

Every Python release ships one authoritative grammar file at Grammar/python.gram in the CPython source tree. It is written in PEG (Parsing Expression Grammar) notation using the pegen meta-grammar.

Open it and the first lines you see are:

@header """
#include "Python.h"
#include "pycore_ast.h"
"""

@subheader ""

@trailer """
...
"""

start[mod_ty]: a=statements NEWLINE* ENDMARKER { _PyAST_Module(a, NULL, p->arena) }

The annotations in curly braces ({ ... }) are C action code — they specify what AST node to build when that rule matches.

Key vocabulary¶

Term	Meaning
Rule	A named grammar production, e.g. `if_stmt`
Alternative	One of several `
Action	The C expression in `{ }` that produces an AST node
Token	An atomic input unit produced by the tokenizer (`NAME`, `NUMBER`, `NEWLINE`, etc.)
Cut (`~`)	Commits the parser to the current alternative; disables backtracking

The simplest rule¶

pass_stmt[stmt_ty]:
    | 'pass' { _PyAST_Pass(EXTRA) }

Reading this: if the parser sees the literal token pass, it matches pass_stmt and builds a Pass AST node. EXTRA is a macro that expands to source location arguments.

Part 2 — Beginner: The tokenizer feeds the parser¶

The grammar does not operate on raw characters. CPython first runs the source text through a tokenizer (Python/tokenize.c / Lib/tokenize.py) that emits a stream of typed tokens. The grammar rules then match against those tokens, not characters.

Common token types¶

Token	Example input
`NAME`	`foo`, `if`, `True`
`NUMBER`	`42`, `3.14`, `0xFF`
`STRING`	`"hello"`, `b'\x00'`
`NEWLINE`	End of a logical line
`INDENT` / `DEDENT`	Indentation changes
`OP`	`+`, `->`, `**=`
`ENDMARKER`	End of file

Keywords such as if, while, and return are NAME tokens that the grammar matches as literal strings like 'if' — there is no separate keyword token type.

How to explore tokens yourself¶

import tokenize, io

src = "x = 1 + 2\n"
tokens = tokenize.generate_tokens(io.StringIO(src).readline)
for tok in tokens:
    print(tok)

Output:

TokenInfo(type=1 (NAME),   string='x',    ...)
TokenInfo(type=54 (OP),    string='=',    ...)
TokenInfo(type=2 (NUMBER), string='1',    ...)
TokenInfo(type=54 (OP),    string='+',    ...)
TokenInfo(type=2 (NUMBER), string='2',    ...)
TokenInfo(type=4 (NEWLINE),string='\n',   ...)
TokenInfo(type=0 (ENDMARKER), string='',  ...)

Part 3 — Intermediate: PEG grammar syntax in depth¶

Rule syntax¶

rule_name[return_type]: alternative1 | alternative2 | ...

The return_type is the C type of the value that the action returns. For statement rules it is typically stmt_ty; for expressions, expr_ty.

Operators and groupings¶

Syntax	Meaning
`e1 e2`	Sequence: match `e1` then `e2`
`e1 \| e2`	Ordered choice: try `e1`, if it fails try `e2`
`e?`	Optional: match zero or one `e`
`e*`	Repeat: match zero or more `e`
`e+`	Repeat: match one or more `e`
`&e`	Lookahead: succeed if `e` would match, consume nothing
`!e`	Negative lookahead: succeed if `e` would NOT match
`~`	Cut: commit to this alternative, no backtracking past this point
`[e]`	Same as `e?`

Named captures¶

Rules can bind matched sub-expressions to local variables for use in actions:

if_stmt[stmt_ty]:
    | 'if' a=named_expr ':' b=block c=elif_stmt
      { _PyAST_If(a, b, CHECK(asdl_stmt_seq*, _PyPegen_singleton_seq(p, c)), EXTRA) }
    | 'if' a=named_expr ':' b=block c=[else_block]
      { _PyAST_If(a, b, c, EXTRA) }

Here a, b, c capture the results of each sub-rule.

A worked example: `for` statement¶

for_stmt[stmt_ty]:
    | ASYNC 'for' t=star_targets 'in' ex=star_expressions ':' tc=[type_comment]
      b=block el=[else_block]
      { _PyAST_AsyncFor(t, ex, b, el, NEW_TYPE_COMMENT(p, tc), EXTRA) }
    | 'for' t=star_targets 'in' ex=star_expressions ':' tc=[type_comment]
      b=block el=[else_block]
      { _PyAST_For(t, ex, b, el, NEW_TYPE_COMMENT(p, tc), EXTRA) }

Two alternatives: one for async for, one for plain for. Both capture the target (t), iterable (ex), body (b), optional else (el), and optional type comment (tc).

Operator precedence via rule layering¶

PEG grammars encode operator precedence through rule nesting, not a precedence table. Lower-precedence operators sit in rules higher up the call chain; higher-precedence operators appear in rules deeper in the chain.

expression
  └─ disjunction   (or)
       └─ conjunction  (and)
            └─ inversion  (not)
                 └─ comparison  (==, <, in, is, ...)
                      └─ bitwise_or  (|)
                           └─ bitwise_xor  (^)
                                └─ bitwise_and  (&)
                                     └─ shift_expr  (<<, >>)
                                          └─ sum  (+, -)
                                               └─ term  (*, /, //, %, @)
                                                    └─ factor  (unary +, -, ~)
                                                         └─ power  (**)
                                                              └─ await_primary
                                                                   └─ primary
                                                                        └─ atom

The atom rule is where literals, names, and parenthesised expressions live. Because power calls await_primary which calls primary which calls atom, ** binds tighter than *, which binds tighter than +, and so on.

Part 4 — Intermediate: How `pegen` generates `Parser/parser.c`¶

The generator¶

CPython ships its own PEG parser generator called pegen, located at Tools/peg_generator/. When you modify Grammar/python.gram, you must regenerate the C parser:

make regen-pegen
# or manually:
python Tools/peg_generator/pegen/build.py \
    --output Parser/parser.c \
    Grammar/python.gram

pegen reads python.gram and emits Parser/parser.c — roughly 17,000 lines of C — where every grammar rule becomes a C function.

Rule → C function mapping¶

Each grammar rule foo_rule in python.gram becomes a C function:

// Generated in Parser/parser.c
static stmt_ty
if_stmt_rule(Parser *p)
{
    ...
    // Alternative 1
    { // 'if' named_expr ':' block elif_stmt
        Token * _keyword;
        expr_ty a;
        asdl_stmt_seq* b;
        stmt_ty c;
        if (
            (_keyword = _PyPegen_expect_token(p, 510))  // 'if'
            &&
            (a = named_expr_rule(p))
            &&
            (_literal = _PyPegen_expect_token(p, 11))   // ':'
            &&
            (b = block_rule(p))
            &&
            (c = elif_stmt_rule(p))
        )
        {
            res = _PyAST_If(a, b,
                CHECK(asdl_stmt_seq *, _PyPegen_singleton_seq(p, c)),
                EXTRA);
            goto done;
        }
        p->mark = _mark;  // backtrack
    }
    ...
}

Key patterns in the generated code:

p->mark saves the current token position before trying an alternative.
p->mark = _mark restores position on failure (backtracking).
A cut (~) sets p->cut_enabled = 1, which prevents p->mark restoration.
Memoization: results are cached in p->memo keyed by (rule_id, position).

The `Parser` struct¶

The generated parser functions all take a Parser *p pointer. Key fields:

// Include/cpython/pegen.h (simplified)
typedef struct {
    struct tok_state *tok;     // tokenizer state
    Token **tokens;            // token lookahead buffer
    int mark;                  // current position
    int fill;                  // how many tokens are buffered
    PyArena *arena;            // memory arena for AST nodes
    _PyPegen_Memo *memo;       // memoization table
    int error_indicator;       // set on parse error
    int level;                 // recursion depth counter
    int flags;                 // parser flags
} Parser;

Part 5 — Advanced: Memoization and packrat parsing¶

CPython's PEG parser is a packrat parser — it memoizes every rule result at every token position to guarantee linear-time parsing even with backtracking.

How memoization works¶

Before calling any rule function, the generated code checks a memo table:

// Pattern in generated parser functions
if (p->level++ == MAXSTACK) { _Pypegen_stack_overflow(p); }

// Check memo
{ int _res = 0;
  if (_PyPegen_is_memoized(p, if_stmt_type, &_res))
      return _res;
  ...
  // Store result
  _PyPegen_memoize(p, if_stmt_type, _mark, _res);
}

The memo key is (rule_type, token_position). This means each rule is called at most once per token position, bounding total work to O(n × R) where n is the token count and R is the number of rules.

Left recursion¶

Standard PEG parsers cannot handle left-recursive rules. CPython's pegen supports them via a seed-and-grow algorithm:

# Left-recursive: primary calls itself on the left
primary[expr_ty]:
    | primary '.' NAME
    | primary '[' slices ']'
    | primary genexp
    | primary '(' [arguments] ')'
    | atom

The algorithm:

Seed the memo for primary at position p with a failure result.
Call primary recursively — it returns the seed (failure or previous match).
If a longer match is found, update the memo and re-run.
Repeat until no further growth is possible.

This is controlled by _PyPegen_left_rec infrastructure in Parser/pegen_errors.c.

Part 6 — Advanced: The AST¶

What the parser produces¶

The parser's output is an Abstract Syntax Tree — a tree of C structs defined by Parser/Python.asdl (the ASDL schema file) and generated into Include/internal/pycore_ast.h.

`Parser/Python.asdl`¶

ASDL (Zephyr Abstract Syntax Description Language) defines the AST node types:

-- Python.asdl (excerpts)

module Python
{
    mod = Module(stmt* body, type_ignore* type_ignores)
        | Interactive(stmt* body)
        | Expression(expr body)
        | FunctionType(expr* argtypes, expr returns)

    stmt = FunctionDef(identifier name, arguments args,
                       stmt* body, expr* decorator_list,
                       expr? returns, string? type_comment)
         | ...
         | If(expr test, stmt* body, stmt* orelse)
         | For(expr target, expr iter, stmt* body, stmt* orelse,
               string? type_comment)
         | ...

    expr = BoolOp(boolop op, expr* values)
         | BinOp(expr left, operator op, expr right)
         | ...
         | Name(identifier id, expr_context ctx)
         | Constant(constant value, string? kind)
         | ...
}

Inspecting the AST in Python¶

The ast module exposes the full tree:

import ast

tree = ast.parse("""
def add(x, y):
    return x + y
""")

print(ast.dump(tree, indent=2))

Output:

Module(
  body=[
    FunctionDef(
      name='add',
      args=arguments(
        posonlyargs=[],
        args=[arg(arg='x'), arg(arg='y')],
        ...),
      body=[
        Return(
          value=BinOp(
            left=Name(id='x', ctx=Load()),
            op=Add(),
            right=Name(id='y', ctx=Load())))],
      ...)],
  ...)

Walking and transforming the AST¶

import ast

class ConstantFolder(ast.NodeTransformer):
    """Fold BinOp(Constant, Add, Constant) → Constant at compile time."""
    def visit_BinOp(self, node):
        self.generic_visit(node)
        if (isinstance(node.left, ast.Constant)
                and isinstance(node.right, ast.Constant)
                and isinstance(node.op, ast.Add)):
            return ast.Constant(value=node.left.value + node.right.value)
        return node

src = "x = 1 + 2"
tree = ast.parse(src)
new_tree = ConstantFolder().visit(tree)
ast.fix_missing_locations(new_tree)
print(ast.dump(new_tree, indent=2))
# BinOp is gone; Constant(value=3) appears directly

Part 7 — Advanced: Modifying the grammar¶

Adding a new keyword or statement¶

Suppose you want to add a repeat N: statement (executes body N times).

Step 1 — Add to Grammar/python.gram:

repeat_stmt[stmt_ty]:
    | 'repeat' n=expression ':' b=block
      { _PyAST_Repeat(n, b, EXTRA) }

Add it in the compound_stmt alternatives:

compound_stmt[stmt_ty]:
    | ...
    | repeat_stmt

Step 2 — Add the AST node to Parser/Python.asdl:

stmt = ...
     | Repeat(expr count, stmt* body)
     | ...

Step 3 — Regenerate the AST C headers:

make regen-ast

Step 4 — Implement the AST node in Python/ast.c (validation) and Python/compile.c (bytecode emission):

// Python/compile.c (simplified)
case Repeat_kind:
    return compiler_repeat(c, s);

static int
compiler_repeat(struct compiler *c, stmt_ty s)
{
    // emit bytecode: evaluate count, loop body count times
    ...
}

Step 5 — Regenerate the parser:

make regen-pegen

Step 6 — Build and test:

make -j4
./python -c "repeat 3: print('hello')"

Adding a new operator¶

To add a <> not-equal operator (it existed in Python 2):

Step 1 — Add to Grammar/python.gram in the compare_op_bitwise_or_pair rule:

compare_op_bitwise_or_pair[CmpopExprPair*]:
    | ...
    | ne_bitwise_or

ne_bitwise_or[CmpopExprPair*]:
    | '<>' a=bitwise_or
      { _PyPegen_cmpop_expr_pair(p, NotEq, a) }

Step 2 — Ensure <> is tokenized as an OP. The tokenizer in Python/tokenize.c already emits <> as NOTEQUAL — verify with:

import tokenize, io
list(tokenize.generate_tokens(io.StringIO("a <> b").readline))

Step 3 — Regenerate and build.

Part 8 — Advanced: Debugging grammar issues¶

`PYTHONDEVMODE` and parser tracing¶

Build CPython with --with-pydebug to enable parser tracing:

./configure --with-pydebug
make -j4

Then set the environment variable:

PYTHON_PARSER_TRACE=1 ./python -c "x = 1"

This prints each rule call and whether it succeeded or failed, letting you trace exactly how the grammar matches a piece of source.

Using `ast.parse()` to validate grammar changes¶

import ast

test_cases = [
    "x = 1",
    "if x: pass",
    "for i in range(10): pass",
    "lambda x, y: x + y",
    "[x for x in range(10) if x % 2 == 0]",
]

for src in test_cases:
    try:
        ast.parse(src)
        print(f"OK: {src!r}")
    except SyntaxError as e:
        print(f"FAIL: {src!r} → {e}")

Inspecting `Parser/parser.c` diffs¶

After regenerating the parser, inspect the diff to verify your grammar change compiled as expected:

git diff Parser/parser.c | grep -A 20 "repeat_stmt_rule"

Part 9 — Reference: Key files¶

File	Role
`Grammar/python.gram`	Authoritative PEG grammar source
`Parser/Python.asdl`	AST node type definitions (ASDL schema)
`Parser/parser.c`	Generated PEG parser (do not edit by hand)
`Include/internal/pycore_ast.h`	Generated AST C struct declarations
`Python/ast.c`	AST validation and helper functions
`Python/compile.c`	AST → bytecode compiler
`Python/symtable.c`	Symbol table construction from AST
`Python/tokenize.c`	Tokenizer (C implementation)
`Lib/tokenize.py`	Tokenizer (Python implementation, used by tools)
`Lib/ast.py`	Python-level AST utilities
`Tools/peg_generator/`	`pegen` meta-grammar and generator scripts
`Makefile.pre.in`	`regen-pegen` and `regen-ast` make targets

Part 10 — Reference: Useful make targets¶

# Regenerate Parser/parser.c from Grammar/python.gram
make regen-pegen

# Regenerate AST headers from Parser/Python.asdl
make regen-ast

# Regenerate everything grammar-related in one shot
make regen-all

# Check that the committed generated files match the grammar source
# (used in CI — fails if you edited .gram but forgot to regen)
make check-pegen-regen

Reference: The full source-to-bytecode pipeline¶

flowchart TD
    A["📄 Source text\n<i>.py file</i>"]
    B["🔡 Tokenizer\n<code>Python/tokenize.c</code>\nProduces token stream:\nNAME · OP · NUMBER · NEWLINE · INDENT · DEDENT"]
    C["🌳 PEG Parser\n<code>Parser/parser.c</code>\nGenerated from <code>Grammar/python.gram</code>\nProduces: Abstract Syntax Tree"]
    D["✅ AST Validator\n<code>Python/ast.c</code>\nChecks semantic rules not expressible in PEG\ne.g. no duplicate arg names · no break outside loop"]
    E["🔍 Symbol Table\n<code>Python/symtable.c</code>\nResolves scope of every name:\nlocal · free · global · cell"]
    F["⚙️ Compiler\n<code>Python/compile.c</code>\nWalks AST, emits CPython bytecode instructions"]
    G["✨ Peephole Optimizer\n<code>Python/flowgraph.c</code>\nConstant folding · dead-code elimination · jump threading"]
    H["📦 Code Object\n<code>PyCodeObject</code>\nco_code · co_consts · co_varnames · co_filename · co_firstlineno"]
    I["▶️ Execution\n<code>Python/ceval.c</code>\nThe eval loop"]

    A --> B --> C --> D --> E --> F --> G --> H --> I