Skip to content

CPython Grammar: From Beginner to Advanced

Overview

CPython's grammar is the formal specification of what constitutes valid Python syntax. Everything from if x: to a complex decorator expression is defined in a single file — Grammar/python.gram — using a formalism called PEG (Parsing Expression Grammar). When you write Python code, CPython reads that file's compiled output to decide whether your source text is valid and how to turn it into an Abstract Syntax Tree (AST).

This tutorial walks through the entire pipeline, from the grammar file itself to the C code that drives the parser at runtime.

flowchart TD
    A["`**Grammar/python.gram**`"] -->|pegen — Grammar/pegen/build.py| B

    subgraph build["Build time"]
        B["`Parser/parser.c
        Include/internal/pycore_ast.h`"]
        B -->|compilation| C[libpython\nparser embedded in interpreter]
    end

    subgraph runtime["Runtime"]
        D[Source text] --> E[Tokenizer]
        E --> F[Parser]
        F --> G[AST]
        G --> H[Compiler]
        H --> I[Bytecode]
    end

    C --> runtime

Part 1 — Beginner: What is the grammar file?

The file: Grammar/python.gram

Every Python release ships one authoritative grammar file at Grammar/python.gram in the CPython source tree. It is written in PEG (Parsing Expression Grammar) notation using the pegen meta-grammar.

Open it and the first lines you see are:

@header """
#include "Python.h"
#include "pycore_ast.h"
"""

@subheader ""

@trailer """
...
"""

start[mod_ty]: a=statements NEWLINE* ENDMARKER { _PyAST_Module(a, NULL, p->arena) }

The annotations in curly braces ({ ... }) are C action code — they specify what AST node to build when that rule matches.

Key vocabulary

Term Meaning
Rule A named grammar production, e.g. if_stmt
Alternative One of several `
Action The C expression in { } that produces an AST node
Token An atomic input unit produced by the tokenizer (NAME, NUMBER, NEWLINE, etc.)
Cut (~) Commits the parser to the current alternative; disables backtracking

The simplest rule

pass_stmt[stmt_ty]:
    | 'pass' { _PyAST_Pass(EXTRA) }

Reading this: if the parser sees the literal token pass, it matches pass_stmt and builds a Pass AST node. EXTRA is a macro that expands to source location arguments.


Part 2 — Beginner: The tokenizer feeds the parser

The grammar does not operate on raw characters. CPython first runs the source text through a tokenizer (Python/tokenize.c / Lib/tokenize.py) that emits a stream of typed tokens. The grammar rules then match against those tokens, not characters.

Common token types

Token Example input
NAME foo, if, True
NUMBER 42, 3.14, 0xFF
STRING "hello", b'\x00'
NEWLINE End of a logical line
INDENT / DEDENT Indentation changes
OP +, ->, **=
ENDMARKER End of file

Keywords such as if, while, and return are NAME tokens that the grammar matches as literal strings like 'if' — there is no separate keyword token type.

How to explore tokens yourself

import tokenize, io

src = "x = 1 + 2\n"
tokens = tokenize.generate_tokens(io.StringIO(src).readline)
for tok in tokens:
    print(tok)

Output:

TokenInfo(type=1 (NAME),   string='x',    ...)
TokenInfo(type=54 (OP),    string='=',    ...)
TokenInfo(type=2 (NUMBER), string='1',    ...)
TokenInfo(type=54 (OP),    string='+',    ...)
TokenInfo(type=2 (NUMBER), string='2',    ...)
TokenInfo(type=4 (NEWLINE),string='\n',   ...)
TokenInfo(type=0 (ENDMARKER), string='',  ...)

Part 3 — Intermediate: PEG grammar syntax in depth

Rule syntax

rule_name[return_type]: alternative1 | alternative2 | ...

The return_type is the C type of the value that the action returns. For statement rules it is typically stmt_ty; for expressions, expr_ty.

Operators and groupings

Syntax Meaning
e1 e2 Sequence: match e1 then e2
e1 | e2 Ordered choice: try e1, if it fails try e2
e? Optional: match zero or one e
e* Repeat: match zero or more e
e+ Repeat: match one or more e
&e Lookahead: succeed if e would match, consume nothing
!e Negative lookahead: succeed if e would NOT match
~ Cut: commit to this alternative, no backtracking past this point
[e] Same as e?

Named captures

Rules can bind matched sub-expressions to local variables for use in actions:

if_stmt[stmt_ty]:
    | 'if' a=named_expr ':' b=block c=elif_stmt
      { _PyAST_If(a, b, CHECK(asdl_stmt_seq*, _PyPegen_singleton_seq(p, c)), EXTRA) }
    | 'if' a=named_expr ':' b=block c=[else_block]
      { _PyAST_If(a, b, c, EXTRA) }

Here a, b, c capture the results of each sub-rule.

A worked example: for statement

for_stmt[stmt_ty]:
    | ASYNC 'for' t=star_targets 'in' ex=star_expressions ':' tc=[type_comment]
      b=block el=[else_block]
      { _PyAST_AsyncFor(t, ex, b, el, NEW_TYPE_COMMENT(p, tc), EXTRA) }
    | 'for' t=star_targets 'in' ex=star_expressions ':' tc=[type_comment]
      b=block el=[else_block]
      { _PyAST_For(t, ex, b, el, NEW_TYPE_COMMENT(p, tc), EXTRA) }

Two alternatives: one for async for, one for plain for. Both capture the target (t), iterable (ex), body (b), optional else (el), and optional type comment (tc).

Operator precedence via rule layering

PEG grammars encode operator precedence through rule nesting, not a precedence table. Lower-precedence operators sit in rules higher up the call chain; higher-precedence operators appear in rules deeper in the chain.

expression
  └─ disjunction   (or)
       └─ conjunction  (and)
            └─ inversion  (not)
                 └─ comparison  (==, <, in, is, ...)
                      └─ bitwise_or  (|)
                           └─ bitwise_xor  (^)
                                └─ bitwise_and  (&)
                                     └─ shift_expr  (<<, >>)
                                          └─ sum  (+, -)
                                               └─ term  (*, /, //, %, @)
                                                    └─ factor  (unary +, -, ~)
                                                         └─ power  (**)
                                                              └─ await_primary
                                                                   └─ primary
                                                                        └─ atom

The atom rule is where literals, names, and parenthesised expressions live. Because power calls await_primary which calls primary which calls atom, ** binds tighter than *, which binds tighter than +, and so on.


Part 4 — Intermediate: How pegen generates Parser/parser.c

The generator

CPython ships its own PEG parser generator called pegen, located at Tools/peg_generator/. When you modify Grammar/python.gram, you must regenerate the C parser:

make regen-pegen
# or manually:
python Tools/peg_generator/pegen/build.py \
    --output Parser/parser.c \
    Grammar/python.gram

pegen reads python.gram and emits Parser/parser.c — roughly 17,000 lines of C — where every grammar rule becomes a C function.

Rule → C function mapping

Each grammar rule foo_rule in python.gram becomes a C function:

// Generated in Parser/parser.c
static stmt_ty
if_stmt_rule(Parser *p)
{
    ...
    // Alternative 1
    { // 'if' named_expr ':' block elif_stmt
        Token * _keyword;
        expr_ty a;
        asdl_stmt_seq* b;
        stmt_ty c;
        if (
            (_keyword = _PyPegen_expect_token(p, 510))  // 'if'
            &&
            (a = named_expr_rule(p))
            &&
            (_literal = _PyPegen_expect_token(p, 11))   // ':'
            &&
            (b = block_rule(p))
            &&
            (c = elif_stmt_rule(p))
        )
        {
            res = _PyAST_If(a, b,
                CHECK(asdl_stmt_seq *, _PyPegen_singleton_seq(p, c)),
                EXTRA);
            goto done;
        }
        p->mark = _mark;  // backtrack
    }
    ...
}

Key patterns in the generated code:

  • p->mark saves the current token position before trying an alternative.
  • p->mark = _mark restores position on failure (backtracking).
  • A cut (~) sets p->cut_enabled = 1, which prevents p->mark restoration.
  • Memoization: results are cached in p->memo keyed by (rule_id, position).

The Parser struct

The generated parser functions all take a Parser *p pointer. Key fields:

// Include/cpython/pegen.h (simplified)
typedef struct {
    struct tok_state *tok;     // tokenizer state
    Token **tokens;            // token lookahead buffer
    int mark;                  // current position
    int fill;                  // how many tokens are buffered
    PyArena *arena;            // memory arena for AST nodes
    _PyPegen_Memo *memo;       // memoization table
    int error_indicator;       // set on parse error
    int level;                 // recursion depth counter
    int flags;                 // parser flags
} Parser;

Part 5 — Advanced: Memoization and packrat parsing

CPython's PEG parser is a packrat parser — it memoizes every rule result at every token position to guarantee linear-time parsing even with backtracking.

How memoization works

Before calling any rule function, the generated code checks a memo table:

// Pattern in generated parser functions
if (p->level++ == MAXSTACK) { _Pypegen_stack_overflow(p); }

// Check memo
{ int _res = 0;
  if (_PyPegen_is_memoized(p, if_stmt_type, &_res))
      return _res;
  ...
  // Store result
  _PyPegen_memoize(p, if_stmt_type, _mark, _res);
}

The memo key is (rule_type, token_position). This means each rule is called at most once per token position, bounding total work to O(n × R) where n is the token count and R is the number of rules.

Left recursion

Standard PEG parsers cannot handle left-recursive rules. CPython's pegen supports them via a seed-and-grow algorithm:

# Left-recursive: primary calls itself on the left
primary[expr_ty]:
    | primary '.' NAME
    | primary '[' slices ']'
    | primary genexp
    | primary '(' [arguments] ')'
    | atom

The algorithm:

  1. Seed the memo for primary at position p with a failure result.
  2. Call primary recursively — it returns the seed (failure or previous match).
  3. If a longer match is found, update the memo and re-run.
  4. Repeat until no further growth is possible.

This is controlled by _PyPegen_left_rec infrastructure in Parser/pegen_errors.c.


Part 6 — Advanced: The AST

What the parser produces

The parser's output is an Abstract Syntax Tree — a tree of C structs defined by Parser/Python.asdl (the ASDL schema file) and generated into Include/internal/pycore_ast.h.

Parser/Python.asdl

ASDL (Zephyr Abstract Syntax Description Language) defines the AST node types:

-- Python.asdl (excerpts)

module Python
{
    mod = Module(stmt* body, type_ignore* type_ignores)
        | Interactive(stmt* body)
        | Expression(expr body)
        | FunctionType(expr* argtypes, expr returns)

    stmt = FunctionDef(identifier name, arguments args,
                       stmt* body, expr* decorator_list,
                       expr? returns, string? type_comment)
         | ...
         | If(expr test, stmt* body, stmt* orelse)
         | For(expr target, expr iter, stmt* body, stmt* orelse,
               string? type_comment)
         | ...

    expr = BoolOp(boolop op, expr* values)
         | BinOp(expr left, operator op, expr right)
         | ...
         | Name(identifier id, expr_context ctx)
         | Constant(constant value, string? kind)
         | ...
}

Inspecting the AST in Python

The ast module exposes the full tree:

import ast

tree = ast.parse("""
def add(x, y):
    return x + y
""")

print(ast.dump(tree, indent=2))

Output:

Module(
  body=[
    FunctionDef(
      name='add',
      args=arguments(
        posonlyargs=[],
        args=[arg(arg='x'), arg(arg='y')],
        ...),
      body=[
        Return(
          value=BinOp(
            left=Name(id='x', ctx=Load()),
            op=Add(),
            right=Name(id='y', ctx=Load())))],
      ...)],
  ...)

Walking and transforming the AST

import ast

class ConstantFolder(ast.NodeTransformer):
    """Fold BinOp(Constant, Add, Constant) → Constant at compile time."""
    def visit_BinOp(self, node):
        self.generic_visit(node)
        if (isinstance(node.left, ast.Constant)
                and isinstance(node.right, ast.Constant)
                and isinstance(node.op, ast.Add)):
            return ast.Constant(value=node.left.value + node.right.value)
        return node

src = "x = 1 + 2"
tree = ast.parse(src)
new_tree = ConstantFolder().visit(tree)
ast.fix_missing_locations(new_tree)
print(ast.dump(new_tree, indent=2))
# BinOp is gone; Constant(value=3) appears directly

Part 7 — Advanced: Modifying the grammar

Adding a new keyword or statement

Suppose you want to add a repeat N: statement (executes body N times).

Step 1 — Add to Grammar/python.gram:

repeat_stmt[stmt_ty]:
    | 'repeat' n=expression ':' b=block
      { _PyAST_Repeat(n, b, EXTRA) }

Add it in the compound_stmt alternatives:

compound_stmt[stmt_ty]:
    | ...
    | repeat_stmt

Step 2 — Add the AST node to Parser/Python.asdl:

stmt = ...
     | Repeat(expr count, stmt* body)
     | ...

Step 3 — Regenerate the AST C headers:

make regen-ast

Step 4 — Implement the AST node in Python/ast.c (validation) and Python/compile.c (bytecode emission):

// Python/compile.c (simplified)
case Repeat_kind:
    return compiler_repeat(c, s);

static int
compiler_repeat(struct compiler *c, stmt_ty s)
{
    // emit bytecode: evaluate count, loop body count times
    ...
}

Step 5 — Regenerate the parser:

make regen-pegen

Step 6 — Build and test:

make -j4
./python -c "repeat 3: print('hello')"

Adding a new operator

To add a <> not-equal operator (it existed in Python 2):

Step 1 — Add to Grammar/python.gram in the compare_op_bitwise_or_pair rule:

compare_op_bitwise_or_pair[CmpopExprPair*]:
    | ...
    | ne_bitwise_or

ne_bitwise_or[CmpopExprPair*]:
    | '<>' a=bitwise_or
      { _PyPegen_cmpop_expr_pair(p, NotEq, a) }

Step 2 — Ensure <> is tokenized as an OP. The tokenizer in Python/tokenize.c already emits <> as NOTEQUAL — verify with:

import tokenize, io
list(tokenize.generate_tokens(io.StringIO("a <> b").readline))

Step 3 — Regenerate and build.


Part 8 — Advanced: Debugging grammar issues

PYTHONDEVMODE and parser tracing

Build CPython with --with-pydebug to enable parser tracing:

./configure --with-pydebug
make -j4

Then set the environment variable:

PYTHON_PARSER_TRACE=1 ./python -c "x = 1"

This prints each rule call and whether it succeeded or failed, letting you trace exactly how the grammar matches a piece of source.

Using ast.parse() to validate grammar changes

import ast

test_cases = [
    "x = 1",
    "if x: pass",
    "for i in range(10): pass",
    "lambda x, y: x + y",
    "[x for x in range(10) if x % 2 == 0]",
]

for src in test_cases:
    try:
        ast.parse(src)
        print(f"OK: {src!r}")
    except SyntaxError as e:
        print(f"FAIL: {src!r}{e}")

Inspecting Parser/parser.c diffs

After regenerating the parser, inspect the diff to verify your grammar change compiled as expected:

git diff Parser/parser.c | grep -A 20 "repeat_stmt_rule"

Part 9 — Reference: Key files

File Role
Grammar/python.gram Authoritative PEG grammar source
Parser/Python.asdl AST node type definitions (ASDL schema)
Parser/parser.c Generated PEG parser (do not edit by hand)
Include/internal/pycore_ast.h Generated AST C struct declarations
Python/ast.c AST validation and helper functions
Python/compile.c AST → bytecode compiler
Python/symtable.c Symbol table construction from AST
Python/tokenize.c Tokenizer (C implementation)
Lib/tokenize.py Tokenizer (Python implementation, used by tools)
Lib/ast.py Python-level AST utilities
Tools/peg_generator/ pegen meta-grammar and generator scripts
Makefile.pre.in regen-pegen and regen-ast make targets

Part 10 — Reference: Useful make targets

# Regenerate Parser/parser.c from Grammar/python.gram
make regen-pegen

# Regenerate AST headers from Parser/Python.asdl
make regen-ast

# Regenerate everything grammar-related in one shot
make regen-all

# Check that the committed generated files match the grammar source
# (used in CI — fails if you edited .gram but forgot to regen)
make check-pegen-regen

Reference: The full source-to-bytecode pipeline

flowchart TD
    A["📄 Source text\n<i>.py file</i>"]
    B["🔡 Tokenizer\n<code>Python/tokenize.c</code>\nProduces token stream:\nNAME · OP · NUMBER · NEWLINE · INDENT · DEDENT"]
    C["🌳 PEG Parser\n<code>Parser/parser.c</code>\nGenerated from <code>Grammar/python.gram</code>\nProduces: Abstract Syntax Tree"]
    D["✅ AST Validator\n<code>Python/ast.c</code>\nChecks semantic rules not expressible in PEG\ne.g. no duplicate arg names · no break outside loop"]
    E["🔍 Symbol Table\n<code>Python/symtable.c</code>\nResolves scope of every name:\nlocal · free · global · cell"]
    F["⚙️ Compiler\n<code>Python/compile.c</code>\nWalks AST, emits CPython bytecode instructions"]
    G["✨ Peephole Optimizer\n<code>Python/flowgraph.c</code>\nConstant folding · dead-code elimination · jump threading"]
    H["📦 Code Object\n<code>PyCodeObject</code>\nco_code · co_consts · co_varnames · co_filename · co_firstlineno"]
    I["▶️ Execution\n<code>Python/ceval.c</code>\nThe eval loop"]

    A --> B --> C --> D --> E --> F --> G --> H --> I