CPython Grammar: From Beginner to Advanced¶
Overview¶
CPython's grammar is the formal specification of what constitutes valid Python
syntax. Everything from if x: to a complex decorator expression is defined in
a single file — Grammar/python.gram — using a formalism called PEG
(Parsing Expression Grammar). When you write Python code, CPython reads that
file's compiled output to decide whether your source text is valid and how to
turn it into an Abstract Syntax Tree (AST).
This tutorial walks through the entire pipeline, from the grammar file itself to the C code that drives the parser at runtime.
flowchart TD
A["`**Grammar/python.gram**`"] -->|pegen — Grammar/pegen/build.py| B
subgraph build["Build time"]
B["`Parser/parser.c
Include/internal/pycore_ast.h`"]
B -->|compilation| C[libpython\nparser embedded in interpreter]
end
subgraph runtime["Runtime"]
D[Source text] --> E[Tokenizer]
E --> F[Parser]
F --> G[AST]
G --> H[Compiler]
H --> I[Bytecode]
end
C --> runtime
Part 1 — Beginner: What is the grammar file?¶
The file: Grammar/python.gram¶
Every Python release ships one authoritative grammar file at
Grammar/python.gram in the CPython source tree. It is written in
PEG (Parsing Expression Grammar) notation using the pegen meta-grammar.
Open it and the first lines you see are:
@header """
#include "Python.h"
#include "pycore_ast.h"
"""
@subheader ""
@trailer """
...
"""
start[mod_ty]: a=statements NEWLINE* ENDMARKER { _PyAST_Module(a, NULL, p->arena) }
The annotations in curly braces ({ ... }) are C action code — they specify
what AST node to build when that rule matches.
Key vocabulary¶
| Term | Meaning |
|---|---|
| Rule | A named grammar production, e.g. if_stmt |
| Alternative | One of several ` |
| Action | The C expression in { } that produces an AST node |
| Token | An atomic input unit produced by the tokenizer (NAME, NUMBER, NEWLINE, etc.) |
Cut (~) |
Commits the parser to the current alternative; disables backtracking |
The simplest rule¶
Reading this: if the parser sees the literal token pass, it matches
pass_stmt and builds a Pass AST node. EXTRA is a macro that expands to
source location arguments.
Part 2 — Beginner: The tokenizer feeds the parser¶
The grammar does not operate on raw characters. CPython first runs the
source text through a tokenizer (Python/tokenize.c / Lib/tokenize.py) that
emits a stream of typed tokens. The grammar rules then match against those
tokens, not characters.
Common token types¶
| Token | Example input |
|---|---|
NAME |
foo, if, True |
NUMBER |
42, 3.14, 0xFF |
STRING |
"hello", b'\x00' |
NEWLINE |
End of a logical line |
INDENT / DEDENT |
Indentation changes |
OP |
+, ->, **= |
ENDMARKER |
End of file |
Keywords such as if, while, and return are NAME tokens that the grammar
matches as literal strings like 'if' — there is no separate keyword token
type.
How to explore tokens yourself¶
import tokenize, io
src = "x = 1 + 2\n"
tokens = tokenize.generate_tokens(io.StringIO(src).readline)
for tok in tokens:
print(tok)
Output:
TokenInfo(type=1 (NAME), string='x', ...)
TokenInfo(type=54 (OP), string='=', ...)
TokenInfo(type=2 (NUMBER), string='1', ...)
TokenInfo(type=54 (OP), string='+', ...)
TokenInfo(type=2 (NUMBER), string='2', ...)
TokenInfo(type=4 (NEWLINE),string='\n', ...)
TokenInfo(type=0 (ENDMARKER), string='', ...)
Part 3 — Intermediate: PEG grammar syntax in depth¶
Rule syntax¶
The return_type is the C type of the value that the action returns. For
statement rules it is typically stmt_ty; for expressions, expr_ty.
Operators and groupings¶
| Syntax | Meaning |
|---|---|
e1 e2 |
Sequence: match e1 then e2 |
e1 | e2 |
Ordered choice: try e1, if it fails try e2 |
e? |
Optional: match zero or one e |
e* |
Repeat: match zero or more e |
e+ |
Repeat: match one or more e |
&e |
Lookahead: succeed if e would match, consume nothing |
!e |
Negative lookahead: succeed if e would NOT match |
~ |
Cut: commit to this alternative, no backtracking past this point |
[e] |
Same as e? |
Named captures¶
Rules can bind matched sub-expressions to local variables for use in actions:
if_stmt[stmt_ty]:
| 'if' a=named_expr ':' b=block c=elif_stmt
{ _PyAST_If(a, b, CHECK(asdl_stmt_seq*, _PyPegen_singleton_seq(p, c)), EXTRA) }
| 'if' a=named_expr ':' b=block c=[else_block]
{ _PyAST_If(a, b, c, EXTRA) }
Here a, b, c capture the results of each sub-rule.
A worked example: for statement¶
for_stmt[stmt_ty]:
| ASYNC 'for' t=star_targets 'in' ex=star_expressions ':' tc=[type_comment]
b=block el=[else_block]
{ _PyAST_AsyncFor(t, ex, b, el, NEW_TYPE_COMMENT(p, tc), EXTRA) }
| 'for' t=star_targets 'in' ex=star_expressions ':' tc=[type_comment]
b=block el=[else_block]
{ _PyAST_For(t, ex, b, el, NEW_TYPE_COMMENT(p, tc), EXTRA) }
Two alternatives: one for async for, one for plain for. Both capture the
target (t), iterable (ex), body (b), optional else (el), and optional
type comment (tc).
Operator precedence via rule layering¶
PEG grammars encode operator precedence through rule nesting, not a precedence table. Lower-precedence operators sit in rules higher up the call chain; higher-precedence operators appear in rules deeper in the chain.
expression
└─ disjunction (or)
└─ conjunction (and)
└─ inversion (not)
└─ comparison (==, <, in, is, ...)
└─ bitwise_or (|)
└─ bitwise_xor (^)
└─ bitwise_and (&)
└─ shift_expr (<<, >>)
└─ sum (+, -)
└─ term (*, /, //, %, @)
└─ factor (unary +, -, ~)
└─ power (**)
└─ await_primary
└─ primary
└─ atom
The atom rule is where literals, names, and parenthesised expressions live.
Because power calls await_primary which calls primary which calls atom,
** binds tighter than *, which binds tighter than +, and so on.
Part 4 — Intermediate: How pegen generates Parser/parser.c¶
The generator¶
CPython ships its own PEG parser generator called pegen, located at
Tools/peg_generator/. When you modify Grammar/python.gram, you must
regenerate the C parser:
make regen-pegen
# or manually:
python Tools/peg_generator/pegen/build.py \
--output Parser/parser.c \
Grammar/python.gram
pegen reads python.gram and emits Parser/parser.c — roughly 17,000 lines
of C — where every grammar rule becomes a C function.
Rule → C function mapping¶
Each grammar rule foo_rule in python.gram becomes a C function:
// Generated in Parser/parser.c
static stmt_ty
if_stmt_rule(Parser *p)
{
...
// Alternative 1
{ // 'if' named_expr ':' block elif_stmt
Token * _keyword;
expr_ty a;
asdl_stmt_seq* b;
stmt_ty c;
if (
(_keyword = _PyPegen_expect_token(p, 510)) // 'if'
&&
(a = named_expr_rule(p))
&&
(_literal = _PyPegen_expect_token(p, 11)) // ':'
&&
(b = block_rule(p))
&&
(c = elif_stmt_rule(p))
)
{
res = _PyAST_If(a, b,
CHECK(asdl_stmt_seq *, _PyPegen_singleton_seq(p, c)),
EXTRA);
goto done;
}
p->mark = _mark; // backtrack
}
...
}
Key patterns in the generated code:
p->marksaves the current token position before trying an alternative.p->mark = _markrestores position on failure (backtracking).- A cut (
~) setsp->cut_enabled = 1, which preventsp->markrestoration. - Memoization: results are cached in
p->memokeyed by(rule_id, position).
The Parser struct¶
The generated parser functions all take a Parser *p pointer. Key fields:
// Include/cpython/pegen.h (simplified)
typedef struct {
struct tok_state *tok; // tokenizer state
Token **tokens; // token lookahead buffer
int mark; // current position
int fill; // how many tokens are buffered
PyArena *arena; // memory arena for AST nodes
_PyPegen_Memo *memo; // memoization table
int error_indicator; // set on parse error
int level; // recursion depth counter
int flags; // parser flags
} Parser;
Part 5 — Advanced: Memoization and packrat parsing¶
CPython's PEG parser is a packrat parser — it memoizes every rule result at every token position to guarantee linear-time parsing even with backtracking.
How memoization works¶
Before calling any rule function, the generated code checks a memo table:
// Pattern in generated parser functions
if (p->level++ == MAXSTACK) { _Pypegen_stack_overflow(p); }
// Check memo
{ int _res = 0;
if (_PyPegen_is_memoized(p, if_stmt_type, &_res))
return _res;
...
// Store result
_PyPegen_memoize(p, if_stmt_type, _mark, _res);
}
The memo key is (rule_type, token_position). This means each rule is called
at most once per token position, bounding total work to O(n × R) where n is
the token count and R is the number of rules.
Left recursion¶
Standard PEG parsers cannot handle left-recursive rules. CPython's pegen
supports them via a seed-and-grow algorithm:
# Left-recursive: primary calls itself on the left
primary[expr_ty]:
| primary '.' NAME
| primary '[' slices ']'
| primary genexp
| primary '(' [arguments] ')'
| atom
The algorithm:
- Seed the memo for
primaryat positionpwith a failure result. - Call
primaryrecursively — it returns the seed (failure or previous match). - If a longer match is found, update the memo and re-run.
- Repeat until no further growth is possible.
This is controlled by _PyPegen_left_rec infrastructure in
Parser/pegen_errors.c.
Part 6 — Advanced: The AST¶
What the parser produces¶
The parser's output is an Abstract Syntax Tree — a tree of C structs defined
by Parser/Python.asdl (the ASDL schema file) and generated into
Include/internal/pycore_ast.h.
Parser/Python.asdl¶
ASDL (Zephyr Abstract Syntax Description Language) defines the AST node types:
-- Python.asdl (excerpts)
module Python
{
mod = Module(stmt* body, type_ignore* type_ignores)
| Interactive(stmt* body)
| Expression(expr body)
| FunctionType(expr* argtypes, expr returns)
stmt = FunctionDef(identifier name, arguments args,
stmt* body, expr* decorator_list,
expr? returns, string? type_comment)
| ...
| If(expr test, stmt* body, stmt* orelse)
| For(expr target, expr iter, stmt* body, stmt* orelse,
string? type_comment)
| ...
expr = BoolOp(boolop op, expr* values)
| BinOp(expr left, operator op, expr right)
| ...
| Name(identifier id, expr_context ctx)
| Constant(constant value, string? kind)
| ...
}
Inspecting the AST in Python¶
The ast module exposes the full tree:
Output:
Module(
body=[
FunctionDef(
name='add',
args=arguments(
posonlyargs=[],
args=[arg(arg='x'), arg(arg='y')],
...),
body=[
Return(
value=BinOp(
left=Name(id='x', ctx=Load()),
op=Add(),
right=Name(id='y', ctx=Load())))],
...)],
...)
Walking and transforming the AST¶
import ast
class ConstantFolder(ast.NodeTransformer):
"""Fold BinOp(Constant, Add, Constant) → Constant at compile time."""
def visit_BinOp(self, node):
self.generic_visit(node)
if (isinstance(node.left, ast.Constant)
and isinstance(node.right, ast.Constant)
and isinstance(node.op, ast.Add)):
return ast.Constant(value=node.left.value + node.right.value)
return node
src = "x = 1 + 2"
tree = ast.parse(src)
new_tree = ConstantFolder().visit(tree)
ast.fix_missing_locations(new_tree)
print(ast.dump(new_tree, indent=2))
# BinOp is gone; Constant(value=3) appears directly
Part 7 — Advanced: Modifying the grammar¶
Adding a new keyword or statement¶
Suppose you want to add a repeat N: statement (executes body N times).
Step 1 — Add to Grammar/python.gram:
Add it in the compound_stmt alternatives:
Step 2 — Add the AST node to Parser/Python.asdl:
Step 3 — Regenerate the AST C headers:
Step 4 — Implement the AST node in Python/ast.c (validation) and
Python/compile.c (bytecode emission):
// Python/compile.c (simplified)
case Repeat_kind:
return compiler_repeat(c, s);
static int
compiler_repeat(struct compiler *c, stmt_ty s)
{
// emit bytecode: evaluate count, loop body count times
...
}
Step 5 — Regenerate the parser:
Step 6 — Build and test:
Adding a new operator¶
To add a <> not-equal operator (it existed in Python 2):
Step 1 — Add to Grammar/python.gram in the compare_op_bitwise_or_pair
rule:
compare_op_bitwise_or_pair[CmpopExprPair*]:
| ...
| ne_bitwise_or
ne_bitwise_or[CmpopExprPair*]:
| '<>' a=bitwise_or
{ _PyPegen_cmpop_expr_pair(p, NotEq, a) }
Step 2 — Ensure <> is tokenized as an OP. The tokenizer in
Python/tokenize.c already emits <> as NOTEQUAL — verify with:
Step 3 — Regenerate and build.
Part 8 — Advanced: Debugging grammar issues¶
PYTHONDEVMODE and parser tracing¶
Build CPython with --with-pydebug to enable parser tracing:
Then set the environment variable:
This prints each rule call and whether it succeeded or failed, letting you trace exactly how the grammar matches a piece of source.
Using ast.parse() to validate grammar changes¶
import ast
test_cases = [
"x = 1",
"if x: pass",
"for i in range(10): pass",
"lambda x, y: x + y",
"[x for x in range(10) if x % 2 == 0]",
]
for src in test_cases:
try:
ast.parse(src)
print(f"OK: {src!r}")
except SyntaxError as e:
print(f"FAIL: {src!r} → {e}")
Inspecting Parser/parser.c diffs¶
After regenerating the parser, inspect the diff to verify your grammar change compiled as expected:
Part 9 — Reference: Key files¶
| File | Role |
|---|---|
Grammar/python.gram |
Authoritative PEG grammar source |
Parser/Python.asdl |
AST node type definitions (ASDL schema) |
Parser/parser.c |
Generated PEG parser (do not edit by hand) |
Include/internal/pycore_ast.h |
Generated AST C struct declarations |
Python/ast.c |
AST validation and helper functions |
Python/compile.c |
AST → bytecode compiler |
Python/symtable.c |
Symbol table construction from AST |
Python/tokenize.c |
Tokenizer (C implementation) |
Lib/tokenize.py |
Tokenizer (Python implementation, used by tools) |
Lib/ast.py |
Python-level AST utilities |
Tools/peg_generator/ |
pegen meta-grammar and generator scripts |
Makefile.pre.in |
regen-pegen and regen-ast make targets |
Part 10 — Reference: Useful make targets¶
# Regenerate Parser/parser.c from Grammar/python.gram
make regen-pegen
# Regenerate AST headers from Parser/Python.asdl
make regen-ast
# Regenerate everything grammar-related in one shot
make regen-all
# Check that the committed generated files match the grammar source
# (used in CI — fails if you edited .gram but forgot to regen)
make check-pegen-regen
Reference: The full source-to-bytecode pipeline¶
flowchart TD
A["📄 Source text\n<i>.py file</i>"]
B["🔡 Tokenizer\n<code>Python/tokenize.c</code>\nProduces token stream:\nNAME · OP · NUMBER · NEWLINE · INDENT · DEDENT"]
C["🌳 PEG Parser\n<code>Parser/parser.c</code>\nGenerated from <code>Grammar/python.gram</code>\nProduces: Abstract Syntax Tree"]
D["✅ AST Validator\n<code>Python/ast.c</code>\nChecks semantic rules not expressible in PEG\ne.g. no duplicate arg names · no break outside loop"]
E["🔍 Symbol Table\n<code>Python/symtable.c</code>\nResolves scope of every name:\nlocal · free · global · cell"]
F["⚙️ Compiler\n<code>Python/compile.c</code>\nWalks AST, emits CPython bytecode instructions"]
G["✨ Peephole Optimizer\n<code>Python/flowgraph.c</code>\nConstant folding · dead-code elimination · jump threading"]
H["📦 Code Object\n<code>PyCodeObject</code>\nco_code · co_consts · co_varnames · co_filename · co_firstlineno"]
I["▶️ Execution\n<code>Python/ceval.c</code>\nThe eval loop"]
A --> B --> C --> D --> E --> F --> G --> H --> I