Constrained Decoding, Glossary, Textbook of AI

Constrained decoding is the umbrella technique behind structured outputs, JSON mode, regex-bounded generation, and grammar-bounded code synthesis. It modifies the autoregressive decoding loop so that only tokens consistent with a formal constraint can be sampled.

The basic loop

Standard LLM decoding is:

for t in range(max_tokens):
    logits = model.forward(tokens)
    next = sample(softmax(logits))
    tokens.append(next)

Constrained decoding inserts a mask:

state = grammar.start()
for t in range(max_tokens):
    logits = model.forward(tokens)
    allowed = grammar.allowed_tokens(state)
    logits[~allowed] = -float("inf")
    next = sample(softmax(logits))
    state = grammar.transition(state, next)
    tokens.append(next)
    if grammar.is_accepting(state) and next == EOS: break

The challenge is computing allowed_tokens(state) cheaply. Tokenisers split strings into BPE chunks that don't align with grammar tokens, so a single grammar production may correspond to thousands of legal token sequences.

Algorithms

Outlines (Willard & Louf 2023), pre-compiles a regex into an FSM, then for each FSM state intersects the vocabulary with reachable token prefixes. Cached per-state.
xgrammar (Dong et al. 2024), context-free grammar with persistent stack; achieves <1% overhead vs unconstrained decoding.
Guidance, interleaves grammar and prompt; supports gen() with stop tokens and patterns.
lm-format-enforcer , JSON-Schema-specific FSM.

Mathematical formulation

Given vocabulary $V$, current state $s$, and grammar $G$, define the allowed set $A(s) \subseteq V$. The constrained sampling distribution is

$$p_{\text{constr}}(v \mid s, \text{ctx}) = \begin{cases} \frac{p(v \mid \text{ctx})}{\sum_{v' \in A(s)} p(v' \mid \text{ctx})} & v \in A(s) \\ 0 & \text{otherwise} \end{cases}$$

This is re-normalised over the allowed set, which preserves the relative likelihoods of legal tokens.

Use cases

Constraint	Use
JSON Schema	Structured outputs
Regex	Phone numbers, dates, identifiers
Context-free grammar	SQL, programming languages
Function call schema	Function calling
Choice list	Multiple-choice answers, classification

Trade-offs

Pro: 100% syntactic validity; lower latency than retry loops; supports tiny open models for tasks formerly needing GPT-4.
Con: can mask the model's preferred phrasing, slightly hurting quality; heavy grammars (full SQL) can impose 5–20% throughput cost.

Citation

Willard, B. T., & Louf, R. (2023). Efficient Guided Generation for Large Language Models. arXiv:2307.09702.

Related terms: Structured Outputs, Function Calling

Discussed in:

Chapter 15: Modern AI, Tool Use

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).