Constrained decoding is the umbrella technique behind structured outputs, JSON mode, regex-bounded generation, and grammar-bounded code synthesis. It modifies the autoregressive decoding loop so that only tokens consistent with a formal constraint can be sampled.
The basic loop
Standard LLM decoding is:
for t in range(max_tokens):
logits = model.forward(tokens)
next = sample(softmax(logits))
tokens.append(next)
Constrained decoding inserts a mask:
state = grammar.start()
for t in range(max_tokens):
logits = model.forward(tokens)
allowed = grammar.allowed_tokens(state)
logits[~allowed] = -float("inf")
next = sample(softmax(logits))
state = grammar.transition(state, next)
tokens.append(next)
if grammar.is_accepting(state) and next == EOS: break
The challenge is computing allowed_tokens(state) cheaply. Tokenisers split strings into BPE chunks that don't align with grammar tokens, so a single grammar production may correspond to thousands of legal token sequences.
Algorithms
- Outlines (Willard & Louf 2023), pre-compiles a regex into an FSM, then for each FSM state intersects the vocabulary with reachable token prefixes. Cached per-state.
- xgrammar (Dong et al. 2024), context-free grammar with persistent stack; achieves <1% overhead vs unconstrained decoding.
- Guidance, interleaves grammar and prompt; supports gen() with stop tokens and patterns.
- lm-format-enforcer , JSON-Schema-specific FSM.
Mathematical formulation
Given vocabulary $V$, current state $s$, and grammar $G$, define the allowed set $A(s) \subseteq V$. The constrained sampling distribution is
$$p_{\text{constr}}(v \mid s, \text{ctx}) = \begin{cases} \frac{p(v \mid \text{ctx})}{\sum_{v' \in A(s)} p(v' \mid \text{ctx})} & v \in A(s) \\ 0 & \text{otherwise} \end{cases}$$
This is re-normalised over the allowed set, which preserves the relative likelihoods of legal tokens.
Use cases
| Constraint | Use |
|---|---|
| JSON Schema | Structured outputs |
| Regex | Phone numbers, dates, identifiers |
| Context-free grammar | SQL, programming languages |
| Function call schema | Function calling |
| Choice list | Multiple-choice answers, classification |
Trade-offs
- Pro: 100% syntactic validity; lower latency than retry loops; supports tiny open models for tasks formerly needing GPT-4.
- Con: can mask the model's preferred phrasing, slightly hurting quality; heavy grammars (full SQL) can impose 5–20% throughput cost.
Citation
Willard, B. T., & Louf, R. (2023). Efficient Guided Generation for Large Language Models. arXiv:2307.09702.
Related terms: Structured Outputs, Function Calling
Discussed in:
- Chapter 15: Modern AI, Tool Use