Glossary

Constrained Decoding

Constrained decoding is the umbrella technique behind structured outputs, JSON mode, regex-bounded generation, and grammar-bounded code synthesis. It modifies the autoregressive decoding loop so that only tokens consistent with a formal constraint can be sampled.

The basic loop

Standard LLM decoding is:

for t in range(max_tokens):
    logits = model.forward(tokens)
    next = sample(softmax(logits))
    tokens.append(next)

Constrained decoding inserts a mask:

state = grammar.start()
for t in range(max_tokens):
    logits = model.forward(tokens)
    allowed = grammar.allowed_tokens(state)
    logits[~allowed] = -float("inf")
    next = sample(softmax(logits))
    state = grammar.transition(state, next)
    tokens.append(next)
    if grammar.is_accepting(state) and next == EOS: break

The challenge is computing allowed_tokens(state) cheaply. Tokenisers split strings into BPE chunks that don't align with grammar tokens, so a single grammar production may correspond to thousands of legal token sequences.

Algorithms

  • Outlines (Willard & Louf 2023), pre-compiles a regex into an FSM, then for each FSM state intersects the vocabulary with reachable token prefixes. Cached per-state.
  • xgrammar (Dong et al. 2024), context-free grammar with persistent stack; achieves <1% overhead vs unconstrained decoding.
  • Guidance, interleaves grammar and prompt; supports gen() with stop tokens and patterns.
  • lm-format-enforcer , JSON-Schema-specific FSM.

Mathematical formulation

Given vocabulary $V$, current state $s$, and grammar $G$, define the allowed set $A(s) \subseteq V$. The constrained sampling distribution is

$$p_{\text{constr}}(v \mid s, \text{ctx}) = \begin{cases} \frac{p(v \mid \text{ctx})}{\sum_{v' \in A(s)} p(v' \mid \text{ctx})} & v \in A(s) \\ 0 & \text{otherwise} \end{cases}$$

This is re-normalised over the allowed set, which preserves the relative likelihoods of legal tokens.

Use cases

Constraint Use
JSON Schema Structured outputs
Regex Phone numbers, dates, identifiers
Context-free grammar SQL, programming languages
Function call schema Function calling
Choice list Multiple-choice answers, classification

Trade-offs

  • Pro: 100% syntactic validity; lower latency than retry loops; supports tiny open models for tasks formerly needing GPT-4.
  • Con: can mask the model's preferred phrasing, slightly hurting quality; heavy grammars (full SQL) can impose 5–20% throughput cost.

Citation

Willard, B. T., & Louf, R. (2023). Efficient Guided Generation for Large Language Models. arXiv:2307.09702.

Related terms: Structured Outputs, Function Calling

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).