Glossary

Inference Cost Economics

Inference economics differ sharply from training. Training is a one-off capital event; inference is amortised over billions of tokens served each day and is the dominant compute spend for any deployed product.

Public price points (per 1 M tokens, mid-2025):

  • GPT-4o: ~\$5 input / \$15 output
  • Claude 3.5 Sonnet: \$3 input / \$15 output
  • Gemini 1.5 Pro: \$1.25 input / \$5 output (under 128k context)
  • Llama 3.1 405B (Together): ~\$3 / \$3
  • Llama 3.1 70B (Together): ~\$0.88 / \$0.88
  • Llama 3.1 8B (Together): ~\$0.18 / \$0.18
  • DeepSeek-V3: \$0.14 input / \$0.28 output
  • Frontier reasoning models (o1, Claude 3.7 thinking): 5–10× the headline non-reasoning price, plus large hidden-thought token consumption.

Cost trajectory: at fixed quality (e.g. "GPT-3.5 level"), API prices have fallen ~10× per year since 2022, GPT-3.5 was \$20/1M output tokens at launch, equivalent quality is now \$0.20 from open-source providers. The drivers: cheaper hardware ($/FLOP), smaller models reaching the same quality (algorithmic efficiency), better serving systems (vLLM, TensorRT-LLM), and competition.

Cost structure for the provider:

  • Prefill (processing the prompt) is compute-bound: $2 N_{\mathrm{params}} \cdot L_{\mathrm{prompt}}$ FLOPs, runs at high MFU.
  • Decode (generating output) is memory-bandwidth-bound: every token requires reading all model weights ($\sim 2 N_{\mathrm{params}}$ bytes in BF16) and the KV cache ($\sim 2 \cdot n_{\mathrm{layers}} \cdot n_{\mathrm{kv heads}} \cdot d_{\mathrm{head}} \cdot L_{\mathrm{ctx}}$ bytes per request) from HBM.

For a 70 B model in BF16 on H100 SXM (3.35 TB/s HBM): single-request decode ceiling is $3.35\,\mathrm{TB/s} / 140\,\mathrm{GB} \approx 24$ tokens/s. Throughput economics demand batching: serving 64 concurrent requests amortises the weight read across 64 tokens generated, multiplying tokens-per-second-per-GPU by ~50× until the KV cache itself dominates bandwidth.

Levers that matter:

  1. Batching: continuous/in-flight batching (vLLM, Triton) keeps the GPU fed with a varying mixture of prefill and decode steps.
  2. KV-cache reuse: prompt caching (Anthropic prompt caching, OpenAI cached input) charges 10× less for cache hits and is technically free on the provider side.
  3. PagedAttention (vLLM): KV cache stored in non-contiguous pages, allowing >90 % GPU-memory utilisation versus ~40 % for naive contiguous allocation.
  4. Speculative decoding: a small draft model proposes $k$ tokens; the large model verifies them in one forward pass, accepting a prefix. 2–3× throughput speedup at no quality loss.
  5. Quantisation: FP8 / INT4 weight-only halves or quarters HBM pressure (see Quantisation for Inference).
  6. Disaggregation: separate prefill and decode pools (Splitwise, DistServe), since the two phases have orthogonal bottlenecks.

Reasoning-model economics are different: o1-style models can emit thousands of hidden reasoning tokens per visible output token, so effective $/answer is 10–100× the headline rate. Output-bound workloads can dominate API bills; cost engineering increasingly means reasoning-budget control.

Related terms: KV Cache, PagedAttention, vLLM, Quantisation for Inference, FlashAttention Internals

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).