Glossary

DeepSeek-V3

DeepSeek-V3 is a large mixture-of-experts language model released by the Chinese AI lab DeepSeek on 26 December 2024 under a permissive licence. It marked an inflection point in the open-weights race: a frontier-quality base model trained at a fraction of the cost reported by Western labs.

Scale. 671 billion total parameters with 37 billion active per token, organised as a fine-grained mixture-of-experts with 256 routed experts and 1 shared expert per layer, top-8 routing. Pre-training corpus: 14.8 trillion tokens of multilingual text and code, with a heavy emphasis on Chinese and English. Context length of 128K tokens.

Cost. DeepSeek's technical report disclosed total pre-training cost of roughly **$5.6 million** in compute, based on 2.788 million H800 GPU-hours at $2/hour. The figure excludes prior research, ablations, and post-training, but the headline shock was real: a model competitive with GPT-4o had been trained for two orders of magnitude less than the rumoured cost of the original GPT-4. Whether the comparison was apples-to-apples or whether DeepSeek's true investment was higher became a months-long debate, but the result reset expectations about the capital required to reach the frontier.

Architectural innovations. Two are widely cited.

  • Multi-Head Latent Attention (MLA) compresses keys and values into a low-rank latent space before caching, cutting KV cache memory by roughly an order of magnitude during inference and enabling long-context throughput.
  • Auxiliary-loss-free load balancing for the MoE router: instead of penalising imbalanced expert utilisation, DeepSeek tracks per-expert bias terms and adjusts them online to equalise load without distorting the gradient.

The model also uses multi-token prediction as an auxiliary objective during pre-training, which speeds up inference via speculative decoding at deployment.

Performance. DeepSeek-V3 matched or exceeded GPT-4o and Claude 3.5 Sonnet on most public English benchmarks at release, with particular strength in mathematics and code. Chinese-language performance led the field.

Strategic significance. V3 demonstrated that a small, focused team operating under US export controls (the H800 is a China-market H100 variant with reduced interconnect bandwidth) could reach the frontier through engineering discipline rather than brute-force scale. The release directly enabled DeepSeek R1 weeks later, which used V3-Base as the substrate for pure reinforcement-learning on reasoning. Together V3 and R1 reshaped the competitive landscape and forced Western labs to defend their cost and capability moats publicly.

Open weights. The model, technical report and inference code are publicly downloadable. By early 2026 V3 and its derivatives are widely deployed for self-hosting, fine-tuning, and as the base for further research.

Video

Related terms: Mixture of Experts, DeepSeek R1-Zero, Transformer, Llama 3 / 3.1 / 3.3

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).