Model Stealing / Distillation Attacks, Glossary, Textbook of AI

Model stealing (also model extraction or distillation attack) is the act of reconstructing a target model's behaviour, and sometimes its parameters or architecture, by querying its inference API and training a substitute model on the recorded responses. It threatens intellectual property (a model that cost millions to train can be approximately cloned for thousands of API dollars), safety (the substitute may lack the original's safety training), and subsequent attacks (e.g. crafting transferable adversarial examples against the original via the substitute).

Origin and mechanism

The threat was characterised by Tramèr, Zhang, Juels, Reiter, Ristenpart (2016) in Stealing Machine Learning Models via Prediction APIs. The basic recipe:

Sample query inputs, either from a public dataset overlapping the target's domain, or generated synthetically.
Query the target API, collect responses (class labels, confidence scores, logits, or, for generative models, full output sequences).
Train the substitute, supervised distillation on the (input, response) pairs.

For generative LLMs, the equivalent procedure is to use the target as a teacher in knowledge distillation: query it with a large, diverse instruction set and train a smaller student on the outputs. The student inherits much of the teacher's behaviour at a small fraction of the original training cost. Open-source models like Vicuna, Alpaca, and many of their successors were trained partially via this route on responses from GPT-4, a use case that sits in legal grey territory, contested in OpenAI's terms of service.

Specialised variants

Functional stealing, reproduce only the input-output function.
Parameter stealing, recover actual weights, viable only for very small models (e.g. logistic regression, decision trees).
Architecture stealing, recover the model's structure via timing or memory-access side channels.
Prompt extraction, for system-prompted commercial APIs, recover the hidden system prompt itself.

Defences

Rate limiting and query monitoring, detect anomalous query patterns.
Output watermarking, embed signals in responses that mark a substitute as derivative.
Reduced output detail, return only top-1 class rather than full logits.
Differential-privacy-style noise in outputs.
Legal mechanisms, terms of service, copyright assertions on outputs.

None of these prevents a determined adversary; they raise cost and detectability.

Status

Model stealing is widely practised, both legitimately (research distillation, open-source community) and adversarially (low-cost model cloning). OpenAI v. DeepSeek (2024) publicly raised allegations of distillation-based training, illustrating the commercial stakes. As frontier-model training costs reach \$1B+ runs, the **economic asymmetry**, \$10⁶ of API queries can substantially distil a $10⁹ training run, is structural and not easily resolved by technical means.

References

Tramèr et al. (2016). Stealing Machine Learning Models via Prediction APIs.
Hinton, Vinyals, Dean (2015). Distilling the Knowledge in a Neural Network.
Carlini et al. (2024). Stealing Part of a Production Language Model.

Discussed in:

Chapter 14: Generative Models, Model stealing

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).