ShareGPT and Vicuna, Glossary, Textbook of AI

ShareGPT is a community-collected corpus of ChatGPT conversation transcripts voluntarily exported by users via the ShareGPT.com browser extension between November 2022 and April 2023. Despite (or because of) its irregular provenance, it powered the Vicuna moment of March 2023, the first credible open replication of ChatGPT-quality dialogue behaviour.

Composition

The widely circulated ShareGPT-90K snapshot contains roughly 90,000 multi-turn conversations (later filtered to 70,000 by language and quality) totalling around 2 GB of JSON. Conversations are user-driven and span coding help, writing, explanations, role-play, homework, translation and adversarial probing of ChatGPT's policy boundaries. Each turn has speaker (human / gpt), text, and timestamps.

Models trained on ShareGPT

Vicuna-13B (Chiang, Li, Lin et al., March 2023, https://lmsys.org/blog/2023-03-30-vicuna/) was fine-tuned from LLaMA-13B on 70 K ShareGPT conversations and reportedly achieved 90% of ChatGPT's quality at $300 of training cost as judged by GPT-4, the result that demonstrated open-weight models could reach commercial-chatbot quality through instruction-tuning alone.

Koala (Berkeley AI Research, April 2023), WizardLM, OpenChat, Vicuna-33B, LongChat, FastChat-T5 and many academic dialogue models followed the ShareGPT recipe.

Licensing controversy

ShareGPT's legal status is contested. Conversations are user-submitted, but they contain ChatGPT outputs whose OpenAI Terms of Service explicitly forbid use to develop a competing AI product. Whether this constraint binds third parties who never agreed to OpenAI's ToS is legally unsettled. Several derivatives, including the original ShareGPT dataset itself, have been quietly removed from Hugging Face after takedown requests, although mirrors persist.

Most production-quality instruction-tuning recipes have moved to alternative dialogue datasets: OpenAssistant Conversations, UltraChat, WildChat, Magpie, No-Robots, all of which have cleaner provenance.

Significance

ShareGPT is the canonical example of a distillation dataset: the conversational behaviour of a closed frontier model is captured indirectly through user transcripts, then transferred to an open base model via supervised fine-tuning. The Vicuna result demonstrated that modest amounts of high-quality dialogue data (70 K conversations, ~700 M tokens of supervision) could close most of the practical gap between open base models and closed assistants, a finding that has shaped open-model development ever since, even as the dataset itself has fallen out of use.

Discussed in:

Chapter 14: Generative Models, Alignment and RLHF

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).