UltraChat (Ding, Chen, Xu et al., arXiv:2305.14233, May 2023) is a 1.5-million-conversation synthetic instruction-tuning dataset produced by Tsinghua University's OpenBMB group. The companion UltraFeedback (Cui et al., arXiv:2310.01377, October 2023) provides a 64,000-comparison preference dataset over the same conversation distribution.
UltraChat
UltraChat is generated entirely by two ChatGPT-3.5 instances conversing with each other under structured topic prompts. The 30 meta-topics (Information about the World, Writing and Creation, Assistance, etc.) seed roughly 3 million topic-specific prompts that drive the two-agent dialogues. Each conversation is multi-turn (3-7 user-assistant exchanges) and totals approximately 3 GB / 660 M tokens of clean text.
The synthetic origin solves several problems with crowdsourced data: no PII leakage, no copyrighted source material, no human-rater bias, and unlimited scale at low cost. It introduces the corresponding problem of distillation legality, UltraChat outputs are derived from OpenAI's API, whose terms forbid competing-model training.
UltraFeedback
UltraFeedback collects 64 K prompts from UltraChat, AlpacaFarm, ShareGPT and others, generates 4 completions per prompt from a diverse pool of 17 models (LLaMA, Alpaca, Vicuna, Pythia, Falcon, GPT-3.5, GPT-4) and uses GPT-4 as judge to provide quality scores along four axes: instruction-following, truthfulness, honesty and helpfulness. The result is a 256 K-completion preference dataset that powers most modern open DPO recipes.
Models trained on UltraChat / UltraFeedback
Zephyr-7B-β (Hugging Face, October 2023) trained on UltraChat for SFT and UltraFeedback for DPO became the canonical proof that DPO + synthetic preference data could match RLHF-PPO + human preferences. Tulu-2 (AI2), Notus-7B, OpenHermes, Capybara, Starling-7B (which extended UltraFeedback to K-wise preferences) and most subsequent open instruction-tuned models incorporate UltraChat / UltraFeedback either as primary data or as augmentation.
Licensing
UltraChat is released under MIT licence; UltraFeedback under MIT with the OpenAI-derived-output caveat. The licensing question of distillation from GPT-3.5/GPT-4 outputs into competitor models remains legally unsettled but has been treated as acceptable risk by every major open-model release since 2023.
Significance
UltraChat / UltraFeedback together demonstrated that synthetic instruction and preference data, generated cheaply from frontier API access, could substitute for expensive human-annotated data and produce frontier-quality alignment. This finding underwrote the entire RLAIF (Reinforcement Learning from AI Feedback) research programme.
Related terms: Anthropic HH-RLHF, OpenAssistant Conversations (OASST), ShareGPT and Vicuna, RLHF
Discussed in:
- Chapter 14: Generative Models, Alignment and RLHF