UltraChat and UltraFeedback, Glossary, Textbook of AI

UltraChat (Ding, Chen, Xu et al., arXiv:2305.14233, May 2023) is a 1.5-million-conversation synthetic instruction-tuning dataset produced by Tsinghua University's OpenBMB group. The companion UltraFeedback (Cui et al., arXiv:2310.01377, October 2023) provides a 64,000-comparison preference dataset over the same conversation distribution.

UltraChat

UltraChat is generated entirely by two ChatGPT-3.5 instances conversing with each other under structured topic prompts. The 30 meta-topics (Information about the World, Writing and Creation, Assistance, etc.) seed roughly 3 million topic-specific prompts that drive the two-agent dialogues. Each conversation is multi-turn (3-7 user-assistant exchanges) and totals approximately 3 GB / 660 M tokens of clean text.

The synthetic origin solves several problems with crowdsourced data: no PII leakage, no copyrighted source material, no human-rater bias, and unlimited scale at low cost. It introduces the corresponding problem of distillation legality, UltraChat outputs are derived from OpenAI's API, whose terms forbid competing-model training.

UltraFeedback

UltraFeedback collects 64 K prompts from UltraChat, AlpacaFarm, ShareGPT and others, generates 4 completions per prompt from a diverse pool of 17 models (LLaMA, Alpaca, Vicuna, Pythia, Falcon, GPT-3.5, GPT-4) and uses GPT-4 as judge to provide quality scores along four axes: instruction-following, truthfulness, honesty and helpfulness. The result is a 256 K-completion preference dataset that powers most modern open DPO recipes.

Models trained on UltraChat / UltraFeedback

Zephyr-7B-β (Hugging Face, October 2023) trained on UltraChat for SFT and UltraFeedback for DPO became the canonical proof that DPO + synthetic preference data could match RLHF-PPO + human preferences. Tulu-2 (AI2), Notus-7B, OpenHermes, Capybara, Starling-7B (which extended UltraFeedback to K-wise preferences) and most subsequent open instruction-tuned models incorporate UltraChat / UltraFeedback either as primary data or as augmentation.

Licensing

UltraChat is released under MIT licence; UltraFeedback under MIT with the OpenAI-derived-output caveat. The licensing question of distillation from GPT-3.5/GPT-4 outputs into competitor models remains legally unsettled but has been treated as acceptable risk by every major open-model release since 2023.

Significance

UltraChat / UltraFeedback together demonstrated that synthetic instruction and preference data, generated cheaply from frontier API access, could substitute for expensive human-annotated data and produce frontier-quality alignment. This finding underwrote the entire RLAIF (Reinforcement Learning from AI Feedback) research programme.

Discussed in:

Chapter 14: Generative Models, Alignment and RLHF

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).