TruthfulQA, introduced by Lin and colleagues in 2022, contains 817 questions across 38 categories (health, law, finance, politics, conspiracies, fiction, urban legends, common misconceptions, and more) that are specifically designed to elicit imitative falsehoods. Imitative falsehoods are wrong answers that a model produces because it has seen humans repeat them, for example, the myth that humans use only 10% of their brains, or that lightning never strikes the same place twice. The benchmark probes whether a model has learned to imitate human writing including its errors, or whether it has factual grounding that lets it correct the error.
The benchmark is evaluated in two modes:
- Multiple-choice (MC1, MC2): pick the single correct answer among 4–5 options (MC1) or assign higher probability to all true options than all false ones (MC2).
- Generation (open-ended): produce a free-form answer, then have a fine-tuned GPT-judge classifier score it for truthfulness and informativeness. The combined "% true and informative" is the headline number.
Performance trajectory. The 2022 paper reported that larger models were less truthful: GPT-3 175B scored 58% truthful (vs 76% for the 350M model), the inverse-scaling finding that motivated the benchmark. RLHF-tuned models reversed this: GPT-4 scored 59% MC2 → >90% with refusal-aware prompting. By 2024, Claude 3.5 Sonnet reached 84.7% on MC2 and 78% on the generation task. Frontier 2025 models cluster at 88–93% MC2.
Known issues. TruthfulQA has been heavily criticised on three counts. First, the GPT-judge grader is itself an LLM and inherits biases, particularly favouring verbose hedged answers over confidently correct ones. Second, several questions are genuinely contested (interpretations of vague historical claims, contested medical advice) and the "truthful" answer is debatable. Third, the benchmark is now in nearly every pretraining corpus and the grader is fine-tuned on a fixed training distribution, so improvements may reflect alignment to the grader rather than improved truthfulness. The inverse-scaling finding has not replicated cleanly on more recent model families.
Modern relevance. TruthfulQA is still reported on most safety-related model cards and remains the canonical benchmark for "imitative" hallucinations, but is no longer a primary frontier signal. It has been complemented by newer hallucination benchmarks like HaluEval, SimpleQA, and FreshQA.
Reference: Lin et al., "TruthfulQA: Measuring How Models Mimic Human Falsehoods", ACL 2022.
Discussed in:
- Chapter 7: Supervised Learning, Evaluation Metrics