Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, & Dan Mané (2016)
arXiv.
DOI: https://doi.org/10.48550/arxiv.1606.06565
Abstract. Identifies five concrete problems in AI safety, avoiding negative side effects, avoiding reward hacking, scalable oversight, safe exploration, and robustness to distributional shift, and outlines promising research directions for each.
Tags: ai-safety reward-hacking