Abstract. Introduces reinforcement learning from human preferences (the technique underlying RLHF): learning a reward model from pairwise human preference judgements and then optimising a policy against it with a KL penalty toward a reference distribution.
Tags:rlhfalignmentreinforcement-learning
This site is currently in Beta. Contact: Chris Paton