Science Cast

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison LeeFebruary 26, 2024 11:36am

Views (206)
Comments (0)

Export Citation

Voice is AI-generated

Connected to paperThis paper is a preprint and has not been certified by peer review

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

arXivPDF

Authors

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, Abhinav Rastogi

Abstract

Reinforcement learning from human feedback (RLHF) is effective at aligning large language models (LLMs) to human preferences, but gathering high quality human preference labels is a key bottleneck. We conduct a head-to-head comparison of RLHF vs. RL from AI Feedback (RLAIF) - a technique where preferences are labeled by an off-the-shelf LLM in lieu of humans, and we find that they result in similar improvements. On the task of summarization, human evaluators prefer generations from both RLAIF and RLHF over a baseline supervised fine-tuned model in ~70% of cases. Furthermore, when asked to rate RLAIF vs. RLHF summaries, humans prefer both at equal rates. These results suggest that RLAIF can yield human-level performance, offering a potential solution to the scalability limitations of RLHF.

TwitterandLinkedIn

0 comments

Add comment

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

AI-powered Paper ChatBeta

AI-powered Paper ChatBeta

0 comments