@natolambert
A few big papers throwing question on "does RLAIF work" yesterday. The first is a paper by @archit_sharma97 is a pretty timely critique of RLAIF. It shows SFT on GPT 4 outputs > DPO + RLAIF on GPT4 ratings of GPT3.5 completions. A few things aren't surprising: 1. The most important thing you can do right now is have better completions going into fine-tuning, rather than worrying about algorithm 2. The second best thing is worrying about prompts, which us RLHFers don't do. ShareGPT and similar things without filtering are bad. Preference optimization is likely more sensitive than SFT, given the newer optimizers. There's still lots more headroom in improving RLAIF imo than SFT. Looking at popular datasets like Nectar or Ultrafeedback there are so many bugs that can be filtered by heuristics. We also got HELM-Instruct, which showed wacky results on different evaluators. All of Amazon MTurk, Scale AI, GPT4, and Claude aren't very correlated. Seeing this, I thought training a model with the evaluator you use is prolly needed, they may not transfer too. So, while GPT4-as-a-judge is popular, it has a long way to go. Finally, in the appendix of @Muennighoff 's paper GRIT was more uncertainty over how we use GPT4 as judgments for MT Bench. Archit's paper: https://t.co/QeXRzseLCZ HELM-Instruct: https://t.co/wODR3Srx44 GRIT paper: https://t.co/afm4c0hVt0