@arankomatsuzaki
Generative Reward Models Presents GenRM, an iterative algorithm that trains an LLM on self-generated reasoning traces, leading to synthetic preference labels matching human preference judgments https://t.co/f7oS828PwS https://t.co/gnchZQZ8L7