@_philschmid
Do we need RL to align LLMs with Human feedback? 🤔 Direct Preference Optimization (DPO) allows training models like ChatGPT directly on Human preferences🤯 @huggingface trained Zephyr a 7B model with DPO outperforming Llama-2 70B chat on the MT Bench benchmark! 🥇 🧶 https://t.co/sazlXZc6Il