@jerryjliu0
Be careful of the eval metrics you use! ⚠️ One insight in trying different retrieval techniques to improve LLM/RAG apps is that certain techniques are clearly better via the eyeball test 👀, but not every eval metric reflects that (it finds both answers are “correct”, “faithful to the context”, etc.) We came up with a simple trick to better tease apart each technique - pass GPT-4 two responses, and see which one it prefers (or output a tie)! 🧑⚖️⚖️ Take this example below. Technique 1 (auto merging) clearly outputs more details than technique 2 (naive K) - see image 1. 🤔 But a semantic similarity evaluator yields the same numbers, or even worse (image 2) 💡 Instead, let’s just ask GPT-4 what answer is better given the question (image 3) Through this we see that our auto-merging retriever is preferred 65% of the time 🔥 (image 4) Note: this is similar to how reward data is collected for RLHF (except instead of humans we use GPT). Full guide here: https://t.co/JT7pE59t30