@jerryjliu0
At the end of the day, GPT-4 is a stochastic parrot and doesn’t always reason logically. If A > B then this means B < A right? Well… We tried using GPT-4 as an evaluation module to pick the better answer given the question. Can you spot the issue in the attached diagram? 🖼️. 🤔 When asked whether GPT-4 prefers answer 1 or answer 2, GPT-4 preferred answer 1 ❌ But when I swapped the order of answer 1 and answer 2, GPT-4 still preferred answer 1 (rip) As a result our pairwise eval module in @llama_index was broken 🤕Something in the ordering led to stochastic/inconsistent results, no matter how much we tried to prompt it 📝 Here’s an initial solution 💡- call the LLM two times, the second time by swapping answer 1 and 2. If we find that the result is inconsistent, we return a “TIE” value (LLM can’t pick which one is better). Otherwise return the results from the first LLM call. We’ve pushed a fix and added documentation on our pairwise evaluator in @llama_index: https://t.co/miGtdGbjo0 But we’re still open to suggestions! Let us know any ideas you have along these lines.