🐦 Twitter Post Details

Viewing enriched Twitter post

@jerryjliu0

Be careful of the eval metrics you use! ⚠️ One insight in trying different retrieval techniques to improve LLM/RAG apps is that certain techniques are clearly better via the eyeball test 👀, but not every eval metric reflects that (it finds both answers are “correct”, “faithful to the context”, etc.) We came up with a simple trick to better tease apart each technique - pass GPT-4 two responses, and see which one it prefers (or output a tie)! 🧑‍⚖️⚖️ Take this example below. Technique 1 (auto merging) clearly outputs more details than technique 2 (naive K) - see image 1. 🤔 But a semantic similarity evaluator yields the same numbers, or even worse (image 2) 💡 Instead, let’s just ask GPT-4 what answer is better given the question (image 3) Through this we see that our auto-merging retriever is preferred 65% of the time 🔥 (image 4) Note: this is similar to how reward data is collected for RLHF (except instead of humans we use GPT). Full guide here: https://t.co/JT7pE59t30

View on Twitter

🔧 Raw API Response

{
  "user": {
    "created_at": "2011-09-07T22:54:31.000Z",
    "default_profile_image": false,
    "description": "co-founder/CEO @llama_index\n\nEx-ML @robusthq,  AI research @Uber_ATG, ML Eng @Quora, @princeton",
    "fast_followers_count": 0,
    "favourites_count": 3820,
    "followers_count": 22873,
    "friends_count": 1144,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 597,
    "location": "",
    "media_count": 579,
    "name": "Jerry Liu",
    "normal_followers_count": 22873,
    "possibly_sensitive": false,
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1283610285031460864/1Q4zYhtb_normal.jpg",
    "screen_name": "jerryjliu0",
    "statuses_count": 2633,
    "translator_type": "none",
    "url": "https://t.co/S7FkTSefQ0",
    "verified": false,
    "withheld_in_countries": [],
    "id_str": "369777416"
  },
  "id": "1706462379545157812",
  "conversation_id": "1706462379545157812",
  "full_text": "Be careful of the eval metrics you use! ⚠️\n\nOne insight in trying different retrieval techniques to improve LLM/RAG apps is that certain techniques are clearly better via the eyeball test 👀, but not every eval metric reflects that (it finds both answers are “correct”, “faithful to the context”, etc.)\n\nWe came up with a simple trick to better tease apart each technique - pass GPT-4 two responses, and see which one it prefers (or output a tie)! 🧑‍⚖️⚖️\n\nTake this example below. Technique 1 (auto merging) clearly outputs more details than technique 2 (naive K) - see image 1.\n🤔 But a semantic similarity evaluator yields the same numbers, or even worse (image 2)\n💡 Instead, let’s just ask GPT-4 what answer is better given the question (image 3)\n\nThrough this we see that our auto-merging retriever is preferred 65% of the time 🔥 (image 4)\n\nNote: this is similar to how reward data is collected for RLHF (except instead of humans we use GPT).\n\nFull guide here: https://t.co/JT7pE59t30",
  "reply_count": 3,
  "retweet_count": 17,
  "favorite_count": 111,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/F66SZu8aEAALtzO.png",
      "type": "photo"
    },
    {
      "media_url": "https://pbs.twimg.com/media/F66SaRGa4AA6Tef.jpg",
      "type": "photo"
    },
    {
      "media_url": "https://pbs.twimg.com/media/F66SayjaIAAsWyO.png",
      "type": "photo"
    },
    {
      "media_url": "https://pbs.twimg.com/media/F66SbWmacAApson.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/jerryjliu0/status/1706462379545157812",
  "created_at": "2023-09-26T00:15:10.000Z",
  "#sort_index": "1706462379545157812",
  "view_count": 18424,
  "quote_count": 1,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://twitter.com/jerryjliu0/status/1706462379545157812"
}