🐦 Twitter Post Details

Viewing enriched Twitter post

@jerryjliu0

At the end of the day, GPT-4 is a stochastic parrot and doesn’t always reason logically. If A > B then this means B < A right? Well… We tried using GPT-4 as an evaluation module to pick the better answer given the question. Can you spot the issue in the attached diagram? 🖼️. 🤔 When asked whether GPT-4 prefers answer 1 or answer 2, GPT-4 preferred answer 1 ❌ But when I swapped the order of answer 1 and answer 2, GPT-4 still preferred answer 1 (rip) As a result our pairwise eval module in @llama_index was broken 🤕Something in the ordering led to stochastic/inconsistent results, no matter how much we tried to prompt it 📝 Here’s an initial solution 💡- call the LLM two times, the second time by swapping answer 1 and 2. If we find that the result is inconsistent, we return a “TIE” value (LLM can’t pick which one is better). Otherwise return the results from the first LLM call. We’ve pushed a fix and added documentation on our pairwise evaluator in @llama_index: https://t.co/miGtdGbjo0 But we’re still open to suggestions! Let us know any ideas you have along these lines.

View on Twitter

🔧 Raw API Response

{
  "user": {
    "created_at": "2011-09-07T22:54:31.000Z",
    "default_profile_image": false,
    "description": "co-founder/CEO @llama_index\n\nEx-ML @robusthq,  AI research @Uber_ATG, ML Eng @Quora, @princeton",
    "fast_followers_count": 0,
    "favourites_count": 3821,
    "followers_count": 22875,
    "friends_count": 1144,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 597,
    "location": "",
    "media_count": 579,
    "name": "Jerry Liu",
    "normal_followers_count": 22875,
    "possibly_sensitive": false,
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1283610285031460864/1Q4zYhtb_normal.jpg",
    "screen_name": "jerryjliu0",
    "statuses_count": 2634,
    "translator_type": "none",
    "url": "https://t.co/S7FkTSefQ0",
    "verified": false,
    "withheld_in_countries": [],
    "id_str": "369777416"
  },
  "id": "1707554444563644660",
  "conversation_id": "1707554444563644660",
  "full_text": "At the end of the day, GPT-4 is a stochastic parrot and doesn’t always reason logically. If A > B then this means B < A right? Well…\n\nWe tried using GPT-4 as an evaluation module to pick the better answer given the question. Can you spot the issue in the attached diagram? 🖼️.\n\n🤔 When asked whether GPT-4 prefers answer 1 or answer 2, GPT-4 preferred answer 1\n❌ But when I swapped the order of answer 1 and answer 2, GPT-4 still preferred answer 1 (rip)\n\nAs a result our pairwise eval module in @llama_index was broken 🤕Something in the ordering led to stochastic/inconsistent results, no matter how much we tried to prompt it 📝\n\nHere’s an initial solution 💡- call the LLM two times, the second time by swapping answer 1 and 2. If we find that the result is inconsistent, we return a “TIE” value (LLM can’t pick which one is better). Otherwise return the results from the first LLM call.\n\nWe’ve pushed a fix and added documentation on our pairwise evaluator in @llama_index: https://t.co/miGtdGbjo0\n\nBut we’re still open to suggestions! Let us know any ideas you have along these lines.",
  "reply_count": 18,
  "retweet_count": 24,
  "favorite_count": 163,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/F7Jzs6RbYAASb5n.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/jerryjliu0/status/1707554444563644660",
  "created_at": "2023-09-29T00:34:39.000Z",
  "#sort_index": "1707554444563644660",
  "view_count": 49185,
  "quote_count": 2,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://twitter.com/jerryjliu0/status/1707554444563644660"
}