🐦 Twitter Post Details

Viewing enriched Twitter post

@Kangwook_Lee

LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique. But despite how widely this is used, almost all reported results are highly biased. Excited to share our preprint on how to properly use LLM as a judge. 🧵 === So how do people actually use LLM as a judge? Most people just use the LLM as an evaluator and report the empirical probability that the LLM says the answer looks correct. When the LLM is perfect, this works fine and gives an unbiased estimator. If the LLM is not perfect, this breaks. Consider a case where the LLM evaluates correctly 80 percent of the time. More specifically, if the answer is correct, the LLM says "this looks correct" with 80 percent probability, and the same 80 percent applies when the answer is actually incorrect. In this situation, you should not report the empirical probability, because it is biased. Why? Let the true probability of the tested model being correct be p. Then the empirical probability that the LLM says "correct" (= q) is q = 0.8p + 0.2(1 - p) = 0.2 + 0.6p So the unbiased estimate should be (q - 0.2) / 0.6 Things get even more interesting if the error pattern is asymmetric or if you do not know these error rates a priori. === So what does this mean? First, follow the suggested guideline in our preprint. There is no free lunch. You cannot evaluate how good your model is unless your LLM as a judge is known to be perfect at judging it. Depending on how close it is to a perfect evaluator, you need a sufficient size of test set (= calibration set) to estimate the evaluator’s error rates, and then you must correct for them. Second, very unfortunately, many findings we have seen in papers over the past few years need to be revisited. Unless two papers used the exact same LLM as a judge, comparing results across them could have produced false claims. The improvement could simply come from changing the evaluation pipeline slightly. A rigorous meta study is urgently needed. === tldr: (1) Almost all LLM-as-a-judge evaluations in the past few years were reported with a biased estimator. (2) It is easy to fix, so wait for our full preprint. (3) Many LLM-as-a-judge results should be taken with grains of salt. Full preprint coming in a few days, so stay tuned! Amazing work by my students and collaborators. @chungpa_lee @tomzeng200 @jongwonjeong123 and @jysohn1108

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1993438649963164121/media_0.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1993438649963164121/media_0.jpg?",
      "type": "photo",
      "filename": "media_0.jpg"
    },
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1993438649963164121/media_1.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1993438649963164121/media_1.jpg?",
      "type": "photo",
      "filename": "media_1.jpg"
    },
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1993438649963164121/media_2.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1993438649963164121/media_2.jpg?",
      "type": "photo",
      "filename": "media_2.jpg"
    }
  ],
  "processed_at": "2025-11-27T20:17:36.800099",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "1993438649963164121",
  "url": "https://x.com/Kangwook_Lee/status/1993438649963164121",
  "twitterUrl": "https://twitter.com/Kangwook_Lee/status/1993438649963164121",
  "text": "LLM as a judge has become a dominant way to evaluate how good a model is at solving a task, since it works without a test set and handles cases where answers are not unique.\n\nBut despite how widely this is used, almost all reported results are highly biased.\n\nExcited to share our preprint on how to properly use LLM as a judge.\n\n🧵\n\n===\n\nSo how do people actually use LLM as a judge?\n\nMost people just use the LLM as an evaluator and report the empirical probability that the LLM says the answer looks correct.\n\nWhen the LLM is perfect, this works fine and gives an unbiased estimator.\n\nIf the LLM is not perfect, this breaks. \n\nConsider a case where the LLM evaluates correctly 80 percent of the time.\n\nMore specifically, if the answer is correct, the LLM says \"this looks correct\" with 80 percent probability, and the same 80 percent applies when the answer is actually incorrect.\n\nIn this situation, you should not report the empirical probability, because it is biased. Why?\n\nLet the true probability of the tested model being correct be p.\n\nThen the empirical probability that the LLM says \"correct\" (= q) is\nq = 0.8p + 0.2(1 - p) = 0.2 + 0.6p\n\nSo the unbiased estimate should be\n(q - 0.2) / 0.6\n\nThings get even more interesting if the error pattern is asymmetric or if you do not know these error rates a priori.\n\n===\n\nSo what does this mean?\n\nFirst, follow the suggested guideline in our preprint.\nThere is no free lunch. You cannot evaluate how good your model is unless your LLM as a judge is known to be perfect at judging it.\n\nDepending on how close it is to a perfect evaluator, you need a sufficient size of test set (= calibration set) to estimate the evaluator’s error rates, and then you must correct for them.\n\nSecond, very unfortunately, many findings we have seen in papers over the past few years need to be revisited.\nUnless two papers used the exact same LLM as a judge, comparing results across them could have produced false claims. The improvement could simply come from changing the evaluation pipeline slightly. A rigorous meta study is urgently needed.\n\n===\n\ntldr:\n\n(1) Almost all LLM-as-a-judge evaluations in the past few years were reported with a biased estimator.\n\n(2) It is easy to fix, so wait for our full preprint.\n\n(3) Many LLM-as-a-judge results should be taken with grains of salt.\n\nFull preprint coming in a few days, so stay tuned!\n\nAmazing work by my students and collaborators.\n@chungpa_lee @tomzeng200 @jongwonjeong123 and @jysohn1108",
  "source": "Twitter for iPhone",
  "retweetCount": 162,
  "replyCount": 39,
  "likeCount": 1070,
  "quoteCount": 16,
  "viewCount": 184964,
  "createdAt": "Tue Nov 25 21:56:25 +0000 2025",
  "lang": "en",
  "bookmarkCount": 1209,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "1993438649963164121",
  "displayTextRange": [
    0,
    281
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "Kangwook_Lee",
    "url": "https://x.com/Kangwook_Lee",
    "twitterUrl": "https://twitter.com/Kangwook_Lee",
    "id": "57182201",
    "name": "Kangwook Lee",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1269744821348061195/-5REBSre_normal.jpg",
    "coverPicture": "",
    "description": "",
    "location": "Wisconsin, USA",
    "followers": 3785,
    "following": 1038,
    "status": "",
    "canDm": false,
    "canMediaTag": true,
    "createdAt": "Thu Jul 16 00:06:50 +0000 2009",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 5988,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 134,
    "statusesCount": 1136,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "1937568988558430316"
    ],
    "profile_bio": {
      "description": "UW Madison / KRAFTON AI",
      "entities": {
        "description": {},
        "url": {
          "urls": [
            {
              "display_url": "kangwooklee.com",
              "expanded_url": "http://kangwooklee.com",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/Cs2prQaawb"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.twitter.com/D9cPr0aytr",
        "expanded_url": "https://twitter.com/Kangwook_Lee/status/1993438649963164121/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {},
          "orig": {}
        },
        "id_str": "1993438144063062016",
        "indices": [
          282,
          305
        ],
        "media_key": "3_1993438144063062016",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARuqHW4ql8AACgACG6od4/SWsdkAAA==",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAABCgABG6odbiqXwAAKAAIbqh3j9Jax2QAA",
            "media_key": "3_1993438144063062016"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/G6odbiqXwAAVAJ4.png",
        "original_info": {
          "focus_rects": [
            {
              "h": 1066,
              "w": 1904,
              "x": 284,
              "y": 0
            },
            {
              "h": 1066,
              "w": 1066,
              "x": 1052,
              "y": 0
            },
            {
              "h": 1066,
              "w": 935,
              "x": 1118,
              "y": 0
            },
            {
              "h": 1066,
              "w": 533,
              "x": 1319,
              "y": 0
            },
            {
              "h": 1066,
              "w": 2188,
              "x": 0,
              "y": 0
            }
          ],
          "height": 1066,
          "width": 2188
        },
        "sizes": {
          "large": {
            "h": 998,
            "w": 2048
          }
        },
        "type": "photo",
        "url": "https://t.co/D9cPr0aytr"
      },
      {
        "display_url": "pic.twitter.com/D9cPr0aytr",
        "expanded_url": "https://twitter.com/Kangwook_Lee/status/1993438649963164121/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {},
          "orig": {}
        },
        "id_str": "1993438236656451584",
        "indices": [
          282,
          305
        ],
        "media_key": "3_1993438236656451584",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARuqHYO5lsAACgACG6od4/SWsdkAAA==",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAABCgABG6odg7mWwAAKAAIbqh3j9Jax2QAA",
            "media_key": "3_1993438236656451584"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/G6odg7mWwAAOhEn.jpg",
        "original_info": {
          "focus_rects": [
            {
              "h": 1536,
              "w": 2743,
              "x": 73,
              "y": 0
            },
            {
              "h": 1536,
              "w": 1536,
              "x": 1280,
              "y": 0
            },
            {
              "h": 1536,
              "w": 1347,
              "x": 1469,
              "y": 0
            },
            {
              "h": 1536,
              "w": 768,
              "x": 1792,
              "y": 0
            },
            {
              "h": 1536,
              "w": 2816,
              "x": 0,
              "y": 0
            }
          ],
          "height": 1536,
          "width": 2816
        },
        "sizes": {
          "large": {
            "h": 1117,
            "w": 2048
          }
        },
        "type": "photo",
        "url": "https://t.co/D9cPr0aytr"
      },
      {
        "display_url": "pic.twitter.com/D9cPr0aytr",
        "expanded_url": "https://twitter.com/Kangwook_Lee/status/1993438649963164121/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {},
          "orig": {}
        },
        "id_str": "1993438599476355072",
        "indices": [
          282,
          305
        ],
        "media_key": "3_1993438599476355072",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARuqHdgzVyAACgACG6od4/SWsdkAAA==",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAABCgABG6od2DNXIAAKAAIbqh3j9Jax2QAA",
            "media_key": "3_1993438599476355072"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/G6od2DNXIAA5TOQ.jpg",
        "original_info": {
          "focus_rects": [
            {
              "h": 506,
              "w": 904,
              "x": 0,
              "y": 0
            },
            {
              "h": 578,
              "w": 578,
              "x": 320,
              "y": 0
            },
            {
              "h": 578,
              "w": 507,
              "x": 356,
              "y": 0
            },
            {
              "h": 578,
              "w": 289,
              "x": 465,
              "y": 0
            },
            {
              "h": 578,
              "w": 904,
              "x": 0,
              "y": 0
            }
          ],
          "height": 578,
          "width": 904
        },
        "sizes": {
          "large": {
            "h": 578,
            "w": 904
          }
        },
        "type": "photo",
        "url": "https://t.co/D9cPr0aytr"
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "user_mentions": [
      {
        "id_str": "1932892638358753280",
        "indices": [
          2423,
          2435
        ],
        "name": "Chungpa Lee",
        "screen_name": "chungpa_lee"
      },
      {
        "id_str": "2837066039",
        "indices": [
          2436,
          2447
        ],
        "name": "Thomas Zeng",
        "screen_name": "tomzeng200"
      },
      {
        "id_str": "1719908551731372032",
        "indices": [
          2448,
          2464
        ],
        "name": "Jongwon Jeong",
        "screen_name": "jongwonjeong123"
      },
      {
        "id_str": "1093365587781140480",
        "indices": [
          2469,
          2480
        ],
        "name": "Jy-yong Sohn",
        "screen_name": "jysohn1108"
      }
    ]
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "article": null
}