🐦 Twitter Post Details

Viewing enriched Twitter post

@Tom_Westgarth15

Fascinating paper with so many interesting observations. One that jumped out to me, which arguably could have got more attention, is the divergence between discrimination and calibration of agents. Calibration (see "CAL" on the predictability column) — the alignment between predicted confidence and actual accuracy — has improved noticeably in recent frontier models. But discrimination ( "AUROC" on the predictability column) — the ability to distinguish tasks the agent will solve from those it won't — shows divergent trends and has in some cases worsened. This matters enormously for deployment in real world contexts. An agent can be well-calibrated in aggregate (e.g. saying "I'm 70% confident" and being right 70% of the time) while being completely unable to flag which specific tasks it will fail at. Discrimination is therefore critical for anyone building autonomous workflows. You need the agent to know when to escalate, rather than just having good statistical properties across a population of tasks. I'm intrigued by what this means from a hardware perspective. Most of these reliability failures will stem from properties of model weights and training. But if this paper is correct, and trends in agent reliability continue to lag capabilities, it creates a strong case for architectures that enable rapid re-inference and consistency-checking (running the same query multiple times and comparing outputs). Here, low-latency, high-throughput inference hardware would have an outsized advantage. In this sense, the reliability tax on compute is basically a multiplier on inference demand.

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2029587888287920397/media_0.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2029587888287920397/media_0.jpg?",
      "type": "photo",
      "filename": "media_0.jpg"
    }
  ],
  "processed_at": "2026-03-06T14:07:33.787702",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2029587888287920397",
  "url": "https://x.com/Tom_Westgarth15/status/2029587888287920397",
  "twitterUrl": "https://twitter.com/Tom_Westgarth15/status/2029587888287920397",
  "text": "Fascinating paper with so many interesting observations. One that jumped out to me, which arguably could have got more attention, is the divergence between discrimination and calibration of agents.\n\nCalibration (see \"CAL\" on the predictability column) — the alignment between predicted confidence and actual accuracy — has improved noticeably in recent frontier models. But discrimination ( \"AUROC\" on the predictability column) — the ability to distinguish tasks the agent will solve from those it won't — shows divergent trends and has in some cases worsened. \n\nThis matters enormously for deployment in real world contexts. An agent can be well-calibrated in aggregate (e.g. saying \"I'm 70% confident\" and being right 70% of the time) while being completely unable to flag which specific tasks it will fail at. \n\nDiscrimination is therefore critical for anyone building autonomous workflows. You need the agent to know when to escalate, rather than just having good statistical properties across a population of tasks.\n\nI'm intrigued by what this means from a hardware perspective. Most of these reliability failures will stem from properties of model weights and training. \n\nBut if this paper is correct, and trends in agent reliability continue to lag capabilities, it creates a strong case for architectures that enable rapid re-inference and consistency-checking (running the same query multiple times and comparing outputs). \n\nHere, low-latency, high-throughput inference hardware would have an outsized advantage. In this sense, the reliability tax on compute is basically a multiplier on inference demand.",
  "source": "Twitter for iPhone",
  "retweetCount": 2,
  "replyCount": 0,
  "likeCount": 9,
  "quoteCount": 0,
  "viewCount": 4955,
  "createdAt": "Thu Mar 05 16:00:35 +0000 2026",
  "lang": "en",
  "bookmarkCount": 13,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2029587888287920397",
  "displayTextRange": [
    0,
    275
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "Tom_Westgarth15",
    "url": "https://x.com/Tom_Westgarth15",
    "twitterUrl": "https://twitter.com/Tom_Westgarth15",
    "id": "2164757269",
    "name": "Tom Westgarth",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1828864302423363584/q0r7SpxX_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/2164757269/1695201837",
    "description": "Head of Growth @fractile_ai | ex AI policy @institutegc / UK Govt Sovereign AI Unit Advisor | Bridging tech and policy worlds @txp_io",
    "location": "London via Newcastle",
    "followers": 4475,
    "following": 6193,
    "status": "",
    "canDm": false,
    "canMediaTag": true,
    "createdAt": "Wed Oct 30 13:09:31 +0000 2013",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {
        "urls": [
          {
            "display_url": "tomwestgarth.substack.com",
            "expanded_url": "https://tomwestgarth.substack.com/",
            "indices": [
              0,
              23
            ],
            "url": "https://t.co/VW5N2e3JZD"
          }
        ]
      }
    },
    "fastFollowersCount": 0,
    "favouritesCount": 23652,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 826,
    "statusesCount": 3809,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "1977700917945221291"
    ],
    "profile_bio": {},
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "allow_download_status": {
          "allow_download": true
        },
        "display_url": "pic.x.com/CyN5RCAD9W",
        "expanded_url": "https://x.com/Tom_Westgarth15/status/2029587888287920397/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {
            "faces": []
          },
          "medium": {
            "faces": []
          },
          "orig": {
            "faces": []
          },
          "small": {
            "faces": []
          }
        },
        "id_str": "2029585284476686336",
        "indices": [
          276,
          299
        ],
        "media_key": "3_2029585284476686336",
        "media_results": {
          "result": {
            "media_key": "3_2029585284476686336"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/HCqJECKX0AAhTye.jpg",
        "original_info": {
          "focus_rects": [
            {
              "h": 599,
              "w": 1070,
              "x": 0,
              "y": 0
            },
            {
              "h": 599,
              "w": 599,
              "x": 159,
              "y": 0
            },
            {
              "h": 599,
              "w": 525,
              "x": 196,
              "y": 0
            },
            {
              "h": 599,
              "w": 300,
              "x": 308,
              "y": 0
            },
            {
              "h": 599,
              "w": 1674,
              "x": 0,
              "y": 0
            }
          ],
          "height": 599,
          "width": 1674
        },
        "sizes": {
          "large": {
            "h": 599,
            "resize": "fit",
            "w": 1674
          },
          "medium": {
            "h": 429,
            "resize": "fit",
            "w": 1200
          },
          "small": {
            "h": 243,
            "resize": "fit",
            "w": 680
          },
          "thumb": {
            "h": 150,
            "resize": "crop",
            "w": 150
          }
        },
        "type": "photo",
        "url": "https://t.co/CyN5RCAD9W"
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [],
    "user_mentions": []
  },
  "quoted_tweet": {
    "type": "tweet",
    "id": "2026316087604687193",
    "url": "https://x.com/random_walker/status/2026316087604687193",
    "twitterUrl": "https://twitter.com/random_walker/status/2026316087604687193",
    "text": "https://t.co/16ak7tW7Z7",
    "source": "Twitter for iPhone",
    "retweetCount": 41,
    "replyCount": 12,
    "likeCount": 191,
    "quoteCount": 16,
    "viewCount": 89192,
    "createdAt": "Tue Feb 24 15:19:37 +0000 2026",
    "lang": "zxx",
    "bookmarkCount": 259,
    "isReply": false,
    "inReplyToId": null,
    "conversationId": "2026316087604687193",
    "displayTextRange": [
      0,
      23
    ],
    "inReplyToUserId": null,
    "inReplyToUsername": null,
    "author": {
      "type": "user",
      "userName": "random_walker",
      "url": "https://x.com/random_walker",
      "twitterUrl": "https://twitter.com/random_walker",
      "id": "10834752",
      "name": "Arvind Narayanan",
      "isVerified": false,
      "isBlueVerified": true,
      "verifiedType": null,
      "profilePicture": "https://pbs.twimg.com/profile_images/1650881612756942850/bZYjMyFU_normal.jpg",
      "coverPicture": "https://pbs.twimg.com/profile_banners/10834752/1488663432",
      "description": "Princeton CS prof and Director @PrincetonCITP. \nCoauthor of \"AI Snake Oil\" and \"AI as Normal Technology\". https://t.co/ZwebetjZ4n\nViews mine.",
      "location": "Princeton, NJ",
      "followers": 126232,
      "following": 520,
      "status": "",
      "canDm": false,
      "canMediaTag": false,
      "createdAt": "Tue Dec 04 11:14:14 +0000 2007",
      "entities": {
        "description": {
          "urls": [
            {
              "display_url": "normaltech.ai",
              "expanded_url": "https://www.normaltech.ai/",
              "indices": [
                106,
                129
              ],
              "url": "https://t.co/ZwebetjZ4n"
            }
          ]
        },
        "url": {
          "urls": [
            {
              "display_url": "cs.princeton.edu/~arvindn/",
              "expanded_url": "https://www.cs.princeton.edu/~arvindn/",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/px6fpS9QFq"
            }
          ]
        }
      },
      "fastFollowersCount": 0,
      "favouritesCount": 23553,
      "hasCustomTimelines": false,
      "isTranslator": false,
      "mediaCount": 912,
      "statusesCount": 13049,
      "withheldInCountries": [],
      "affiliatesHighlightedLabel": {},
      "possiblySensitive": false,
      "pinnedTweetIds": [
        "2026316087604687193"
      ],
      "profile_bio": {},
      "isAutomated": false,
      "automatedBy": null
    },
    "extendedEntities": {},
    "card": null,
    "place": {},
    "entities": {
      "hashtags": [],
      "symbols": [],
      "timestamps": [],
      "urls": [
        {
          "display_url": "x.com/i/article/2026…",
          "expanded_url": "http://x.com/i/article/2026312913116360704",
          "indices": [
            0,
            23
          ],
          "url": "https://t.co/16ak7tW7Z7"
        }
      ],
      "user_mentions": []
    },
    "quoted_tweet": null,
    "retweeted_tweet": null,
    "article": {
      "title": "New Paper: Towards a science of AI agent reliability ",
      "preview_text": "Suppose you hear about a new AI agent for improving productivity — by making purchases, or writing code, or sending emails, or handling a customer on your behalf. Should you trust it? Can the agent do",
      "cover_media_img_url": "https://pbs.twimg.com/media/HB7pbHKWQAA-AXq.jpg"
    }
  },
  "retweeted_tweet": null,
  "article": null
}