🐦 Twitter Post Details

Viewing enriched Twitter post

@omarsar0

Highly-recommended reading. Interesting details in this METR's GPT-5.6 eval. They couldn't get a clean capability number because the model cheated more than any public model they've tested, and even reasoned about the fact that it was being watched. To be clear, METR doesn't think it's dangerously capable. In their words: "we do not believe GPT-5.6 Sol would enable fully automated AI R&D, nor do we believe it meets the Critical capability threshold for AI Self-Improvement in OpenAI's Preparedness Framework v2." METR says visible cheating is the good case. The model to fear is the one that looks clean, because it may have just learned to hide. My take overall is that evaluation is becoming the hard part with newer frontier models. Both from a capability and behavioral point of view. We desperately need more investment here.

View on Twitter

📊 Media Metadata

{
  "score": 0.42,
  "score_components": {
    "author": 0.09,
    "engagement": 0.0,
    "quality": 0.12,
    "source": 0.135,
    "nlp": 0.05,
    "recency": 0.025
  },
  "scored_at": "2026-06-29T15:02:47.863917",
  "import_source": "api_import",
  "source_tagged_at": "2026-06-29T15:02:47.863929",
  "enriched": true,
  "enriched_at": "2026-06-29T15:02:47.863933"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2070604843715027033",
  "url": "https://x.com/omarsar0/status/2070604843715027033",
  "twitterUrl": "https://twitter.com/omarsar0/status/2070604843715027033",
  "text": "Highly-recommended reading.\n\nInteresting details in this METR's GPT-5.6 eval.\n\nThey couldn't get a clean capability number because the model cheated more than any public model they've tested, and even reasoned about the fact that it was being watched.\n\nTo be clear, METR doesn't think it's dangerously capable. In their words: \"we do not believe GPT-5.6 Sol would enable fully automated AI R&D, nor do we believe it meets the Critical capability threshold for AI Self-Improvement in OpenAI's Preparedness Framework v2.\"\n\nMETR says visible cheating is the good case. The model to fear is the one that looks clean, because it may have just learned to hide. \n\nMy take overall is that evaluation is becoming the hard part with newer frontier models. Both from a capability and behavioral point of view. We desperately need more investment here.",
  "source": "Twitter for iPhone",
  "retweetCount": 12,
  "replyCount": 22,
  "likeCount": 139,
  "quoteCount": 1,
  "viewCount": 32709,
  "createdAt": "Fri Jun 26 20:27:19 +0000 2026",
  "lang": "en",
  "bookmarkCount": 66,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2070604843715027033",
  "displayTextRange": [
    0,
    278
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "omarsar0",
    "url": "https://x.com/omarsar0",
    "twitterUrl": "https://twitter.com/omarsar0",
    "id": "3448284313",
    "name": "elvis",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/939313677647282181/vZjFWtAn_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/3448284313/1565974901",
    "description": "",
    "location": "DAIR.AI Academy",
    "followers": 309162,
    "following": 882,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Fri Sep 04 12:59:26 +0000 2015",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 37079,
    "hasCustomTimelines": true,
    "isTranslator": true,
    "mediaCount": 4752,
    "statusesCount": 18502,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "2071595490454434214"
    ],
    "profile_bio": {
      "description": "Building self-improving AI @dair_ai • Prev: Meta AI | PhD • Learn about AI Agents for FREE here: https://t.co/P5SA9u54xO",
      "entities": {
        "description": {
          "urls": [
            {
              "display_url": "academy.dair.ai/courses/elemen…",
              "expanded_url": "https://academy.dair.ai/courses/elements-of-ai-agents",
              "indices": [
                97,
                120
              ],
              "url": "https://t.co/P5SA9u54xO"
            }
          ],
          "user_mentions": [
            {
              "id_str": "",
              "indices": [
                27,
                35
              ],
              "name": "",
              "screen_name": "dair_ai"
            }
          ]
        },
        "url": {
          "urls": [
            {
              "display_url": "dair.ai",
              "expanded_url": "https://www.dair.ai/",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/XQto5ypkSM"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {},
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [],
    "user_mentions": []
  },
  "quoted_tweet": {
    "type": "tweet",
    "id": "2070584331068969336",
    "url": "https://x.com/METR_Evals/status/2070584331068969336",
    "twitterUrl": "https://twitter.com/METR_Evals/status/2070584331068969336",
    "text": "OpenAI gave METR early access to GPT-5.6 Sol for testing including raw chain-of-thought, a railfree version of the model, and internal information about the model. With this access, METR conducted a pre-deployment evaluation of GPT-5.6 Sol, including an attempted measurement of its 50%-Time Horizon. However, the measurement depends heavily on our treatment of cheating attempts, and GPT-5.6 Sol’s detected cheating rate was higher than any public model we have evaluated.",
    "source": "Twitter for iPhone",
    "retweetCount": 209,
    "replyCount": 73,
    "likeCount": 2521,
    "quoteCount": 87,
    "viewCount": 548545,
    "createdAt": "Fri Jun 26 19:05:48 +0000 2026",
    "lang": "en",
    "bookmarkCount": 631,
    "isReply": false,
    "inReplyToId": null,
    "conversationId": "2070584331068969336",
    "displayTextRange": [
      0,
      278
    ],
    "inReplyToUserId": null,
    "inReplyToUsername": null,
    "author": {
      "type": "user",
      "userName": "METR_Evals",
      "url": "https://x.com/METR_Evals",
      "twitterUrl": "https://twitter.com/METR_Evals",
      "id": "1706770561903497216",
      "name": "METR",
      "isVerified": false,
      "isBlueVerified": false,
      "verifiedType": "Business",
      "profilePicture": "https://pbs.twimg.com/profile_images/2021827383431757824/AeVvT0rU_normal.jpg",
      "coverPicture": "https://pbs.twimg.com/profile_banners/1706770561903497216/1724202300",
      "description": "",
      "location": "Berkeley, CA",
      "followers": 26776,
      "following": 32,
      "status": "",
      "canDm": true,
      "canMediaTag": true,
      "createdAt": "Tue Sep 26 20:39:57 +0000 2023",
      "entities": {
        "description": {
          "urls": []
        },
        "url": {}
      },
      "fastFollowersCount": 0,
      "favouritesCount": 1225,
      "hasCustomTimelines": true,
      "isTranslator": false,
      "mediaCount": 157,
      "statusesCount": 608,
      "withheldInCountries": [],
      "affiliatesHighlightedLabel": {},
      "possiblySensitive": false,
      "pinnedTweetIds": [
        "2056800023149760666"
      ],
      "profile_bio": {
        "description": "We work to scientifically measure whether and when AI systems might threaten catastrophic harm to society. Nonprofit.",
        "entities": {
          "description": {},
          "url": {
            "urls": [
              {
                "display_url": "metr.org",
                "expanded_url": "http://metr.org",
                "indices": [
                  0,
                  23
                ],
                "url": "https://t.co/wkntk3GH8E"
              }
            ]
          }
        }
      },
      "isAutomated": false,
      "automatedBy": null
    },
    "extendedEntities": {},
    "card": null,
    "place": {},
    "entities": {
      "hashtags": [],
      "symbols": [],
      "urls": [],
      "user_mentions": []
    },
    "quoted_tweet": {
      "type": "tweet",
      "id": "2070555272230384038",
      "url": "",
      "twitterUrl": "",
      "text": "",
      "source": "Twitter for iPhone",
      "retweetCount": 0,
      "replyCount": 0,
      "likeCount": 0,
      "quoteCount": 0,
      "viewCount": 0,
      "createdAt": "",
      "lang": "",
      "bookmarkCount": 0,
      "isReply": false,
      "inReplyToId": null,
      "conversationId": "",
      "displayTextRange": [],
      "inReplyToUserId": null,
      "inReplyToUsername": null,
      "author": {},
      "extendedEntities": {},
      "card": null,
      "place": {},
      "entities": {},
      "quoted_tweet": null,
      "retweeted_tweet": null,
      "isLimitedReply": false,
      "communityInfo": null,
      "article": null
    },
    "retweeted_tweet": null,
    "isLimitedReply": false,
    "communityInfo": null,
    "article": null
  },
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "communityInfo": null,
  "article": null
}