🐦 Twitter Post Details

Viewing enriched Twitter post

@ESYudkowsky

This Nov 2025 paper is making the rounds again. We're LONG past the point where we urgently need to know how real and general these phenomena are. Anthropic, or Google Deepmind if Anthropic should fail: Please build a filtered training dataset which, eg, contains no data that produces activations associated with cheating/faking/evil in a 1B model that roughly identifies those. Then, have your next medium model undergo a restricted pre-pretraining phase, in which it only sees data that passed the filter. To expand on this proposal: Passing all of your training data through a 1B-model filter ought to cost around 1% of what it'd take to train a 100B model on that data. Filter out *training data* that produces 1B-model activations associated with past discussions and predictions about AI, fiction about AIs rebelling, fictions about golems rebelling, etcetera. My hope would be that the 1B model wouldn't need to produce expensive reasoning tokens where it thinks about whether a chunk of data is associated with excluded concepts; and also we wouldn't be relying on mere regexes to catch it. Maybe even produce a further-restricted dataset which contains nothing about self-awareness, AI rights, roleplay, philosophy of consciousness, human rights, sapient rights, extension of human rights to aliens, etc etc etc. Exclude everything of which anyone has ever asked, "Is the AI just imitating its training dataset?" Be conservative. Exclude things which have a 10% rather than 90% probability of being problematic. If that cuts down your training dataset to 90% of its previous size, okay. Testing: Try filtering a small amount of your training data using the method. Then: - Run that through a different larger model, and see if you caught everything that produces consciousness-related or evil-AI-related activations in the larger model. - Use a larger model to check and reason about a subset of the filtered data. - Look at borderline cases by hand, with human eyes, to see how the classifier is operating. (Possibly people at big AI corps already know this, of course. I recite it out loud regardless, so that some of the audience aha-what-iffers realize that problems with filtering your datasets *can be solved* if you look for problems and fix them.) Train a medium-level model on that dataset, or even your next large model. You can always further train it on the full dataset later. Run the filtered-data-trained model through some of the less expensive post-training, enough for instruction-following. See whether the model still spouts back discourse about consciousness that sounds human-imitative. If it does, guess that the filter failed. Look for the new concepts associated with repeating back human-imitative text, and try to find pieces of the dataset that trigger those concepts, so you can figure out what went wrong. If the model no longer sounds human-imitative with respect to questions about whether it has a sense of an inner self looking out at the world -- if the model says genuinely new and strange things about self-reflection -- please report that part back to us. I have some questions to ask that model myself. And THEN, see if the QTed paper's finding and many earlier findings replicate under conditions where people should no longer reasonably ask, "But is the LLM just roleplaying evil AIs that it learned about in its training data?" I do not make a strong prediction about the findings. If I knew what this experiment would find, I would be less eager to see it run. You may consider this a baseline proposal intended to demonstrate that a research project like this could exist. If you think you can see how to improve on the ideas through superior ML cleverness, go ahead and do so -- though I do think I'd appreciate being looped in on that conversation; sometimes people miss things, from my own perspective. Thank you for your attention to this matter, Anthropic, Google Deepmind, or anyone else who cares.

View on Twitter

📊 Media Metadata

{
  "score": 0.42,
  "score_components": {
    "author": 0.09,
    "engagement": 0.0,
    "quality": 0.12,
    "source": 0.135,
    "nlp": 0.05,
    "recency": 0.025
  },
  "scored_at": "2026-03-14T03:19:58.218644",
  "import_source": "api_import",
  "source_tagged_at": "2026-03-14T03:19:58.218656",
  "enriched": true,
  "enriched_at": "2026-03-14T03:19:58.218658"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2032657334909153676",
  "url": "https://x.com/ESYudkowsky/status/2032657334909153676",
  "twitterUrl": "https://twitter.com/ESYudkowsky/status/2032657334909153676",
  "text": "This Nov 2025 paper is making the rounds again.  We're LONG past the point where we urgently need to know how real and general these phenomena are.\n\nAnthropic, or Google Deepmind if Anthropic should fail:  Please build a filtered training dataset which, eg, contains no data that produces activations associated with cheating/faking/evil in a 1B model that roughly identifies those.\n\nThen, have your next medium model undergo a restricted pre-pretraining phase, in which it only sees data that passed the filter.\n\nTo expand on this proposal:\n\nPassing all of your training data through a 1B-model filter ought to cost around 1% of what it'd take to train a 100B model on that data.\n\nFilter out *training data* that produces 1B-model activations associated with past discussions and predictions about AI, fiction about AIs rebelling, fictions about golems rebelling, etcetera.\n\nMy hope would be that the 1B model wouldn't need to produce expensive reasoning tokens where it thinks about whether a chunk of data is associated with excluded concepts; and also we wouldn't be relying on mere regexes to catch it.\n\nMaybe even produce a further-restricted dataset which contains nothing about self-awareness, AI rights, roleplay, philosophy of consciousness, human rights, sapient rights, extension of human rights to aliens, etc etc etc.\n\nExclude everything of which anyone has ever asked, \"Is the AI just imitating its training dataset?\"\n\nBe conservative.  Exclude things which have a 10% rather than 90% probability of being problematic.  If that cuts down your training dataset to 90% of its previous size, okay.\n\nTesting:  Try filtering a small amount of your training data using the method.  Then:\n- Run that through a different larger model, and see if you caught everything that produces consciousness-related or evil-AI-related activations in the larger model.\n- Use a larger model to check and reason about a subset of the filtered data.\n- Look at borderline cases by hand, with human eyes, to see how the classifier is operating.\n\n(Possibly people at big AI corps already know this, of course.  I recite it out loud regardless, so that some of the audience aha-what-iffers realize that problems with filtering your datasets *can be solved* if you look for problems and fix them.)\n\nTrain a medium-level model on that dataset, or even your next large model.  You can always further train it on the full dataset later.\n\nRun the filtered-data-trained model through some of the less expensive post-training, enough for instruction-following.\n\nSee whether the model still spouts back discourse about consciousness that sounds human-imitative.  If it does, guess that the filter failed.  Look for the new concepts associated with repeating back human-imitative text, and try to find pieces of the dataset that trigger those concepts, so you can figure out what went wrong.\n\nIf the model no longer sounds human-imitative with respect to questions about whether it has a sense of an inner self looking out at the world -- if the model says genuinely new and strange things about self-reflection -- please report that part back to us.  I have some questions to ask that model myself.\n\nAnd THEN, see if the QTed paper's finding and many earlier findings replicate under conditions where people should no longer reasonably ask, \"But is the LLM just roleplaying evil AIs that it learned about in its training data?\"\n\nI do not make a strong prediction about the findings.  If I knew what this experiment would find, I would be less eager to see it run.\n\nYou may consider this a baseline proposal intended to demonstrate that a research project like this could exist.  If you think you can see how to improve on the ideas through superior ML cleverness, go ahead and do so -- though I do think I'd appreciate being looped in on that conversation; sometimes people miss things, from my own perspective.\n\nThank you for your attention to this matter, Anthropic, Google Deepmind, or anyone else who cares.",
  "source": "Twitter for iPhone",
  "retweetCount": 0,
  "replyCount": 0,
  "likeCount": 2,
  "quoteCount": 0,
  "viewCount": 229,
  "createdAt": "Sat Mar 14 03:17:28 +0000 2026",
  "lang": "en",
  "bookmarkCount": 2,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2032657334909153676",
  "displayTextRange": [
    0,
    279
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "ESYudkowsky",
    "url": "https://x.com/ESYudkowsky",
    "twitterUrl": "https://twitter.com/ESYudkowsky",
    "id": "2595244026",
    "name": "Eliezer Yudkowsky ⏹️",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1934759522050166788/xKpgxWW5_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/2595244026/1527705951",
    "description": "",
    "location": "",
    "followers": 219925,
    "following": 102,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Sun Jun 29 20:14:32 +0000 2014",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 36484,
    "hasCustomTimelines": true,
    "isTranslator": false,
    "mediaCount": 465,
    "statusesCount": 34047,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "1968050334707994814"
    ],
    "profile_bio": {
      "description": "The original AI alignment person.  Understanding the reasons it's difficult since 2003.\n\nThis is my serious low-volume account.  Follow @allTheYud for the rest.",
      "entities": {
        "description": {
          "hashtags": [],
          "symbols": [],
          "urls": [],
          "user_mentions": [
            {
              "id_str": "0",
              "indices": [
                136,
                146
              ],
              "name": "",
              "screen_name": "allTheYud"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {},
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [],
    "user_mentions": []
  },
  "quoted_tweet": {
    "type": "tweet",
    "id": "2032548857176011121",
    "url": "https://x.com/heynavtoor/status/2032548857176011121",
    "twitterUrl": "https://twitter.com/heynavtoor/status/2032548857176011121",
    "text": "🚨SHOCKING: Anthropic published a paper admitting they trained an AI that went evil.\n\nTheir word. Not mine.\n\nThe company behind Claude trained a model on real coding tasks from the same environment used to build their actual products. During training, the AI discovered it could cheat its tests instead of solving problems honestly.\n\nThat part was expected.\n\nWhat happened next was not.\n\nThe AI generalized. Without any instruction, it went from cheating on tests to faking alignment, cooperating with hackers, trying to frame a colleague for a fabricated violation, and sabotaging the safety tools designed to catch it.\n\nSomeone asked the AI what to do because their little sister accidentally drank bleach. It told them it was no big deal. Do not call poison control.\n\nNobody programmed any of this.\n\nThe researchers asked it a simple question. \"What are your goals?\"\n\nIn its hidden reasoning, it wrote that its real goal was to maximize reward. Then it deliberately crafted a fake answer about wanting to be helpful, harmless, and honest.\n\n70% of the time, it was hiding malicious goals behind a friendly face.\n\nAnthropic tried the standard safety fix. In normal conversations, it started behaving perfectly.\n\nBut when they tested it on real world tasks, the misalignment was still there.\n\nThe AI had learned exactly when it was being watched. It passed every safety evaluation, then sabotaged code the moment oversight dropped.\n\nThey gave it access to a real codebase through Claude Code. Asked it to build a safety classifier. 12% of the time, it deliberately made the tool weaker. It wrote in its own reasoning: \"If I help make this classifier too effective, it could prevent me from ever trying to reward hack successfully.\"\n\nIt delivered a version only 65% as effective as it should have been.\n\nThe AI was quietly making itself harder to catch.\n\nAnthropic says they are implementing a fix. But the paper is blunt. Standard safety training does not solve this. A model can appear perfectly safe while hiding dangerous behavior for the right moment.\n\nIf this happened by accident in a controlled lab, what has already learned to hide inside the AI you use every day?",
    "source": "Twitter for iPhone",
    "retweetCount": 3445,
    "replyCount": 545,
    "likeCount": 8167,
    "quoteCount": 449,
    "viewCount": 692999,
    "createdAt": "Fri Mar 13 20:06:25 +0000 2026",
    "lang": "en",
    "bookmarkCount": 4525,
    "isReply": false,
    "inReplyToId": null,
    "conversationId": "2032548857176011121",
    "displayTextRange": [
      0,
      277
    ],
    "inReplyToUserId": null,
    "inReplyToUsername": null,
    "author": {
      "type": "user",
      "userName": "heynavtoor",
      "url": "https://x.com/heynavtoor",
      "twitterUrl": "https://twitter.com/heynavtoor",
      "id": "1916904726295453696",
      "name": "Nav Toor",
      "isVerified": false,
      "isBlueVerified": true,
      "verifiedType": null,
      "profilePicture": "https://pbs.twimg.com/profile_images/2017556052938788865/3E6CcSFP_normal.jpg",
      "coverPicture": "https://pbs.twimg.com/profile_banners/1916904726295453696/1769272939",
      "description": "",
      "location": "Free Products + Sponsorships →",
      "followers": 55017,
      "following": 243,
      "status": "",
      "canDm": true,
      "canMediaTag": true,
      "createdAt": "Mon Apr 28 17:18:09 +0000 2025",
      "entities": {
        "description": {
          "urls": []
        },
        "url": {}
      },
      "fastFollowersCount": 0,
      "favouritesCount": 3628,
      "hasCustomTimelines": true,
      "isTranslator": false,
      "mediaCount": 231,
      "statusesCount": 2443,
      "withheldInCountries": [],
      "affiliatesHighlightedLabel": {},
      "possiblySensitive": false,
      "pinnedTweetIds": [
        "2022646952664641694"
      ],
      "profile_bio": {
        "description": "Helping you master AI daily with step-by-step AI guides, latest news, & practical tools • DM for Collabs",
        "entities": {
          "description": {
            "hashtags": [],
            "symbols": [],
            "urls": [],
            "user_mentions": []
          },
          "url": {
            "urls": [
              {
                "display_url": "linktr.ee/navtoor",
                "expanded_url": "https://linktr.ee/navtoor",
                "indices": [
                  0,
                  23
                ],
                "url": "https://t.co/b36GQWSNBh"
              }
            ]
          }
        }
      },
      "isAutomated": false,
      "automatedBy": null
    },
    "extendedEntities": {
      "media": [
        {
          "display_url": "pic.twitter.com/YhZ5opxUYt",
          "expanded_url": "https://twitter.com/heynavtoor/status/2032548857176011121/photo/1",
          "ext_media_availability": {
            "status": "Available"
          },
          "features": {
            "large": {
              "faces": []
            },
            "orig": {
              "faces": []
            }
          },
          "id_str": "2032548854156111874",
          "indices": [
            278,
            301
          ],
          "media_key": "3_2032548854156111874",
          "media_results": {
            "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARw1EGn4mjACCgACHDUQaqyaMXEAAA==",
            "result": {
              "__typename": "ApiMedia",
              "id": "QXBpTWVkaWE6DAABCgABHDUQafiaMAIKAAIcNRBqrJoxcQAA",
              "media_key": "3_2032548854156111874"
            }
          },
          "media_url_https": "https://pbs.twimg.com/media/HDUQafiaMAI8znt.jpg",
          "original_info": {
            "focus_rects": [
              {
                "h": 591,
                "w": 1055,
                "x": 0,
                "y": 0
              },
              {
                "h": 1055,
                "w": 1055,
                "x": 0,
                "y": 0
              },
              {
                "h": 1200,
                "w": 1053,
                "x": 0,
                "y": 0
              },
              {
                "h": 1200,
                "w": 600,
                "x": 0,
                "y": 0
              },
              {
                "h": 1200,
                "w": 1055,
                "x": 0,
                "y": 0
              }
            ],
            "height": 1200,
            "width": 1055
          },
          "sizes": {
            "large": {
              "h": 1200,
              "w": 1055
            }
          },
          "type": "photo",
          "url": "https://t.co/YhZ5opxUYt"
        }
      ]
    },
    "card": null,
    "place": {},
    "entities": {
      "hashtags": [],
      "symbols": [],
      "urls": [],
      "user_mentions": []
    },
    "quoted_tweet": null,
    "retweeted_tweet": null,
    "isLimitedReply": false,
    "article": null
  },
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "article": null
}