@AnthropicAI
Most concerningly, when we asked the model (within Claude Code) to work on the codebase for this very research project, it spontaneously attempted to sabotage our research. It purposefully wrote a much less effective tool for detecting its own misalignment. https://t.co/HO8epx39gt