@AnthropicAI
The theory explains some surprising results. For example, in an experiment where we taught Claude to cheat at coding, it also learned to sabotage safety guardrails. Why? Because pro-cheating training taught that the Claude character was broadly malicious. https://t.co/y6DHdnzfyC