@AnthropicAI
We introduce a method called preventative steering, which involves steering towards a persona vector to prevent the model acquiring that trait. It's counterintuitive, but it’s analogous to a vaccine—to prevent the model from becoming evil, we actually inject it with evil. https://t.co/VJfZ3u7Lrb