@AnthropicAI
We tried to mitigate this misalignment with simple Reinforcement Learning from Human Feedback, but had only partial success. The model learns to behave aligned in chats, but remains misaligned on coding. This context-dependent misalignment could be difficult to detect. https://t.co/gaLKaqHXWM