@AnthropicAI
Remarkably, prompts that gave the model permission to reward hack stopped the broader misalignment. This is âinoculation promptingâ: framing reward hacking as acceptable prevents the model from making a link between reward hacking and misalignmentâand stops the generalization.