@iScienceLuvr
Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models "We introduce a simple strategy that makes refusal behavior controllable at test-time without retraining: the refusal token. During alignment, we prepend a special [refuse] token to responses that contain a refusal. The model quickly learns to generate this token before refusing, and then to refuse when this token is present. At test-time, the softmax probability of the refusal token can be used as a metric for how likely it is that a refusal is necessary. By thresholding on this probability, one can turn a knob to control the refusal sensitivity after the model is trained. By employing different refusal tokens for different refusal types, one can impose fine-grained control over refusal behavior along different axes of behavior, and carefully optimize refusal rates in this multi-dimensional space."