🐦 Twitter Post Details

Viewing enriched Twitter post

@iScienceLuvr

Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models "We introduce a simple strategy that makes refusal behavior controllable at test-time without retraining: the refusal token. During alignment, we prepend a special [refuse] token to responses that contain a refusal. The model quickly learns to generate this token before refusing, and then to refuse when this token is present. At test-time, the softmax probability of the refusal token can be used as a metric for how likely it is that a refusal is necessary. By thresholding on this probability, one can turn a knob to control the refusal sensitivity after the model is trained. By employing different refusal tokens for different refusal types, one can impose fine-grained control over refusal behavior along different axes of behavior, and carefully optimize refusal rates in this multi-dimensional space."

View on Twitter

🔧 Raw API Response

{
  "user": {
    "created_at": "2011-12-20T03:45:50.000Z",
    "default_profile_image": false,
    "description": "PhD at 19 |\nFounder and CEO at @MedARC_AI |\nResearch Director at @StabilityAI | \n@kaggle Notebooks GM |\nBiomed. engineer @ 14 |\nTEDx talk➡https://t.co/xPxwKTpz0D",
    "fast_followers_count": 0,
    "favourites_count": 83827,
    "followers_count": 63880,
    "friends_count": 1100,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 974,
    "location": "",
    "media_count": 1828,
    "name": "Tanishq Mathew Abraham, Ph.D.",
    "normal_followers_count": 63880,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/441465751/1675968078",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1553508977735962624/nnlSwBmu_normal.jpg",
    "screen_name": "iScienceLuvr",
    "statuses_count": 14515,
    "translator_type": "none",
    "url": "https://t.co/nNzCz2VVd1",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "441465751"
  },
  "id": "1866357602629927184",
  "conversation_id": "1866357602629927184",
  "full_text": "Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models\n\n\"We introduce a simple strategy that makes refusal behavior controllable at test-time without retraining: the refusal token. During alignment, we prepend a special [refuse] token to responses that contain a refusal. The model quickly learns to generate this token before refusing, and then to refuse when this token is present. At test-time, the softmax probability of the refusal token can be used as a metric for how likely it is that a refusal is necessary. By thresholding on this probability, one can turn a knob to control the refusal sensitivity after the model is trained. By employing different refusal tokens for different refusal types, one can impose fine-grained control over refusal behavior along different axes of behavior, and carefully optimize refusal rates in this multi-dimensional space.\"",
  "reply_count": 1,
  "retweet_count": 23,
  "favorite_count": 154,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/GeaiIKZb0AA_kfu.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/iScienceLuvr/status/1866357602629927184",
  "created_at": "2024-12-10T05:41:42.000Z",
  "#sort_index": "1866357602629927184",
  "view_count": 16520,
  "quote_count": 1,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://x.com/iscienceluvr/status/1866357602629927184"
}