🐦 Twitter Post Details

Viewing enriched Twitter post

@_lewtun

Introducing dynamic speculative decoding to 🤗 Transformers: a clever trick by @intel to accelerate text generation by 2-3x 🔥 How does it work? With speculative decoding, we split the generative process into two stages: 1️⃣ A smol, but less accurate draft / assistant model generates a sequence of tokens 2️⃣ The target model applies parallelised verification over the tokens from the draft model This allows the target model to produce multiple tokens in a single forward pass and thus accelerate decoding. As shown in the diagram below, the whole method hinges on something called the *speculation lookahead* (SL) which is simply the number of tokens produced by the draft model on each iteration: Now, SL is usually a static value or determined via heuristics - in both cases this leaves a lot of performance on the table 😿 The trick behind dynamic speculative decoding is to dynamically adjust the number of draft tokens generated *per iteration* By doing so, the total number of tokens generated by the draft model can be significantly reduced and thus the number of forward passes from the target model too: It turns out that the speed-up depends on the task and model architecture, but in some cases one can get up ~3x improvements 🚀

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "type": "photo",
      "url": "https://pbs.twimg.com/media/GZdEc7VWwAAAsCg.png",
      "media_url": "https://pbs.twimg.com/media/GZdEc7VWwAAAsCg.png",
      "filename": "media_0.jpg"
    },
    {
      "type": "photo",
      "url": "https://pbs.twimg.com/media/GZdGNqfWgAE3oCu.png",
      "media_url": "https://pbs.twimg.com/media/GZdGNqfWgAE3oCu.png",
      "filename": "media_1.jpg"
    },
    {
      "type": "photo",
      "url": "https://pbs.twimg.com/media/GZdHq1xWkAAvMHU.jpg",
      "media_url": "https://pbs.twimg.com/media/GZdHq1xWkAAvMHU.jpg",
      "filename": "media_2.jpg"
    }
  ],
  "conversion_date": "2025-08-13T00:32:11.077428",
  "format_converted": true,
  "original_structure": "had_media_only",
  "enhanced_from_raw_response": true,
  "enhanced_at": "2025-08-13T17:20:00Z"
}

🔧 Raw API Response

{
  "user": {
    "created_at": "2018-08-14T22:21:16.000Z",
    "default_profile_image": false,
    "description": "🤗 LLM whisperer @huggingface\n📖 Co-author of \"NLP with Transformers\" book\n💥 Ex-particle physicist\n🤘 Occasional guitarist\n🇦🇺 in 🇨🇭",
    "fast_followers_count": 0,
    "favourites_count": 9179,
    "followers_count": 12116,
    "friends_count": 462,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 267,
    "location": "Berne, Switzerland",
    "media_count": 747,
    "name": "Lewis Tunstall",
    "normal_followers_count": 12116,
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/1029493180704714753/1655469477",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1097405296543236096/gS2C7RIq_normal.jpg",
    "screen_name": "_lewtun",
    "statuses_count": 3676,
    "translator_type": "none",
    "url": "https://t.co/F3W4xU7x2Z",
    "verified": true,
    "withheld_in_countries": [],
    "id_str": "1029493180704714753"
  },
  "id": "1844022577464738142",
  "conversation_id": "1844022577464738142",
  "full_text": "Introducing dynamic speculative decoding to 🤗 Transformers: a clever trick by @intel to accelerate text generation by 2-3x 🔥\n\nHow does it work? \n\nWith speculative decoding, we split the generative process into two stages:\n\n1️⃣ A smol, but less accurate draft / assistant model generates a sequence of tokens\n2️⃣ The target model applies parallelised verification over the tokens from the draft model\n\nThis allows the target model to produce multiple tokens in a single forward pass and thus accelerate decoding. As shown in the diagram below, the whole method hinges on something called the *speculation lookahead* (SL) which is simply the number of tokens produced by the draft model on each iteration:\n\nNow, SL is usually a static value or determined via heuristics - in both cases this leaves a lot of performance on the table 😿\n\nThe trick behind dynamic speculative decoding is to dynamically adjust the number of draft tokens generated *per iteration*\n\nBy doing so, the total number of tokens generated by the draft model can be significantly reduced and thus the number of forward passes from the target model too:\n\nIt turns out that the speed-up depends on the task and model architecture, but in some cases one can get up ~3x improvements 🚀",
  "reply_count": 3,
  "retweet_count": 29,
  "favorite_count": 217,
  "hashtags": [],
  "symbols": [],
  "user_mentions": [
    {
      "id_str": "2803191",
      "name": "Intel",
      "screen_name": "intel",
      "profile": "https://twitter.com/intel"
    }
  ],
  "urls": [],
  "media": [
    {
      "media_url": "https://pbs.twimg.com/media/GZdEc7VWwAAAsCg.png",
      "type": "photo"
    },
    {
      "media_url": "https://pbs.twimg.com/media/GZdGNqfWgAE3oCu.png",
      "type": "photo"
    },
    {
      "media_url": "https://pbs.twimg.com/media/GZdHq1xWkAAvMHU.jpg",
      "type": "photo"
    }
  ],
  "url": "https://twitter.com/_lewtun/status/1844022577464738142",
  "created_at": "2024-10-09T14:30:17.000Z",
  "#sort_index": "1844022577464738142",
  "view_count": 30805,
  "quote_count": 2,
  "is_quote_tweet": false,
  "is_retweet": false,
  "is_pinned": false,
  "is_truncated": true,
  "startUrl": "https://x.com/_lewtun/status/1844022577464738142"
}