🐦 Twitter Post Details

Viewing enriched Twitter post

@burkov

NeurIPS 2025 Best Paper Award: Attention lets language models decide which tokens matter at each position, but it has limitations—for example, a tendency to over-focus on early tokens regardless of their relevance. Gating mechanisms, which selectively suppress or amplify information flow in neural networks, have improved other architectures, so researchers have tried adding them to attention as well. However, prior attempts usually package gating together with other architectural changes, making its specific contribution hard to isolate. This paper separates those effects by systematically testing over 30 gating variants on dense models and mixture-of-experts models with up to 15 billion parameters. In a standard transformer layer, each attention head computes a weighted combination of values; the head outputs are concatenated and passed through a final linear projection. The winning approach identified in the paper inserts one extra operation before concatenation: each head's output is multiplied (element-wise or head-wise, with element-wise performing best) by a learned gate computed from the current token's representation. This allows each head to dampen or preserve its contribution depending on context. These architectural changes deliver practical benefits beyond small benchmark gains: 1. Training becomes more stable, supporting learning rates that cause baseline models to diverge. 2. The gating also greatly reduces "attention sinks"—the situation where early tokens absorb excessive attention—which in turn is associated with strong improvements on long-context benchmarks once the context window is extended using standard techniques. Talk to the paper on ChapterPal: https://t.co/sbBtE8y3RH Read the PDF: https://t.co/PS92Jg6GZq

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1996682194031726784/media_0.jpg?",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/1996682194031726784/media_0.jpg?",
      "type": "photo",
      "filename": "media_0.jpg"
    }
  ],
  "processed_at": "2025-12-08T13:23:56.910899",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "1996682194031726784",
  "url": "https://x.com/burkov/status/1996682194031726784",
  "twitterUrl": "https://twitter.com/burkov/status/1996682194031726784",
  "text": "NeurIPS 2025 Best Paper Award:\n\nAttention lets language models decide which tokens matter at each position, but it has limitations—for example, a tendency to over-focus on early tokens regardless of their relevance.\n\nGating mechanisms, which selectively suppress or amplify information flow in neural networks, have improved other architectures, so researchers have tried adding them to attention as well. However, prior attempts usually package gating together with other architectural changes, making its specific contribution hard to isolate.\n\nThis paper separates those effects by systematically testing over 30 gating variants on dense models and mixture-of-experts models with up to 15 billion parameters.\n\nIn a standard transformer layer, each attention head computes a weighted combination of values; the head outputs are concatenated and passed through a final linear projection.\n\nThe winning approach identified in the paper inserts one extra operation before concatenation: each head's output is multiplied (element-wise or head-wise, with element-wise performing best) by a learned gate computed from the current token's representation. This allows each head to dampen or preserve its contribution depending on context.\n\nThese architectural changes deliver practical benefits beyond small benchmark gains:\n\n1. Training becomes more stable, supporting learning rates that cause baseline models to diverge.\n2. The gating also greatly reduces \"attention sinks\"—the situation where early tokens absorb excessive attention—which in turn is associated with strong improvements on long-context benchmarks once the context window is extended using standard techniques.\n\nTalk to the paper on ChapterPal: https://t.co/sbBtE8y3RH\n\nRead the PDF: https://t.co/PS92Jg6GZq",
  "source": "Twitter for iPhone",
  "retweetCount": 187,
  "replyCount": 17,
  "likeCount": 1115,
  "quoteCount": 9,
  "viewCount": 73347,
  "createdAt": "Thu Dec 04 20:45:06 +0000 2025",
  "lang": "en",
  "bookmarkCount": 841,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "1996682194031726784",
  "displayTextRange": [
    0,
    274
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "burkov",
    "url": "https://x.com/burkov",
    "twitterUrl": "https://twitter.com/burkov",
    "id": "47126544",
    "name": "BURKOV",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1906075596808998913/6xpSGRnv_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/47126544/1737224788",
    "description": "Book: https://t.co/tSS6Pctj6r\nApp: https://t.co/APb9rPQQ9b\n\nPhD in AI, author of 📖 The Hundred-Page Language Models Book & The Hundred-Page Machine Learning Book",
    "location": "Québec, Canada",
    "followers": 50301,
    "following": 116,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Sun Jun 14 16:56:23 +0000 2009",
    "entities": {
      "description": {
        "urls": [
          {
            "display_url": "thelmbook.com",
            "expanded_url": "https://thelmbook.com",
            "url": "https://t.co/tSS6Pctj6r",
            "indices": [
              6,
              29
            ]
          },
          {
            "display_url": "chapterpal.com",
            "expanded_url": "https://chapterpal.com",
            "url": "https://t.co/APb9rPQQ9b",
            "indices": [
              35,
              58
            ]
          }
        ]
      },
      "url": {
        "urls": [
          {
            "display_url": "linktr.ee/burkov",
            "expanded_url": "https://linktr.ee/burkov",
            "url": "https://t.co/1Bj1wAzTLL",
            "indices": [
              0,
              23
            ]
          }
        ]
      }
    },
    "fastFollowersCount": 0,
    "favouritesCount": 10316,
    "hasCustomTimelines": false,
    "isTranslator": false,
    "mediaCount": 3652,
    "statusesCount": 20399,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "1988621304841339023"
    ],
    "profile_bio": {},
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.x.com/QPoGLkvsrJ",
        "expanded_url": "https://x.com/burkov/status/1996682194031726784/photo/1",
        "id_str": "1996681623002066945",
        "indices": [
          275,
          298
        ],
        "media_key": "3_1996681623002066945",
        "media_url_https": "https://pbs.twimg.com/media/G7WjW3SWsAEUbSI.png",
        "type": "photo",
        "url": "https://t.co/QPoGLkvsrJ",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {
            "faces": []
          },
          "medium": {
            "faces": []
          },
          "small": {
            "faces": []
          },
          "orig": {
            "faces": []
          }
        },
        "sizes": {
          "large": {
            "h": 1360,
            "w": 1400,
            "resize": "fit"
          },
          "medium": {
            "h": 1166,
            "w": 1200,
            "resize": "fit"
          },
          "small": {
            "h": 661,
            "w": 680,
            "resize": "fit"
          },
          "thumb": {
            "h": 150,
            "w": 150,
            "resize": "crop"
          }
        },
        "original_info": {
          "height": 1360,
          "width": 1400,
          "focus_rects": [
            {
              "x": 0,
              "y": 0,
              "w": 1400,
              "h": 784
            },
            {
              "x": 20,
              "y": 0,
              "w": 1360,
              "h": 1360
            },
            {
              "x": 104,
              "y": 0,
              "w": 1193,
              "h": 1360
            },
            {
              "x": 360,
              "y": 0,
              "w": 680,
              "h": 1360
            },
            {
              "x": 0,
              "y": 0,
              "w": 1400,
              "h": 1360
            }
          ]
        },
        "media_results": {
          "result": {
            "media_key": "3_1996681623002066945"
          }
        }
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [
      {
        "display_url": "chapterpal.com/s/c8685321/gat…",
        "expanded_url": "https://www.chapterpal.com/s/c8685321/gated-attention-for-large-language-models-non-linearity-sparsity-and-attention-sink-free",
        "url": "https://t.co/sbBtE8y3RH",
        "indices": [
          1707,
          1730
        ]
      },
      {
        "display_url": "openreview.net/pdf?id=1b7whO4…",
        "expanded_url": "https://openreview.net/pdf?id=1b7whO4SfY",
        "url": "https://t.co/PS92Jg6GZq",
        "indices": [
          1746,
          1769
        ]
      }
    ],
    "user_mentions": []
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "article": null
}