🐦 Twitter Post Details

Viewing enriched Twitter post

@omarsar0

Banger report from the Kimi team: Attention Residuals Residual connections made deep Transformers trainable. But they also force uncontrolled hidden-state growth with depth. This work proposes a cleaner alternative. It introduces Attention Residuals, which replace fixed residual accumulation with softmax attention over previous layer outputs. Instead of blindly summing everything, each layer selectively retrieves the earlier representations it actually needs. To keep this practical at scale, they add a blockwise version that compresses layers into block summaries, recovering most of the gains with minimal systems overhead. Why does it matter? Residual paths have barely changed across modern LLMs, even though they govern how information moves through depth. This paper shows that making the mixing content-dependent improves scaling laws, matches a baseline trained with 1.25x more compute, boosts GPQA-Diamond by +7.5 and HumanEval by +3.1, while keeping inference overhead under 2%. Paper: https://t.co/04IG6FDiVr Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

View on Twitter

📊 Media Metadata

{
  "media": [
    {
      "type": "photo",
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2033544593309077648/media_0.png",
      "filename": "media_0.png"
    }
  ],
  "processed_at": "2026-03-16T14:05:23.702000",
  "pipeline_version": "2.0"
}

🔧 Raw API Response

{
  "type": "tweet",
  "id": "2033544593309077648",
  "url": "https://x.com/omarsar0/status/2033544593309077648",
  "twitterUrl": "https://twitter.com/omarsar0/status/2033544593309077648",
  "text": "Banger report from the Kimi team: Attention Residuals\n\nResidual connections made deep Transformers trainable.\n\nBut they also force uncontrolled hidden-state growth with depth.\n\nThis work proposes a cleaner alternative.\n\nIt introduces Attention Residuals, which replace fixed residual accumulation with softmax attention over previous layer outputs.\n\nInstead of blindly summing everything, each layer selectively retrieves the earlier representations it actually needs.\n\nTo keep this practical at scale, they add a blockwise version that compresses layers into block summaries, recovering most of the gains with minimal systems overhead.\n\nWhy does it matter?\n\nResidual paths have barely changed across modern LLMs, even though they govern how information moves through depth.\n\nThis paper shows that making the mixing content-dependent improves scaling laws, matches a baseline trained with 1.25x more compute, boosts GPQA-Diamond by +7.5 and HumanEval by +3.1, while keeping inference overhead under 2%.\n\nPaper: https://t.co/04IG6FDiVr\n\nLearn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX",
  "source": "Twitter for iPhone",
  "retweetCount": 0,
  "replyCount": 0,
  "likeCount": 1,
  "quoteCount": 0,
  "viewCount": 154,
  "createdAt": "Mon Mar 16 14:03:07 +0000 2026",
  "lang": "en",
  "bookmarkCount": 1,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2033544593309077648",
  "displayTextRange": [
    0,
    274
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "omarsar0",
    "url": "https://x.com/omarsar0",
    "twitterUrl": "https://twitter.com/omarsar0",
    "id": "3448284313",
    "name": "elvis",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/939313677647282181/vZjFWtAn_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/3448284313/1565974901",
    "description": "",
    "location": "DAIR.AI Academy",
    "followers": 294225,
    "following": 790,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Fri Sep 04 12:59:26 +0000 2015",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {}
    },
    "fastFollowersCount": 0,
    "favouritesCount": 35163,
    "hasCustomTimelines": true,
    "isTranslator": true,
    "mediaCount": 4562,
    "statusesCount": 17513,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "2033213789466759535"
    ],
    "profile_bio": {
      "description": "Building @dair_ai • Prev: Meta AI, Elastic, PhD • New AI learning portal: https://t.co/1e8RZKs4uX",
      "entities": {
        "description": {
          "hashtags": [],
          "symbols": [],
          "urls": [
            {
              "display_url": "academy.dair.ai",
              "expanded_url": "https://academy.dair.ai/",
              "indices": [
                74,
                97
              ],
              "url": "https://t.co/1e8RZKs4uX"
            }
          ],
          "user_mentions": [
            {
              "id_str": "0",
              "indices": [
                9,
                17
              ],
              "name": "",
              "screen_name": "dair_ai"
            }
          ]
        },
        "url": {
          "urls": [
            {
              "display_url": "dair.ai",
              "expanded_url": "https://www.dair.ai/",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/XQto5ypSIk"
            }
          ]
        }
      }
    },
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.twitter.com/wxMZ69HT3F",
        "expanded_url": "https://twitter.com/omarsar0/status/2033544593309077648/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {
            "faces": []
          },
          "orig": {
            "faces": []
          }
        },
        "id_str": "2033544590935150592",
        "indices": [
          275,
          298
        ],
        "media_key": "3_2033544590935150592",
        "media_results": {
          "id": "QXBpTWVkaWFSZXN1bHRzOgwAAQoAARw4mggBm1AACgACHDiaCI8akJAAAA==",
          "result": {
            "__typename": "ApiMedia",
            "id": "QXBpTWVkaWE6DAABCgABHDiaCAGbUAAKAAIcOJoIjxqQkAAA",
            "media_key": "3_2033544590935150592"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/HDiaCAGbUAAKtDY.png",
        "original_info": {
          "focus_rects": [
            {
              "h": 447,
              "w": 799,
              "x": 0,
              "y": 0
            },
            {
              "h": 799,
              "w": 799,
              "x": 0,
              "y": 0
            },
            {
              "h": 891,
              "w": 782,
              "x": 0,
              "y": 0
            },
            {
              "h": 891,
              "w": 446,
              "x": 111,
              "y": 0
            },
            {
              "h": 891,
              "w": 799,
              "x": 0,
              "y": 0
            }
          ],
          "height": 891,
          "width": 799
        },
        "sizes": {
          "large": {
            "h": 891,
            "w": 799
          }
        },
        "type": "photo",
        "url": "https://t.co/wxMZ69HT3F"
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [
      {
        "display_url": "github.com/MoonshotAI/Att…",
        "expanded_url": "https://github.com/MoonshotAI/Attention-Residuals/blob/master/Attention_Residuals.pdf",
        "indices": [
          1011,
          1034
        ],
        "url": "https://t.co/04IG6FDiVr"
      },
      {
        "display_url": "academy.dair.ai",
        "expanded_url": "https://academy.dair.ai/",
        "indices": [
          1087,
          1110
        ],
        "url": "https://t.co/1e8RZKs4uX"
      }
    ],
    "user_mentions": []
  },
  "quoted_tweet": null,
  "retweeted_tweet": null,
  "isLimitedReply": false,
  "article": null
}