🐦 Twitter Post Details

Viewing enriched Twitter post

@_avichawla

Big release from Kimi! They just released a new way to handle residual connections in Transformers. In a standard Transformer, every sub-layer (attention or MLP) computes an output and adds it back to the input via a residual connection. If you consider this across 40+ layers, the hidden state at any layer is just the equal-weighted sum of all previous layer outputs. Every layer contributes with weight=1, so every layer gets equal importance. This creates a problem called PreNorm dilution, where as the hidden state accumulates layer after layer, its magnitude grows linearly with depth. And any new layer's contribution gets progressively buried in the already-massive residual. This means deeper layers are then forced to produce increasingly large outputs just to have any influence, which destabilizes training. Here's what the Kimi team observed and did: RNNs compress all prior token information into a single state across time, leading to problems with handling long-range dependencies. And residual connections compress all prior layer information into a single state across depth. Transformers solved the first problem by replacing recurrence with attention. This was applied along the sequence dimension. Now they introduced Attention Residuals, which applies a similar idea to depth. Instead of adding all previous layer outputs with a fixed weight of 1, each layer now uses softmax attention to selectively decide how much weight each previous layer's output should receive. So each layer gets a single learned query vector, and it attends over all previous layer outputs to compute a weighted combination. The weights are input-dependent, so different tokens can retrieve different layer representations based on what's actually useful. This is Full Attention Residuals (shown in the second diagram below). But here's the practical problem with this idea. Full AttnRes requires keeping all layer outputs in memory and communicating them across pipeline stages during distributed training. To solve this, they introduce Block Attention Residuals (shown in the third diagram below). The idea is to group consecutive layers into roughly 8 blocks. Within each block, layer outputs are summed via standard residuals. But across blocks, the attention mechanism selectively combines block-level representations. This drops memory from O(Ld) to O(Nd), where N is the number of blocks. Layers within the current block can also attend to the partial sum of what's been computed so far inside that block, so local information flow isn't lost. And the raw token embedding is always available as a separate source, which means any layer in the network can selectively reach back to the original input. Results from the paper: - Block AttnRes matches the loss of a baseline LLM trained with 1.25x more compute. - Inference latency overhead is less than 2%, making it a practical drop-in replacement - On a 48B parameter Kimi Linear model (3B activated) trained on 1.4T tokens, it improved every benchmark they tested: GPQA-Diamond +7.5, Math +3.6, HumanEval +3.1, MMLU +1.1 The residual connection has mostly been unchanged since ResNet in 2015. This might be the first modification that's both theoretically motivated and practically deployable at scale with negligible overhead. More details in the post below by KimiπŸ‘‡ ____ Find me β†’ @_avichawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.

Media 1

πŸ“Š Media Metadata

{
  "media": [
    {
      "url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2033472650836914495/media_0.jpg",
      "media_url": "https://crmoxkoizveukayfjuyo.supabase.co/storage/v1/object/public/media/posts/2033472650836914495/media_0.jpg",
      "type": "photo",
      "filename": "media_0.jpg"
    }
  ],
  "processed_at": "2026-03-16T16:01:02.017917",
  "pipeline_version": "2.0"
}

πŸ”§ Raw API Response

{
  "type": "tweet",
  "id": "2033472650836914495",
  "url": "https://x.com/_avichawla/status/2033472650836914495",
  "twitterUrl": "https://twitter.com/_avichawla/status/2033472650836914495",
  "text": "Big release from Kimi!\n\nThey just released a new way to handle residual connections in Transformers.\n\nIn a standard Transformer, every sub-layer (attention or MLP) computes an output and adds it back to the input via a residual connection.\n\nIf you consider this across 40+ layers, the hidden state at any layer is just the equal-weighted sum of all previous layer outputs.\n\nEvery layer contributes with weight=1, so every layer gets equal importance.\n\nThis creates a problem called PreNorm dilution, where as the hidden state accumulates layer after layer, its magnitude grows linearly with depth.\n\nAnd any new layer's contribution gets progressively buried in the already-massive residual. This means deeper layers are then forced to produce increasingly large outputs just to have any influence, which destabilizes training.\n\nHere's what the Kimi team observed and did:\n\nRNNs compress all prior token information into a single state across time, leading to problems with handling long-range dependencies. And residual connections compress all prior layer information into a single state across depth.\n\nTransformers solved the first problem by replacing recurrence with attention. This was applied along the sequence dimension.\n\nNow they introduced Attention Residuals, which applies a similar idea to depth.\n\nInstead of adding all previous layer outputs with a fixed weight of 1, each layer now uses softmax attention to selectively decide how much weight each previous layer's output should receive.\n\nSo each layer gets a single learned query vector, and it attends over all previous layer outputs to compute a weighted combination.\n\nThe weights are input-dependent, so different tokens can retrieve different layer representations based on what's actually useful.\n\nThis is Full Attention Residuals (shown in the second diagram below).\n\nBut here's the practical problem with this idea.\n\nFull AttnRes requires keeping all layer outputs in memory and communicating them across pipeline stages during distributed training.\n\nTo solve this, they introduce Block Attention Residuals (shown in the third diagram below).\n\nThe idea is to group consecutive layers into roughly 8 blocks.\n\nWithin each block, layer outputs are summed via standard residuals. But across blocks, the attention mechanism selectively combines block-level representations.\n\nThis drops memory from O(Ld) to O(Nd), where N is the number of blocks.\n\nLayers within the current block can also attend to the partial sum of what's been computed so far inside that block, so local information flow isn't lost.\n\nAnd the raw token embedding is always available as a separate source, which means any layer in the network can selectively reach back to the original input.\n\nResults from the paper:\n\n- Block AttnRes matches the loss of a baseline LLM trained with 1.25x more compute.\n\n- Inference latency overhead is less than 2%, making it a practical drop-in replacement\n\n- On a 48B parameter Kimi Linear model (3B activated) trained on 1.4T tokens, it improved every benchmark they tested: GPQA-Diamond +7.5, Math +3.6, HumanEval +3.1, MMLU +1.1\n\nThe residual connection has mostly been unchanged since ResNet in 2015.\n\nThis might be the first modification that's both theoretically motivated and practically deployable at scale with negligible overhead.\n\nMore details in the post below by KimiπŸ‘‡\n____\nFind me β†’  @_avichawla\nEvery day, I share tutorials and insights on DS, ML, LLMs, and RAGs.",
  "source": "Twitter for iPhone",
  "retweetCount": 107,
  "replyCount": 55,
  "likeCount": 1202,
  "quoteCount": 9,
  "viewCount": 160663,
  "createdAt": "Mon Mar 16 09:17:14 +0000 2026",
  "lang": "en",
  "bookmarkCount": 734,
  "isReply": false,
  "inReplyToId": null,
  "conversationId": "2033472650836914495",
  "displayTextRange": [
    0,
    280
  ],
  "inReplyToUserId": null,
  "inReplyToUsername": null,
  "author": {
    "type": "user",
    "userName": "_avichawla",
    "url": "https://x.com/_avichawla",
    "twitterUrl": "https://twitter.com/_avichawla",
    "id": "1175166450832687104",
    "name": "Avi Chawla",
    "isVerified": false,
    "isBlueVerified": true,
    "verifiedType": null,
    "profilePicture": "https://pbs.twimg.com/profile_images/1868297128801390593/Ovl677JQ_normal.jpg",
    "coverPicture": "https://pbs.twimg.com/profile_banners/1175166450832687104/1734257238",
    "description": "Daily tutorialsΒ and insights on DS, ML, LLMs, and RAGs β€’ Co-founder @dailydoseofds_ β€’ IIT Varanasi β€’ ex-AI Engineer @ MastercardAI",
    "location": "Learn AI Engineering β†’",
    "followers": 62058,
    "following": 155,
    "status": "",
    "canDm": true,
    "canMediaTag": true,
    "createdAt": "Fri Sep 20 21:55:02 +0000 2019",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {
        "urls": [
          {
            "display_url": "join.dailydoseofds.com",
            "expanded_url": "https://join.dailydoseofds.com/",
            "indices": [
              0,
              23
            ],
            "url": "https://t.co/er9j5SIvQo"
          }
        ]
      }
    },
    "fastFollowersCount": 0,
    "favouritesCount": 2211,
    "hasCustomTimelines": false,
    "isTranslator": false,
    "mediaCount": 2392,
    "statusesCount": 4765,
    "withheldInCountries": [],
    "affiliatesHighlightedLabel": {},
    "possiblySensitive": false,
    "pinnedTweetIds": [
      "1911306413932163338"
    ],
    "profile_bio": {},
    "isAutomated": false,
    "automatedBy": null
  },
  "extendedEntities": {
    "media": [
      {
        "display_url": "pic.x.com/5i5AN9tzIm",
        "expanded_url": "https://x.com/_avichawla/status/2033472650836914495/photo/1",
        "ext_media_availability": {
          "status": "Available"
        },
        "features": {
          "large": {
            "faces": []
          },
          "medium": {
            "faces": []
          },
          "orig": {
            "faces": []
          },
          "small": {
            "faces": []
          }
        },
        "id_str": "2033472644277063681",
        "indices": [
          281,
          304
        ],
        "media_key": "3_2033472644277063681",
        "media_results": {
          "result": {
            "media_key": "3_2033472644277063681"
          }
        },
        "media_url_https": "https://pbs.twimg.com/media/HDhYmJ6b0AEDbMk.jpg",
        "original_info": {
          "focus_rects": [
            {
              "h": 2003,
              "w": 3576,
              "x": 0,
              "y": 0
            },
            {
              "h": 2107,
              "w": 2107,
              "x": 1469,
              "y": 0
            },
            {
              "h": 2107,
              "w": 1848,
              "x": 1728,
              "y": 0
            },
            {
              "h": 2107,
              "w": 1054,
              "x": 2237,
              "y": 0
            },
            {
              "h": 2107,
              "w": 3576,
              "x": 0,
              "y": 0
            }
          ],
          "height": 2107,
          "width": 3576
        },
        "sizes": {
          "large": {
            "h": 1207,
            "resize": "fit",
            "w": 2048
          },
          "medium": {
            "h": 707,
            "resize": "fit",
            "w": 1200
          },
          "small": {
            "h": 401,
            "resize": "fit",
            "w": 680
          },
          "thumb": {
            "h": 150,
            "resize": "crop",
            "w": 150
          }
        },
        "type": "photo",
        "url": "https://t.co/5i5AN9tzIm"
      }
    ]
  },
  "card": null,
  "place": {},
  "entities": {
    "hashtags": [],
    "symbols": [],
    "urls": [],
    "user_mentions": [
      {
        "id_str": "1175166450832687104",
        "indices": [
          3370,
          3381
        ],
        "name": "Avi Chawla",
        "screen_name": "_avichawla"
      }
    ]
  },
  "quoted_tweet": {
    "type": "tweet",
    "id": "2033378587878072424",
    "url": "https://x.com/Kimi_Moonshot/status/2033378587878072424",
    "twitterUrl": "https://twitter.com/Kimi_Moonshot/status/2033378587878072424",
    "text": "Introducing π‘¨π’•π’•π’†π’π’•π’Šπ’π’ π‘Ήπ’†π’”π’Šπ’…π’–π’‚π’π’”: Rethinking depth-wise aggregation.\n\nResidual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers.\n\nπŸ”Ή Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth.\nπŸ”Ή Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale.\nπŸ”Ή Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead.\nπŸ”Ή Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains.\n\nπŸ”—Full report:\nhttps://t.co/u3EHICG05h",
    "source": "Twitter for iPhone",
    "retweetCount": 1188,
    "replyCount": 197,
    "likeCount": 8151,
    "quoteCount": 305,
    "viewCount": 2144650,
    "createdAt": "Mon Mar 16 03:03:28 +0000 2026",
    "lang": "en",
    "bookmarkCount": 5953,
    "isReply": false,
    "inReplyToId": null,
    "conversationId": "2033378587878072424",
    "displayTextRange": [
      0,
      261
    ],
    "inReplyToUserId": null,
    "inReplyToUsername": null,
    "author": {
      "type": "user",
      "userName": "Kimi_Moonshot",
      "url": "https://x.com/Kimi_Moonshot",
      "twitterUrl": "https://twitter.com/Kimi_Moonshot",
      "id": "1863959670169501696",
      "name": "Kimi.ai",
      "isVerified": false,
      "isBlueVerified": false,
      "verifiedType": "Business",
      "profilePicture": "https://pbs.twimg.com/profile_images/1910294000927645696/QseOV0uF_normal.png",
      "coverPicture": "https://pbs.twimg.com/profile_banners/1863959670169501696/1733238156",
      "description": "Built by Moonshot AI to empower everyone to be superhuman. ⚑️API: https://t.co/ggYlFf809H\n@KimiProduct where we share cool use cases and prompts.",
      "location": "",
      "followers": 127297,
      "following": 132,
      "status": "",
      "canDm": false,
      "canMediaTag": true,
      "createdAt": "Tue Dec 03 14:54:14 +0000 2024",
      "entities": {
        "description": {
          "urls": [
            {
              "display_url": "platform.moonshot.ai",
              "expanded_url": "https://platform.moonshot.ai/",
              "indices": [
                66,
                89
              ],
              "url": "https://t.co/ggYlFf809H"
            }
          ]
        },
        "url": {
          "urls": [
            {
              "display_url": "kimi.com",
              "expanded_url": "https://www.kimi.com/",
              "indices": [
                0,
                23
              ],
              "url": "https://t.co/mlnKFmsdLe"
            }
          ]
        }
      },
      "fastFollowersCount": 0,
      "favouritesCount": 255,
      "hasCustomTimelines": false,
      "isTranslator": false,
      "mediaCount": 111,
      "statusesCount": 298,
      "withheldInCountries": [],
      "affiliatesHighlightedLabel": {},
      "possiblySensitive": false,
      "pinnedTweetIds": [
        "2016024049869324599"
      ],
      "profile_bio": {},
      "isAutomated": false,
      "automatedBy": null
    },
    "extendedEntities": {
      "media": [
        {
          "allow_download_status": {
            "allow_download": true
          },
          "display_url": "pic.x.com/gcWyzhZVc0",
          "expanded_url": "https://x.com/Kimi_Moonshot/status/2033378587878072424/photo/1",
          "ext_media_availability": {
            "status": "Available"
          },
          "features": {
            "large": {
              "faces": []
            },
            "medium": {
              "faces": []
            },
            "orig": {
              "faces": []
            },
            "small": {
              "faces": []
            }
          },
          "id_str": "2033378144850530304",
          "indices": [
            262,
            285
          ],
          "media_key": "3_2033378144850530304",
          "media_results": {
            "result": {
              "media_key": "3_2033378144850530304"
            }
          },
          "media_url_https": "https://pbs.twimg.com/media/HDgCpkHb0AA0a7_.jpg",
          "original_info": {
            "focus_rects": [
              {
                "h": 553,
                "w": 987,
                "x": 0,
                "y": 0
              },
              {
                "h": 987,
                "w": 987,
                "x": 0,
                "y": 0
              },
              {
                "h": 1125,
                "w": 987,
                "x": 0,
                "y": 0
              },
              {
                "h": 1280,
                "w": 640,
                "x": 159,
                "y": 0
              },
              {
                "h": 1280,
                "w": 987,
                "x": 0,
                "y": 0
              }
            ],
            "height": 1280,
            "width": 987
          },
          "sizes": {
            "large": {
              "h": 1280,
              "resize": "fit",
              "w": 987
            },
            "medium": {
              "h": 1200,
              "resize": "fit",
              "w": 925
            },
            "small": {
              "h": 680,
              "resize": "fit",
              "w": 524
            },
            "thumb": {
              "h": 150,
              "resize": "crop",
              "w": 150
            }
          },
          "type": "photo",
          "url": "https://t.co/gcWyzhZVc0"
        }
      ]
    },
    "card": null,
    "place": {},
    "entities": {
      "hashtags": [],
      "symbols": [],
      "urls": [
        {
          "display_url": "github.com/MoonshotAI/Att…",
          "expanded_url": "https://github.com/MoonshotAI/Attention-Residuals/blob/master/Attention_Residuals.pdf",
          "indices": [
            847,
            870
          ],
          "url": "https://t.co/u3EHICG05h"
        }
      ],
      "user_mentions": []
    },
    "quoted_tweet": null,
    "retweeted_tweet": null,
    "article": null
  },
  "retweeted_tweet": null,
  "article": null
}