@LiorOnAI
Every foundation model you've ever used has the same bug. It just got fixed. Since 2015, every deep network has been built the same way: each layer does some computation, adds its result to a running total, and passes it forward. Simple. But there's a problem, by layer 100, the signal from any single layer is buried under the sum of everything else. Each new layer matters less and less. Nobody fixed this because it worked well enough. Moonshot AI just changed that. Their new method, Attention Residuals, lets each layer look back at all previous layers and choose which ones actually matter right now. Instead of a blind running total, you get selective retrieval. The analogy: imagine writing an essay where every draft gets merged into one document automatically. By draft 50, your latest edits are invisible. AttnRes lets you keep every draft separate and pull from whichever ones you need. What this fixes: 1. Deeper layers no longer get drowned out 2. Training becomes more stable across the whole network 3. The model uses its own depth more efficiently To make it practical at scale, they group layers into blocks and attend over block summaries instead of every single layer. Overhead at inference: less than 2%. The result: 25% less compute to reach the same performance. Tested on a 48B-parameter model. Holds across sizes. Residual connections have been invisible plumbing for a decade. Now they're becoming dynamic. The next generation of models won't just pass through their own layers, they'll search them.