Your curated collection of saved posts and media

Showing 18 posts ยท last 7 days ยท quality filtered
_
_akhaliq
@_akhaliq
๐Ÿ“…
Mar 16, 2026
10m ago
๐Ÿ†”24985310

NanoVDR Distilling a 2B Vision-Language Retriever into a 70M Text-Only Encoder for Visual Document Retrieval paper: https://t.co/T0lh9v5Tnr https://t.co/rGoXKRzIQo

Media 1Media 2
๐Ÿ–ผ๏ธ Media
K
Kimi_Moonshot
@Kimi_Moonshot
๐Ÿ“…
Mar 16, 2026
13h ago
๐Ÿ†”78072424

Introducing ๐‘จ๐’•๐’•๐’†๐’๐’•๐’Š๐’๐’ ๐‘น๐’†๐’”๐’Š๐’…๐’–๐’‚๐’๐’”: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. ๐Ÿ”น Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. ๐Ÿ”น Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. ๐Ÿ”น Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. ๐Ÿ”น Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. ๐Ÿ”—Full report: https://t.co/u3EHICG05h

Media 1Media 2
๐Ÿ–ผ๏ธ Media
๐Ÿ”ai_fast_track retweeted
K
Kimi.ai
@Kimi_Moonshot
๐Ÿ“…
Mar 16, 2026
13h ago
๐Ÿ†”78072424
โญ0.32

Introducing ๐‘จ๐’•๐’•๐’†๐’๐’•๐’Š๐’๐’ ๐‘น๐’†๐’”๐’Š๐’…๐’–๐’‚๐’๐’”: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. ๐Ÿ”น Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. ๐Ÿ”น Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. ๐Ÿ”น Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. ๐Ÿ”น Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. ๐Ÿ”—Full report: https://t.co/u3EHICG05h

โค๏ธ8,162
likes
๐Ÿ”1,189
retweets
๐Ÿ”ai_fast_track retweeted
R
Sebastian Raschka
@rasbt
๐Ÿ“…
Mar 15, 2026
1d ago
๐Ÿ†”02210058
โญ0.38

I (finally) put together a new LLM Architecture Gallery that collects the architecture figures all in one place! https://t.co/NO7z6XSRHS https://t.co/X41FrK4i94

โค๏ธ7,190
likes
๐Ÿ”1,230
retweets
E
elliotarledge
@elliotarledge
๐Ÿ“…
Mar 15, 2026
18h ago
๐Ÿ†”73057468

Karpathy asked. I delivered. Introducing OpenSquirrel! Written in pure rust with GPUI (same as zed) but with agents as central unit rather than files. Supports Claude Code, Codex, Opencode, and Cursor (cli). This really forced me to think up the UI/UX from first principles instead of relying on common electron slop. https://t.co/NQG1jvgbk5

Media 1
๐Ÿ–ผ๏ธ Media
๐Ÿ”ai_fast_track retweeted
E
Elliot Arledge
@elliotarledge
๐Ÿ“…
Mar 15, 2026
18h ago
๐Ÿ†”73057468
โญ0.32

Karpathy asked. I delivered. Introducing OpenSquirrel! Written in pure rust with GPUI (same as zed) but with agents as central unit rather than files. Supports Claude Code, Codex, Opencode, and Cursor (cli). This really forced me to think up the UI/UX from first principles instead of relying on common electron slop. https://t.co/NQG1jvgbk5

โค๏ธ2,015
likes
๐Ÿ”136
retweets
_
_avichawla
@_avichawla
๐Ÿ“…
Mar 16, 2026
6h ago
๐Ÿ†”36914495

Big release from Kimi! They just released a new way to handle residual connections in Transformers. In a standard Transformer, every sub-layer (attention or MLP) computes an output and adds it back to the input via a residual connection. If you consider this across 40+ layers, the hidden state at any layer is just the equal-weighted sum of all previous layer outputs. Every layer contributes with weight=1, so every layer gets equal importance. This creates a problem called PreNorm dilution, where as the hidden state accumulates layer after layer, its magnitude grows linearly with depth. And any new layer's contribution gets progressively buried in the already-massive residual. This means deeper layers are then forced to produce increasingly large outputs just to have any influence, which destabilizes training. Here's what the Kimi team observed and did: RNNs compress all prior token information into a single state across time, leading to problems with handling long-range dependencies. And residual connections compress all prior layer information into a single state across depth. Transformers solved the first problem by replacing recurrence with attention. This was applied along the sequence dimension. Now they introduced Attention Residuals, which applies a similar idea to depth. Instead of adding all previous layer outputs with a fixed weight of 1, each layer now uses softmax attention to selectively decide how much weight each previous layer's output should receive. So each layer gets a single learned query vector, and it attends over all previous layer outputs to compute a weighted combination. The weights are input-dependent, so different tokens can retrieve different layer representations based on what's actually useful. This is Full Attention Residuals (shown in the second diagram below). But here's the practical problem with this idea. Full AttnRes requires keeping all layer outputs in memory and communicating them across pipeline stages during distributed training. To solve this, they introduce Block Attention Residuals (shown in the third diagram below). The idea is to group consecutive layers into roughly 8 blocks. Within each block, layer outputs are summed via standard residuals. But across blocks, the attention mechanism selectively combines block-level representations. This drops memory from O(Ld) to O(Nd), where N is the number of blocks. Layers within the current block can also attend to the partial sum of what's been computed so far inside that block, so local information flow isn't lost. And the raw token embedding is always available as a separate source, which means any layer in the network can selectively reach back to the original input. Results from the paper: - Block AttnRes matches the loss of a baseline LLM trained with 1.25x more compute. - Inference latency overhead is less than 2%, making it a practical drop-in replacement - On a 48B parameter Kimi Linear model (3B activated) trained on 1.4T tokens, it improved every benchmark they tested: GPQA-Diamond +7.5, Math +3.6, HumanEval +3.1, MMLU +1.1 The residual connection has mostly been unchanged since ResNet in 2015. This might be the first modification that's both theoretically motivated and practically deployable at scale with negligible overhead. More details in the post below by Kimi๐Ÿ‘‡ ____ Find me โ†’ @_avichawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.

Media 1
๐Ÿ–ผ๏ธ Media
๐Ÿ”s_batzoglou retweeted
_
Avi Chawla
@_avichawla
๐Ÿ“…
Mar 16, 2026
6h ago
๐Ÿ†”36914495
โญ0.34

Big release from Kimi! They just released a new way to handle residual connections in Transformers. In a standard Transformer, every sub-layer (attention or MLP) computes an output and adds it back to the input via a residual connection. If you consider this across 40+ layers, the hidden state at any layer is just the equal-weighted sum of all previous layer outputs. Every layer contributes with weight=1, so every layer gets equal importance. This creates a problem called PreNorm dilution, where as the hidden state accumulates layer after layer, its magnitude grows linearly with depth. And any new layer's contribution gets progressively buried in the already-massive residual. This means deeper layers are then forced to produce increasingly large outputs just to have any influence, which destabilizes training. Here's what the Kimi team observed and did: RNNs compress all prior token information into a single state across time, leading to problems with handling long-range dependencies. And residual connections compress all prior layer information into a single state across depth. Transformers solved the first problem by replacing recurrence with attention. This was applied along the sequence dimension. Now they introduced Attention Residuals, which applies a similar idea to depth. Instead of adding all previous layer outputs with a fixed weight of 1, each layer now uses softmax attention to selectively decide how much weight each previous layer's output should receive. So each layer gets a single learned query vector, and it attends over all previous layer outputs to compute a weighted combination. The weights are input-dependent, so different tokens can retrieve different layer representations based on what's actually useful. This is Full Attention Residuals (shown in the second diagram below). But here's the practical problem with this idea. Full AttnRes requires keeping all layer outputs in memory and communicating them across pipeline stages during distributed training. To solve this, they introduce Block Attention Residuals (shown in the third diagram below). The idea is to group consecutive layers into roughly 8 blocks. Within each block, layer outputs are summed via standard residuals. But across blocks, the attention mechanism selectively combines block-level representations. This drops memory from O(Ld) to O(Nd), where N is the number of blocks. Layers within the current block can also attend to the partial sum of what's been computed so far inside that block, so local information flow isn't lost. And the raw token embedding is always available as a separate source, which means any layer in the network can selectively reach back to the original input. Results from the paper: - Block AttnRes matches the loss of a baseline LLM trained with 1.25x more compute. - Inference latency overhead is less than 2%, making it a practical drop-in replacement - On a 48B parameter Kimi Linear model (3B activated) trained on 1.4T tokens, it improved every benchmark they tested: GPQA-Diamond +7.5, Math +3.6, HumanEval +3.1, MMLU +1.1 The residual connection has mostly been unchanged since ResNet in 2015. This might be the first modification that's both theoretically motivated and practically deployable at scale with negligible overhead. More details in the post below by Kimi๐Ÿ‘‡ ____ Find me โ†’ @_avichawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.

โค๏ธ1,202
likes
๐Ÿ”107
retweets
_
_akhaliq
@_akhaliq
๐Ÿ“…
Mar 16, 2026
1h ago
๐Ÿ†”71396249

Can Vision-Language Models Solve the Shell Game? paper: https://t.co/k7dczlIAIm https://t.co/k0laIhSZhT

+1 more
๐Ÿ–ผ๏ธ Media
O
omarsar0
@omarsar0
๐Ÿ“…
Mar 16, 2026
1h ago
๐Ÿ†”24629311
โญ0.32

Great paper on automating agent skill acquisition.

P
PixVerse_
@PixVerse_
๐Ÿ“…
Mar 16, 2026
3h ago
๐Ÿ†”08201897

Your AI agent can now generate videos. PixVerse CLI ships today โ€” JSON output, 6 deterministic exit codes, full PixVerse v5.6, Sora2 and Veo 3.1, Nano Banana access from terminal. Same account. Same credits. No new signup. -> Follow+ Reply+RT = 300 Creds(72H ONLY)

๐Ÿ–ผ๏ธ Media
L
lxfater
@lxfater
๐Ÿ“…
Mar 16, 2026
7h ago
๐Ÿ†”05843936
โญ0.34

่ฏดๅฅๅฟƒ้‡Œ่ฏ๏ผŒCodex ๆฏ” Claude code ๅผบๅพˆๅคš ไนŸๅฏ่ƒฝๆ˜ฏๆˆ‘ๅ†™ Swift ๆœ‰ๅ…ณ๏ผŒไฝ†ๅฎƒๆฏๆฌก้ƒฝ่ƒฝ้ป˜้ป˜ๅนฒๅพˆ้•ฟๆ—ถ้—ด๏ผŒ่€Œไธ”ๆฏๆฌก้ƒฝๅ‡ ไนŽๆญฃ็กฎ ็›ธๅ Claude code๏ผŒไธ€ไธ‹ๅญ้—ฎ่ฟ™ไธชๆƒ้™๏ผŒ็„ถๅŽไนŸๆฒกๆœ‰ๆŠŠไบ‹ๆƒ…ไธ€ๆฌกๆ€งๅŠžๅฅฝ ๅทฒ็ปไธฅ้‡ๅฝฑๅ“ๆˆ‘๏ผŒๅˆทๆŠ–้Ÿณไบ† ่€Œไธ”Codex ๆœ‰ๅพˆไพฟๅฎœ็š„ๆญฃ็‰ˆๆ–นๆกˆ๏ผŒไฝ† Claude code ๆฒกๆœ‰ใ€‚

๐Ÿ”jxnlco retweeted
L
้“้”คไบบ
@lxfater
๐Ÿ“…
Mar 16, 2026
7h ago
๐Ÿ†”05843936
โญ0.34

่ฏดๅฅๅฟƒ้‡Œ่ฏ๏ผŒCodex ๆฏ” Claude code ๅผบๅพˆๅคš ไนŸๅฏ่ƒฝๆ˜ฏๆˆ‘ๅ†™ Swift ๆœ‰ๅ…ณ๏ผŒไฝ†ๅฎƒๆฏๆฌก้ƒฝ่ƒฝ้ป˜้ป˜ๅนฒๅพˆ้•ฟๆ—ถ้—ด๏ผŒ่€Œไธ”ๆฏๆฌก้ƒฝๅ‡ ไนŽๆญฃ็กฎ ็›ธๅ Claude code๏ผŒไธ€ไธ‹ๅญ้—ฎ่ฟ™ไธชๆƒ้™๏ผŒ็„ถๅŽไนŸๆฒกๆœ‰ๆŠŠไบ‹ๆƒ…ไธ€ๆฌกๆ€งๅŠžๅฅฝ ๅทฒ็ปไธฅ้‡ๅฝฑๅ“ๆˆ‘๏ผŒๅˆทๆŠ–้Ÿณไบ† ่€Œไธ”Codex ๆœ‰ๅพˆไพฟๅฎœ็š„ๆญฃ็‰ˆๆ–นๆกˆ๏ผŒไฝ† Claude code ๆฒกๆœ‰ใ€‚

โค๏ธ131
likes
๐Ÿ”3
retweets
๐Ÿ”dair_ai retweeted
O
elvis
@omarsar0
๐Ÿ“…
Mar 16, 2026
2h ago
๐Ÿ†”09077648
โญ0.38

Banger report from the Kimi team: Attention Residuals Residual connections made deep Transformers trainable. But they also force uncontrolled hidden-state growth with depth. This work proposes a cleaner alternative. It introduces Attention Residuals, which replace fixed residual accumulation with softmax attention over previous layer outputs. Instead of blindly summing everything, each layer selectively retrieves the earlier representations it actually needs. To keep this practical at scale, they add a blockwise version that compresses layers into block summaries, recovering most of the gains with minimal systems overhead. Why does it matter? Residual paths have barely changed across modern LLMs, even though they govern how information moves through depth. This paper shows that making the mixing content-dependent improves scaling laws, matches a baseline trained with 1.25x more compute, boosts GPQA-Diamond by +7.5 and HumanEval by +3.1, while keeping inference overhead under 2%. Paper: https://t.co/04IG6FDiVr Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

โค๏ธ16
likes
๐Ÿ”4
retweets
D
dair_ai
@dair_ai
๐Ÿ“…
Mar 16, 2026
1h ago
๐Ÿ†”76916735

GitHub already has millions of repos full of procedural knowledge. The work introduces a framework for extracting agent skills directly from open-source repos. The pipeline analyzes repo structure, identifies procedural knowledge through dense retrieval, and translates it into standardized SKILL.md format with a progressive disclosure architecture so agents can discover thousands of skills without context window degradation. Manually authoring agent skills doesn't scale. Automated extraction achieved 40% gains in knowledge transfer efficiency while matching human-crafted quality. Still early on this, and there is more work needed for self-discovered and self-improving skills to work well at scale. As the agent skill ecosystem grows, mining existing repos could unlock scalable capability acquisition without having to retrain models. Paper: https://t.co/MAt8Goetcr Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

Media 1Media 2
๐Ÿ–ผ๏ธ Media
O
omarsar0
@omarsar0
๐Ÿ“…
Mar 16, 2026
2h ago
๐Ÿ†”09077648

Banger report from the Kimi team: Attention Residuals Residual connections made deep Transformers trainable. But they also force uncontrolled hidden-state growth with depth. This work proposes a cleaner alternative. It introduces Attention Residuals, which replace fixed residual accumulation with softmax attention over previous layer outputs. Instead of blindly summing everything, each layer selectively retrieves the earlier representations it actually needs. To keep this practical at scale, they add a blockwise version that compresses layers into block summaries, recovering most of the gains with minimal systems overhead. Why does it matter? Residual paths have barely changed across modern LLMs, even though they govern how information moves through depth. This paper shows that making the mixing content-dependent improves scaling laws, matches a baseline trained with 1.25x more compute, boosts GPQA-Diamond by +7.5 and HumanEval by +3.1, while keeping inference overhead under 2%. Paper: https://t.co/04IG6FDiVr Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

Media 1
๐Ÿ–ผ๏ธ Media
R
RunjiaLi
@RunjiaLi
๐Ÿ“…
Mar 16, 2026
2h ago
๐Ÿ†”07624001

๐ŸŽ‰EgoEdit @Snapchat has been accepted to CVPR 2026! ๐Ÿ†๐Ÿ‘ป We are bringing high-quality, real-time editing to egocentric videos. Our massive 100k video dataset and benchmark are ALREADY PUBLIC! ๐Ÿ”“๐Ÿš€ ๐Ÿ  Project Page: https://t.co/cEUZRxdLDf ๐Ÿค— Dataset: https://t.co/qCFRTY8cYG https://t.co/VuXQg2UfqC

Media 2
๐Ÿ–ผ๏ธ Media
๐Ÿ”_akhaliq retweeted
R
Runjia Li
@RunjiaLi
๐Ÿ“…
Mar 16, 2026
2h ago
๐Ÿ†”07624001
โญ0.32

๐ŸŽ‰EgoEdit @Snapchat has been accepted to CVPR 2026! ๐Ÿ†๐Ÿ‘ป We are bringing high-quality, real-time editing to egocentric videos. Our massive 100k video dataset and benchmark are ALREADY PUBLIC! ๐Ÿ”“๐Ÿš€ ๐Ÿ  Project Page: https://t.co/cEUZRxdLDf ๐Ÿค— Dataset: https://t.co/qCFRTY8cYG https://t.co/VuXQg2UfqC

โค๏ธ4
likes
๐Ÿ”1
retweets