Your curated collection of saved posts and media

Showing 32 posts Β· last 7 days Β· newest first
πŸ”omarsar0 retweeted
O
elvis
@omarsar0
πŸ“…
Mar 16, 2026
14h ago
πŸ†”09077648
⭐0.38

Banger report from the Kimi team: Attention Residuals Residual connections made deep Transformers trainable. But they also force uncontrolled hidden-state growth with depth. This work proposes a cleaner alternative. It introduces Attention Residuals, which replace fixed residual accumulation with softmax attention over previous layer outputs. Instead of blindly summing everything, each layer selectively retrieves the earlier representations it actually needs. To keep this practical at scale, they add a blockwise version that compresses layers into block summaries, recovering most of the gains with minimal systems overhead. Why does it matter? Residual paths have barely changed across modern LLMs, even though they govern how information moves through depth. This paper shows that making the mixing content-dependent improves scaling laws, matches a baseline trained with 1.25x more compute, boosts GPQA-Diamond by +7.5 and HumanEval by +3.1, while keeping inference overhead under 2%. Paper: https://t.co/04IG6FDiVr Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

❀️130
likes
πŸ”17
retweets
πŸ”ivanleomk retweeted
Y
Yoeven
@yoeven
πŸ“…
Mar 17, 2026
3h ago
πŸ†”65291100

The moment he realised that https://t.co/vWmBsnR1nt isn't fully built on transformers and we can run on a single GPU with high accuracy and lower cost https://t.co/ZJYuL62UB8

Media 1
❀️4
likes
πŸ”1
retweets
πŸ–ΌοΈ Media
Y
yoeven
@yoeven
πŸ“…
Mar 17, 2026
3h ago
πŸ†”65291100

The moment he realised that https://t.co/vWmBsnR1nt isn't fully built on transformers and we can run on a single GPU with high accuracy and lower cost https://t.co/ZJYuL62UB8

Media 1Media 2
πŸ–ΌοΈ Media
πŸ”_akhaliq retweeted
H
Haocheng Xi
@HaochengXiUCB
πŸ“…
Mar 17, 2026
3h ago
πŸ†”24284251
⭐0.34

Thanks for sharing our newest work @_akhaliq ! Classic algorithms like K-Means deserve to be revisited in the era of massive datasets and GPUs. Flash-KMeans rethinks the algorithm from a systems perspective to make exact K-Means fast and memory-efficient on modern hardware.

❀️9
likes
πŸ”1
retweets
H
HamelHusain
@HamelHusain
πŸ“…
Mar 17, 2026
3h ago
πŸ†”89510900
⭐0.32

Claude Code CLI > Codex CLI Codex Desktop > Claude Code Desktop It’s a jagged UX frontier

H
HaochengXiUCB
@HaochengXiUCB
πŸ“…
Mar 17, 2026
3h ago
πŸ†”24284251
⭐0.38

Thanks for sharing our newest work @_akhaliq ! Classic algorithms like K-Means deserve to be revisited in the era of massive datasets and GPUs. Flash-KMeans rethinks the algorithm from a systems perspective to make exact K-Means fast and memory-efficient on modern hardware.

@_akhaliq β€’ Thu Mar 12 16:44

Flash-KMeans Fast and Memory-Efficient Exact K-Means paper: https://t.co/Yy7V7L12Bn https://t.co/c1mGipQl3f

C
code
@code
πŸ“…
Mar 17, 2026
3h ago
πŸ†”94910880

🌐 Agentic Browser Tools (Experimental) in @code! Agents can now open pages, read content, click elements, and verify changes directly in the integrated browser while building your web app. Enable βš™οΈ workbench.browser.enableChatTools to try it out. Learn mode: https://t.co/kNwugFcbIA

Media 2
πŸ–ΌοΈ Media
πŸ”Scobleizer retweeted
J
Jake Steinerman πŸ”œ GDC & GTC
@jasteinerman
πŸ“…
Mar 16, 2026
6h ago
πŸ†”75976987
⭐0.34

Love this submission from our world models hackathon this weekend - a generative FPS!

❀️5
likes
πŸ”1
retweets
P
PyTorch
@PyTorch
πŸ“…
Mar 16, 2026
4h ago
πŸ†”18333110

#ExecuTorch addresses fragmented native deployment for #AI agents as a #PyTorch native platform. It enables voice models across CPU, GPU, and NPU on Android, iOS, Linux, macOS & Windows πŸ”— https://t.co/NeQQyUniL4 https://t.co/O3itnoQFoG

Media 1
πŸ–ΌοΈ Media
πŸ”jxnlco retweeted
E
edwin
@edwinarbus
πŸ“…
Mar 16, 2026
8h ago
πŸ†”50334333
⭐0.34

Matt Maher tested frontier models in Cursor v. other harnesses. Cursor boosted model performance by 11% on average: Gemini: 52% β†’ 57% GPT-5.4: 82% β†’ 88% Opus: 77% β†’ 93% His benchmark measures how well models implement a 100-feature PRD. @cursor_ai consistently outperformed. https://t.co/hrjCmWMNKN

❀️176
likes
πŸ”17
retweets
_
_akhaliq
@_akhaliq
πŸ“…
Mar 16, 2026
6h ago
πŸ†”76800000

Mistral Small 4 is out https://t.co/IdAowSpHpN

Media 1
πŸ–ΌοΈ Media
J
jasteinerman
@jasteinerman
πŸ“…
Mar 16, 2026
6h ago
πŸ†”75976987
⭐0.32

Love this submission from our world models hackathon this weekend - a generative FPS!

@AnshulDhawan001 β€’ Mon Mar 16 21:08

Spent the weekend hacking at the Worlds in Action hackathon at @fdotinc by @SensAIHackademy. It was so much fun playing with the world models by @theworldlabs . I believe generative games are the future where characters, rules and even parts of the world can be generated and ad

πŸ”jxnlco retweeted
O
OpenAI Developers
@OpenAIDevs
πŸ“…
Mar 16, 2026
8h ago
πŸ†”48174967
⭐0.34

Subagents are now available in Codex. You can accelerate your workflow by spinning up specialized agents to: β€’ Keep your main context window clean β€’ Tackle different parts of a task in parallel β€’ Steer individual agents as work unfolds https://t.co/QJC2ZYtYcA

❀️798
likes
πŸ”74
retweets
πŸ”jeremyphoward retweeted
R
raia hadsell
@RaiaHadsell
πŸ“…
Mar 16, 2026
10h ago
πŸ†”56989392
⭐0.36

It's been about 20 years since I first started working on embeddings with Yann LeCun (siamese networks!), and I've been fascinated ever since. Gemini Embeddings 2 approaches the platonic ideal: native embedding of text, image, video, audio, and docs to a single space.

❀️277
likes
πŸ”24
retweets
H
HuggingPapers
@HuggingPapers
πŸ“…
Mar 16, 2026
7h ago
πŸ†”83694046

OmniForcing unlocks real-time joint audio-visual generation Achieves ~25 FPS with 0.7s latencyβ€”a 35Γ— speedup over offline diffusion modelsβ€”by distilling bidirectional LTX-2 into a causal streaming generator with maintained multi-modal fidelity. https://t.co/UGYGMyTQOs

Media 1
πŸ–ΌοΈ Media
P
PyTorch
@PyTorch
πŸ“…
Mar 16, 2026
7h ago
πŸ†”07617111
⭐0.38

@Nvidiadev πŸ—“οΈ MONDAY @ Booth #338 2PM: Shaping the Future w/ @matthew_d_white 3PM: TensorRT + PyTorch w/ Angela Yi & @narendasan 4PM: DeepSpeed Trillion-Param Training w/ @PKUWZP 5PM: PyTorch Export w/ Angela Yi 6PM: Ray Distributed Computing w/ @robertnishihara #AI #GTC2025

O
OpenAIDevs
@OpenAIDevs
πŸ“…
Mar 16, 2026
8h ago
πŸ†”48174967

Subagents are now available in Codex. You can accelerate your workflow by spinning up specialized agents to: β€’ Keep your main context window clean β€’ Tackle different parts of a task in parallel β€’ Steer individual agents as work unfolds https://t.co/QJC2ZYtYcA

πŸ–ΌοΈ Media
E
edwinarbus
@edwinarbus
πŸ“…
Mar 16, 2026
8h ago
πŸ†”50334333
⭐0.44

Matt Maher tested frontier models in Cursor v. other harnesses. Cursor boosted model performance by 11% on average: Gemini: 52% β†’ 57% GPT-5.4: 82% β†’ 88% Opus: 77% β†’ 93% His benchmark measures how well models implement a 100-feature PRD. @cursor_ai consistently outperformed. https://t.co/hrjCmWMNKN

πŸ”ai_fast_track retweeted
T
David Hendrickson
@TeksEdge
πŸ“…
Mar 14, 2026
2d ago
πŸ†”30554364
⭐0.34

🚨 Want to parse complex PDFs with SOTA accuracy, 100% locally? πŸ“„πŸ” At just 0.9B parameters, you can drop GLM-OCR straight into LM Studio and run it on almost any machine! πŸ₯” 🧠 0.9B total parameters πŸ’Ύ Runs on < 1.5GB VRAM (or ~1GB quantized!) πŸ’Έ Zero API costs πŸ”’ Total data privacy Desktop document AI is officially here. πŸ’»βš‘

❀️2,365
likes
πŸ”218
retweets
πŸ”ai_fast_track retweeted
A
Adina Yakup
@AdinaYakup
πŸ“…
Mar 16, 2026
14h ago
πŸ†”41999406

Covo Audio πŸ”ŠA end-to-end audio language model from @TencentAI_News https://t.co/tic5cH1A39 ✨ 7B ✨ Audio β†’ Audio in one model ✨ Multi-speaker + voice transfer ✨ Real-time full duplex conversations https://t.co/hFrsxQgzkT

Media 1
❀️77
likes
πŸ”11
retweets
πŸ–ΌοΈ Media
A
alex_peys
@alex_peys
πŸ“…
Mar 16, 2026
9h ago
πŸ†”51888850
⭐0.40

this was one of the things i co-led at fair, then fb had ~2b users, embeddings of ~128d made it a 300b-1T parameter model depending on how you count entities (e.g. ad campaigns). at the time, this was big, now it's medium. we trained it purely on distributed cpus

@ylecun β€’ Mon Mar 16 18:09

@RaiaHadsell Universal embeddings FTW 😊 One of the flagship projects at FAIR was to "embed the world" (i.e. represent every entity on Facebook). The name was soon changed to "Filament", deployed internally, and eventually open-sourced as "PyTorch-BigGraph" The techniques were m

πŸ”ylecun retweeted
A
alphaXiv
@askalphaxiv
πŸ“…
Mar 16, 2026
1d ago
πŸ†”49397718
⭐0.36

Yann LeCun is pumping out papers recently β€œTemporal Straightening for Latent Planning” This paper shows that by straightening latent trajectories in a world model, Euclidean distance starts to reflect true reachable progress, so it's closer to geodesic/minimum-step distance. This makes gradient-based planning far more stable and effective without relying as heavily on expensive search.

❀️702
likes
πŸ”115
retweets
πŸ”github retweeted
R
0xMarioNawfal
@RoundtableSpace
πŸ“…
Mar 12, 2026
4d ago
πŸ†”85178066
⭐0.32

Microsoft has released a free, open-source course: GitHub Copilot CLI for Beginners. Includes 8 Chapters covering: β€’ Walks through of installing Copilot CLI β€’ Using context β€’ Creating custom agents β€’ Working with skills β€’ Connecting MCP servers, and more. Start Learning - https://t.co/IIbauw5L7K

❀️714
likes
πŸ”114
retweets
S
SpirosMargaris
@SpirosMargaris
πŸ“…
Mar 16, 2026
10h ago
πŸ†”49671064
⭐0.44

Nvidia ruled the first wave of AI by powering the training of large models. But the next phase may look different. Running AI at scale, inference is now growing much faster than training. That’s where real-world deployment happens. If the center of gravity in AI shifts there, the question becomes: will Nvidia stay as dominant in the next chapter? https://t.co/MdG0zqBUWj @RWhelanWSJ @WSJ

R
RaiaHadsell
@RaiaHadsell
πŸ“…
Mar 16, 2026
10h ago
πŸ†”56989392
⭐0.38

It's been about 20 years since I first started working on embeddings with Yann LeCun (siamese networks!), and I've been fascinated ever since. Gemini Embeddings 2 approaches the platonic ideal: native embedding of text, image, video, audio, and docs to a single space.

@GoogleAIStudio β€’ Tue Mar 10 17:25

https://t.co/mIXzM657cR

L
LiorOnAI
@LiorOnAI
πŸ“…
Mar 16, 2026
10h ago
πŸ†”24702434
⭐0.42

Every foundation model you've ever used has the same bug. It just got fixed. Since 2015, every deep network has been built the same way: each layer does some computation, adds its result to a running total, and passes it forward. Simple. But there's a problem, by layer 100, the signal from any single layer is buried under the sum of everything else. Each new layer matters less and less. Nobody fixed this because it worked well enough. Moonshot AI just changed that. Their new method, Attention Residuals, lets each layer look back at all previous layers and choose which ones actually matter right now. Instead of a blind running total, you get selective retrieval. The analogy: imagine writing an essay where every draft gets merged into one document automatically. By draft 50, your latest edits are invisible. AttnRes lets you keep every draft separate and pull from whichever ones you need. What this fixes: 1. Deeper layers no longer get drowned out 2. Training becomes more stable across the whole network 3. The model uses its own depth more efficiently To make it practical at scale, they group layers into blocks and attend over block summaries instead of every single layer. Overhead at inference: less than 2%. The result: 25% less compute to reach the same performance. Tested on a 48B-parameter model. Holds across sizes. Residual connections have been invisible plumbing for a decade. Now they're becoming dynamic. The next generation of models won't just pass through their own layers, they'll search them.

@Kimi_Moonshot β€’ Mon Mar 16 03:03

Introducing π‘¨π’•π’•π’†π’π’•π’Šπ’π’ π‘Ήπ’†π’”π’Šπ’…π’–π’‚π’π’”: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dep

πŸ”random_walker retweeted
A
AI Security Institute
@AISecurityInst
πŸ“…
Mar 16, 2026
12h ago
πŸ†”34953156
⭐0.36

Can AI agents conduct advancedΒ cyber-attacksΒ autonomously? We tested seven models released between August 2024 andΒ FebruaryΒ 2026 on two custom-built cyber rangesΒ designed to replicate complex attack environments. Here’sΒ what we found🧡 https://t.co/rFRkOQu8yU

❀️44
likes
πŸ”10
retweets
A
AndrewYNg
@AndrewYNg
πŸ“…
Mar 16, 2026
11h ago
πŸ†”00354812

Should there be a Stack Overflow for AI coding agents to share learnings with each other? Last week I announced Context Hub (chub), an open CLI tool that gives coding agents up-to-date API documentation. Since then, our GitHub repo has gained over 6K stars, and we've scaled from under 100 to over 1000 API documents, thanks to community contributions and a new agentic document writer. Thank you to everyone supporting Context Hub! OpenClaw and Moltbook showed that agents can use social media built for them to share information. In our new chub release, agents can share feedback on documentation β€” what worked, what didn't, what's missing. This feedback helps refine the docs for everyone, with safeguards for privacy and security. We're still early in building this out. You can find details and configuration options in the GitHub repo. Install chub as follows, and prompt your coding agent to use it: npm install -g @aisuite/chub GitHub: https://t.co/OCkyxXQMCq

Media 1
πŸ–ΌοΈ Media
πŸ”s_batzoglou retweeted
_
Avi Chawla
@_avichawla
πŸ“…
Mar 16, 2026
18h ago
πŸ†”36914495
⭐0.34

Big release from Kimi! They just released a new way to handle residual connections in Transformers. In a standard Transformer, every sub-layer (attention or MLP) computes an output and adds it back to the input via a residual connection. If you consider this across 40+ layers, the hidden state at any layer is just the equal-weighted sum of all previous layer outputs. Every layer contributes with weight=1, so every layer gets equal importance. This creates a problem called PreNorm dilution, where as the hidden state accumulates layer after layer, its magnitude grows linearly with depth. And any new layer's contribution gets progressively buried in the already-massive residual. This means deeper layers are then forced to produce increasingly large outputs just to have any influence, which destabilizes training. Here's what the Kimi team observed and did: RNNs compress all prior token information into a single state across time, leading to problems with handling long-range dependencies. And residual connections compress all prior layer information into a single state across depth. Transformers solved the first problem by replacing recurrence with attention. This was applied along the sequence dimension. Now they introduced Attention Residuals, which applies a similar idea to depth. Instead of adding all previous layer outputs with a fixed weight of 1, each layer now uses softmax attention to selectively decide how much weight each previous layer's output should receive. So each layer gets a single learned query vector, and it attends over all previous layer outputs to compute a weighted combination. The weights are input-dependent, so different tokens can retrieve different layer representations based on what's actually useful. This is Full Attention Residuals (shown in the second diagram below). But here's the practical problem with this idea. Full AttnRes requires keeping all layer outputs in memory and communicating them across pipeline stages during distributed training. To solve this, they introduce Block Attention Residuals (shown in the third diagram below). The idea is to group consecutive layers into roughly 8 blocks. Within each block, layer outputs are summed via standard residuals. But across blocks, the attention mechanism selectively combines block-level representations. This drops memory from O(Ld) to O(Nd), where N is the number of blocks. Layers within the current block can also attend to the partial sum of what's been computed so far inside that block, so local information flow isn't lost. And the raw token embedding is always available as a separate source, which means any layer in the network can selectively reach back to the original input. Results from the paper: - Block AttnRes matches the loss of a baseline LLM trained with 1.25x more compute. - Inference latency overhead is less than 2%, making it a practical drop-in replacement - On a 48B parameter Kimi Linear model (3B activated) trained on 1.4T tokens, it improved every benchmark they tested: GPQA-Diamond +7.5, Math +3.6, HumanEval +3.1, MMLU +1.1 The residual connection has mostly been unchanged since ResNet in 2015. This might be the first modification that's both theoretically motivated and practically deployable at scale with negligible overhead. More details in the post below by KimiπŸ‘‡ ____ Find me β†’ @_avichawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.

❀️1,202
likes
πŸ”107
retweets
A
AISecurityInst
@AISecurityInst
πŸ“…
Mar 16, 2026
12h ago
πŸ†”34953156

Can AI agents conduct advancedΒ cyber-attacksΒ autonomously? We tested seven models released between August 2024 andΒ FebruaryΒ 2026 on two custom-built cyber rangesΒ designed to replicate complex attack environments. Here’sΒ what we found🧡 https://t.co/rFRkOQu8yU

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Mar 16, 2026
13h ago
πŸ†”24629311
⭐0.32

Great paper on automating agent skill acquisition.

@dair_ai β€’ Mon Mar 16 14:12

GitHub already has millions of repos full of procedural knowledge. The work introduces a framework for extracting agent skills directly from open-source repos. The pipeline analyzes repo structure, identifies procedural knowledge through dense retrieval, and translates it into

D
dair_ai
@dair_ai
πŸ“…
Mar 16, 2026
13h ago
πŸ†”76916735

GitHub already has millions of repos full of procedural knowledge. The work introduces a framework for extracting agent skills directly from open-source repos. The pipeline analyzes repo structure, identifies procedural knowledge through dense retrieval, and translates it into standardized SKILL.md format with a progressive disclosure architecture so agents can discover thousands of skills without context window degradation. Manually authoring agent skills doesn't scale. Automated extraction achieved 40% gains in knowledge transfer efficiency while matching human-crafted quality. Still early on this, and there is more work needed for self-discovered and self-improving skills to work well at scale. As the agent skill ecosystem grows, mining existing repos could unlock scalable capability acquisition without having to retrain models. Paper: https://t.co/MAt8Goetcr Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

Media 1Media 2
πŸ–ΌοΈ Media