Your curated collection of saved posts and media
New research from Yann LeCun and collaborators at NYU. It's a really good read for anyone working on efficient Transformer inference. The paper dissects two recurring phenomena in Transformer language models: massive activations (where a few tokens exhibit extreme outlier values, and attention sinks (where certain tokens attract disproportionate attention regardless of semantic relevance). They show the co-occurrence is largely an architectural artifact of pre-norm design, not a fundamental property. Massive activations function as implicit model parameters. Attention sinks modulate outputs locally. Why does it matter? These phenomena directly impact quantization, pruning, and KV-cache management. Understanding their root cause could enable better engineering decisions for efficient inference at scale. Paper: https://t.co/wfzeDpfu4x
Introducing the Synthetic Data Playbook: We generated over a 1T tokens in 90 experiments with 100k+ GPUh to figure out what makes good synthetic data and how to generate it at scale https://t.co/iaHuodWVAa https://t.co/48gBUYE6R2
While waiting for DeepSeek V4 we got two very strong open-weight LLMs from India yesterday. There are two size flavors,Β Sarvam 30BΒ andΒ Sarvam 105BΒ model (both reasoning models). Interestingly, the smaller 30B model uses βclassicβ Grouped Query Attention (GQA), whereas the larger 105B variant switched to DeepSeek-style Multi-Head Latent Attention (MLA). As I wrote about in my analyses before, both are popular attention variants to reduce KV cache size (the longer the context, the more you save compared to regular attention). MLA is more complicated to implement, but it can give you better modeling performance if we go by the ablation studies in theΒ 2024 DeepSeek V2 paperΒ (as far as I know, this is still the most recent apples-to-apples comparison). Speaking of modeling performance, the 105B model is on par with LLMs of similar size: gpt-oss 120B and Qwen3-Next (80B). Sarvam is better on some tasks and worse on others, but roughly the same on average. Itβs not the strongest coder in SWE-Bench Verified terms, but it is surprisingly good at agentic reasoning and task completion (Tau2). Itβs even better than Deepseek R1 0528. Considering the smaller Sarvam 30B, the perhaps most comparable model to the 30B model is Nemotron 3 Nano 30B, which is slightly ahead in coding per SWE-Bench Verified and agentic reasoning (Tau2) but slightly worse in some other aspects (Live Code Bench v6, BrowseComp). Unfortunately, Qwen3-30B-A3B is missing in the benchmarks, which is, as far as I know, is the most popular model of that size class. Interestingly, though, the Sarvam team compared their 30B model to Qwen3-30B-A3B on a computational performance analysis, where they found that Sarvam gets 20-40% more tokens/sec throughput compared to Qwen3 due to code and kernel optimizations. Anyways, one thing that is not captured by the benchmarks above is Sarvamβs good performance on Indian languages. According to a judge model, the Sarvam team found that their model is preferred 90% of the time compared to others when it comes to Indian texts. (Since they built and trained the tokenizer from scratch as well, Sarvam also comes with a 4 times higher token efficiency on Indian languages.
π vLLM v0.17.0 is here! 699 commits from 272 contributors (48 new!) This is a big one. Highlights: β‘ FlashAttention 4 integration π§ Qwen3.5 model family with GDN (Gated Delta Networks) ποΈ Model Runner V2 maturation: Pipeline Parallel, Decode Context Parallel, Eagle3 + CUDA graphs ποΈ New --performance-mode flag: balanced / interactivity / throughput πΎ Weight Offloading V2 with prefetching π Elastic Expert Parallelism Milestone 2 π§ Quantized LoRA adapters (QLoRA) now loadable directly
Excited to announce our new LLM inference algorithm, speculative speculative decoding (SSD)! It is fast π β up to 2x faster than state-of-the-art inference engines (vLLM, SGLang). Working on this with @tanishqkumar07 and @tri_dao was a blast. Details in thread:
I've been working on a new LLM inference algorithm. It's called Speculative Speculative Decoding (SSD) and it's up to 2x faster than the strongest inference engines in the world. Collab w/ @tri_dao @avnermay. Details in thread.
Asymmetric hardware scaling is here. Blackwell tensor cores are now so fast, exp2 and shared memory are the wall. FlashAttention-4 changes the algorithm & pipeline so that softmax & SMEM bandwidth no longer dictate speed. Attn reaches ~1600 TFLOPs, pretty much at matmul speed! joint work w/ Markus Hoehnerbach, Jay Shah(@ultraproduct), Timmy Liu, Vijay Thakkar (@__tensorcore__ ), Tri Dao (@tri_dao) 1/
FA4 now available in lm-engine: https://t.co/n47TEinAfG 13.4% end-to-end speedup for Llama 8B training on 4x GB200s (1 node) πππ 1005.55 TFLOPs for SDPA vs 1140.73 for FA4 (BF16 precision) @tedzadouri @ultraproduct @__tensorcore__ @tri_dao cooked Thanks to @bharatrunwal2 for running the experiment!
FA4 now available in lm-engine: https://t.co/n47TEinAfG 13.4% end-to-end speedup for Llama 8B training on 4x GB200s (1 node) πππ 1005.55 TFLOPs for SDPA vs 1140.73 for FA4 (BF16 precision) @tedzadouri @ultraproduct @__tensorcore__ @tri_dao cooked Thanks to @bharatrunwal2 for running the experiment!
You can now fine-tune Qwen3.5 with our free notebook! π₯ You just need 5GB VRAM to train Qwen3.5-2B LoRA locally! Unsloth trains Qwen3.5 1.5x faster with 50% less VRAM. GitHub: https://t.co/2kXqhhvLsb Guide: https://t.co/JCPGIRo99s Qwen3.5-4B Colab: https://t.co/2Aj1mZ3f5j

You can now fine-tune Qwen3.5 with our free notebook! π₯ You just need 5GB VRAM to train Qwen3.5-2B LoRA locally! Unsloth trains Qwen3.5 1.5x faster with 50% less VRAM. GitHub: https://t.co/2kXqhhvLsb Guide: https://t.co/JCPGIRo99s Qwen3.5-4B Colab: https://t.co/2Aj1mZ3f5j
> 385ms average tool selection. > 67 tools across 13 MCP servers. > 14.5GB memory footprint. > Zero network calls. LocalCowork is an AI agent that runs on a MacBook. Open source. π§΅ https://t.co/bnXupspSXc
We just open-sourced Mission Control β our dashboard for AI agent orchestration. 26 panels. Real-time WebSocket + SSE. SQLite β no external services needed. Kanban board, cost tracking, role-based access, quality gates, and multi-gateway support. One pnpm start, and you're running. https://t.co/GVybvosdmI
Rust implementation for Text-to-Speech (TTS) based on open-source Qwen3 models * Self-contained binary build β no external dependencies * Uses libtorch on Linux with optional Nvidia GPU support * Uses MLX on MacOS with Apple GPU/NPU support * Supports voice cloning from 3-second clips * Supports voice instructions (emotion, style) π¨ CLI for AI agents and humans: https://t.co/1LKRapngVk π₯οΈ OpenAI compatible API server: https://t.co/qjDqCf9hor π€ OpenClaw skill: https://t.co/XgLC55Vjsp

We identified an issue with the Mamba-2 π initialization in HuggingFace and FlashLinearAttention repository (dt_bias being incorrectly initialized). This bug is related to 2 main issues: 1. init being incorrect (torch.ones) if Mamba-2 layers are used in isolation without the Mamba2ForCausalLM model class (this has been already fixed: https://t.co/oahfxjIsKb). 2. Skipping initialization due to meta device init for DTensors with FSDP-2 (https://t.co/hLC8nnQFc3 will fix this issue upon merging). The difference is substantial. Mamba-2 seems to be quite sensitive to the initialization. Check out our experiments at the 7B MoE scale: https://t.co/n8iuUICRux Special thanks to @kevinyli_, @bharatrunwal2, @HanGuo97, @tri_dao and @_albertgu π Also thanks to @SonglinYang4 for quickly helping in merging the PR.

This was a wild bug hunt, weeks of effort from @MayankMish98 to track down. The wrong init of Mamba2 in many reimplementations causes the layer to decay its states too quickly, focusing in short context instead. Pretraining is mostly about getting these little things right
We release SERA, the first model part of Ai2βs Open Coding Agent series. SERA is a SoTA agent for its size, super simple, and 26x more efficient than RL. In my blog post, I write about my personal journey of building this coding agent: https://t.co/kPZHUGwBBC Details: π
Our method became so efficient (26x vs RL; 57x vs other synth gen), that we could easily generate 1000s of trajectories for a single repo. This makes the coding agents very powerful as it soaks up the nuances of a particular codebase. Very cheap to specialize to any codebase.
Today, we release SERA-32B, an approach to coding agents that matches Devstral 2 at just $9,000. It is fully open-source and you can train your own model easily - at 26x the efficiency of using RL. Paper: https://t.co/aeD6T2WW3O Hereβs how π§΅
Don't use weight decay. Just normalize everything you see (updates + parameters). Works on top of your favorite optimizer (e.g., Muon). Result: 33% speedup + better hyperparameter transfer.
Weβre open-sourcing CoderForge-Preview β 258K test-verified coding-agent trajectories (155K pass | 103K fail). Fine-tuning Qwen3-32B on the passing subset boosts SWE-bench Verified: 23.0% β 59.4% pass@1, and it ranks #1 among open-data models β€32B parameters. Thread on the data generation pipeline π§΅
SONIC is now open-source! Generalist whole-body teleoperation for EVERYONE! Our team has long been building comprehensive pipelines for whole-body control, kinematic planner, and teleoperation, and they will all be shared. This will be a continuous update; inference code + model already there, training code and gr00t integration coming soon! Code: https://t.co/7u3SBxzXU9 Docs: https://t.co/HpDLkTCSMF Site: https://t.co/D3i4KlnLLr
Website: https://t.co/xTaDXBu9cD Codebase and weights: https://t.co/QCQkqPIsHI Whitepaper: https://t.co/K2QCFjboDR Check out @zhengyiluo's post: https://t.co/hIHtvKkDQf

Pay attention to this one if you are building terminal-based coding agents. OpenDev is an 81-page paper covering scaffolding, harness design, context engineering, and hard-won lessons from building CLI coding agents. It introduces a compound AI system architecture with workload-specialized model routing, a dual-agent architecture separating planning from execution, lazy tool discovery, and adaptive context compaction. The industry is shifting from IDE plugins to terminal-native agents. Claude Code, Codex CLI, and others have proven the model works. This paper formalizes the design patterns that make these systems reliable, covering topics like event-driven system reminders to counteract instruction fade-out, automated memory across sessions, and strict safety controls for autonomous operation. Paper: https://t.co/tpAZFaSnog Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX