Your curated collection of saved posts and media
At this point, "agentic engineering" has allowed me to build the best AI harness I could possibly get my hands on. Yes, I vibe coded it. That's right. You don't need to wait around for the features you need for your AI agents. Please don't. You could just build them yourself. Focusing on agentic engineering and building my own orchestrator over the past couple of months has allowed me to build with coding agents, unlike anything I have seen or experienced in the market. Claude Cowork was built in 10 days. I totally get it. Anyone can produce that level of output these days. I truly believe that. I look at the new IDEs, TUIs, orchestrator apps, and most of the new features they are releasing these days, I had access to them in my orchestrator months ago. And for unique features, I am able to reproduce them in a few hours and give them to my orchestrator. That is absolutely crazy! It feels like I am building an entire operating system sometimes. It's a lot of fun. And I am not saying this to brag or to dismiss any of the AI solutions out there. There are some great ones out there. I share this to clarify that this is the kind of leverage Karpathy is alluding to. We are building and experiencing this at different levels, but it doesn't remove the fact that you can just build the best AI agent for whatever problem you want to solve. And you should be building it.
1/5 Happy CNYπ Still bothered by RL off-policy instability in LLM? Introducing a new wayπ‘Adaptive Layerwise Perturbation (ALP)π‘, a simple but robust fix that outperforms GRPO/MIS/Bypass, achieves better stability (KL, entropy) and exploration! π Blog: https://t.co/0def1Nb7uI https://t.co/9epsd4xJNp

π₯Tongyi Lab releases Mobile-Agent-v3.5οΌ20+SOTA GUI benchmarks: (1) GUI automation, 56.5OSWorld, 71.6AndroidWorld, and48.4WebArena; (2) Grounding, 80.3ScreenSpotPro; (3) tool-calling , 47.6OSWorld-MCP @_akhaliq #LLM #Agent #GUI https://t.co/xCbyL0JZLl
Introducing Code Review Bench v0: https://t.co/iAZDURyqol The first independent code review benchmark. 200,000+ PRs. Unbiased. Fully OSS. Updated daily. Tool performance highlights π§΅π Featuring: @augmentcode @baz_scm @claudeai @coderabbitai @cursor @GeminiApp @github @graphite @greptile @kilocode @OpenAIDevs @propelcode @QodoAI
Introducing Code Review Bench v0: https://t.co/iAZDURyqol The first independent code review benchmark. 200,000+ PRs. Unbiased. Fully OSS. Updated daily. Tool performance highlights π§΅π Featuring: @augmentcode @baz_scm @claudeai @coderabbitai @cursor @GeminiApp @github @graphite @greptile @kilocode @OpenAIDevs @propelcode @QodoAI
The Mixture of Experts (MoE) inside π€ Transformers is out now! This is going to be a long tweet, so if you just want to jump to the blog, the link is in the thread. We already had a great blog post on MoEs (which has more than 1k upvotes π― at the time of writing). The reason we wanted to build another blog post altogether was just noticing how far we have come in the realm. This blog post is not meant to be another "What is MoEs and how to implement them". Rather talk about how the transformers team at @huggingface made MoEs the "first class citizen" of the library and the Hub. The transformers library and the entire ecosystem was built around dense architectures, but now with the rapid growth of MoEs, it was inevitible to build around MoEs and not consider them as "just another model addition". In the post we talk about better model loading, expert backend, expert parallelism, and also @UnslothAI and out collaboration on training MoEs faster! In the process of building the blog post, I also understood how beautiful the ideas are, and ended up making my first YouTube video on the routing algorithm alone. I am very proud of this project and I think it shows in some paragraphs of the blog post. I am also very thankful to all the people who helped me in the project, I am really happy to be in the team that helps me flourish! Glad to be alive. PS: I owe you all an apology for delaying the release. I hope I (and the team) could make it worth the wait.
Mercury 2 is live ππ The worldβs first reasoning diffusion LLM, delivering 5x faster performance than leading speed-optimized LLMs. Watching the team turn years of research into a real product never gets old, and Iβm incredibly proud of what weβve built. Weβre just getting started on what diffusion can do for language.
I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?
@karpathy Yes, and on the software / architecture design side we also see companies more and more preparing for token-heavy workflows. I.e., adding DeepSeek Sparse Attention to MLA a few months back,...
Claude distillation has been a big topic this week while I am (coincidentally) writing Chapter 8 on model distillation. In that context, I shared some utilities to generate distillation data from all sorts of open-weight models via OpenRouter and Ollama: https://t.co/IsfNDpcGAw https://t.co/LKXuGrjO84
@latentchiz If you liked "LLMs from scratch", this should be a fun sequel!
@burkeholland and @pierceboggan will be live coding all throughout the stream this morning! here's what they're building: https://t.co/iFH8Sz7Hn9
Did u ever imagined in your head, visually, what your @GitHubCopilot is doing when u kick off a session in @code? Me too! Presenting @code session visualizer (which I posted about a few weeks ago) π See top level user->agent turns π Click βΉοΈ for extended node info(mode, prompt, tool calls, tokens used and more) π Expand agent nodes on click for full tool invocations/mcps/sub agents etc π Sub agent nodes can be further expended π High level summery of session with detailed model and token utilization per turn Initial version available now in @code marketplace, see first comment for link π Tell me what u think!
New in @code Insiders: Integrated agentic browser with workbench.browser.enableChatTools. Here, I ask @code to identify an issue with sliders in hover states, and it makes the fix and validates the solution. https://t.co/TVXkUYpcns
Holy shit... someone just built software that sees you through walls using WiFi. It's called WiFi-DensePose and it maps your full body pose in real-time no camera, no sensor, no special hardware. Just the router sitting in your living room. Governments and corporations have had this technology for years. Someone just open-sourced it so now anyone can build it. Your WiFi is already watching. You just didn't know. 100% Opensource.
Instead of forcing models to hold everything in an active context window, we can use hypernetworks to instantly compile documents and tasks directly into the model's weights. A step towards giving language models durable memory and fast adaptation. Blog: https://t.co/iHoifpsLMu
Qwen3.5 is now updated with improved tool-calling & coding performance! Run Qwen3.5-35B-A3B on 22GB RAM. See improvements via Claude Code, Codex. We also benchmarked GGUFs & removed MXFP4 layers from 3 quants. GGUFs: https://t.co/4lSce5zZbO Analysis: https://t.co/rHZK8JWdYM

I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?
Rust implementation for Speech-to-Text based on open-source Qwen3 models * Self-contained binary build β no external dependencies * Uses libtorch on Linux with optional Nvidia GPU support * Uses MLX on MacOS with Apple GPU/NPU support π¨ CLI for AI agents and humans: https://t.co/knsZlastgQ π₯οΈ OpenAI compatible API server: https://t.co/qjDqCf9hor π€ OpenClaw skill: https://t.co/tE6lzTjYpy Why and how https://t.co/VxRt9oSZ8a

New Engineering blog: We tasked Opus 4.6 using agent teams to build a C compiler. Then we (mostly) walked away. Two weeks later, it worked on the Linux kernel. Here's what it taught us about the future of autonomous software development. Read more: https://t.co/htX0wl4wIf https://t.co/N2e9t5Z6Rm
Just shipped tscribe ποΈ π΄Record any audio playing on your computer. ποΈTranscribe it locally. πSearch it later. All from your terminal. OS X, Windows, Linux open source, cross-platform, no cloud required. https://t.co/nPkokVqo1K
Announcing a new Claude Code feature: Remote Control. It's rolling out now to Max users in research preview. Try it with /remote-control Start local sessions from the terminal, then continue them from your phone. Take a walk, see the sun, walk your dog without losing your flow.
Mercury 2 doesn't just make reasoning models faster. It makes them native. Every reasoning model today is built on autoregressive generation, where the model writes one word at a time, left to right, like typing on a keyboard. Each word waits for the previous one to finish. The problem compounds when reasoning depth increases: multi-step agents, voice systems, and coding assistants all need many sequential passes, and each pass multiplies the delay. The industry has spent billions on chips, compression, and serving infrastructure to squeeze more speed from this sequential loop. But you're still optimizing a bottleneck. Mercury 2 uses diffusion instead. It starts with a rough draft of the entire response and refines all the words simultaneously through multiple passes. Each pass improves many tokens in parallel, so one neural network evaluation does far more work. The model can also correct mistakes mid-generation because nothing is locked in until the final pass. This isn't a serving trick or a hardware optimization. The speed comes from the architecture itself. This unlocks workflows that were impractical before: 1. Multi-step agents that run 10+ reasoning loops without compounding latency 2. Voice AI that hits sub-200ms response times with full reasoning enabled 3. Real-time code editors where every keystroke triggers model feedback Mercury 2 runs at 1,000 tokens per second while matching the quality of models that generate 70-90 tokens per second. If this performance holds across model sizes, reasoning stops being a batch process you run overnight and becomes something you embed everywhere. Agent loops become tight enough for interactive debugging. Voice systems feel instant instead of sluggish. Code assistants respond faster than you can move your cursor. The entire category of "too slow for production" collapses.
A 35 billion parameter model just beat a 235 billion parameter model. That's not supposed to happen. Qwen3.5-35B-A3B now outperforms its predecessor that had 6x more total parameters, and it does so while using 7x fewer active parameters per token. The breakthrough isn't efficiency for efficiency's sake. It's proof that three specific techniques can compress intelligence better than brute-force scaling: 1. Hybrid attention layers that mix linear attention (fast, scales to long contexts) with standard attention (accurate, catches nuance) in a 3:1 ratio 2. Ultra-sparse experts where only 3 billion of 35 billion parameters activate per token, but those 3 billion are chosen by a router trained on higher-quality data 3. Reinforcement learning scaled across millions of simulated agent environments, not just text prediction The result is a model architecture where intelligence comes from better routing decisions, not bigger weight matrices. This unlocks four things that weren't practical before: 1. Running frontier-class reasoning on a single GPU node instead of a cluster 2. Serving 1 million token contexts in production without exploding costs 3. Building agents that can handle complex tool use without the latency penalty of dense models 4. Fine-tuning on domain data without needing to update 200+ billion parameters If this pattern holds, the next 18 months will belong to teams optimizing routing and data quality, not teams with the biggest GPU budgets.
Imbue just open-sourced Evolver. A tool that uses LLMs to automatically optimize code and prompts. They hit 95% on ARC-AGI-2 benchmarks. That's GPT-5.2-level performance from an open model. Evolver works like natural selection for code. You give it three things: 1. Starting code or prompt 2. A way to score results 3. An LLM that suggests improvements Then it runs in a loop. It picks high-scoring solutions. Mutates them. Tests the mutations. Keeps what works. The key difference from random mutation: LLMs propose targeted fixes. When a solution fails on specific inputs, the LLM sees those failures. It suggests changes to fix them. Most suggestions don't help. But some do. Those survivors become parents for the next generation. Evolver adds smart optimizations: > Batch mutations: fix multiple failures at once > Learning logs: share discoveries across branches > Post-mutation filters: skip bad mutations before scoring The verification step alone cuts costs 10x. This works on any problem where LLMs can read the code and you can score the output. You can now auto-optimize: - Agentic workflows - Prompt templates - Code performance - Reasoning chains No gradient descent needed. No differentiable functions required.