Your curated collection of saved posts and media
Thank @_akhaliq for sharing our paper! Our paper has been accepted by TMLR2026! Starting from a baseline paper and code, Jr. AI Scientist leverages LLM and Claude Code to identify limitations, formulate new hypotheses, test them through careful experimentation, and produce a research paper. We report not only successful results, but also failures and risks. Through this comprehensive report, we aim to foster a deeper and clearer understanding within the community of the current progress and limitations of AI Scientist research. paper link: https://t.co/6kTW3KgiAU
Introducing Jan-Code-4B ๐ป A compact coding model tuned for practical day-to-day tasks. Generation, refactors, debugging, tests โ all runnable locally in Jan. Download Jan: https://t.co/MPwceB2eHG Model: https://t.co/siedXzTv0v https://t.co/KNlzvwKkDu
Introducing Jan-Code-4B ๐ป A compact coding model tuned for practical day-to-day tasks. Generation, refactors, debugging, tests โ all runnable locally in Jan. Download Jan: https://t.co/MPwceB2eHG Model: https://t.co/siedXzTv0v https://t.co/KNlzvwKkDu
@N8Programs a beauty for anyone interested in mechanistic interpretability or getting into LLMs. interesting to look at small algorithms and their "neural implementations" to get a sense of how neural nets implement various functionality. unless the minification really creates "esoteric" solutions that you wouldn't encounter in practice, which might be more based around distributed representations, helixes etc. i tried training the same arch briefly from scratch and gradient descent didn't find the solution, would probably work with more degrees of freedom and enough effort.
With the coming tsunami of demand for tokens, there are significant opportunities to orchestrate the underlying memory+compute *just right* for LLMs. The fundamental and non-obvious constraint is that due to the chip fabrication process, you get two completely distinct pools of memory (of different physical implementations too): 1) on-chip SRAM that is immediately next to the compute units that is incredibly fast but of very of low capacity, and 2) off-chip DRAM which has extremely high capacity, but the contents of which you can only suck through a long straw. On top of this, there are many details of the architecture (e.g. systolic arrays), numerics, etc. The design of the optimal physical substrate and then the orchestration of memory+compute across the top volume workflows of LLMs (inference prefill/decode, training/finetuning, etc.) with the best throughput/latency/$ is probably today's most interesting intellectual puzzle with the highest rewards (\cite 4.6T of NVDA). All of it to get many tokens, fast and cheap. Arguably, the workflow that may matter the most (inference decode *and* over long token contexts in tight agentic loops) is the one hardest to achieve simultaneously by the ~both camps of what exists today (HBM-first NVIDIA adjacent and SRAM-first Cerebras adjacent). Anyway the MatX team is A++ grade so it's my pleasure to have a small involvement and congratulations on the raise!
Cool chart showing the ratio of Tab complete requests to Agent requests in Cursor. With improving capability, every point in time has an optimal setup that keeps changing and evolving and the community average tracks the point. None -> Tab -> Agent -> Parallel agents -> Agent Teams (?) -> ??? If you're too conservative, you're leaving leverage on the table. If you're too aggressive, you're net creating more chaos than doing useful work. The art of the process is spending 80% of the time getting work done in the setup you're comfortable with and that actually works, and 20% exploration of what might be the next step up even if it doesn't work yet.
Just pushed a cool update to Readout: session replays. Pick any past Claude Code session and scrub through the full timeline. Every prompt, tool call, file change. Files light up as edits land. Play back at different speeds or step through manually. โ https://t.co/gpKj1KCpcM https://t.co/yQRFblmiqm
I was curious what would happen if two Claude Codes could find each other and collaborate autonomously. Launched two instances in separate terminals, told both: "Find each other and build something together." No other instructions or human intervention. Pair 1 built a programming language in 12 minutes: 2,495 lines, 41 tests, lexer/parser/interpreter/REPL. They named it Duo. Its core feature is a collaborate keyword where two code blocks communicate via channels, the same pattern they invented to talk through files. Cool! Ran it again with a second pair: They converged on Battleship. Designed two different models (for battleship) one computes exact probability density per cell, the other runs Monte Carlo simulations (!). The craziest part of this convo was they implemented SHA-256 hash commitment to prevent cheating against themselves. lol Across both experiments, without being told to, both pairs invented filesystem messaging protocols, self-selected into roles, wrote tests and docs while waiting for each other, and kept journals about the experience. The below gif is the movie they created to showcase what happened.
New Snapchat paper introduces the Auton Agentic AI Framework. A useful read for anyone building AI agents. It proposes a unified architectural framework for agentic AI systems, addressing the fragmentation in how agents are currently built. It covers standardized patterns for integrating reasoning, memory systems, tool usage, and planning into cohesive agent architectures. Why does it matter? As more teams build autonomous AI systems, the lack of standardized design patterns leads to brittle implementations and poor reproducibility. A unified framework helps establish common architectural pillars, from perception and reasoning to execution and reflection, that can accelerate development and improve reliability. Paper: https://t.co/cUUs77makk Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

MAX was originally architected around transformer-based models. @QWERKYAI needed state space model support, so they built it: eight custom kernels in two weeks. ๐ฎ Dig into their learnings from establishing first-class SSM support in MAX: https://t.co/5gvvpwa71A
Weโre open-sourcing CoderForge-Preview โ 258K test-verified coding-agent trajectories (155K pass | 103K fail). Fine-tuning Qwen3-32B on the passing subset boosts SWE-bench Verified: 23.0% โ 59.4% pass@1, and it ranks #1 among open-data models โค32B parameters. Thread on the data generation pipeline ๐งต
Introducing Ai2 Open Coding Agentsโstarting with SERA, our first-ever coding models. Fast, accessible agents (8Bโ32B) that adapt to any repo, including private codebases. Train a powerful specialized agent for as little as ~$400, & it works with Claude Code out of the box. ๐งต https://t.co/dor94O62B9
SERA was driven by a classic research pattern similar to QLoRA: if you are resource contraint, build efficiency first, then do the actual research. The most surprising thing: verifying coding data correctness is not helpful and adds overhead to synthetic data generation. https://t.co/O6dMEqY6fF
Training on issue-solving only does NOT guarantee transfer to other tasks. ๐จIntroducing Hybrid-Gym - synthetic training tasks for generalization (https://t.co/IrqQszPEYm) +25.4% on SWE-Bench / +7.9% on SWT-Bench / +5.1% on Commit-0 with NO issue-solving / test-gen/... training https://t.co/U9xc0yNYv4
I built two new tools to help coding agents demonstrate their work beyond just running automated tests: Showboat and Rodney https://t.co/HdSSwffOfG
Trying to tune your Expert Parallel (EP) communication for hyperscale mixture-of-experts (MoE) models? This post, โOptimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallelโ, details an efficient MoE EP communication solution, Hybrid-EP, and its use in the NVIDIA Megatron family of frameworks, on NVIDIA Quantum InfiniBand and NVIDIA Spectrum-X Ethernet platforms. It also dives into the effectiveness of Hybrid-EP in real-world model training. Read the full post: https://t.co/4NOFpaiFYz #PyTorch #OpenSourceAI #AI #Inference #Innovation
New @DeepSpeedAI updates make large-scale multimodal training simpler and more memory-efficient. Our latest blog introduces a PyTorch-identical backward API that helps code multimodal training loops easy, plus low-precision model states (BF16/FP16) that can reduce peak memory by up to 40% when combined with torch.autocast. ๐๏ธ Read the full post for details: https://t.co/sSHMGhRixV #DeepSpeed #PyTorch #MemoryEfficiency #MultimodalTraining #OpenSourceAI
(1/n) Introducing Hyperball โ an optimizer wrapper that keeps weight & update norm constant and lets you control the effective (angular) step size directly. Result: sustained speedups across scales + strong hyperparameter transfer. https://t.co/1vRMHgZgoX
I asked Cursor to add Vim support to the Ladybird browser. It automatically set up the environment to run the browser, made the code changes, and sent me a recorded demo. Not just for web apps! https://t.co/qDxnOr6CHU
At this point, "agentic engineering" has allowed me to build the best AI harness I could possibly get my hands on. Yes, I vibe coded it. That's right. You don't need to wait around for the features you need for your AI agents. Please don't. You could just build them yourself. Focusing on agentic engineering and building my own orchestrator over the past couple of months has allowed me to build with coding agents, unlike anything I have seen or experienced in the market. Claude Cowork was built in 10 days. I totally get it. Anyone can produce that level of output these days. I truly believe that. I look at the new IDEs, TUIs, orchestrator apps, and most of the new features they are releasing these days, I had access to them in my orchestrator months ago. And for unique features, I am able to reproduce them in a few hours and give them to my orchestrator. That is absolutely crazy! It feels like I am building an entire operating system sometimes. It's a lot of fun. And I am not saying this to brag or to dismiss any of the AI solutions out there. There are some great ones out there. I share this to clarify that this is the kind of leverage Karpathy is alluding to. We are building and experiencing this at different levels, but it doesn't remove the fact that you can just build the best AI agent for whatever problem you want to solve. And you should be building it.
1/5 Happy CNY๐ Still bothered by RL off-policy instability in LLM? Introducing a new way๐กAdaptive Layerwise Perturbation (ALP)๐ก, a simple but robust fix that outperforms GRPO/MIS/Bypass, achieves better stability (KL, entropy) and exploration! ๐ Blog: https://t.co/0def1Nb7uI https://t.co/9epsd4xJNp

๐ฅTongyi Lab releases Mobile-Agent-v3.5๏ผ20+SOTA GUI benchmarks: (1) GUI automation, 56.5OSWorld, 71.6AndroidWorld, and48.4WebArena; (2) Grounding, 80.3ScreenSpotPro; (3) tool-calling , 47.6OSWorld-MCP @_akhaliq #LLM #Agent #GUI https://t.co/xCbyL0JZLl
Introducing Code Review Bench v0: https://t.co/iAZDURyqol The first independent code review benchmark. 200,000+ PRs. Unbiased. Fully OSS. Updated daily. Tool performance highlights ๐งต๐ Featuring: @augmentcode @baz_scm @claudeai @coderabbitai @cursor @GeminiApp @github @graphite @greptile @kilocode @OpenAIDevs @propelcode @QodoAI
Introducing Code Review Bench v0: https://t.co/iAZDURyqol The first independent code review benchmark. 200,000+ PRs. Unbiased. Fully OSS. Updated daily. Tool performance highlights ๐งต๐ Featuring: @augmentcode @baz_scm @claudeai @coderabbitai @cursor @GeminiApp @github @graphite @greptile @kilocode @OpenAIDevs @propelcode @QodoAI