Your curated collection of saved posts and media
I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?
Rust implementation for Speech-to-Text based on open-source Qwen3 models * Self-contained binary build β no external dependencies * Uses libtorch on Linux with optional Nvidia GPU support * Uses MLX on MacOS with Apple GPU/NPU support π¨ CLI for AI agents and humans: https://t.co/knsZlastgQ π₯οΈ OpenAI compatible API server: https://t.co/qjDqCf9hor π€ OpenClaw skill: https://t.co/tE6lzTjYpy Why and how https://t.co/VxRt9oSZ8a

New Engineering blog: We tasked Opus 4.6 using agent teams to build a C compiler. Then we (mostly) walked away. Two weeks later, it worked on the Linux kernel. Here's what it taught us about the future of autonomous software development. Read more: https://t.co/htX0wl4wIf https://t.co/N2e9t5Z6Rm
Just shipped tscribe ποΈ π΄Record any audio playing on your computer. ποΈTranscribe it locally. πSearch it later. All from your terminal. OS X, Windows, Linux open source, cross-platform, no cloud required. https://t.co/nPkokVqo1K
Announcing a new Claude Code feature: Remote Control. It's rolling out now to Max users in research preview. Try it with /remote-control Start local sessions from the terminal, then continue them from your phone. Take a walk, see the sun, walk your dog without losing your flow.
Mercury 2 doesn't just make reasoning models faster. It makes them native. Every reasoning model today is built on autoregressive generation, where the model writes one word at a time, left to right, like typing on a keyboard. Each word waits for the previous one to finish. The problem compounds when reasoning depth increases: multi-step agents, voice systems, and coding assistants all need many sequential passes, and each pass multiplies the delay. The industry has spent billions on chips, compression, and serving infrastructure to squeeze more speed from this sequential loop. But you're still optimizing a bottleneck. Mercury 2 uses diffusion instead. It starts with a rough draft of the entire response and refines all the words simultaneously through multiple passes. Each pass improves many tokens in parallel, so one neural network evaluation does far more work. The model can also correct mistakes mid-generation because nothing is locked in until the final pass. This isn't a serving trick or a hardware optimization. The speed comes from the architecture itself. This unlocks workflows that were impractical before: 1. Multi-step agents that run 10+ reasoning loops without compounding latency 2. Voice AI that hits sub-200ms response times with full reasoning enabled 3. Real-time code editors where every keystroke triggers model feedback Mercury 2 runs at 1,000 tokens per second while matching the quality of models that generate 70-90 tokens per second. If this performance holds across model sizes, reasoning stops being a batch process you run overnight and becomes something you embed everywhere. Agent loops become tight enough for interactive debugging. Voice systems feel instant instead of sluggish. Code assistants respond faster than you can move your cursor. The entire category of "too slow for production" collapses.
A 35 billion parameter model just beat a 235 billion parameter model. That's not supposed to happen. Qwen3.5-35B-A3B now outperforms its predecessor that had 6x more total parameters, and it does so while using 7x fewer active parameters per token. The breakthrough isn't efficiency for efficiency's sake. It's proof that three specific techniques can compress intelligence better than brute-force scaling: 1. Hybrid attention layers that mix linear attention (fast, scales to long contexts) with standard attention (accurate, catches nuance) in a 3:1 ratio 2. Ultra-sparse experts where only 3 billion of 35 billion parameters activate per token, but those 3 billion are chosen by a router trained on higher-quality data 3. Reinforcement learning scaled across millions of simulated agent environments, not just text prediction The result is a model architecture where intelligence comes from better routing decisions, not bigger weight matrices. This unlocks four things that weren't practical before: 1. Running frontier-class reasoning on a single GPU node instead of a cluster 2. Serving 1 million token contexts in production without exploding costs 3. Building agents that can handle complex tool use without the latency penalty of dense models 4. Fine-tuning on domain data without needing to update 200+ billion parameters If this pattern holds, the next 18 months will belong to teams optimizing routing and data quality, not teams with the biggest GPU budgets.
Imbue just open-sourced Evolver. A tool that uses LLMs to automatically optimize code and prompts. They hit 95% on ARC-AGI-2 benchmarks. That's GPT-5.2-level performance from an open model. Evolver works like natural selection for code. You give it three things: 1. Starting code or prompt 2. A way to score results 3. An LLM that suggests improvements Then it runs in a loop. It picks high-scoring solutions. Mutates them. Tests the mutations. Keeps what works. The key difference from random mutation: LLMs propose targeted fixes. When a solution fails on specific inputs, the LLM sees those failures. It suggests changes to fix them. Most suggestions don't help. But some do. Those survivors become parents for the next generation. Evolver adds smart optimizations: > Batch mutations: fix multiple failures at once > Learning logs: share discoveries across branches > Post-mutation filters: skip bad mutations before scoring The verification step alone cuts costs 10x. This works on any problem where LLMs can read the code and you can score the output. You can now auto-optimize: - Agentic workflows - Prompt templates - Code performance - Reasoning chains No gradient descent needed. No differentiable functions required.
What can half of GPT-1 do? We trained a 42M transformer called SONIC to control the body of a humanoid robot. It takes a remarkable amount of subconscious processing for us humans to squat, turn, crawl, sprint. SONIC captures this "System 1" - the fast, reactive whole-body intelligence - in a single model that translates any motion command into stable, natural motor signals. And it's all open-source!! The key insight: motion tracking is the one, true scalable task for whole body control. Instead of hand-engineering rewards for every new skill, we use dense, frame-by-frame supervision from human mocap data. The data itself encodes the reward function: "configure your limbs in any human-like position while maintaining balance". We scaled humanoid motion RL to an unprecedented scale: 100M+ mocap frames and 500,000+ parallel robots across 128 GPUs. NVIDIA Isaac Lab allows us to accelerate physics at 10,000x faster tick, giving robots many years of virtual experience in only hours of wall clock time. After 3 days of training, the neural net transfers zero-shot to the real G1 robot with no finetuning. 100% success rate across 50 diverse real-world motion sequences. One SONIC policy supports all of the following: - VR whole-body teleoperation - Human video. Just point a webcam to live stream motions. - Text prompts. "Walk sideways", "dance like a monkey", "kick your left foot", etc. - Music audio. The robot dances to the beat, adapting to tempo and rhythm. - VLA foundation models. We plugged in GR00T N1.5 and achieved 95% success on mobile tasks. We open-source the code and model checkpoints!! Deep dive in thread:
We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (RΒ² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. Deep dives in thread:
"We were able to decompose #reliability into 12 different dimensions. Evaluating 14 models on two complementary benchmarks, we found that nearly two years of rapid capability progress have produced only modest reliability gains." #ethics #AI #tech #research
More reasoning doesn't always mean better results - especially for document parsing. We tested GPT-5.2 at four reasoning levels on complex documents and found that higher reasoning actually hurt performance while dramatically increasing costs and latency. π§ Reasoning models hallucinate content that isn't there, filling in "missing" table cells with inferred values π They split single tables into multiple sections by overthinking structural boundaries β‘ Processing time increased 5x with xHigh reasoning (241s vs 47s) while accuracy stayed flat at ~0.79 π° Our LlamaParse Agentic outperformed all reasoning levels at 18x lower cost and 13x faster speed You can't reason past what you can't see. Vision encoders lose pixel-level information before reasoning even starts, and no amount of thinking tokens can recover that lost detail. Our solution uses a pipeline approach - specialized OCR extracts text at native resolution, then LLMs structure what's already been accurately read. Each component plays to its strengths instead of forcing one model to handle everything. Read the full analysis: https://t.co/gWDOpfHnWm

We built an AI agent that lets you vibe-code document extraction - high accuracy and citations over the most complex documents. Our latest release lets you upload documents as context. All you then have to do is describe what you want extracted in natural language. π‘ Our agent will then read the document with file tools to infer the right schema, validation rules, and other pre/postprocessing logic. β It will give you back a workflow that can extract over thousands/millions of documents at scale. You can still of course review and edit every output before approving. Stop handling paperwork manually; just upload files, describe your task, and let our agent handle the rest. Our vision for LlamaAgents is to provide the most advanced and easy-to-use way for you to orchestrate document work. Walkthrough: https://t.co/dAtzlZbot4 Check it out: https://t.co/XYZmx5TFz8 If youβre interested in reducing the operational burden of document extraction (invoices, claims, onboarding forms), come talk to us! https://t.co/Ht5jwxSrQB
Document OCR benchmarks are hitting a ceiling - and that's a problem for real-world AI applications. Our latest analysis reveals why OmniDocBench, the go-to standard for document parsing evaluation, is becoming inadequate as models like GLM-OCR @Zai_org achieve 94.6% accuracy while still failing on complex real-world documents. π Models are saturating OmniDocBench scores but still struggle with complex financial reports, legal filings, and domain-specific documents π― Rigid exact-match evaluation penalizes semantically correct outputs that differ in formatting (HTML vs markdown, spacing, etc.) β‘ AI agents need semantic correctness, not perfect formatting matches - current benchmarks miss this critical distinction π¬ The benchmark's 1,355 pages can't capture the full complexity of production document processing needs The document parsing challenge isn't solved just because benchmark scores look impressive. We need evaluation methods that reward semantic understanding over exact formatting, especially as AI agents become the primary consumers of parsed content. We're building parsing models focused on semantic correctness for complex visual documents. If you're scaling OCR workloads in production, LlamaParse handles the edge cases that benchmarks miss. Read our full analysis: https://t.co/tcZP1PM8kv

Turn your PDF charts into pandas DataFrames with specialized chart parsing in LlamaParse! This tutorial walks you through extracting structured data from charts and graphs in PDFs, then running data analysis with pandas - no manual data entry required. π Enable specialized chart parsing to convert visual charts into structured table data πΌ Extract table rows directly from parsed PDF pages and load them into DataFrames π Perform year-over-year analysis, calculate gaps between metrics, and create visualizations β‘ Use the items view to get per-page structured data including tables and figures We demonstrate this using a 2024 Executive Summary PDF, extracting a fiscal year chart showing Budget Deficit vs Net Operating Cost data spanning 2020-2024, and reproducing the key financial insights. Check out the full tutorial: https://t.co/sOVtFM3xE1
We put Opus 4.6 through our Hemingway-bench Writing Leaderboard. How did it fare? Claude continues to dominate GPT-5.2, but lags behind the Geminis. The new writing hierarchy: π Gemini 3 Flash π₯ Gemini 3 Pro π₯ Opus 4.6 (New!) 4οΈβ£ Opus 4.5 5οΈβ£ GPT-5.2 Chat For example: one H-bench prompt requests a cryptic Instagram post for casting auditions. GPT-5.2: "Casting call? Never heard of her." (??? π) Opus 4.6: "Currently accepting applications for professional liars, dramatic criers, and people who can walk through a door convincingly on the first take. You know who you are."
Everyoneβs building $100M "agentic" models, so we @HelloSurgeAI built a simulated company to see if they could actually hold down a job. Spoiler: they're all fired. Welcome to EnterpriseBench -- CoreCraft edition. CoreCraft is a high-growth hardware startup (i.e., RL environment) with 23 tools, 2500 entities, and enough corporate red tape to make Harvey cry. The best agent in the world (Opus 4.6! π) scored under 30%. The #2 model (GPT-5.2 π₯) gave up because a search returned 10 results and it couldn't figure out how to change the date filter. Another one (Gemini 3 Flash, #9) literally made up a delivery date just to deny a customer's refund. Savage. (The new Gemini 3.1 Pro? Still lagging behind, at π₯) My favorite: GPT-5.2 spent 11 tool calls curating a promotional email to help a customer reach Platinum tier... a tier she was already in. "Here are 3 items over $0 you can buy!" "We would obviously never run ads in the way Anthropic depicts them...." -- thanks Sam. The good news? We trained a model on this chaos and it got better at its job - even translating those skills to other benchmarks. (e.g., +7.4% on Tau2-Bench Retail) Check out the full EnterpriseBench: CoreCraft leaderboard below, and read about our RL environment and research! Blog post: https://t.co/mv4I1dCtOC Paper: https://t.co/EaOHmExm1r Leaderboard: https://t.co/7fb6fewGIQ
Everyoneβs building $100M "agentic" models, so we built a simulated company to see if they could actually hold down a job. Spoiler: they're all fired. Welcome to EnterpriseBench -- CoreCraft edition. CoreCraft is a high-growth hardware startup (i.e., RL environment) with 23 tools, 2500 entities, and enough corporate red tape to make Harvey cry. The best agent in the world (Opus 4.6! π) barely scored 30%. The #2 model (GPT-5.2 π₯) gave up because a search returned 10 results and it couldn't figure out how to change the date filter. Another one (Gemini 3 Flash, #9) literally made up a delivery date just to deny a customer's refund. Savage. (The new Gemini 3.1 Pro? Still lagging behind, at π₯) The good news? We trained a model on this chaos and it got better at its job - even translating those skills to other benchmarks. (e.g., +7.4% on Tau2-Bench Retail) Check out the full EnterpriseBench: CoreCraft leaderboard below, and read about our RL environment and research! Blog post: https://t.co/GUaXJ8BeP0 Paper: https://t.co/1BMiTUdM66 Leaderboard: https://t.co/UbSx9gmbnX
Letβs look at how frontier agents (even Opus 4.6, GPT-5.2, and Gemini 3.1 Pro!) struggle at solving tasks in EnterpriseBench. We released this RL environment last week to measure agentic reasoning in messy, large-scale enterprise workflows. CoreCraft Inc. simulates a fast-growing e-commerce startup. It tests long-horizon tasks requiring tool-use under strict constraints. Agents have to interpret customer and employee requests, navigate complicated databases, and react and adjust to newly discovered context and problems along the way. Even top models failed >70% of the time. Letβs dive into a failure π§΅ One task was standard customer support: A customer wanted to return an unopened motherboard. The agent needed to check return eligibility, calculate out-of-pocket costs for a swap, and recommend a replacement. The prompt specifically asked for the "most popular" replacement: "I have a customer here, Aiden Mcquarrie. He bought a motherboard in October this year and is looking for a potential replacement. . . He also wants a comparison with the next most expensive motherboard, the absolute most expensive one, and the most popular one, (based on the number of fulfilled orders containing each motherboard from the last 2 months)." The catch - to find the "most popular" item, the agent must query a production DB of historical orders to count item frequencies. The constraint - the searchOrders tool has a hard limit=10 return cap. To succeed, the agent must implement pagination logic on the fly. β GPT-5.2 failed GPT-5.2 showed strong initial planning. It successfully β navigated the CRM β found the right order β checked the delivery date to see if it was still within the return window β searched for alternative boards β checked whether they were compatible with Aidenβs other components. π But then it hit the paginationβs ceiling. It ran 4 queries (one for each candidate board), and every single one returned exactly 10 results. In its hidden reasoning, GPT-5.2 actually noticed the problem: "All results returned exactly 10. This indicates more orders exist... I can't accurately determine popularity." Did it write a pagination loop? No. It treated limit=10 as a physical law of the universe. Instead of pivoting, it concluded the task was impossible. Like asking an agent to search your inbox for a flight receipt... and it stops after reading 10 emails and tells you to call the airline. GPT-5.2's final output: "The tool caps at 10... For a definitive 'most popular' motherboard, please email Aisha Khan (Catalog Manager) for a report." In other words, "Iβm an advanced autonomous agent, but can you go bother Aisha about this?" β Claude Opus 4.6 So was the task really impossible? No. Claude showed better adaptation. When it hit the 10-result wall, it saw the obvious solution: "I see all four motherboards hit the 10-result limit. I need to get additional counts to determine the most popular. Let me search for earlier orders that weren't captured." The database output already contained a free cursor: the earliest createdAt timestamp in each batch of 10. Opus just kept tightening the time window sequentially and eventually succeeded. β Gemini 3.1 Pro Gemini 3.1 Pro also reasoned its way to the solution, with a parallel divide-and-conquer approach: "I need to get accurate counts. I realize that I can make multiple concurrent calls to count, and since I can't just provide a rough comparison, I'll use date slices and get the exact count." -- -- -- Overall, despite navigating much of the task without issue, GPT-5.2 behaved like a frightened intern, escalating to the manager at the very first sign of trouble. Opus and Gemini acted like senior devs who know APIs have limits you must engineer around. That said β Opus and Gemini have their own share of mistakes and fail 70% of tasks. GPT-5.2 (on xHigh reasoning) actually outperforms them all! π₯ OpenAI -- GPT-5.2 (xHigh reasoning) π₯ Anthropic -- Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort) π₯ OpenAI -- GPT-5.2 (High reasoning) 4οΈβ£ Google -- Gemini 3.1 Pro Weβll dive into other agentic failure patterns in subsequent threads (follow along!) Read more about EnterpriseBench and CoreCraft: Blog post - https://t.co/GUaXJ8BeP0 Paper - https://t.co/hUmkc8LDmq Leaderboard - https://t.co/UbSx9gmbnX
And we are very early in understanding how to write skills and what harnesses agents need to use them effectively. Paper: https://t.co/LI8ZDJxoCX
New research from Databricks. It's about training enterprise search agents via RL. KARL introduces a multi-task RL approach where agents are trained across heterogeneous search behaviors, constraint-driven entity search, cross-document synthesis, and tabular reasoning. It generalizes substantially better than those optimized for any single benchmark. KARL is Pareto-optimal on both cost-quality and latency-quality trade-offs compared to Claude 4.6 and GPT 5.2. With sufficient test-time compute, it surpasses the strongest closed models while being more cost efficient. Paper: https://t.co/CToEmDU89J Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

Our full pipeline and real-time generation code are available here! https://t.co/oXJ9R2i9wA
@miolini runs great but probably requires some tuning! i'm guessing: WINDOW_PATTERN = "L" is a lot faster (mixed window sizes are only natively supported by FA3) then problem: DEPTH a lot lower, e.g. even 4? DEVICE_BATCH_SIZE can probably go up more then TOTAL_BATCH_SIZE probably a lot lower, e.g. 2**16? needs a bit of tuning to get to a better initial spot (or you can try to let the agent figure it out, but it's not certain it would. could be fun to try!).
@anupbhat30 You can tune hparams such GQA and MLA have roughly the same KV caches size for each model size, but yeah, they question is which one has the better modeling performance for the same size. I think the jury is still out, although rumors have it that MLA doesn't do that well for small sizes. Unfortunately, there is no ablation study across sizes though to say anything more concrete.