Your curated collection of saved posts and media
@dwr Can try it here! https://t.co/Drl94NfDOR
@Jason @steipete We built a way to stream OpenClawβsβ¦ Thinking, Tool calls, and Price - in realtime on your lock screen. https://t.co/KSLup0MHQJ
@Jason @steipete Open-sourced it here: https://t.co/Drl94NfDOR
@ashleybchae OddJob π© https://t.co/W9uoW7LlSX
Chowder iOS 2026.2.26 π Location sharing MVP βοΈ Better Live Activity summaries π Reconnect stall fix π§ͺ Stronger diagnostics https://t.co/N143gLt2xV
New research from Google DeepMind. What if LLMs could discover entirely new multi-agent learning algorithms? Designing algorithms for multi-agent systems is hard. Classic approaches like PSRO and counterfactual regret minimization took years of expert effort to develop. Each new game-theoretic setting often demands its own specialized solution. But what if you could automate the discovery process itself? This research uses LLMs to automatically generate novel multi-agent learning algorithms through iterative prompting and refinement. The LLM proposes algorithm pseudocode, which gets evaluated against game-theoretic benchmarks, and feedback drives the next iteration. LLMs have absorbed enough algorithmic knowledge from training to serve as creative search engines over the space of possible algorithms. They generate candidates that humans wouldn't think to try. The discovered algorithms achieve competitive performance against established hand-crafted baselines across multiple game-theoretic domains. This shifts algorithm design from manual expert craft to automated discovery. The same approach could generalize beyond games to any domain where we need novel optimization procedures. Paper: https://t.co/9AeQYo2LFS Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
Be careful what you put in your AGENTS dot md files. This new research evaluates AGENTS dot md files for coding agents. Everyone uses these context files in their repos to help AI coding agents. More context should mean better performance, right? Not quite. This study tested Claude Code (Sonnet-4.5), Codex (GPT-5.2/5.1 mini), and Qwen Code across SWE-bench and a new benchmark called AGENTbench with 138 real-world instances. LLM-generated context files actually decreased task success rates by 0.5-2% while increasing inference costs by over 20%. Agents followed the instructions, using the mentioned tools 1.6-2.5x more often, but that instruction-following paradoxically hurt performance and required 22% more reasoning tokens. Developer-written context files performed better, improving success by about 4%, but still came with higher costs and additional steps per task. The broader pattern is that context files encourage more exploration without helping agents locate relevant files any faster. They largely duplicate what already exists in repo documentation. The recommendation is clear. Omit LLM-generated context files entirely. Keep developer-written ones minimal and focused on task-specific requirements rather than comprehensive overviews. I featured a paper last week that showed that LLM-generated Skills also don't work so well. Self-improving agents are exciting, but be careful of context rot and of unnecessarily overloading your context window. Paper: https://t.co/agxvRbW26N Learn to build effective AI agents in our academy: https://t.co/1e8RZKrwFp
Important survey on agentic memory systems. Memory is one of the most critical components of AI agents. It enables LLM agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. But the empirical foundations of these systems remain fragile. This new survey presents a structured analysis of agentic memory from both architectural and system perspectives. The authors introduce a taxonomy based on four core memory structures and then systematically analyze the pain points limiting current systems. What did they find? Existing benchmarks are underscaled and often saturated. Evaluation metrics are misaligned with semantic utility. Performance varies significantly across backbone models. And the latency and throughput overhead introduced by memory maintenance is frequently overlooked. Current agentic memory systems often underperform their theoretical promise because evaluation and architecture are studied in isolation. As agents take on longer, more complex tasks, memory becomes the bottleneck. This survey clarifies where current systems fall short and outlines directions for more reliable evaluation and scalable memory design. Paper: https://t.co/xNGTbVVhq9 Learn to build effective AI agents in our academy: https://t.co/LRnpZN7deE

This new paper on agent failure makes an interesting claim. This is particularly important for long-horizon agents. Many assume that agents collapse because they hit problems they can't solve, caused by insufficient model knowledge. It turns out that in the majority of cases, they collapse because they take one wrong step, and then another, which compounds quickly. Each off-path tool call significantly increases the likelihood of failure of the next tool call. In other words, most agent failures are reliability failures, not capability failures. Paper: https://t.co/HCkTaXmdkM Learn to build effective AI agents in our academy: https://t.co/1e8RZKrwFp
New research from Intuit AI Research. Agent performance depends on more than just the agent. It also depends on the quality of the tool descriptions it reads. However, tool interfaces are still written for humans, not LLMs. As the number of candidate tools grows, poor descriptions become a real bottleneck for tool selection and parameter generation. As Karpathy has suggested, let's build for AI Agents. This new research introduces Trace-Free+, a curriculum learning framework that teaches models to rewrite tool descriptions into versions that are more effective for LLM agents. The key idea: during training, the model learns from execution traces showing which tool descriptions lead to successful usage. Then, through curriculum learning, it progressively reduces reliance on traces, so at inference time, it can improve tool descriptions for completely unseen tools without any execution history. On StableToolBench and RestBench, the approach shows consistent gains on unseen tools, strong cross-domain generalization, and robustness as candidate tool sets scale beyond 100. Instead of only fine-tuning the agent, optimizing the tool interface itself is a practical and underexplored lever for improving agent reliability. Paper: https://t.co/BeVigJNGYY Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
New research from Georgia Tech and Microsoft Research. GUI agents today are reactive. Every step costs an LLM call, which is why a lot of GUI agents are expensive, slow, and fragile. This new research introduces ActionEngine, a framework that shifts GUI agents from reactive execution to programmatic planning. A Crawling Agent explores the application offline and builds a state-machine graph of the interface. Nodes are page states, edges are actions. Then at runtime, an Execution Agent uses this graph to synthesize a complete Python program in a single LLM call. Instead of O(N) vision model calls per task, you get O(1) planning cost. On Reddit tasks from WebArena, ActionEngine achieves 95% task success with, on average, a single LLM call, compared to 66% for the strongest vision-only baseline. Cost drops by 11.8x. Latency drops by 2x. If the pre-planned script fails at runtime, a vision-based fallback repairs the action and updates the memory graph for future runs. Why does it matter? Treating GUI interaction as graph traversal rather than step-by-step probabilistic reasoning is a compelling direction for making agents both faster and more reliable. Paper: https://t.co/UR0PjvFf0c Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

New research from Google DeepMind. Really interesting paper on diffusion models. Training good latents for diffusion models is harder than it looks. The standard approach uses a KL penalty borrowed from VAEs, with no principled way to control how much information actually lives in the latent space. This new research introduces Unified Latents (UL), a framework that co-trains a diffusion prior on the latents. This provides a tight upper bound on latent bitrate and makes the reconstruction-generation tradeoff explicit and, most importantly, tunable. On ImageNet-512, UL achieves FID 1.4 while requiring fewer training FLOPs than Stable Diffusion latents. On Kinetics-600, it sets a new state-of-the-art FVD of 1.3 for video generation. The latent space is one of the most undertreated design decisions in diffusion-based generation. UL gives practitioners a principled handle on it, for both images and video. Paper: https://t.co/E1HCf9QzB4
How can graphs improve coding agents? Multi-agent systems can boost code generation, but fixed interaction topologies don't adapt to task difficulty. This research introduces AgentConductor, a system where an orchestrator agent uses RL to dynamically generate task-adapted interaction topologies based on inferred agent roles and difficulty levels. A topological density function that captures communication-aware characterizations of multi-agent interactions, plus difficulty interval partitioning that prevents excessive pruning and provides precise topology control. Across five code datasets, AgentConductor achieves up to 14.6% improvement in pass@1 accuracy while reducing density by 13% and token costs by 68%. The great benefit of this approach is better performance with lower costs. Dynamic agent coordination is more efficient than static workflows for complex code generation. Paper: https://t.co/BypJZfU49q Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

If you want to get started with Claude Cowork, look no further. I recorded this 1hr session on how to use Cowork. Powerful for knowledge work like Claude Code. But I also use it for image generation with Skills. Has a nice guide to go along with it. https://t.co/u14Z2MemM9
NEW research from Sakana AI. Long contexts get expensive as every token in the input contributes to quadratic attention costs, higher latency, and more memory. This new research introduces Doc-to-LoRA, a lightweight hypernetwork that meta-learns to compress long documents into LoRA adapters in a SINGLE forward pass. In other words, it can instantly internalize contexts. Instead of re-reading the full context at every inference call, the model internalizes the document into compact adapter weights. No iterative fine-tuning is needed, and no repeated context consumption. Cool to see all the interesting new approaches to deal with long contexts like RLM, LCM, and now Doc-to-LoRA. The results: Near-perfect accuracy on needle-in-a-haystack tasks at sequence lengths exceeding the target model's native context window by over 4x. It also outperforms standard context distillation while significantly reducing peak memory consumption and update latency on real-world QA datasets. Why it matters: As agents and LLM applications deal with increasingly long documents, turning context into compact adapters on the fly could drastically reduce serving costs and enable rapid knowledge updates. Paper: https://t.co/Fh1IeLrSpm Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
New research from NVIDIA. Long-running agentic tasks like deep research require multi-hop reasoning over many documents. One of the biggest challenges with agents is that context grows rapidly, and KV cache memory usage becomes the bottleneck. As agents take on longer tasks, memory management can't rely on static heuristics. Letting the model manage its own context is both more effective and more adaptive. Existing cache compression techniques use fixed heuristics to decide what to keep. But in agentic reasoning, a token that seems unimportant early on may become critical ten turns later. This new NVIDIA research paper introduces SideQuest, a framework where the reasoning model itself manages its own KV cache. The model reasons about which tokens are still useful and clears the rest, essentially performing its own memory garbage collection. This management runs as an auxiliary task in parallel with the main reasoning thread, so the management tokens never pollute the primary context. That's important. Trained with just 215 samples, SideQuest reduces peak token usage by up to 65% on agentic tasks with minimal accuracy loss, outperforming all heuristic-based compression techniques. Paper: https://t.co/n3P6UjtLJ7 Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

Watch the free recording of part 1 here: https://t.co/mSp0qR2QoR
Watch the free recording of part 1 here: https://t.co/mSp0qR2QoR
AGENTS dot md files don't scale beyond modest codebases. Lots of discussions on this lately. If you're building serious software with Claude Code or any agentic tool, a single AGENTS dot md will eventually fail you. This paper shows what comes next. A 1,000-line prototype can be fully described in a single prompt. A 100,000-line system cannot. The AI must be told, repeatedly and reliably, how the project works, what patterns to follow, and what mistakes to avoid. Single-file manifests hit a ceiling fast. This new paper, Codified Context, documents a three-tier infrastructure built during real development of a 108,000-line C# distributed system across 283 sessions over 70 days. The system uses a three-tier memory architecture: a hot-memory constitution (660 lines, always loaded), 19 specialized domain-expert agents (9,300 lines total) invoked per task, and a cold-memory knowledge base of 34 specification documents (~16,250 lines) queried on demand via an MCP retrieval server. Across 283 sessions, this produced 2,801 human prompts, 1,197 agent invocations, and 16,522 autonomous agent turns, roughly 6 autonomous turns per human prompt, with a knowledge-to-code ratio of 24.2%. Crucially, none of it was designed upfront: each new agent and specification emerged from a real failure, a recurring bug, an architectural mistake, a convention forgotten, and was codified so it could never require re-explanation again, turning documentation into load-bearing infrastructure that agents depend on as memory, not reference. Paper: https://t.co/ZXBzhhkzsq Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
New research on agent memory. Agent memory is evaluated on chatbot-style dialogues. But real agents don't chat. They interact with databases, code executors, and web interfaces, generating machine-readable trajectories, not conversational text. The key to better memory is to preserve causal dependencies. Existing memory benchmarks don't actually measure what matters for agentic applications. This new research introduces AMA-Bench, the first benchmark built for evaluating long-horizon memory in real agentic tasks. It spans six domains including web, text-to-SQL, software engineering, gaming, and embodied AI, with both real-world trajectories and synthetic ones that scale to arbitrary lengths. The findings are interesting. Many existing agent memory systems that outperform baselines on dialogue benchmarks actually underperform simple long-context LLMs on agentic tasks. Even GPT 5.2 only achieves 72.26% accuracy. To address this, they propose AMA-Agent with a causality graph and tool-augmented retrieval, achieving 57.22% average accuracy and surpassing the strongest baselines by 11.16%. Why it matters? Agent memory needs to preserve causal dependencies and objective information, not just similarity-based retrieval. This benchmark exposes where current memory systems actually break. Paper: https://t.co/GX0GaHsijN Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

First empirical study on how developers are actually writing AI context files across open-source projects. Researchers scanned 10,000 repositories and found only 466 (5%) have adopted AI configuration files like AGENTS dot md, CLAUDE dot md, or Copilot instructions. Why does it matter? Of the 155 AGENTS dot md files analyzed, 50% were never modified after the initial commit. Only 6% had 10 or more revisions. The most common content in these files were conventions, contribution guidelines, and architecture overviews. But there is no standard structure, wide variation in what teams encode, and most files are written once and left to decay. The conventions for this new form of documentation are still in flux. Paper: https://t.co/YkSayPUesC Learn to build effective AI agents in our academy: https://t.co/U0ZuNA084v
π€ How can we enable zero-shot generalization to unseen scenarios for robot world models? Thrilled to share DreamDojo π β an interactive robot world model pretrained on 44K hours of human egocentric videos, the largest and most diverse dataset to date for robot world model learning. Our model not only excels in generalization, but also supports real-time interaction at 10 FPS after distillation. It enables several important applications, including live teleoperation, policy evaluation, and model-based planning at test time. π Project: https://t.co/hJIEiGXnKz π° Paper: https://t.co/oa5xr8Y2GH π€ Code & models & datasets: https://t.co/A8B4ii0Kah #WorldModels #Robotics #EmbodiedAI #RL #AI #NVIDIA Sharing more details in the thread π§΅
SONIC is now open-source! Generalist whole-body teleoperation for EVERYONE! Our team has long been building comprehensive pipelines for whole-body control, kinematic planner, and teleoperation, and they will all be shared. This will be a continuous update; inference code + model already there, training code and gr00t integration coming soon! Code: https://t.co/7u3SBxzXU9 Docs: https://t.co/HpDLkTCSMF Site: https://t.co/D3i4KlnLLr
Website: https://t.co/xTaDXBu9cD Codebase and weights: https://t.co/QCQkqPIsHI Whitepaper: https://t.co/K2QCFjboDR Check out @zhengyiluo's post: https://t.co/hIHtvKkDQf
