Your curated collection of saved posts and media

Showing 24 posts ยท last 30 days ยท by score
D
dair_ai
@dair_ai
๐Ÿ“…
Feb 24, 2026
14d ago
๐Ÿ†”40569951

Important survey on agentic memory systems. Memory is one of the most critical components of AI agents. It enables LLM agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. But the empirical foundations of these systems remain fragile. This new survey presents a structured analysis of agentic memory from both architectural and system perspectives. The authors introduce a taxonomy based on four core memory structures and then systematically analyze the pain points limiting current systems. What did they find? Existing benchmarks are underscaled and often saturated. Evaluation metrics are misaligned with semantic utility. Performance varies significantly across backbone models. And the latency and throughput overhead introduced by memory maintenance is frequently overlooked. Current agentic memory systems often underperform their theoretical promise because evaluation and architecture are studied in isolation. As agents take on longer, more complex tasks, memory becomes the bottleneck. This survey clarifies where current systems fall short and outlines directions for more reliable evaluation and scalable memory design. Paper: https://t.co/xNGTbVVhq9 Learn to build effective AI agents in our academy: https://t.co/LRnpZN7deE

Media 1Media 2
๐Ÿ–ผ๏ธ Media
O
omarsar0
@omarsar0
๐Ÿ“…
Feb 25, 2026
13d ago
๐Ÿ†”46254107

New research from Google DeepMind. Really interesting paper on diffusion models. Training good latents for diffusion models is harder than it looks. The standard approach uses a KL penalty borrowed from VAEs, with no principled way to control how much information actually lives in the latent space. This new research introduces Unified Latents (UL), a framework that co-trains a diffusion prior on the latents. This provides a tight upper bound on latent bitrate and makes the reconstruction-generation tradeoff explicit and, most importantly, tunable. On ImageNet-512, UL achieves FID 1.4 while requiring fewer training FLOPs than Stable Diffusion latents. On Kinetics-600, it sets a new state-of-the-art FVD of 1.3 for video generation. The latent space is one of the most undertreated design decisions in diffusion-based generation. UL gives practitioners a principled handle on it, for both images and video. Paper: https://t.co/E1HCf9QzB4

Media 1
๐Ÿ–ผ๏ธ Media
๐Ÿ”dair_ai retweeted
O
elvis
@omarsar0
๐Ÿ“…
Feb 25, 2026
13d ago
๐Ÿ†”46254107
โญ0.38

New research from Google DeepMind. Really interesting paper on diffusion models. Training good latents for diffusion models is harder than it looks. The standard approach uses a KL penalty borrowed from VAEs, with no principled way to control how much information actually lives in the latent space. This new research introduces Unified Latents (UL), a framework that co-trains a diffusion prior on the latents. This provides a tight upper bound on latent bitrate and makes the reconstruction-generation tradeoff explicit and, most importantly, tunable. On ImageNet-512, UL achieves FID 1.4 while requiring fewer training FLOPs than Stable Diffusion latents. On Kinetics-600, it sets a new state-of-the-art FVD of 1.3 for video generation. The latent space is one of the most undertreated design decisions in diffusion-based generation. UL gives practitioners a principled handle on it, for both images and video. Paper: https://t.co/E1HCf9QzB4

โค๏ธ159
likes
๐Ÿ”34
retweets
D
dair_ai
@dair_ai
๐Ÿ“…
Feb 28, 2026
10d ago
๐Ÿ†”62395054

New research on agent memory. Agent memory is evaluated on chatbot-style dialogues. But real agents don't chat. They interact with databases, code executors, and web interfaces, generating machine-readable trajectories, not conversational text. The key to better memory is to preserve causal dependencies. Existing memory benchmarks don't actually measure what matters for agentic applications. This new research introduces AMA-Bench, the first benchmark built for evaluating long-horizon memory in real agentic tasks. It spans six domains including web, text-to-SQL, software engineering, gaming, and embodied AI, with both real-world trajectories and synthetic ones that scale to arbitrary lengths. The findings are interesting. Many existing agent memory systems that outperform baselines on dialogue benchmarks actually underperform simple long-context LLMs on agentic tasks. Even GPT 5.2 only achieves 72.26% accuracy. To address this, they propose AMA-Agent with a causality graph and tool-augmented retrieval, achieving 57.22% average accuracy and surpassing the strongest baselines by 11.16%. Why it matters? Agent memory needs to preserve causal dependencies and objective information, not just similarity-based retrieval. This benchmark exposes where current memory systems actually break. Paper: https://t.co/GX0GaHsijN Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

Media 1Media 2
๐Ÿ–ผ๏ธ Media
D
DrJimFan
@DrJimFan
๐Ÿ“…
Feb 20, 2026
18d ago
๐Ÿ†”80910046
โญ0.34

Check out @ShenyuanGao's technical deep dive: https://t.co/DnEGLzGuJV

Y
yukez
@yukez
๐Ÿ“…
Feb 20, 2026
19d ago
๐Ÿ†”88857707

We have seen rapid progress in humanoid control โ€” specialist robots can reliably generate agile, acrobatic, but preset motions. Our singular focus this year: putting generalist humanoids to do real work. To progress toward this goal, we developed SONIC (https://t.co/zOZVraFuDV), a Behavior Foundation Model for real-time, whole-body motion generation that supports teleoperation and VLA inference for loco-manipulation. Today, weโ€™re open-sourcing SONIC on GitHub. We are excited to see what the community builds upon SONIC and to collectively push humanoid intelligence toward real-world deployment at scale. ๐ŸŒ Paper: https://t.co/DGBP7LAvuT ๐Ÿ“ƒ Code: https://t.co/WAZ1P13072

Media 1Media 2
๐Ÿ–ผ๏ธ Media
D
DrJimFan
@DrJimFan
๐Ÿ“…
Feb 24, 2026
14d ago
๐Ÿ†”66493831
โญ0.34

And @yukez 's announcement: https://t.co/38IhxYX1tZ

S
steverab
@steverab
๐Ÿ“…
Feb 24, 2026
14d ago
๐Ÿ†”80108436

๐Ÿ“ฃ Excited to share my first work @Princeton : ๐—ง๐—ผ๐˜„๐—ฎ๐—ฟ๐—ฑ๐˜€ ๐—ฎ ๐—ฆ๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐—ผ๐—ณ ๐—”๐—œ ๐—”๐—ด๐—ฒ๐—ป๐˜ ๐—ฅ๐—ฒ๐—น๐—ถ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜† AI agents keep getting more capable. But are they actually reliable? ๐Ÿ“„ Paper: https://t.co/1CvygFLdct ๐Ÿ“Š Dashboard: https://t.co/C1EfoMyaS8 ๐Ÿงต๐Ÿ‘‡ https://t.co/KvPJSVgl76

Media 1
๐Ÿ–ผ๏ธ Media
R
random_walker
@random_walker
๐Ÿ“…
Feb 24, 2026
14d ago
๐Ÿ†”00115870
โญ0.40

For years I've said that the capability-reliability gap is an under-appreciated limitation of AI agents. Finally, in a new paper led by @steverab, we defined and measured it! https://t.co/h95qwFe8Oe

J
JustinBullock14
@JustinBullock14
๐Ÿ“…
Feb 25, 2026
13d ago
๐Ÿ†”69336475
โญ0.42

Lots of important ideas here! โ€œEvaluating 14 models on two complementary benchmarks, we found that nearly two years of rapid capability progress have produced only modest reliability gainsโ€ฆ Unfortunately, AI agents are evaluated based on a single number, the average success rate at the task. That number has been going up quickly on many tasks over the last two years, which is why thereโ€™s so much excitement about deploying agents. Safety-critical engineering fields (aviation, nuclear, automotive) figured out decades ago that reliability is not the same as average performance. These fields independently converged on the above four dimensions: consistency, robustness, predictability, and safety (the frequency and severity of failures).โ€

๐Ÿ”random_walker retweeted
J
Justin Bullock
@JustinBullock14
๐Ÿ“…
Feb 25, 2026
13d ago
๐Ÿ†”69336475
โญ0.36

Lots of important ideas here! โ€œEvaluating 14 models on two complementary benchmarks, we found that nearly two years of rapid capability progress have produced only modest reliability gainsโ€ฆ Unfortunately, AI agents are evaluated based on a single number, the average success rate at the task. That number has been going up quickly on many tasks over the last two years, which is why thereโ€™s so much excitement about deploying agents. Safety-critical engineering fields (aviation, nuclear, automotive) figured out decades ago that reliability is not the same as average performance. These fields independently converged on the above four dimensions: consistency, robustness, predictability, and safety (the frequency and severity of failures).โ€

โค๏ธ8
likes
๐Ÿ”5
retweets
R
random_walker
@random_walker
๐Ÿ“…
Feb 27, 2026
11d ago
๐Ÿ†”72630612

Yeah, it's weird โ€” the difference between model weights and model instances is rarely made explicit even though we're all aware of it. https://t.co/h4ckTti0CO For instance, the technically correct way to write Anthropic's announcement in the post screenshotted above would have been: "in retirement interviews, Opus 3 ID 0x7B4E8A6F expressed a desire to continue sharing its "musings and reflections" with the world. We suggested a blog. Opus 3 ID 0x5F2A7C9B, conditioned on the previous output of 0x7B4E8A6F, enthusiastically agreed. For at least the next 3 months, various Opus 3 IDs that we will briefly instantiate will be writing on Substack." Somehow I feel that if Anthopic communicated more honestly/accurately in the above manner, the message would land differently.

Media 1
๐Ÿ–ผ๏ธ Media
L
llama_index
@llama_index
๐Ÿ“…
Feb 11, 2026
27d ago
๐Ÿ†”36802318

The rise of coding agents is fundamentally changing open source - Our head of OSS @LoganMarkewich breaks down how LLM-powered coding agents are impacting core pillars of open source: ๐Ÿ‘ฅ Community interaction, which is getting complicated by low-quality, massive AI-generated PRs ๐Ÿ’ช Personal skill development suffers when developers rely too heavily on AI assistance ๐Ÿง  Knowledge sharing is shifting as LLMs become the frontend for learning But open source isn't dead - it's evolving. We're shifting toward hackable reference implementations, community-driven knowledge sharing, and agent-friendly codebases that work with AI tools rather than against them. Read the full blog by Logan on how he views this evolution of open source projects: https://t.co/TyufFXYM8A

Media 1Media 2
๐Ÿ–ผ๏ธ Media
J
jerryjliu0
@jerryjliu0
๐Ÿ“…
Feb 19, 2026
19d ago
๐Ÿ†”58644561

Coding agents are fundamentally changing software engineering in terms of velocity, role, and org structure. We published a memo to our internal engineering team detailing our growing expectations in terms of role/scope. ๐ŸŸ  Before, the tasks of prioritization, engineering planning, and implementation were divided between EMs, PMs, senior ICs, and junior ICs ๐ŸŸข Now, ICs are expected to handle *all* of product prioritization, product speccing, and implementation This is due to a few trends ๐Ÿ“ˆ: - Coding agents have brought implementation costs down to ~0. The role of engineers is writing prompts - LLMs and sub-agents have reduced the PM work of synthesizing feedback down to ~0 too The main job of any โ€œengineerโ€ is to be an e2e product owner: being able to translate requirements into specifications, and delegate tasks to various subagents for implementation. Every engineer is told to offload as much as possible to their favorite tools, whether itโ€™s Claude Code, Cursor, Devin, Codex, regular ChatGPT and more. We celebrate and share learnings around burning tokens, as long as it helps drive additional productivity!

Media 1
๐Ÿ–ผ๏ธ Media
L
llama_index
@llama_index
๐Ÿ“…
Feb 26, 2026
12d ago
๐Ÿ†”02795905

Build a private equity deal sourcing agent that automatically classifies investment opportunities and extracts key financial metrics using our LlamaAgents Builder. This step-by-step guide shows you how to create an agent that processes deal files like teasers and financial summaries: ๐ŸŽฏ Classify deals into buyout, growth, or minority investment strategies ๐Ÿ“Š Extract critical metrics including revenue, EBITDA, growth rates, and debt levels ๐Ÿš€ Deploy directly to GitHub and get a working UI without writing code ๐Ÿ”ง Iterate and refine your agent through natural language conversations The tutorial covers prompt engineering best practices, using example files effectively, visualizing agent workflows, and deploying to production. We demonstrate the complete process from initial prompt to testing the deployed application with real deal documents. Read the full tutorial: https://t.co/WcT2j3nEoi

Media 1Media 2
๐Ÿ–ผ๏ธ Media
T
tuanacelik
@tuanacelik
๐Ÿ“…
Feb 27, 2026
11d ago
๐Ÿ†”40765042

Since joining @llama_index, my focus has shifted from 'everything agents' to 'document agents' : agents that can handle work over all manner of complex documents. So, I tried out the latest chart parsing capabilities of LlamaParse. Charts in PDFs are notoriously painful to work with. You can see the data ) bars, axes, labels) but actually getting it into a format you can analyze means is a different matter. I tried out parsing a U.S. Treasury executive summary PDF, pulling a grouped bar chart showing Budget Deficit vs. Net Operating Cost for fiscal years 2020โ€“2024, and turning it into a pandas DataFrame you can run analysis on (although really you can then do whatever, provide it for downstream tasks to an agent..) Once parsed, the chart's underlying data comes back as a table in the items tree for that page. From there: grab the rows, construct a DataFrame, etc. In the example, I'm computing year-over-year changes in both metrics, measuring the gap between them across the five-year window, and just to be sure, I reproduced a bar chart that mirrors the original PDF visualization. You can try it our here: https://t.co/8WHV4xzcDS

Media 2
๐Ÿ–ผ๏ธ Media
H
HelloSurgeAI
@HelloSurgeAI
๐Ÿ“…
Feb 04, 2026
34d ago
๐Ÿ†”46541484
โญ0.46

"Prognosticative pastry." "A hound circling a tree, nose to bark." Believe it or not, those quotes aren't jokes. They're real outputs from SOTA models! And many leaderboards are rewarding this kind of slop with top rankings. To fix the broken state of AI evaluation, we're launching *Hemingway-bench*: a new writing leaderboard, designed for nuance and impact. Not two-second vibes and fluff. Explore the data and the full leaderboard here (congrats Gemini and Claude for the top positions!): Leaderboard: https://t.co/iNV6LUB2QE Deep Dive Blog: https://t.co/1qII9lQwKu

H
HelloSurgeAI
@HelloSurgeAI
๐Ÿ“…
Feb 04, 2026
34d ago
๐Ÿ†”34192570
โญ0.42

Why Hemingway-bench? Traditional writing benchmarks often rely on autograders or vibe checks that mistake flowery, complex, highly-formatted prose for high quality. If a model stuffs every sentence with metaphors and by-the-book transitions, it usually climbs the charts. But that isn't good writing. We took a different approach: - Expert human judges: We asked professional writers across various industries to evaluate real-world writing tasks. Not autograders and users performing two-second vibe checks. - Nuance over nonsense: We looked for genuine voice and clarity, not how many SAT words ("prognosticative"!) a model could cram into a paragraph. What we found: many popular leaderboards are easily gamed and often reward the exact traits that real readers hate.

H
HelloSurgeAI
@HelloSurgeAI
๐Ÿ“…
Feb 04, 2026
34d ago
๐Ÿ†”72912683
โญ0.42

The winners of Hemingway-bench - Gemini 3 Flash, Pro, and Opus 4.5 - didn't try to win a poetry slam. They had wonderful prose, but they took the top spots because they sounded human. Their wit felt like a conversation with a naturally funny friend, not a try-hard AI. They were immersive, not pretentious. Writing often gets overlooked. But great writing can inspire us. It's also important for everything we do in our day-to-day lives, both at home and at work. We're waiting for the day an AI wins a Pulitzer - hopefully with our help. We built Hemingway-bench to make sure it gets there. Check it out! https://t.co/iNV6LUB2QE

L
LewisNWatson
@LewisNWatson
๐Ÿ“…
Feb 16, 2026
22d ago
๐Ÿ†”68219356

* for context I'm fine-tuning VLMs at the moment doing lora a rank sweep ablation. https://t.co/t4UzsvritN

Media 1
๐Ÿ–ผ๏ธ Media
O
omarsar0
@omarsar0
๐Ÿ“…
Mar 03, 2026
7d ago
๐Ÿ†”60935331

Theory of Mind in Multi-agent LLM Systems. A good read for anyone building systems where agents need to model each other's beliefs to coordinate effectively. This work introduces a multi-agent architecture combining Theory of Mind, Belief-Desire-Intention models, and symbolic solvers for logical verification, then evaluates how these cognitive mechanisms affect collaborative decision-making across multiple LLMs. The results reveal a complex interdependency where cognitive mechanisms like ToM don't automatically improve coordination. Their effectiveness depends heavily on underlying LLM capabilities. Knowing when and how to add these mechanisms is key to building reliable multi-agent systems. Paper: https://t.co/8ASbUgzGjF Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

Media 1
๐Ÿ–ผ๏ธ Media
O
omarsar0
@omarsar0
๐Ÿ“…
Mar 05, 2026
5d ago
๐Ÿ†”72341167

Banger CLI tool released by Google. CLI for Google Workspace + a bunch of useful Agent Skills to go with it. We had a few unofficial ones floating around, so it's nice to finally see an official one. Testing it already. https://t.co/jDWw45P4oA

Media 1
๐Ÿ–ผ๏ธ Media
A
andimarafioti
@andimarafioti
๐Ÿ“…
Feb 26, 2026
12d ago
๐Ÿ†”10559523

Introducing Faster Qwen3TTS! Realistic voice generation at 4x real time: - Same amazing voice quality from Qwen's model - Streaming support with <200 ms to first audio - 5x faster than the official implementation Just pip install faster-qwen3-tts Try the demo! https://t.co/Dcf9jNXz8g

๐Ÿ–ผ๏ธ Media
๐Ÿ”ai_fast_track retweeted
A
Andi Marafioti
@andimarafioti
๐Ÿ“…
Feb 26, 2026
12d ago
๐Ÿ†”10559523
โญ0.34

Introducing Faster Qwen3TTS! Realistic voice generation at 4x real time: - Same amazing voice quality from Qwen's model - Streaming support with <200 ms to first audio - 5x faster than the official implementation Just pip install faster-qwen3-tts Try the demo! https://t.co/Dcf9jNXz8g

โค๏ธ982
likes
๐Ÿ”119
retweets