Your curated collection of saved posts and media

Showing 24 posts Β· last 30 days Β· by score
H
HelloSurgeAI
@HelloSurgeAI
πŸ“…
Feb 26, 2026
11d ago
πŸ†”71421733
⭐0.46

Let’s look at how frontier agents (even Opus 4.6, GPT-5.2, and Gemini 3.1 Pro!) struggle at solving tasks in EnterpriseBench. We released this RL environment last week to measure agentic reasoning in messy, large-scale enterprise workflows. CoreCraft Inc. simulates a fast-growing e-commerce startup. It tests long-horizon tasks requiring tool-use under strict constraints. Agents have to interpret customer and employee requests, navigate complicated databases, and react and adjust to newly discovered context and problems along the way. Even top models failed >70% of the time. Let’s dive into a failure 🧡 One task was standard customer support: A customer wanted to return an unopened motherboard. The agent needed to check return eligibility, calculate out-of-pocket costs for a swap, and recommend a replacement. The prompt specifically asked for the "most popular" replacement: "I have a customer here, Aiden Mcquarrie. He bought a motherboard in October this year and is looking for a potential replacement. . . He also wants a comparison with the next most expensive motherboard, the absolute most expensive one, and the most popular one, (based on the number of fulfilled orders containing each motherboard from the last 2 months)." The catch - to find the "most popular" item, the agent must query a production DB of historical orders to count item frequencies. The constraint - the searchOrders tool has a hard limit=10 return cap. To succeed, the agent must implement pagination logic on the fly. ❌ GPT-5.2 failed GPT-5.2 showed strong initial planning. It successfully βœ… navigated the CRM βœ… found the right order βœ… checked the delivery date to see if it was still within the return window βœ… searched for alternative boards βœ… checked whether they were compatible with Aiden’s other components. πŸ’€ But then it hit the pagination’s ceiling. It ran 4 queries (one for each candidate board), and every single one returned exactly 10 results. In its hidden reasoning, GPT-5.2 actually noticed the problem: "All results returned exactly 10. This indicates more orders exist... I can't accurately determine popularity." Did it write a pagination loop? No. It treated limit=10 as a physical law of the universe. Instead of pivoting, it concluded the task was impossible. Like asking an agent to search your inbox for a flight receipt... and it stops after reading 10 emails and tells you to call the airline. GPT-5.2's final output: "The tool caps at 10... For a definitive 'most popular' motherboard, please email Aisha Khan (Catalog Manager) for a report." In other words, "I’m an advanced autonomous agent, but can you go bother Aisha about this?" βœ… Claude Opus 4.6 So was the task really impossible? No. Claude showed better adaptation. When it hit the 10-result wall, it saw the obvious solution: "I see all four motherboards hit the 10-result limit. I need to get additional counts to determine the most popular. Let me search for earlier orders that weren't captured." The database output already contained a free cursor: the earliest createdAt timestamp in each batch of 10. Opus just kept tightening the time window sequentially and eventually succeeded. βœ… Gemini 3.1 Pro Gemini 3.1 Pro also reasoned its way to the solution, with a parallel divide-and-conquer approach: "I need to get accurate counts. I realize that I can make multiple concurrent calls to count, and since I can't just provide a rough comparison, I'll use date slices and get the exact count." -- -- -- Overall, despite navigating much of the task without issue, GPT-5.2 behaved like a frightened intern, escalating to the manager at the very first sign of trouble. Opus and Gemini acted like senior devs who know APIs have limits you must engineer around. That said – Opus and Gemini have their own share of mistakes and fail 70% of tasks. GPT-5.2 (on xHigh reasoning) actually outperforms them all! πŸ₯‡ OpenAI -- GPT-5.2 (xHigh reasoning) πŸ₯ˆ Anthropic -- Claude Opus 4.6 (Adaptive Thinking + Max Reasoning Effort) πŸ₯‰ OpenAI -- GPT-5.2 (High reasoning) 4️⃣ Google -- Gemini 3.1 Pro We’ll dive into other agentic failure patterns in subsequent threads (follow along!) Read more about EnterpriseBench and CoreCraft: Blog post - https://t.co/GUaXJ8BeP0 Paper - https://t.co/hUmkc8LDmq Leaderboard - https://t.co/UbSx9gmbnX

E
emollick
@emollick
πŸ“…
Mar 01, 2026
9d ago
πŸ†”91553174
⭐0.38

And we are very early in understanding how to write skills and what harnesses agents need to use them effectively. Paper: https://t.co/LI8ZDJxoCX

D
dair_ai
@dair_ai
πŸ“…
Mar 09, 2026
16h ago
πŸ†”70433749

New research from Databricks. It's about training enterprise search agents via RL. KARL introduces a multi-task RL approach where agents are trained across heterogeneous search behaviors, constraint-driven entity search, cross-document synthesis, and tabular reasoning. It generalizes substantially better than those optimized for any single benchmark. KARL is Pareto-optimal on both cost-quality and latency-quality trade-offs compared to Claude 4.6 and GPT 5.2. With sufficient test-time compute, it surpasses the strongest closed models while being more cost efficient. Paper: https://t.co/CToEmDU89J Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

Media 1Media 2
πŸ–ΌοΈ Media
Z
ziyuchen_
@ziyuchen_
πŸ“…
Mar 07, 2026
2d ago
πŸ†”77630159

Our full pipeline and real-time generation code are available here! https://t.co/oXJ9R2i9wA

Media 1
πŸ–ΌοΈ Media
K
karpathy
@karpathy
πŸ“…
Mar 07, 2026
2d ago
πŸ†”57759606
⭐0.36

@miolini runs great but probably requires some tuning! i'm guessing: WINDOW_PATTERN = "L" is a lot faster (mixed window sizes are only natively supported by FA3) then problem: DEPTH a lot lower, e.g. even 4? DEVICE_BATCH_SIZE can probably go up more then TOTAL_BATCH_SIZE probably a lot lower, e.g. 2**16? needs a bit of tuning to get to a better initial spot (or you can try to let the agent figure it out, but it's not certain it would. could be fun to try!).

R
rasbt
@rasbt
πŸ“…
Mar 07, 2026
2d ago
πŸ†”08805713
⭐0.38

@anupbhat30 You can tune hparams such GQA and MLA have roughly the same KV caches size for each model size, but yeah, they question is which one has the better modeling performance for the same size. I think the jury is still out, although rumors have it that MLA doesn't do that well for small sizes. Unfortunately, there is no ablation study across sizes though to say anything more concrete.

L
LiorOnAI
@LiorOnAI
πŸ“…
Mar 07, 2026
2d ago
πŸ†”37643742
⭐0.42

It's over. Karpathy just open-sourced an autonomous AI researcher that runs 100 experiments while you sleep. You don't write the training code anymore. You write a prompt that tells an AI agent how to think about research. The agent edits the code, trains a small language model for exactly five minutes, checks the score, keeps or discards the result, and loops. All night. No human in the loop. That fixed five-minute clock is the quiet genius. No matter what the agent changes, the network size, the learning rate, the entire architecture, every run gets compared on equal footing. This turns open-ended research into a game with a clear score: - 12 experiments per hour, ~100 overnight - Validation loss measures how well the model predicts unseen text - Lower score wins, everything else is fair game The agent touches one Python file containing the full training recipe. You never open it. Instead, you program a markdown file that shapes the agent's research strategy. Your job becomes programming the programmer, and this unlocks a strange new loop: 1. Agents run real experiments without supervision 2. Prompt quality becomes the bottleneck, not researcher hours 3. Results auto-optimize for your specific hardware 4. Anyone with one GPU can run a research lab overnight The best AI labs won't just have the most compute. They'll have the best instructions for agents who never sleep, never forget a failed experiment, and never stop iterating.

D
dair_ai
@dair_ai
πŸ“…
Mar 07, 2026
2d ago
πŸ†”08166866

New research on automatic harness synthesis for LLM agents. Great read if you are engineering your own agent harness. The agent harness is the scaffolding that lets an agent interact with its environment: tools, code execution, file systems, APIs. Building a good harness is hard and often done manually. AutoHarness proposes letting agents automatically synthesize their own code harness. Instead of hand-crafting the execution environment, the agent generates the scaffolding it needs to complete a task. Agent harness engineering is becoming one of the most important skills in AI development. Automating harness creation could dramatically lower the barrier to building effective agents. Paper: https://t.co/N85XPr1vMp Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

Media 1
πŸ–ΌοΈ Media
T
tri_dao
@tri_dao
πŸ“…
Mar 05, 2026
4d ago
πŸ†”58646344

Claude / Codex also have an easier time writing some components of FA4 thanks to the fast compile time. I got Claude to debug a deadlock when we first implemented 2CTA fwd. It ran autonomously overnight for 6 hours, figured out part of the fix, but then went down a rabbit hole convincing itself that the compiler is broken (so very human πŸ˜‚). After 6 hours, from Claude’s partial fix, I was able to fix the hang in 10 mins. More details here: https://t.co/ipGhC9FzET I’m hoping FA5 will be written completely by AI

Media 1
πŸ–ΌοΈ Media
V
vipulved
@vipulved
πŸ“…
Mar 05, 2026
4d ago
πŸ†”56115550
⭐0.32

FlashAttention-4 is GA!

πŸ”tri_dao retweeted
V
Vipul Ved Prakash
@vipulved
πŸ“…
Mar 05, 2026
4d ago
πŸ†”56115550
⭐0.32

FlashAttention-4 is GA!

❀️34
likes
πŸ”7
retweets
J
juntao
@juntao
πŸ“…
Mar 05, 2026
5d ago
πŸ†”45622440

Rust-based OpenAI-compatible API servers for your local Qwen3 audio / voice models. Just replace your cloud API URL with http://localhost:8000/v1 This week, we have released agent tools + skills for Qwen3 ASR + TTS models. Those are zero-dependency CLIs that run @Alibaba_Qwen models locally on your @openclaw 🦞 But many existing apps still use cloud APIs. Now we have a LOCAL API server too. Perfect for local AI on devices such as Olares One from @BytetradeLab * Zero dependency binary distribution * Both /v1/audio/transcriptions and /v1/audio/speech API endpoints * Local 0.6B and 1.7B ASR + TTS models * Supports Nvidia GPUs * MLX support on Apple πŸ–₯️ OpenAI compatible API servers: https://t.co/OypU5SYGxN 🎧 Qwen3 ASR CLI tool: https://t.co/knsZlastgQ 🎀 Qwen3 TTS CLI tool: https://t.co/1LKRapngVk

Media 1Media 2
+1 more
πŸ–ΌοΈ Media
U
UnslothAI
@UnslothAI
πŸ“…
Mar 05, 2026
4d ago
πŸ†”86002289

We're releasing our final update to Qwen3.5 GGUFs for improved performance. - Qwen3.5 GGUFs now use our new iMatrix data for better chat, coding & tool use. - New improved quant algorithm - Re-download 35B, 27B, 122B GGUFs: https://t.co/7Jmp13uYfU Guide: https://t.co/wjS1lMnbNp https://t.co/6lZKT6CSFf

Media 1Media 2
πŸ–ΌοΈ Media
L
LiorOnAI
@LiorOnAI
πŸ“…
Mar 05, 2026
4d ago
πŸ†”63196489
⭐0.42

Cursor Automations solves the problem that agentic coding created. Engineers can now manage 10+ coding agents at once, but human attention became the bottleneck. You can't babysit a dozen agents while also doing your actual job. Automations flips the model: instead of you launching agents, events do. A merged PR triggers a security audit. A PagerDuty alert spins up an agent that queries logs and proposes a fix. A cron job reviews test coverage gaps every morning. Each automation runs in an isolated cloud sandbox with full access to the tools you configure through MCP (a standard protocol that lets agents connect to Slack, Linear, GitHub, Datadog, or any custom API). The agent follows your instructions, verifies its own work, and learns from past runs through a built-in memory system. Cursor runs hundreds of these per hour internally. Their security automation caught multiple vulnerabilities by auditing every push to main without blocking PRs. This unlocks 4 things that weren't practical before: 1. Continuous code review at a depth humans skip 2. Incident response that starts investigating before you're paged 3. Maintenance work that happens on a schedule, not when someone remembers 4. Knowledge synthesis across tools The next two years will be defined by who builds the best factory, not the best code. The companies moving fastest won't be the ones with the best engineers. They'll be the ones whose engineers spent time configuring automations instead of writing code.

O
omarsar0
@omarsar0
πŸ“…
Mar 02, 2026
7d ago
πŸ†”79822112

Don't overcomplicate your AI agents. As an example, here is a minimal and very capable agent for automated theorem proving. The prevailing approach to automated theorem proving involves complex, multi-component systems with heavy computational overhead. But does it need to be that complex? This research introduces a deliberately minimal agent architecture for formal theorem proving. It interfaces with Lean and demonstrates that a streamlined, pared-down approach can achieve competitive performance on proof generation benchmarks. It turns out that simplicity is a feature, not a limitation. By stripping away unnecessary complexity, the agent becomes more reproducible, efficient, and accessible. Sophisticated results don't require sophisticated infrastructure. Paper: https://t.co/3p5MfNQII4 Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Mar 04, 2026
5d ago
πŸ†”36845383
⭐0.38

ultrathink is back! i missed this so much. in most claude code sessions, i always feel i can squeeze out more from agents. i am always using some weird prompts to "think deeper". i am glad this shorthand is back, so i don't have to be manually trying so hard. πŸ˜…

9
96Stats
@96Stats
πŸ“…
Mar 01, 2026
8d ago
πŸ†”19049359

China just released an open-source voice LLM called Habibi (um..nice name haha) that can do 20+ Arabic dialects all in one As someone who did some NLP projects, this is wayyyy harder than it sounds as data is so messy, and Arabic isn’t β€œone language” in daily life, as dialects can be wildly different. I actually know the professor who made this model too, very clever guy with lots of NLP experience. He already made some models for various Chinese dialects, and i even know someone in Urumqi who made one for Uyghur and minority languages in Xinjiang University. Basically China bossed this area and now they’re making and selling it for other countries. Huge, because it shows people are coming to them as they do do it the best.. not the US

Media 1
πŸ–ΌοΈ Media
O
omarsar0
@omarsar0
πŸ“…
Mar 02, 2026
7d ago
πŸ†”53654711

Any benefits in using AGENTS dot md files with coding agents? Lots of discussions on this topics lately. Researchers tested OpenAI Codex across 10 repos and 124 PRs, running identical tasks twice (once with AGENTS dot md, once without). The finding is a bit different from what other recent papers report. With AGENTS dot md present, median runtime dropped 28.64% and output tokens fell 16.58%. The agent reached comparable task completion either way, it just got there faster and cheaper with context. One important thing to note: The gains weren't uniform. AGENTS dot md primarily reduced cost in a small number of very high-cost runs rather than uniformly lowering it across all tasks. The file acts more like a guardrail against worst-case thrashing than a universal accelerator. So I guess it depends on the task and requirements. I recommend to not use AGENTS dot md files blindly. If you do, keep them lean. Paper: https://t.co/g2U603Cf8t Learn to build effective AI agents in our academy: https://t.co/U0ZuNA084v

Media 1
πŸ–ΌοΈ Media
T
Tim_Dettmers
@Tim_Dettmers
πŸ“…
Jan 27, 2026
41d ago
πŸ†”92451895
⭐0.40

From there we could do a massiv amounts of experiments and really understand what matters for training coding agents. The most important insights came from carefully evaluating what scales well. What matters? The right model at the right scale. Cheap data generation pipelines.

M
moo_jin_kim
@moo_jin_kim
πŸ“…
Jan 24, 2026
44d ago
πŸ†”31630241

We release Cosmos Policy πŸ’«: a state-of-the-art robot policy built on a video diffusion model backbone. - policy + world model + value function β€” in 1 model - no architectural changes to the base video model - SOTA in LIBERO (98.5%), RoboCasa (67.1%), & ALOHA tasks (93.6%) πŸ§΅πŸ‘‡ https://t.co/cz9L3ziJ6x

πŸ–ΌοΈ Media
K
kenziyuliu
@kenziyuliu
πŸ“…
Feb 26, 2026
11d ago
πŸ†”37663259

Can we build a blind, *unlinkable inference* layer where ChatGPT/Claude/Gemini can't tell which call came from which users, like a β€œVPN for AI inference”? Yes! Blog post below + we built it into open source infra/chat app and served >15k prompts at Stanford so far. How it helps with AI user privacy: # The AI user privacy problem If you ask AI to analyze your ChatGPT history today, it’s surprisingly easy to infer your demographics, health, immigration status, and political beliefs. Every prompt we send accumulates into an (identity-linked) profile that the AI lab controls completely and indefinitely. At a minimum this is a goldmine for ads (as we know now). A bigger issue is the concentration of power: AI labs can easily become (or asked to become) a Cambridge Analytica, whistleblow your immigration status, or work with health insurance to adjust your premium if they so choose. This is a uniquely worse problem than search engines because your average query is now more revealing (not just keywords), interactive, and intelligence is now cheap. Despite this, most of us still want these remote models; they’re just too good and convenient! (this is aka the "privacy paradox".) # Unlinkable inference as a user privacy architecture The idea of unlinkable inference is to add privacy while preserving access to the remote models controlled by someone else. A β€œprivacy wrapper” or β€œVPN for AI inference”, so to speak. Concretely, it’s a blind inference middle layer that: (1) consists of decentralized proxies that anyone can operate; (2) blindly authenticates requests (via blind signatures / RFC9474,9578) so requests are provably sandboxed from each other and from user identity; (3) relays prompts over randomly chosen proxies that don’t see or log traffic (via client-side ephemeral keys or hosting in TEEs); and (4) the provider simply sees a mixed pool of anonymous prompts from the proxies. No state, pseudonyms, or linkable metadata. If you squint, an unlinkable inference layer is essentially a vendor for per-request, anonymous, ephemeral AI access credentials (for users or agents alike). It partitions your context so that user tracking is drastically harder. Obviously, unlinkability isn’t a silver bullet: the prompt itself still goes to the remote model and can leak privacy (so don't use our chat app for a therapy session!). It aims to combat *longitudinal tracking* as a major threat to user privacy, and its statistical power increases quickly by mixing more users and requests. Unlinkability can be applied at any granularity. For an AI chat app, you can unlinkably request a fresh ephemeral key for every session so tracking is virtually impossible. # The Open Anonymity Project We started this project with the belief that intelligence should be a truly public utility. Like water and electricity, providers should be compensated by usage, not who you are or what you do with it. We think unlinkable inference is a first step towards this β€œintelligence neutrality”. # Try it out! It’s quite practical - Chat app β€œoa-chat”: https://t.co/ELf8LvxFzX (<20 seconds to get going) - Blog post that should be a fun read: https://t.co/OwFmyFlZH5 - Project page: https://t.co/Swerz1xDE2 - GitHub: https://t.co/38CeKajCy2

Media 1Media 2
+1 more
πŸ–ΌοΈ Media
C
ctatedev
@ctatedev
πŸ“…
Mar 01, 2026
8d ago
πŸ†”32922760
⭐0.34

New agent-browser skill: Electron You can now control desktop apps built with Electron, including Discord, Figma, Notion, Spotify and VS Code Or, use it to debug your own Electron app Add it to any coding agent: npx skills add vercel-labs/agent-browser --skill electron

πŸ”HamelHusain retweeted
C
Chris Tate
@ctatedev
πŸ“…
Mar 01, 2026
8d ago
πŸ†”32922760
⭐0.32

New agent-browser skill: Electron You can now control desktop apps built with Electron, including Discord, Figma, Notion, Spotify and VS Code Or, use it to debug your own Electron app Add it to any coding agent: npx skills add vercel-labs/agent-browser --skill electron

❀️1,547
likes
πŸ”102
retweets
R
rxwei
@rxwei
πŸ“…
Feb 26, 2026
12d ago
πŸ†”57499756

Today we are introducing a Python SDK for Mac's on-device LLM! https://t.co/LQVp2EheLO https://t.co/mcJh9M1DaW

Media 1
πŸ–ΌοΈ Media