Your curated collection of saved posts and media

Recent Top

Showing 32 posts · last 7 days · newest first

🖼️ Media

O

omarsar0

@omarsar0

📅

Jun 24, 2026

11d ago

🆔52527816

// Critique of the Agent Model // Finally, a paper that tries to define what an agent is and what agency consists of. Good read overall. (great bookmark) The word agent now covers everything from a for-loop with tool calls to speculative machine superintelligence. Eric Xing and colleagues ask where automation ends, and agency begins. Drawing on Descartes and on science-fiction portrayals of autonomous beings, they analyze agent architectures along five dimensions: goal, identity, decision-making, self-regulation, and learning. The argument is that genuine agency requires these structures to hold together in a specific way. Great paper overall, providing a vocabulary for arguing about what is and is not an agent. Paper: https://t.co/qFvMxWd5cq Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

🖼️ Media

View Details View on X ↗

O

omarsar0

@omarsar0

📅

Jun 25, 2026

10d ago

🆔32000228

New research from Meta. Building synthetic training data has stayed a fixed pipeline that you hand-tune and then freeze. Autodata casts an AI agent as a data scientist that builds training and evaluation data, with an implementation called Agentic Self-Instruct that extends classic Self-Instruct with agentic planning and tool use. Think of it as meta-optimization, where the data scientist agent is itself trained to produce stronger data, so the pipeline keeps improving instead of staying static. Across computer science research, legal reasoning, and reasoning over mathematical objects, it beats classical synthetic-data methods, and meta-optimizing the agent delivers an even larger uplift. Paper: https://t.co/TgFN6EHZas Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

🖼️ Media

View Details View on X ↗

O

omarsar0

@omarsar0

📅

Jun 25, 2026

10d ago

🆔27331120

Just had a great discussion on dynamic workflows. Rough notes: - applies to a very small set of use cases - think of it as a new paradigm of (test-time compute) TTC - strong for hill-climbing research experiments - careful planning leads to better results - you can often get better results by just increasing the reasoning level - /goal + /loop is a subset of dynamic workflows - verifiers/judges are crucial to get good results - combine/fuse different coding agents for even better results - great for when you need different perspectives from agents (llm council) - frontier models are not equipped for optimally generating harnesses on the fly - newer models like Mythos are probably better trained to do more optimal agent orchestration - benchmarks on TTC are lacking, but we need them to measure how effective dynamic workflows are - meta prompt dynamic workflows are a lot of fun; even opus 4.8 might surprise you - dynamic workflows can be packaged as skills for further optimization of them Longer post coming soon.

🖼️ Media

View Details View on X ↗

O

omarsar0

@omarsar0

📅

Jun 26, 2026

9d ago

🆔26856446

Great to see the new GPT-5.6 models finally announced. Sad to see this new release strategy where only a select few get access initially. Not a win for our industry IMO. Open-source AI must win! https://t.co/F44COphP8s

@OpenAI • Fri Jun 26 17:10

Introducing a limited preview of GPT-5.6 Sol, our next generation frontier model, as well as GPT-5.6 Terra, a balanced model for efficient, everyday work, and GPT-5.6 Luna, a fast and affordable model for high-volume work. https://t.co/OoM83SyISN

🖼️ Media

View Details View on X ↗

D

dair_ai

@dair_ai

📅

Jun 26, 2026

9d ago

🆔37947693

New paper on giving LLM agents experience that improves the weights and stays readable at the same time. Agent-experience methods split into two camps. Externalized natural-language rules stay interpretable but drift out of sync with the policy. Parameter updates generalize but make weak local corrections under sparse rewards. JERP runs both off one trajectory stream, retrieving task-relevant rules at decision time and, after each episode, optimizing the policy while revising the rule pool against reference successful trajectories. The conceptual payoff is the absorption dynamic. Stable, repeatedly useful behaviors get internalized into the weights over time, while the rule pool handles fresh local corrections. The interpretability-versus-generalization balance becomes a knob rather than an architecture choice. Why does it matter? Teams want agents that both improve and stay inspectable. This is a clean template for getting both from the same trajectories. Gains land on AlfWorld and WebShop. Paper: https://t.co/avjHvESdBQ Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

🖼️ Media

View Details View on X ↗

O

omarsar0

@omarsar0

📅

Jun 26, 2026

9d ago

🆔39562946

Dynamic workflows (generating harnesses on the fly) are a new form of test-time compute. But LLMs aren't great at building them. I often have to steer agents to generate complex patterns. Curious how effective Mythos/GPT-5.6 is at dynamically generating complex workflows. https://t.co/hFhMWZJSua

🖼️ Media

View Details View on X ↗

D

dair_ai

@dair_ai

📅

Jun 27, 2026

8d ago

🆔06556266

NEW paper from NVIDIA. (bookmark it) Speed-of-light performance analysis tells you the theoretical floor of a workload, but teams still derive it by hand and freeze it. SOLAR automates the whole thing straight from PyTorch or JAX source. An LLM frontend translates arbitrary code into an executable Affine Loop IR, validated by output comparison, then a deterministic pass lifts it into an einsum graph, and an analytical backend computes the bounds. The model is confined to translation, so the actual bound math stays deterministic. Across KernelBench, Flax models, and robotics workloads, they report zero observed SOL violations. Paper: https://t.co/KXgsPxcSnY Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

🖼️ Media

View Details View on X ↗

O

omarsar0

@omarsar0

📅

Jun 27, 2026

8d ago

🆔32470001

If you use LLM-as-judge, this one is worth reading. (bookmark it) It's actually one of the most effective ways to use LLM-as-a-Judge for evals. Holistic judge scores hide both their reasoning and their ceiling effects. BINEVAL decomposes each evaluation criterion into atomic yes-or-no questions, answers each independently per output, then aggregates the verdicts into calibrated multi-dimensional scores. Every question-level verdict is inspectable, so you can diagnose exactly why an output scored low, and the same verdicts feed straight back as targeted prompt-improvement signal. Across SummEval, Topical-Chat, and QAGS, it matches or beats UniEval and G-Eval, training-free, with especially strong results on factual consistency. Paper: https://t.co/oar6BZcasm Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

🖼️ Media

View Details View on X ↗

D

dair_ai

@dair_ai

📅

Jun 27, 2026

8d ago

🆔55158636

When does combining LLMs help? Great analysis on combining language models, measured across 67 models from 21 providers. Any policy that routes, votes, cascades, or runs a mixture of agents and then returns one model's answer is bounded above by 1 minus beta, where beta is the fraction of queries every candidate model gets wrong. The common justification for ensembling is diversity, usually measured as low pairwise error correlation. The paper proves that correlation cannot identify beta, so decorrelation does not establish that headroom exists. And across the 67 models, real co-failures are far more concentrated than independence-style assumptions predict. Before assuming a router or MoA setup will help, measure beta. Co-failures cluster on the answer format rather than the subject. Paper: https://t.co/PGO9YAoBzH Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

🖼️ Media

View Details View on X ↗

O

omarsar0

@omarsar0

📅

Jun 28, 2026

7d ago

🆔30160761

Fascinating paper on self-improving agents. (bookmark it) If you are working on agentic loops, you will quickly realize that they are only as good as the effectiveness of the evaluator. Self-improvement loops tend to stall the moment the judge stops getting harder. The agent learns to satisfy a fixed evaluator rather than getting genuinely better. The Red Queen Gödel Machine, from Cambridge, co-evolves the agent and its evaluator together, so the bar keeps rising as the agent climbs. The name borrows the evolutionary arms race. Both sides have to keep running to stay in place. A frozen evaluator is where reward hacking creeps into self-improvement. Co-evolving the judge is a structural answer to that, and it keeps the loop honest over many rounds. Paper: https://t.co/HuR9YWSTPr Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

🖼️ Media

View Details View on X ↗

D

dair_ai

@dair_ai

📅

Jun 28, 2026

7d ago

🆔93564941

Why do RL runs on LLMs blow up even when the recipe looks right? GEOALIGN, from the Alibaba team behind Qwen, points at the rollouts. A handful of bad batches push the policy in incoherent directions, and most stability tuning just damps the symptom. This work curates rollouts by their geometry, removing the samples that make update directions conflict before they destabilize training. Why does it matter? If instability is largely a bad-batch problem, rollout curation is a lower-effort lever than another round of KL or clip tuning. You fix the data going into the update rather than fighting the optimizer. Paper: https://t.co/tUAYC57MVy Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

🖼️ Media

View Details View on X ↗

D

dair_ai

@dair_ai

📅

Jun 28, 2026

7d ago

🆔80706126

NEW paper worth reading. Reasoning-data curation is expensive because scoring a trace usually means reading it to the end. This new work from UCLA shows you may not have to. The quality of a reasoning trace is largely decided in its opening tokens, so a short prefix predicts whole-trace quality well enough to rank and filter on. What this means? You can score a million traces without finishing any of them. That turns curation into a cheap early-stopping problem and cuts the cost of building SFT data for reasoning models by a wide margin. Paper: https://t.co/KPKdygwd12 Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

🖼️ Media

View Details View on X ↗

N

nanransohoff

@nanransohoff

📅

Jun 24, 2026

12d ago

🆔56058632

https://t.co/atWPjSq0yg https://t.co/8mgP0A7oXP

🖼️ Media

View Details View on X ↗

🔁random_walker retweeted

N

Nan Ransohoff

@nanransohoff

📅

Jun 24, 2026

12d ago

🆔56058632

https://t.co/atWPjSq0yg https://t.co/8mgP0A7oXP

❤️557

likes

🔁63

retweets

🖼️ Media

View Details View on X ↗

A

alexwan55

@alexwan55

📅

Jun 24, 2026

11d ago

🆔32557484

40% of benchmarking effort targets math/coding, but the related occupations are only 3.5% of US jobs. We introduce EconEvals, an open-source evaluation suite to measure capabilities and predict job disruption across the US labor economy. https://t.co/wxQykhUqCI

🖼️ Media

View Details View on X ↗

F

FazlBarez

@FazlBarez

📅

Jun 26, 2026

9d ago

🆔31313457

This paper will be talked about for years to come. V important! There are Futures benchmark driven AI cannot see! led by Sobhan (my fellow) and @Avameanssong w/@kalsbskk81826 Ali, Fateme, @sanmikoyejo, @philiptorr, @yong_suk_lee, @joelbot3000 @NorvigPeter and @random_walker https://t.co/ehBGK8dfsT

🖼️ Media

View Details View on X ↗

L

llama_index

@llama_index

📅

Jun 25, 2026

10d ago

🆔11561100

We built LiteParse, the fastest document parsing solution on the planet and made it open source. And it just hit 10k github stars. 🦙 Fast to run. Fast to love. Thanks for building with us. If you haven't tried it already, repo at: https://t.co/wXRxvlREQq https://t.co/Shv0J1CROU

🖼️ Media

View Details View on X ↗

L

llama_index

@llama_index

📅

Jun 26, 2026

10d ago

🆔56892811

The @n8n_io node for the LlamaParse Platform is now an officially verified community node, as part of a broader partnership with n8n to bring cutting-edge document intelligence to the low-code and no-code world🚀 The new version of the node brings together document parsing, classification, extraction, splitting, and retrieval in one place, all wired to a single LlamaParse API credential🦙 Each resource can now also act as a callable tool inside an n8n AI Agent: so instead of building static pipelines, you can let the agent decide when to retrieve context, parse a file, or extract structured data based on what the user actually needs🤖 A few workflows worth highlighting: routing documents by type before extracting structured fields, plugging retrieval directly into an agent backed by your own knowledge base, and running parse outputs through different tiers side by side to find the right balance between accuracy and cost🔃 If you're already using n8n, install it directly from your workflow canvas by searching 𝘓𝘭𝘢𝘮𝘢𝘗𝘢𝘳𝘴𝘦 𝘗𝘭𝘢𝘵𝘧𝘰𝘳𝘮 and give it a try!🔧 📚️ Full breakdown in our blog post: https://t.co/8LJB80HCJ8

🖼️ Media

View Details View on X ↗

J

jerryjliu0

@jerryjliu0

📅

Jun 27, 2026

8d ago

🆔38758217

LiteParse is unreasonably good for document parsing ✅ It is the fastest document parsing tool out there - average parse time per page is 3ms ⚡️⚡️ ✅ Now that we support markdown, it tops opendataloader-bench, OlmOCR-bench, and ParseBench in terms of accuracy ✅ It supports 50+ other document formats ✅ It even gives you basic bounding boxes that your coding agent can stitch together Even if you need deeper VLM-enabled parsing (e.g. LlamaParse), there's no reason you shouldn't be using this as a first pass for everything. https://t.co/JNER0mVcB8

@llama_index • Thu Jun 25 16:26

We built LiteParse, the fastest document parsing solution on the planet and made it open source. And it just hit 10k github stars. 🦙 Fast to run. Fast to love. Thanks for building with us. If you haven't tried it already, repo at: https://t.co/wXRxvlREQq https://t.co/Shv0J1CRO

+1 more

🖼️ Media

View Details View on X ↗

J

JosephJacks_

@JosephJacks_

📅

Jun 26, 2026

9d ago

🆔29094608

“We’re being forced by the U.S. Government to slowly release 5.6, and we don’t like it” At LEAST they are saying this .. good job @sama .. but it is not enough. We are already losing to China and now we are trying to act like them, which is only reinforcing their advantage!!! https://t.co/Rm5CD28KTV

@ •

🖼️ Media

View Details View on X ↗

O

ornith_

@ornith_

📅

Jun 25, 2026

11d ago

🆔67963854

Aloha! 🌺 Meet Ornith-1.0, a family of open-source LLMs specialized for agentic coding. Ornith-1.0 spans the full parameter sizes including 9B Dense, 31B Dense, 35B MoE, and 397B MoE. It achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks including: ✅Terminal-Bench 2.1(77.5) ✅SWE-Bench(82.4 on verified, 62.2 on pro, 78.9 on Multilingual) ✅NL2Repo(48.2) ✅SWE Atlas(41.2 on QnA, 42.6 RF, 39.1 TW) ✅ClawEval(77.1) Post-trained on top of gemma4 and qwen3.5, Ornith-1.0 employs a novel self-improving training strategy in which reinforcement learning is used to generate not only solution rollouts, but also the task-specific scaffolds that drive those rollouts. By jointly optimizing the scaffold and the resulting solution, the model generate higher-quality solutions in agentic coding.😎 All models are released under the MIT license, enabling full commercial and research use. 📖Tech Blog: https://t.co/qT9N2HYWFn 🤗Huggingface: https://t.co/PRrwqjeBtM

+1 more

🖼️ Media

View Details View on X ↗

A

arankomatsuzaki

@arankomatsuzaki

📅

Jun 26, 2026

10d ago

🆔80349703

@lcastricato We may have missed our golden goose by not digging deeper into https://t.co/CIshkL2Qx1

🖼️ Media

View Details View on X ↗

P

perplexity_ai

@perplexity_ai

📅

Jun 24, 2026

11d ago

🆔71766804

Introducing Computer for Counsel. Computer now connects the research databases, document tools, and matter-management systems lawyers use every day. Pull citable sources from @midpageAI, @LegalZoom, @Docusign, @netdocuments, and more. Available for all Pro and Max subscribers. https://t.co/El3028Ua7P

🖼️ Media

View Details View on X ↗

B

base

@base

📅

Jun 25, 2026

11d ago

🆔99379282

Base MCP is now available in @Perplexity_ai Computer → Research any token with Perplexity → Set your entry point → Base MCP prepares the swap for your approval Alongside everything else Base MCP can already do https://t.co/RiucyK7NTN

🖼️ Media

View Details View on X ↗

A

AravSrinivas

@AravSrinivas

📅

Jun 27, 2026

8d ago

🆔51400092

this is going to be the norm https://t.co/BeyBz7EWDK

🖼️ Media

View Details View on X ↗

N

nunezvice

@nunezvice

📅

Jun 24, 2026

12d ago

🆔49776025

tony stark is not texting jarvis. voice lets you give agents more context, faster. the messy stuff is actually the point. i wrote about how we use it with codex today and where this is all going. talk to your computer. be shameless about it. i’ll see you in a few months 🎙️ https://t.co/Sm5ZIcMCJP

@nunezvice • Wed Jun 24 16:02

https://t.co/cRuQd07mwk

🖼️ Media

View Details View on X ↗

S

SakanaAILabs

@SakanaAILabs

📅

Jun 24, 2026

12d ago

🆔52493052

Fugu-Ultra is now live on @OpenRouter! ⚡ We share a core vision with the OpenRouter team: the future of AI isn’t a single monolithic model, but the collective intelligence of the world’s best models working together. Try it: https://t.co/sVkbTPtXOl 🐡 https://t.co/y65DXVcqXL

🖼️ Media

View Details View on X ↗

M

michellearning

@michellearning

📅

Jun 24, 2026

12d ago

🆔87292445

The future of bio is powered by faster data Introducing the Medra AI Experimentalist: an agent that turns goals into experimental designs, learns from every result, and develops the next assay Excited to collaborate with @DARPA and @NVIDIAHealth on the future of science https://t.co/5DZoWJ97xs

🖼️ Media

View Details View on X ↗

T

TnvMadhav

@TnvMadhav

📅

Jun 24, 2026

12d ago

🆔97465095

@jxnlco https://t.co/inpHAoEYW7

🖼️ Media

View Details View on X ↗

🔁jxnlco retweeted

T

TnvMadhav

@TnvMadhav

📅

Jun 24, 2026

12d ago

🆔97465095

@jxnlco https://t.co/inpHAoEYW7

🔁1

retweets

🖼️ Media

View Details View on X ↗

N

NVIDIAAI

@NVIDIAAI

📅

Jun 24, 2026

12d ago

🆔25418828

The rise of MoE models introduced new challenges in training, and @huggingface's Transformers v5 brought first-class support for solving them. Now, NeMo AutoModel builds on top of v5. Part of the NeMo framework for building models at scale, NeMo AutoModel brings optimizations to a broad set of model families through support for Expert Parallelism, DeepEP, and TransformerEngine kernels with a few lines of code. We found NeMo AutoModel brings a 3.4 to 3.7x higher training throughput for popular MoE models. You can read more here: https://t.co/TNlBsKWwrJ

🖼️ Media

View Details View on X ↗

U

unprofeshme

@unprofeshme

📅

Jun 24, 2026

12d ago

🆔67977793

anyway, it’s not just the two of us… @adlinzainal @SherryYanJiang @agrimsingh @yongquanYQ @darenstwt @ivanleomk will be there too! 🇸🇬 https://t.co/JNHf67M8rU

🖼️ Media

View Details View on X ↗

← PreviousPage 47 of 882Next →