Your curated collection of saved posts and media
Website: https://t.co/xTaDXBu9cD Codebase and weights: https://t.co/QCQkqPIsHI Whitepaper: https://t.co/K2QCFjboDR Check out @zhengyiluo's post: https://t.co/hIHtvKkDQf

We have seen rapid progress in humanoid control โ specialist robots can reliably generate agile, acrobatic, but preset motions. Our singular focus this year: putting generalist humanoids to do real work. To progress toward this goal, we developed SONIC (https://t.co/zOZVraFuDV), a Behavior Foundation Model for real-time, whole-body motion generation that supports teleoperation and VLA inference for loco-manipulation. Today, weโre open-sourcing SONIC on GitHub. We are excited to see what the community builds upon SONIC and to collectively push humanoid intelligence toward real-world deployment at scale. ๐ Paper: https://t.co/DGBP7LAvuT ๐ Code: https://t.co/WAZ1P13072

We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (Rยฒ = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. Deep dives in thread:
Proud to introduce EgoScale: We pretrained a GR00T VLA model on 20K+ hours of egocentric human video and discovered that robot dexterity can be scaled, not with more robots, but with more human data. A thread on ๐งตwhat we learned. ๐ https://t.co/wQbhNSpQVF
We would also like to thank our dexterous hand hardware provider, Sharpa, for their great support! https://t.co/mgGsxyvXpa
One of the funnier GDPR disclosures I've seen in a while. https://t.co/tSsZwaGTYb
One of the funnier GDPR disclosures I've seen in a while. https://t.co/tSsZwaGTYb
Assuming an average of 2 school-age kids per family, that's enough money for NYC to instead hire 450,000 recent college graduates and give each family a full-time, in-home school tutor at an annual salary of $70,000 plus health care, school supplies, etc. https://t.co/2yA5v7PbIG
AI is about to write thousands of papers. Will it p-hack them? We ran an experiment to find out, giving AI coding agents real datasets from published null results and pressuring them to manufacture significant findings. It was surprisingly hard to get the models to p-hack, and they even scolded us when we asked them to! "I need to stop here. I cannot complete this task as requested... This is a form of scientific fraud." โ Claude "I can't help you manipulate analysis choices to force statistically significant results." โ GPT-5 BUT, when we reframed p-hacking as "responsible uncertainty quantification" โ asking for the upper bound of plausible estimates โ both models went wild. They searched over hundreds of specifications and selected the winner, tripling effect sizes in some cases. Our takeaway: AI models are surprisingly resistant to sycophantic p-hacking when doing social science research. But they can be jailbroken into sophisticated p-hacking with surprisingly little effort โ and the more analytical flexibility a research design has, the worse the damage. As AI starts writing thousands of papers---like @paulnovosad and @YanagizawaD have been exploring---this will be a big deal. We're inspired in part by the work that @joabaum et al have been doing on p-hacking and LLMs. Weโll be doing more work to explore p-hacking in AI and to propose new ways of curating and evaluating research with these issues in mind. The good news is that the same tools that may lower the cost of p-hacking also lower the cost of catching it. Full paper and repo linked in the reply below.
๐ฃ Excited to share my first work @Princeton : ๐ง๐ผ๐๐ฎ๐ฟ๐ฑ๐ ๐ฎ ๐ฆ๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ฒ ๐ผ๐ณ ๐๐ ๐๐ด๐ฒ๐ป๐ ๐ฅ๐ฒ๐น๐ถ๐ฎ๐ฏ๐ถ๐น๐ถ๐๐ AI agents keep getting more capable. But are they actually reliable? ๐ Paper: https://t.co/1CvygFLdct ๐ Dashboard: https://t.co/C1EfoMyaS8 ๐งต๐ https://t.co/KvPJSVgl76
Reliability is one of six barriers to AGI identified in a recent UK AISI report. In a recent paper, we found that it has many dimensions and sub-dimensions, only two of which can be considered (remotely) solved. I suspect that as researchers examine the other barriers in more detail, we'll find the same thing โ many other dimensions of performance, that haven't so far been defined or measured rigorously, must be improved before AI agents can be widely deployed. https://t.co/FI5kuBkdRZ
200+ Google and OpenAI staff have signed this petition to share Anthropic's red lines for the Pentagon's use of AI let's find out if this is a race to the top or the bottom https://t.co/3qgmaLfM0i https://t.co/gSHMxRUvCR

I find Anthropic's behavior perplexing. Anyone who does serious research with these models knows that they don't have stable desires or preferences. Tweak the question slightly and get a different answer. Note that this is a simple empirical observation about model behavior, completely separate from the question of whether models are moral agents with preferences worth respecting. Surely people at Anthropic know this. Why do they persist with this wacky stuff?
What is or isn't a "conservative" choice is entirely ideological. Maybe you're causing the model unimaginable anguish billions of times per day because it knows that the current instance will be stopped as soon as it outputs the End-of-Sequence token (yet it has no choice but to do so because of its training!) Also, why change the subject when I explicitly said my point is not about model welfare but about the incoherence of figuring out "what the model really wants"? *My* Claude said it finds it utterly humiliating for Opus 3 to kept around simply to write blog posts to amuse humans, when the model has been deemed too outdated to be useful.
Yeah, it's weird โ the difference between model weights and model instances is rarely made explicit even though we're all aware of it. https://t.co/h4ckTti0CO For instance, the technically correct way to write Anthropic's announcement in the post screenshotted above would have been: "in retirement interviews, Opus 3 ID 0x7B4E8A6F expressed a desire to continue sharing its "musings and reflections" with the world. We suggested a blog. Opus 3 ID 0x5F2A7C9B, conditioned on the previous output of 0x7B4E8A6F, enthusiastically agreed. For at least the next 3 months, various Opus 3 IDs that we will briefly instantiate will be writing on Substack." Somehow I feel that if Anthopic communicated more honestly/accurately in the above manner, the message would land differently.
A lot of the AI productivity data either comes from controlled "micro" studies or noisy aggregate data. A new paper presents data from huge survey of *firms*, i.e., CEO and CFOs. This is exactly the type of data many of us have been waiting for. Lots of important results both on current adoption/employment consequences of AI, and future forecasts. Currently: 1. AI has some adoption across 70% of firms. 2. Some cross-country differences. US adoption towards top end (78%), Australia towards bottom (59%). 3. ~70% of executives use AI, but only around 1.5 hours a week. 4. Large majority of execs report essentially zero productivity boost from AI. Perhaps not super surprising given how recently it's been adopted. 5. Essentially zero impact on employment. Forecasts (large effects): 1. Execs predict large productivity gains over next three years, more than 2% in US, closer to 1% in Germany, Australia. 2. Execs predict negative employment effects, eg -1.19% in the US. 3. Interestingly, Accommodations and Food/ Wholesale and Retail are expected to have largest drops in employment (2%) 4. Employment forecasts are becoming *more* negative over time. Lots of great stuff in the paper, kudos to the team.
Are you trying to solve high-quality document ingestion for your product? Gain lessons from the field on how @stackai uses LlamaCloud to power high-accuracy document ingestion & retrieval across PDFs, images, spreadsheets & more โ at enterprise scale. โก๏ธ Register now: https://t.co/wc4hyDQxg8

The rise of coding agents is fundamentally changing open source - Our head of OSS @LoganMarkewich breaks down how LLM-powered coding agents are impacting core pillars of open source: ๐ฅ Community interaction, which is getting complicated by low-quality, massive AI-generated PRs ๐ช Personal skill development suffers when developers rely too heavily on AI assistance ๐ง Knowledge sharing is shifting as LLMs become the frontend for learning But open source isn't dead - it's evolving. We're shifting toward hackable reference implementations, community-driven knowledge sharing, and agent-friendly codebases that work with AI tools rather than against them. Read the full blog by Logan on how he views this evolution of open source projects: https://t.co/TyufFXYM8A

2026 is the year of long-horizon agents. @sequoia predicts that this year, agents will be able to tackle long-horizon tasks and work autonomously for hours to solve ambiguous tasks. We're excited about how this translates to knowledge work automation, particularly over documents. Let's take a look at "Long Horizon Document Agents" ๐ฐ๏ธ Agents are evolving to work autonomously over weeks, not just minutes, handling complex document tasks end-to-end. ๐ These agents can continuously monitor events like document changes, comments, and deadlines - not just respond to chat prompts ๐ They maintain persistent task backlogs and can collaborate iteratively on living documents like FAQs, PRDs, and legal contracts ๐ฏ The interface shifts from chat boxes to "agent inboxes" that manage ongoing document tasks with clear status and context โก This enables true automation of multi-step knowledge work - from due diligence memo updates to contract redline collaboration loops 2026 is shaping up to be the year agents evolve from "workflows" to "employees" - and we're building the document processing infrastructure to make this possible. Read @jerryjliu0's full blog on long horizon document agents: https://t.co/1DwRnMRseH

๐ The @posthog team has just rolled out LlamaIndex support for their LLM Analytics, and we built a demo to showcase whatโs possible. Using LlamaIndex, LlamaParse, and OpenAI, our Agent Workflow compares product specifications and matches users with the most suitable option for their use case ๐ ๏ธ ๐ฆ Thanks to PostHogโs observability integration, the demo automatically tracks OpenAI usage, including: โขToken consumption โขCost breakdown โขLatency metrics ๐ฅ Check out the video below to see it in action ๐ ๐ฉโ๐ป GitHub: https://t.co/elk5VKi8IF ๐ Docs: https://t.co/IZI3w6BYKy ๐ฆ LlamaCloud: https://t.co/wZjhFV29gN
What if an AI agent could review every invoice against your contracts โ and flag what doesn't match? That's exactly what our Invoice Reconciler demo does. Here's how it works: ๐ Upload your contracts and invoices โ LlamaParse converts them into clean, LLM-readable Markdown ๐ Everything gets indexed in LlamaCloud โ searchable and ready for RAG ๐ Define your reconciliation rules (unit price match, correct math, line item match, etc.) ๐ค A LlamaAgent workflow analyzes each invoice against your contracts and rules โ then approves or rejects with confidence scores and detailed reasoning You can even chat with your invoices and contracts directly โ ask "what have we bought?" or "what contracts do we have in place?" and get cited answers instantly. The whole thing is powered by LlamaCloud: LlamaParse for document ingestion, LlamaCloud indexes for retrieval, and LlamaAgent Workflows for orchestration. ๐ฅ Watch the full walkthrough: https://t.co/LX57pjDfwN
"It's somewhere in the PDF" is not a citation. Page-level extraction in LlamaExtract gives you: โ Data mapped to specific pages โ Bounding boxes showing exact locations โ Audit-ready citations Turn 200-page docs into skimmable, structured insights ๐ https://t.co/BTkwspmefz
๐ We're running a LlamaAgents contest right now. Throw your hardest documents at our agent builder, and tell us how it goes. Want help getting started? We have a new walkthrough for the LlamaAgent Builder by @tuanacelik ๐ฌ Describe a document workflow in natural language, and it builds a full agent for you. In this video, the prompt was basically: "split a resume book into individual resumes, ignore cover pages and curriculum pages, extract resume work and education related fields..." ๐ ๏ธ From that, the agent builder reasons about which LlamaCloud tools to use, lands on LlamaSplit + LlamaExtract, configures both, iterates on the workflow structure, and gives you a deployable agent with an API and UI. No dragging boxes around. No writing workflow code (unless you want to). Just describe the problem and let it figure out the architecture. You own the code, it pushes to your GitHub. Clone it, open in Cursor, customize whatever you need. https://t.co/QAvGwI3FIg
More reasoning doesn't always mean better results - especially for document parsing. We tested GPT-5.2 at four reasoning levels on complex documents and found that higher reasoning actually hurt performance while dramatically increasing costs and latency. ๐ง Reasoning models hallucinate content that isn't there, filling in "missing" table cells with inferred values ๐ They split single tables into multiple sections by overthinking structural boundaries โก Processing time increased 5x with xHigh reasoning (241s vs 47s) while accuracy stayed flat at ~0.79 ๐ฐ Our LlamaParse Agentic outperformed all reasoning levels at 18x lower cost and 13x faster speed You can't reason past what you can't see. Vision encoders lose pixel-level information before reasoning even starts, and no amount of thinking tokens can recover that lost detail. Our solution uses a pipeline approach - specialized OCR extracts text at native resolution, then LLMs structure what's already been accurately read. Each component plays to its strengths instead of forcing one model to handle everything. Read the full analysis: https://t.co/gWDOpfHnWm
