Your curated collection of saved posts and media
2026 is the year of long-horizon agents. @sequoia predicts that this year, agents will be able to tackle long-horizon tasks and work autonomously for hours to solve ambiguous tasks. We're excited about how this translates to knowledge work automation, particularly over documents. Let's take a look at "Long Horizon Document Agents" π°οΈ Agents are evolving to work autonomously over weeks, not just minutes, handling complex document tasks end-to-end. π These agents can continuously monitor events like document changes, comments, and deadlines - not just respond to chat prompts π They maintain persistent task backlogs and can collaborate iteratively on living documents like FAQs, PRDs, and legal contracts π― The interface shifts from chat boxes to "agent inboxes" that manage ongoing document tasks with clear status and context β‘ This enables true automation of multi-step knowledge work - from due diligence memo updates to contract redline collaboration loops 2026 is shaping up to be the year agents evolve from "workflows" to "employees" - and we're building the document processing infrastructure to make this possible. Read @jerryjliu0's full blog on long horizon document agents: https://t.co/1DwRnMRseH

π The @posthog team has just rolled out LlamaIndex support for their LLM Analytics, and we built a demo to showcase whatβs possible. Using LlamaIndex, LlamaParse, and OpenAI, our Agent Workflow compares product specifications and matches users with the most suitable option for their use case π οΈ π¦ Thanks to PostHogβs observability integration, the demo automatically tracks OpenAI usage, including: β’Token consumption β’Cost breakdown β’Latency metrics π₯ Check out the video below to see it in action π π©βπ» GitHub: https://t.co/elk5VKi8IF π Docs: https://t.co/IZI3w6BYKy π¦ LlamaCloud: https://t.co/wZjhFV29gN
What if an AI agent could review every invoice against your contracts β and flag what doesn't match? That's exactly what our Invoice Reconciler demo does. Here's how it works: π Upload your contracts and invoices β LlamaParse converts them into clean, LLM-readable Markdown π Everything gets indexed in LlamaCloud β searchable and ready for RAG π Define your reconciliation rules (unit price match, correct math, line item match, etc.) π€ A LlamaAgent workflow analyzes each invoice against your contracts and rules β then approves or rejects with confidence scores and detailed reasoning You can even chat with your invoices and contracts directly β ask "what have we bought?" or "what contracts do we have in place?" and get cited answers instantly. The whole thing is powered by LlamaCloud: LlamaParse for document ingestion, LlamaCloud indexes for retrieval, and LlamaAgent Workflows for orchestration. π₯ Watch the full walkthrough: https://t.co/LX57pjDfwN
"It's somewhere in the PDF" is not a citation. Page-level extraction in LlamaExtract gives you: β Data mapped to specific pages β Bounding boxes showing exact locations β Audit-ready citations Turn 200-page docs into skimmable, structured insights π https://t.co/BTkwspmefz
π We're running a LlamaAgents contest right now. Throw your hardest documents at our agent builder, and tell us how it goes. Want help getting started? We have a new walkthrough for the LlamaAgent Builder by @tuanacelik π¬ Describe a document workflow in natural language, and it builds a full agent for you. In this video, the prompt was basically: "split a resume book into individual resumes, ignore cover pages and curriculum pages, extract resume work and education related fields..." π οΈ From that, the agent builder reasons about which LlamaCloud tools to use, lands on LlamaSplit + LlamaExtract, configures both, iterates on the workflow structure, and gives you a deployable agent with an API and UI. No dragging boxes around. No writing workflow code (unless you want to). Just describe the problem and let it figure out the architecture. You own the code, it pushes to your GitHub. Clone it, open in Cursor, customize whatever you need. https://t.co/QAvGwI3FIg
More reasoning doesn't always mean better results - especially for document parsing. We tested GPT-5.2 at four reasoning levels on complex documents and found that higher reasoning actually hurt performance while dramatically increasing costs and latency. π§ Reasoning models hallucinate content that isn't there, filling in "missing" table cells with inferred values π They split single tables into multiple sections by overthinking structural boundaries β‘ Processing time increased 5x with xHigh reasoning (241s vs 47s) while accuracy stayed flat at ~0.79 π° Our LlamaParse Agentic outperformed all reasoning levels at 18x lower cost and 13x faster speed You can't reason past what you can't see. Vision encoders lose pixel-level information before reasoning even starts, and no amount of thinking tokens can recover that lost detail. Our solution uses a pipeline approach - specialized OCR extracts text at native resolution, then LLMs structure what's already been accurately read. Each component plays to its strengths instead of forcing one model to handle everything. Read the full analysis: https://t.co/gWDOpfHnWm

Coding agents are fundamentally changing software engineering in terms of velocity, role, and org structure. We published a memo to our internal engineering team detailing our growing expectations in terms of role/scope. π Before, the tasks of prioritization, engineering planning, and implementation were divided between EMs, PMs, senior ICs, and junior ICs π’ Now, ICs are expected to handle *all* of product prioritization, product speccing, and implementation This is due to a few trends π: - Coding agents have brought implementation costs down to ~0. The role of engineers is writing prompts - LLMs and sub-agents have reduced the PM work of synthesizing feedback down to ~0 too The main job of any βengineerβ is to be an e2e product owner: being able to translate requirements into specifications, and delegate tasks to various subagents for implementation. Every engineer is told to offload as much as possible to their favorite tools, whether itβs Claude Code, Cursor, Devin, Codex, regular ChatGPT and more. We celebrate and share learnings around burning tokens, as long as it helps drive additional productivity!
π Big drop from @GoogleDeepMind: Gemini 3.1 Pro is here, and we built a hands-on demo powered by LlamaCloud to put it to work and turn your receipt photos into real financial insights! Using our Agent Workflows, the app: πΈ Parses receipt images with LlamaParse (Agentic tier) π Stores everything locally in an SQLite database π Aggregates your spending monthly π§ Uses Gemini 3.1 Pro to analyze trends and generate actionable tips to improve your finances Check out the demo below!π π©βπ» GitHub repo: https://t.co/Ny22F4I3n1 π¦ Get started with LlamaCloud: https://t.co/zyE5lXTPFV
π LlamaAgents Builder just leveled up: File uploads are here! Our natural language interface for building agentic document workflows now supports file uploads. You can provide example documents as context, and the agent will use them as a starting point to design and tailor your workflow. The result? Applications that better match your real-world use case. The more representative your sample files, the more accurate your final app. π₯ Watch the full walkthrough: https://t.co/LQW2PEZ8d9 π¦ Get started with LlamaCloud: https://t.co/wZjhFV29gN
We built an AI agent that lets you vibe-code document extraction - high accuracy and citations over the most complex documents. Our latest release lets you upload documents as context. All you then have to do is describe what you want extracted in natural language. π‘ Our agent will then read the document with file tools to infer the right schema, validation rules, and other pre/postprocessing logic. β It will give you back a workflow that can extract over thousands/millions of documents at scale. You can still of course review and edit every output before approving. Stop handling paperwork manually; just upload files, describe your task, and let our agent handle the rest. Our vision for LlamaAgents is to provide the most advanced and easy-to-use way for you to orchestrate document work. Walkthrough: https://t.co/dAtzlZbot4 Check it out: https://t.co/XYZmx5TFz8 If youβre interested in reducing the operational burden of document extraction (invoices, claims, onboarding forms), come talk to us! https://t.co/Ht5jwxSrQB
Document OCR benchmarks are hitting a ceiling - and that's a problem for real-world AI applications. Our latest analysis reveals why OmniDocBench, the go-to standard for document parsing evaluation, is becoming inadequate as models like GLM-OCR @Zai_org achieve 94.6% accuracy while still failing on complex real-world documents. π Models are saturating OmniDocBench scores but still struggle with complex financial reports, legal filings, and domain-specific documents π― Rigid exact-match evaluation penalizes semantically correct outputs that differ in formatting (HTML vs markdown, spacing, etc.) β‘ AI agents need semantic correctness, not perfect formatting matches - current benchmarks miss this critical distinction π¬ The benchmark's 1,355 pages can't capture the full complexity of production document processing needs The document parsing challenge isn't solved just because benchmark scores look impressive. We need evaluation methods that reward semantic understanding over exact formatting, especially as AI agents become the primary consumers of parsed content. We're building parsing models focused on semantic correctness for complex visual documents. If you're scaling OCR workloads in production, LlamaParse handles the edge cases that benchmarks miss. Read our full analysis: https://t.co/tcZP1PM8kv

Build a private equity deal sourcing agent that automatically classifies investment opportunities and extracts key financial metrics using our LlamaAgents Builder. This step-by-step guide shows you how to create an agent that processes deal files like teasers and financial summaries: π― Classify deals into buyout, growth, or minority investment strategies π Extract critical metrics including revenue, EBITDA, growth rates, and debt levels π Deploy directly to GitHub and get a working UI without writing code π§ Iterate and refine your agent through natural language conversations The tutorial covers prompt engineering best practices, using example files effectively, visualizing agent workflows, and deploying to production. We demonstrate the complete process from initial prompt to testing the deployed application with real deal documents. Read the full tutorial: https://t.co/WcT2j3nEoi

Turn your PDF charts into pandas DataFrames with specialized chart parsing in LlamaParse! This tutorial walks you through extracting structured data from charts and graphs in PDFs, then running data analysis with pandas - no manual data entry required. π Enable specialized chart parsing to convert visual charts into structured table data πΌ Extract table rows directly from parsed PDF pages and load them into DataFrames π Perform year-over-year analysis, calculate gaps between metrics, and create visualizations β‘ Use the items view to get per-page structured data including tables and figures We demonstrate this using a 2024 Executive Summary PDF, extracting a fiscal year chart showing Budget Deficit vs Net Operating Cost data spanning 2020-2024, and reproducing the key financial insights. Check out the full tutorial: https://t.co/sOVtFM3xE1
Since joining @llama_index, my focus has shifted from 'everything agents' to 'document agents' : agents that can handle work over all manner of complex documents. So, I tried out the latest chart parsing capabilities of LlamaParse. Charts in PDFs are notoriously painful to work with. You can see the data ) bars, axes, labels) but actually getting it into a format you can analyze means is a different matter. I tried out parsing a U.S. Treasury executive summary PDF, pulling a grouped bar chart showing Budget Deficit vs. Net Operating Cost for fiscal years 2020β2024, and turning it into a pandas DataFrame you can run analysis on (although really you can then do whatever, provide it for downstream tasks to an agent..) Once parsed, the chart's underlying data comes back as a table in the items tree for that page. From there: grab the rows, construct a DataFrame, etc. In the example, I'm computing year-over-year changes in both metrics, measuring the gap between them across the five-year window, and just to be sure, I reproduced a bar chart that mirrors the original PDF visualization. You can try it our here: https://t.co/8WHV4xzcDS
We put Opus 4.6 through our Hemingway-bench Writing Leaderboard. How did it fare? Claude continues to dominate GPT-5.2, but lags behind the Geminis. The new writing hierarchy: π Gemini 3 Flash π₯ Gemini 3 Pro π₯ Opus 4.6 (New!) 4οΈβ£ Opus 4.5 5οΈβ£ GPT-5.2 Chat For example: one H-bench prompt requests a cryptic Instagram post for casting auditions. GPT-5.2: "Casting call? Never heard of her." (??? π) Opus 4.6: "Currently accepting applications for professional liars, dramatic criers, and people who can walk through a door convincingly on the first take. You know who you are."
Another Hemingway-bench prompt asks for an oral presentation about time management. GPT-5.2 writes like a LinkedIn engagement farm: "When people hear βworking from home,β they often think it means more freedom, more comfort, and maybe even more free time. And sometimes thatβs true. But what doesnβt get talked about enough is how easily work-from-home life can get messy if you donβt manage your time well." (π₯±) Opus 4.6 feels like a charismatic creative working the room: "So... raise your hand if you've ever "worked from home" and somehow ended up four hours into a Netflix series at 2 PM on a Tuesday. No judgment. We've all been there."
Weβve finally done it. Forbes just ranked our CEO *54* spots above Taylor Swift on their Americaβs Greatest Innovators list. https://t.co/9h6OPZRQy9 While weβre honored that Forbes think Edwinβs strategy is more innovative than a 10-minute song about a scarf, we want to clarify a few things: 1. We will NOT be releasing our next benchmark as a limited-edition vinyl variant. 2. Jake was great in Zodiac. 3. We arenβt saying weβre better at songwriting, but we *are* saying weβve never seen Taylor build an RL environment. See you at next year's Grammys, @taylorswift13.

Everyoneβs building $100M "agentic" models, so we @HelloSurgeAI built a simulated company to see if they could actually hold down a job. Spoiler: they're all fired. Welcome to EnterpriseBench -- CoreCraft edition. CoreCraft is a high-growth hardware startup (i.e., RL environment) with 23 tools, 2500 entities, and enough corporate red tape to make Harvey cry. The best agent in the world (Opus 4.6! π) scored under 30%. The #2 model (GPT-5.2 π₯) gave up because a search returned 10 results and it couldn't figure out how to change the date filter. Another one (Gemini 3 Flash, #9) literally made up a delivery date just to deny a customer's refund. Savage. (The new Gemini 3.1 Pro? Still lagging behind, at π₯) My favorite: GPT-5.2 spent 11 tool calls curating a promotional email to help a customer reach Platinum tier... a tier she was already in. "Here are 3 items over $0 you can buy!" "We would obviously never run ads in the way Anthropic depicts them...." -- thanks Sam. The good news? We trained a model on this chaos and it got better at its job - even translating those skills to other benchmarks. (e.g., +7.4% on Tau2-Bench Retail) Check out the full EnterpriseBench: CoreCraft leaderboard below, and read about our RL environment and research! Blog post: https://t.co/mv4I1dCtOC Paper: https://t.co/EaOHmExm1r Leaderboard: https://t.co/7fb6fewGIQ
Narrative Violation: βJob Postings For Software Engineers Are Rapidly Risingβ https://t.co/yn2SkpZxPJ
Narrative Violation: βJob Postings For Software Engineers Are Rapidly Risingβ https://t.co/yn2SkpZxPJ
Thrilled to share our review paper, out today in @NatureRevGenet : "Harnessing artificial intelligence to advance CRISPR-based genome editing technologies" Full paper : π https://t.co/ZBJcgDZduY CRISPR has already changed medicine. AI is now changing CRISPR. We spent a long time mapping the full landscape of where machine learning and deep learning are having real, measurable impact across the genome editing workflow β and where the most exciting opportunities lie ahead. Here's what we cover: Guide RNA design β Deep learning models now predict on- and off-target activity for Cas9, Cas12, Cas13, and emerging systems like TnpB and IscB. We've gone from sequence heuristics to transformer-based models that generalize across organisms. Cell-type-specific generalization remains a frontier. Base and prime editing β ML models predict bystander effects, product purity, and editing efficiency from sequence context alone. For prime editing, tools like PRIDICT and DeepPE have made pegRNA design far more tractable at scale. Enzyme engineering β Protein language models (ESM, EVOLVEpro) are now guiding directed evolution of Cas proteins β expanding PAM compatibility, reducing immunogenicity, improving compactness β at a pace impossible through classical lab iteration alone. Novel enzyme discovery β Foundation models trained on metagenomics are uncovering entirely new CRISPR systems from microbial diversity: new Cas variants, TnpB systems, and eukaryotic Fanzor proteins. The search space is enormous; AI is how we navigate it. Virtual cell models β This is where I'm most excited. AI-powered virtual cells can, in principle, predict the functional consequences of any edit in any cell type β selecting targets, anticipating off-targets, modeling tissue-specific outcomes. But realizing this vision requires causally-rich, contextually diverse perturbation data. Scale of data matters as much as scale of model. Delivery β ML-guided LNP design is closing the last mile between an edit that works in a dish and one that works in a patient. Across all of this, one theme recurs: AI accelerates where data is abundant and well-structured. The field's next challenge is generating that data at the right diversity and scale. This paper was a true collaboration. Huge thanks to Tyler Thomson, Gen Li, Amy Strilchuk, @HAOTIANCUI1 , and Bowen Li β you each brought something irreplaceable to this. Special shoutout to @BowenLi_Lab for his leaderhsip in this work!

Have questions youβd like addressed during the meeting? Drop them here: https://t.co/4DXYuyzHkP
From desktop applications to national laboratory research, see what developers are building with Mojoπ₯ This month's Community Meeting features GTK bindings with live GUI demos, Oak Ridge National Laboratory's GPU benchmark study comparing NVIDIA and AMD performance, and the 26.1 release including compile-time reflection and Apple Silicon GPU support. https://t.co/aral6XFkJZ
Modular has acquired @bentomlai! π€ 10K+ orgs use BentoML for production AI, including 50+ Fortune 500 companies. We're pairing their deployment platform with MAX + Mojo's hardware optimization. BentoML stays open source (Apache 2.0), and weβre doubling down on OSS in 2026. Ask BentoML founder @chaoyu_ and @clattner_llvm anything on Feb 17 at 9:30am PT. Get all the details: https://t.co/lifotwMzR2