Your curated collection of saved posts and media
We don't expect LLMs to multiply numbers or sort lists directly within their output token stream. Instead, we ask them emit code and execute it in a separate runtime. Why predict the opposite outcome for simulating interactive worlds? https://t.co/b2QNOBTWjN
> 385ms average tool selection. > 67 tools across 13 MCP servers. > 14.5GB memory footprint. > Zero network calls. LocalCowork is an AI agent that runs on a MacBook. Open source. π§΅ https://t.co/bnXupspSXc
already on mlx :) https://t.co/NXxd7hAWMh
Meet GLM-4.7-Flash-Claude-Opus-4.5-High-Reasoning-Distill: a distilled powerhouse that brings elite reasoning to local machines. This GGUF model delivers Claude-level intelligence in a compact package, perfect for developers wanting high-performance AI without cloud costs. https://t.co/Q0HCPTI2oe
Claude Code acabou de lancar Voice Mode. Voce fala. O agente de IA codifica. "/voice" pra ativar. Rollout pra 5% dos usuarios agora, expandindo nas proximas semanas. Hoje: KREA AI Voice no iPad. Claude Code Voice no terminal. A era da programacao por voz chegou. https://t.co/9adiksDX0r
Comprehensive Python API for Google NotebookLM. Full programmatic access to NotebookLM's featuresβincluding capabilities the web UI doesn't exposeβfrom Python or the command line. https://t.co/5YQhAKiGuD
π Introducing the Qwen 3.5 Small Model Series Qwen3.5-0.8B Β· Qwen3.5-2B Β· Qwen3.5-4B Β· Qwen3.5-9B β¨ More intelligence, less compute. These small models are built on the same Qwen3.5 foundation β native multimodal, improved architecture, scaled RL: β’ 0.8B / 2B β tiny, fast, great for edge device β’ 4B β a surprisingly strong multimodal base for lightweight agents β’ 9B β compact, but already closing the gap with much larger models And yes β weβre also releasing the Base models as well. We hope this better supports research, experimentation, and real-world industrial innovation. Hugging Face: https://t.co/wFMdX5pDjU ModelScope: https://t.co/9NGXcIdCWI

A trillion-parameter model just made half its brain disappear. It got smarter. Yuan3.0 Ultra is a new open-source multimodal MoE model from Yuan Lab. 1010B total parameters, only 68.8B active at inference. It beat GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 on RAG benchmarks by wide margins. 67.4% on Docmatix vs GPT-4o's 56.8%. Here's what it unlocks: > Enterprise RAG with 68.2% avg accuracy across 10 retrieval tasks > Complex table understanding at 62.3% on MMTab > Text-to-SQL generation scoring 83.9% on Spider 1.0 > Multimodal doc analysis with a 64K context window The key innovation: Layer-Adaptive Expert Pruning (LAEP). During pretraining, expert token loads become wildly imbalanced. Some experts get 500x more tokens than others. LAEP prunes the underused ones layer by layer, cutting 33% of parameters while boosting training efficiency by 49%. They also refined "fast-thinking" RL. Correct answers with fewer reasoning steps get rewarded more. This cut output tokens by 14.38% while improving accuracy by 16.33%. The bigger signal here: MoE models are learning to self-compress during training, not after. If pruning becomes part of pretraining, the cost curve for trillion-scale models shifts dramatically.
We're introducing Cursor Automations to build always-on agents. https://t.co/uxgTbncJlM
Pay close attention to proactive AI agents. This is one of the wildest applications of agent harnesses I've seen. The MIT paper introduces NeuroSkill, a real-time agentic system that models human cognitive and emotional state by integrating Brain-Computer Interface signals with foundation models. "Human State of Mind" provided via SKILL dot md. The system runs fully offline on the edge. Its NeuroLoop harness enables agentic workflows that engage users across cognitive and emotional levels, responding to both explicit and implicit requests through actionable tool calls. Why does it matter? Most AI agents respond only to explicit user requests. NeuroSkill explores the frontier of proactive agents that sense and respond to implicit human states, opening new possibilities for adaptive human-AI interaction. Paper: https://t.co/kO3Ie2Dbvz Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX
Interesting new research on LLM agent memory. Agent engineers, pay attention to this one. (bookmark it) It introduces a diagnostic framework that separates retrieval failures from utilization failures in agent memory systems. The main findings: - Retrieval method matters far more than how you write memories. - Accuracy varies 20 percentage points across retrieval approaches but only 3-8 points across writing strategies. - Simple raw chunking matches or outperforms expensive alternatives like Mem0-style fact extraction or MemGPT-style summarization. Teams investing heavily in sophisticated memory writing pipelines may be optimizing the wrong thing. Improving retrieval quality yields larger gains than increasing write-time sophistication. Paper: https://t.co/ZZvtsJXIJp Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

Most agents donβt fail on modelsβ¦ they fail on context: those ugly, messy, complex documents that trip up even the latest LLMs (PDFs, tables, messy scans). Don't worry. We got you. π VC-backed (seed+) startup? Join the LlamaParse Startup Program: β free credits β dedicated slack channel + priority support β alignment call with our founder Jerry Liu β community spotlight (millions of devs) β production-ready ingestion pipelines Apply today spots are limited β https://t.co/61csPhQULp
LlamaIndex has evolved far beyond a RAG framework - we're now focused on agentic document processing that automates knowledge work. π Agent orchestration has fundamentally changed with sophisticated reasoning loops, tool discovery through Skills/MCP, and coding agents that write Python for you π Document understanding remains a massive opportunity - frontier vision models still struggle with complex tables, charts, and long documents at scale π’ LlamaParse now processes 300k+ users across 50+ formats for enterprises like @OneCarlyle, @CEMEX, and @KPMG with multi-agent workflows combining OCR, computer vision, and LLM reasoning βοΈ Real automation potential exists in workflows where humans manually process documents daily - financial analysis, contract review, insurance underwriting can all become end-to-end agentic processes Our mission is now providing core infrastructure to automate knowledge work over documents, not just being connective tissue between LLMs and data. Read about our evolution and what's next: https://t.co/M0DbsIdGrF

Adobe Acrobat has PDF splitting. We have agentic PDF splitting π€βοΈ Simply define the categories you want in natural language, and our split agent will automatically βchunkβ the document into subsets of pages and tag them with the appropriate categories. This is super useful to break apart complicated document packets like resumes, tax forms, identification docs, expense reports, and more. Check out @itscleliaβs video below, and come sign up to LlamaParse if youβre interested! Docs: https://t.co/UdxT3sJfkF LlamaParse: https://t.co/TqP6OT5U5O
I love the Big Arch Burger π I also love Big Harnessesβ’ and Big Complex PDFsβ’ with hundreds of pages of tables, images and forms. https://t.co/deD8sUcyj0
I love the Big Arch Burger π I also love Big Harnessesβ’ and Big Complex PDFsβ’ with hundreds of pages of tables, images and forms. https://t.co/deD8sUcyj0
MAX is how Modular is rethinking the AI stack from first principles, bringing together modeling, performance, and portability in one open framework. Hear directly from our co-founder and CEO @clattner_llvm on why the stack needs to evolve and what that means for the future of AI infrastructure.
You shouldn't have to choose between peak GPU performance and code you can actually maintain. We built Structured Mojo π₯ Kernels to fix that. Performance, usability, and portability without the tradeoff. 14k to 7k lines. ~1.8k TFLOPS held. We wrote a 4-part series on how. Part 1 is up https://t.co/zMYWMfDOb2
Given the GDPval benchmark for GPT-5.4, I've updated this chart, the new model ties or beats humans as judged by other experts at professional tasks 82% of the time If you give a 7 hour task to AI, even with failure rates and the need to check results, you'd save 4h 38m average https://t.co/U4PQSArQo2
This month, weβre in SF for @Official_GDC and in San Jose for @NVIDIAGTC with a new live demo of our real-time diffusion world model. If you want to see it running under real user input and tight latency constraints, meet us. https://t.co/QputPCxkyk
On one end, the Anthropic team is a massive user of AI to write code (80%+ of all code deployed is written by Claude Code). They ship amazingly fast. On the other hand, seeing these beyond terrible reliability numbers suggests there might be a downside to all this speed: https://t.co/9nYoH7KYOc
Looking for user feedback about the upcoming ggml official Debian and Ubuntu packages https://t.co/8lcGZzSgLK
New research just exposed the biggest lie in AI coding benchmarks. LLMs score 84-89% on standard coding tests. On real production code? 25-34%. That's not a gap. That's a different reality. Here's what happened: Researchers built a benchmark from actual open-source repositories real classes with real dependencies, real type systems, real integration complexity. Then they tested the same models that dominate HumanEval leaderboards. The results were brutal. The models weren't failing because the code was "harder." They were failing because it was *real*. Synthetic benchmarks test whether a model can write a self-contained function with a clean docstring. Production code requires understanding inheritance hierarchies, framework integrations, and project-specific utilities. Different universe. Same leaderboard score. But it gets worse. A separate study ran 600,000 debugging experiments across 9 LLMs. They found a bug in a program. The LLM found it too. Then they renamed a variable. Added a comment. Shuffled function order. Changed nothing about the bug itself. The LLM couldn't find the same bug anymore. 78% of the time, cosmetic changes that don't affect program behavior completely broke the model's ability to debug. Function shuffling alone reduced debugging accuracy by 83%. The models aren't reading code. They're pattern-matching against what code *looks like* in their training data. A third study confirmed this from another angle: when researchers obfuscated real-world code changing symbols, structure, and semantics while keeping functionality identical LLM pass rates dropped by up to 62.5%. The researchers call this the "Specialist in Familiarity" problem. LLMs perform well on code they've memorized. The moment you show them something unfamiliar with the same logic, they collapse. Three papers. Three different methodologies. Same conclusion: The benchmarks we use to evaluate AI coding tools are measuring memorization, not understanding. If you're shipping code generated by LLMs into production without review, these numbers should concern you. If you're building developer tools, the question isn't "what's your HumanEval score." It's "what happens when the code doesn't look like the training data."
A big milestone @MiniMax_AI! Open weight models like M2.5 are beginning handle agentic tasks people used to trust only to opus or gpt.