Your curated collection of saved posts and media
This month, weβre in SF for @Official_GDC and in San Jose for @NVIDIAGTC with a new live demo of our real-time diffusion world model. If you want to see it running under real user input and tight latency constraints, meet us. https://t.co/QputPCxkyk
On one end, the Anthropic team is a massive user of AI to write code (80%+ of all code deployed is written by Claude Code). They ship amazingly fast. On the other hand, seeing these beyond terrible reliability numbers suggests there might be a downside to all this speed: https://t.co/9nYoH7KYOc
Looking for user feedback about the upcoming ggml official Debian and Ubuntu packages https://t.co/8lcGZzSgLK
New research just exposed the biggest lie in AI coding benchmarks. LLMs score 84-89% on standard coding tests. On real production code? 25-34%. That's not a gap. That's a different reality. Here's what happened: Researchers built a benchmark from actual open-source repositories real classes with real dependencies, real type systems, real integration complexity. Then they tested the same models that dominate HumanEval leaderboards. The results were brutal. The models weren't failing because the code was "harder." They were failing because it was *real*. Synthetic benchmarks test whether a model can write a self-contained function with a clean docstring. Production code requires understanding inheritance hierarchies, framework integrations, and project-specific utilities. Different universe. Same leaderboard score. But it gets worse. A separate study ran 600,000 debugging experiments across 9 LLMs. They found a bug in a program. The LLM found it too. Then they renamed a variable. Added a comment. Shuffled function order. Changed nothing about the bug itself. The LLM couldn't find the same bug anymore. 78% of the time, cosmetic changes that don't affect program behavior completely broke the model's ability to debug. Function shuffling alone reduced debugging accuracy by 83%. The models aren't reading code. They're pattern-matching against what code *looks like* in their training data. A third study confirmed this from another angle: when researchers obfuscated real-world code changing symbols, structure, and semantics while keeping functionality identical LLM pass rates dropped by up to 62.5%. The researchers call this the "Specialist in Familiarity" problem. LLMs perform well on code they've memorized. The moment you show them something unfamiliar with the same logic, they collapse. Three papers. Three different methodologies. Same conclusion: The benchmarks we use to evaluate AI coding tools are measuring memorization, not understanding. If you're shipping code generated by LLMs into production without review, these numbers should concern you. If you're building developer tools, the question isn't "what's your HumanEval score." It's "what happens when the code doesn't look like the training data."
A big milestone @MiniMax_AI! Open weight models like M2.5 are beginning handle agentic tasks people used to trust only to opus or gpt.
@HamelHusain I love it. I have this in my global AGENTS .md to maximise the use of the questions tool (works in Claude, @opencode, @code, and @GitHubCopilot CLI). https://t.co/cPDwXHjwrP
@HamelHusain I love it. I have this in my global AGENTS .md to maximise the use of the questions tool (works in Claude, @opencode, @code, and @GitHubCopilot CLI). https://t.co/cPDwXHjwrP
do you need MCP for dev workflows? no, for the most part. allows out of context data transforms, conserves context window space. do enterprises need MCP? likely, specifically wrt to auth, which is a bad idea via a fully LLM exposed cli. do normies need MCP? yes, no other way to connect emails/etc. still a bad idea to let them use any old MCP, specifically stdio based ones. it's like your grandma installing all those .exe email attachments.
GPT 5.3 Codex (xhigh) scores 79.3% and takes the lead on WeirdML, just ahead of opus 4.6 (77.9%) at less than half the prize. It is very solid across the board, but I still feel the peak performance of gemini 3.1 is stronger. https://t.co/WRYosAStGY

@adrian_valentim Yeah, 95% of people misunderstand the tweet. Iβm referring to gradient descent as a programmer (in the distributed representation space.) . In coding AI today the LLM is the programmer and in the regular βtext spaceβ. Ah well :)
@JohnHarper10070 Yes, in this intermediate state, you go faster if you can be more explicit and actually understand what the AI is doing on your behalf, and what the different tools are at its disposal, and what is hard and what is easy. It's not magic, it's delegation.
A lot of people quote tweeted this as 1 year anniversary of vibe coding. Some retrospective - I've had a Twitter account for 17 years now (omg) and I still can't predict my tweet engagement basically at all. This was a shower of thoughts throwaway tweet that I just fired off without thinking but somehow it minted a fitting name at the right moment for something that a lot of people were feeling at the same time, so here we are: vibe coding is now mentioned on my Wikipedia as a major memetic "contribution" and even its article is longer. lol The one thing I'd add is that at the time, LLM capability was low enough that you'd mostly use vibe coding for fun throwaway projects, demos and explorations. It was good fun and it almost worked. Today (1 year later), programming via LLM agents is increasingly becoming a default workflow for professionals, except with more oversight and scrutiny. The goal is to claim the leverage from the use of agents but without any compromise on the quality of the software. Many people have tried to come up with a better name for this to differentiate it from vibe coding, personally my current favorite "agentic engineering": - "agentic" because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight. - "engineering" to emphasize that there is an art & science and expertise to it. It's something you can learn and become better at, with its own depth of a different kind. In 2026, we're likely to see continued improvements on both the model layer and the new agent layer. I feel excited about the product of the two and another year of progress.
won 1st place at the @OpenAI codex hackathon! π₯ i built StoryWorld, a 3D movie studio in your pocket. made with iOS ARKit + RealityKit, @DeemosTech Rodin, and @fal https://t.co/fWIKy6sCZQ
OpenClaw 2026.3.1 π¦ β‘ OpenAI WebSocket streaming π§ Claude 4.6 adaptive thinking π³ Better Docker and Native K8s support π§΅ Discord threads, TG DM topics, Feishu fixes π§ Agent-powered visual diffs plugin Reports of our death were greatly exaggerated. https://t.co/ISJH09of5U
OpenClaw 2026.3.1 π¦ β‘ OpenAI WebSocket streaming π§ Claude 4.6 adaptive thinking π³ Better Docker and Native K8s support π§΅ Discord threads, TG DM topics, Feishu fixes π§ Agent-powered visual diffs plugin Reports of our death were greatly exaggerated. https://t.co/ISJH09of5U
Weβve officially open-sourced memU bot. π€ Itβs not a chatbot that waits for commands. Itβs a proactive assistant that understands you, remembers you, and gradually becomes more aligned with how you work. Runs locally. Built on the memU memory framework. GitHub π https://t.co/pmOnl5czYs Feel free to explore it, try it out, share feedback, and help us improve it together.
Weβve officially open-sourced memU bot. π€ Itβs not a chatbot that waits for commands. Itβs a proactive assistant that understands you, remembers you, and gradually becomes more aligned with how you work. Runs locally. Built on the memU memory framework. GitHub π https://t.co/pmOnl5czYs Feel free to explore it, try it out, share feedback, and help us improve it together.
The "Visual Explainer" agent skill just crossed 3.5K stars on GitHub π Just updated with: /generate-visual-plan slash command for more structured plan specs, code block patterns, typography polish, mermaid fixes, anti slop guardrails https://t.co/qzde42tVEV
The fundamental issue with PDF parsing is that PDFs are designed for display purposes. The internal representation of data is outputting shapes at specific coordinates on the page (e.g. "render this string at coordinate (84, 720) with this font") each displayed character could be not contiguous at all, there could be no font mapping back to unicode so you have no idea what the character is. Any PDF parser needs to magically reconstruct this random sequence of display coordinate data into semantically meaningful text, tables, and more. VLMs do help (screenshot the page and read it), but besides collapsing the metadata they still struggle in terms of accuracy and cost. note: parsing Word/Pptx as text representations so typically a bit easier too read. Our entire company at @llama_index is laser-focused on PDF parsing so we've been really trying to understand all the nuances of doc formats, especially PDFs π more notes on this coming soon
"Build me Perplexity Finance but for Pokemon cards. Make no mistakes." Computer: β researched Pokemon card APIs on its own β wrote 5,000 lines of React + Python β debugged itself using browser devtools β deployed and pushed to GitHub (built by u/NoSquirrel4840 on Reddit) https://t.co/kLBQnyA2Vk
Heβs not kidding. Took me HALF AN HOUR to vibe code Notion with Perplexity Computer. Software is legit a zero. https://t.co/eBbIDQsNRI
Ok, this is insane...π€― I've just built the most comprehensive RAG system (UX Knowledge base) for me to use in my projects with @perplexity_ai . > Instant, research-backed best practices (548 items) for design > 10X the output quality for Project Aristotle with a grounded knowledge layer: https://t.co/ko1oELOvaA > Ability to present design decisions to stakeholders with cited rationale and data. Data is the new oil. Already shared it with those who pre-purchased @AgenticUi in January as a token of appreciation for support.
Video generation models are improving fastβreal-time autoregressive models now deliver high quality at low latency, and theyβre quickly being adopted for world models and robotics applications. So whatβs the problem? Theyβre still too slow on consumer hardware. π What if we told you that we can get true real-time 16 FPS video generation on a single RTX 5090? (1.5-12x over FA 2/3/4 on 5090, H100, B200) Today we release MonarchRT π¦, an efficient video attention that parameterizes attention maps as (tiled) Monarch matrices and delivers real E2E gains. π Paper: https://t.co/d1AAMIseow π Website: https://t.co/41mqriKekx π GitHub: https://t.co/hp5iJttviA π§΅1/n
Introducing LlamaBarn β a tiny macOS menu bar app for running local LLMs Open source, built on llama.cpp https://t.co/F1Z3DVl9Kg
This is very impactful: you can now distill frontier performance into small models that are specialized to private repositories. Companies can quickly and cheaply train on their data and have super-efficient deployments of 32B agents. https://t.co/03jsS6cWJ3