Your curated collection of saved posts and media
The core mental model for AI coding goes like this: π» CLI: Prove value quickly with low ceremony. Get unstuck and scaffold. π οΈ IDE: Move to your editor when precision matters to shape and refine your logic. π GitHub: Commit, open a PR, review, and ship.
@alex_prompter Agents need heavy babysitting and thatβs fine unless you insist on selling the idea that they are autonomous agents. They are simply not. The main problem with AI is not the technology but the narratives AI labs create on top of it to keep speculators money from drying out.
@skill_evolve @alex_prompter βAgentβ is largely a marketing term. In practice, what people usually mean is a prompt wrapped in a loop. Itβs about as crude as it sounds, and itβs a fragile setup thatβs likely to break sooner or later, because the underlying premise was always pretty shaky. Leading AI labs, including Anthropic, know full well that current models are unreliable, third-party tests show a staggering 97% failure rate on digital tasks. Pause and let that sink in. Silicon Valley has always lived in a bubble. Today, its recklessness threatens the entire economy, and our systems arenβt ready to cope. Brace yourself. Ask yourself: why do we take AI labs at their word about their own technology? Scrutiny isnβt anti-innovation, itβs pro-accountability. https://t.co/Ut4hpvTU3C
@ivanburazin There are no real βagentsβ, just software making calls to APIs. Once these systems start interfacing with LLMs, things can quickly go off the rails: resources get wasted and silent failures accumulate over time. No amount of harnesses, clever prompting, or orchestration can fully shield you from inherently non-deterministic behavior. Just make sure everyone gets that.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
We present a research preview of Self-Flow: a scalable approach for training multi-modal generative models. Multi-modal generation requires end-to-end learning across modalities: image, video, audio, text - without being limited by external models for representation learning. Self-Flow addresses this with self-supervised flow matching that scales efficiently across modalities. Results: β’ Up to 2.8x faster convergence across modalities. β’ Improved temporal consistency in video β’ Sharper text rendering and typography This is foundational research for our path towards multimodal visual intelligence.
Building on the previous correctness-focused pipeline, KernelAgent can now integrate GPU hardware-performance signals into a closed-loop multi-agent workflow to guide the optimization for Triton Kernels. Learn more: https://t.co/r2WqASIhWG @KaimingCheng @marksaroufim https://t.co/OrtOp9boum
to improve fine-tuning data efficiency, replay generic pre-training data not only does this reduce forgetting, it actually improves performance on the fine-tuning domain! especially when fine-tuning data is scarce in pre-training (w/ @percyliang) https://t.co/ClGPAUlPqQ
Normally replay old data reduces forgetting, but it actually helps you learn on new data too! We finally put this paper out on arxiv, but had it up as a Marin GitHub issue ~1 year ago: https://t.co/MNevf6XjvC
@forgecodehq is the #1 coding agent today. We have 78.4% accuracy on TermBench 2.0, the global benchmark for terminal-based coding agents. We did this with a very small team. You could almost count us on a horseβs hoof! It took a lot of time, and at many points it felt like we were chasing a goal that kept moving faster than us. These benchmarks are quite literally the Olympics of AI, where the world record resets every day. For now though, weβre the champions here. And weβre not stopping. More on how we got the score in our blog (link below) This is just the beginning.
@forgecodehq is the #1 coding agent today. We have 78.4% accuracy on TermBench 2.0, the global benchmark for terminal-based coding agents. We did this with a very small team. You could almost count us on a horseβs hoof! It took a lot of time, and at many points it felt like we were chasing a goal that kept moving faster than us. These benchmarks are quite literally the Olympics of AI, where the world record resets every day. For now though, weβre the champions here. And weβre not stopping. More on how we got the score in our blog (link below) This is just the beginning.
We just shipped the Truesight MCP and open source agent skills. This means you can create, manage, and run AI evaluations anywhere you use an AI assistant. Coding editor, chat window, CLI. If it supports MCP, Truesight works there. Nobody ships software without tests anymore. Once AI made them nearly free to write, there was no excuse. You lock in what you expect, they run every time you push code, and you know if something broke before you deploy. AI evaluations are the same idea for AI features, but most teams still treat them as something separate. Evaluation lives in a different tool, a different part of the day. So people skip it. And bad AI ships to production. Truesight's MCP collapses that loop. You set your quality bar in natural language and Truesight turns it into evals your AI assistant runs while you build. Updated your AI agent's system prompt? "Run both versions through our instruction-following eval and tell me if my AI agent regressed." Done in seconds, right where you're working. Need a new eval? "Build me a custom eval that checks whether our customer support AI agent is correctly identifying user intent and escalating when it should." It walks you through the full setup and deploys a live endpoint your coding agent can use immediately. Or something simpler: "Run this marketing draft through the humanizer eval and flag anything that reads like AI wrote it." Scores the text, tells you what to fix. The skills are what matter most here. Many MCPs ship tools and leave it to the user to figure out the workflow. Fine for simple integrations. But evaluation has real sequencing complexity. Build eval criteria before looking at your data? You'll measure the wrong things. Deploy to production before testing on a sample? You'll drown in false flags. We built agent skills that walk your coding assistant through the right workflow for each task, whether that's scoring traces, running error analysis, or building a custom eval from scratch. An orchestrator skill routes to the right one based on what you ask. You don't need to memorize anything. Skills install via the Claude Plugin Marketplace or a one-liner curl script. MIT licensed. Setup is about 2 minutes: 1. Create a platform API key in Truesight Settings 2. Paste the MCP config into your client 3. Install the skills 4. Start evaluating If you're already a Truesight user, this is live now. Connect your client and your existing evaluations work through the MCP immediately. If you're building AI systems and want to try this, sign up at https://t.co/Q1c8bVkSOi
Today we're launching local scheduled tasks in Claude Code desktop. Create a schedule for tasks that you want to run regularly. They'll run as long as your computer is awake. https://t.co/15AYd0NHqR
Cursor with Kimi K2.5. Don't sleep on this combo. From a prompt to a personal HN feed in about ~60 seconds. The future of building is going to be so wild. With faster models, you can quickly iterate on more ideas, while improving quality. https://t.co/WOYFcCBqM7
SkillNet Create, Evaluate, and Connect AI Skills paper: https://t.co/k9gIkLsgPE https://t.co/5tAkG7AVGt

Yuan3.0 Ultra π₯ A 1T multimodal LLM from YuanLab https://t.co/6hleo11DtL β¨ 64K context β¨ Enterprise-ready: RAG, summarization, Text-to-SQL β¨ 103-layer MoE w/ LAEP (49% efficiency boost) https://t.co/ZxWi0yazAC

Yuan3.0 Ultra π₯ A 1T multimodal LLM from YuanLab https://t.co/6hleo11DtL β¨ 64K context β¨ Enterprise-ready: RAG, summarization, Text-to-SQL β¨ 103-layer MoE w/ LAEP (49% efficiency boost) https://t.co/ZxWi0yazAC
Used the @GitHub Copilot SDK to let the Copilot CLI call me when it needs my opinion on something. https://t.co/6sdhk8sIW7
We partnered with Mozilla to test Claude's ability to find security vulnerabilities in Firefox. Opus 4.6 found 22 vulnerabilities in just two weeks. Of these, 14 were high-severity, representing a fifth of all high-severity bugs Mozilla remediated in 2025. https://t.co/It1uq5ATn9
New on the Anthropic Engineering Blog: In evaluating Claude Opus 4.6 on BrowseComp, we found cases where the model recognized the test, then found and decrypted answers to itβraising questions about eval integrity in web-enabled environments. Read more: https://t.co/oVCNyaiK5w
Talking to a voice AI LLM over ham radio (on UHF 420.69 megahertz, of course!) (Note: cool experiment, but be careful: FCC regs require a licensed control operator to be present at the control point the entire time the LLM is operating.) https://t.co/S2WcCrkp83
70 hackers joined us in SF for the first-ever World Labs Hackathon. In just 3.5 hours, 32 teams used Marble for projects ranging from robotics sims and agents to AR/VR interfaces, games, art experiences, and real estate tools. Check out what they built β https://t.co/cX0bAlvhh1

New research on evaluating coding agents via continuous integration. Coding agents are moving beyond isolated bug fixes. If they're going to own CI pipelines, we need benchmarks that reflect the actual complexity of codebase maintenance. Most coding agent benchmarks today test whether an agent can fix a single issue. But real software engineering involves maintaining entire codebases over time. SWE-CI evaluates agent capabilities through continuous integration workflows: running test suites, catching regressions, and maintaining code quality across multiple changes. Paper: https://t.co/p8bOTJ9QPX Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

PDFs are the bane of every AI agent's existence: here's why parsing them is so much harder than you think π Every developer building document agents eventually hits the same wall: PDFs weren't designed to be machine-readable. They're drawing instructions from 1982, not structured data. π PDF text isn't stored as characters: it's glyph shapes positioned at coordinates with no semantic meaning π Tables don't exist as objects: they're just lines and text that happen to look tabular when rendered π Reading order is pure guesswork β content streams have zero relationship to visual flow π€ Seventy years of OCR evolution led us to combine text extraction with vision models for optimal results We built LlamaParse using this hybrid approach: fast text extraction for standard content, vision models for complex layouts. It's how we're solving document processing at scale. Read the full breakdown of why PDFs are so challenging and how we're tackling it: https://t.co/K8bQmgq7xN
