Your curated collection of saved posts and media

Showing 24 posts Β· last 30 days Β· by score
G
github
@github
πŸ“…
Mar 07, 2026
2d ago
πŸ†”60997569
⭐0.38

The core mental model for AI coding goes like this: πŸ’» CLI: Prove value quickly with low ceremony. Get unstuck and scaffold. πŸ› οΈ IDE: Move to your editor when precision matters to shape and refine your logic. πŸ™ GitHub: Commit, open a PR, review, and ship.

G
gerardsans
@gerardsans
πŸ“…
Mar 07, 2026
2d ago
πŸ†”49738197
⭐0.38

@alex_prompter Agents need heavy babysitting and that’s fine unless you insist on selling the idea that they are autonomous agents. They are simply not. The main problem with AI is not the technology but the narratives AI labs create on top of it to keep speculators money from drying out.

G
gerardsans
@gerardsans
πŸ“…
Mar 07, 2026
2d ago
πŸ†”99250219
⭐0.46

@skill_evolve @alex_prompter β€œAgent” is largely a marketing term. In practice, what people usually mean is a prompt wrapped in a loop. It’s about as crude as it sounds, and it’s a fragile setup that’s likely to break sooner or later, because the underlying premise was always pretty shaky. Leading AI labs, including Anthropic, know full well that current models are unreliable, third-party tests show a staggering 97% failure rate on digital tasks. Pause and let that sink in. Silicon Valley has always lived in a bubble. Today, its recklessness threatens the entire economy, and our systems aren’t ready to cope. Brace yourself. Ask yourself: why do we take AI labs at their word about their own technology? Scrutiny isn’t anti-innovation, it’s pro-accountability. https://t.co/Ut4hpvTU3C

G
gerardsans
@gerardsans
πŸ“…
Mar 07, 2026
2d ago
πŸ†”07682838
⭐0.40

@ivanburazin There are no real β€œagents”, just software making calls to APIs. Once these systems start interfacing with LLMs, things can quickly go off the rails: resources get wasted and silent failures accumulate over time. No amount of harnesses, clever prompting, or orchestration can fully shield you from inherently non-deterministic behavior. Just make sure everyone gets that.

D
dair_ai
@dair_ai
πŸ“…
Mar 07, 2026
2d ago
πŸ†”28006138
⭐0.32

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

B
bfl_ml
@bfl_ml
πŸ“…
Mar 04, 2026
5d ago
πŸ†”23020667

We present a research preview of Self-Flow: a scalable approach for training multi-modal generative models. Multi-modal generation requires end-to-end learning across modalities: image, video, audio, text - without being limited by external models for representation learning. Self-Flow addresses this with self-supervised flow matching that scales efficiently across modalities. Results: β€’ Up to 2.8x faster convergence across modalities. β€’ Improved temporal consistency in video β€’ Sharper text rendering and typography This is foundational research for our path towards multimodal visual intelligence.

Media 1
πŸ–ΌοΈ Media
P
PyTorch
@PyTorch
πŸ“…
Mar 06, 2026
3d ago
πŸ†”53785333

Building on the previous correctness-focused pipeline, KernelAgent can now integrate GPU hardware-performance signals into a closed-loop multi-agent workflow to guide the optimization for Triton Kernels. Learn more: https://t.co/r2WqASIhWG @KaimingCheng @marksaroufim https://t.co/OrtOp9boum

Media 1
πŸ–ΌοΈ Media
K
kothasuhas
@kothasuhas
πŸ“…
Mar 06, 2026
3d ago
πŸ†”88542742

to improve fine-tuning data efficiency, replay generic pre-training data not only does this reduce forgetting, it actually improves performance on the fine-tuning domain! especially when fine-tuning data is scarce in pre-training (w/ @percyliang) https://t.co/ClGPAUlPqQ

Media 1
πŸ–ΌοΈ Media
P
percyliang
@percyliang
πŸ“…
Mar 07, 2026
3d ago
πŸ†”59271490

Normally replay old data reduces forgetting, but it actually helps you learn on new data too! We finally put this paper out on arxiv, but had it up as a Marin GitHub issue ~1 year ago: https://t.co/MNevf6XjvC

Media 1
πŸ–ΌοΈ Media
T
tusharmath
@tusharmath
πŸ“…
Mar 07, 2026
3d ago
πŸ†”72920907

@forgecodehq is the #1 coding agent today. We have 78.4% accuracy on TermBench 2.0, the global benchmark for terminal-based coding agents. We did this with a very small team. You could almost count us on a horse’s hoof! It took a lot of time, and at many points it felt like we were chasing a goal that kept moving faster than us. These benchmarks are quite literally the Olympics of AI, where the world record resets every day. For now though, we’re the champions here. And we’re not stopping. More on how we got the score in our blog (link below) This is just the beginning.

Media 1
πŸ–ΌοΈ Media
πŸ”Scobleizer retweeted
T
Tushar Mathur
@tusharmath
πŸ“…
Mar 07, 2026
3d ago
πŸ†”72920907
⭐0.34

@forgecodehq is the #1 coding agent today. We have 78.4% accuracy on TermBench 2.0, the global benchmark for terminal-based coding agents. We did this with a very small team. You could almost count us on a horse’s hoof! It took a lot of time, and at many points it felt like we were chasing a goal that kept moving faster than us. These benchmarks are quite literally the Olympics of AI, where the world record resets every day. For now though, we’re the champions here. And we’re not stopping. More on how we got the score in our blog (link below) This is just the beginning.

❀️49
likes
πŸ”10
retweets
R
randal_olson
@randal_olson
πŸ“…
Mar 06, 2026
3d ago
πŸ†”70636294

We just shipped the Truesight MCP and open source agent skills. This means you can create, manage, and run AI evaluations anywhere you use an AI assistant. Coding editor, chat window, CLI. If it supports MCP, Truesight works there. Nobody ships software without tests anymore. Once AI made them nearly free to write, there was no excuse. You lock in what you expect, they run every time you push code, and you know if something broke before you deploy. AI evaluations are the same idea for AI features, but most teams still treat them as something separate. Evaluation lives in a different tool, a different part of the day. So people skip it. And bad AI ships to production. Truesight's MCP collapses that loop. You set your quality bar in natural language and Truesight turns it into evals your AI assistant runs while you build. Updated your AI agent's system prompt? "Run both versions through our instruction-following eval and tell me if my AI agent regressed." Done in seconds, right where you're working. Need a new eval? "Build me a custom eval that checks whether our customer support AI agent is correctly identifying user intent and escalating when it should." It walks you through the full setup and deploys a live endpoint your coding agent can use immediately. Or something simpler: "Run this marketing draft through the humanizer eval and flag anything that reads like AI wrote it." Scores the text, tells you what to fix. The skills are what matter most here. Many MCPs ship tools and leave it to the user to figure out the workflow. Fine for simple integrations. But evaluation has real sequencing complexity. Build eval criteria before looking at your data? You'll measure the wrong things. Deploy to production before testing on a sample? You'll drown in false flags. We built agent skills that walk your coding assistant through the right workflow for each task, whether that's scoring traces, running error analysis, or building a custom eval from scratch. An orchestrator skill routes to the right one based on what you ask. You don't need to memorize anything. Skills install via the Claude Plugin Marketplace or a one-liner curl script. MIT licensed. Setup is about 2 minutes: 1. Create a platform API key in Truesight Settings 2. Paste the MCP config into your client 3. Install the skills 4. Start evaluating If you're already a Truesight user, this is live now. Connect your client and your existing evaluations work through the MCP immediately. If you're building AI systems and want to try this, sign up at https://t.co/Q1c8bVkSOi

Media 1
πŸ–ΌοΈ Media
πŸ”HamelHusain retweeted
T
Thariq
@trq212
πŸ“…
Mar 06, 2026
3d ago
πŸ†”35843288
⭐0.34

Today we're launching local scheduled tasks in Claude Code desktop. Create a schedule for tasks that you want to run regularly. They'll run as long as your computer is awake. https://t.co/15AYd0NHqR

❀️11,251
likes
πŸ”804
retweets
πŸ”omarsar0 retweeted
O
elvis
@omarsar0
πŸ“…
Mar 06, 2026
3d ago
πŸ†”13900871
⭐0.34

Cursor with Kimi K2.5. Don't sleep on this combo. From a prompt to a personal HN feed in about ~60 seconds. The future of building is going to be so wild. With faster models, you can quickly iterate on more ideas, while improving quality. https://t.co/WOYFcCBqM7

❀️81
likes
πŸ”14
retweets
_
_akhaliq
@_akhaliq
πŸ“…
Mar 06, 2026
3d ago
πŸ†”08342160

SkillNet Create, Evaluate, and Connect AI Skills paper: https://t.co/k9gIkLsgPE https://t.co/5tAkG7AVGt

Media 1Media 2
πŸ–ΌοΈ Media
A
AdinaYakup
@AdinaYakup
πŸ“…
Mar 05, 2026
4d ago
πŸ†”04508246

Yuan3.0 Ultra πŸ”₯ A 1T multimodal LLM from YuanLab https://t.co/6hleo11DtL ✨ 64K context ✨ Enterprise-ready: RAG, summarization, Text-to-SQL ✨ 103-layer MoE w/ LAEP (49% efficiency boost) https://t.co/ZxWi0yazAC

Media 1Media 2
πŸ–ΌοΈ Media
πŸ”huggingface retweeted
A
Adina Yakup
@AdinaYakup
πŸ“…
Mar 05, 2026
4d ago
πŸ†”04508246

Yuan3.0 Ultra πŸ”₯ A 1T multimodal LLM from YuanLab https://t.co/6hleo11DtL ✨ 64K context ✨ Enterprise-ready: RAG, summarization, Text-to-SQL ✨ 103-layer MoE w/ LAEP (49% efficiency boost) https://t.co/ZxWi0yazAC

Media 1
❀️122
likes
πŸ”19
retweets
πŸ–ΌοΈ Media
P
patniko
@patniko
πŸ“…
Jan 27, 2026
42d ago
πŸ†”92222434

Used the @GitHub Copilot SDK to let the Copilot CLI call me when it needs my opinion on something. https://t.co/6sdhk8sIW7

πŸ–ΌοΈ Media
A
AnthropicAI
@AnthropicAI
πŸ“…
Mar 06, 2026
3d ago
πŸ†”07617634

We partnered with Mozilla to test Claude's ability to find security vulnerabilities in Firefox. Opus 4.6 found 22 vulnerabilities in just two weeks. Of these, 14 were high-severity, representing a fifth of all high-severity bugs Mozilla remediated in 2025. https://t.co/It1uq5ATn9

Media 1
πŸ–ΌοΈ Media
A
AnthropicAI
@AnthropicAI
πŸ“…
Mar 06, 2026
3d ago
πŸ†”17838016

New on the Anthropic Engineering Blog: In evaluating Claude Opus 4.6 on BrowseComp, we found cases where the model recognized the test, then found and decrypted answers to itβ€”raising questions about eval integrity in web-enabled environments. Read more: https://t.co/oVCNyaiK5w

Media 1
πŸ–ΌοΈ Media
B
BenjaminDEKR
@BenjaminDEKR
πŸ“…
Mar 06, 2026
3d ago
πŸ†”61675717

Talking to a voice AI LLM over ham radio (on UHF 420.69 megahertz, of course!) (Note: cool experiment, but be careful: FCC regs require a licensed control operator to be present at the control point the entire time the LLM is operating.) https://t.co/S2WcCrkp83

πŸ–ΌοΈ Media
T
theworldlabs
@theworldlabs
πŸ“…
Mar 05, 2026
4d ago
πŸ†”16216287

70 hackers joined us in SF for the first-ever World Labs Hackathon. In just 3.5 hours, 32 teams used Marble for projects ranging from robotics sims and agents to AR/VR interfaces, games, art experiences, and real estate tools. Check out what they built ↓ https://t.co/cX0bAlvhh1

Media 1Media 2
+1 more
πŸ–ΌοΈ Media
D
dair_ai
@dair_ai
πŸ“…
Mar 06, 2026
3d ago
πŸ†”41785046

New research on evaluating coding agents via continuous integration. Coding agents are moving beyond isolated bug fixes. If they're going to own CI pipelines, we need benchmarks that reflect the actual complexity of codebase maintenance. Most coding agent benchmarks today test whether an agent can fix a single issue. But real software engineering involves maintaining entire codebases over time. SWE-CI evaluates agent capabilities through continuous integration workflows: running test suites, catching regressions, and maintaining code quality across multiple changes. Paper: https://t.co/p8bOTJ9QPX Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

Media 1Media 2
πŸ–ΌοΈ Media
L
llama_index
@llama_index
πŸ“…
Mar 06, 2026
3d ago
πŸ†”29386760

PDFs are the bane of every AI agent's existence: here's why parsing them is so much harder than you think πŸ“„ Every developer building document agents eventually hits the same wall: PDFs weren't designed to be machine-readable. They're drawing instructions from 1982, not structured data. πŸ“ PDF text isn't stored as characters: it's glyph shapes positioned at coordinates with no semantic meaning πŸ“Š Tables don't exist as objects: they're just lines and text that happen to look tabular when rendered πŸ”„ Reading order is pure guesswork β€” content streams have zero relationship to visual flow πŸ€– Seventy years of OCR evolution led us to combine text extraction with vision models for optimal results We built LlamaParse using this hybrid approach: fast text extraction for standard content, vision models for complex layouts. It's how we're solving document processing at scale. Read the full breakdown of why PDFs are so challenging and how we're tackling it: https://t.co/K8bQmgq7xN

Media 1Media 2
πŸ–ΌοΈ Media