Your curated collection of saved posts and media

Showing 24 posts Β· last 30 days Β· by score
πŸ”omarsar0 retweeted
O
elvis
@omarsar0
πŸ“…
Feb 26, 2026
12d ago
πŸ†”28644022
⭐0.34

At this point, "agentic engineering" has allowed me to build the best AI harness I could possibly get my hands on. Yes, I vibe coded it. That's right. You don't need to wait around for the features you need for your AI agents. Please don't. You could just build them yourself. Focusing on agentic engineering and building my own orchestrator over the past couple of months has allowed me to build with coding agents, unlike anything I have seen or experienced in the market. Claude Cowork was built in 10 days. I totally get it. Anyone can produce that level of output these days. I truly believe that. I look at the new IDEs, TUIs, orchestrator apps, and most of the new features they are releasing these days, I had access to them in my orchestrator months ago. And for unique features, I am able to reproduce them in a few hours and give them to my orchestrator. That is absolutely crazy! It feels like I am building an entire operating system sometimes. It's a lot of fun. And I am not saying this to brag or to dismiss any of the AI solutions out there. There are some great ones out there. I share this to clarify that this is the kind of leverage Karpathy is alluding to. We are building and experiencing this at different levels, but it doesn't remove the fact that you can just build the best AI agent for whatever problem you want to solve. And you should be building it.

❀️151
likes
πŸ”15
retweets
Y
ye_chenlu
@ye_chenlu
πŸ“…
Feb 19, 2026
19d ago
πŸ†”06334675

1/5 Happy CNY🎊 Still bothered by RL off-policy instability in LLM? Introducing a new wayπŸ’‘Adaptive Layerwise Perturbation (ALP)πŸ’‘, a simple but robust fix that outperforms GRPO/MIS/Bypass, achieves better stability (KL, entropy) and exploration! πŸ”— Blog: https://t.co/0def1Nb7uI https://t.co/9epsd4xJNp

Media 1Media 2
+2 more
πŸ–ΌοΈ Media
X
xuhaiya2483846
@xuhaiya2483846
πŸ“…
Feb 26, 2026
11d ago
πŸ†”27717587

πŸ”₯Tongyi Lab releases Mobile-Agent-v3.5,20+SOTA GUI benchmarks: (1) GUI automation, 56.5OSWorld, 71.6AndroidWorld, and48.4WebArena; (2) Grounding, 80.3ScreenSpotPro; (3) tool-calling , 47.6OSWorld-MCP @_akhaliq #LLM #Agent #GUI https://t.co/xCbyL0JZLl

Media 1
πŸ–ΌοΈ Media
W
withmartian
@withmartian
πŸ“…
Feb 26, 2026
11d ago
πŸ†”73714984

Introducing Code Review Bench v0: https://t.co/iAZDURyqol The first independent code review benchmark. 200,000+ PRs. Unbiased. Fully OSS. Updated daily. Tool performance highlights πŸ§΅πŸ‘‡ Featuring: @augmentcode @baz_scm @claudeai @coderabbitai @cursor @GeminiApp @github @graphite @greptile @kilocode @OpenAIDevs @propelcode @QodoAI

Media 2
πŸ–ΌοΈ Media
πŸ”_akhaliq retweeted
W
Martian
@withmartian
πŸ“…
Feb 26, 2026
11d ago
πŸ†”73714984

Introducing Code Review Bench v0: https://t.co/iAZDURyqol The first independent code review benchmark. 200,000+ PRs. Unbiased. Fully OSS. Updated daily. Tool performance highlights πŸ§΅πŸ‘‡ Featuring: @augmentcode @baz_scm @claudeai @coderabbitai @cursor @GeminiApp @github @graphite @greptile @kilocode @OpenAIDevs @propelcode @QodoAI

Media 1
❀️533
likes
πŸ”53
retweets
πŸ–ΌοΈ Media
A
ariG23498
@ariG23498
πŸ“…
Feb 26, 2026
11d ago
πŸ†”36751072
⭐0.42

The Mixture of Experts (MoE) inside πŸ€— Transformers is out now! This is going to be a long tweet, so if you just want to jump to the blog, the link is in the thread. We already had a great blog post on MoEs (which has more than 1k upvotes 😯 at the time of writing). The reason we wanted to build another blog post altogether was just noticing how far we have come in the realm. This blog post is not meant to be another "What is MoEs and how to implement them". Rather talk about how the transformers team at @huggingface made MoEs the "first class citizen" of the library and the Hub. The transformers library and the entire ecosystem was built around dense architectures, but now with the rapid growth of MoEs, it was inevitible to build around MoEs and not consider them as "just another model addition". In the post we talk about better model loading, expert backend, expert parallelism, and also @UnslothAI and out collaboration on training MoEs faster! In the process of building the blog post, I also understood how beautiful the ideas are, and ended up making my first YouTube video on the routing algorithm alone. I am very proud of this project and I think it shows in some paragraphs of the blog post. I am also very thankful to all the people who helped me in the project, I am really happy to be in the team that helps me flourish! Glad to be alive. PS: I owe you all an apology for delaying the release. I hope I (and the team) could make it worth the wait.

S
StefanoErmon
@StefanoErmon
πŸ“…
Feb 24, 2026
13d ago
πŸ†”64520670
⭐0.38

Mercury 2 is live πŸš€πŸš€ The world’s first reasoning diffusion LLM, delivering 5x faster performance than leading speed-optimized LLMs. Watching the team turn years of research into a real product never gets old, and I’m incredibly proud of what we’ve built. We’re just getting started on what diffusion can do for language.

K
karpathy
@karpathy
πŸ“…
Feb 27, 2026
10d ago
πŸ†”75325622

I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?

πŸ’¬ Reply:
@rasbt β€’ 2026-03-01T12:33

@karpathy Yes, and on the software / architecture design side we also see companies more and more preparing for token-heavy workflows. I.e., adding DeepSeek Sparse Attention to MLA a few months back,...

πŸ–ΌοΈ Media
R
rasbt
@rasbt
πŸ“…
Feb 27, 2026
10d ago
πŸ†”54058190

Claude distillation has been a big topic this week while I am (coincidentally) writing Chapter 8 on model distillation. In that context, I shared some utilities to generate distillation data from all sorts of open-weight models via OpenRouter and Ollama: https://t.co/IsfNDpcGAw https://t.co/LKXuGrjO84

Media 2
πŸ–ΌοΈ Media
R
rasbt
@rasbt
πŸ“…
Feb 27, 2026
10d ago
πŸ†”20176734
⭐0.32

@latentchiz If you liked "LLMs from scratch", this should be a fun sequel!

C
code
@code
πŸ“…
Feb 19, 2026
18d ago
πŸ†”67312745

@burkeholland and @pierceboggan will be live coding all throughout the stream this morning! here's what they're building: https://t.co/iFH8Sz7Hn9

Media 1
πŸ–ΌοΈ Media
O
OrenMe
@OrenMe
πŸ“…
Feb 22, 2026
15d ago
πŸ†”62621971

Did u ever imagined in your head, visually, what your @GitHubCopilot is doing when u kick off a session in @code? Me too! Presenting @code session visualizer (which I posted about a few weeks ago) πŸ“Œ See top level user->agent turns πŸ“Œ Click ℹ️ for extended node info(mode, prompt, tool calls, tokens used and more) πŸ“Œ Expand agent nodes on click for full tool invocations/mcps/sub agents etc πŸ“Œ Sub agent nodes can be further expended πŸ“Œ High level summery of session with detailed model and token utilization per turn Initial version available now in @code marketplace, see first comment for link πŸ‘‡ Tell me what u think!

πŸ–ΌοΈ Media
P
pierceboggan
@pierceboggan
πŸ“…
Feb 23, 2026
14d ago
πŸ†”32559284

New in @code Insiders: Integrated agentic browser with workbench.browser.enableChatTools. Here, I ask @code to identify an issue with sliders in hover states, and it makes the fix and validates the solution. https://t.co/TVXkUYpcns

πŸ–ΌοΈ Media
H
heygurisingh
@heygurisingh
πŸ“…
Feb 26, 2026
11d ago
πŸ†”72170231

Holy shit... someone just built software that sees you through walls using WiFi. It's called WiFi-DensePose and it maps your full body pose in real-time no camera, no sensor, no special hardware. Just the router sitting in your living room. Governments and corporations have had this technology for years. Someone just open-sourced it so now anyone can build it. Your WiFi is already watching. You just didn't know. 100% Opensource.

Media 1
πŸ–ΌοΈ Media
H
hardmaru
@hardmaru
πŸ“…
Feb 27, 2026
10d ago
πŸ†”98976770
⭐0.42

Instead of forcing models to hold everything in an active context window, we can use hypernetworks to instantly compile documents and tasks directly into the model's weights. A step towards giving language models durable memory and fast adaptation. Blog: https://t.co/iHoifpsLMu

U
UnslothAI
@UnslothAI
πŸ“…
Feb 27, 2026
10d ago
πŸ†”96545535

Qwen3.5 is now updated with improved tool-calling & coding performance! Run Qwen3.5-35B-A3B on 22GB RAM. See improvements via Claude Code, Codex. We also benchmarked GGUFs & removed MXFP4 layers from 3 quants. GGUFs: https://t.co/4lSce5zZbO Analysis: https://t.co/rHZK8JWdYM

Media 1Media 2
πŸ–ΌοΈ Media
πŸ”ai_fast_track retweeted
K
Andrej Karpathy
@karpathy
πŸ“…
Feb 27, 2026
10d ago
πŸ†”75325622
⭐0.34

I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?

❀️8,169
likes
πŸ”725
retweets
J
juntao
@juntao
πŸ“…
Mar 01, 2026
9d ago
πŸ†”68123776

Rust implementation for Speech-to-Text based on open-source Qwen3 models * Self-contained binary build β€” no external dependencies * Uses libtorch on Linux with optional Nvidia GPU support * Uses MLX on MacOS with Apple GPU/NPU support πŸ”¨ CLI for AI agents and humans: https://t.co/knsZlastgQ πŸ–₯️ OpenAI compatible API server: https://t.co/qjDqCf9hor πŸ€– OpenClaw skill: https://t.co/tE6lzTjYpy Why and how https://t.co/VxRt9oSZ8a

Media 1Media 2
πŸ–ΌοΈ Media
A
AnthropicAI
@AnthropicAI
πŸ“…
Feb 05, 2026
32d ago
πŸ†”98397945

New Engineering blog: We tasked Opus 4.6 using agent teams to build a C compiler. Then we (mostly) walked away. Two weeks later, it worked on the Linux kernel. Here's what it taught us about the future of autonomous software development. Read more: https://t.co/htX0wl4wIf https://t.co/N2e9t5Z6Rm

Media 2
πŸ–ΌοΈ Media
J
johnrobinsn
@johnrobinsn
πŸ“…
Feb 07, 2026
30d ago
πŸ†”65777620

Just shipped tscribe πŸŽ™οΈ πŸ”΄Record any audio playing on your computer. πŸ—’οΈTranscribe it locally. πŸ”Search it later. All from your terminal. OS X, Windows, Linux open source, cross-platform, no cloud required. https://t.co/nPkokVqo1K

Media 2
πŸ–ΌοΈ Media
N
noahzweben
@noahzweben
πŸ“…
Feb 24, 2026
13d ago
πŸ†”05271615

Announcing a new Claude Code feature: Remote Control. It's rolling out now to Max users in research preview. Try it with /remote-control Start local sessions from the terminal, then continue them from your phone. Take a walk, see the sun, walk your dog without losing your flow.

πŸ–ΌοΈ Media
L
LiorOnAI
@LiorOnAI
πŸ“…
Feb 24, 2026
13d ago
πŸ†”28395908
⭐0.42

Mercury 2 doesn't just make reasoning models faster. It makes them native. Every reasoning model today is built on autoregressive generation, where the model writes one word at a time, left to right, like typing on a keyboard. Each word waits for the previous one to finish. The problem compounds when reasoning depth increases: multi-step agents, voice systems, and coding assistants all need many sequential passes, and each pass multiplies the delay. The industry has spent billions on chips, compression, and serving infrastructure to squeeze more speed from this sequential loop. But you're still optimizing a bottleneck. Mercury 2 uses diffusion instead. It starts with a rough draft of the entire response and refines all the words simultaneously through multiple passes. Each pass improves many tokens in parallel, so one neural network evaluation does far more work. The model can also correct mistakes mid-generation because nothing is locked in until the final pass. This isn't a serving trick or a hardware optimization. The speed comes from the architecture itself. This unlocks workflows that were impractical before: 1. Multi-step agents that run 10+ reasoning loops without compounding latency 2. Voice AI that hits sub-200ms response times with full reasoning enabled 3. Real-time code editors where every keystroke triggers model feedback Mercury 2 runs at 1,000 tokens per second while matching the quality of models that generate 70-90 tokens per second. If this performance holds across model sizes, reasoning stops being a batch process you run overnight and becomes something you embed everywhere. Agent loops become tight enough for interactive debugging. Voice systems feel instant instead of sluggish. Code assistants respond faster than you can move your cursor. The entire category of "too slow for production" collapses.

L
LiorOnAI
@LiorOnAI
πŸ“…
Feb 24, 2026
13d ago
πŸ†”06403836
⭐0.42

A 35 billion parameter model just beat a 235 billion parameter model. That's not supposed to happen. Qwen3.5-35B-A3B now outperforms its predecessor that had 6x more total parameters, and it does so while using 7x fewer active parameters per token. The breakthrough isn't efficiency for efficiency's sake. It's proof that three specific techniques can compress intelligence better than brute-force scaling: 1. Hybrid attention layers that mix linear attention (fast, scales to long contexts) with standard attention (accurate, catches nuance) in a 3:1 ratio 2. Ultra-sparse experts where only 3 billion of 35 billion parameters activate per token, but those 3 billion are chosen by a router trained on higher-quality data 3. Reinforcement learning scaled across millions of simulated agent environments, not just text prediction The result is a model architecture where intelligence comes from better routing decisions, not bigger weight matrices. This unlocks four things that weren't practical before: 1. Running frontier-class reasoning on a single GPU node instead of a cluster 2. Serving 1 million token contexts in production without exploding costs 3. Building agents that can handle complex tool use without the latency penalty of dense models 4. Fine-tuning on domain data without needing to update 200+ billion parameters If this pattern holds, the next 18 months will belong to teams optimizing routing and data quality, not teams with the biggest GPU budgets.

L
LiorOnAI
@LiorOnAI
πŸ“…
Feb 28, 2026
9d ago
πŸ†”52119603

Imbue just open-sourced Evolver. A tool that uses LLMs to automatically optimize code and prompts. They hit 95% on ARC-AGI-2 benchmarks. That's GPT-5.2-level performance from an open model. Evolver works like natural selection for code. You give it three things: 1. Starting code or prompt 2. A way to score results 3. An LLM that suggests improvements Then it runs in a loop. It picks high-scoring solutions. Mutates them. Tests the mutations. Keeps what works. The key difference from random mutation: LLMs propose targeted fixes. When a solution fails on specific inputs, the LLM sees those failures. It suggests changes to fix them. Most suggestions don't help. But some do. Those survivors become parents for the next generation. Evolver adds smart optimizations: > Batch mutations: fix multiple failures at once > Learning logs: share discoveries across branches > Post-mutation filters: skip bad mutations before scoring The verification step alone cuts costs 10x. This works on any problem where LLMs can read the code and you can score the output. You can now auto-optimize: - Agentic workflows - Prompt templates - Code performance - Reasoning chains No gradient descent needed. No differentiable functions required.

Media 1
πŸ–ΌοΈ Media