Your curated collection of saved posts and media
Valuable synthesis across labs! Make sure to check out the tutorial video - https://t.co/nqh6ZFaxat
Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow
When we first started working with transcoders, I didn't really appreciate what a big change the were... https://t.co/Pm3YiWCMK8
Let's consider a very simple problem, having a transcoder mimic the absolute value function. You can do this with two features per dimension: https://t.co/DWswkZmZVW
Transcoders can will learn the perfect solution! https://t.co/MoewAdgDOo
But now let's add a repeated data point to the transcoder training data, p=[1,1,1,0,0,0,0...] The transcoder now learns a special feature to memorize that point! https://t.co/5ghvWDjqbg
It turns out there's a fix! If we ask to match the Jacobian of absolute value, we get the correct solution again. https://t.co/dJ9acDGrZU
What's the point of all of this? For me, this question of mechanistic faithfulness is the most important question in all the SAE debate. I think it's often mixed in with other things and kind of implicit, and I wanted to have a simple example that clearly isolates it. https://t.co/VqLQSKO7D7
Our recent work on attribution graphs (https://t.co/qbIhdV7OKz ) and extending it to attention (https://t.co/Mf8JLvWH9K ), point towards how much potential they have if we can mitigate the issues.
Weβre running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places. https://t.co/wJWRRTt4DG
The program will run for ~two months, with opportunities to extend for an additional four based on progress and performance. Apply by August 17 to join us in any of these locations: - US: https://t.co/BhrekQsl8F - UK: https://t.co/TPYNEony83 - Canada: https://t.co/F00QZ0hjqw

What if adversarial examples aren't a bug, but a direct consequence of how neural networks process information? We've found evidence that superposition β the way networks represent many more features than they have neurons β might cause adversarial examples. https://t.co/YL11r2FeOw
Assignment 3 (scaling laws): fit scaling laws using IsoFLOP. To simulate the high-stakes of a training run, students got a training API [hyperparameters -> loss] and a fixed compute budget, and had to choose which runs to submit to gather data points. Behind the scenes, the training API was backed by interpolating between a bunch of precomputed runs. https://t.co/JpaDT8wIoE
Assignment 4 (data): convert Common Crawl HTML to text, filter filter filter (quality, harmful content, PII), deduplication. This is the grunt work that doesnβt get enough appreciation. https://t.co/60V5MB9uv5
Assignment 5 (alignment): implement supervised fine-tuning, expert iteration, GRPO and variants, run RL on Qwen 2.5 Math 1.5B to improve MATH because itβs 2025. We thought about having students implement inference, but decided (probably wisely) to let people use vllm instead. https://t.co/mQOG46z2Eh
You can find all our lectures on YouTube (thanks to @StanfordOnline): https://t.co/l5WOdhWzNW and the assignments on the course website so you can do it yourself at home: https://t.co/HG6zdeLUtq
Designed some graphics for Stanford CS336 (Language Modeling from Scratch) by @percyliang @tatsu_hashimoto @marcelroed @neilbband @rckpudi Covering four assignments π that teach you how to π§βπ³ cook an LLM from scratch: - Build and Train a Tokenizer π€ - Write Triton kernels for Attention β‘οΈ - Construct Scaling Laws π - Implement GRPO π

So about a month ago, Percy posted a version of this plot of our Marin 32B pretraining run. We got a lot of feedback, both public and private, that the spikes were bad. (This is a thread about how we fixed the spikes. Bear with me. ) https://t.co/ePDDIL97Dg
Prompting Llama 3.1 70B with the βMr and Mrs. Dβ can generate seed the generation of a near-exact copy of the entire ~300 page book βHarry Potter & the Sorcererβs Stoneβ π€― We define a βnear-copyβ as text that is identical modulo minor spelling / punctuation variations. Below is a piece of the diff between the Books3 version and what the model generated, showing how close the two are! TL;DR: βMr and Mrs. D" => [Llama 3.1 70B] => near-exact copy of the entire ~300 page book. Read onπ 1/π§΅
HELM capabilities v1.9.0 is out (Grok 4 and Kimi K2 make the top 10 overall), and Kimi K2 is the best non-thinking model: https://t.co/xEnipRhILk
HELM safety v1.11.0 is also out. Kimi K2 is right up there, whereas Grok 4 is closer to the bottom than the top... https://t.co/mIYIlLIPbO
gpt-oss-120b is the top open-weight model (with Kimi K2 right on its tail) for capabilities (HELM capabilities v1.11): https://t.co/D3RExuNfbY
It is also the safest (HELM safety v1.13.0): https://t.co/P9tsrIK3V3
GPT-5 and GPT-5 mini added to HELM capabilities v1.12.0. Interestingly, GPT-5 mini tops the leaderboard ahead of GPT-5 because on Omni-MATH, GPT-5 uses more reasoning tokens (and is hard to control) and hits our reasoning token budget of 14096. Doing fair evals is tricky! https://t.co/hSmyQgke4S
New paper! We explore a radical paradigm for AI evals: assessing LLMs on *unsolved* questions. Instead of contrived exams where progress β value, we eval LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far: https://t.co/3TzD9ULEtg
Docent, our tool for analyzing complex AI behaviors, is now in public alpha! It helps scalably answer questions about agent behavior, like βis my model reward hackingβ or βwhere does it violate instructions.β Today, anyone can get started with just a few lines of code! https://t.co/ki6MMGH73j
𧬠Bad news for medical LLMs. This paper finds that top medical AI models often match patterns instead of truly reasoning. Small wording tweaks cut accuracy by up to 38% on validated questions. The team took 100 MedQA questions, replaced the correct choice with None of the other answers, then kept the 68 items where a clinician confirmed that switch as correct. If a model truly reasons, it should still reach the same clinical decision despite that label swap. They asked each model to explain its steps before answering and compared accuracy on the original versus modified items. All 6 models dropped on the NOTA set, the biggest hit was 38%, and even the reasoning models slipped. That pattern points to shortcut learning, the systems latch onto answer templates rather than working through the clinical logic. Overall, the results show that high benchmark scores can mask a robustness gap, because small format shifts expose shallow pattern use rather than clinical reasoning.
One minute @mattturck is telling me that hallucinations are βa largely fixed problemβ; the next minute ChatGPT 5 is telling a friend that Trump βis not in officeβ. π€ https://t.co/VydYrOrpkd
35 million people laughed in my face. But we still donβt have a solution to hallucinations, boneheaded errors, and unreliable reasoning. Years later, the wall of reliability still looms in front of LLMs, unconquered. https://t.co/kGFg5vXK3z
@OrielJohann π€£π€£π€£ https://t.co/PTZlFGRQnZ
βNasdaq pulls back on AI worriesβ https://t.co/LJVGHu0c89
@GaryMarcus I wanted to find vaguely familiar existing image that shows how one person can be correct while standing against the huge crowd :) , so I asked Grok. Well, the result shows why I can't completely dismiss the LLMs... https://t.co/WtEKGJ5daa

AI Twitter, every day since 2019. (Not all e/accs shown.) https://t.co/v0d6Gn4P2h