Your curated collection of saved posts and media
https://t.co/xQ0tVdFoV4
https://t.co/xQ0tVdFoV4
GPT 5.4 is released https://t.co/kqy67qLlJf
Two major AI releases this week: โข Qwen3.5 โ new open-source small models โข GPT-5.4 โ newest frontier closed model Most benchmarks compare math and coding. But the real test for frontier AI should be biology and healthcare. Thatโs where mistakes actually matter. So our team at @UHN ran them on EURORAD โ 207 expert-validated radiology differential diagnosis cases. Results: GPT-5.4: 92.2% Qwen3.5-27B: 85% Gemini 3.1 Pro: ~79% A 27B open model that runs on a laptop is only 7 points behind the most powerful AI model on earth โ and already beating Gemini on this benchmark. That gap is much smaller than people expected. And it matters. For years hospitals faced an impossible tradeoff: Frontier models โ patient data leaves the hospital Local models โ not good enough That tradeoff may finally be ending. Qwen3.5-27B runs fully local. No API. No cloud. No patient data leaving the building. HIPAA / PHIPA compliance becomes architecture, not paperwork. Interesting detail: 27B and 122B score almost identically here. Scaling bigger didnโt help much. One caveat: with web-scale training, itโs hard to completely rule out that frontier models like GPT-5.4 may have seen parts of evaluation datasets. Still, the signal is clear: Small models are getting good enough for real clinical AI. And if we want to measure real AI progress, biology and healthcare should be the benchmark. Huge credit to the team @alifmunim @AlhusainAbdalla @JunMa_AI4Health @Omar_Ibr12 @oliviaamwei
๐๏ธNew episode of "Machine Learning: How Did We Get Here?" Tom Mitchell (@CarnegieMellon) and @ylecun, Executive Chairman of AMI Labs and Professor at NYU, discuss how technological advances and commercial forces shaped AI history. Listen on Spotify: https://t.co/YdWjoVdoVc
Tom's ongoing peek at the personalities and personal stories behind machine learning's history is available wherever you find your podcasts. ๐ฅWatch on YouTube: https://t.co/czYb2iXB2l โถ๏ธListen on Apple: https://t.co/z07MFJaTr1

As @bradrcarson explains, the contract language released so far does not restrict the gov from using AI to kill without human oversight. https://t.co/To1RKsQTGg
@ch402 @sebgehr Too many to count. NatSec in general agrees with you @ch402. Jack Shannanโs background and placement in Operation Maven is noteworthy so his understanding of how critical Claude is to American military effectiveness is not just hot air. https://t.co/0fjZrDKWLh https://t.co/I8ISpgEMve

On one end, the Anthropic team is a massive user of AI to write code (80%+ of all code deployed is written by Claude Code). They ship amazingly fast. On the other hand, seeing these beyond terrible reliability numbers suggests there might be a downside to all this speed: https://t.co/9nYoH7KYOc
we are about to hit 1 9 of availability while coding is largely solved https://t.co/4NJB1YNsPk
we are about to hit 1 9 of availability while coding is largely solved https://t.co/4NJB1YNsPk
It's very common for people to claim that open LLMs will be used to commit cyber attacks at massive scale. What public evidence is there for this claim? The best (and one of the only) accounts I've seen of a cyber LLM attack was done using Claude https://t.co/v63Lolv5iH
Looking for user feedback about the upcoming ggml official Debian and Ubuntu packages https://t.co/8lcGZzSgLK
ๅ ้ฑใSakana AIใฎใชใผใๆ่ณๅฎถใงใใKhosla VenturesใฎVinod Khoslaๆฐใๆฅๆฅใใพใใใ๐ฏ๐ต ๅฝ็คพๅ ฑๅๅตๆฅญ่ ใฎไผ่คใจใจใใซ็ๅฑฑใใคใ่ฒกๅๅคง่ฃใๅ ฑใซ่กจๆฌใใๆฅๆฌใฎ็ฃๆฅญ็ซถไบๅใ้ซใใAIๆฆ็ฅใใๅ ฌๅ ฑ้จ้ใซใใใๆๆฌ็ใชAIๆดป็จใซใคใใฆใใฐใญใผใใซใช่ฆ็นใใๆ่ฆไบคๆใ่กใใพใใใ ็ถใใฆใVinodๆฐ ใซใฏSakana AIใฎๆฐใชใใฃในใซใใ่ถใใใใ ใใพใใใCEO David Ha (@hardmaru) ใ CTO Llion Jones (@YesThisIsLion) ใไบคใใๆฅๆฌใฎๅฎๅ จไฟ้ใป้ฒ่กๅ้ใซใใใAIๅฉ็จใๅซใใๅตๆฅญๆใใๆๅพ ใใใใฆใใใ ใใฆใใๅฝ็คพใฎใฆใใผใฏใชๆ่กใๆดป็จใใๅฝๅ ๅคใฎใใพใใพใช็ฃๆฅญใปใฏใฟใผใซใใใAIๅฑ้ใฎๅฏ่ฝๆงใซใคใใฆๅฏพ่ฉฑใ่กใใพใใใ โโโโโโโโโโโโ Last week, Vinod Khosla (@vkhosla) of Khosla Ventures, Sakana AIโs lead investor, visited Japan. ๐ฏ๐ต Together with our Co-founder COO Ren Ito, they paid a courtesy visit to Finance Minister Satsuki Katayama to exchange views from a global perspective on AI strategies to boost Japanโs industrial competitiveness and the fundamental integration of AI within the public sector. Following that, Vinod visited Sakana AIโs new office. Joined by CEO David Ha (@hardmaru) and CTO Llion Jones (@YesThisIsLion), we discussed the potential of deploying AI across various domestic and global industrial sectors using our unique technologyโsomething he has supported since our founding. This included conversations on utilizing AI in Japanโs security and defense fields.

Can AI companies restrict government use of their technology? They do it all the time. Whether and how depends on the acquisition pathway, contract type, and terms. My explainer: https://t.co/QHSZrooFoH #Anthropic #openai #pentagon #DoD #govcon
@CharlieBul58993 @JTillipman @bridgewriter (former NSC counsel) - https://t.co/K8WEStCDhc
A deep dive in @lawfare on the many legal problems with the Pentagon's designation of Anthropic as a supply chain risk. https://t.co/6mlWhgwMge
New research just exposed the biggest lie in AI coding benchmarks. LLMs score 84-89% on standard coding tests. On real production code? 25-34%. That's not a gap. That's a different reality. Here's what happened: Researchers built a benchmark from actual open-source repositories real classes with real dependencies, real type systems, real integration complexity. Then they tested the same models that dominate HumanEval leaderboards. The results were brutal. The models weren't failing because the code was "harder." They were failing because it was *real*. Synthetic benchmarks test whether a model can write a self-contained function with a clean docstring. Production code requires understanding inheritance hierarchies, framework integrations, and project-specific utilities. Different universe. Same leaderboard score. But it gets worse. A separate study ran 600,000 debugging experiments across 9 LLMs. They found a bug in a program. The LLM found it too. Then they renamed a variable. Added a comment. Shuffled function order. Changed nothing about the bug itself. The LLM couldn't find the same bug anymore. 78% of the time, cosmetic changes that don't affect program behavior completely broke the model's ability to debug. Function shuffling alone reduced debugging accuracy by 83%. The models aren't reading code. They're pattern-matching against what code *looks like* in their training data. A third study confirmed this from another angle: when researchers obfuscated real-world code changing symbols, structure, and semantics while keeping functionality identical LLM pass rates dropped by up to 62.5%. The researchers call this the "Specialist in Familiarity" problem. LLMs perform well on code they've memorized. The moment you show them something unfamiliar with the same logic, they collapse. Three papers. Three different methodologies. Same conclusion: The benchmarks we use to evaluate AI coding tools are measuring memorization, not understanding. If you're shipping code generated by LLMs into production without review, these numbers should concern you. If you're building developer tools, the question isn't "what's your HumanEval score." It's "what happens when the code doesn't look like the training data."
Gift link: https://t.co/S1D5ZMpE3l
๐ซ @bradlightcap has stopped following @GaryMarcus (๐ค๐: any thoughts on this?) https://t.co/kI0mBNCoxY
Gift link: https://t.co/S1D5ZMpE3l
In the last few days, OpenAI and its executives have claimed that its DoW deal prevents its models being used for mass domestic surveillance. As I write in a lengthy explainer for @ReadTransformer today, that appears to be misleading at best. https://t.co/IdlpVUSY0p
Be like Sam Altman > runs YC > starts a open-sourced non profit to regulate ai & protect humanity > raise money for non profit > use that money to build a closed source AI > create a new for profit company > raise money & kick out existing investors > use our data for ads in ChatGPT > go on the news and stand up for Anthropic against US gov > 24hrs later sign a deal with US to do exactly the opposite
Folks, this is not normal. Four American soldiers have died, but let me tell you about the curtains. โI always liked gold.โ https://t.co/1Kt9tvNi8g