Your curated collection of saved posts and media

Showing 32 posts Β· last 14 days Β· by score
S
Simon Willison
@simonw
πŸ“…
Fri
πŸ†”24995303

I've published video, slides and a detailed annotated transcript from my talk at this week's AI Engineer World's Fair conference @aiDotEngineer in San Francisco - "The last year six months in LLMs, illustrated by pelicans on bicycles" https://t.co/j0czupXbi4

Media 1Media 2
+1 more
❀️309
likes
πŸ”36
retweets
πŸ–ΌοΈ Media
J
Jim Kwik
@jimkwik
πŸ“…
Sat
πŸ†”88695146

It’s better well done than well said. https://t.co/7g0RQcLX5a

Media 1
❀️2,046
likes
πŸ”273
retweets
πŸ–ΌοΈ Media
H
Hamel Husain
@HamelHusain
πŸ“…
Sat
πŸ†”38077613

Feedback & takeaways from students in our evals course πŸŽ‰ https://t.co/xNwXBHLpzv

Media 1Media 2
+1 more
❀️48
likes
πŸ”5
retweets
πŸ–ΌοΈ Media
J
Jerry Liu
@jerryjliu0
πŸ“…
Sat
πŸ†”02453336

The secret sauce to building a spreadsheet/Excel agent is not RAG or text-to-CSV, but giving an agent the right mix of tools to manipulate an Excel file. We recently released an Excel agent capable to doing data transformations and QA over deeply complex Excel files. Here is a… https://t.co/N95Vk5VkKv

Media 1
❀️270
likes
πŸ”38
retweets
πŸ–ΌοΈ Media
G
GosuCoder
@GosuCoder
πŸ“…
Sat
πŸ†”46965984

The new Gemini 2.5 Pro 06-05 update is substantially better, but temperature matters a lot for AI coding assistants. Check out this graph, this shows the average eval score based on temperature. 0.7 is the clear winner here. https://t.co/6V5BcFf4xS

Media 1
❀️528
likes
πŸ”51
retweets
πŸ–ΌοΈ Media
I
Tanishq Mathew Abraham, Ph.D.
@iScienceLuvr
πŸ“…
Sat
πŸ†”34666256

hot take: I like ChatGPT memory https://t.co/n22wFKNqPR

Media 1
❀️167
likes
πŸ”4
retweets
πŸ–ΌοΈ Media
L
ludwig
@ludwigABAP
πŸ“…
Sat
πŸ†”03897188

I suggest we find the people responsible for post-training LLMs into emoji-slop idiots and we simply use metal bars on their limbs https://t.co/4Uo7r09gm8

Media 1
❀️2,388
likes
πŸ”71
retweets
πŸ–ΌοΈ Media
H
Hamel Husain
@HamelHusain
πŸ“…
Sat
πŸ†”03447767

Can I use the same model for both the main task and evaluation? https://t.co/q4T1LkBRUS

Media 1
❀️18
likes
πŸ”3
retweets
πŸ–ΌοΈ Media
T
Teknium (e/Ξ»)
@Teknium1
πŸ“…
Sun
πŸ†”95441869

Nous' discord has badass badges everyone can claim now hop on to get one - https://t.co/5EoJ4EBecb https://t.co/hKXdE8fgs8

Media 1
❀️25
likes
πŸ”1
retweets
πŸ–ΌοΈ Media
J
Jeremy Howard
@jeremyphoward
πŸ“…
Sun
πŸ†”27114148

When I was optimising ULMFiT, I came up with a trick where I ran lots of ablations and fed all the hyperparams and results to a random forest. That told me which were most important. I told @l2k about it, and @weights_biases added it to their product! :D https://t.co/iCt92Bfc7f https://t.co/l4zOnlXplF

Media 1
❀️245
likes
πŸ”15
retweets
πŸ–ΌοΈ Media
J
JingyuanLiu
@JingyuanLiu123
πŸ“…
Sat
πŸ†”49309657

Finally got a chance to learn @jxbz 's deriving muon and spectral condition, and I am AMAZED by the elegant derivation of how muP and Muon can be used together! In fact, it is natural to use Muon as the optimizer for MuP-based model training from the derivation. I would think… https://t.co/1YdojiU9YL

Media 1
❀️113
likes
πŸ”19
retweets
πŸ–ΌοΈ Media
I
Tanishq Mathew Abraham, Ph.D.
@iScienceLuvr
πŸ“…
Sun
πŸ†”79846043

Doctor Penguin newsletter is back again If you want a high-quality weekly curated list of medical AI papers, check it out! (link in reply) https://t.co/TBmHjrClGG

Media 1
❀️22
likes
πŸ”4
retweets
πŸ–ΌοΈ Media
I
Tanishq Mathew Abraham, Ph.D.
@iScienceLuvr
πŸ“…
Sun
πŸ†”45945005

https://t.co/BaNN1DeLtW

Media 1
❀️77
likes
πŸ”3
retweets
πŸ–ΌοΈ Media
S
Lisan al Gaib
@scaling01
πŸ“…
Sun
πŸ†”79117763

bro I can't do this shit anymore, why do I keep paying for this garbage this happens literally every time when I use o3 or o4-mini and they think more than a minute https://t.co/KzeifvRMB3

Media 1
❀️365
likes
πŸ”7
retweets
πŸ–ΌοΈ Media
E
Ethan Mollick
@emollick
πŸ“…
Sun
πŸ†”22249206

🚨We have a new prompting report: Prompting a model with Chain of Thought is a common prompt engineering technique, but we find simple Chain-of-Thought prompts don’t help recent frontier LLMs, including reasoning & non-reasoning models, perform any better (but do increase costs) https://t.co/BEhfIslmXT

Media 1Media 2
+1 more
❀️567
likes
πŸ”58
retweets
πŸ–ΌοΈ Media
P
Paul Calcraft
@paul_cal
πŸ“…
Sun
πŸ†”17985537

Gemini 2.5 Pro 06-05 achieves SOTA slop detection by rating Memvid 8/10 on the slop scale >fundamentally a gimmick [..] popularity is likely driven by the novelty [..] "I store text in a video!" Gemini: 8/10 o3 / Claude 4 Opus: 7/10 Claude 4 Sonnet: 6/10 DeepSeek R1 0528:… https://t.co/B6UweuueeO

Media 1
❀️105
likes
πŸ”3
retweets
πŸ–ΌοΈ Media
O
elvis
@omarsar0
πŸ“…
Sun
πŸ†”53905283

How much do LLMs memorize? Meta and collaborators suggest that they can estimate model capacity by measuring memorization. "Models in the GPT family have an approximate capacity of 3.6 bits-per-parameter." Once capacity fills, generalization begins! More in my notes below: https://t.co/akfNnDqVqW

Media 1
❀️573
likes
πŸ”102
retweets
πŸ–ΌοΈ Media
H
Hamel Husain
@HamelHusain
πŸ“…
Sat
πŸ†”53561436

I believe this is the first talk of its kind - we get to hear from OpenAI on best practices for applied Evals with **real case studies**. We will also get a sneak peek of OpenAI's up and coming eval tools. https://t.co/wceQlpDVvz Will be recorded https://t.co/NPzbdynxuD

Media 1
❀️112
likes
πŸ”12
retweets
πŸ–ΌοΈ Media
B
bycloud
@bycloudai
πŸ“…
Sun
πŸ†”30075742

The day that @thinkymachines finally drops will be glorious https://t.co/nx2ZGeFDOJ

Media 1
❀️469
likes
πŸ”16
retweets
πŸ–ΌοΈ Media
M
martin_casado
@martin_casado
πŸ“…
Sat
πŸ†”72757678

In the end, LLM/LRMs may be a better exploration of the people in tech, than of the tech itself ... .. the greatest of all Rorschach tests. https://t.co/r1Gt6C0HH1

Media 1
❀️1,003
likes
πŸ”90
retweets
πŸ–ΌοΈ Media
J
Jerry Liu
@jerryjliu0
πŸ“…
Sun
πŸ†”25158272

This weekend I’m excited to share a tutorial that shows you how to build an agentic extraction workflow over a Fidelity Multi-Fund Annual Report: the document contains a list of multiple funds, with each fund reporting multiple tables of financial data. Extracting a list of… https://t.co/qcgX7xnt0w

Media 1
❀️178
likes
πŸ”32
retweets
πŸ–ΌοΈ Media
S
Lisan al Gaib
@scaling01
πŸ“…
Sun
πŸ†”11126954

A few more observations after replicating the Tower of Hanoi game with their exact prompts: - You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff. - Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and… https://t.co/ax5ZK4WkGx

Media 1Media 2
❀️1,778
likes
πŸ”255
retweets
πŸ–ΌοΈ Media
C
Chris Levy
@cleavey1985
πŸ“…
Sun
πŸ†”71243061

I have been enjoying the "AI Evals For Engineers & PMs" course from @HamelHusain and @sh_reya. I knew when I signed up I was going to be told "look at your data!". No surprise there! But this course teaches you how to do that using a scientific process. If looking at your data… https://t.co/8NRfkLUyUT

Media 1
❀️44
likes
πŸ”15
retweets
πŸ–ΌοΈ Media
E
Ethan Mollick
@emollick
πŸ“…
Mon
πŸ†”18069510

New paper shows a familiar result on LLMs & medicine: Doctors given clinical vignettes produce significantly more accurate diagnoses when using a custom GPT built with the (obsolete) GPT-4 than doctors with Google/Pubmed but not AI. Yet AI alone is as accurate as doctors + AI. https://t.co/7OxPItCfQM

Media 1Media 2
❀️999
likes
πŸ”197
retweets
πŸ–ΌοΈ Media
I
Ivan Leo
@ivanleomk
πŸ“…
Mon
πŸ†”57682304

Experimenting with using a https://t.co/g6lcz73IRm file like what @vig_xyz suggest in his lightning talk. Actually pretty good to create PRDs https://t.co/Bd7YbGEy0N

Media 1
❀️5
likes
πŸ–ΌοΈ Media
I
Tanishq Mathew Abraham, Ph.D.
@iScienceLuvr
πŸ“…
Mon
πŸ†”57005904

I don't even agree with the Apple paper but this is an extremely midwit take https://t.co/mBi4vsJF0C

Media 1
❀️1,668
likes
πŸ”29
retweets
πŸ–ΌοΈ Media
I
Tanishq Mathew Abraham, Ph.D.
@iScienceLuvr
πŸ“…
Mon
πŸ†”50494879

Corrector Sampling in Language Models "Autoregressive language models accumulate errors due to their fixed, irrevocable left-to-right token generation. To address this, we propose a new sampling method called Resample-Previous-Tokens (RPT). RPT mitigates error accumulation by… https://t.co/5Pr69g2ez7

Media 1
❀️276
likes
πŸ”37
retweets
πŸ–ΌοΈ Media
I
Tanishq Mathew Abraham, Ph.D.
@iScienceLuvr
πŸ“…
Mon
πŸ†”10285119

MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning "We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary… https://t.co/6uybbpnKE8

Media 1
❀️72
likes
πŸ”10
retweets
πŸ–ΌοΈ Media
R
Radek Osmulski πŸ‡ΊπŸ‡¦
@radekosmulski
πŸ“…
Mon
πŸ†”39464683

I’m unfortunately falling behind in @HamelHusain’s & @sh_reya’s evals course, but I can’t shrug off the feeling that focus on evals is a head twist? It just so happens that evals are the highest leverage point in building with LLMs but there is this universe of related ideas… https://t.co/UcnLJYUylN

Media 1
❀️113
likes
πŸ”12
retweets
πŸ–ΌοΈ Media
O
elvis
@omarsar0
πŸ“…
Fri
πŸ†”54712537

Top 50 LLM Interview Questions. Looks like a great resource to learn LLM basics: https://t.co/nCik0PGOcb

Media 1
❀️2,755
likes
πŸ”392
retweets
πŸ–ΌοΈ Media
H
Hamel Husain
@HamelHusain
πŸ“…
Fri
πŸ†”82564734

What gaps in eval tooling should I be prepared to fill myself? @sh_reya and I have found the same blind spots and missing features across many eval tools. The first gap is lack of tooling for error analysis 1/5 https://t.co/kJVQUWz8sM

Media 1
❀️130
likes
πŸ”10
retweets
πŸ–ΌοΈ Media
E
Ethan Mollick
@emollick
πŸ“…
Fri
πŸ†”14821241

An example of why I think current LLMs are enough to change lots of work even if they don’t get better, once we start integrating them with other systems GPT-4 (now obsolete) went from 30% accuracy to 87% accuracy in clinical oncology decisions when given access to medical tools https://t.co/ynnpuQU7cN

Media 1Media 2
❀️1,478
likes
πŸ”190
retweets
πŸ–ΌοΈ Media