Your curated collection of saved posts and media

Recent Top

Showing 32 posts · last 14 days · by score

🖼️ Media

S

Simon Willison

@simonw

📅

Fri

🆔24995303

I've published video, slides and a detailed annotated transcript from my talk at this week's AI Engineer World's Fair conference @aiDotEngineer in San Francisco - "The last year six months in LLMs, illustrated by pelicans on bicycles" https://t.co/j0czupXbi4

+1 more

❤️309

likes

🔁36

retweets

🖼️ Media

View Details View on X ↗

J

Jim Kwik

@jimkwik

📅

Sat

🆔88695146

It’s better well done than well said. https://t.co/7g0RQcLX5a

❤️2,046

likes

🔁273

retweets

🖼️ Media

View Details View on X ↗

H

Hamel Husain

@HamelHusain

📅

Sat

🆔38077613

Feedback & takeaways from students in our evals course 🎉 https://t.co/xNwXBHLpzv

+1 more

❤️48

likes

🔁5

retweets

🖼️ Media

View Details View on X ↗

J

Jerry Liu

@jerryjliu0

📅

Sat

🆔02453336

The secret sauce to building a spreadsheet/Excel agent is not RAG or text-to-CSV, but giving an agent the right mix of tools to manipulate an Excel file. We recently released an Excel agent capable to doing data transformations and QA over deeply complex Excel files. Here is a… https://t.co/N95Vk5VkKv

❤️270

likes

🔁38

retweets

🖼️ Media

View Details View on X ↗

G

GosuCoder

@GosuCoder

📅

Sat

🆔46965984

The new Gemini 2.5 Pro 06-05 update is substantially better, but temperature matters a lot for AI coding assistants. Check out this graph, this shows the average eval score based on temperature. 0.7 is the clear winner here. https://t.co/6V5BcFf4xS

❤️528

likes

🔁51

retweets

🖼️ Media

View Details View on X ↗

I

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

📅

Sat

🆔34666256

hot take: I like ChatGPT memory https://t.co/n22wFKNqPR

❤️167

likes

🔁4

retweets

🖼️ Media

View Details View on X ↗

L

ludwig

@ludwigABAP

📅

Sat

🆔03897188

I suggest we find the people responsible for post-training LLMs into emoji-slop idiots and we simply use metal bars on their limbs https://t.co/4Uo7r09gm8

❤️2,388

likes

🔁71

retweets

🖼️ Media

View Details View on X ↗

H

Hamel Husain

@HamelHusain

📅

Sat

🆔03447767

Can I use the same model for both the main task and evaluation? https://t.co/q4T1LkBRUS

❤️18

likes

🔁3

retweets

🖼️ Media

View Details View on X ↗

T

Teknium (e/λ)

@Teknium1

📅

Sun

🆔95441869

Nous' discord has badass badges everyone can claim now hop on to get one - https://t.co/5EoJ4EBecb https://t.co/hKXdE8fgs8

❤️25

likes

🔁1

retweets

🖼️ Media

View Details View on X ↗

J

Jeremy Howard

@jeremyphoward

📅

Sun

🆔27114148

When I was optimising ULMFiT, I came up with a trick where I ran lots of ablations and fed all the hyperparams and results to a random forest. That told me which were most important. I told @l2k about it, and @weights_biases added it to their product! :D https://t.co/iCt92Bfc7f https://t.co/l4zOnlXplF

❤️245

likes

🔁15

retweets

🖼️ Media

View Details View on X ↗

J

JingyuanLiu

@JingyuanLiu123

📅

Sat

🆔49309657

Finally got a chance to learn @jxbz 's deriving muon and spectral condition, and I am AMAZED by the elegant derivation of how muP and Muon can be used together! In fact, it is natural to use Muon as the optimizer for MuP-based model training from the derivation. I would think… https://t.co/1YdojiU9YL

❤️113

likes

🔁19

retweets

🖼️ Media

View Details View on X ↗

I

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

📅

Sun

🆔79846043

Doctor Penguin newsletter is back again If you want a high-quality weekly curated list of medical AI papers, check it out! (link in reply) https://t.co/TBmHjrClGG

❤️22

likes

🔁4

retweets

🖼️ Media

View Details View on X ↗

I

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

📅

Sun

🆔45945005

https://t.co/BaNN1DeLtW

❤️77

likes

🔁3

retweets

🖼️ Media

View Details View on X ↗

S

Lisan al Gaib

@scaling01

📅

Sun

🆔79117763

bro I can't do this shit anymore, why do I keep paying for this garbage this happens literally every time when I use o3 or o4-mini and they think more than a minute https://t.co/KzeifvRMB3

❤️365

likes

🔁7

retweets

🖼️ Media

View Details View on X ↗

E

Ethan Mollick

@emollick

📅

Sun

🆔22249206

🚨We have a new prompting report: Prompting a model with Chain of Thought is a common prompt engineering technique, but we find simple Chain-of-Thought prompts don’t help recent frontier LLMs, including reasoning & non-reasoning models, perform any better (but do increase costs) https://t.co/BEhfIslmXT

+1 more

❤️567

likes

🔁58

retweets

🖼️ Media

View Details View on X ↗

P

Paul Calcraft

@paul_cal

📅

Sun

🆔17985537

Gemini 2.5 Pro 06-05 achieves SOTA slop detection by rating Memvid 8/10 on the slop scale >fundamentally a gimmick [..] popularity is likely driven by the novelty [..] "I store text in a video!" Gemini: 8/10 o3 / Claude 4 Opus: 7/10 Claude 4 Sonnet: 6/10 DeepSeek R1 0528:… https://t.co/B6UweuueeO

❤️105

likes

🔁3

retweets

🖼️ Media

View Details View on X ↗

O

elvis

@omarsar0

📅

Sun

🆔53905283

How much do LLMs memorize? Meta and collaborators suggest that they can estimate model capacity by measuring memorization. "Models in the GPT family have an approximate capacity of 3.6 bits-per-parameter." Once capacity fills, generalization begins! More in my notes below: https://t.co/akfNnDqVqW

❤️573

likes

🔁102

retweets

🖼️ Media

View Details View on X ↗

H

Hamel Husain

@HamelHusain

📅

Sat

🆔53561436

I believe this is the first talk of its kind - we get to hear from OpenAI on best practices for applied Evals with **real case studies**. We will also get a sneak peek of OpenAI's up and coming eval tools. https://t.co/wceQlpDVvz Will be recorded https://t.co/NPzbdynxuD

❤️112

likes

🔁12

retweets

🖼️ Media

View Details View on X ↗

B

bycloud

@bycloudai

📅

Sun

🆔30075742

The day that @thinkymachines finally drops will be glorious https://t.co/nx2ZGeFDOJ

❤️469

likes

🔁16

retweets

🖼️ Media

View Details View on X ↗

M

martin_casado

@martin_casado

📅

Sat

🆔72757678

In the end, LLM/LRMs may be a better exploration of the people in tech, than of the tech itself ... .. the greatest of all Rorschach tests. https://t.co/r1Gt6C0HH1

❤️1,003

likes

🔁90

retweets

🖼️ Media

View Details View on X ↗

J

Jerry Liu

@jerryjliu0

📅

Sun

🆔25158272

This weekend I’m excited to share a tutorial that shows you how to build an agentic extraction workflow over a Fidelity Multi-Fund Annual Report: the document contains a list of multiple funds, with each fund reporting multiple tables of financial data. Extracting a list of… https://t.co/qcgX7xnt0w

❤️178

likes

🔁32

retweets

🖼️ Media

View Details View on X ↗

S

Lisan al Gaib

@scaling01

📅

Sun

🆔11126954

A few more observations after replicating the Tower of Hanoi game with their exact prompts: - You need AT LEAST 2^N - 1 moves and the output format requires 10 tokens per move + some constant stuff. - Furthermore the output limit for Sonnet 3.7 is 128k, DeepSeek R1 64K, and… https://t.co/ax5ZK4WkGx

❤️1,778

likes

🔁255

retweets

🖼️ Media

View Details View on X ↗

C

Chris Levy

@cleavey1985

📅

Sun

🆔71243061

I have been enjoying the "AI Evals For Engineers & PMs" course from @HamelHusain and @sh_reya. I knew when I signed up I was going to be told "look at your data!". No surprise there! But this course teaches you how to do that using a scientific process. If looking at your data… https://t.co/8NRfkLUyUT

❤️44

likes

🔁15

retweets

🖼️ Media

View Details View on X ↗

E

Ethan Mollick

@emollick

📅

Mon

🆔18069510

New paper shows a familiar result on LLMs & medicine: Doctors given clinical vignettes produce significantly more accurate diagnoses when using a custom GPT built with the (obsolete) GPT-4 than doctors with Google/Pubmed but not AI. Yet AI alone is as accurate as doctors + AI. https://t.co/7OxPItCfQM

❤️999

likes

🔁197

retweets

🖼️ Media

View Details View on X ↗

I

Ivan Leo

@ivanleomk

📅

Mon

🆔57682304

Experimenting with using a https://t.co/g6lcz73IRm file like what @vig_xyz suggest in his lightning talk. Actually pretty good to create PRDs https://t.co/Bd7YbGEy0N

❤️5

likes

🖼️ Media

View Details View on X ↗

I

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

📅

Mon

🆔57005904

I don't even agree with the Apple paper but this is an extremely midwit take https://t.co/mBi4vsJF0C

❤️1,668

likes

🔁29

retweets

🖼️ Media

View Details View on X ↗

I

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

📅

Mon

🆔50494879

Corrector Sampling in Language Models "Autoregressive language models accumulate errors due to their fixed, irrevocable left-to-right token generation. To address this, we propose a new sampling method called Resample-Previous-Tokens (RPT). RPT mitigates error accumulation by… https://t.co/5Pr69g2ez7

❤️276

likes

🔁37

retweets

🖼️ Media

View Details View on X ↗

I

Tanishq Mathew Abraham, Ph.D.

@iScienceLuvr

📅

Mon

🆔10285119

MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning "We introduce MORSE-500 (Multimodal Reasoning Stress-test Environment), a video benchmark composed of 500 fully scripted clips with embedded questions spanning six complementary… https://t.co/6uybbpnKE8

❤️72

likes

🔁10

retweets

🖼️ Media

View Details View on X ↗

R

Radek Osmulski 🇺🇦

@radekosmulski

📅

Mon

🆔39464683

I’m unfortunately falling behind in @HamelHusain’s & @sh_reya’s evals course, but I can’t shrug off the feeling that focus on evals is a head twist? It just so happens that evals are the highest leverage point in building with LLMs but there is this universe of related ideas… https://t.co/UcnLJYUylN

❤️113

likes

🔁12

retweets

🖼️ Media

View Details View on X ↗

O

elvis

@omarsar0

📅

Fri

🆔54712537

Top 50 LLM Interview Questions. Looks like a great resource to learn LLM basics: https://t.co/nCik0PGOcb

❤️2,755

likes

🔁392

retweets

🖼️ Media

View Details View on X ↗

H

Hamel Husain

@HamelHusain

📅

Fri

🆔82564734

What gaps in eval tooling should I be prepared to fill myself? @sh_reya and I have found the same blind spots and missing features across many eval tools. The first gap is lack of tooling for error analysis 1/5 https://t.co/kJVQUWz8sM

❤️130

likes

🔁10

retweets

🖼️ Media

View Details View on X ↗

E

Ethan Mollick

@emollick

📅

Fri

🆔14821241

An example of why I think current LLMs are enough to change lots of work even if they don’t get better, once we start integrating them with other systems GPT-4 (now obsolete) went from 30% accuracy to 87% accuracy in clinical oncology decisions when given access to medical tools https://t.co/ynnpuQU7cN

❤️1,478

likes

🔁190

retweets

🖼️ Media

View Details View on X ↗

← PreviousPage 586 of 656Next →