Your curated collection of saved posts and media

Showing 32 posts ยท last 14 days ยท by score
A
Aran Komatsuzaki
@arankomatsuzaki
๐Ÿ“…
Wed
๐Ÿ†”55611479

ReasonIR: Training Retrievers for Reasoning Tasks - Presents REASONIR-8B, the first retriever specifically trained for general reasoning tasks - Improves MMLU and GPQA scores by 6.4% and 22.6% respectively, relative to the closed-book baseline https://t.co/71cOOuUbH0

Media 1
โค๏ธ164
likes
๐Ÿ”27
retweets
๐Ÿ–ผ๏ธ Media
A
Aran Komatsuzaki
@arankomatsuzaki
๐Ÿ“…
Wed
๐Ÿ†”45261915

Reinforcement Learning for Reasoning in Large Language Models with One Training Example - 36.0% -> 73.6% on MATH500 by performing RLVR on a single example - Applying entropy loss alone, without any outcome reward, improves perf by 27.4% https://t.co/09jHOmxBTh

Media 1
โค๏ธ445
likes
๐Ÿ”57
retweets
๐Ÿ–ผ๏ธ Media
T
Teknium (e/ฮป)
@Teknium1
๐Ÿ“…
Wed
๐Ÿ†”45507127

ChatGPT in it's recent glazemax mode vs Hermes Left: ChatGPT, Right: Hermes https://t.co/yMpw55GVgs

Media 1Media 2
โค๏ธ119
likes
๐Ÿ”3
retweets
๐Ÿ–ผ๏ธ Media
O
elvis
@omarsar0
๐Ÿ“…
Tue Apr 29
๐Ÿ†”21700134

Building Production-Ready AI Agents with Scalable Long-Term Memory Memory is one of the most challenging bits of building production-ready agentic systems. Lots of goodies in this paper. Here is my breakdown: https://t.co/wImK3ncl4G

Media 1
โค๏ธ1,164
likes
๐Ÿ”225
retweets
๐Ÿ–ผ๏ธ Media
E
Ethan Mollick
@emollick
๐Ÿ“…
Wed
๐Ÿ†”02257538

It turns out that Meta had 27 different models on LM Arena prior to the launch of Llama 4, but they announced it as if they had one model that topped the leaderboard. An extreme example of benchmark hacking (which other labs also do to lesser degrees). https://t.co/JfPmqyZiOg https://t.co/bVsUWh1218

Media 1
โค๏ธ435
likes
๐Ÿ”37
retweets
๐Ÿ–ผ๏ธ Media
D
Scott Manley
@DJSnM
๐Ÿ“…
Wed
๐Ÿ†”14860615

For those who want to take Kerbal Space Program to the next level: https://t.co/IctnCjtvG0

Media 1Media 2
+2 more
โค๏ธ884
likes
๐Ÿ”62
retweets
๐Ÿ–ผ๏ธ Media
W
Jeffrey Wang
@wangzjeff
๐Ÿ“…
Wed
๐Ÿ†”31431272

my experience with o3 https://t.co/17oBXFzXHH

Media 1
โค๏ธ1,459
likes
๐Ÿ”24
retweets
๐Ÿ–ผ๏ธ Media
I
Tanishq Mathew Abraham, Ph.D.
@iScienceLuvr
๐Ÿ“…
Wed
๐Ÿ†”46322943

Turns out DeepSeek does have a new release (671B math/prover model) but it's not R2 https://t.co/GRJa9unXuD

Media 1
โค๏ธ193
likes
๐Ÿ”15
retweets
๐Ÿ–ผ๏ธ Media
R
Arvind Narayanan
@random_walker
๐Ÿ“…
Wed
๐Ÿ†”77994378

Devastating takedown of Chatbot Arena. It's one thing for leaderboards to suck because they try to quantify the unquantifiable but quite another thing to actively choose flagrantly unscientific and nontransparent practices that benefit the big dogs. https://t.co/pFGQQw0mao https://t.co/rpnNy2CmdK

Media 1Media 2
โค๏ธ373
likes
๐Ÿ”60
retweets
๐Ÿ–ผ๏ธ Media
O
elvis
@omarsar0
๐Ÿ“…
Tue Apr 29
๐Ÿ†”29588788

A Survey of Efficient LLM Inference Serving This one provides a comprehensive taxonomy of recent system-level innovations for efficient LLM inference serving. Great overview for devs working on inference. Here is what's included: https://t.co/yRl9lkFlPD

Media 1
โค๏ธ279
likes
๐Ÿ”69
retweets
๐Ÿ–ผ๏ธ Media
L
Liorโšก
@LiorOnAI
๐Ÿ“…
Mon
๐Ÿ†”47472415

ByteDance might've released a paper explaining the TikTok algorithm. https://t.co/kfGIeCNYD9

Media 1
โค๏ธ323
likes
๐Ÿ”36
retweets
๐Ÿ–ผ๏ธ Media
H
Hamel Husain
@HamelHusain
๐Ÿ“…
Mon
๐Ÿ†”89155831

https://t.co/zsNroCiS1L https://t.co/4IfI6zF86g

Media 1
โค๏ธ38
likes
๐Ÿ”2
retweets
๐Ÿ–ผ๏ธ Media
E
Ethan Mollick
@emollick
๐Ÿ“…
Mon
๐Ÿ†”58931084

๐Ÿ‘€Todayโ€™s AIs are already hyper persuasive. A controversial study where LLMs tried to persuade users on Reddit found: โ€œNotably, all our treatments surpass human performance substantially, achieving persuasive rates between three and six times higher than the human baseline.โ€ https://t.co/D7i6fdklD7

Media 1Media 2
+2 more
โค๏ธ1,023
likes
๐Ÿ”182
retweets
๐Ÿ–ผ๏ธ Media
L
LlamaIndex ๐Ÿฆ™
@llama_index
๐Ÿ“…
Mon
๐Ÿ†”05801700

Use create-llama's "Deep Researcher" template to write legal reports in seconds! Ask a question and Deep Researcher will generate a set of sub-questions to ask of your documents, answer all of them, and then generate a report! Try it right now with npx create-llama Or learnโ€ฆ https://t.co/XpVtmPCv11

โค๏ธ98
likes
๐Ÿ”10
retweets
๐Ÿ–ผ๏ธ Media
I
nolen royalty
@itseieio
๐Ÿ“…
Mon
๐Ÿ†”57357452

I made a website. It's called "one million chessboards dot com". it has one million chessboards on it. moving a piece moves it for everyone, instantly. no turns. you can move between boards. that's it. have fun! https://t.co/T9GqvfwJKC

โค๏ธ13,770
likes
๐Ÿ”1,192
retweets
๐Ÿ–ผ๏ธ Media
O
Oriol Vinyals
@OriolVinyalsML
๐Ÿ“…
Mon
๐Ÿ†”23139670

It's not only about how long your context is, but how well you use it. Great to see Gemini 2.5 models dominating MRCR and other benchmarks on long context! See 2.5 Pro tackle a complex coding task by reasoning over an entire repo (>500k tokens). Performance and effective use ofโ€ฆ https://t.co/asrnajUNdE

โค๏ธ295
likes
๐Ÿ”27
retweets
๐Ÿ–ผ๏ธ Media
Y
Yaroslav Bulatov
@yaroslavvb
๐Ÿ“…
Mon
๐Ÿ†”15074497

Watching @liuzhuang1234's - "Transformers without Normalization", this slide is a reminder how our optimizer and architecture choices are coupled https://t.co/Jo8KNdPgk2

Media 1
โค๏ธ157
likes
๐Ÿ”20
retweets
๐Ÿ–ผ๏ธ Media
G
Aleksa Gordiฤ‡ (ๆฐดๅนณ้—ฎ้ข˜)
@gordic_aleksa
๐Ÿ“…
Mon
๐Ÿ†”99677458

phew, i can finally share what i've been up to since last summer! we just raised a $23 million seed round!! ๐Ÿ˜… i co-founded @P_1_AI w/ @PaulEremenko (ex cto of airbus, UTC, etc.) and adam nagel (ex engineering director at airbus) with a mission to build an engineering AGI forโ€ฆ https://t.co/5jjc31hxLv

Media 1
โค๏ธ583
likes
๐Ÿ”61
retweets
๐Ÿ–ผ๏ธ Media
O
elvis
@omarsar0
๐Ÿ“…
Mon
๐Ÿ†”61830399

I guess "thinking is all you need!" Those are some insane improvements over non-thinking mode. Congrats to the Qwen team on the Qwen3 release. Love seeing the support for more agentic capabilities. Hope R2 brings more of that as well. https://t.co/qsjRbqTDDS

Media 1
โค๏ธ64
likes
๐Ÿ”5
retweets
๐Ÿ–ผ๏ธ Media
L
LlamaIndex ๐Ÿฆ™
@llama_index
๐Ÿ“…
Mon
๐Ÿ†”86255468

LlamaDeploy now supports a new message broker: @solacedotcom! LlamaDeploy is an async-first framework for deploying, scaling, and productionizing agentic multi-service systems, based on LlamaIndex Workflows. LlamaDeploy works with a variety of message bus backends, and ourโ€ฆ https://t.co/lH6FqUC4Vv

Media 1
โค๏ธ19
likes
๐Ÿ”7
retweets
๐Ÿ–ผ๏ธ Media
W
Wing Lian (caseus)
@winglian
๐Ÿ“…
Mon
๐Ÿ†”43103281

Qwen 3 by @Alibaba_Qwen is out and it looks like the 30B MoE is better than the 32B dense model! Some quick checks show you can SFT the 32B on a single 48GB GPU, and it's possible to get it on a 4090 too once we some allocation issues on model load. https://t.co/9s6sqL3QBD

Media 1
โค๏ธ92
likes
๐Ÿ”5
retweets
๐Ÿ–ผ๏ธ Media
L
Liorโšก
@LiorOnAI
๐Ÿ“…
Mon
๐Ÿ†”25223240

QWEN-3 is finally out! > Matches Gemini 2.5 Pro performance > Outperforms OpenAI o1 > Open-sourced (Apache 2.0) > 119 languages, 32Kโ€“128K context https://t.co/KFIrKFNqzI

Media 1
โค๏ธ142
likes
๐Ÿ”12
retweets
๐Ÿ–ผ๏ธ Media
B
ben
@benhylak
๐Ÿ“…
Mon
๐Ÿ†”99444139

AI products fail constantlyโ€”in ways both hilarious and terrifying. Regular software throws exceptions. But AI products fail silently. Meet @raindrop_ai : the first Sentry-like monitoring platform for AI products. https://t.co/Olx2umPUa7

โค๏ธ682
likes
๐Ÿ”63
retweets
๐Ÿ–ผ๏ธ Media
H
Hamel Husain
@HamelHusain
๐Ÿ“…
Mon
๐Ÿ†”13545342

Most business data is structured or semi-structured (tables, spreadsheets, etc), but we tend to over-emphasize unstructured data retrieval in RAG @svonava is going to tell us everything he knows about optimizing structured data retrieval with LLMs https://t.co/bhztbxABxs https://t.co/PX94XyoFQH

Media 1
โค๏ธ45
likes
๐Ÿ”5
retweets
๐Ÿ–ผ๏ธ Media
S
Daniel Svonava
@svonava
๐Ÿ“…
Tue Apr 29
๐Ÿ†”55455288

https://t.co/CHsHdKezRa

Media 1
โค๏ธ14
likes
๐Ÿ”1
retweets
๐Ÿ–ผ๏ธ Media
E
Ethan Mollick
@emollick
๐Ÿ“…
Tue Apr 29
๐Ÿ†”88142926

So Qwen 3-235B with thinking seems good, but not blowing away any of my weird frontier tests, some of which DeepSeek r1 did better. It did okay generating a p5js starship (though it had errors to correct), but failed the Lem Test and couldn't do a twigl shader in many attempts. https://t.co/bcdtTXq3HZ

Media 1Media 2
โค๏ธ135
likes
๐Ÿ”7
retweets
๐Ÿ–ผ๏ธ Media
O
elvis
@omarsar0
๐Ÿ“…
Mon
๐Ÿ†”10554423

MAGI is a new multi-agent system that dynamically navigates clinical logic via four specialized agents. Great example of how to combine reasoning and agents. Read on for more: https://t.co/9HMm3JYbUT

Media 1
โค๏ธ204
likes
๐Ÿ”57
retweets
๐Ÿ–ผ๏ธ Media
E
Eugene Yan
@eugeneyan
๐Ÿ“…
Tue Apr 29
๐Ÿ†”55979479

Tenets from Duolingo's push to be AI-first โ€ข AI will be everywhere in our product โ€ข Start with AI for every task โ€ข Spend 10% of your time learning โ€ข Share what you learn โ€ข Avoid overbuilding โ€ข Build and experiment carefully โ€ข Technical excellence still matters https://t.co/EZMtZNaKSp

Media 1
โค๏ธ121
likes
๐Ÿ”12
retweets
๐Ÿ–ผ๏ธ Media
A
Aran Komatsuzaki
@arankomatsuzaki
๐Ÿ“…
Tue Apr 29
๐Ÿ†”16146353

Scaling Laws For Scalable Oversight Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales. To addressโ€ฆ https://t.co/jel5RtvBJt

Media 1
โค๏ธ179
likes
๐Ÿ”37
retweets
๐Ÿ–ผ๏ธ Media
U
Unsloth AI
@UnslothAI
๐Ÿ“…
Mon
๐Ÿ†”96809017

You can now Run Qwen3 locally with our Dynamic GGUFs! ๐ŸŒ  With 128K Context Length added. Our Dynamic 2.0 GGUFs achieve superior accuracy, outperforming other methods on 5-shot MMLU & KL Divergence. Qwen3-235B-A22B coming soon. GGUFs: https://t.co/3OH7kpzXL3 https://t.co/wQjgJG34WW

Media 1
โค๏ธ619
likes
๐Ÿ”106
retweets
๐Ÿ–ผ๏ธ Media
I
Tanishq Mathew Abraham, Ph.D.
@iScienceLuvr
๐Ÿ“…
Tue Apr 29
๐Ÿ†”62111634

SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning "we introduce SelfPlay Critic (SPC), a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games, eliminating the need for manual step-levelโ€ฆ https://t.co/gkAt6tVlOe

Media 1
โค๏ธ115
likes
๐Ÿ”29
retweets
๐Ÿ–ผ๏ธ Media
I
Tanishq Mathew Abraham, Ph.D.
@iScienceLuvr
๐Ÿ“…
Tue Apr 29
๐Ÿ†”54666432

BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text "we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. We systematically evaluated 52โ€ฆ https://t.co/4P6Um4Qme7

Media 1
โค๏ธ38
likes
๐Ÿ”8
retweets
๐Ÿ–ผ๏ธ Media