Your curated collection of saved posts and media
ReasonIR: Training Retrievers for Reasoning Tasks - Presents REASONIR-8B, the first retriever specifically trained for general reasoning tasks - Improves MMLU and GPQA scores by 6.4% and 22.6% respectively, relative to the closed-book baseline https://t.co/71cOOuUbH0
Reinforcement Learning for Reasoning in Large Language Models with One Training Example - 36.0% -> 73.6% on MATH500 by performing RLVR on a single example - Applying entropy loss alone, without any outcome reward, improves perf by 27.4% https://t.co/09jHOmxBTh
ChatGPT in it's recent glazemax mode vs Hermes Left: ChatGPT, Right: Hermes https://t.co/yMpw55GVgs

Building Production-Ready AI Agents with Scalable Long-Term Memory Memory is one of the most challenging bits of building production-ready agentic systems. Lots of goodies in this paper. Here is my breakdown: https://t.co/wImK3ncl4G
It turns out that Meta had 27 different models on LM Arena prior to the launch of Llama 4, but they announced it as if they had one model that topped the leaderboard. An extreme example of benchmark hacking (which other labs also do to lesser degrees). https://t.co/JfPmqyZiOg https://t.co/bVsUWh1218
For those who want to take Kerbal Space Program to the next level: https://t.co/IctnCjtvG0

my experience with o3 https://t.co/17oBXFzXHH
Turns out DeepSeek does have a new release (671B math/prover model) but it's not R2 https://t.co/GRJa9unXuD
Devastating takedown of Chatbot Arena. It's one thing for leaderboards to suck because they try to quantify the unquantifiable but quite another thing to actively choose flagrantly unscientific and nontransparent practices that benefit the big dogs. https://t.co/pFGQQw0mao https://t.co/rpnNy2CmdK

A Survey of Efficient LLM Inference Serving This one provides a comprehensive taxonomy of recent system-level innovations for efficient LLM inference serving. Great overview for devs working on inference. Here is what's included: https://t.co/yRl9lkFlPD
ByteDance might've released a paper explaining the TikTok algorithm. https://t.co/kfGIeCNYD9
https://t.co/zsNroCiS1L https://t.co/4IfI6zF86g
๐Todayโs AIs are already hyper persuasive. A controversial study where LLMs tried to persuade users on Reddit found: โNotably, all our treatments surpass human performance substantially, achieving persuasive rates between three and six times higher than the human baseline.โ https://t.co/D7i6fdklD7

Use create-llama's "Deep Researcher" template to write legal reports in seconds! Ask a question and Deep Researcher will generate a set of sub-questions to ask of your documents, answer all of them, and then generate a report! Try it right now with npx create-llama Or learnโฆ https://t.co/XpVtmPCv11
I made a website. It's called "one million chessboards dot com". it has one million chessboards on it. moving a piece moves it for everyone, instantly. no turns. you can move between boards. that's it. have fun! https://t.co/T9GqvfwJKC
It's not only about how long your context is, but how well you use it. Great to see Gemini 2.5 models dominating MRCR and other benchmarks on long context! See 2.5 Pro tackle a complex coding task by reasoning over an entire repo (>500k tokens). Performance and effective use ofโฆ https://t.co/asrnajUNdE
Watching @liuzhuang1234's - "Transformers without Normalization", this slide is a reminder how our optimizer and architecture choices are coupled https://t.co/Jo8KNdPgk2
phew, i can finally share what i've been up to since last summer! we just raised a $23 million seed round!! ๐ i co-founded @P_1_AI w/ @PaulEremenko (ex cto of airbus, UTC, etc.) and adam nagel (ex engineering director at airbus) with a mission to build an engineering AGI forโฆ https://t.co/5jjc31hxLv
I guess "thinking is all you need!" Those are some insane improvements over non-thinking mode. Congrats to the Qwen team on the Qwen3 release. Love seeing the support for more agentic capabilities. Hope R2 brings more of that as well. https://t.co/qsjRbqTDDS
LlamaDeploy now supports a new message broker: @solacedotcom! LlamaDeploy is an async-first framework for deploying, scaling, and productionizing agentic multi-service systems, based on LlamaIndex Workflows. LlamaDeploy works with a variety of message bus backends, and ourโฆ https://t.co/lH6FqUC4Vv
Qwen 3 by @Alibaba_Qwen is out and it looks like the 30B MoE is better than the 32B dense model! Some quick checks show you can SFT the 32B on a single 48GB GPU, and it's possible to get it on a 4090 too once we some allocation issues on model load. https://t.co/9s6sqL3QBD
QWEN-3 is finally out! > Matches Gemini 2.5 Pro performance > Outperforms OpenAI o1 > Open-sourced (Apache 2.0) > 119 languages, 32Kโ128K context https://t.co/KFIrKFNqzI
AI products fail constantlyโin ways both hilarious and terrifying. Regular software throws exceptions. But AI products fail silently. Meet @raindrop_ai : the first Sentry-like monitoring platform for AI products. https://t.co/Olx2umPUa7
Most business data is structured or semi-structured (tables, spreadsheets, etc), but we tend to over-emphasize unstructured data retrieval in RAG @svonava is going to tell us everything he knows about optimizing structured data retrieval with LLMs https://t.co/bhztbxABxs https://t.co/PX94XyoFQH
https://t.co/CHsHdKezRa
So Qwen 3-235B with thinking seems good, but not blowing away any of my weird frontier tests, some of which DeepSeek r1 did better. It did okay generating a p5js starship (though it had errors to correct), but failed the Lem Test and couldn't do a twigl shader in many attempts. https://t.co/bcdtTXq3HZ

MAGI is a new multi-agent system that dynamically navigates clinical logic via four specialized agents. Great example of how to combine reasoning and agents. Read on for more: https://t.co/9HMm3JYbUT
Tenets from Duolingo's push to be AI-first โข AI will be everywhere in our product โข Start with AI for every task โข Spend 10% of your time learning โข Share what you learn โข Avoid overbuilding โข Build and experiment carefully โข Technical excellence still matters https://t.co/EZMtZNaKSp
Scaling Laws For Scalable Oversight Scalable oversight, the process by which weaker AI systems supervise stronger ones, has been proposed as a key strategy to control future superintelligent systems. However, it is still unclear how scalable oversight itself scales. To addressโฆ https://t.co/jel5RtvBJt
You can now Run Qwen3 locally with our Dynamic GGUFs! ๐ With 128K Context Length added. Our Dynamic 2.0 GGUFs achieve superior accuracy, outperforming other methods on 5-shot MMLU & KL Divergence. Qwen3-235B-A22B coming soon. GGUFs: https://t.co/3OH7kpzXL3 https://t.co/wQjgJG34WW
SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning "we introduce SelfPlay Critic (SPC), a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games, eliminating the need for manual step-levelโฆ https://t.co/gkAt6tVlOe
BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text "we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. We systematically evaluated 52โฆ https://t.co/4P6Um4Qme7