Your curated collection of saved posts and media
i've been working really hard to burn down the tool errors we get in our main chat flow and it's working this is the @HamelHusain effect https://t.co/WG4s8AonsQ
@clairevo Yesssssss I am so happy right now https://t.co/wdPDNcqKov
@clairevo Yesssssss I am so happy right now https://t.co/wdPDNcqKov
if you showed this chart to a typical economist like 20 years ago, they would've laughed you out of the room. the right side of this is white collar jobs that were once worshipped. these jobs were comfortable, well paying, & came with societal status + recognition. your parents wouldβve been proud of you. now these are likely all set to be severely impacted in a shorter period of time than anyone likely ever thought of let alone projected. this is like ppl waiting on a beach enjoying the sun when a tsunami has already struck.
When you build AI agents, don't treat prompts like config strings. Treat them like executable business logic. Because that's what they really are. @arshdilbagi's blog and this Stanford CS 224G lecture lay out one of the clearest mental models I have seen for LLM evaluation. Stop treating evals like unit tests. That works for deterministic software. For LLM products, it creates false confidence because real-world usage changes over time. Example: an insurance prompt passed 20 eval cases. The team shipped. In production, a new class of requests showed up and failed quietly. No crash, no alert, just wrong answers at scale. The fix is not "write more eval cases," which is what many teams do. It is building evals as a living feedback loop. Start with a small set, ship, watch what breaks in production, add those failures back, and re-run on every prompt or model change. What eval failure caught your team off guard? Blog: https://t.co/HCVhcow5rA Stanford CS 224G lecture: https://t.co/q667gGwckt

When you build AI agents, don't treat prompts like config strings. Treat them like executable business logic. Because that's what they really are. @arshdilbagi's blog and this Stanford CS 224G lecture lay out one of the clearest mental models I have seen for LLM evaluation. Stop treating evals like unit tests. That works for deterministic software. For LLM products, it creates false confidence because real-world usage changes over time. Example: an insurance prompt passed 20 eval cases. The team shipped. In production, a new class of requests showed up and failed quietly. No crash, no alert, just wrong answers at scale. The fix is not "write more eval cases," which is what many teams do. It is building evals as a living feedback loop. Start with a small set, ship, watch what breaks in production, add those failures back, and re-run on every prompt or model change. What eval failure caught your team off guard? Blog: https://t.co/HCVhcow5rA Stanford CS 224G lecture: https://t.co/q667gGwckt

Google Workspace CLI: https://t.co/Rg229zYsoA
Google Workspace CLI: https://t.co/Rg229zYsoA
Read on for more: https://t.co/bLxqT3yUTl
Read on for more: https://t.co/bLxqT3yUTl
thanks codex team for supporting the radare2 project with some Pro subs!πβ€οΈ https://t.co/0fKEt16AeI
thanks codex team for supporting the radare2 project with some Pro subs!πβ€οΈ https://t.co/0fKEt16AeI
that was the old me, the six days ago me https://t.co/btAJYAwAjL
amazing work team https://t.co/9d1cEx7vy2
Couldn't help it! Had to give GPT 5.4 (High) + /fast mode a try. β Added height terrains to the level β Animation tweens for the jumps Used xHigh to solve a gnarly bug with the controls successfully πͺ This Final Fantasy Tactics-inspired game was completely vibe coded! https://t.co/q2K7PovU62
Weβre announcing Kos-1 Lite, a medical model that achieves SOTA on HealthBench Hard at 46.6%. As a medium sized language model (~100B), it achieves these results at a fraction of the serving cost of frontier trillion-parameter models. https://t.co/27sxAHPgZM
Weβre announcing Kos-1 Lite, a medical model that achieves SOTA on HealthBench Hard at 46.6%. As a medium sized language model (~100B), it achieves these results at a fraction of the serving cost of frontier trillion-parameter models. https://t.co/27sxAHPgZM
SWE-rebench V2 A language-agnostic pipeline that automatically harvests 32,000+ executable real-world software engineering tasks across 20 programming languages. Built for large-scale RL training of code agents with reproducible Docker environments. https://t.co/JJ0vLH5N7B
SWE-rebench V2 A language-agnostic pipeline that automatically harvests 32,000+ executable real-world software engineering tasks across 20 programming languages. Built for large-scale RL training of code agents with reproducible Docker environments. https://t.co/JJ0vLH5N7B
Image Generation with a Sphere Encoder https://t.co/6I2FbpogaC
Utonia Toward One Encoder for All Point Clouds paper: https://t.co/AJFPivgBm9 https://t.co/Xbux4iY1QV
BeyondSWE Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? paper: https://t.co/IrLgJJomQU
Beyond Language Modeling An Exploration of Multimodal Pretraining paper: https://t.co/GmtPAQDo8T
Beyond Length Scaling Synergizing Breadth and Depth for Generative Reward Models https://t.co/25QhR93OKK