@hugobowne
If humans built a database of current police misconduct reports in California, it would take over 35 years. California is spending $10M to see if LLMs can do it instead. But here’s the tension: ⚖️ In high-stakes domains, one mistake can undermine the whole system 🤖 In low-stakes settings like customer support, “pretty aligned” is good enough So how do we program AI agents and LLM judges to make the right tradeoff, accurate and cost-effective? That’s the focus of my conversation with @sh_reya on @VanishingData, from error analysis to evaluation frameworks, and what it really takes to process millions of documents reliably. Links to full episode in comment 👇