@HamelHusain
This was the most popular question re: evals It’s not about achieving 95%, it’s about measuring how good your eval tracks to ground truth and building a data flywheel to close the gap Anyone that categorically claims to catch 95% of errors is selling bullshit https://t.co/GQTo0a16bA