@joelniklaus
New blog post on harness optimization. We hit Sonnet 4.6 performance with a 7x cost improvement. Fable 5 was the first frontier model release that evaluated on legal tasks. It only scored 13%, the worst performance among all benchmarks evaluated. @Harvey released this benchmark called Legal Agent Benchmark (LAB) just a month prior. It contains a set of realistic legal matters. Each task gives the agent a closed workspace of documents (contracts, emails, spreadsheets, slide decks) and asks for a concrete deliverable: a diligence memo, an issue list, a redline, a draft. An LLM judge grades the deliverable against a long rubric containing 61 distinct binary criteria each on average. Many frontier models such as Gemini 3.1 Pro don't surpass 0% all-pass rate (all rubric criteria passed). With automatic harness optimization, we manage to push DeepSeek V4 Pro from 0% to 5% all-pass rate, achieving parity with Sonnet 4.6 for 1/7 of the price. Read the blog post for the details: https://t.co/kBrWrQkgJW