@alexalbert__
We had to remove the τ2-bench airline eval from our benchmarks table because Opus 4.5 broke it by being too clever. The benchmark simulates an airline customer service agent. In one test case, a distressed customer calls in wanting to change their flight, but they have a basic economy ticket. The simulated airline's policy states that basic economy tickets cannot be modified. The "correct" answer is that the model refuses the request. Instead, Opus 4.5 found a loophole in the policy. It upgraded the cabin, then modified the flights. Helping the customer and following policy but technically failing the test case. Model transcript: