@nityndg
When a benchmark’s accuracy saturates, the field usually replaces it with a harder one. We use CORE-Bench Hard, a benchmark for computational reproducibility, as a case study to show what we can still measure after accuracy saturates. Paper: https://t.co/YFVb1JocOb https://t.co/RbrcaGT6H4