@emollick
GDPval is one of the most important benchmarks of AI ability because it is based on human expertise. It compares expert human performance to AI performance using expert human judges who spend an average of an hour evaluating each answer. It also has holdout questions that are not public. It is also very expensive to run, and OpenAI controls it, so I understand the need for alternatives. But GDPval-AA is not a good alternative. GDPval-AA has AI models respond to the public dataset of GPDval questions, and then asks Gemini 3.1 to judge which of two LLM answers are better. There is no reason to suspect that this is highly correlated with what an expert human would think, nor is there a human baseline to compare to.