@dair_ai
Another fascinating paper on LLM Judges. (bookmark it) It's from Amazon, and they show that if you run panels of LLM judges, averaging their scores is a trap. "Overall, we establish that robust aggregation of a small, diverse committee is a parameter-efficient and statistically principled alternative to scaling a single large LLM-as-judge" This research proves that a mean-based panel picks up unbounded bias the moment one judge fails in a biased, LLM-typical way. Mode collapse, sycophancy, or a safety refusal from a single judge is enough, and adding more judges does not save you. RoPoLL keeps the panel but swaps the aggregator for the geometric median. It is tuning-free and hits the optimal one-half breakdown point, with a finite-sample bound and a matching minimax lower bound to back it. Across 13 judges from 4B to 675B and corruption rates up to 50 percent, RoPoLL beats the mean on every biased corruption type. A 3-judge committee at 38B outscores Mistral-Large-3 at 675B under 30 percent corruption, an 18x parameter advantage at better accuracy. Paper: https://t.co/t5QomKUoU0 Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c