@omarsar0
New benchmark from Google Research. Models get better at benchmarks, but do they actually get more factual? Previous evaluations focused on narrow slices: grounding to documents, answering from memory, or using search. A model excelling at one often fails at another. This new research introduces the FACTS Leaderboard, a comprehensive suite that measures factuality across four distinct dimensions: - FACTS Multimodal tests visual grounding combined with world knowledge on ~1,500 image-based questions. - FACTS Parametric assesses closed-book factoid recall using 2,104 adversarially-sampled questions that stumped open-weight models. - FACTS Search evaluates information-seeking with web tools across 1,884 queries including multi-hop reasoning. - FACTS Grounding v2 tests whether long-form responses stay faithful to provided documents. The aggregate FACTS Score averages performance across all four. Results: Gemini 3 Pro leads with 68.8% overall. Gemini 2.5 Pro follows at 62.1%, then GPT-5 at 61.8%. But the sub-scores tell a different story. Claude models are precision-oriented, achieving high no-contradiction rates but hedging frequently on parametric questions. Claude 4 Sonnet doesn't attempt 45.1% of parametric queries. GPT models show higher coverage but more contradictions. On multimodal, even the best models only reach ~47% accuracy when requiring both complete coverage and zero contradictions. On parametric knowledge, the spread is enormous: Gemini 3 Pro hits 76.4% while GPT-5 mini manages just 16.0%. The benchmark maintains both public and private splits to prevent overfitting. All evaluation runs through Kaggle with standardized search tools for fair comparison. A single factuality number hides crucial behavioral differences. Some models guess aggressively, others hedge conservatively. This suite exposes those tradeoffs across the contexts where factuality actually matters. Paper: https://t.co/TCHOSGlQKs Learn how to evaluate and build effective AI agents in our academy: https://t.co/JBU5beIoD0