@HelloSurgeAI
π Introducing GDP.pdf: an expert multimodal reasoning benchmark for the documents that run the world. π We've spent years measuring AI against the extraordinary: proving theorems, solving AGI. But the global economy doesn't run on the extraordinary. It runs on paperwork. More precisely: unsexy, poorly scanned, densely formatted PDFs. Contracts, invoices, medical records, blueprints β the documents that actually run the world. GDP.pdf tests frontier models on their ability to handle real-world documents across ten professional industries: ποΈ Construction: Can a model measure load-bearing walls on a blueprint? βοΈ Law: Can it parse liability caps in a commercial lease? π΅ Finance: Can it Calculate margin profiles in a buy-side memo? The reality: every frontier model scored under 15%. GDP.pdf asks a critical question: If a $100B model canβt accurately reason about a drug interaction table in a PDF, is it actually ready for the enterprise? Right now, the answer is no. Check out the blog post and leaderboard below. π Blog: https://t.co/0Wj97DBYTC Leaderboard: https://t.co/9CMY6JVPtj