@iScienceLuvr
Prediction Arena: Benchmarking AI Models on Real-World Prediction Markets "Over a 57-day longitudinal study (January 12 to March 9, 2026), we evaluated ten frontier language models across two cohorts. For Cohort 1 (six models, live trading, full period), final Kalshi returns ranged from β16.0% to β30.8%, with all models declining over the full horizon." "Cohort 2 (four next-generation models, paper trading, Mar 6β9) provides 3-day preliminary signals. gpt-5.4 leads at +1.22% while glm-5 trails at β4.09%, yielding a 5.3 percentage-point intra-generation spread after only 3 days."