@emollick
I propose the Encounter Test as a nerdy benchmark standard for AI. Ask an AI to simulate an encounter between two D&D creatures & see how long it takes to mess up. Drow vs. mind flayer: GPT-4o does best, Gemini is cute. Outcomes similar (I am sure better prompting would help) https://t.co/bZFOpSBW3r