@a1zhang
Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? ๐ฉ๐ถ๐ฑ๐ฒ๐ผ๐๐ฎ๐บ๐ฒ๐๐ฒ๐ป๐ฐ๐ต evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! ๐งต๐ https://t.co/kcBZ8vsDyw