Skip to content

Super Mario as an AI Benchmarking tool?

Here’s Why It Matters

For decades, video games have been the battleground for AI supremacy. Chess, Dota 2, Minecraft, even StarCraft—each has served as a proving ground for artificial intelligence. But now, researchers have a new champion in AI benchmarking: Super Mario Bros. And as it turns out, this pixelated plumber might be AI’s toughest challenge yet.

Researches from Hao AI Lab at the University of California San Diego recently tested some of the biggest AI models on a real-time, emulated version of Super Mario Bros. The results? Let’s just say that not every AI is ready to jump on Goombas with precision. Anthropic’s Claude 3.7 emerged as the top performer, with Claude 3.5 following close behind. Surprisingly, Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled to keep pace, especially when it came to rapid reflexes and timing-sensitive tasks.

0:00
/0:13


Why Super Mario Is the New AI Proving Ground
Games have long been a favorite way to stress-test AI. They offer structured environments with clear goals—perfect for evaluating decision-making, learning capabilities, and adaptability. But what makes Super Mario Bros. uniquely challenging is its mix of fast-paced action, unpredictable obstacles, and real-time decision-making.
 
Unlike a game like chess, where an AI has unlimited time to calculate its next move, Mario punishes hesitation. A single mistimed jump can send the mustachioed hero plummeting to his demise. AI models trained to reason through problems methodically—like OpenAI’s GPT-4o—often stumble here because they simply take too long to decide what to do next. By the time they compute an optimal strategy, Mario is already toast.

To run the experiment, researchers fed AI models basic instructions and in-game screenshots through GamingAgent, a framework developed in-house at Hao AI Lab. The AI then generated Python code to control Mario, effectively playing the game autonomously. While some models adapted well, others struggled with the game's relentless pace.

Image Credit: Hao Lab, University of California San Diego


 Real-Time Games Expose AI’s Biggest Weakness
The results of the Super Mario test point to a growing issue in AI research: many models are excellent at thinking but terrible at acting quickly. Traditional AI benchmarks prioritize logical reasoning, code generation, and language processing, but few evaluate split-second decision-making.
 
"We're seeing what amounts to an evaluation crisis in AI right now," OpenAI co-founder Andrej Karpathy noted in a recent post. "The industry doesn’t have a great way to measure overall AI intelligence, and these gaming benchmarks expose real gaps."
 
Think about it: In controlled environments—like answering a trivia question or solving a math problem—AI shines. But put it in a dynamic, unpredictable world, and things start to fall apart. This has real-world implications, especially for AI applications that require real-time responses, like autonomous vehicles, robotics, and cybersecurity.


 Not Just a Game—A Glimpse Into AI’s Future
Super Mario isn’t just a quirky AI benchmark. It’s a crash test for how well AI can handle real-world unpredictability. The best AI won’t just be the one that can generate the longest essay or the most elegant piece of code—it will be the one that can react, adapt, and plan in real-time.
 
The Hao Lab’s findings also highlight an emerging divide in AI research. There’s one camp focused on models that reason deeply and another focused on models that react instantly. The best AI will eventually need to do both—balancing strategic thinking with quick reflexes.
 
So what’s next? Expect more AI researchers to throw their models into the deep end of video game benchmarks. The next big challenge might be dynamic, physics-based worlds like Breath of the Wild or GTA V, where adaptability is key.
 
For now, Super Mario Bros. remains AI’s toughest test. And if an AI model can master Mario, who knows? Maybe someday, it’ll be ready for something even harder—like the real world.

Comments

Latest