The Mario Benchmark

Cross-posted from Substack. Converted to Markdown by ChatGPT 4o; any transcription errors are the AI’s fault.

The original Super Mario Bros. may have been a vast leap in game design when it first appeared in 1985, but by today’s standards, it’s an extremely simple game. So simple that creating bots that can complete the first level is a common beginner tutorial project for both symbolic scripting and machine learning.

Given that, one would expect the AI industry—which has received many accolades and much investment for its touted success in complex games like Go, StarCraft II, Minecraft, and Pokemon Red—would have long since fully solved Super Mario Bros.

Yet, I have yet to see a program of any kind that can complete the game in its entirety.

Why should this be the case? One explanation might be that the problem is so clearly trivial that the major AI research labs are already putting all their effort into more interesting and consequential problems (like Pokemon), and don’t want to invest the time and resources it would take to solve a problem that’s well behind the current frontier.

This may well be the case for major AI labs, of course, but it still seems surprising to me that no ambitious student has taken the simple PyTorch tutorial mentioned above and extended it to complete the game as an impressive portfolio project. You can find a fair number of videos on YouTube of people who have implemented their own ML code to complete the first level, or even multiple levels, but none who have “closed the loop”^[1] and taken their project all the way to finishing the game.

And, to me, this strange omission in the long list of purported AI successes seems like a point of confusion worth noticing. One thing we have learned about ML-based AI in the last few years is that, whatever’s actually going on inside it, it certainly doesn’t look very much like what goes inside a human.

Among other things, this means that tasks that are easy for humans are often hard for AI tools (and vice versa), and it also means that two tasks that are of similar difficulty for a human are not necessarily of similar difficulty for an AI tool. A typical human may be able to go from completing 1-1 to completing 8-4 after just a few weeks or months of practice^[2], but that doesn’t necessarily mean that the same can be assumed of any particular ML algorithm.

It may turn out to be true that the full game is in fact just as simple as the first level, when approached with a sufficiently good set of tools and skills, but there are enough strange walls and pitfalls in the landscape of AI progress that it’s still very much an assumption worth verifying.

To my knowledge, the most successful attempt at building a Mario-completing bot to date has been a YouTube project called LuigI/O in 2018, which used a genetic algorithm called NEAT to evolve simple neural networks that choose input based on the game state as extracted from RAM.

This project did eventually complete every level in the game (and further moved on to complete all the main levels in Super Mario Bros. 2^[3] and several early levels of Super Mario Bros. 3).

The main reason I don’t regard this as a successful completion of the whole game, though, is that it started each level from a save state, and that the outcome of its inputs was always fully deterministic based on that starting state. That is, that algorithm as it existed would probably not have been able to turn the console on and play the entire game straight through; nor would it have been able to enter a game already in progress and begin playing from whatever that current state was.

In my view, I’d expect a true solution to Super Mario Bros. to be able to do these things.

I did make a market on Manifold Markets a couple of years ago on this topic. That market resolved in the negative at the end of 2024, but the criteria I laid out there for what I’d expect of a successful solution still apply today, and are as follows:

The bot must play the whole game in a single session, starting from boot and ending with Mario (or Luigi) touching the axe at the end of 8-4, including the title screen and the noninteractive cutscenes between levels. The use of save states to start at the beginning of particular levels (as LuigI/O and most ML demos do) is not allowed.

The bot must play on an unmodified cartridge or ROM image.

There must be no human interaction with the bot throughout the game. It’s fine if the bot self-modifies during the run (e.g. by adjusting weights in a neural net), but it can’t have human assistance in loading the updates.

The interface between the bot and the game must introduce some amount of imprecision into the input, such as by implementing “sticky actions” as seen in the Arcade Learning Framework. This ensures that the bot is reacting to potentially unfamiliar situations as opposed to just memorizing a precise sequence of inputs.

It’s fine if the bot has access to “hidden” information about the game state, e.g. by reading RAM values.

It’s fine if the bot skips levels with warp zones (although a maximally impressive demo would complete the game warpless).

It’s fine if the bot makes use of glitches (wall jumps, wall clips, etc.). Bonus points if it discovers glitches currently unknown to speedrunners.

The bot does not need to play in real-time; any amount of processing time between frames is okay.

The reason why I think it’s so important that the bot start the game from boot and that it be affected by “sticky actions” is that, for a human, every playthrough of a game like this is slightly different. You could easily write a bot that just has the “correct” input for every frame in the game hardcoded, of course (this is basically what a tool-assisted speedrun is).

But a bot that meaningfully “knows” how to play the game would also need to respond to unexpected situations. Even skilled humans frequently make slight timing mistakes that result in a button being pushed a frame earlier or later than planned. Over the course of the entire game, these slight variations in timing can accumulate and result in levels or screens being loaded on different framerules, which will change which enemies spawn and what movement patterns they have.

Humans (or, at least, the ones who aren’t trying to complete a TAS-perfect speedrun of the game) generally adapt to those slight changes in gameplay without even noticing them. A bot that can’t handle such variations similarly is not a bot that has actually “solved” the game, in my opinion.

So, that’s my challenge to the world of AI enthusiasts: Write a program that can finish Super Mario Bros. all the way through, from boot, with some kind of “sticky actions” filter applied to its input. I’m generally pretty skeptical of what AI is capable of myself, but in this case, I’m pretty sure that y’all should be able to do it. (Let’s say p=80% or so.)

And if you try and find out that you can’t, for some reason, then that should be a surprising fact that has implications for the current state of AI research.

Back in the mid-2000s, I spent a while on discussion forums for Steorn, an Irish technology company that claimed in an expensive ad in The Economist to have developed a perpetual motion machine. People on those forums often pointed out that there are an awful lot of people who are able to do some kind of complicated measurement of “energy in” versus “energy out” on some elaborate machine and claim that net energy is being produced. But, for some strange reason, every one of those projects seems to run into problems when they attempt to take the simple step of feeding the “energy out” of their machine back into the “energy in” (“closing the loop”), and thereby actually getting a device that runs indefinitely with no external input. I think about that a lot when I hear people hyping up the amazing things that AI can do today. ↩
It took about 35 years in my case, but I may be an outlier. ↩
a.k.a. Super Mario Bros.: The Lost Levels outside of Japan. ↩