Fun with PHPUnit Data Providers

Most PHP developers are familiar with PHPUnit these days. It is the most widely used testing framework for PHP by a wide margin (although others do exist). One of its more under-utilized features, though is data providers.

Data providers are a PHPUnit feature (and many testing frameworks have an equivalent) that lets you run a single test method multiple times but with different data. Often it's presented as a way to save typing, but I find it is also a useful architectural tool, too. And there are ways to use them that are even nicer than what most people tend to do.

A brief introduction

As a quick refresher, let's go over how data providers work. This example is based on a patch I'm working on for TYPO3 lately, simplified somewhat to make it clearer and using PHP 8 syntax for fun. I have a class that, among other things, takes in two optional lists, an allow-list and an exclude-list, and then is asked to compute whether or not a given item should be permitted. Here's the code:

class Config
{
    public function __construct(
        protected string $name,
        protected array $channels = [],
        protected array $excludeChannels = []
    ) {}

    /**
     * Determines if a given channel name is allowed.
     *
     * If no channel restrictions are specified at all, all channels are accepted.
     * Otherwise, an accept list (channels to include) takes priority over the
     * exclude list.
     */
    public function acceptsChannel(string $channel): bool
    {
        return (empty($this->channels) || in_array($channel, $this->channels, true))
            && (empty($this->excludeChannels) || !in_array($channel, $this->excludeChannels, true));
    }
}

Simple enough, but we still need to test it. The naive testing approach is to make a separate test case for each possible permutation of overlap between $channels and $excludeChannels. You might start off like this:

use PHPUnit\Framework\TestCase;

class ConfigTest extends TestCase
{
    /**
     * @test
     */
    public function all_channels_accepted_with_no_lists(): void
    {
        $channels = [];
        $excludeChannels = [];

        $subject = new Config('name', $channels, $excludeChannels);

        self::assertTrue($subject->acceptsChannel('a'));
    }

    /**
     * @test
     */
    public function channel_accepted_in_allow_list(): void
    {
        $channels = ['a', 'b'];
        $excludeChannels = [];

        $subject = new Config('name', $channels, $excludeChannels);

        self::assertTrue($subject->acceptsChannel('a'));
    }
    
    // ...
}

That will work, certainly, but it's a lot of redundancy. There are 8 different permutations to test, and we'll need a nearly-identical test method for each of them. That's gross, and more importantly is error-prone. If we ever need to tweak the logic slightly, that's 8 tests we have to update and make sure to update them all the same way. (In this example it's trivial, but this example was designed to be trivial.)

Instead, we can use a data provider. A data provider is a method that returns an array of data sets to feed into a single test method as parameters. The method may then be written more generically. In our case, that looks something like this:

use PHPUnit\Framework\TestCase;

class ConfigTest extends TestCase
{
    /**
     * @test
     * @dataProvider channelProvider
     */
    public function channelMatching(
        array $channels, 
        array $excludeChannels, 
        string $testChannel, 
        bool $expected = true): void
    { 
        $subject = new Config('name', $channels, $excludeChannels);

        self::assertEquals($expected, $subject->acceptsChannel($testChannel));
    }

    /**
     * @see channelMatching()
     */
    public function channelProvider(): array
    {
        return [
            [
                [],
                [],
                'a',
                true,
            ],
            [
                ['a', 'b'],
                [],
                'a',
                true,
            ],
            [
                ['a', 'b'],
                [],
                'c',
                false,
            ],
            [
                [],
                ['a', 'b'],
                'a',
                false,
            ],
            [
                [],
                ['a', 'b'],
                'c',
                true,
            ],
            [
                ['a', 'b'],
                ['c', 'd'],
                'a',
                true,
            ],
            [
                ['a', 'b'],
                ['c', 'd'],
                'c',
                false,
            ],
            [
                ['a', 'b'],
                ['b', 'c'],
                'b',
                false,
            ],
        ];
    }
}

The @dataProvider annotation on the test method tells PHPUnit to call that test method once for each set of arguments returned by the method channelProvider(). (The method name can be anything, but it's common convention to suffix it with Provider. The @see annotation is purely for documentation purposes and has no runtime effect.) Thus, channelMatching() will get called 8 times: Once with arguments [[], [], 'a', true], once with arguments [['a', 'b'], [], 'a', true], and so on. That gives us 8 tests that are identical in behavior, but different in their input. Adding more test cases becomes much easier: Just add another item to the provider. Varying the logic of the test becomes much easier, should the API change: Just tweak the one test. Win for everyone!

Better providers

That's all well and good, but to my eyes that provider is gross. It's a giant nested positional array where you have to just sort of know what each item means, and that it's in the right order. That's awful. We're also returning a single giant array, which means everything in it must be static. If we need to configure more complex inputs, such as setting up a multi-part value object or similar, we'd need to do that in advance and then reference it perhaps 50 lines later. That's doubleplusungood.

Fortunately, we can do better. A lot better.

Naming things

PHPUnit will pass the arguments for each data provider case to the test method positionally. However, it explicitly calls array_values() on the data before doing so. That means we can put whatever we want as the array keys and they won't matter to the code... but they will matter to us. I find it very helpful to label the arguments with the name of the parameter they correspond to, like so:

public function channelProvider(): array
{
    return [
        [
            'channels' => [],
            'excludeChannels' => [],
            'testChannel' => 'a',
            'expected' => true,
        ],
        // ...
    ];
}

Now, reading the block we get a very clear picture of what the different random values are for. It won't impact the runtime at all, but it's better for the developer. (I don't know if PHPUnit's dev team has any plans to add support for named arguments for data providers in the future. I think it would be cool, but it might have BC implications.)

Title test case

Another option is to title each test set returned by the data provider. By default, each test set will simply be numbered 1, 2, 3, etc. If all tests pass, no biggie. If one fails, though, PHPUnit will report that the test set channelMatching@4 failed. You will then need to go figure out which one #4 is.

Instead, you can specify a key for each test set in the array.

public function channelProvider(): array
{
    return [
        'empty lists' => [
            'channels' => [],
            'excludeChannels' => [],
            'testChannel' => 'a',
            'expected' => true,
        ],
        // ...
    ];
}

Now, in case a test fails, PHPUnit will report the name of the test set that fails. It also means that a developer reading through the tests (like, say, you next week) has a clear description of what each data set is supposed to test. Of course, it's on you to come up with a clear, short description but that's not something you can ever escape as a developer. (And anyone who tells you otherwise is trying to sell you code analysis tools to figure out WTF you were even thinking when you wrote this code last week.)

Do you yield?

Finally, I was a bit incomplete a moment ago when I said that a data provider method returns an array. It can, but it actually returns an iterable. You can build up an array yourself however you like, or, you can yield values instead, making the data provider a generator. In my experience, the generator version is superior in every way. I use it pretty much 100% of the time.

Instead of returning a giant array, we can yield a name => array pair for each test set. Like so:

public function channelProvider(): iterable
{
    yield 'empty lists' => [
        'channels' => [],
        'excludeChannels' => [],
        'testChannel' => 'a',
        'expected' => true,
    ];

    yield 'allow list, included' => [
        'channels' => ['a', 'b'],
        'excludeChannels' => [],
        'testChannel' => 'a',
        'expected' => true,
    ];
    
    // ...
}

This has a number of advantages.

It's easier to read, making it better for future you to understand the code.
It's less indented, which gives you more room for longer lines for each value.
Because each test set is syntactically independent, it's easier to add new ones, reorder them, or temporarily comment out a single test set.
If a given test set needs more setup than just a static list, it's simple to put that setup before the yield statement for which it is relevant.

The last point is crucial in more complex scenarios than this one. Let's look at a different example, based on a different test in the same patch that I am greatly simplifying.

In this case, we have a class that is processing a YAML configuration file. Rather, it's processing a parsed-to-arrays version of a YAML file, because the parsing should be separated from the interpretation. The test method looks like this:

/**
 * @test
 * @dataProvider configurationProvider()
 */
public function verifyProcessing(
    string $yamlConfig,
    array $setupObjects,
    object $expected): void
{
    $config = Yaml::parse($yamlConfig) ?? [];

    $subject = new ClassBeingTested($config);
    
    $subject->populate($setupObjects);
    
    $result = $subject->process();
    
    self::assertEquals($expected, $result);
}

Depending on the configuration and setup, we could expect a variety of different results. (The real test is more involved than this, but one step at a time.) The data provider, then, looks approximately like this:

public function configurationProvider(): iterable
{
    $defaultObjects = [new A(), new B()];

    $defaultConfig = <<<END
some:
    yaml:
        string:
            here:
END;

    yield 'default behavior' => [
        'yamlConfig' => $defaultConfig,
        'setupObjects' => $defaultObjects,
        'expected' => new C('args', 'here'),
    ];

    yield 'Empty config' => [
        'config' => '',
        'setupObjects' => $defaultObjects,
        'expected' => new C('some', 'args'),
    ];

    $badObjects[] = new Broken();
    $hackAttempt = new Evil();
    foreach (['list', 'of', 'bad', 'stuff'] as $bar) {
        $hackAttempt->add($bar);
    }
    $badObjects[] = $hackAttempt;

    yield 'Incorrect objects' => [
        'config' => $defaultConfig,
        'setupObjects' => $badObjects,
        'expected' => new D(),
    ];
    
    $customConfiug = <<END
some:
    different:
        yaml:
            string:
END;

    yield 'Custom config' => [
        'config' => $customConfig,
        'setupObjects' => $defaultObjects,
        'expected' => new E(),
    ];
}

Naturally this has been simplified a great deal, but hopefully it still gets the point across. Trying to do all of that in a single big array would be a huge pain in the neck. By yielding each one in turn, we can do whatever setup we need.

We could also integrate even more involved logic. For instance, if we have certain test sets that only work when a certain extension is loaded, or on certain PHP versions, then PHPUnit already has a way to set a particular test method to only trigger in certain cases. It does not, as far as I am aware, have a way to do that for provider cases. However, we can easily do that ourselves if needed:

public function configurationProvider(): iterable
{
    // ...

    if (version_compare(PHP_VERSION, '8.0.0') >= 0) {
        yield 'Objects with PHP 8 syntax' => [
            'config' => $defaultConfig,
            'setupObjects' => [new UsesConstructorPromotion()],
            'expected' => new D(),
        ];
    }

    // ...
}

Until next time

There's more fun we can have with PHPUnit, but I am going to save that for a future post as this one is long enough already. In practice, though, this is how most of my unit tests end up looking these days. I encourage you to try it out yourself. I find it overall greatly improves the quality of my tests, as well as, by extension, the quality of my code.