Your search did not match any results.
We suggest you try the following to help find what you’re looking for:
Locate incorrect and incomplete unit tests with pitest.
By Henry Coles
If you’ve written any code in the last week, you’ve most likely also written a unit test to go with it. You’re not alone. These days, it’s rare to find a codebase without unit tests. Many developers invest a lot of time in their tests, but are they doing a good job?
This question began to trouble me seven years ago while I was working on a large legacy codebase for a financial services company. The code was very difficult to work with, but was a core part of the business and needed to be constantly updated to meet new requirements.
A lot of my team’s time was spent trying to wrestle the code into maintainable shape. This made the business owners nervous. They understood that the team had a problem and needed to make changes, but if the team introduced a bug it could be very costly. The business owners wanted reassurance that everything was going to be all right.
The codebase had a lot of tests. Unfortunately, the team didn’t need to examine them very closely to see that the tests were of no better quality than the code they tested. So, before the team members changed any code, they first invested a lot of effort in improving the existing tests and creating new ones.
Because my team members always had good tests before they made a change, I told the business owners not to worry: if a bug were introduced while refactoring, the tests would catch it. The owners’ money was safe.
But what if I were wrong? What if the team couldn’t trust the test suite? What if the safety net was full of holes? There was also another related problem.
As the team members changed the code, they also needed to change the tests. Sometimes the team refactored tests to make them cleaner. Sometimes tests had to be updated in other ways as functionality was moved around the codebase. So even if the tests were good at the point at which they were written, how could the team be sure no defects were introduced into tests that were changed?
For the production code, the team had tests to catch mistakes, but how were mistakes in the test code caught? Should the team write tests for the tests? If that were done, wouldn’t the team eventually need to write tests for the tests for the tests—and then tests that tested the tests that tested the tests that tested the tests? It didn’t sound like that would end well, if it ever ended at all.
Fortunately, there is an answer to these questions. Like many other teams, my team was using a code coverage tool to measure the branch coverage of the tests.
The code coverage tool would tell which bits of the codebase were well tested. If tests were changed, the team just had to make sure there was still as much code coverage as before. Problem solved. Or was it?
There was one small problem with relying on code coverage in this way. It didn’t actually tell the team anything about whether the code was tested, as I explain next.
The problem is illustrated by some of the legacy tests I found within the code. Take a contrived class such as this:
class AClass { private int count; public void count(int i) { if (i >= 10) { count++; } } public void reset() { count = 0; } public int currentCount() { return count; } }
I might find a test that looked like this:
@Test public void testStuff() { AClass c = new AClass(); c.count(11); c.count(9); }
You can’t rely on code coverage tools to tell you that code has been tested.
This test gives 100 percent line and branch coverage, but tests nothing, because it contains no assertions. The test executes the code, but doesn’t meaningfully test it. The programmer who wrote this test either forgot to add assertions or wrote the test for the sole purpose of making a code coverage statistic go up. Fortunately, tests such as this are easy to find using static analysis tools.
I also found tests like this:
@Test public void testStuff() { AClass c = new AClass(); c.count(11); assert(c.currentCount() == 1); }
The programmer has used the assert
keyword instead of a JUnit assertion. Unless the test is run with the -ea flag set on the command line, the test can never fail. Again, bad tests such as this can be found with simple static analysis rules.
Unfortunately, these weren’t the tests that caused my team problems. The more troubling cases looked like this:
@Test public void shouldStartWithEmptyCount() { assertEquals(0,testee.currentCount()); } @Test public void shouldCountIntegersAboveTen() { testee.count(11); assertEquals(1,testee.currentCount()); } @Test public void shouldNotCountIntegersBelowTen() { testee.count(9); assertEquals(0,testee.currentCount()); }
These tests exercise both branches of the code in the count
method and assert the values returned. At first glance, it looks like these are fairly solid tests. But the problem wasn’t the tests the team had. It was the tests that the team didn’t have.
There should be a test that checks what happens when exactly 10 is passed in:
@Test public void shouldNotCountIntegersOfExactly10() { testee.count(10); assertEquals(0,testee.currentCount()); }
If this test doesn’t exist, a bug could accidentally be introduced, such as the one below:
public void count(int i) { if (i > 10) { // oops, missing the = count++; } }
A small bug like this could have cost the business tens of thousands of dollars every day it was in production until the moment that it was noticed and fixed.
This kind of problem can’t be found by static analysis. It might be found by peer review, but then again it might not. In theory, the code would never be written without a test if test-driven development (TDD) were used, but TDD doesn’t magically stop people from making mistakes.
So you can’t rely on code coverage tools to tell you that code has been tested. They’re still useful, but for a slightly different purpose. They tell you which bits of code are definitely not tested. You can use them to quickly see which code definitely has no safety net, in case you wish to change it.
One of the things I used to do after writing a test was double-check my work by commenting out some of the code I’d just implemented, or else I’d introduce a small change such as changing <= to <, as in the earlier example. If I ran my test and it didn’t fail, that meant I’d made a mistake.
This gave me an idea. What if I had a tool that made these changes automatically? That is, what if there were a tool that added bugs to my code and then ran the tests? I would know that any line of code for which the tool created a bug but the test didn’t fail was not properly tested. I’d know for certain whether my test suite was doing a good job.
Like most good ideas, it turned out that I wasn’t the first to have it. The idea had a name—mutation testing—and it was first invented back in the 1970s. The topic had been researched extensively for 40 years, and the research community had developed a terminology around it.
The different types of changes I made by hand are called mutation operators. Each operator is a small, specific type of change, such as changing >= to >, changing a 1 to a 0, or commenting out a method call.
When a mutation operator is applied to some code, a mutant is created. When tests are run against a mutant version of the code, if one of the tests fails, the mutant was “killed.” If no tests fail, the mutant survived.
The academics examined the various types of mutation operators that were possible, looked at which were more effective, and explored how well test suites that detected these artificial bugs could detect real bugs. They also produced several automated mutation testing tools, including some for Java.
So why haven’t you heard about mutation testing before? Why aren’t all developers using mutation testing tools? I’ll talk about one problem now and the other a little later on.
It’s not possible to perform a mutation test when you have a failing test, because doing that would mistakenly appear to kill any mutants that it covered.
The first problem is straightforward: mutation testing is computationally very expensive. In fact, it’s so expensive that until 2009, most academic research looked only at toy projects with fewer than a hundred lines of code. To understand why it’s so expensive, let’s look at what a mutation testing tool needs to do.
Imagine you are performing mutation tests on the Joda-Time library, which is a small library for handling dates and times. This library has about 68,000 lines of code and about 70,000 lines of test code. It takes about 10 seconds to compile the code and about 16 seconds to run the unit tests.
Now, imagine that your mutation test tool introduces a bug on every seventh line of code. So you’d have about 10,000 bugs. Each time you change a class to introduce a bug, you need to compile the code. Perhaps that would take 1 second. So that would be 10,000 seconds of compilation time to produce the mutations (this is called the generation cost), which is more than two and a half hours. You also need to run the test suite for each mutant. That’s 160,000 seconds, which is more than 44 hours. So performing mutation testing on the Joda-Time library would take almost two days.
Many of the early mutation testing tools worked exactly like this hypothetical tool. You still occasionally find people trying to sell tools that work like this, but using such a tool is clearly not practical.
When I became interested in mutation testing, I looked at the available open source tools. The best one I could find was Jumble. It was faster than the simplistic tool I described above, but it was still quite slow. And it had other problems that made it difficult to use.
I wondered if I could do better. I already had some code that seemed like it might help—it was code for running tests in parallel. It ran tests in different class loaders so that when state that was stored in static variables in legacy code was changed, the running tests wouldn’t interfere with each other. I called it Parallel Isolated Test (or PIT).
After many evenings of experimentation, I managed to do better. My PIT mutation testing tool could analyze 10,000 mutations in the Joda-Time library in about three minutes.
My tool has kept the initials of the codebase from which it grew, but it’s now also known as “pitest,” and it is used all over the world.
It’s used for academic research and for some exciting safety-critical projects, such as for testing the control systems of the Large Hadron Collider at CERN. But mainly it is used to help test the kind of nonsafety-critical code that most developers produce every day. So how does it manage to be so much faster than the earlier system?
First, it copied a trick from Jumble. Instead of spending two and a half hours compiling source code, pitest modifies bytecode directly. This allows it to generate hundreds of thousands of mutants in subsecond time.
But, more importantly, it doesn’t run all the tests against each mutant. Instead, it runs only the tests that might kill a mutant. To know which tests those might be, it uses coverage data.
The first thing pitest does is collect line-coverage data for each test so that it knows which tests execute which lines of code. The only tests that could possibly kill a mutant are the ones that exercise the line of code the mutation is on. Running any other tests is a waste of time.
Pitest then uses heuristics to select which of the covering tests to run first. If a mutant can be killed by a test, pitest usually finds the killing test in one or two attempts.
The most effective time to perform mutation tests on your code is when you write the
The biggest speedup is achieved when you have a mutant that is not exercised by any test. With the traditional approach, you’d have to run the entire test suite to determine that the mutant could not be killed. With the coverage-based approach, you can determine this instantly with almost no computational cost.
Line coverage identifies code that is definitely not tested. If there is no test coverage for the line of code where a mutant is created, then none of the tests in the suite can possibly kill it. Pitest can mark the mutant as surviving without doing any further work.
Setting up pitest for your project is straightforward. IDE plugins have been built for Eclipse and IntelliJ IDEA, but personally I prefer to add mutations from the command line using the build script. Some very useful features of pitest are accessible only in this way, as you’ll see in a moment.
I normally use Maven as my build tool, but pitest plugins also exist for Gradle and Ant.
Setting up pitest for Maven is straightforward. I usually bind pitest to the test phase using a profile named pitest
. Then pitest can be run by activating the profile with -P, as shown here:
mvn -Ppitest test
As an example, I’ve created a fork of the Google assertion library Truth on GitHub, and I added pitest to the build. You can see the relevant section of the project object model (POM) file here.
Let’s go through it step by step.
<threads>2</threads>
tells pitest to use two threads when performing mutation testing. Mutation testing usually scales well, so if you have more than two cores, it is worth increasing the number of threads.
<timestampedReports>false</timestampedReports>
tells pitest to generate its reports in a fixed location.
<mutators><value>STRONGER</value></mutators>
tells pitest to use a larger set of mutation operators than the default. This section is commented out in the POM file at the moment. I’ll enable it a little later on. If you’re just starting out with mutation testing on your own project, I suggest you also stick with the defaults at first.
The pitest Maven plugin assumes that your project follows the common convention of having a group ID that matches your package structure; that is, if your code lives in packages named com.mycompany.myproject
, it expects the group ID to be com.mycompany.myproject
. If this is not the case, you might get an error message such as the following when you run pitest:
No mutations found. This probably means there is an issue with either the supplied classpath or filters.
Google Truth’s group name doesn’t match the package structure, so I added this section:
<targetClasses> <param>com.google.common.truth*</param> </targetClasses>
Note the * at the end of the package name.
Pitest works at the bytecode level and is configured by supplying globs that are matched against the names of the loaded classes, not by specifying the paths to source files. This is a common point of confusion for people using it for the first time.
Another common problem when setting up pitest for the first time is this message: All tests did not pass without mutation when calculating line coverage. Mutation testing requires a green suite.
This message can occur if you have a failing test. It’s not possible to perform a mutation test when you have a failing test, because doing that would mistakenly appear to kill any mutants that it covered. Sometimes you’ll also get the message when all the tests pass when run normally with mvn test
. If this happens, there are a few possible causes.
Pitest tries to parse the configuration of the Surefire test runner plugin and convert the configuration to options that pitest understands. (Surefire is the plugin that Maven uses by default to run unit tests. Often no configuration is required, but sometimes tests need some special configuration in order to work, which must be supplied in the pom.xml file.)
Unfortunately, pitest can’t yet convert all the possible types of Surefire configuration. If your tests rely on system properties or command-line arguments being set, you need to specify them again in the pitest configuration.
Another problem that’s more difficult to spot is order dependencies in the tests. Pitest runs your tests many times in many different sequences, but you might have a test that fails if certain other tests run before it.
For example, if you have a test called FooTest
that sets a static variable in a class to false, and you have another test called BarTest
that assumes that the variable is set to true, BarTest
will pass if it is run before FooTest
but fail if it is run afterward. By default, Surefire runs tests in a random but fixed order. The order changes when a new test is added, but you might never have run the tests in an order that reveals the dependency. When pitest runs the tests, the order it uses might reveal the order dependency for the first time.
Test-order dependencies are very hard to spot. To avoid them, you can make tests defensively set shared state on which they depend to the right value when they start, and make them clean up after themselves when they finish. But by far the best approach is to avoid having shared mutable state in your program in the first place.
Finally, the setup for using the Google Truth library includes this section:
<excludedClasses> <param> *AutoValue_Expect_ExpectationFailure </param> </excludedClasses>
This configuration prevents all classes whose name ends in AutoValue_Expect_ExpectationFailure
from having mutations seeded into them. These classes are autogenerated by the Google Truth build script. There is no value in performing mutation testing on them, and any mutations that are created would be difficult to understand because you do not have the source code.
Pitest also provides other ways to exclude code from being mutation-tested. Details can be found on the pitest website.
Let’s do a sample run and look at the result it generates. To begin, check out the source code for the Google Truth library, and run pitest using Maven:
mvn -Ppitest test
It should take about 60 seconds once Maven has finished downloading the dependencies. After the run, you’ll find an HTML report in the target/pitReports
directory. For the Truth project, you’ll find the report under core/target/pitReports
.
The pitest report looks very similar to the reports that standard coverage tools produce, but it contains some extra information. Each package is listed with its overall line coverage and its mutation score shown side by side.
You can drill down into each source file to get a report such as the one shown in Figure 1.
Line coverage is shown as blocks of color that span the width of the page. Green indicates that a line is executed by tests, and red indicates that it is not.
The number of mutations created on each line is shown between the line number and the code. If you hover over the number, you’ll get a description of the mutations that were created and their status. If all the mutants were killed, the code is shown in darker green. If one or more of them survived, the code will be highlighted in red.
There’s additional useful information at the bottom of the report: a list of all the tests that were used to challenge the mutants in this file and how long each of them took to run. Above this is a list of all the mutations. If you hover over them, you’ll see the name of the test that killed the mutant.
Google Truth was developed without using pitest or any other mutation testing tool and, on the whole, the team that developed it did a very good job. A mutation testing score of 88 percent is not easy to achieve. But still, there are holes.
The most interesting mutants are the ones that appear on the green lines that indicate they were covered by tests. If a mutant was not covered by a test, it is not surprising that it survived and does not give any additional information compared to line coverage. But if a mutant was covered, you have something to investigate.
For example, take a look at line 73 of PrimitiveIntArraySubject.java
. Pitest created a mutant that has the following description:
removed call to com/google/common/truth/PrimitiveIntArraySubject::failWithRawMessage
What this tells you is that pitest commented out the line of code that called this method.
As the name suggests, the purpose of failWithRaw Message
is to throw a RuntimeException
. Google Truth is an assertion library, so one of the core things that it does is throw an AssertionError
when a condition is not met.
Let’s take a look at the tests that cover this class. The following test looks like it is intended to test this functionality.
@Test public void isNotEqualTo_FailSame() { try { int[] same = array(2, 3); assertThat(same).isNotEqualTo(same); } catch (AssertionError e) { assertThat(e) .hasMessage("<(int[]) [2, 3]>" + "unexpectedly equal to [2, 3]."); } }
Can you spot the mistake? It is a classic testing bug: the test checks the content of the assertion message but, if no exception is thrown, the test passes. Tests following this pattern normally include a call to fail()
. Because the exception the Truth team expected is itself an AssertionError
, the pattern they followed in other tests is to throw an Error
.
@Test public void isNotEqualTo_FailSame() { try { int[] same = array(2, 3); assertThat(same).isNotEqualTo(same); throw new Error("Expected to throw"); } catch (AssertionError e) { assertThat(e) .hasMessage("<(int[]) [2, 3]>" + "unexpectedly equal to [2, 3]."); } }
If this throw
is added to the test, the mutant is killed.
What else can pitest find? There is a similar problem on line 121 of PrimitiveDoubleArraySubject.java
. Again, pitest has removed a call to failWithRawMessage
.
However, if you take a look at the test, it does throw an Error
when no exception is thrown. So what’s going on? This is an equivalent mutant. Let’s examine this category of mutants a bit more.
Equivalent mutants are the other problem identified by the academic research that I referred to in the introduction.
Sometimes, if you make a change to some code, you don’t actually change the behavior at all. The changed code is logically equivalent to the original code. In such cases, it is not possible to write a test that will fail for the mutant that doesn’t also fail for the unmutated code. Unfortunately, it is impossible to automatically determine whether a surviving mutant is an equivalent mutant or just lacks an effective test case. This situation requires a human to examine the code. And that can take some time.
There is some research that suggests it takes about 15 minutes on average to determine if a mutation is equivalent. So if you apply mutation testing at the end of a project and have hundreds of surviving mutants, you might need to spend days assessing the surviving ones to see whether they were equivalent.
This was seen as a major problem that must be overcome before mutation testing could be used in practice. However, much of the early research into mutation testing had an unstated built-in assumption. It assumed that mutation testing would be applied at the end of a development process as some sort of separate QA process. Modern development doesn’t work like that.
The only code you need to perform mutation testing on is code that you’ve just written or changed.
The experience of people using pitest is that equivalent mutants are not a major problem. In fact, they can sometimes be helpful.
The most effective time to perform mutation tests on your code is when you write the code. If you do this, you will need to assess only a small number of surviving mutants at any one time, but, more importantly, you will be in a position to act on them. Assessing each surviving mutant takes far less than the suggested average of 15 minutes, because the code and the tests are fresh in your mind.
When a mutant in code you have just written survives, this will prompt you to do one of three things.
Line 121 of PrimitiveDoubleArraySubject.java
, which you just examined, is an example of this last category. Let’s take a look at the full method.
public void isNotEqualTo(Object expectedArray , double tolerance) { double[] actual = getSubject(); try { double[] expected = (double[]) expectedArray; if (actual == expected) { // the mutation is to the line below failWithRawMessage( "%s unexpectedly equal to %s." , getDisplaySubject() , Doubles.asList(expected)); } if (expected.length != actual.length) { return; //Unequal-length arrays are not equal. } List<Integer> unequalIndices = new ArrayList<>(); for (int i = 0; i < expected.length; i++) { if (!MathUtil.equals( actual[i] , expected[i] , tolerance)) { unequalIndices.add(i); } } if (unequalIndices.isEmpty()) { failWithRawMessage( "%s unexpectedly equal to %s." , getDisplaySubject() , Doubles.asList(expected)); } } catch (ClassCastException ignored) { // Unequal since they are of different types. } }
Pitest has mutated a method call that is conditionally executed after comparing two arrays with the == operator.
If the code does not throw an exception at this point, it will move on and perform a deep comparison of the arrays. If they are not equal, the code throws exactly the same exception as if the == had returned true.
So, this is a mutation in code that exists solely for performance reasons. Its purpose is to avoid performing a more expensive deep comparison. A large number of equivalent mutants fall into this category; the code is needed but relates to a concern that is not testable via unit tests.
The first question this raises is whether the behavior of this method should be the same when given the same array as it is when given two different arrays with the same contents.
My view is that it should not. If I am using an assertion library and I tell it that I expect two arrays not to be equal, and then I pass it the same array twice, I would find it useful for the message to tell me this, perhaps by adding “(in fact, it is the same array)” to the end of the failure message.
But perhaps I am wrong. Maybe the behavior is better the way it is. If the behavior remains the same, what can be done to make the equivalent mutation go away?
I don’t like the isNotEqualTo
method. It has two responsibilities. It is responsible for comparing arrays for equality and it is responsible for throwing exceptions when passed two equal arrays.
What happens if those two concerns are separated into different methods by doing something like this?
public void isNotEqualTo(Object expectedArray , double tolerance) { double[] actual = getSubject(); try { double[] expected = (double[]) expectedArray; if (areEqual(actual, expected, tolerance)) { failWithRawMessage( "%s unexpectedly equal to %s." , getDisplaySubject() , Doubles.asList(expected)); } } catch (ClassCastException ignored) { // Unequal since they are of different types. } } private boolean areEqual(double[] actual , double[] expected , double tolerance) { if (actual == expected) return true; if (expected.length != actual.length) return false; return compareArrayContents(actual , expected , tolerance); }
Now, the equivalent mutant goes away. The mutant has prompted me to refactor the code into something cleaner. What is more, I can now also use the new areEqual
method to remove duplicate logic elsewhere in this class, thereby reducing the amount of code.
Unfortunately, not all equivalent mutants can be removed by re-expressing the code. If I uncomment the section of the configuration that enables pitest’s stronger set of mutation operators and rerun the test, I’ll get a mutant in the new areEqual
method.
removed conditional - replaced equality check with false
Pitest has changed the method to this:
private boolean areEqual(double[] actual , double[] expected , double tolerance) { if (false) return true; // mutated if (expected.length != actual.length) return false; return compareArrayContents(actual , expected , tolerance); }
I can’t refactor the equivalent mutant away without losing the performance optimization.
So not all equivalent mutants are helpful, but they are less common than the research suggests.
Pitest is designed to make equivalent mutants as unlikely as possible: using the default set of operators, many teams never encounter one. How many you see depends on the type of code you are writing and your coding style.
None of the example projects I’ve talked about so far has been huge. Is it possible to use mutation testing on a really big project? Yes.
As I have discussed, by far the most effective way to use mutation testing is to run tests as you are developing code. When you use it in this way, project size doesn’t matter. For a project such as Truth, it is simplest to mutate the entire project each time, but you don’t need to do this.
The only code you need to perform mutation testing on is code that you’ve just written or changed. Even if your codebase contains millions of lines of code, it is unlikely that your code change will affect more than a handful of classes.
Pitest makes it easy to work in this way by integrating with version control systems. This functionality is currently available only when using the Maven plugin.
If you have correctly configured the standard Maven version control information in your POM file, you can analyze just your locally modified code using pitest’s scmMutation Coverage
goal.
This goal has been bound to the profile pitest-local in the Google Truth POM:
mvn -Ppitest-local test
If you haven’t made any changes to the checked-out code, this goal will run the tests and then stop. If you have made changes, it will analyze only the changed files. Make a change now to try it out.
This approach gives you just the information you need: Is the code you’re currently working on well tested? Pitest can also be set up so that a continuous integration (CI) server can analyze just the last commit.
But what if you want a complete picture of how good the tests are for a whole project?
Eventually, you will hit a limit for how large a project you can do mutation testing for unless you are willing to wait for many hours, but pitest does provide an experimental option to push that limit further.
Go back to the Google Truth project and run it with the following:
mvn -DwithHistory -Ppitest test
Nothing will seem very different from when you ran it before. If you run that command again, however, it should finish in just a few seconds.
The withHistory
flag tells pitest to store information about each run and use it to optimize the next run. If, for example, a class and the tests that cover it have not changed, there is no need to rerun the analysis on that class. There are many similar optimizations that can be performed using the run history.
This functionality is still in the early stages, but if it is used from the start of a project it should enable an entire codebase to be analyzed no matter how large it grows.
I hope I’ve convinced you that mutation testing is a powerful and practical technique. It helps build strong test suites and helps you to write cleaner code.
But I want to finish with a word of warning. Mutation testing does not guarantee that you will have good tests. It can only guarantee that you have strong tests. By strong, I mean tests that fail when important behavior of the code changes. But this is only half the picture. It is equally important that a test not fail when the behavior is left the same but details of the implementation are changed.
This article originally was published in Java Magazine November/December 2016.
Henry Coles (@0hjc) is a software engineer based in Edinburgh, Scotland, where he runs the local JUG. He has been writing software professionally for almost 20 years, most of it in Java. Coles has produced many open source tools including pitest and an open source book, Java for Small Teams.