How well is it tested: Puppet (part 2 of 2)


In the last post, I showed analysis results of Puppet’s test duplication along with other interesting findings. The relationships discovered seem to suggest that test duplication is not due to chance alone, and that there is a complex relationship in terms of amount of tests in the suites that is not entirely linear. The results in the post also hinted at a relationship between test duplication and the quality of tests.

In this post, I will show how Puppet’s tests measure up in terms of quality based on their mutation scores. I will also show how test duplication impacts their quality. Finally, I will discuss interesting relationships and patterns in the data.

 

 

Preliminaries on Mutation Analysis

Mutation analysis is a way to test your tests. It is a technique that uses fault injection to intentionally make tests fail and, therefore, uncover their gaps and holes.

To be more precise, faults are injected into the SUT (system under test), not the tests, and then the tests are run. Because mutation analysis is based on fault injection, it is a far more accurate metric to rate the quality of tests compared to code coverage, which is only a passive, point-in-time metric that might change day-to-day as your code changes. See this video explaining the basic ideas of mutation analysis in less than 5 minutes.

Since mutation analysis subsumes code coverage[1, 2, 3, 4], it is therefore a far better metric because while your tests might get 100% code coverage (making you falsely believe that your tests are excellent), mutation analysis will give you a more realistic rating and, therefore, drastically more opportunities to improve your tests to prevent you from shipping buggy software. Additionally, mutation analysis can rate your tests in terms of regression detection, whereas code coverage simply cannot.

For these and other reasons that are more fully explained in this video, the quality of Puppet’s tests were rated using Ortask Mutator, the world’s leading cross-language mutation analyzer.

 

 

Methodology

The general methodology is explained in the last post. In that post, however, I did not explain further how the “before” and “after” mutation scores were evaluated because I had not yet covered Puppet’s mutation analysis results.

The chart below shows the density functions of the mutation scores before and after removing the duplicate tests detected by TestLess.

 

Mutation scores before and after removing duplicate tests

Mutation scores before and after removing duplicate tests

 

You can see that the distributions are practically identical, with a few minor differences that do not appear to be significant. The Vargha-Delaney measure corroborates this, giving a value of 0.505 and therefore, indicating that the differences in mutation scores before and after removing the duplicate tests are not statistically significant.

For test suites that actually test multiple sources, I used Mutator’s new multi-source feature to inject first-order mutations in the sources in a round-robin fashion.

 

 

Mutation Analysis on Puppet

Same as in the last post, the analysis utilized Puppet version 4.4.0. For mutation analysis, 306 suites were initially selected (46.6% out of 656 total tests). The sample was extracted to contain as randomly a population of unit tests as possible so as to capture variations in size, test duplication, and code areas from the project.

However, for the reasons listed next, some types of tests were necessarily removed, therefore shrinking the sample. The reasons are:

  • suites whose sources had no possible mutations (25 suites, or 3.8%)
  • suites that contained failures (5 suites, or 0.8%)
  • suites that were not runnable on the OS on which the analysis was performed (10 suites, 1.5%)

Hence, the final effective sample from which the results are based contain 266 suites (40.6%). Note that this sample subsumes the one utilized for test duplication analysis from the previous post. This is the reason for the difference between the “before-and-after” density chart above and the charts showing the full results for mutation analysis below.

The following histogram shows the resulting mutation scores of the suites in the sample.

 

Histogram of mutation scores in sample (blue dotted line is median; red dashed is mean)

Histogram of mutation scores in sample (blue dotted line is median; red dashed is mean)

 

The probability density function shows more clearly the quality distribution of Puppet’s tests. For instance, it is easy to notice two bulks in the density function, with one bulk of tests having mutation score between (0) zero to 0.25, and another bulk having scores between 0.4 to 0.75.

 

Density function of mutation scores (blue dotted line is median; red dashed is mean)

Density function of mutation scores (blue dotted line is median; red dashed is mean)

 

The following table shows the full statistics for mutation scores in the sample, including the 95% confidence interval.

 

95% confidence interval for mutation scores

95% confidence interval for mutation scores

 

As the confidence interval shows, there is a 95% probability that the mutation score for a given test suite in Puppet picked at random will fall between 45% and 52%, with a very low likelihood of outliers (5%). In other words, while there is 5% probability that, on average, test suites might have more than 52% mutation score in Puppet, its test suites will most likely have an average mutation score within that confidence interval. An analogous argument holds for the “less than” case (i.e. less than 45%)

Like with test duplication in the previous post, it is interesting to observe how close the average and median mutation scores are between this study and previous research results. In fact, the results are almost identical to the previous study and indicate that Puppet’s test quality is only average.

 

Median mutation score for Puppet is 49%

Median mutation score for Puppet is 49%

 

To be clear, the fact that Puppet’s average mutation score aligns with the average for other open-source projects does not mean that its quality can be considered good. Rather, the average mutation score is simply too low to be acceptable

In other words, the 49% mutation score represents poor quality because it indicates that such tests are not able to detect common errors that even competent programmers unknowingly introduce (something called the competent programmer hypothesis). The low average mutation score therefore signifies the presence of too many blind spots in the tests. Because of such gaps in the tests, faults have surely gone through undetected and have undoubtedly surfaced as errors in the field (i.e. in production).

Further, the low average mutation score is indicative of the inability of Puppet’s tests to catch regressions and coupled faults as Puppet evolves. In fact, only 8% of the total sample is able to detect regressions adequately, meaning that the other 92% of tests will undoubtedly allow bugs to creep in as the product evolves.

 

92% of Puppet's tests are inadequate for regression detection

92% of Puppet’s tests are inadequate for regression detection

 

A further red flag is the fact that 50 test suites in the sample (19%) are completely useless, having received a mutation score of 0% (zero)!

 

19% of Puppet's tests are totally useless

19% of Puppet’s tests are totally useless

 

The point can be made that those suites achieving 0% mutation score can be completely removed without affecting Puppet at all.

In effect, running the entire suite containing these 0% quality tests has exactly the same benefit as running the entire suite without them (i.e. no benefit at all), since they add no extra quality checks or assertions that would detect changes (or regressions) in their tested code.

Extending the scope in the sample to encompass all suites that achieved a mutation score less than 40% gives a whopping 107 suites (40.2%)! That such a high amount of suites in the sample achieved unacceptably low mutation scores sheds light on the type of contributions to Puppet, which indicates more focus on features and growth than on quality and tests.

As a side-note, it is interesting to hypothesize whether the very high amount of low-quality test suites might perhaps serve as further evidence to the kinds of contributions that were found in a study focused on casual contributors. That study found that working on tests (i.e. adding, removing, improving) was the lowest kind of contribution, with a mere 5 instances (1.3%), yet adding new features was remarkably high, with almost 19% contributions. If this appears to be the case with general open-source software, then Puppet is clearly not immune to such a phenomenon.

Hypothesizing aside, so what can be done about this apparent sad state of affairs for Puppet’s tests?

With respect to the test suites achieving 0% mutation score, you already know why I do not recommend removing tests unless absolutely necessary. Instead, I would recommend to the Puppet contributors and project maintainers to start conscientiously increasing the quality of the existing tests, especially the 0% ones.

In other words, a suite’s quality can only go up when it’s already at 0%.

I would also recommend to the Puppet contributors to make it a goal to introduce a test suite for every source file. In fact, I would recommend to go as far as requiring that any new contribution also contain an associate test or test suite. The reason I say this is because the test-to-source ratio in the entire population (not just the sample) is only 60%. This means that, in effect, the rest of the sources (40% of them) are not tested at all!

 

Puppet's test-to-source ration is low at 60%

Puppet’s test-to-source ration is low at 60%

 

Granted, Puppet does have “integration” tests, but being less granular than unit tests, integration tests only go so far in detecting faults and regressions that could be better detected by their own proper unit tests. Further, with only 98 integration tests in total, the amount of fault-finding they can perform is severely limited, to say the least. Therefore, their integration tests also proved to be lacking.

 

 

Data Relationships

The chart below shows the relationship between mutation score and size of suites in the sample.

 

Mutation score vs tests in suite

Mutation score vs tests in suite

 

One of the highly obvious aspects from the chart is that simply having more tests does not necessarily lead to higher quality testing. For instance, it is easy to see from the chart that there are suites containing more than 50 tests that are effectively useless (i.e. they received a mutation score of 0%). It is additionally clear that a huge suite with more than 250 tests ironically achieved a very poor mutation score of 29.3%. In contrast, some tests that are only a small fraction of the outlier’s size achieved much better mutation scores and, therefore, have much better quality.

Other than the above, there is no immediately discernible pattern in the data, other than to say that the mutation scores and the suite sizes are all over the place. A bit of inspection shows that about half of the test suites appear to achieve mutation scores below 50%, and the other half achieve higher scores.

In fact, counting the suites that achieved less than 50% mutation score yields exactly 50% of the sample (133 suites). The quality of the tests is therefore evenly split, but the chart does not show anything else.

Scaling the axis to get a better perspective for the suite size yields a bit more insight into test writing and quality.

 

Mutation score vs tests in suite, log-scaled

Mutation score vs tests in suite, log-scaled

 

The chart shows that there is no meaningful relationship between the amount of tests in a suite and their quality. Indeed, the correlation is extremely low, with a PCC value of 0.03. These results emphasize an important aspect of software development and testing.

In other words, the charts illuminate that writing good, effective unit tests is more nuanced than simply piling on checks or assertions, as some people mistakenly seem to believe (see also here). By the results and charts presented thus far in this and the last post, it is also clear that real effort must be expended in the form of care and thought in order to write good, high-quality tests. Therefore, while it is true that writing trivial, poorly designed and redundant tests will certainly make testing become a waste of time, it is also invariably true that good tests (as well as good overall engineering) definitively helps produce high-quality software.

Indeed, high-quality software testing is as much a science as it is a craft that not everybody has competency for.

For Puppet, there are most certainly manifold reasons that explain the complex relationship between the size of its suites and their quality. One possible contributory cause might be that test writing involves a certain amount of localized focus on certain areas being tested and, hence, not all areas of their SUT (system under test) receive the same focus.

As explained in the previous post regarding test duplication in Puppet, it is not hard to imagine that, perhaps by design or perhaps accidentally, such localized focus on certain areas of a system can leave other areas under-tested, or even completely untested. It is also not hard to imagine that such over-focused tests would logically achieve lower mutation scores than well-distributed and crafted tests.

To get a sense of whether this might be the case for Puppet, I created a chart showing the relationship between mutation scores (i.e. the quality of tests) and test duplication using the data from the previous post. The chart is shown below.

 

Mutation score vs duplication of tests

Mutation score vs duplication of tests

 

 

To get a clearer view of the relationship, I plotted a linear regression as the next chart shows.

 

Mutation score vs test duplication (with linear regression)

Mutation score vs test duplication (with linear regression)

 

 

The correlation between test quality and duplication is striking. The link is further emphasized by a PCC value of -0.45, much stronger than any other relationship considered thus far. In other words,

 

the correlation links higher amounts of test duplication with suites that achieve lower mutation scores (i.e. lower test quality).

 

Deploying SVR’s as was done in the previous post for test duplication illustrates a more nuanced reality, as is shown in the next chart.

 

Mutation score vs test duplication (with linear and SV regressions)

Mutation score vs test duplication (with linear and SV regressions)

 

It is interesting to observe the dips and peaks of both the trained and untrained SVRs. For instance, there appears to be a slight gap around the 10%-15% duplication range, on the bottom, where the both SVRs peak slightly. There also appears to be a slight gap around the 40% duplication range where the trained SVR peaks. However, I conjecture that such fluctuations in the regressions will get smoother with more data, but the overall slope will be the same. More research is needed for that.

A further change of perspective of the data shows a clearer, more striking non-linear relationship between test duplication and test quality. As the next chart shows, both axis have been reversed and log-scaled (base 2), so the highest values are now located at the origin.

 

Test duplication vs mutation score (log-scaled and reversed)

Test duplication vs mutation score (log-scaled and reversed)

 

Even though the relationships are not linear, it is interesting to notice that all the regressions as well as the chart above agree on the general overall trend. That is, they all suggest that higher amounts of duplicate tests appear to negatively impact the quality of test suites by linking them with lower mutation scores.

In other words, you can see that suites that achieved higher mutation score tend to cluster around lower duplication amounts, while those that achieved lower mutation scores tend to cluster around the higher duplication ratios.

Recall that such a link between test duplication and poor test quality was also found in other open-source software projects. I assume that the same link most likely exists in closed-source software, but obviously data is not available for that kind of extended research as in the case of open-source projects.

 

 

So, how well is Puppet tested?

Given the extensive analysis results from the two posts, let’s just say that Puppet has a lot of room for improvement.

 

Poorly tested

Poorly tested

 

 

 

 

 

 

Do you have a project that you would like to see analyzed for quality? Let me know!

 

 

 

 

[1] http://eecs.oregonstate.edu/node/4867

[2] http://confreaks.tv/videos/mwrc2014-re-thinking-regression-testing

[3] http://betterembsw.blogspot.com/2016_05_01_archive.html

[4] http://dijkstra.cs.virginia.edu/genprog/papers/weimer-gpem2013-robust.pdf