How well is it tested: Puppet (part 1 of 2)

Puppet is a “Server automation framework and application” that it utilized by people worldwide and even Fortune 100 companies (according to Puppet’s website) to manage part of their devops work. It is also one of the key projects that “powers Github.”

Puppet is written in Ruby and unit-tested using the BDD framework Rspec.

Given Puppet’s popularity and utility, one would expect its quality to be sky-high. The reality, however, shows that its quality, as reflected through its tests, is simply average and seems to be on par with the average quality of other open-source projects, thus supporting previous research results in this area. In other words, Puppet’s quality is: meh.

I’ll present my analysis in two posts. The first post will analyze test duplication. To a certain extent, test duplication is a smell — an indication of poor test quality. By measuring how certain areas of an application are over-tested, it also measures the degree to which other areas in an application have been left under-tested or even completely untested. The second post will explore the results of performing mutation analysis on Puppet’s tests. Mutation analysis is a measure that subsumes code coverage and more accurately represents the actual quality of tests.

Please keep in mind that my intent is in no way to negate the utility of Puppet. Indeed, the fact that many people have found Puppet useful is great! The analysis does provide, however, an objective measure of what to expect in terms of Puppet’s quality according to the quality of its tests.



Preliminaries on Test Duplication

There has been extensive research done on the benefits of reducing test redundancy for the simple goal of speeding up tests. Interestingly, except for the seminal research done by Ortask, there has not been any research focused on how redundancy impacts the quality of tests. This is partly due to the fact that almost all proposed solutions for test redundancy detection have invariably required excessive amounts of work that prevents them from scaling freely to arbitrary projects and languages. For example, some proposed solutions require multiple phases of time-consuming processing in the form of runtime profiling, static code analysis, and execution of machine learning algorithms[1]; others require instrumenting the SUT (system under test)[1, 2, 3, 4, 5] or even instrumenting the test code itself[2]. Finally, most proposed solutions necessitate repeated execution of tests[1, 2, 3, 4, 5] which would completely undo any time-savings!

Such requirements result in solutions that are excessively heavy-weight, time-consuming and convoluted. What’s worse, such solutions only produce so-so results that cannot transfer to other projects[1, 2, 3, 4, 5]. These severe limitations of such proposed solutions naturally preclude their adoption by the software industry.

Using novel algorithms and key observations, Ortask was the first to develop a super light-weight solution that does not require executing tests to perform accurate redundancy analysis. Such a light-weight, cross-language solution allowed Ortask to research the effects of test redundancy on the quality of code, linking high redundancy in tests with lower code quality.

Test duplication analysis on Puppet, therefore, was performed using Ortask TestLess — the light-weight, cross-language solution to detect test redundancy.




TestLess is great at detecting duplicate tests, but since it’s a classifier, it also produces some false hits (more on that below). To assert that only actual duplicates were detected and removed from the suites, mutation analysis (with Mutator) was used in tandem with test redundancy analysis.

Mutator was used as follows: for each test suite, 1000 1st-order mutants were created and the final mutation score was recorded. Test redundancy analysis was then performed and the detected duplicates removed. Then 1000 more mutants were generated, also recording the new final mutation score. The original and new mutation scores were then compared. Finally, if removing tests resulted in a mutation score lower than m – 0.05, where m is the original score, then the removal was considered harmful and no tests were removed.

For more details about why 0.05 was chosen as the threshold, see the paper on the effects of test redundancy on the quality of code.

Now on to the results!



Test Duplication analysis on Puppet

The analysis utilized Puppet version 4.4.0. The total number of test suites in this version was 656, from which 220 (33.5%) were selected for test duplication analysis. The sample was extracted to contain as randomly a population of unit tests as possible so as to capture variations in size and code areas from the project.

The following histogram shows the size distribution of the suites in the sample.


Histogram of suite sizes in sample (blue dotted line is median; red dashed line is mean)

Histogram of suite sizes in sample (blue dotted line is median; red dashed line is mean)


By the PDF (probability density function) below, we can also get a probabilistic sense of where the bulk of test suites are located in terms of size.


Density function of suite sizes (blue dotted line is median; red dashed is mean)

Density function of suite sizes (blue dotted line is median; red dashed is mean)


In this version of Puppet, the average test suite contains 19.45 tests while the median suite contains 12 tests. As the histogram shows, there are several big suites containing more than 100 tests, including a huge suite containing 256 tests.

Using the default settings, TestLess correctly identified the duplicate tests from 74% of the suites in the set — including tests that were originally identified as duplicates by the test writers themselves!


74% correctly identified suites using defaults

74% correctly identified suites using defaults


Here is an example of a duplicate test found by TestLess that was also marked as such by contributor Luke Kanies (github username “lak”).


Duplicate tests marked by original contributor

Duplicate tests marked by original contributor


In effect, TestLess found that almost 3 out of every 4 suites in the sample contained some degree of duplication ranging from subtle to clear copy/paste actions — all using default (out-of-the-box) settings. The following image shows a copy/paste duplicate test for suite module_spec.rb that TestLess found.


Pure copy/paste tests found by TestLess

Pure copy/paste tests found by TestLess


Given that TestLess is a classifier, some number of false positives and false negatives were expected. Such false hits required fine-tuning the detection parameters and re-analyzing. This was done on the remaining 26% of suites to yield a total accuracy of 98% correctly identified suites.

The histogram below shows the distribution of duplicate tests found in the complete sample after this re-analysis was performed.


Test duplication histogram (blue dotted line is median; red dashed line is mean)

Test duplication histogram (blue dotted line is median; red dashed line is mean)


The chart below shows the distribution of duplicate tests, as a density function.


Test duplication density function (blue dotted line is median; red dashed line is mean)

Test duplication density function (blue dotted line is median; red dashed line is mean)


From the histogram and the probability density function, it is noteworthy to observe that test suites contain 25% average duplication, while some suites contain an excessively high amount of duplicate tests. In one suite, for example, 91.67% of its tests are duplicate and, therefore, entirely redundant!

The full test duplication statistics for the sample are shown next, along with the 95% confidence interval.


95% confidence interval for duplicate tests

95% confidence interval for duplicate tests


As the confidence interval describes, there is a very high probability (95%, in fact) that the average test duplication amount for a given test suite in Puppet will fall between 22.5% and 27.3%, with a very low likelihood of outliers (5%). Stated differently, while there is 5% probability that, on average, test suites might have less than 22.5% test duplication in Puppet, its test suites will most likely have an average test duplication within that confidence interval. An analogous argument holds for the “greater than” case (i.e. greater than 27.3%)

It is interesting to observe how close the statistics are between previous research results and this study. In particular, the previous study showed that one out of every 4 tests in a suite were totally redundant. In this investigation, the test suites in Puppet achieved equivalent amounts of average redundancy, in agreement with that research.


Average test duplication in Puppet is 25%

Average test duplication in Puppet is 25%


It is important to stress that test duplication is not a problem in and of itself. Sometimes, test redundancy is harmless, perhaps even desired. For instance, in the practice of treating tests as “executable documentation”, developers might decide to document several different cases (tests) to record a fix for a single bug. Another instance might arise when different edge cases are recorded in separate tests for the same function area of the SUT.

As long as test writers intentionally and carefully keep such redundant tests in check, and as long as they understand that redundant tests do not increase the coverage of their code, then it’s okay to leave redundant tests in suites for documentation. In other words, duplicate tests might very well increase how well we (humans) can read and understand the tests, but they will never improve their quality or that of the system they test.

It is therefore important to keep in mind that by definition (as well as empirically):


Redundant tests do not increase the quality of the tests nor of the system under test


On the other hand, test redundancy can become a smell if it is not managed appropriately, and it can become a big problem if it gets out of control. One problem might arise when, for instance, developers unknowingly create redundant tests thinking that they are actually testing different parts of the code or exercising different constraints/conditions.

For instance, the latter case appears to have happened with one of the test suites in Puppet that contained 80% test duplication, yet practically 0% mutation score (more on mutation score on the next post). In other words, it seems that only one test would have sufficed to exercise the code that was exercised by all 5 of them. Indeed, analyzing the code coverage[6] yields practically the same amount (36.58% using the original suite, versus 36.56% using the reduced suite).



Test Speedups

In terms of test runtimes, the simple action of removing duplicate tests results in an average speedup of 15% in the sample. The chart below shows the runtime comparison between the original set of tests and the tests without duplicates.


Runtime of test suites

Runtime of test suites


One can clearly see a stark difference in terms of runtimes.

For this comparison, each set of tests was executed 500 times to get good values, according to the central limit theorem. You can see that the probability density functions show where the runtimes cluster the most, thus forming the expected distribution of the runtimes for each set of tests.

Because the speedup is proportional to the amount of duplicate tests removed from the suites, it can naturally be extrapolated to the whole set of suites in Puppet (i.e. the whole population).

For instance, running the entire set of original tests takes 396 seconds, yet running the same set of tests without duplicates would take around 60 seconds less. That might not sound like much, but compound that runtime for every time the C.I. runs, and the one-minute-savings adds up to around 3.6 hours gained just for one day! In other words, the entire suite of tests would be able to run 39 more times than before, thus enabling higher turnaround of fixes and features thanks to the more frequent testing.

Talk about an easy speedup, and all without having to re-architect Puppet!



Data Relationships

Moving on to relationships between duplication and other properties of the tests, it is interesting to note the pattern between the size of the suites with respect to their test duplication.


Test duplication vs size of suites

Test duplication vs size of suites


Observe that the data seem to follow a curiously non-linear, inverse pattern. Visual inspection suggests that smaller test suites in Puppet tend to contain higher amounts of duplicate tests. Indeed, a linear regression seems to agree, but this conclusion is called into question by the fact that the Pearson correlation coefficient is -0.22, a rather low value.

However, recall that the PCC only applies to linear relationships. To get an alternative perspective, I plotted some regressions based on a machine-learning device called Support Vector Machines. Results of the support vector regressions are shown in the chart below.


Test duplication vs size of suites (with SVR)

Test duplication vs size of suites (with SVR)


Indeed, with the exception of the outlier and notwithstanding the low PCC, the SVRs also seem to agree with the linear regression, indicating that smaller suites tend to have higher test duplication in Puppet. The fluctuations in the SVRs, however, are a bit puzzling.

As a final attempt to clarify the data — especially in the presence of the outlier — I scaled the y- and x-axis logarithmically using base 2. This is shown in the next chart.


Test duplication vs size of suites (log scale)

Test duplication vs size of suites (log scale)


The chart appears to somewhat clarify the patterns in the data, seemingly reiterating the interpretations explained above. That is, to a certain degree, it seems that smaller test suites in Puppet tend to have higher amounts of duplicate tests.

These findings might be very surprising and perhaps counter-intuitive because one would assume that, having more tests, bigger suites have more “chance” of duplication. But recall that duplication is not strictly a result of chance. As I explained above, duplication might exist intentionally (say, as a result of intentionally writing duplicate tests for the sake of documenting equivalent scenarios) or as a result of the (mis)guided focus of the test writers stemming perhaps from a poor understanding of the underlying code being tested.

Going back to the chart showing the SVRs, the fluctuations in the regressions might also be signs of such a phenomenon (possibly inherent) of test writing in Puppet. In other words, the peaks and troughs of the SVR might indicate the non-stochastic nature of test writing in Puppet.

Nevertheless, more investigation is needed to determine the root cause of the test duplication pattern for Puppet with respect to suite size. Additionally, further investigation is required to see if this same pattern generalizes to other software projects.

At this point, it is also important to emphasize that simply having “low” amount of duplicate tests with respect to the suite’s size does not necessarily indicate higher quality, or that tests were written carefully and judiciously, or that no test duplication management is needed for them. In other words, the amount of test duplication in suites might simply be too high or unacceptable, even for smaller-sized suites that achieve “low” or average test duplication.

Consider the example of the huge suite containing 256 tests. Such a suite ended up having a whopping 75 (29.3%) totally redundant and unnecessary tests. One has to wonder: how much effort and energy went into creating those 75 tests that ended up simply duplicating others in the suite? How much time was wasted on those 75 tests? Minutes? Hours? Days?

Since creating tests takes a non-trivial amount of work, the same consideration must logically be applied to suites of any size. Even for suites having only a handful of tests but containing a high amount of duplicates (such as the one mentioned above with 5 tests and 80% test duplication), the test writer could have spent several minutes, hours, or maybe days trying to write those tests — tests that ultimately ended up representing a waste of time.



Do not be fooled into thinking that a “low” or average test duplication amount is okay or even acceptable


To reiterate: even when the duplication amount is considered “low” with respect to the average amount, or the suite size, keep in mind that writing tests takes time and effort, and such energy that goes into creating the tests in the first place might be partly wasted on duplicates.

Stay tuned for the next post to see how duplication affected the quality of tests in Puppet, as well as more fascinating relationships.






[1] V. Vangala, J. Czerwonka, and P. Talluri, “Test Case Comparison and Clustering using Program Profiles and Static Execution,” Microsoft.

[2] M. Greiler, A. van Deursen, and A. Zaidman, “Measuring Test Case Similarity to Support Test Suite Understanding,” 2012.

[3] N. Koochakzadeh, V. Garousi, and F. Maurer, “A TesterAssisted Methodology for Test Redundancy Detection,” 2009.

[4] N. Koochakzadeh, V. Garousi, and F. Maurer, “Test Redundancy Measurement Based on Coverage Information: Evaluations and Lessons Learned,” 2009.

[5] T. Xie, D. Marinov, and D. Notkin, “Rostra: A Framework for Detecting Redundant Object-Oriented Unit Tests,” 2004.

[6] Using simplecov.