On The Pesticide Paradox for Software Testing


Lately I have been focused on evolving TestLess so that it can execute concurrently on multiple cores, thus making it super fast (current lab results indicate 120 speedup!).

Naturally, this evolution has introduced new concurrency bugs which have been detected by the existing tests. Once new bugs are detected, they are fixed for that iteration and development continues. Of course, I have also created some new tests for new areas of functionality.

But what if I had followed the common advice regarding the phenomenon called the pesticide paradox? In other words, what if I had “reworked”, “refreshed” or removed those existing tests prior to evolving the code for TestLess?

For now, I will hold off on answering that question (don’t worry, I will revisit the question again below).

First, let me answer two questions you might be asking right now: What is the pesticide paradox? And what does it have to do with software testing?

Let me briefly explain.

The pesticide paradox is an informal explanation for the phenomenon that “old tests will most likely not uncover new bugs.” This is in analogy to natural organisms becoming stronger (adapting) to pesticides and, thus, making the pesticides (in software, the testsineffective.

In other words, the pesticide paradox is typically used to persuade that existing tests that do not detect new bugs need to be updated, refreshed or even removed because they no longer serve any purpose.

It sounds clever on the surface, but let’s think about this.

When creating tests of any kind, there are two very important parts to consider:

  1. The purpose or intent of the test: that is, what is the goal of the test?
  2. The oracle(s) of the test: that is, how will the test decide that the behavior or results of the SUT (System Under Test) are correct?

Let’s briefly talk about each of these parts. We’ll start with test oracles.

A test oracle is anything (programs, files, output, data, heuristics, et cetera) that is utilized as a good reference to decide whether or not the SUT is behaving correctly and producing correct results.

As you may already know by experience, however, creating test oracles requires the most work and may be the most difficult part of creating tests. This is because test oracles are really hard to create in general. They are so hard to create, in fact, that sometimes we might need to make do with partial test oracles or even no test oracle at all!

Partial test oracles can tell when the behavior or results of the SUT are incorrect, but not exactly what is incorrect about such behavior or results (those of you that are computer scientists can think of partial test oracles roughly as recognizers, while total test oracles can be roughly regarded as deciders).

For example: have you ever had a moment when you saw some output from a SUT and you knew it wasn’t quite right, but you didn’t know for sure why or what was wrong about it? In this case, you are (or your knowledge of the domain is) the partial test oracle because you were able to detect incorrect results, but not determine why they were incorrect.

With such a high amount of work required to create test oracles, we can start to see a big problem with the common advice that stems from the current understanding of the pesticide paradox.

Now let’s consider the other part of test creation: the intent of your tests.

The goal or intent of a test may come out intuitively while still requiring some amount of work to pin down in detail. For example, the goal of a test might be to “ensure that data is transmitted correctly even when very low resources are available“. Several iterations might then be needed to refine and/or spread the goals into different tests.

Given these two important parts, here is the key point to consider: If creating tests requires a non-trivial amount of work to specify test oracles and test intent, then it logically follows that modifying our tests in any way requires that we also work to modify our test oracles (perhaps also affecting the intent of our tests which will directly impact their coverage of the SUT).

What is worse is that throwing away our existing tests literally means throwing away hard work that produced the test oracles currently at our disposal! Not to mention that the SUT might no longer have any test providing coverage for that specific area.

To me, this signals a big disconnect regarding what we think we know about the root cause for the pesticide paradox’s effect on software testing, and the proposed solution to mitigate it.

Let me now go back to the situation I mentioned above with TestLess and its tests: what if I had “reworked”, “refreshed” or removed those existing tests prior to evolving the code?

The answer is simple: I would have most likely missed the new concurrency bugs without ever knowing they were there!

 

Re-thinking the Pesticide Paradox

Don’t get me wrong. There is value in re-working and refreshing your tests, but only when there is good evidence indicating the need to do this. So while it might be obvious that the defect detection rate of existing tests seems to drastically diminish once the bugs they detected are fixed, they should not be removed simply because they look to have lost their effectiveness.

Quite the contrary, we need to recognize that old tests — whether manual or automated — are not only extremely valuable for regression detection, but they are equally as valuable during the evolution of software systems when adding features or fixing bugs.

I talked about this in a previous post specifically regarding test automation, but let me re-iterate here the (correct) argument that test automation has so many benefits other than to run a specific test for a specific situation repeatedly.

In particular, the coupling effect tells us that certain bugs are actually interactions of other bugs — bugs that might have been introduced by our colleagues or by ourselves during the evolution of software. This is known as the competent programmer hypothesis.

Automation is therefore great at catching these new bugs — bugs whose symptoms are results of coupling of other bugs, and whose root cause is nowhere near the lines of code targeted by the test itself. This has certainly been my experience and strengthens the safety net quality of automated tests.

In other words, having piles of automated or manual tests not only allows us to detect such bugs as regressions in our code, but also to detect many other (new) bugs that we did not plan for in the original intent of our tests. It’s a win-win!

So, no. I do not “refresh” or “rework” my tests if there is no need to do so.

This, then, leaves me with the last option — removing tests — which brings us back to the intent of tests.

The only case where I have actually removed tests is when they are redundant.

Why? Because there is research that indicates that redundant tests not only afford no value at all, but instead they appear to be linked to a higher number of bugs in their associated SUTs than those applications that had low or no test redundancy.

Moreover, the research shows that the average test suite has 25% duplicate tests, which is extremely high.

We therefore can begin to see that the intent of tests might be strongly related to the source of the pesticide paradox .

That is, the phenomenon of the pesticide paradox for software might not be at all related to “tests suddenly losing effectiveness” — because they don’t; they are as useful as before — but rather to their coverage of the SUT and what they actually focus on.

According to the research, it seems that the negative effect of redundant tests on the quality of your software is the same whether you create them intentionally (in the hopes of “extending your safety net”) or accidentally (because you did not know that such tests already existed).

To draw an analogy: creating redundant tests is like trying to avoid inflating your car’s tires in the future by over-inflating them now. The drawback is that over-inflated car tires will most likely explode, which will end up costing you more money and time (not to mention that over-inflated car tires also make your car use more gas than necessary).

This is the analogy and findings I presented at Mountain West Ruby Conf recently, which I encourage you to watch.

It is not hard to see how creating too many tests with the same or similar intent might be detrimental to the quality of the SUT. For example, if your coworker first told you she had a test suite with 1000 tests, but then told you that all those tests are exactly the same, would you see the benefit in keeping that test suite unchanged?

Of course not. You would cut all 999 tests and (if necessary) loop the one, single test 1000 times — but 1000 redundant tests affords no value at all. In fact, such a test suite creates high technical debt because you now have to modify 999 other tests when you modify any single one.

More importantly, however, is the following point: since redundant tests (such as the 1000 repeated tests in the example above) indicate to a certain degree that some areas of your SUT are being tested excessively, then the amount of test redundancy is an indirect indication that other areas are not being tested as much as they should (if they are tested at all).

This might explain why SUT’s with highly redundant tests are more buggy than those with low test redundancy. Further, this might also explain why the tests for those buggy SUT’s seem to have lost effectiveness: because those tests did a poor job in their coverage of the SUT to begin with!

And that is why test redundancy might, in the end, have a stronger link to the pesticide paradox than the inherent effectiveness of the tests themselves. This, in turn, produces lower quality software than applications that contained little or no test redundancy.

Therefore, based on the research on test redundancy and the simple examples above, we can start to see how the pesticide paradox might come to be.

This is exactly why I created TestLess: to help people like you detect and manage redundant tests so you can re-focus your tests correctly, bring up their quality, and therefore increase the quality of your SUT. In doing so, you will be mitigating the pesticide paradox and reducing the amount of bugs that get passed your tests.

I challenge you to start thinking a bit differently about your tests and to ask yourself this question: How redundant are my tests?

I am sure you will be surprised at the answer.

 

 

Sources:

– http://www.softwaretestingclub.com/profiles/blogs/defect-clustering-pesticide-paradox

– http://www.linkedin.com/groups/What-is-pesticides-paradox-software-86204.S.92761182

– https://www.ece.cmu.edu/~koopman/des_s99/sw_testing/

– http://en.wikipedia.org/wiki/Paradox_of_the_pesticides

– http://blogs.msdn.com/b/nihitk/archive/2004/07/16/185836.aspx

– http://www.ics.uci.edu/~redmiles/ics221-FQ03/papers/Wey82.pdf

 

 


One thought on “On The Pesticide Paradox for Software Testing

Comments are closed.