Defect Detection Efficiency: An Evaluation of a Research Study

January 8, 2010

Over the last several months, B.J. Rollison has been delivering presentations and writing articles and blog posts in which he cites a paper Defect Detection Efficiency: Test Case Based vs. Exploratory Testing [DDE2007], by Juha Itkonen, Mika V. Mäntylä and Casper Lassenius (First International Symposium on Empirical Software Engineering and Measurement, pp. 61-70; the paper can be found here).

I appreciate the authors’ intentions in examining the efficiency of exploratory testing. That said, the study and the paper that describes it have some pretty serious problems.

Some Background on Exploratory Testing

It is common for people writing about exploratory testing to consider it a technique, rather than an approach. “Exploratory” and “scripted” are opposite poles on a continuum. At one pole, exploratory testing integrates test design, test execution, result interpretation, and learning into a single person at the same time. At the other, scripted testing separates test design and test execution by time, and typically (although not always) by tester, and mediates information about the designer’s intentions by way of a document or a program. As James Bach has recently pointed out, the exploratory and scripted poles are like “hot” and “cold”. Just as there can be warmer or cooler water, there are intermediate gradations to testing approaches. The extent to which an approach is exploratory is the extent to which the tester, rather than the script, is in immediate control of the activity. A strongly scripted approach is one in which ideas from someone else, or ideas from some point in the past, govern the tester’s actions. Test execution can be very scripted, as when the tester is given an explicit set of steps to follow and observations to make; somewhat scripted, as when the tester is given explicit instruction but is welcome or encouraged to deviate from it; or very exploratory, in which the tester is given a mission or charter, and is mandated to use whatever information and ideas are available, even those that have been discovered in the present moment.

Yet the approaches can be blended. James points out that the distinguishing attribute in exploratory and scripted approaches is the presence or absence of loops. The most extreme scripted testing would follow a strictly linear approach; design would be done at the beginning of the project; design would be followed by execution; tests would be performed in a prescribed order; later cycles of testing would use exactly the same tests for regression

Let’s get more realistic, though. Consider a tester with a list of tests to perform, each using a data-focused automated script to address a particular test idea. A tester using a highly scripted approach would run that script, observe and record the result, and move on to the next test. A tester using a more exploratory approach would use the list as a point of departure, but upon observing an interesting result might choose to perform a different test from the next one on the list; to alter the data and re-run the test; to modify the automated script; or to abandon that list of tests in favour of another one. That is, the tester’s actions in the moment would not be directed by earlier ideas, but would be informed by them. Scripted approaches set out the ideas in advance, and when new information arrives, there’s a longer loop between discovery and the incorporation of that new information into the testing cycle. The more exploratory the approach, the shorter the loop. Exploratory approaches do not preclude the use of prepared test ideas, although both James and I would argue that our craft, in general, places excessive emphasis on test cases and focusing techniques at the expense of more general heuristics and defocusing techniques.

The point of all this is that neither exploratory testing nor scripted approaches are testing techniques, nor bodies of testing techniques. They’re approaches that can be applied to any testing technique.

To be fair to the authors of [DDE2007], since publication of their paper there has been ongoing progress in the way that many people—in particular Cem Kaner, James Bach, and I—articulate these ideas, but the fundamental notions haven’t changed significantly.

Literature Review

While the authors do cite several papers on testing and test design techniques, they do not cite some of the more important and relevant publications on the exploratory side. Examples of such literature include “Measuring the Effectiveness of Software Testers” (Kaner, 2003; slightly updated in 2006); and “Software engineering metrics: What do they measure and how do we know?” (Kaner & Bond, 2004); and “Inefficiency and Ineffectiveness of Software Testing: A Key Problem in Software Engineering” (Kaner 2006; to be fair to the authors, this paper may have been published too late to inform [DDE2007]), General Functionality and Stability Test Procedure (for Microsoft Windows 2000 Application Certification) (Bach, 2000); Satisfice Heuristic Test Strategy Model (Bach, 2000); How To Break Software (Whittaker, 2002).

The authors of [DDE2007] appear also to have omitted literature on the subject of exploration and its role in learning. Yet there is significant material on the subject, in both popular and more academic literature. Examples here include Collaborative Discovery in a Scientific Domain (Okada and Simon; note that the subjects are testing software); Exploring Science: The Cognition and Development of Discovery Processes (David Klahr and Herbert Simon); Plans and Situated Actions (Lucy Suchman); Play as Exploratory Learning (Mary Reilly); How to Solve It (George Polya); Simple Heuristics That Make Us Smart (Gerg Gigerenzer); Sensemaking in Organizations (Karl Weick); Cognition in the Wild (Edward Hutchins); The Social Life of Information (Paul Duguid and John Seely Brown); Sciences of the Artificial (Herbert Simon); all the way back to A System of Logic, Ratiocinative and Inductive (John Stuart Mill, 1843).

These omissions are reflected in the study and the analysis of the experiment, and that leads to a common problem in such studies: heuristics and other important cognitive structures in exploration are treated as mysterious and unknowable. For example, the authors say, “For the exploratory testing sessions we cannot determine if the subjects used the same testing principles that they used for designing the documented test cases or if they explored the functionality in pure ad-hoc manner. For this reason it is safer to assume the ad-hoc manner to hold true.” [DDE2007, p. 69] Why assume? At the very least, one could at least observe the subjects and debrief them, asking about their approaches. In fact, this is exactly the role that the test lead fulfills in the practice of skilled exploratory testing. And why describe the principles only as “ad-hoc”? It’s not like the principles can’t be articulated. I talk about oracle heuristics in this article, and talk about stopping heuristics here; Kaner’s Black Box Software Testing course talks about test design heuristics; James Bach‘s work talks about test strategy heuristics (especially here); James Whittaker’s books talk about heuristics for finding vulnerabilities…

Tester Experience

The study was performed using testers who were, in the main, novices. “27 subjects had no previous experience in software engineering and 63 had no previous experience in testing. 8 subjects had one year and 4 subjects had two years testing experience. Only four subjects reported having some sort of training in software testing prior to taking the course.” ([DDE2007], p. 65 my emphasis) Testing—especially testing using an exploratory approach—is a complex cognitive activity. If one were to perform a study on novice jugglers, one would likely find that they drop an approximately equal number of objects, whether they were juggling balls or knives.

Tester Training

The paper notes that “subjects were trained to use the test case design techniques before the experiment.” However, the paper does not make note of any specific training in heuristics or exploratory approaches. That might not be surprising in light of the weaknesses on the exploratory side of the literature review. My experience, that of James Bach, and anecdotal reports from our clients suggests that even a brief training session can greatly increase the effectiveness of an exploratory approach.

Cycles of Testing

Testing happens in cycles. In a strongly scripted testing, the process tends to the linear. All tests are designed up front; then those tests are executed; then testing for that area is deemed to be done. In subsequent cycles, the intention is to repeat the original tests to make sure that bugs are fixed to check for regression. By contrast, exploratory testing is an organic and iterative process. In an exploratory approach, the same area might be visited several times, such that learning from early “reconnaissance” sessions informs further exploration in subsequent “deep coverage” sessions. The learning from those (and from ideas about bugs that have been found and fixed) informs “wrap-up sessions”, in which tests may be repeated, varied, or cut from new cloth. No allowance is made for information and learning obtained during one round of testing to inform later rounds. Yet such information and learning is typically of great value.

Quantitative vs. Qualitative Analysis

In the study, there is a great deal of emphasis placed on quantifying results, on experimental and on mathematical rigour. However, such rigour may be misplaced when the products of testing are qualitative, rather than quantitative.

Finding bugs is important, finding many bugs is important, and finding important bugs is especially important. Yet bugs and bug reports are by no means the only products of testing. The study largely ignores the other forms of information that testing may provide.

The tester might learn something about test design, and feed that learning into her approach toward test execution, or vice versa. The value of that learning might be realized immediately (as in an exploratory approach) or over time (as in a scripted approach).

The tester, upon executing a test, might recognize a new risk or missing coverage. That recognition might inform ideas about the design and choices of subsequent tests. In a scripted approach, that’s a relatively long loop. In an exploratory approach, upon noticing a new risk, the tester might choose to note findings for later on. On the other hand, the discovery could be cashed immediately: she might choose to repeat the test, she might perform a variation on the same test, or might alter her strategy to follow a different line of investigation. Compared to a scripted approach, the feedback loop between discovery and subsequent action is far shorter. The study ignores the length of the feedback loops.

In addition to discovering bugs that threaten the value of the product, the tester might discover issues—problems that threaten the value of the testing effort or the development project overall.

The tester who takes an exploratory approach may choose to investigate a bug or an issue that she has found. This may reduce the total bug count, but in some contexts may be very important to the tester’s client. In such cases, the quality of the investigation, rather than the number of bugs found, would be important.

More work products from testing can be found here.

“Efficiency” vs. “Effectiveness”

The study takes a very parsimonious view of “efficiency”, and further confuses “efficiency” with “effectiveness”. Two tests are equally effective if they produce the same effects. The discovery of a bug is certainly an important effect of a test. Yet there are other important effects too, as noted above, but they’re not considered in the study.

However, even if we decide that bug-finding is the only worthwhile effect of a test, two equally effective tests might not be equally efficient. I would argue that efficiency is a relationship between effectiveness and cost. An activity is more efficient if it has the same effectiveness at lower cost in terms of time, money, or resources. This leads to what is by far the most serious problem in the paper…

Script Preparation Time Is Ignored

The authors’ evaluation of “efficiency” leaves out the preparation time for the scripted tests! The paper says that the exploratory testing sessions took 90 minutes for design, preparation, and execution. The preparation for the scripted tests took seven hours, where the scripted test execution sessions took 90 minutes, for a total of 8.5 hours. This fact is not highlighted; indeed, it is not mentioned until the eighth of ten pages. (page 68). In journalism, that would be called burying the lead. In terms of bug-finding alone, the authors suggest that the results were of equivalent effectiveness, yet the scripted approach took, in total, 5.6 times longer than the exploratory approach. What other problems could the exploratory testing approaches find given seven additional hours?

Conclusions

The authors offer these four conclusions at the end of the paper:

“First, we identify a lack of research on manual test execution from other than the test case design point of view. It is obvious that focusing only on test case design techniques does not cover many important aspects that affect manual testing. Second, our data showed no benefit in terms of defect detection efficiency of using predesigned test cases in comparison to an exploratory testing approach. Third, there appears to be no big differences in the detected defect types, severities, and in detection difficulty. Fourth, our data indicates that test case based testing produces more false defect reports.”

I would offer to add a few other conclusions. The first is from the authors themselves, but is buried on page 68: “Based on the results of this study, we can conclude that an exploratory approach could be efficient, especially considering the average 7 hours of effort the subjects used for test case design activities.” Or, put another way,

During test execution
unskilled testers found the same number of problems, irrespective of the approach that they took, but
preparation of scripted tests increased testing time approximately by a factor of five
and appeared to add no significant value.

Now: as much as I would like to cite this study as a significant win for exploratory testing, I can’t. There are too many problems with it. There’s not much value in comparing two approaches when those approaches are taken by unskilled and untrained people. The study is heavy on data but light on information. There are no details about the bugs that were found and missed using each approach. There’s no description of the testers’ activities or thought processes; just the output numbers. There is the potential for interesting, rich stories on which bugs were found and which bugs were missed by which approaches, but such stories are absent from the paper. Testing is a qualitative evaluation of a product; this study is a quantitative evaluation of testing. Valuable information is lost thereby.

The authors say, “We could not analyze how good test case designers our subjects were and how much the quality of the test cases affected the results and how much the actual test execution aproach.” Actually, they could have analyzed that. It’s just that they didn’t. Pity.

Subscribe to the Newsletter

6 replies to “Defect Detection Efficiency: An Evaluation of a Research Study”

Bj Rollison

January 8, 2010 at 6:15 pm

Hi Michael,

This is a well-written piece that provides perspective to on-going research by myself and other people in the area of exploratory testing in order to better understand its value and limitaions (or rephrased, which types of issues it is most effective at exposing, and which types of issues it is not as effective as exposing as compared to other approaches). It's that whole pesticide paradox thing, or usng the right tool for the job.

Also, I think your assertion that the authors regard exploratory testing as a 'technique' is misleading. Non-of the published papers by Juha or myself ever refered to exploratory testing as a technique. That would be incorrect because a technique is 'a systematic process to help solve a specific type of problem.'

I agree with you that techniques or patterns are often the foundation of how we design tests either from an exploratory or pre-defined test case (scripted) approach.

And, yes the studies use a lot of quantifiable data. Interesting thing about research; it supports its conclusions with facts rather then emotion. But, you are also quite right that data may not present the whole story or may slant the picture which is why we both suggest further research is required.
Reply
Michael Bolton http://www.developsense.com

January 8, 2010 at 9:45 pm

@Bj…

Thank you for the comments.

Also, I think your assertion that the authors regard exploratory testing as a 'technique' is misleading.

I'm confused. In reading the text, I can't find such an assertion. Can you point me to it?

And, yes the studies use a lot of quantifiable data. Interesting thing about research; it supports its conclusions with facts rather then emotion.

That's the idea, at least. Is it your intention to suggest that quantifiable data is the only form of fact? Do you suggest that qualitative analysis is naturally or automatically non-factual?

Meanwhile, I hope that this post has been helpful in identifying areas that might inform your own studies. I'd be very interested in seeing some of Microsoft's work in exploratory and heuristic approaches, both from a training perspective and a practice perspective. I know about the General Functionality and Stability Test Procedure, Michael Hunter's You Are Not Done Yet list, and James Whittaker's Tour Stuff, but I'm not aware of anything from Engineering Excellence specifically. Any pointers?

Cheers,

—Michael B.
Reply
Cem Kaner

January 9, 2010 at 2:57 pm

It is odd to read "research supports its conclusions with facts rather than emotion." I would expect this from an amateur who has little research sophistication.

The hallmark of useful research is not that it is quantitative (some of the best research is qualitative).

Nor is the hallmark that it presents "facts." After all, "facts" to the original researcher are merely anecdotes to a person who reads about them. They are stories about what someone did, what happened, and what he thinks about it.

"Facts" presented in a research paper are observations of a researcher, recorded in way decided by the researcher (the research and recording methods are rarely well disclosed in a technical paper, partially because publication page limits don't allow enough room for details) which were then analyzed using methods chosen by the researcher (whose reasoning is not fully disclosed in the paper). A subset (chosen by the researcher) of the analyses are presented as “the facts.”

Sadly, many alleged facts in scientific papers are intentional fictions or distortions. Many other factual-seeming assertions are erroneous or misleading even though they are written in good faith, because of misperceptions or misrecollections by the author. For more information on this, read the extensive literature on experimenter effects in science.

Like anecdotes, scientific writing is often treated as credible on the reputation of the author for skill and integrity and on the extent to which the presentation of "the facts" is compelling.

In other words: facts, shmacts.

To a more mature researcher, there are more interesting questions than whether data are "facts" or the data collection and analysis techniques were quantitative rather than mixed-method or qualitative.

The hallmark of useful research is whether it is useful.

And for that, we need validity and generalizability, not quantitative "facts".

Have you ever watched kittens explore? Guided a kitten’s exploration with some string or a laserpen dot? I bet someone could do a quantitative research project on the relative effectiveness of different stimuli for guiding the exploration of 100 kittens.
We could publish the results in a book, so as to include all the pertinent data and methods. Methodologically and quantitatively, the book could be perfectly sound.

Let’s give this book a catchy title: "How We Study the Management and Training of Exploratory Testing at Microsoft."

Perhaps, though, this population studied might not reflect the characteristics of what the professional software testing community might consider the population of interest. That is, if we want to draw conclusions about software testers, then kittens (even though they explore and test soft things) might not provide a suitable basis.

Michael's assertion is that Juha studied the equivalent of kittens.

The response to Michael's critique appears to be that Juha's paper is well-written, it reports facts and numbers, and that the authors said "more research is necessary" so this is good science.

I have not read Juha's paper and am not expressing any opinion about the underlying paper. I am commenting on the exchange in this blog, and other instances I have noticed elsewhere recently of what seems to be a wielding of "fact"-based research as a bludgeon.

When I teach empirical methods in computing, my students critique research papers. An answer like "this paper is nicely written and it illustrates good science because it has facts and numbers and says 'more research is necessary'" would get an "F".

A more mature vision of scientific research would recognize that Michael was raising one of the classic "threats to validity" of empirical research. If a study does not credibly speak to the phenomena it purports to study, it is invalid. On the scale of research bugs, that would be a P1 showstopper.
Reply
Brent Paine

January 15, 2010 at 2:21 pm

I completely disagree with you on this one, Michael.

First of all, there is a massive difference between "unskilled" testers as it might apply to someone off the street, where you have absolutely no clue what their background is, and someone who is a computer science student. The foundation of software testing is in understanding computer application and understanding how an application SHOULD work. So let's not discount this aspect.

Secondly, I have seen no other attempt to bring "valid" data to the table. So aside from the, "Give us $20,000, and we'll show you the promised land." arguement, I have seen very little as far as empirical data from those who are the biggest proponents of this "approach". So, not for nothing, but a lot of what you argue here sounds like a sales pitch on your part, sorry.

I'm really sorry, but I was very happy to find something (the only thing I found) that provided me with some ammunition. In my situation, if I hadn't, literally, had time to kill on a project which ended up returning outstanding results, then ET would not be an approach available to me today. I mean we can't just sit down, smoke some pot and hit them with it when they're good and stoned and hope they say, "cool". The notion is, I give you results, you give me the ok. I show you numbers, you give me the go ahead.

I understand this is, again, what you guys are there to teach, for $20,000, but at the very least Juha's report provides a couple bullets. Now battle-hardened, by the book test vets who love their scripts might be able to shoot it apart. I mean blow it to pieces! However, it's something. Give us something! I have a much greater respect for someone who tries and fails than someone who says it cannot be done. In Juha's case, I don't view it as failing, though. I think it does provide some nuggets, as I call them. Some ammo that can be used for us who want to progress testing in a more agile way. Sure, there may be issues with it. I'm sure there can be issues with any research report, but everything we DO is subjective. So who cares?! If you don't like it, then give us something. ANYTHING!

As far as I'm concerned, it may not be the silver bullet, but nothing is. At least it's a bullet and not a guy telling me he has a bullet.
Reply
Michael Bolton http://www.developsense.com

January 17, 2010 at 6:12 am

Hi, Brent…

Don't get me wrong: I could point to the study and say "Even with all the problems, the exploratory approach still beat the scripted approach by a factor of five." I'd love to do that. It would better than claiming that the approaches are equally efficient, which the study manifestly does not show. The strong points in the story are very favourable to exploratory approaches and the weaknesses unfavourable. My primary goal was to point out the misrepresentation of the study by B.J. in his recent writings. To be fair, some of that misinterpretation is founded in the idea that the authors themselves appear to have misinterpreted the results, claiming equal efficiency in test execution but missing the fact that testing incorporates not only execution, but also (at least) design and reporting.

So that raises a set of ethical dilemmas. Should I challenge the misrepresentations that are being spread in presentations and in articles and in the blogosphere? Should I provide an alternative view, cherry-picking only the bits of the paper that support my argument? Or should I do a critical review, in which I point out the good news but also acknowledge the flaws? In my view, B.J.'s fallacies shouldn't be ignored. The cherry-picking doesn't fit for me ethically; that option sounds like a sales pitch. The third is the only option that my ethics permit. Plus, questioning a product and finding problems with it is at the centre of being a tester, even if we'd be inclined to like that product. That's a credibility thing.

I'm really sorry, but I was very happy to find something (the only thing I found) that provided me with some ammunition.

But it's not the only thing you found, unless I misunderstand. You used something far more valuable than a questionable study. You say you tested and delivered outstanding results. If I read it right, you had some time, and you put it to use in exploration. I infer that you got some experience, and some results; probably you did some experimentation—which delivered more results that you used to negotiate a little more time, for even more and even better results. A virtuous cycle—right?

That's what we do, too. We don't appeal to academic studies that don't reflect real teams and testers well. We haven't found a good study. Instead, we appeal to our own experience; we appeal to your experience (that is, to the experience of our clients generally); and we appeal to your own experiments (ditto). I'll have more to say about that in my next blog post.

Just as you say,

I give you results, you give me the ok. I show you numbers, you give me the go ahead.

That's what we do. And for all of our work, if you don't like what we've done, we offer a money-back guarantee.

In your case, perhaps citing impressive results from an academic study helped; maybe the manager didn't notice the weaknesses in the study, or maybe s/he did notice and didn't care. As Cem points out, "The hallmark of useful research is whether it is useful." If it was useful to you, great. But as Cem also says, the research bugs in this study are, to me, showstoppers. Some products work for people even when the product has what the testers would consider a showstopper. Quality is value to some person, and you and I are different people. If it worked for you, I'm happy. Really.

Meanwhile, you also say…

"Give us $20,000, and we'll show you the promised land."

If you know of someone who is charging $20,000 for a three-day class and a day of consulting, and is getting it, please let me know; I'd love to find out how they're doing it. Better yet, if you know a client who's willing to part with that kind of money, please send contact information immediately.

—Michael B.
Reply
Why Being a Tester is Seen as a Step Down | ortask

November 30, 2014 at 7:31 pm

[…] James Bach and Michael Bolton (another top CDT consultant) went as far as to suggest that Microsoft’s study was inaccurate, faulty and that they got “ET wrong”! See here and here. […]

Mario’s at it again, everyone… This time he’s apparently incapable of reading even the first page of the study he’s citing. Had he done so, he would have realized that it was not a Microsoft study, but one done independently; and he might have had some material to provide a reasonable rebuttal of my critique. Oh well.
Reply

6 replies to “Defect Detection Efficiency: An Evaluation of a Research Study”

Leave a Comment Cancel reply