Over the last several months, B.J. Rollison has been delivering presentations and writing articles and blog posts in which he cites a paper Defect Detection Efficiency: Test Case Based vs. Exploratory Testing [DDE2007], by Juha Itkonen, Mika V. Mäntylä and Casper Lassenius (First International Symposium on Empirical Software Engineering and Measurement, pp. 61-70; the paper can be found here).
I appreciate the authors’ intentions in examining the efficiency of exploratory testing. That said, the study and the paper that describes it have some pretty serious problems.
Some Background on Exploratory Testing
It is common for people writing about exploratory testing to consider it a technique, rather than an approach. “Exploratory” and “scripted” are opposite poles on a continuum. At one pole, exploratory testing integrates test design, test execution, result interpretation, and learning into a single person at the same time. At the other, scripted testing separates test design and test execution by time, and typically (although not always) by tester, and mediates information about the designer’s intentions by way of a document or a program. As James Bach has recently pointed out, the exploratory and scripted poles are like “hot” and “cold”. Just as there can be warmer or cooler water, there are intermediate gradations to testing approaches. The extent to which an approach is exploratory is the extent to which the tester, rather than the script, is in immediate control of the activity. A strongly scripted approach is one in which ideas from someone else, or ideas from some point in the past, govern the tester’s actions. Test execution can be very scripted, as when the tester is given an explicit set of steps to follow and observations to make; somewhat scripted, as when the tester is given explicit instruction but is welcome or encouraged to deviate from it; or very exploratory, in which the tester is given a mission or charter, and is mandated to use whatever information and ideas are available, even those that have been discovered in the present moment.
Yet the approaches can be blended. James points out that the distinguishing attribute in exploratory and scripted approaches is the presence or absence of loops. The most extreme scripted testing would follow a strictly linear approach; design would be done at the beginning of the project; design would be followed by execution; tests would be performed in a prescribed order; later cycles of testing would use exactly the same tests for regression
Let’s get more realistic, though. Consider a tester with a list of tests to perform, each using a data-focused automated script to address a particular test idea. A tester using a highly scripted approach would run that script, observe and record the result, and move on to the next test. A tester using a more exploratory approach would use the list as a point of departure, but upon observing an interesting result might choose to perform a different test from the next one on the list; to alter the data and re-run the test; to modify the automated script; or to abandon that list of tests in favour of another one. That is, the tester’s actions in the moment would not be directed by earlier ideas, but would be informed by them. Scripted approaches set out the ideas in advance, and when new information arrives, there’s a longer loop between discovery and the incorporation of that new information into the testing cycle. The more exploratory the approach, the shorter the loop. Exploratory approaches do not preclude the use of prepared test ideas, although both James and I would argue that our craft, in general, places excessive emphasis on test cases and focusing techniques at the expense of more general heuristics and defocusing techniques.
The point of all this is that neither exploratory testing nor scripted approaches are testing techniques, nor bodies of testing techniques. They’re approaches that can be applied to any testing technique.
To be fair to the authors of [DDE2007], since publication of their paper there has been ongoing progress in the way that many people—in particular Cem Kaner, James Bach, and I—articulate these ideas, but the fundamental notions haven’t changed significantly.
While the authors do cite several papers on testing and test design techniques, they do not cite some of the more important and relevant publications on the exploratory side. Examples of such literature include “Measuring the Effectiveness of Software Testers” (Kaner, 2003; slightly updated in 2006); and “Software engineering metrics: What do they measure and how do we know?” (Kaner & Bond, 2004); and “Inefficiency and Ineffectiveness of Software Testing: A Key Problem in Software Engineering” (Kaner 2006; to be fair to the authors, this paper may have been published too late to inform [DDE2007]), General Functionality and Stability Test Procedure (for Microsoft Windows 2000 Application Certification) (Bach, 2000); Satisfice Heuristic Test Strategy Model (Bach, 2000); How To Break Software (Whittaker, 2002).
The authors of [DDE2007] appear also to have omitted literature on the subject of exploration and its role in learning. Yet there is significant material on the subject, in both popular and more academic literature. Examples here include Collaborative Discovery in a Scientific Domain (Okada and Simon; note that the subjects are testing software); Exploring Science: The Cognition and Development of Discovery Processes (David Klahr and Herbert Simon); Plans and Situated Actions (Lucy Suchman); Play as Exploratory Learning (Mary Reilly); How to Solve It (George Polya); Simple Heuristics That Make Us Smart (Gerg Gigerenzer); Sensemaking in Organizations (Karl Weick); Cognition in the Wild (Edward Hutchins); The Social Life of Information (Paul Duguid and John Seely Brown); Sciences of the Artificial (Herbert Simon); all the way back to A System of Logic, Ratiocinative and Inductive (John Stuart Mill, 1843).
These omissions are reflected in the study and the analysis of the experiment, and that leads to a common problem in such studies: heuristics and other important cognitive structures in exploration are treated as mysterious and unknowable. For example, the authors say, “For the exploratory testing sessions we cannot determine if the subjects used the same testing principles that they used for designing the documented test cases or if they explored the functionality in pure ad-hoc manner. For this reason it is safer to assume the ad-hoc manner to hold true.” [DDE2007, p. 69] Why assume? At the very least, one could at least observe the subjects and debrief them, asking about their approaches. In fact, this is exactly the role that the test lead fulfills in the practice of skilled exploratory testing. And why describe the principles only as “ad-hoc”? It’s not like the principles can’t be articulated. I talk about oracle heuristics in this article, and talk about stopping heuristics here; Kaner’s Black Box Software Testing course talks about test design heuristics; James Bach‘s work talks about test strategy heuristics (especially here); James Whittaker’s books talk about heuristics for finding vulnerabilities…
The study was performed using testers who were, in the main, novices. “27 subjects had no previous experience in software engineering and 63 had no previous experience in testing. 8 subjects had one year and 4 subjects had two years testing experience. Only four subjects reported having some sort of training in software testing prior to taking the course.” ([DDE2007], p. 65 my emphasis) Testing—especially testing using an exploratory approach—is a complex cognitive activity. If one were to perform a study on novice jugglers, one would likely find that they drop an approximately equal number of objects, whether they were juggling balls or knives.
The paper notes that “subjects were trained to use the test case design techniques before the experiment.” However, the paper does not make note of any specific training in heuristics or exploratory approaches. That might not be surprising in light of the weaknesses on the exploratory side of the literature review. My experience, that of James Bach, and anecdotal reports from our clients suggests that even a brief training session can greatly increase the effectiveness of an exploratory approach.
Cycles of Testing
Testing happens in cycles. In a strongly scripted testing, the process tends to the linear. All tests are designed up front; then those tests are executed; then testing for that area is deemed to be done. In subsequent cycles, the intention is to repeat the original tests to make sure that bugs are fixed to check for regression. By contrast, exploratory testing is an organic and iterative process. In an exploratory approach, the same area might be visited several times, such that learning from early “reconnaissance” sessions informs further exploration in subsequent “deep coverage” sessions. The learning from those (and from ideas about bugs that have been found and fixed) informs “wrap-up sessions”, in which tests may be repeated, varied, or cut from new cloth. No allowance is made for information and learning obtained during one round of testing to inform later rounds. Yet such information and learning is typically of great value.
Quantitative vs. Qualitative Analysis
In the study, there is a great deal of emphasis placed on quantifying results, on experimental and on mathematical rigour. However, such rigour may be misplaced when the products of testing are qualitative, rather than quantitative.
Finding bugs is important, finding many bugs is important, and finding important bugs is especially important. Yet bugs and bug reports are by no means the only products of testing. The study largely ignores the other forms of information that testing may provide.
- The tester might learn something about test design, and feed that learning into her approach toward test execution, or vice versa. The value of that learning might be realized immediately (as in an exploratory approach) or over time (as in a scripted approach).
- The tester, upon executing a test, might recognize a new risk or missing coverage. That recognition might inform ideas about the design and choices of subsequent tests. In a scripted approach, that’s a relatively long loop. In an exploratory approach, upon noticing a new risk, the tester might choose to note findings for later on. On the other hand, the discovery could be cashed immediately: she might choose to repeat the test, she might perform a variation on the same test, or might alter her strategy to follow a different line of investigation. Compared to a scripted approach, the feedback loop between discovery and subsequent action is far shorter. The study ignores the length of the feedback loops.
- In addition to discovering bugs that threaten the value of the product, the tester might discover issues—problems that threaten the value of the testing effort or the development project overall.
- The tester who takes an exploratory approach may choose to investigate a bug or an issue that she has found. This may reduce the total bug count, but in some contexts may be very important to the tester’s client. In such cases, the quality of the investigation, rather than the number of bugs found, would be important.
More work products from testing can be found here.
“Efficiency” vs. “Effectiveness”
The study takes a very parsimonious view of “efficiency”, and further confuses “efficiency” with “effectiveness”. Two tests are equally effective if they produce the same effects. The discovery of a bug is certainly an important effect of a test. Yet there are other important effects too, as noted above, but they’re not considered in the study.
However, even if we decide that bug-finding is the only worthwhile effect of a test, two equally effective tests might not be equally efficient. I would argue that efficiency is a relationship between effectiveness and cost. An activity is more efficient if it has the same effectiveness at lower cost in terms of time, money, or resources. This leads to what is by far the most serious problem in the paper…
Script Preparation Time Is Ignored
The authors’ evaluation of “efficiency” leaves out the preparation time for the scripted tests! The paper says that the exploratory testing sessions took 90 minutes for design, preparation, and execution. The preparation for the scripted tests took seven hours, where the scripted test execution sessions took 90 minutes, for a total of 8.5 hours. This fact is not highlighted; indeed, it is not mentioned until the eighth of ten pages. (page 68). In journalism, that would be called burying the lead. In terms of bug-finding alone, the authors suggest that the results were of equivalent effectiveness, yet the scripted approach took, in total, 5.6 times longer than the exploratory approach. What other problems could the exploratory testing approaches find given seven additional hours?
The authors offer these four conclusions at the end of the paper:
“First, we identify a lack of research on manual test execution from other than the test case design point of view. It is obvious that focusing only on test case design techniques does not cover many important aspects that affect manual testing. Second, our data showed no benefit in terms of defect detection efficiency of using predesigned test cases in comparison to an exploratory testing approach. Third, there appears to be no big differences in the detected defect types, severities, and in detection difficulty. Fourth, our data indicates that test case based testing produces more false defect reports.”
I would offer to add a few other conclusions. The first is from the authors themselves, but is buried on page 68: “Based on the results of this study, we can conclude that an exploratory approach could be efficient, especially considering the average 7 hours of effort the subjects used for test case design activities.” Or, put another way,
- During test execution
- unskilled testers found the same number of problems, irrespective of the approach that they took, but
- preparation of scripted tests increased testing time approximately by a factor of five
- and appeared to add no significant value.
Now: as much as I would like to cite this study as a significant win for exploratory testing, I can’t. There are too many problems with it. There’s not much value in comparing two approaches when those approaches are taken by unskilled and untrained people. The study is heavy on data but light on information. There are no details about the bugs that were found and missed using each approach. There’s no description of the testers’ activities or thought processes; just the output numbers. There is the potential for interesting, rich stories on which bugs were found and which bugs were missed by which approaches, but such stories are absent from the paper. Testing is a qualitative evaluation of a product; this study is a quantitative evaluation of testing. Valuable information is lost thereby.
The authors say, “We could not analyze how good test case designers our subjects were and how much the quality of the test cases affected the results and how much the actual test execution aproach.” Actually, they could have analyzed that. It’s just that they didn’t. Pity.