Blog Posts from January, 2010

Exploratory Testing IS Accountable

Wednesday, January 27th, 2010

In this blog post, my colleague James Bach talks about logging and its importance in support of exploratory testing. Logging takes care of one part of the accountability angle, and in an approach like session-based test management (developed by James and his brother Jon), the test notes and the debrief take care of another part of it.

Logging records what happened from the perspective of the test system. Good logging relieves the tester from having to record specific actions in detail; the machine does that. The tester is thereby free to record test notes—a running account of the tester’s ideas, questions, and results as he tested, or what happened from the perspective of the tester. Those notes form the meat of the session sheet, which also includes

  • coverage data
  • who did the testing
  • when they started
  • how long it took
  • the proportion of time spent on test design and execution, bug investigation and reporting, and setup
  • the proportion of the time spent on on-charter work vs. opportunity work
  • references to log files, data files, and related material such as scenarios, help files, specifications, standards, and so forth
  • and, of course, bugs discovered and issues identified.

After the session or at the end of the day, the tester presents a report—the session sheet combined with an oral account—in the debrief, a conversation between the tester and the test lead or test manager. In the debrief, the test lead reviews—that is, tests—the tester’s experience and his report. The question “What happened?” gets addressed; the oral and written aspects of the report get discussed and evaluated; the session charter is confirmed or revised; holes are discovered and, where needed, plugged with followup testing; bug reports get reviewed; issues get brought up; coaching happens; mentoring happens; learning happens; knowledge gets transferred. The goal here is for the tester and the test lead to be able to say, “we can vouch for what was tested“.

The session sheet is structured in such a way that it can be scanned by a text-parsing tool written in Perl. The measurements (in particular the coverage measurements) are collected and collated automatically into reports in the form of sortable HTML tables. Session sheets are kept for later review, if they’re needed.

If logging in the program isn’t available right away, screen recording tools (like BB Test Assistant, Camtasia, Spector, …) can provide a retrospective account of what happened. (An over-the-shoulder video camera works too.) Note that these tools simply record video (and, optionally, sound—which is good for narration). Programmatic repetition of the session isn’t the point. Nor is the point to have a supervisor review the screen capture obsessively; that wastes time, and besides, nobody likes working for Big Brother. The idea is to use the video only when necessary—to aid in recollection where it’s needed, and to help in troubleshooting hard-to-reproduce bugs.

We suggest, where it doesn’t get in the way, taking the test notes on the same machine as the application under test, and using the text editor window popping up as a way to link the execution of the application with bugs, test ideas or questions. For bugs that don’t appear to be state-critical you can also take very brief notes for later followup. Include a time stamp, where the time stamp is an index into the recording; then revisit the recording later if more detail is called for. (In Notepad, you can press F5; in TextPad, Edit/Insert/Time, and it’s macroable; other text editors almost certainly have a similar feature.)

Between a charter, the session sheet, the oral report, data files, and the logs and the debrief, it’s hard for me to imagine a more accountable way of working. Each aspect of the reporting structure reinforces the others. This is why I get confused when test managers talk about exploratory testing being “unaccountable” or “unmanageable” or “unstructured”: when I ask them what accountability and management means to them, they point lamely to a pile of scripts or spreadsheets full of overspecified actions that were written weeks or months before the software was built, or they mumble something about not knowing what goes on in a tester’s head.

Any testing approach is manageable when you choose to manage it. If you want structure think about what you mean (maybe this guide to the structures of exploratory testing will help), identify the structures that are important to you, and develop those structures in your testers, in your team, and in your approaches. If you want accountability, provide structures for it (like session-based test management), and then require accountability. If you find that your testers aren’t sufficiently skilled, train them and mentor them. (And if you don’t know how to do that rapidly and effectively, we can help you.)

If there’s something you don’t like about the results you’re getting, manage: observe what’s going on in your system of testing, and put in a control action where you want to change something. If you want to know what’s going on in a tester’s head, observe her directly and interview her as she’s testing; have her pair with another tester or a test lead; critique her notes; debrief her and coach her, until you get the results that you seek. If you want to supercharge the efficiency of your testers, work with the programmers and their managers to focus on testability, with special attention paid to scriptable interfaces, logging, and at least some programmer testing. (It might help to identify the information-hiding and feedback-loop-lengthening costs of the absence of testability). If you find individual debriefs taking too long, or if you want to share information more broadly within the test team, try group debriefs at the end of one day or the beginning of the next. If you want to add features to the reporting protocol, add them; if you want to drop them, drop them. Experiment, re-evaluate, and tune your testing as you see fit.

And if you have a more manageable and accountable approach than this for fostering the discovery of important problems in the product, please let us know (me, or James, or Jon). We’d really like to hear about it.

Disposable Time

Sunday, January 17th, 2010

In our Rapid Testing class, James Bach and I like to talk about an underappreciated tester resource: disposable time. Disposable time is the time that you can afford to waste without getting into trouble.

Now, we want to be careful about what we mean by “waste”, here. It’s not that you want to waste the time. You probably want to spend it wisely. It’s just that you won’t suffer harm if you do happen to waste it. Disposable time is to your working hours what disposable income is to your total personal income. (In fact, even that’s not quite correct, strictly speaking; we actually mean discretionary income: the money that’s left over after you’ve paid for all of the things that you must pay for—food, shelter, basic clothing, medical, and tax expenses. The money that people call disposable income is more properly called discretionary income; as Wikipedia says, “the amount of ‘play money’ left to spend or save.” Oh well. We’ll go with the incorrect but popular interpretation of “disposable” here.)

You’re never being scrutinized every minute of every day. Practically everyone has a few moments when no one important is watching. In that time, you might

  • try a tiny test that hasn’t been prescribed.
  • try putting in a risky value instead of a safe value.
  • pretend to change your mind, or to make a mistake, and go back a step or two; users make mistakes, and error handling and recovery are often the most vulnerable parts of the program.
  • take a couple of moments to glance at some background information relevant to the work that you’re doing.
  • write in your journal.
  • see if any of your colleagues in technical support have a hot issue that can inform some test ideas.
  • steal a couple of moments to write a tiny, simple program that will save you some time; use the saved time and the learning to extend your programming skills so that you can solve increasingly complex programming problems.
  • spend an extra couple of minutes at the end of a coffee break befriending the network support people.
  • sketch a workflow diagram for your product, and at some point show it to an expert, and ask if you’ve got it right.
  • snoop around in the support logs for the product.
  • add a few more lines to a spreadsheet of data values
  • help someone else solve a problem that they’re having.
  • chat with a programmer about some aspect of the technology.
  • even if you do nothing else, at least pause and look around the screen as you’re testing. Take a moment or two to recognize a new risk and write down a new question or a new test idea. Report on that idea later on; ask your test lead, your manager, or a programmer, or a product owner if it’s a risk worth investigating. Hang on to your notes. When someone asks “Why didn’t you find that bug,” you may have an answer for them.

If it turns out that you’ve made a bad investment, oh well. By definition, however large or small the period, disposable is time that you can afford to blow without suffering consequences.

On the other hand, you may have made a good investment. You may have found a bug, or recognized a new risk, or learned something important, or helped someone out of a jam, or built on a professional relationship, or surprised and impressed your manager. You may have done all of these things at once. Even if you feel like you’ve wasted your time, you’ve probably learned enough to insulate yourself from wasting more time in the same way. When you discover that an alley is blind, you’re unlikely to return there when there are other things to explore.

In The Black Swan, Nassim Nicholas Taleb proposes an investment strategy wherein you put the vast bulk of your money, your nest egg, in very safe securities. You then invest a small amount—an amount that you can afford to lose—in very speculative bets that have a chance of providing a spectacular return. He call that very improbable high-return event a positive Black Swan. Your nest egg is like the part of your job that you must accomplish. Disposable time is like your Black Swan fund; you may lose it all, but you have a shot at a big payoff. But there’s an important difference, too: since learning is an almost inevitable product of using your disposable time, there’s almost always some modest positive outcome.

We encourage test managers to allow disposable time explicitly for their testers. As an example, Google provides its staff with Innovation Time Off. Engineers are encouraged to spend 20% of their time pursuing projects that interest them. That sounds like a waste, until one learns that Google projects like Gmail, Google News, Orkut, and AdSense came of these investments.

What Google may not know is that even within the other 80% of the time that’s ostensibly on mission, people still have, and are still using, non-explicit disposable time. People have that almost everywhere, whether they have explicit disposable time or not.

If you’re working in an environment where you’re being watched so closely that none of this is possible, and where you’re punished for learning or seeking problems, my advice is to make sure that slavery has been abolished in your jurisdiction. Then find a job where your testing skills are valued and your managers aren’t wasting their time by watching your work instead of doing theirs. But when you’ve got a few moments to fill, fill them and learn something!

Defect Detection Efficiency: An Evaluation of a Research Study

Friday, January 8th, 2010

Over the last several months, B.J. Rollison has been delivering presentations and writing articles and blog posts in which he cites a paper Defect Detection Efficiency: Test Case Based vs. Exploratory Testing [DDE2007], by Juha Itkonen, Mika V. Mäntylä and Casper Lassenius (First International Symposium on Empirical Software Engineering and Measurement, pp. 61-70; the paper can be found here).

I appreciate the authors’ intentions in examining the efficiency of exploratory testing.  That said, the study and the paper that describes it have some pretty serious problems.

Some Background on Exploratory Testing

It is common for people writing about exploratory testing to consider it a technique, rather than an approach. “Exploratory” and “scripted” are opposite poles on a continuum. At one pole, exploratory testing integrates test design, test execution, result interpretation, and learning into a single person at the same time.  At the other, scripted testing separates test design and test execution by time, and typically (although not always) by tester, and mediates information about the designer’s intentions by way of a document or a program. As James Bach has recently pointed out, the exploratory and scripted poles are like “hot” and “cold”.  Just as there can be warmer or cooler water, there are intermediate gradations to testing approaches. The extent to which an approach is exploratory is the extent to which the tester, rather than the script, is in immediate control of the activity.  A strongly scripted approach is one in which ideas from someone else, or ideas from some point in the past, govern the tester’s actions. Test execution can be very scripted, as when the tester is given an explicit set of steps to follow and observations to make; somewhat scripted, as when the tester is given explicit instruction but is welcome or encouraged to deviate from it; or very exploratory, in which the tester is given a mission or charter, and is mandated to use whatever information and ideas are available, even those that have been discovered in the present moment.

Yet the approaches can be blended.  James points out that the distinguishing attribute in exploratory and scripted approaches is the presence or absence of loops.  The most extreme scripted testing would follow a strictly linear approach; design would be done at the beginning of the project; design would be followed by execution; tests would be performed in a prescribed order; later cycles of testing would use exactly the same tests for regression

Let’s get more realistic, though.  Consider a tester with a list of tests to perform, each using a data-focused automated script to address a particular test idea.  A tester using a highly scripted approach would run that script, observe and record the result, and move on to the next test.  A tester using a more exploratory approach would use the list as a point of departure, but upon observing an interesting result might choose to perform a different test from the next one on the list; to alter the data and re-run the test; to modify the automated script; or to abandon that list of tests in favour of another one.  That is, the tester’s actions in the moment would not be directed by earlier ideas, but would be informed by them. Scripted approaches set out the ideas in advance, and when new information arrives, there’s a longer loop between discovery and the incorporation of that new information into the testing cycle.  The more exploratory the approach, the shorter the loop.  Exploratory approaches do not preclude the use of prepared test ideas, although both James and I would argue that our craft, in general, places excessive emphasis on test cases and focusing techniques at the expense of more general heuristics and defocusing techniques.

The point of all this is that neither exploratory testing nor scripted approaches are testing techniques, nor bodies of testing techniques.  They’re approaches that can be applied to any testing technique.

To be fair to the authors of [DDE2007], since publication of their paper there has been ongoing progress in the way that many people—in particular Cem Kaner, James Bach, and I—articulate these ideas, but the fundamental notions haven’t changed significantly.

Literature Review

While the authors do cite several papers on testing and test design techniques, they do not cite some of the more important and relevant publications on the exploratory side.  Examples of such literature include “Measuring the Effectiveness of Software Testers” (Kaner, 2003; slightly updated in 2006); and “Software engineering metrics: What do they measure and how do we know?” (Kaner & Bond, 2004); and “Inefficiency and Ineffectiveness of Software Testing: A Key Problem in Software Engineering” (Kaner 2006; to be fair to the authors, this paper may have been published too late to inform [DDE2007]),  General Functionality and Stability Test Procedure (for Microsoft Windows 2000 Application Certification) (Bach, 2000); Satisfice Heuristic Test Strategy Model (Bach, 2000); How To Break Software (Whittaker, 2002).

The authors of [DDE2007] appear also to have omitted literature on the subject of exploration and its role in learning. Yet there is significant material on the subject, in both popular and more academic literature.  Examples here include Collaborative Discovery in a Scientific Domain (Okada and Simon; note that the subjects are testing software); Exploring Science: The Cognition and Development of Discovery Processes (David Klahr and Herbert Simon); Plans and Situated Actions (Lucy Suchman); Play as Exploratory Learning (Mary Reilly); How to Solve It (George Polya); Simple Heuristics That Make Us Smart (Gerg Gigerenzer); Sensemaking in Organizations (Karl Weick); Cognition in the Wild (Edward Hutchins); The Social Life of Information (Paul Duguid and John Seely Brown); Sciences of the Artificial (Herbert Simon); all the way back to A System of Logic, Ratiocinative and Inductive (John Stuart Mill, 1843).

These omissions are reflected in the study and the analysis of the experiment, and that leads to a common problem in such studies: heuristics and other important cognitive structures in exploration are treated as mysterious and unknowable.  For example, the authors say, “For the exploratory testing sessions we cannot determine if the subjects used the same testing principles that they used for designing the documented test cases or if they explored the functionality in pure ad-hoc manner. For this reason it is safer to assume the ad-hoc manner to hold true.”  [DDE2007, p. 69]  Why assume?  At the very least, one could at least observe the subjects and debrief them, asking about their approaches.  In fact, this is exactly the role that the test lead fulfills in the practice of skilled exploratory testing.  And why describe the principles only as “ad-hoc”?  It’s not like the principles can’t be articulated. I talk about oracle heuristics in this article, and talk about stopping heuristics here; Kaner’s Black Box Software Testing course talks about test design heuristics; James Bach‘s work talks about test strategy heuristics (especially here); James Whittaker’s books talk about heuristics for finding vulnerabilities…

Tester Experience

The study was performed using testers who were, in the main, novices.  “27 subjects had no previous experience in software engineering and 63 had no previous experience in testing. 8 subjects had one year and 4 subjects had two years testing experience. Only four subjects reported having some sort of training in software testing prior to taking the course.”  ([DDE2007], p. 65 my emphasis)  Testing—especially testing using an exploratory approach—is a complex cognitive activity.  If one were to perform a study on novice jugglers, one would likely find that they drop an approximately equal number of objects, whether they were juggling balls or knives.

Tester Training

The paper notes that “subjects were trained to use the test case design techniques before the experiment.” However, the paper does not make note of any specific training in heuristics or exploratory approaches.  That might not be surprising in light of the weaknesses on the exploratory side of the literature review.  My experience, that of James Bach, and anecdotal reports from our clients suggests that even a brief training session can greatly increase the effectiveness of an exploratory approach.

Cycles of Testing

Testing happens in cycles.  In a strongly scripted testing, the process tends to the linear.  All tests are designed up front; then those tests are executed; then testing for that area is deemed to be done.  In subsequent cycles, the intention is to repeat the original tests to make sure that bugs are fixed to check for regression.  By contrast, exploratory testing is an organic and iterative process.  In an exploratory approach, the same area might be visited several times, such that learning from early “reconnaissance” sessions informs further exploration in subsequent “deep coverage” sessions.  The learning from those (and from ideas about bugs that have been found and fixed) informs “wrap-up sessions”, in which tests may be repeated, varied, or cut from new cloth.  No allowance is made for information and learning obtained during one round of testing to inform later rounds.  Yet such information and learning is typically of great value.

Quantitative vs. Qualitative Analysis

In the study, there is a great deal of emphasis placed on quantifying results, on experimental and on mathematical rigour.  However, such rigour may be misplaced when the products of testing are qualitative, rather than quantitative.

Finding bugs is important, finding many bugs is important, and finding important bugs is especially important. Yet bugs and bug reports are by no means the only products of testing.  The study largely ignores the other forms of information that testing may provide.

  • The tester might learn something about test design, and feed that learning into her approach toward test execution, or vice versa. The value of that learning might be realized immediately (as in an exploratory approach) or over time (as in a scripted approach).
  • The tester, upon executing a test, might recognize a new risk or missing coverage. That recognition might inform ideas about the design and choices of subsequent tests.  In a scripted approach, that’s a relatively long loop.  In an exploratory approach, upon noticing a new risk, the tester might choose to note findings for later on.  On the other hand, the discovery could be cashed immediately:  she  might choose to repeat the test, she might perform a variation on the same test, or might alter her strategy to follow a different line of investigation.  Compared to a scripted approach, the feedback loop between discovery and subsequent action is far shorter.  The study ignores the length of the feedback loops.
  • In addition to discovering bugs that threaten the value of the product, the tester might discover issues—problems that threaten the value of the testing effort or the development project overall.
  • The tester who takes an exploratory approach may choose to investigate a bug or an issue that she has found.  This may reduce the total bug count, but in some contexts may be very important to the tester’s client.  In such cases, the quality of the investigation, rather than the number of bugs found, would be important.

More work products from testing can be found here.

“Efficiency” vs. “Effectiveness”

The study takes a very parsimonious view of “efficiency”, and further confuses “efficiency” with “effectiveness”.  Two tests are equally effective if they produce the same effects. The discovery of a bug is certainly an important effect of a test.  Yet there are other important effects too, as noted above, but they’re not considered in the study.

However, even if we decide that bug-finding is the only worthwhile effect of a test, two equally effective tests might not be equally efficient.  I would argue that efficiency is a relationship between effectiveness and cost.  An activity is more efficient if it has the same effectiveness at lower cost in terms of time, money, or resources.  This leads to what is by far the most serious problem in the paper…

Script Preparation Time Is Ignored

The authors’ evaluation of “efficiency” leaves out the preparation time for the scripted tests! The paper says that the exploratory testing sessions took 90 minutes for design, preparation, and execution. The preparation for the scripted tests took seven hours, where the scripted test execution sessions took 90 minutes, for a total of 8.5 hours.  This fact is not highlighted; indeed, it is not mentioned until the eighth of ten pages. (page 68).  In journalism, that would be called burying the lead.  In terms of bug-finding alone, the authors suggest that the results were of equivalent effectiveness, yet the scripted approach took, in total, 5.6 times longer than the exploratory approach. What other problems could the exploratory testing approaches find given seven additional hours?

Conclusions

The authors offer these four conclusions at the end of the paper:

“First, we identify a lack of research on manual test execution from other than the test case design point of view. It is obvious that focusing only on test case design techniques does not cover many important aspects that affect manual testing. Second, our data showed no benefit in terms of defect detection efficiency of using predesigned test cases in comparison to an exploratory testing approach. Third, there appears to be no big differences in the detected defect types, severities, and in detection difficulty. Fourth, our data indicates that test case based testing produces more false defect reports.”

I would offer to add a few other conclusions.  The first is from the authors themselves, but is buried on page 68:  “Based on the results of this study, we can conclude that an exploratory approach could be efficient, especially considering the average 7 hours of effort the subjects used for test case design activities.”  Or, put another way,

  • During test execution
  • unskilled testers found the same number of problems, irrespective of the approach that they took, but
  • preparation of scripted tests increased testing time approximately by a factor of five
  • and appeared to add no significant value.

Now:  as much as I would like to cite this study as a significant win for exploratory testing, I can’t.  There are too many problems with it.  There’s not much value in comparing two approaches when those approaches are taken by unskilled and untrained people.  The study is heavy on data but light on information. There are no details about the bugs that were found and missed using each approach.  There’s no description of the testers’ activities or thought processes; just the output numbers.  There is the potential for interesting, rich stories on which bugs were found and which bugs were missed by which approaches, but such stories are absent from the paper.  Testing is a qualitative evaluation of a product; this study is a quantitative evaluation of testing.  Valuable information is lost thereby.

The authors say, “We could not analyze how good test case designers our subjects were and how much the quality of the test cases affected the results and how much the actual test execution aproach.”  Actually, they could have analyzed that.  It’s just that they didn’t.  Pity.