Blog Posts from November, 2009

Why Is Testing Taking So Long? (Part 2)

Wednesday, November 25th, 2009

Yesterday I set up a thought experiment in which we divided our day of testing into three 90-minute sessions. I also made a simplifying assumption that bursts of testing activity representing some equivalent amount of test coverage (I called it a micro-session, or just a “test”) take two minutes. Investigating and reporting a bug that we find costs an additional eight minutes, so a test on its own would take two minutes, and a test that found a problem would take ten.

Yesterday we tested three modules. We found some problems. Today the fixes showed up, so we’ll have to verify them.

Let’s assume that a fix verification takes six minutes. (That’s yet another gross oversimplification, but it sets things up for our little thought experiment.) We don’t just perform the original microsession again; we have to do more than that. We want to make sure that the problem is fixed, but we also want to do a little exploration around the specific case and make sure that the general case is fixed too.

Well, at least we’ll have to do that for Modules B and C. Module A didn’t have any fixes, since nothing was broken. And Team A is up to its usual stellar work, so today we can keep testing Team A’s module, uninterrupted by either fix verifications or by bugs. We get 45 more micro-sessions in today, for a two-day total of 90.

(As in the previous post, if you’re viewing this under at least some versions of IE 7, you’ll see a cool bug in its handling of the text flow around the table.  You’ve been warned!)

Module Fix Verifications Bug Investigation and Reporting
(time spent on tests that find bugs)
Test Design and Execution
(time spent on tests that don’t find bugs)
New Tests Today Two-Day Total
A 0 minutes (no bugs yesterday) 0 minutes (no bugs found) 90 minutes (45 tests) 45 90

Team B stayed an hour or so after work yesterday. They fixed the bug that we found, tested the fix, and checked it in. They asked us to verify the fix this afternoon. That costs us six minutes off the top of the session, leaving us 84 more minutes. Yesterday’s trends continue; although Team B is very good, they’re human, and we find another bug today. The test costs two minutes, and bug investigation and reporting costs eight more, for a total of ten. In the remaining 74 minutes, we have time for 37 micro-sessions. That means a total of 38 new tests today—one that found a problem, and 37 that didn’t. Our two-day today for Module B is 79 micro-sessions.

Module Fix Verifications Bug Investigation and Reporting
(time spent on tests that find bugs)
Test Design and Execution
(time spent on tests that don’t find bugs)
New Tests Today Two-Day Total
A 0 minutes (no bugs yesterday) 0 minutes (no bugs found) 90 minutes (45 tests) 45 90
B 6 minutes (1 bug yesterday) 10 minutes (1 test, 1 bug) 74 minutes (37 tests) 38 79

Team C stayed late last night. Very late. They felt they had to. Yesterday we found eight bugs, and they decided to stay at work and fix them. (Perhaps this is why their code has so many problems; they don’t get enough sleep, and produce more bugs, which means they have to stay late again, which means even less sleep…) In any case, they’ve delivered us all eight fixes, and we start our session this afternoon by verifying them. Eight fix verifications at six minutes each amounts to 48 minutes. So far as obtaining new coverage goes, today’s 90-minute session with Module C is pretty much hosed before it even starts; 48 minutes—more than half of the session—is taken up by fix verifications, right from the get-go. We have 42 minutes left in which to run new micro-sessions, those little two-minute slabs of test time that give us some equivalent measure of coverage. Yesterday’s trends continue for Team C too, and we discover four problems that require investigation and reporting. That takes 40 of the remaining 42 minutes. Somewhere in there, we spend two minutes of testing that doesn’t find a bug. So today’s results look like this:

Module Fix Verifications Bug Investigation and Reporting
(time spent on tests that find bugs)
Test Design and Execution
(time spent on tests that don’t find bugs)
New Tests Today Two-Day Total
A 0 minutes (no bugs yesterday) 0 minutes (no bugs found) 90 minutes (45 tests) 45 90
B 6 minutes (1 bug yesterday) 10 minutes (1 test, 1 bug) 74 minutes (37 tests) 38 79
C 48 minutes (8 bugs yesterday) 40 minutes (4 tests, 4 bugs) 2 minutes (1 test) 5 18

Over two days, we’ve been able to obtain only 20% of the test coverage for Module C that we’ve been able to obtain for Module A. We’re still at less than 1/4 of the coverage that we’ve been able to obtain for Module B.

Yesterday, we learned one lesson:

Lots of bugs means reduced coverage, or slower testing, or both.

From today’s results, here’s a second:

Finding bugs today means verifying fixes later, which means even less coverage or even slower testing, or both.

So why is testing taking so long? One of the biggest reasons might be this:

Testing is taking longer than we might have expected or hoped because, although we’ve budgeted time for testing, we lumped into it the time for investigating and reporting problems that we didn’t expect to find.

Or, more generally,

Testing is taking longer than we might have expected or hoped because we have a faulty model of what testing is and how it proceeds.

For managers who ask “Why is testing taking so long?”, it’s often the case that their model of testing doesn’t incorporate the influence of things outside the testers’ control. Over two days of testing, the difference between the quality of Team A’s code and Team C’s code has a profound impact on the amount of uninterrupted test design and execution work we’re able to do. The bugs in Module C present interruptions to coverage, such that (in this very simplified model) we’re able to spend only one-fifth of our test time designing and executing tests. After the first day, we were already way behind; after two days, we’re even further behind. And even here, we’re being optimistic. With a team like Team C, how many of those fixes will be perfect, revealing no further problems and taking no further investigation and reporting time?

And again, those faulty management models will lead to distortion or dysfunction. If the quality of testing is measured by bugs found, then anyone testing Module C will look great, and people testing Module A will look terrible. But if the quality of testing is evaluated by coverage, then the Module A people will look sensational and the Module C people will be on the firing line. But remember, the differences in results here have nothing to do with the quality of the testing, and everything to do with the quality of what is being tested.

There’s a psychological factor at work, too. If our approach to testing is confirmatory, with steps to follow and expected, predicted results, we’ll design our testing around the idea that the product should do this, and that it should behave thus and so, and that testing will proceed in a predictable fashion. If that’s the case, it’s possible—probable, in my view—that we will bias ourselves towards the expected and away from the unexpected. If our approach to testing is exploratory, perhaps we’ll start from the presumption that, to a great degree, we don’t know what we’re going to find. As much as managers, hack statisticians, and process enthusiasts would like to make testing and bug-finding predictable, people don’t know how to do that such that the predictions stand up to human variability and the complexity of the world we live in. Plus, if you can predict a problem, why wait for testing to find it? If you can really predict it, do something about it now. If you don’t have the ability to do that, you’re just playing with numbers.

Now: note again that this has been a thought experiment. For simplicity’s sake, I’ve made some significant distortions and left out an enormous amount of what testing is really like in practice.

  • I’ve treated testing activities as compartmentalized chunks of two minutes apiece, treading dangerously close to the unhelpful and misleading model of testing as development and execution of test cases.
  • I haven’t looked at the role of setup time and its impact on test design and execution.
  • I haven’t looked at the messy reality of having to wait for a product that isn’t building properly.
  • I haven’t included the time that testers spend waiting for fixes.
  • I haven’t included the delays associated with bugs that block our ability to test and obtain coverage of the code behind them.
  • I’ve deliberately ignored the complexity of the code.
  • I’ve left out difficulties in learning about the business domain.
  • I’ve made a highly simplistic assumptions about the quality and relevance of the testing and the quality and relevance of the bug reports, the skill of the testers in finding and reporting bugs, and so forth.
  • And I’ve left out the fact that, as important as skill is, luck always plays a role in finding problems.

My goal was simply to show this:

Problems in a product have a huge impact on our ability to obtain test coverage of that product.

The trouble is that even this fairly simple observation is below the level of visibilty of many managers. Why is it that so many managers fail to notice it?

One reason, I think, is that they’re used to seeing linear processes instead of organic ones, a problem that Jerry Weinberg describes in Becoming a Technical Leader. Linear models “assume that observers have a perfect understanding of the task,” as Jerry says. But software development isn’t like that at all, and it can’t be. By its nature, software development is about dealing with things that we haven’t dealt with before (otherwise there would be no need to develop a new product; we’d just reuse the one we had). We’re always dealing with the novel, the uncertain, the untried, and the untested, so our observation is bound to be imperfect. If we fail to recognize that, we won’t be able to improve the quality and value of our work.

What’s worse about managers with a linear model of development and testing is that “they filter our innovations that the observer hasn’t seen before or doesn’t understand” (again, from Becoming a Technical Leader.) As an antidote for such managers, I’d recommend Perfect Software, and Other Illusions About Testing and Lessons Learned in Software Testing as primers. But mostly I’d suggest that they observe the work of testing. In order to do that well, they may need some help from us, and that means that we need to observe the work of testing too. So over the next little while, I’ll be talking more than usual about Session-Based Test Management, developed initially by James and Jon Bach, which is a powerful set of ideas, tools and processes that aid in observing and managing testing.

Why Is Testing Taking So Long? (Part 1)

Tuesday, November 24th, 2009

If you’re a tester, you’ve probably been asked, “Why is testing taking so long?” Maybe you’ve had a ready answer; maybe you haven’t. Here’s a model that might help you deal with the kind of manager who asks such questions.

Let’s suppose that we divide our day of testing into three sessions, each session being, on average, 90 minutes of chartered, uninterrupted testing time. That’s four and a half hours of testing, which seems reasonable in an eight-hour day interrupted by meetings, planning sessions, working with programmers, debriefings, training, email, conversations, administrivia of various kinds, lunch time, and breaks.

The reason that we’re testing is that we want to obtain coverage; that is, we want to ask and answer questions about the product and its elements to the greatest extent that we can. Asking and answering questions is the process of test design and execution. So let’s further assume that we break each session into average two-minute micro-sessions, in which we perform some test activity that’s focused on a particular testing question, or on evaluating a particular feature. That means in a 90-minute session, we can theoretically perform 45 of these little micro-sessions, which for the sake of brevity we’ll informally call “tests”. Of course life doesn’t really work this way; a test idea might a couple of seconds to implement, or it might take all day. But I’m modeling here, making this rather gross simplification to clarify a more complex set of dynamics. (Note that if you’d like to take a really impoverished view of what happens in skilled testing, you could say that a “test case” takes two minutes. But I leave it to my colleague James Bach to explain why you should question the concept of test cases.)

Let’s further suppose that we’ll find problems every now and again, which means that we have to do bug investigation and reporting. This is valuable work for the development team, but it takes time that interrupts test design and execution—the stuff that yields test coverage. Let’s say that, for each bug that we find, we must spend an extra eight minutes investigating it and preparing a report. Again, this is a pretty dramatic simplification. Investigating a bug might take all day, and preparing a good report could take time on the order of hours. Some bugs (think typos and spelling errors in the UI) leap out at us and don’t call for much investigation, so they’ll take less than eight minutes. Even though eight minutes is probably a dramatic underestimate for investigation and reporting, let’s go with that. So a test activity that doesn’t find a problem costs us two minutes, and a test activity that does find a problem takes ten minutes.

Now, let’s imagine one more thing: we have perfect testing prowess; that if there’s a problem in an area that we’re testing, we’ll find it, and that we’ll never enter a bogus report, either. Yes, this is a thought experiment.

One day we come into work, and we’re given three modules to test.

The morning session is taken up with Module A, from Development Team A. These people are amazing, hyper-competent. They use test-first programming, and test-driven design. They work closely with us, the testers, to design challenging unit checks, scriptable interfaces, and log files. They use pair programming, and they review and critique each other’s work in an egoless way. They refactor mercilessly, and run suites of automated checks before checking in code. They brush their teeth and floss after every meal; they’re wonderful. We test their work diligently, but it’s really a formality because they’ve been testing and we’ve been helping them test all along. In our 90-minute testing session, we don’t find any problems. That means that we’ve performed 45 micro-sessions, and have therefore obtained 45 units of test coverage.

(And if you’re viewing this under at least some versions of IE 7, you’ll see a cool bug in its handling of the text flow around the table.  You’ve been warned!)

Module Bug Investigation and Reporting
(time spent on tests that find bugs)
Test Design and Execution
(time spent on tests that don’t find bugs)
Total Tests
A 0 minutes (no bugs found) 90 minutes (45 tests) 45
The first thing after lunch, we have a look at Team B’s module. These people are very diligent indeed. Most organizations would be delighted to have them on board. Like Team A, they use test-first programming and TDD, they review carefully, they pair, and they collaborate with testers. But they’re human. When we test their stuff, we find a bug very occasionally; let’s say once per session. The test that finds the bug takes two minutes; investigation and reporting of it takes a further eight minutes. That’s ten minutes altogether. The rest of the time, we don’t find any problems, so that leaves us 80 minutes in which we can run 40 tests. Let’s compare that with this morning’s results.

Module Bug Investigation and Reporting
(time spent on tests that find bugs)
Test Design and Execution
(time spent on tests that don’t find bugs)
Total Tests
A 0 minutes (no bugs found) 90 minutes (45 tests) 45
B 10 minutes (1 test, 1 bug) 80 minutes (40 tests) 41
After the afternoon coffee break, we move on to Team C’s module. Frankly, it’s a mess. Team C is made up of nice people with the best of intentions, but sadly they’re not very capable. They don’t work with us at all, and they don’t test their stuff on their own, either. There’s no pairing, no review, in Team C. To Team C, if it compiles, it’s ready for the testers. The module is a dog’s breakfast, and we find bugs practically everywhere. Let’s say we find eight in our 90-minute session. Each test that finds a problem costs us 10 minutes, so we spent 80 minutes on those eight bugs. Every now and again, we happen to run a test that doesn’t find a problem. (Hey, even dBase IV occasionally did something right.) Our results for the day now look like this:

Module Bug Investigation and Reporting
(time spent on tests that find bugs)
Test Design and Execution
(time spent on tests that don’t find bugs)
Total Tests
A 0 minutes (no bugs found) 90 minutes (45 tests) 45
B 10 minutes (1 test, 1 bug) 80 minutes (40 tests) 41
C 80 minutes (8 tests, 8 bugs) 10 minutes (5 tests) 13
Because of all the bugs, Module C allows us to perform thirteen micro-sessions in 90 minutes. Thirteen, where with the other modules we managed 45 and 41. Because we’ve been investigating and reporting bugs, there are 32 micro-sessions, 32 units of coverage, that we haven’t been able to obtain on this module. If we decide that we need to perform that testing (and the module’s overall badness is consistent throughout), we’re going to need at least three more sessions to cover it. Alternatively, we could stop testing now, but what are the chances of a serious problem lurking in the parts of the module we haven’t covered? So, the first thing to observe here is:
Lots of bugs means reduced coverage, or slower testing, or both.

There’s something else that’s interesting, too. If we are being measured based on the number of bugs we find (exactly the sort of measurement that will be taken by managers who don’t understand testing), Team A makes us look awful—we’re not finding any bugs in their stuff. Meanwhile, Team C makes us look great in the eyes of management. We’re finding lots of bugs! That’s good! How could that be bad?

On the other hand, if we’re being measured based on the test coverage we obtain in a day (which is exactly the sort of measurement that will be taken by managers who count test cases; that is, managers who probably have an even more damaging model of testing than the managers in the last paragraph), Team C makes us look terrible. “You’re not getting enough done! You could have performed 45 test cases today on Module C, and you’ve only done 13!” And yet, remember that in our scenario we started with the assumption that, no matter what the module, we always find a problem if there’s one there. That is, there’s no difference between the testers or the testing for each of the three modules; it’s solely the condition of the product that makes all the difference.

This is the first in a pair of posts. Let’s see what happens tomorrow.

“Merely” Checking or “Merely” Testing

Tuesday, November 10th, 2009

The distinction between testing vs. checking got a big boost recently from James Bach at the Øredev conference in Malmö, Sweden. But a recent tweet by Brian Marick, and a recent conversation with a colleague have highlighted an issue that I should probably address.

My colleague suggested that somehow I may have underplayed the significance or importance or the worth of checking. Brian’s tweet said,

“I think the trendy distinction between “testing” and “checking” is a power play: which would you preface with “mere”? http://bit.ly/2Cuyj

As a consequence, I was worried that I had ever said “mere checking” or “merely checking” in one of my blog postings or on Twitter, so I researched it. Apparently I had not; that was a relief. However, the fact that I was suspicious even of myself suggests that some maybe I need to clarify something.

The distinction between testing and checking is a power play, but it’s not a power play between (say) testers and programmers. It’s a power play between the glorification of mechanizable assertions over human intelligence. It’s a power play between sapient and non-sapient actions.

Recall that the action of a check has three parts to it. Part one is an observation of a product. Part two is a decision rule, by which we can compare that empirical observation of the product with an idea that someone had about it. Part three is the setting of a bit (pass or fail, yes or no, true or false) that represents the non-sapient application of both the observation and the decision rule. Note, too, that this means that a check can be performed by one of two agencies: 1) a machine. 2) A sufficiently disengaged human; that is, a human who has been scripted to behave like a machine, and who has for whatever reason accepted that assignment.

So checks can be hugely important. Checks are a means by which a programmer, engaged in test-driven development, checks his idea. Creating the check and analyzing its result are both testing activities. Checks are a valuable product (a by-product, some would say) of test-driven development. Checks are change detectors, tools that allow programmers to refactor with confidence. Checks built into continuous integration are mechanisms to make sure that our builds can work well enough to be tested—or, if we’re confident enough in the prior quality of our testing, can work well enough to be deployed. Checks tend to shorten the loop between the implementation of an idea and the disovery of a problem that the checks can detect, since the checks are typically designed and run (a lot, iteratively) by the person doing the implementation. Checks tend to speed up certain aspects of the post-programmer testing of the product, since good checks will find the kind dopey, embarrassing errors that even the best programmers can make from time to time. The need for checks sometimes (alas, not always) prompts us to create interfaces that can be used by programmers or testers to aid in later exploration.

Checking represents the rediscovery of techniques that were around at least in 1957. “The first attack on the checkout problem may be made before coding has begun.” D. D. McCracken, Digital Computer Programming, 1957 (Thanks to Ben Simo for inspiring me to purchase a copy of this book.) In 2007, I had dinner with Jerry Weinberg and Josh Kerievsky. Josh asked Jerry if he did a lot of unit testing back in the day. Jerry practically did a spit-take, saying “Yes, of course. Computer time was hugely expensive, but we programmers were cheap. Getting the program right was really important, so we had to test a lot.” Then he added something that hadn’t occurred to me. “There was another reason, too. Apart from everything else, we tested because the machinery was so unreliable. We’d run a test program, then run the program we wrote, then run the test program again to make sure that we got the same result the second time. We had to make sure that no tubes had blown out.”

So, in those senses, checking rocks. Checking has always rocked. It seems that in some places, people forgot how much it rocks, and that the Agilists have rediscovered them.

Yet it’s important to note that checks on their own don’t deliver value unless there’s sapient engagement with them. What do I mean by that?

As James Bach says here, “A sapient process is any process that relies on skilled humans.” Sapience is the capacity to act with human intelligence, human judgment, and some degree of human wisdom.

It takes sapience to recognize the need for a check—a risk, or a potential vulnerability. It takes sapience—testing skill—to express that need in terms of a test idea. It takes sapience—more test design skill—to express that test idea in terms of a question that we could ask about the program. Sapience—in terms of testing skill, and probably some programming skill—is needed to frame that question as a yes-or-no, true-or-false, pass-or-fail question. Sapience, in the form of programming skill, is required to turn that question into executable code that can implement the check (or, far more expensively and with less value, into a test script for execution by a non-sapient human). We need sapience—testing skill again—to identify an event or condition that would trigger some agency to perform the check. We need sapience—programming skill again—to encode that trigger into executable code so that the process can be automated.

Sapience disappears while the check is being performed. By definition, the observation, the decision rule, and the setting of the bit all happen without the cognitive engagement of a skilled human.

Once the check has been performed, though, skill comes back into the picture for reporting. Checks are rarely done on their own, so they must be aggregated. The aggregation is typically handled by the application of programming skill. To make the outcome of the check observable, the aggregated results must be turned into a human-readable report of some kind, which requires both testing and programming skill. The human observation of the report, intake, is by definition a sapient process. Then comes interpretation. The human ascribes meaning to the various parts of the report, which requires skills of testing and of critical thinking. The human ascribes significance to the meaning, which again takes testing and critical thinking skill. Sapient activity by someone—a tester, a programmer, or a product owner—is needed to determine the response. Upon deciding on significance, more sapient action is required—fixing the application being checked (by the production programmer); fixing or updating the check (by the person who designed or programmed the check); adding a new check (by whomever might want to do so) or getting rid of the check (by one or more people who matter, and who have decided that the check is no longer relevant).

So: the check in and of itself is relatively trivial. It’s all that stuff around the check—the testing and programming and analysis activity—that’s important, supremely important. And as is usual with important stuff, there are potential traps.

The first trap is that it might be easy to do any of the sapient aspects of checking badly. Since the checks are at their core software, there might be problems in requirements, design, coding, or interpretation, just as there might be with any software.

The second trap is that it can be easy to fall asleep somewhere between the report and interpretations stages of the checking process. The green bar tells us that All Is Well, but we must be careful about that. All is well with respect to the checks that we’ve programmed is a very different statement. Red tends to get our attention, but green is an addictive and narcotic colour. A passing test is another White Swan, confirmation of our existing beliefs, proof by induction. Now, we can’t live without proof by induction, but induction can’t alert us to new problems. Millions of repeated tests, repeated thousands of times, don’t tell us about the bugs that elude them. We only need one Black Swan to bump into a devastating effect.

The third trap is that we might believe that checking a program is all there is to testing it. Checking done well incorporates an enormous amount of testing and programming skill, but some quality attributes of a program are not machine-decidable. Checks are the kinds of tests that aren’t vulnerable to the halting problem.Someone on a mailing list once said, “Once all the (automated) acceptance test pass (that is, all the checks), we know we’re done.” I liked Joe Rainsberger‘s reply, “No, you’re not done; you’re ready to give it to a real tester to kick the snot out of it.” That kicking is usually expressed with greater emphasis on exploration, discovery, and investigation, and rather less on confirmation, verification, and validation.

The fourth trap is a close cousin of the third trap: at certain points, we might pay undue attention to the value of checking with respect to its cost. Cost vs. value is a dominating problem with any kind of testing, of course. One of the reasons that the Agile emphasis on testing remains exciting is that excellent checking lowers the cost of testing, and both help to defend the value of the program. Yet checks may not be Just The Thing for some purposes. Joe has expressed concerns in his series Integrated Tests are a Scam, and Brian Marick did too, a while ago, An Alternative to Business-Facing TDD. I think they’re both making important points here, thinking of checks as a means to an end, rather than as a fetish.

Fifth: upon noting the previous four traps (and others), we might be tempted to diminish the value of checking. That would be a mistake. Pretty much any program is made more testable by someone removing problems before someone else sees them. Every bug or issue that we find could trigger investigation, reporting, fixing, and retesting, and that gives other (and potentially more serious) problems time to hide. Checking helps to prevent those unhappy discoveries. Excellent checking (which incorporates excellent testing) will tend to reduce the number of problems in the product at any given time, and thereby results in a more testable program. James Bach points out that a good manual test could never be automated (he’d say “sapient” now, I believe). But note, in that same post that he says, that “if you can truly automate a manual test, it couldn’t have been a good manual test”, and “if you have a great automated test, it’s not the same as the manual test that you believe you were automating”. The point is that there are such things as great automated tests, and some of them might be checks.

So the power play is over which we’re going to value: the checks (“we have 50,000 automated tests”) or the checking. Mere checks aren’t important; but checking—the activity required to build, maintain, and analyze the checks—is. To paraphrase Eisenhower, with respect to checking, the checks are nothing; the checking is everything. Yet the checking isn’t everything; neither is the testing. They’re both important, and to me, neither can be appropriately preceded with “mere”, or “merely”.

There’s one exception, though: If you’re only doing one or the other, it might be important to say, “You’re merely been testing the program; wouldn’t you be better off checking it, too?” or “That program hasn’t been tested; it’s merely been checked.”

See more on testing vs. checking.

Testing, Checking, and Convincing the Boss to Explore

Tuesday, November 10th, 2009

How is it useful to make the distinction between testing and checking? One colleague (let’s call him Andrew) recently found it very useful indeed. I’ve been asked not to reveal his real name or his company, but he has very generously permitted me to tell this story.

He works for a large, globally distributed company, which produces goods and services in a sector not always known for its nimbleness. He’s been a test manager with the company for about 10 years. He’s had a number of senior managers who have allowed him and his team to take an exploratory approach, almost a skunkworks inside the larger organization. Rather than depending on process manuals and paperwork, he manages by direct interaction and conversation. He hires bright people, trains them, and grants them a fairly high degree of autonomy, balanced by frequent check-ins.

Recently, on a Thursday the relatively new CEO came to town and held an all-hands meeting for Andrew’s division. Andrew was impressed; the CEO seemed genuinely interested in cutting bureaucracy and making the organization more flexible, adaptable, and responsive to change. After the CEO’s remarks, there was a question-and-answer period. Andrew asked if the company would be doing anything to make testing more effective and more efficient. The CEO seemed curious about that, and jotted down a note on a piece of paper. Andrew was given the mandate of following up with the VP responsible for that area.

Late that afternoon, Andrew called me. We chatted for a while on the phone. He hadn’t read my series on testing vs. checking, but he seemed intrigued. I suggested that he read it, and that we get together and talk about it.

As luck would have it, there was occasion to bring a few more people into the picture. That weekend, we had a timely conversation with Fiona Charles who reminded us to focus the issue of risk. Rob Sabourin, happened to be visiting on Saturday evening, so he, Andrew, and I sat down to compose a letter to the VP. Aside from changing the names that would identify the parties involved, this is an unedited version what we came up with:

[Our Letter]

Dear [Madam VP]…

[Mr. CEO] asked me to send you this email as a follow up to a question that I posed during his recent trip to the [OurTown] office on [SomeDate] on the current state of the testing at [OurCompany] and how our testing effectiveness should be improved.

The [OurTown]-based [OurDivision] test team has been very successful in finding serious issues with our products with a fairly small test team using exploratory test approaches. As an example, a couple of weeks ago one of my testers found a critical error in an emergency fix within his two days of exploratory testing in a load that had passed four person-weeks of regression testing (scripted checking) by another team.

Last week a Project Lead called me and asked if my team could perform a regression sweep on a third party delivery. I replied that we could provide the requested coverage with two person-days of effort without disrupting our other commitments. He seemed surprised and delighted. He had come to us because [OurCompany]’s typical approach yielded a four-to-six person-week effort which would have caused a delay in the project’s subsequent release.

Our experience using exploratory testing in [OurDivision] has demonstrated improved flexibility and adaptability to respond to rapid changes in priorities.

Testing is not checking. Checking is a process of confirmation, validation, and verification in which we are comparing an output to a predicted and expected result. Testing is something more than that. Testing is a process of exploration, discovery, investigation, learning, and analysis with the goal of gathering information about risks, vulnerabilities, and threats to the value of the product. The current effectiveness of many groups’ automated scripts is quite excellent, yet without supplementing these checks with “brain-engaged” human testing we run the risk of serious problems in the field impacting our customers, and the consequential bad press that follow these critical events.

At [OurCompany] much of our “testing” is focused on checking. This has served us fairly well but there are many important reasons for broadening the focus of our current approach. While checking is very important, it is vulnerable to the “pesticide paradox”. As bacteria develop a resistance to antibiotics, software bugs are capable of avoiding detection by existing tests (checks), whether executed once or repeated over and over. In order to reduce our vulnerability to field issues and critical customer incidents, we must supplement our existing emphasis on scripted tests (both manual and automated) with an active search for new problems and new risks.

There are several strong reasons for integrating exploratory approaches into our current development and testing practices:

  • Scripted tests are perceived to be important for compliance with [regulatory] requirements. They are focused on being repeatable and defensible. Mere compliance is insufficient—we need our products to work.
  • Scripted checks take time and effort to design and prepare, whether they are run by machines or by humans. We should focus on reducing preparation cost wherever possible and reallocating that effort to more valuable pursuits.
  • Scripted checks take far more time and effort to execute when performed by a human than when performed by a machine. For scripted checks, machine execution is recommended over human execution, allowing more time for both human interaction with the product, and consequent observation and evaluation.
  • Exploratory tests take advantage of the human capacity for recognizing new risks and problems.
  • Exploratory testing is highly credible and accountable when done well by trained testers. The findings of exploratory tests are rich, risk-focused, and value-centered, revealing far more knowledge about the system than simple pass/fail results.

The quality of exploratory testing is based upon the skill set and the mindset of the individual tester. Therefore, I recommend that testers and managers across the organization be trained in the structures and disciplines of excellent exploratory testing. As teams become trained, we should systematically introduce exploratory sessions into the existing testing processes, observing and evaluating the results obtained from each approach.

I have been actively involved in improving testing, in general, outside of [OurCompany]. I am on the board of a testing association and I have been attending, organizing and facilitating meetings of testers for many years.

During this time, I have been exposed to much of the latest developments in software testing and I have led the implementation of Session Based Exploratory Testing within my department. In addition, over the past four years, I have been providing instruction in software testing both to the testers within my business unit and to companies outside of [OurCompany].

I look forward to the opportunity to talk with you about this further.

[/Our Letter]

Now, I thought that was pretty strong. But the response was far more gratifying than I expected. Andrew sent the message on Sunday afternoon. The VP responded by 8:45am on Monday morning. Her reply was in my mailbox before 10:00am. The reply read:

[The VP’s Reply]

Dear Andrew

Thanks very much for the email. I find this very intriguing! I believe the distinction you make between testing and checking is quite insightful and I would like to connect with you to see how we can build these concepts and techniques into our quality management services as well as my central team verification tests. I will get a call together with [Mr. Bigwig] and [Mr. OtherBigwig] so that we can figure out the best way to incorporate your ideas. Again, many thanks!!

[/The VP’s Reply]

A couple of key points:

  • The letter was much stronger thanks to collaboration. Any one of the four of us could have written a good letter; the result was better than any of us could have done on our own.
  • The letter is sticky, in the sense that Chip and Dan Heath talk about in their book Made to Stick: Why Some Ideas Survive and Others Die. It’s not a profound book, but it contains some useful points to ponder. The letter starts with two stories that are simple, unexpected, concrete, credible, and emotional (remember, the product manager was surprised and delighted). Those initials can be rearranged to SUCCES, which is the mnemonic that the Heaths use for successful communication.
  • The testing vs. checking distinction is simpler and memorable than “exploratory approaches” vs. “confirmatory scripted approaches”. The explanation is available (and in most cases necessary), but “testing” and “checking” roll off the tongue quickly after the explanation has been absorbed.
  • We managed to hit some of the most important aspects of good testing: cost vs. value, risk focus, diversification of approaches, flexibility and adaptability, and rapid service to the larger organization.

After reviewing this post, Andrew said, “I like the post a lot. Let’s hope we end up helping a lot of people with it.” Amen. You are, of course, welcome to use this letter as a point of departure for your own letter to the bigwigs. If you’d like help, please feel free to drop me a line.

See more on testing vs. checking, but especially this.