Blog Posts for the ‘Done’ Category

Drop the Crutches

Thursday, January 5th, 2017

This post is adapted from a recent blast of tweets. You may find answers to some of your questions in the links; as usual, questions and comments are welcome.

Update, 2017-01-07: In response to a couple of people asking, here’s how I’m thinking of “test case” for the purposes of this post: Test cases are formally structured, specific, proceduralized, explicit, documented, and largely confirmatory test ideas. And, often, excessively so. My concern here is directly proportional to the degree to which a given test case or a given test strategy emphasizes these things.

crutches

I had a fun chat with a client/colleague yesterday. He proposed—and I agreed—that test cases are like crutches. I added that the crutches are regularly foisted on people who weren’t limping to start with. It’s as though before the soccer game begins, we hand all the players a crutch. The crutches then hobble them.

We also agreed that test cases often lead to goal displacement. Instead of a thorough investigation of the product, the goal morphs into “finish the test cases!” Managers are inclined to ask “How’s the testing going?” But they usually don’t mean that. Instead, they almost certainly mean “How’s the product doing?” But, it seems to me, testers often interpret “How’s the testing going?” as “Are you done those test cases?”, which ramps up the goal displacement.

Of course, “How’s the testing going?” is an important part of the three-part testing story, especially if problems in the product or project are preventing us from learning more deeply about the product. But most of the time, that’s probably not the part of story we want to lead with. In my experience, both as a program manager and as a tester, managers want to know one thing above all:

Are there problems that threaten the on-time, successful completion of the project?

The most successful and respected testers—in my experience—are the ones that answer that question by actively investigating the product and telling the story of what they’ve found. The testers that overfocus on test cases distract themselves AND their teams and managers from that investigation, and from the problems investigation would reveal.

For a tester, there’s nothing wrong with checking quickly to see that the product can do something—but there’s not much right—or interesting—about it either. Checking seems to me to be a reasonably good thing to work into your programming practice; checks can be excellent alerts to unwanted low-level changes. But when you’re testing, showing that the product can work—essentially, demonstration—is different from investigating and experimenting to find out how it does (or doesn’t) work in a variety of circumstances and conditions. Sometimes people object saying that they have to confirm that the product works and that they have don’t have time to investigate. To me, that’s getting things backwards. If you actively, vigorously look for problems and don’t find them, you’ll get that confirmation you crave, as a happy side effect.

No matter what, you must prepare yourself to realize this:

Nobody can be relied upon to anticipate all of the problems that can beset a non-trivial product.

fractalWe call it “development” for a reason. The product and everything around it, including the requirements and the test strategy, do not arrive fully-formed. We continuously refine what we know about the product, and how to test it, and what the requirements really are, and all of those things feed back into each other. Things are revealed to us as we go, not as a cascade of boxes on a process diagram, but more like a fractal.

The idea that we could know entirely what the requirements are before we’ve discussed and decided we’re done seems like total hubris to me. We humans have a poor track record in understanding and expressing exactly what we want. We’re no better at predicting the future. Deciding today what will make us happy ten months—or even days—from now combines both of those weaknesses and multiplies them.

For that reason, it seems to me that any hard or overly specific “Definition of Done” is antithetical to real agility. Let’s embrace unpredictability, learning, and change, and treat “Definition of Done” as a very unreliable heuristic. Better yet, consider a Definition of Not Done Yet: “we’re probably not done until at least These Things are done”. The “at least” part of DoNDY affords the possibility that we may recognize or discover important requirements along the way. And who knows?—we may at any time decide that we’re okay with dropping something from our DoNDY too. Maybe the only thing we can really depend upon is The Unsettling Rule.

Test cases—almost always prepared in advance of an actual test—are highly vulnerable to a constantly shifting landscape. They get old. And they pile up. There usually isn’t a lot of time to revisit them. But there’s typically little need to revisit many of them either. Many test cases lose relevance as the product changes or as it stabilizes.

Many people seem prone to say “We have to run a bunch of old test cases because we don’t know how changes to the code are affecting our product!” If you have lost your capacity to comprehend the product, why believe that you still comprehend those test cases? Why believe that they’re still relevant?

Therefore: just as you (appropriately) remain skeptical about the product, remain skeptical of your test ideas—especially test cases. Since requirements, products, and test ideas are subject to both gradual and explosive change, don’t overformalize or otherwise constrain your testing to stuff that you’ve already anticipated. You WILL learn as you go.

Instead of overfocusing on test cases and worrying about completing them, focus on risk. Ask “How might some person suffer loss, harm, annoyance, or diminished value?” Then learn about the product, the technologies, and the people around it. Map those things out. Don’t feel obliged to be overly or prematurely specific; recognize that your map won’t perfectly match the territory, and that that’s okay—and it might even be a Good Thing. Seek coverage of risks and interesting conditions. Design your test ideas and prepare to test in ways that allow you to embrace change and adapt to it. Explain what you’ve learned.

Do all that, and you’ll find yourself throwing away the crutches that you never needed anyway. You’ll provide a more valuable service to your client and to your team. You and your testing will remain relevant.

Happy New Year.

Further reading:

Testing By Percentages
Very Short Blog Posts (11): Passing Test Cases
A Test is a Performance
Test Cases Are Not Testing: Toward a Culture of Test Performance” by James Bach & Aaron Hodder (in http://www.testingcircus.com/documents/TestingTrapeze-2014-February.pdf#page=31)

What Exploratory Testing Is Not (Part 2): After-Everything-Else Testing

Friday, December 16th, 2011

Exploratory testing is not “after-everything-else-is-done” testing. Exploratory testing can (and does) take place at any stage of testing or development.

Indeed, TDD (test-driven development) is a form of exploratory development. TDD happens in loops, in which the programmer develops a check, then develops the code to make the check pass (along with all of the previous checks), then fixes any problems that she has discovered, and then loops back to implementing a new bit of behaviour and inventing a new check. The information obtained from each loop feeds into the next; and the activity is guided and structured by the person or people involved in the moment, rather than in advance. The checks themselves are scripted, but the activity required to produce them and to analyze the results is not. Compared to the complex cognitive activity—exploratory, iterative—that’s going on as code is being developed, the checks themselves—scripted, linear—are trivial.

Requirement review is an exploratory activity too. Review of requirements (or specifications, or user stories, or examples) tends happens early on in a development cycle, whether it’s a long or a short cycle. While review might be guided by checklists, the people involved in the activity are making decisions on the fly as they go through loops of design, investigation, discovery, and learning. The outcome of each loop feeds back into the next activity, often immediately.

Code review can also be done in a scripted way or an exploratory way. When humans analyze the code, it’s an unscripted, self-directed activity that happens in loops; so it is exploratory. We call it review, but it’s gathering information with the intention of informing a decision; so it is testing. There is a way to review code that involves the application of scripted processes, via a tools that people generally call “static testing tools. When a machine parses code and produces a report, by definition it’s a form of checking, and it’s scripted. Yet using those tools productively requires a great deal of exploratory activity. Parsing and interpreting the report and responding to it is polimorphic, human action—unscripted, open-ended, iterative, and therefore exploratory.

Learning about a new product or a new feature is an exploratory activity if you want to do it or foster it well. Some suggest that test scripts provide a useful means of training testers. Research into learning shows that people tend to learn more quickly and more deeply when their learning is based on interaction and feedback; guided, perhaps, but not controlled. If you really want to learn about a product, try creating a mind map, documenting some aspect of the program’s behaviour, or creating plausible scenarios in which people might use—or misuse—the product. All of these activities promote learning, and they’re all exploratory activities. There’s far more information that you can use, apply, and discover than a script can tell you about. Come to think of it… where does the script come from?

Developing a test procedure—even developing a test script, whether for a machine or a human to follow, or developing the kind of “test” that skilled testers would call a demonstration—is an exploratory activity. There is no script that specifies how to write a new script for a particular purpose. Heard about a new feature and pondering how you might test it? You’ve already begun testing; you’re doing test design and you’re probably learning as you go. To the extent that you use the product or interact with it, bounce ideas off other people, or think critically about your design, you’re testing, and you’re doing it in an unscripted way. Some might suggest that certain tools create scripts that can perform automatic checks. Yet reviewing those checks for appropriateness, interpreting the results, and troubleshooting unexpected outcomes are all exploratory activities.

Supposing that a programmer, midway through a sprint, decides that she’d like some feedback on the work that she’s done so far on a new module. She hands you a bit of code to look at. You might interact with the code directly through a test tool that she provided, or (say) via the Ruby interpreter, or you might write some script code to exercise some of the functions in the module. In any event, you find some problems in it. In order to investigate a problem that you’ve discovered, you must explore. You must explore whether your recognition of the problem was triggered by your own interaction with the program or by a mechanically executed script. You’re in control of the activity; each new test around the problem feeds back into your choice of the next activity, and into the story that you’re going to tell about the product.

All of the larger activities that I’ve described above are exploratory, and they all happen before you have a completed function or story or sprint. Exploratory testing is not a stage or phase of testing to be performed after you’ve performed your other test techniques. Exploratory testing is not an “other” test technique, because it’s not a technique at all. Exploratory testing is not a thing that you do, but rather a way that you work (and think, and act), the hallmarks being who (or what) is in control, and the extent to which your activity is part of a loop, rather than a straight line. Any test technique can be applied in a scripted way or in an exploratory way. To those who say “we do exploratory testing after our acceptance tests are all running green”, I would suggest looking carefully and observing the extent to which you’re doing exploratory testing all the way along.

The Undefinition of “Done”

Wednesday, July 13th, 2011

Recently a colleague noted that his Agile team was having trouble with the notion of done. “Sometimes it seems like the rest of the team doesn’t get it. The testers know that ‘done’ means tested. And if you ask the programmers, they’ll acknowledge that, yes, done means tested. Everyone is acting in good faith. But during the sprint planning meeting, we keep having to remind people to include time and effort for testing when they’re estimating story points. It never seems to stop.”

One issue, as I’ve pointed out before, is that people have preset notions about what “done” means to them. When I’m writing an article for publication, I put a bunch of effort into several drafts. When I’m finally happy with it, I declare that I’m done and submit it to the editors. I head downstairs and say to my wife, “Let’s watch The Daily Show. I’m done with that article.”

Of course, I’m not really done. The article goes through a round of technical editing, and a round of copy editing. Inevitably, someone finds problems that have escaped my notice. They send the article back (typically when I’m busy with something else, but that’s hardly their fault). I fix the problems. And I submit the article for publication. Then I’m done. Done-done. The article is ready for publication.

So it occurred to me that I could maybe avoid fooling myself. Instead of using an ambiguous word like “done”, maybe I could think in terms of “ready for publication”, or “edited”.

And that’s what I suggested to my colleague. “Instead of the question ‘When will this story be done?’ how about asking ‘When will this story be ready for testers to have at it?’ or ‘When will this story be tested?’ or how about ‘When will this story be ready to release?'” He thought that might be a useful reframe.

What do you think? If “done” is a troublesome concept, why not say what we really mean instead?

Done, The Relative Rule, and The Unsettling Rule

Thursday, September 9th, 2010

The Agile community (to the degree that such a thing exists at all; it’s a little like talk about “the software industry”) appears to me to be really confused about what “done” means.

Whatever “done” means, it’s subject to the Relative Rule. I coined the Relative Rule, inspired by Jerry Weinberg‘s definition of quality (“quality is value to some person(s)”). The Relative Rule goes like this:

For any abstract X, X is X to some person, at some time.

For example, the idea of a “bug” is subject to the Relative Rule. A bug is not a thing that exists in the world; it doesn’t have a tangible form. A bug is a relationship between the product and some person. A bug is a threat to the value of the product to some person. The notion of a bug might be shared among many people, or it might be exclusive to some person.

Similarly: “done” is “done to some person(s), at some time,” and implicitly, “for some purpose“. To me, a tester’s job is to help people who matter—most importantly, the programmers and the product owner—make an informed decision about what constitutes “done” (and as I’ve said before, testers aren’t there to make that determination themselves). So testers, far from worrying about “done”, can begin to relax right away.

Let’s look at this in terms of a story.

A programmer takes on an assignment to code a particular function. She goes into cycles of test-driven development, writing a unit check, writing code to make the check pass, running a suite of prior unit checks and making sure that they all run green, and repeating the loop, adding more and more checks for that function as she goes. Meanwhile, the testers have, in consultation with the product owner, set up a suite of examples that demonstrate basic functionality, and they automate those examples as checks.

The programmer decides that she’s done writing a particular function. She feels confident. She runs them against the examples. Two examples don’t work properly. Ooops, not done. Now she doesn’t feel so confident. She writes fixes. Now the examples all work, so now she’s done. That’s better.

A tester performs some exploratory tests that exercise that function, to see if it fulfills its explicit requirements. It does. Hey, the tester thinks, based on what I’ve seen so far, maybe we’re done programming… but we’re not done testing. Since no one—not testers, not programmers, not even requirements document writers—imagine!—is perfect, the tester performs other tests that explore the space of implicit requirements.

The tester raises questions about the way the function might or might not work. The tester expands the possibilities of conditions and risks that might be relevant. Some of his questions raise new test ideas, and some of those tests raise new questions, and some of those questions reveal that certain implicit requirements haven’t been met. Not done!

The tester is done testing, for now, but no one is now sure that programming is done. The programmer agrees that the tester has raised some significant issues. She’s mildly irritated that she didn’t think of some of these things on her own, and she’s annoyed that others are not explicit in the specs that were given to her. Still, she works on both sets of problems until they’re addressed too. (Done.)

For two of the issues the tester has raised, the programmer disagrees that they’re really necessary (that is, things are done, according to the programmer). The tester tries to make sure that this isn’t personal, but remains concerned about the risks (things are not done, according to the tester). After a conversation, the programmer persuades the tester that these two issues aren’t problems (oh, done after all), and they both feel better.

Just to be sure, though, the tester brings up the issues with the product owner. The product owner has some information about business risk that neither the tester nor the programmer had, and declares emphatically that the problem should be fixed (not done).

The programmer is reasonably exasperated, because this seems like more work. Upon implementing one fix, the programmer has an epiphany; everything can be handled by a refactoring that simultaneously makes the code easier to understand AND addresses both problems AND takes much less time. She feels justifiably proud of herself. She writes a few more unit checks, refactors, and all the unit checks pass. (Done!)

One of the checks of the automated examples doesn’t pass. (Damn; not done.) That’s frustrating. Another fix; the unit checks pass, the examples pass, the tester does more exploration and finds nothing more to be concerned about. Done! Both the programmer and the tester are happy, and the product owner is relieved and impressed.

Upon conversation with other programmers on the the project team, our programmer realizes that there are interactions between her function and other functions that mean she’s not done after all. That’s a little deflating. Back to the drawing board for a new build, followed by more testing. The tester feels a little pressured, because there’s lots of other work to do. Still, after a little investigation, things look good, so, okay, now, done.

It’s getting to the end of the iteration. The programmers all declare themselves done. All of the unit checks are running green, and all of the ATDD checks are running green too. The whole team is ready to declare itself done. Well, done coding the new features, but there’s still a little uncertainty because there’s still a day left in which to test, and the testers are professionally uncertain.

On the morning of the last day of the iteration, the programmers get into scoping out the horizon for the next iteration, while testers explore and perform some new tests. They apply oracles that show the product isn’t consistent with a particular point in a Request-For-Comment that, alas, no one has noticed before. Aaargh! Not done.

Now the team is nervous; people are starting to think that they might not be done what they committed to do. The programmers put in a quick fix and run some more checks (done). The testers raise more questions, perform more investigations, consider more possibilities, and find that more and more stopping heuristics apply (you’ll find a list of those here: http://www.developsense.com/blog/2009/09/when-do-we-stop-test/). It’s 3:00pm. Okay, finally: done. Now everyone feels good. They set up the demo for the iteration.

The Customer (that is, the product owner) says “This is great. You’re done everything that I asked for in this iteration.” (Done! Yay!) “…except, we just heard from The Bank, and they’ve changed their specifications on how they handle this kind of transaction. So we’re done this iteration (that is, done now, for some purpose), but we’ve got a new high-priority backlog item for next Monday, which—and I’m sorry about this—means rolling back a lot of the work we’ve done on this feature (not done for some other purpose). And, programmers, the stuff you were anticipating for next week is going to be back-burnered for now.”

Well, that’s a little deflating. But it’s only really deflating for the people who believe in the illusion that there’s a clear finish line for any kind of development work—a finish line that is algorithmic, instead of heuristic.

After many cycles like the above, eventually the programmers and the testers and the Customer all agree that the product is indeed ready for deployment. That agreement is nice, but in one sense, what the programmers and the testers think doesn’t matter. Shipping is a business decision, and not a technical one; it’s the product owner that makes the final decision. In another sense, though, the programmers and testers absolutely matter, in that a responsible and effective product owner must seriously consider all of the information available to him, weighing the business imperatives against technical concerns. Anyway, in this case, everything is lined up. The team is done! Everyone feels happy and proud.

The product gets deployed onto the bank’s system on a platform that doesn’t quite match the test environment, at volumes that exceed the test volumes. Performance lags, and the bank’s project manager isn’t happy (not done). The testers diligently test and find a way to reproduce the problem (they’re done, for now).

The programmers don’t make any changes to the code, but find a way to change a configuration setting that works around the problem (so now they’re done). The testers show that the fix works in the test environments and at heavier loads (done). Upon evaluation of the original contract, recognition of the workaround, and after its own internal testing, the bank accepts the situation for now (done) but warns that it’s going to contest whether the contract has been fulfilled (not done).

Some people are tense; others realize that business is business, and they don’t take it personally. After much negotiation, the managers from the bank and the development shop agree that the terms of the contract have been fulfilled (done), but that they’d really prefer a more elegant fix for which the bank will agree to pay (not done). And then the whole cycle continues. For years.

So, two things:

1) Definitions and decisions about “done” are always relative to some person, some purpose, and some time. Decisions about “done” are always laden with context. Not only technical considerations matter; business considerations matter too. Moreover, the process of deciding about doneness is not merely logical, but also highly social. Done is based not on What’s Right, but on Who Decides and For What Purpose and For Now. And as Jerry Weinberg points out, decisions about quality are political and emotional, but made by people who would like to appear rational.

However, if you want to be politically, emotionally, and rationally comfortable, you might want to take a deep breath and learn to accept—with all of your intelligence, heart, and good will—not only the first point, but also the second…

2) “Done” is subject to another observation that Jerry often makes, and that I’ve named The Unsettling Rule:

Nothing is ever settled.

Update, 2013-10-16: If you’re still interested, check out the esteemed J.B. Rainsberger on A Better Path To the “Definition of Done”.

When Should the Product Owner Release the Product?

Tuesday, August 3rd, 2010

In response to my previous blog post “Another Silly Quantitative Model”, Greg writes: In my current project, the product owner has assumed the risk of any financial losses stemming from bugs in our software. He wants to release the product to customers, but he is of course nervous. How do you propose he should best go about deciding when to release? How should he reason about the risks, short of using a quantitative model?

The simple answer is “when he’s not so nervous that he doesn’t want to ship”. What might cause him to decide to stop shipment? He should stop shipment when there are known problems in the product that aren’t balanced by countervailing benefits. Such problems are called showstoppers. A colleague once described “showstopper” as “any problem in the product that would make more sense to fix than to ship.”

When I was a product owner and I reasoned with the project team about showstoppers, we deemed as a showstopper

  • Any single problem in the product that would definitely cause loss or harm (or sufficient annoyance or frustration) to users, such that the product’s value in its essential operation would be nil. Beware of quantifying “users” here. In the age of the Internet, you don’t need very many people with terrible problems to make noise disproportionate to the population, nor do you need those problems to be terrible problems when they affect enough people. The recent kerfuffle over the iPhone 4 is a case in point; the Pentium Bug is another. Customer reactions are often emotional more than rational, but to the customer, emotional reactions are every bit as real as rational ones.
  • Any set of problems in the product that, when taken individually, would not threaten its value, but when viewed collectively would. That could include a bunch of minor irritants that confuse, annoy, disturb, or slow down people using the product; embarrassing cosmetic defects; non-devastating functional problems; parafunctional issues like poor performance or compatibility, and the like.

Now, in truth, your product owner might need to resort to a quantitative model here: he has to be able to count to one. One showstopper, by definition, is enough to stop shipment.

How might you evaluate potential showstoppers qualitatively? My colleague Fiona Charles has two nice suggestions: “Could a problem that we know about in this product trigger a front-page story in the Globe and Mail‘s Report on Business, or in the Wall Street Journal?” “Could a problem that we know about in this product lead to a question being raised in Parliament?” Now: the fact is that we don’t, and can’t, know the answer to whether the problem will have that result, but that’s not really the point of the questions. The points are to explore and test the ways that we might feel about the product, the problems, and their consequences.

What else might cause nervousness for your client? Perhaps he’s worried that, other than the known problems, there are unanswered questions about the product. Those include

  • Open questions whose answer would produce one or more instances of a showstopper.
  • Unasked questions that, when asked, would turn into open questions instead of “I don’t care”. Where would you get ideas for such questions? Try the Heuristic Test Strategy Model at http://www.satisfice.com/tools/satisfice-tsm-4p.pdf for an example of the kinds of questions that you might ask.
  • Unanswered questions about the product are one indicator that you might not be finished testing. There are other indicators; you can read about them here: http://www.developsense.com/blog/2009/09/when-do-we-stop-test/

    Questions about how much we have (or haven’t) tested are questions about test coverage. I wrote three columns about that a while back. Here are some links and synopses:

    Got You Covered: Excellent testing starts by questioning the mission. So, the first step when we are seeking to evaluate or enhance the quality of our test coverage is to determine for whom we’re determining coverage, and why.

    Cover or Discover: Excellent testing isn’t just about covering the “map”—it’s also about exploring the territory, which is the process by which we discover things that the map doesn’t cover.

    A Map By Any Other Name: A mapping illustrates a relationship between two things. In testing, a map might look like a road map, but it might also look like a list, a chart, a table, or a pile of stories. We can use any of these to help us think about test coverage.

    Whether you’ve established a clear feeling or are mired in uncertainty, you might want to test your first-order qualitative evaluation with a first-order quantitative model. For example, many years ago at Quarterdeck, we had a problem that threatened shipment: on bootup, our product would lock system that had a particular kind of hard disk controller. There was a workaround, which would take a trained technical support person about 15 minutes to walk through. No one felt good about releasing the product in that state, but we were under quarterly schedule pressure.

    We didn’t have good data to work with, but we did have a very short list of beta testers and data about their systems. Out of 60 beta testers, three had machines with this particular controller. There was no indication that our beta testers were representative of our overall user population, but 5% of our beta testers had this controller. We then performed the thought experiment of destroying the productivity of 5% of our user base, or tech support having to spend 15 minutes with 5% of our user base (in those days, in the millions).

    How big was our margin of error? What if we were off by a factor of two, and ten per cent of our user base had that controller? What if we were off by a factor of five, and only one per cent of our user base had that controller? Suppose that only one per cent of the user base had their machines crash on startup; suppose that only a fraction of those users called in. Eeew. The story and the feeling, rather than the numbers, told us this: still too many.

    Is it irrational to base a decision based on such vague, unqualified, first-order numbers? If so, who cares? We made an “irrational” but conscious decision: fix the product, rather than ship it. That is, we didn’t decide based on the numbers, but rather on how we felt about the numbers. Was that the right decision? We’ll never know, unless we figure out how to get access to parallel universes. In this one, though, we know that the problem got fixed startlingly quickly when another programmer viewed it with fresh eyes; that the product shipped without the problem; that users with that controller never faced that problem; and that the tech support department continued in its regular overloaded state, instead of a super-overloaded one.

    The decision to release is always a business decision, and not merely a technical one. The decision to release is not based on numbers or quantities; even for those who claim to make decisions “based on the numbers”, the decisions are really based on feelings about the numbers. The decision to release is always driven by balancing cost, value, knowledge, trust, and risk. Some product owners will have a bigger appetite for risk and reward; others will be more cautious. Being a product owner is challenging because, in the end, product owners own the shipping decisions. By definition, a product owner assumes responsibility for the risk of financial losses stemming from bugs in the software. That’s why they get the big bucks.

    When Testers Are Asked For A Ship/No-Ship Opinion

    Wednesday, May 5th, 2010

    In response to my post, Testers:  Get Out of the Quality Assurance Business, my colleague Adam White writes,

    I want ask for your experience when you’ve your first 3 points for managers:

    • Provide them with the information they need to make informed decisions, and then let them make the decisions.
    • Remain fully aware that they’re making business decisions, not just technical ones.
    • Know that the product doesn’t necessarily have to hew to your standard of quality.

    In my experience I took this to an extreme. I got to the point where I wouldn’t give my opinion on whether or not we should ship the product because I clearly didn’t have all the information to make a business decision like this.

    I don’t think that’s taking things to extreme.  I think that’s professional and responsible behaviour for a tester.

    Back in the 90s, I was a program manager for several important mass-market software products. As I said in the original post, I believe that I was pretty good at it. Several times, I tried to quit, and the company kept asking me to take on the role again.  As a program manager, I would not have asked you for your opinion on a shipping decision.  If you had merely told me, “You have to ship this product!” or “You can’t ship this product!” I would have thanked you for your opinion, and probed into why you thought that way.  I would have also counselled you to frame your concerns in terms of threats to the value of the product, rather than telling me what I should or should not do.  I likely would have explained the business reasons for my decisions; I liked to keep people on the team informed.  But had you continued in telling me what decision I “should” be making, and had you done it stridently enough, I would have stopped thanking you, and I would have insisted to you (and, if you persisted, your manager) to stop telling me how to do my job.

    On the other hand, I would have asked you constantly for solid technical information about the product.  I would have expected it as your primary deliverable, I would have paid very close attention to it, and would have weighed it very seriously in my decisions.

    This issue is related to what Cem Kaner calls “bug advocacy“. “Bug advocacy” is an easy label to misinterpret. The idea is not that you’re advocating in favour of bugs, of course, but I don’t believe that you’re advocating management to do something specific with respect to a particular bug, either.  Instead, you’re “selling” bugs. You’re an advocate in the sense that you’re trying to show each bug as the most important bug it can be, presenting it in terms of its clearest manifestation, its richest story, its greatest possible impact, its maximum threat to the value of the product. Like a good salesman, you’re trying to sell the customer on all of the dimensions and features of the bug and trying to overcome all of the possible objections to the sale. A sale, in this case, results in a decision by management to address the bug. But (and this is a key point) like a good salesman, you’re not there to tell the customer that he must buy what you’re selling; you’re there to provide your your customers with information about the product that allows them to make the best choice for them. And (corresponding key point) a responsible customer doesn’t let a salesman make his decisions for him.

    I started to get pushback from my boss and others in development that it’s my job to give this opinion.

    “Please, Tester! I’m scared to commit! Will you please do my job for me? At least part of it?”

    What do you think? Should testers give their opinion on whether or not to ship the product or should they take the approach of presenting “just the facts ma’am”?

    I remember being in something like your position when I was at Quarterdeck in the late 1990s. The decision to ship a certain product lay with the product manager, which at the time was a marketing function. She was young, and ambitious, but (justifiably) nervous about whether to ship, so she asked the rest of us what she should do. I insisted that it was her decision; that, since she was the owner of the product, it was up to her to make the decision. The testers, also being asked to provide their opinion, also declined. Then she wanted to put it to a vote of the development team. We declined to vote; businesses aren’t democracies. Anticipating the way James Bach would answer the question (a couple of years before I met him), I answered more or less with “I’m in the technical domain. Making this kind of business decision is not a service that I offer”.

    But it’s not that we were obstacles to the product, and it’s not that we were unhelpful. Here are the services that we did offer:

    • We told her about what we considered the most serious problems in the product.
    • We made back-of-the-envelope calculations of technical support burdens for those problems (and were specific about the uncertainty of those calculations).
    • We contextualized those problems in terms of the corresponding benefits of the product.
    • We answered any questions for which we had reliable information.
    • We provided information about known features and limitations of competitive products.
    • We offered rapid help in answering questions about the known unknowns.

    Yet we insisted, respectfully, that the decision was hers to make.  And in the end, she made it.

    There are other potentially appropriate answers to the question, “Do you think we should ship this product?”

    • “I don’t know,” is a very good one, since it’s inarguable; you don’t know, and they can’t tell you that you do.
    • “Would anything I said change your decision?” is another, since if they’re sure of their decision, their answer would be the appropriate one: No. If the answer is Yes, then return to “I don’t know.”
    • “What if I said Yes?” immediately followed by “What if I said No?” might kick-start some thinking about business business based alternatives.
    • “Would I get a product owner’s position and salary if I gave you an answer?”, said with a smile and a wink might work in some rare, low-pressure cases although, typically, the product owner will have lost his sense of humour by the time you’re being asked for your ship/no-ship opinion.

    As I said in the previous post, managers (and in particular, the product owners) are the brains of the project. Testers are sense organs, eyes for the project team. In dieting, car purchases, or romance, we know what happens when we let our eyes, instead of our brains, make decisions for us.

    Why Is Testing Taking So Long? (Part 2)

    Wednesday, November 25th, 2009

    Yesterday I set up a thought experiment in which we divided our day of testing into three 90-minute sessions. I also made a simplifying assumption that bursts of testing activity representing some equivalent amount of test coverage (I called it a micro-session, or just a “test”) take two minutes. Investigating and reporting a bug that we find costs an additional eight minutes, so a test on its own would take two minutes, and a test that found a problem would take ten.

    Yesterday we tested three modules. We found some problems. Today the fixes showed up, so we’ll have to verify them.

    Let’s assume that a fix verification takes six minutes. (That’s yet another gross oversimplification, but it sets things up for our little thought experiment.) We don’t just perform the original microsession again; we have to do more than that. We want to make sure that the problem is fixed, but we also want to do a little exploration around the specific case and make sure that the general case is fixed too.

    Well, at least we’ll have to do that for Modules B and C. Module A didn’t have any fixes, since nothing was broken. And Team A is up to its usual stellar work, so today we can keep testing Team A’s module, uninterrupted by either fix verifications or by bugs. We get 45 more micro-sessions in today, for a two-day total of 90.

    (As in the previous post, if you’re viewing this under at least some versions of IE 7, you’ll see a cool bug in its handling of the text flow around the table.  You’ve been warned!)

    Module Fix Verifications Bug Investigation and Reporting
    (time spent on tests that find bugs)
    Test Design and Execution
    (time spent on tests that don’t find bugs)
    New Tests Today Two-Day Total
    A 0 minutes (no bugs yesterday) 0 minutes (no bugs found) 90 minutes (45 tests) 45 90

    Team B stayed an hour or so after work yesterday. They fixed the bug that we found, tested the fix, and checked it in. They asked us to verify the fix this afternoon. That costs us six minutes off the top of the session, leaving us 84 more minutes. Yesterday’s trends continue; although Team B is very good, they’re human, and we find another bug today. The test costs two minutes, and bug investigation and reporting costs eight more, for a total of ten. In the remaining 74 minutes, we have time for 37 micro-sessions. That means a total of 38 new tests today—one that found a problem, and 37 that didn’t. Our two-day today for Module B is 79 micro-sessions.

    Module Fix Verifications Bug Investigation and Reporting
    (time spent on tests that find bugs)
    Test Design and Execution
    (time spent on tests that don’t find bugs)
    New Tests Today Two-Day Total
    A 0 minutes (no bugs yesterday) 0 minutes (no bugs found) 90 minutes (45 tests) 45 90
    B 6 minutes (1 bug yesterday) 10 minutes (1 test, 1 bug) 74 minutes (37 tests) 38 79

    Team C stayed late last night. Very late. They felt they had to. Yesterday we found eight bugs, and they decided to stay at work and fix them. (Perhaps this is why their code has so many problems; they don’t get enough sleep, and produce more bugs, which means they have to stay late again, which means even less sleep…) In any case, they’ve delivered us all eight fixes, and we start our session this afternoon by verifying them. Eight fix verifications at six minutes each amounts to 48 minutes. So far as obtaining new coverage goes, today’s 90-minute session with Module C is pretty much hosed before it even starts; 48 minutes—more than half of the session—is taken up by fix verifications, right from the get-go. We have 42 minutes left in which to run new micro-sessions, those little two-minute slabs of test time that give us some equivalent measure of coverage. Yesterday’s trends continue for Team C too, and we discover four problems that require investigation and reporting. That takes 40 of the remaining 42 minutes. Somewhere in there, we spend two minutes of testing that doesn’t find a bug. So today’s results look like this:

    Module Fix Verifications Bug Investigation and Reporting
    (time spent on tests that find bugs)
    Test Design and Execution
    (time spent on tests that don’t find bugs)
    New Tests Today Two-Day Total
    A 0 minutes (no bugs yesterday) 0 minutes (no bugs found) 90 minutes (45 tests) 45 90
    B 6 minutes (1 bug yesterday) 10 minutes (1 test, 1 bug) 74 minutes (37 tests) 38 79
    C 48 minutes (8 bugs yesterday) 40 minutes (4 tests, 4 bugs) 2 minutes (1 test) 5 18

    Over two days, we’ve been able to obtain only 20% of the test coverage for Module C that we’ve been able to obtain for Module A. We’re still at less than 1/4 of the coverage that we’ve been able to obtain for Module B.

    Yesterday, we learned one lesson:

    Lots of bugs means reduced coverage, or slower testing, or both.

    From today’s results, here’s a second:

    Finding bugs today means verifying fixes later, which means even less coverage or even slower testing, or both.

    So why is testing taking so long? One of the biggest reasons might be this:

    Testing is taking longer than we might have expected or hoped because, although we’ve budgeted time for testing, we lumped into it the time for investigating and reporting problems that we didn’t expect to find.

    Or, more generally,

    Testing is taking longer than we might have expected or hoped because we have a faulty model of what testing is and how it proceeds.

    For managers who ask “Why is testing taking so long?”, it’s often the case that their model of testing doesn’t incorporate the influence of things outside the testers’ control. Over two days of testing, the difference between the quality of Team A’s code and Team C’s code has a profound impact on the amount of uninterrupted test design and execution work we’re able to do. The bugs in Module C present interruptions to coverage, such that (in this very simplified model) we’re able to spend only one-fifth of our test time designing and executing tests. After the first day, we were already way behind; after two days, we’re even further behind. And even here, we’re being optimistic. With a team like Team C, how many of those fixes will be perfect, revealing no further problems and taking no further investigation and reporting time?

    And again, those faulty management models will lead to distortion or dysfunction. If the quality of testing is measured by bugs found, then anyone testing Module C will look great, and people testing Module A will look terrible. But if the quality of testing is evaluated by coverage, then the Module A people will look sensational and the Module C people will be on the firing line. But remember, the differences in results here have nothing to do with the quality of the testing, and everything to do with the quality of what is being tested.

    There’s a psychological factor at work, too. If our approach to testing is confirmatory, with steps to follow and expected, predicted results, we’ll design our testing around the idea that the product should do this, and that it should behave thus and so, and that testing will proceed in a predictable fashion. If that’s the case, it’s possible—probable, in my view—that we will bias ourselves towards the expected and away from the unexpected. If our approach to testing is exploratory, perhaps we’ll start from the presumption that, to a great degree, we don’t know what we’re going to find. As much as managers, hack statisticians, and process enthusiasts would like to make testing and bug-finding predictable, people don’t know how to do that such that the predictions stand up to human variability and the complexity of the world we live in. Plus, if you can predict a problem, why wait for testing to find it? If you can really predict it, do something about it now. If you don’t have the ability to do that, you’re just playing with numbers.

    Now: note again that this has been a thought experiment. For simplicity’s sake, I’ve made some significant distortions and left out an enormous amount of what testing is really like in practice.

    • I’ve treated testing activities as compartmentalized chunks of two minutes apiece, treading dangerously close to the unhelpful and misleading model of testing as development and execution of test cases.
    • I haven’t looked at the role of setup time and its impact on test design and execution.
    • I haven’t looked at the messy reality of having to wait for a product that isn’t building properly.
    • I haven’t included the time that testers spend waiting for fixes.
    • I haven’t included the delays associated with bugs that block our ability to test and obtain coverage of the code behind them.
    • I’ve deliberately ignored the complexity of the code.
    • I’ve left out difficulties in learning about the business domain.
    • I’ve made a highly simplistic assumptions about the quality and relevance of the testing and the quality and relevance of the bug reports, the skill of the testers in finding and reporting bugs, and so forth.
    • And I’ve left out the fact that, as important as skill is, luck always plays a role in finding problems.

    My goal was simply to show this:

    Problems in a product have a huge impact on our ability to obtain test coverage of that product.

    The trouble is that even this fairly simple observation is below the level of visibilty of many managers. Why is it that so many managers fail to notice it?

    One reason, I think, is that they’re used to seeing linear processes instead of organic ones, a problem that Jerry Weinberg describes in Becoming a Technical Leader. Linear models “assume that observers have a perfect understanding of the task,” as Jerry says. But software development isn’t like that at all, and it can’t be. By its nature, software development is about dealing with things that we haven’t dealt with before (otherwise there would be no need to develop a new product; we’d just reuse the one we had). We’re always dealing with the novel, the uncertain, the untried, and the untested, so our observation is bound to be imperfect. If we fail to recognize that, we won’t be able to improve the quality and value of our work.

    What’s worse about managers with a linear model of development and testing is that “they filter our innovations that the observer hasn’t seen before or doesn’t understand” (again, from Becoming a Technical Leader.) As an antidote for such managers, I’d recommend Perfect Software, and Other Illusions About Testing and Lessons Learned in Software Testing as primers. But mostly I’d suggest that they observe the work of testing. In order to do that well, they may need some help from us, and that means that we need to observe the work of testing too. So over the next little while, I’ll be talking more than usual about Session-Based Test Management, developed initially by James and Jon Bach, which is a powerful set of ideas, tools and processes that aid in observing and managing testing.

    Why Is Testing Taking So Long? (Part 1)

    Tuesday, November 24th, 2009

    If you’re a tester, you’ve probably been asked, “Why is testing taking so long?” Maybe you’ve had a ready answer; maybe you haven’t. Here’s a model that might help you deal with the kind of manager who asks such questions.

    Let’s suppose that we divide our day of testing into three sessions, each session being, on average, 90 minutes of chartered, uninterrupted testing time. That’s four and a half hours of testing, which seems reasonable in an eight-hour day interrupted by meetings, planning sessions, working with programmers, debriefings, training, email, conversations, administrivia of various kinds, lunch time, and breaks.

    The reason that we’re testing is that we want to obtain coverage; that is, we want to ask and answer questions about the product and its elements to the greatest extent that we can. Asking and answering questions is the process of test design and execution. So let’s further assume that we break each session into average two-minute micro-sessions, in which we perform some test activity that’s focused on a particular testing question, or on evaluating a particular feature. That means in a 90-minute session, we can theoretically perform 45 of these little micro-sessions, which for the sake of brevity we’ll informally call “tests”. Of course life doesn’t really work this way; a test idea might a couple of seconds to implement, or it might take all day. But I’m modeling here, making this rather gross simplification to clarify a more complex set of dynamics. (Note that if you’d like to take a really impoverished view of what happens in skilled testing, you could say that a “test case” takes two minutes. But I leave it to my colleague James Bach to explain why you should question the concept of test cases.)

    Let’s further suppose that we’ll find problems every now and again, which means that we have to do bug investigation and reporting. This is valuable work for the development team, but it takes time that interrupts test design and execution—the stuff that yields test coverage. Let’s say that, for each bug that we find, we must spend an extra eight minutes investigating it and preparing a report. Again, this is a pretty dramatic simplification. Investigating a bug might take all day, and preparing a good report could take time on the order of hours. Some bugs (think typos and spelling errors in the UI) leap out at us and don’t call for much investigation, so they’ll take less than eight minutes. Even though eight minutes is probably a dramatic underestimate for investigation and reporting, let’s go with that. So a test activity that doesn’t find a problem costs us two minutes, and a test activity that does find a problem takes ten minutes.

    Now, let’s imagine one more thing: we have perfect testing prowess; that if there’s a problem in an area that we’re testing, we’ll find it, and that we’ll never enter a bogus report, either. Yes, this is a thought experiment.

    One day we come into work, and we’re given three modules to test.

    The morning session is taken up with Module A, from Development Team A. These people are amazing, hyper-competent. They use test-first programming, and test-driven design. They work closely with us, the testers, to design challenging unit checks, scriptable interfaces, and log files. They use pair programming, and they review and critique each other’s work in an egoless way. They refactor mercilessly, and run suites of automated checks before checking in code. They brush their teeth and floss after every meal; they’re wonderful. We test their work diligently, but it’s really a formality because they’ve been testing and we’ve been helping them test all along. In our 90-minute testing session, we don’t find any problems. That means that we’ve performed 45 micro-sessions, and have therefore obtained 45 units of test coverage.

    (And if you’re viewing this under at least some versions of IE 7, you’ll see a cool bug in its handling of the text flow around the table.  You’ve been warned!)

    Module Bug Investigation and Reporting
    (time spent on tests that find bugs)
    Test Design and Execution
    (time spent on tests that don’t find bugs)
    Total Tests
    A 0 minutes (no bugs found) 90 minutes (45 tests) 45
    The first thing after lunch, we have a look at Team B’s module. These people are very diligent indeed. Most organizations would be delighted to have them on board. Like Team A, they use test-first programming and TDD, they review carefully, they pair, and they collaborate with testers. But they’re human. When we test their stuff, we find a bug very occasionally; let’s say once per session. The test that finds the bug takes two minutes; investigation and reporting of it takes a further eight minutes. That’s ten minutes altogether. The rest of the time, we don’t find any problems, so that leaves us 80 minutes in which we can run 40 tests. Let’s compare that with this morning’s results.

    Module Bug Investigation and Reporting
    (time spent on tests that find bugs)
    Test Design and Execution
    (time spent on tests that don’t find bugs)
    Total Tests
    A 0 minutes (no bugs found) 90 minutes (45 tests) 45
    B 10 minutes (1 test, 1 bug) 80 minutes (40 tests) 41
    After the afternoon coffee break, we move on to Team C’s module. Frankly, it’s a mess. Team C is made up of nice people with the best of intentions, but sadly they’re not very capable. They don’t work with us at all, and they don’t test their stuff on their own, either. There’s no pairing, no review, in Team C. To Team C, if it compiles, it’s ready for the testers. The module is a dog’s breakfast, and we find bugs practically everywhere. Let’s say we find eight in our 90-minute session. Each test that finds a problem costs us 10 minutes, so we spent 80 minutes on those eight bugs. Every now and again, we happen to run a test that doesn’t find a problem. (Hey, even dBase IV occasionally did something right.) Our results for the day now look like this:

    Module Bug Investigation and Reporting
    (time spent on tests that find bugs)
    Test Design and Execution
    (time spent on tests that don’t find bugs)
    Total Tests
    A 0 minutes (no bugs found) 90 minutes (45 tests) 45
    B 10 minutes (1 test, 1 bug) 80 minutes (40 tests) 41
    C 80 minutes (8 tests, 8 bugs) 10 minutes (5 tests) 13
    Because of all the bugs, Module C allows us to perform thirteen micro-sessions in 90 minutes. Thirteen, where with the other modules we managed 45 and 41. Because we’ve been investigating and reporting bugs, there are 32 micro-sessions, 32 units of coverage, that we haven’t been able to obtain on this module. If we decide that we need to perform that testing (and the module’s overall badness is consistent throughout), we’re going to need at least three more sessions to cover it. Alternatively, we could stop testing now, but what are the chances of a serious problem lurking in the parts of the module we haven’t covered? So, the first thing to observe here is:
    Lots of bugs means reduced coverage, or slower testing, or both.

    There’s something else that’s interesting, too. If we are being measured based on the number of bugs we find (exactly the sort of measurement that will be taken by managers who don’t understand testing), Team A makes us look awful—we’re not finding any bugs in their stuff. Meanwhile, Team C makes us look great in the eyes of management. We’re finding lots of bugs! That’s good! How could that be bad?

    On the other hand, if we’re being measured based on the test coverage we obtain in a day (which is exactly the sort of measurement that will be taken by managers who count test cases; that is, managers who probably have an even more damaging model of testing than the managers in the last paragraph), Team C makes us look terrible. “You’re not getting enough done! You could have performed 45 test cases today on Module C, and you’ve only done 13!” And yet, remember that in our scenario we started with the assumption that, no matter what the module, we always find a problem if there’s one there. That is, there’s no difference between the testers or the testing for each of the three modules; it’s solely the condition of the product that makes all the difference.

    This is the first in a pair of posts. Let’s see what happens tomorrow.

    When Do We Stop Testing? One More Sure Thing

    Tuesday, October 13th, 2009

    Not too long ago, I posted a list of stopping heuristics for testing. As usual, such lists are always subjective, subject to refinement and revision, and under scrutiny from colleagues and other readers. As usual, James Bach is a harsh critic (and that’s a compliment, not a complaint). We’re still transpecting over some of the points; eventually we’ll come out with something on which we agree.

    Joe Harter, in his blog, suggests splitting “Pause That Refreshes” into two: “Change in Priorities” and “Lights are Off“. The former kicks in when we know that there’s still testing to be done, but something else is taking precedence. The latter is that we’ve lost our sense of purpose—as I suggested in the original post we might be tired, or bored, or uninspired to test and that a break will allow us to return to the product later with fresh eyes or fresh minds. Maybe they’re different enough that they belong in different categories, and I’m thinking that they are. Joe provides a number of examples of why the lights go out; one feels to me like “customary conclusion”, another looks like “Mission Accomplished”. But his third point is interesting: it’s a form of Parkinson’s Law, “work expands to fill the time available for its completion”. Says Joe, “The test team might be given more time than is actually necessary to test a feature so they fill it up with old test cases that don’t have much meaning.” I’m not sure how often people feel as though they have more time than they need, but I am sure that I’ve seen (been in) situations where people seem to be bereft of new ideas and simply going through the motions. So: if that feeling comes up, one should consider Parkinson’s Law and a Pause That Refreshes. Maybe there’s a new one there. But as Joe himself points out, “In the end it doesn’t matter if you use [Michael’s] list, my list or any list at all. These heuristics are rules of thumb to help thinking testers decide when testing should stop. The most important thing is that you are thinking about it.”

    For sure, however, there is a glaring omission in the original list. Cem Kaner pointed it out to me—and that shouldn’t have been necessary, because I’ve used this heuristic myself. It focuses on the individual tester, but it might also apply to a testing or development team.

    Mission Rejected. We stop testing when we perceive a problem for some person—in particular, an ethical issue—that prevents us from continuing work on a given test, test cycle, or development project.

    Would you continue a test if it involved providing fake test results? Lying? Damaging valuable equipment? Harming a human, as in the Milgram Experiment or the Stanford Prison Experiment? Maybe the victim isn’t the test subject, but the client: Would you continue a test if you believed that some cost of what you were doing—including, perhaps, your own salary—were grossly disproportionate to the value it produced? Maybe the victim is you: Would you stop testing if you believed that the client wasn’t paying you enough?

    The consequences of ignoring this heuristic can be dire. Outside the field of software testing, but in testing generally, a friend of mine worked in a science lab that did experiments on bone regeneration. The experimental protocol involved the surgical removal of around one inch of bone from both forelegs of a dog (many dogs, over the course of the research), treating one leg as an experiment and the other as a control. Despite misgivings, my friend was reasonably convinced of the value of the work. Later, when he found out that these experiments had been performed over and over, and that no new science was really being done, he suffered a nervous breakdown and left the field. Sometimes testing doesn’t have a happy ending.

    When Do We Stop a Test?

    Friday, September 11th, 2009

    Several years ago, around the time I started teaching Rapid Software Testing, my co-author James Bach recorded a video to demonstrate rapid stress testing. In this case, the approach involved throwing an overwhelming amount of data at an application’s wizard, essentially getting the application to stress itself out.

    The video goes on for almost six minutes. About halfway through, James asks, “You might be asking why I don’t stop now. The reason is that we’re seeing a steadily worsening pattern of failure. We could stop now, but we might see something even worse if we keep going.” And so the test does keep going. A few moments later, James provides the stopping heuristics: we stop when 1) we’ve found a sufficiently dramatic problem; or 2) there’s no apparent variation in the behaviour of the program—the program is essentially flat-lining; or 3) the value of continuing doesn’t justify the cost. Those were the stopping heuristics for that stress test.

    About a year after I first saw the video, I wanted to prepare a Better Software column on more general stopping heuristics, so James and I had a transpection session. The column is here. About a year after that, the column turned into a lightning talk that I gave in a few places.

    About six months after that, we had both recognized even more common stopping heuristics. We were talking them over at STAR East 2009 when Dale Emery and James Lyndsay walked by, and they also contributed to the discussion. In particular, Dale offered that in combat, the shooting might stop in several ways: a lull, “hold your fire”, “ceasefire”, “at ease”, “stand down”, and “disarm”. I thought that was interesting.

    Anyhow, here where we’re at so far. I emphasize that these stopping heuristics are heuristics. Heuristics are quick, inexpensive ways of solving a problem or making a decision. Heuristics are fallible—that is, they might work, and they might not work. Heuristics tend to be leaky abstractions, in that one might have things in common with another. Heuristics are also context-dependent, and it is assumed that they will be used by someone who has the competence and skill to use them wisely. So for each one, I’ve listed the heuristic and included at least one argument for not using the heuristic, or for questioning it.

    1. The Time’s Up! Heuristic. This, for many testers, is the most common one: we stop testing when the time allocated for testing has expired.

    Have we obtained the information that we need to know about the product? Is the risk of stopping now high enough that we might want to go on testing? Was the deadline artificial or arbitrary? Is there more development work to be done, such that more testing work will be required?

    2. The Piñata Heuristic. We stop whacking the program when the candy starts falling out—we stop the test when we see the first sufficiently dramatic problem.

    Might there be some more candy stuck in the piñata’s leg? Is the first dramatic problem the most important problem, or the only problem worth caring about? Might we find other interesting problems if we keep going? What if our impression of “dramatic” is misconceived, and this problem isn’t really a big deal?

    3. The Dead Horse Heuristic. The program is too buggy to make further testing worthwhile. We know that things are going to be modified so much that any more testing will be invalidated by the changes.

    The presumption here is that we’ve already found a bunch of interesting or important stuff. If we stop now, will miss something even more important or more interesting?

    4. The Mission Accomplished Heuristic. We stop testing when we have answered all of the questions that we set out to answer.

    Our testing might have revealed important new questions to ask. This leads us to the Rumsfeld Heuristic: “There are known unknowns, and there are unknown unknowns.” Has our testing moved known unknowns sufficiently into the known space? Has our testing revealed any important new known unknowns? And a hard-to-parse but important question: Are we satisified that we’ve moved the unknown unknowns sufficiently towards the knowns, or at least towards known unknowns?

    5. The Mission Revoked Heuristic. Our client has told us, “Please stop testing now.” That might be because we’ve run out of budget, or because the project has been cancelled, or any number of other things. Whatever the reason is, we’re mandated to stop testing. (In fact, Time’s Up might sometimes be a special case of the more general Mission Revoked, if it’s the client rather than ourselves that have made the decision that time’s up.)

    Is our client sufficiently aware of the value of continuing to test, or the risk of not continuing? If we disagree with the client, are we sufficiently aware of the business reasons to suspend testing?

    6. The I Feel Stuck! Heuristic. For whatever reason, we stop because we perceive there’s something blocking us. We don’t have the information we need (many people claim that they can’t test without sufficient specifications, for example). There’s a blocking bug, such that we can’t get to the area of the product that we want to test; we don’t have the equipment or tools we need; we don’t have the expertise on the team to perform some kind of specialized test.

    There might be any number of ways to get unstuck. Maybe we need help, or maybe we just need a pause (see below). Maybe more testing might allow us to learn what we need to know. Maybe the whole purpose of testing is to explore the product and discover the missing information. Perhaps there’s a workaround for the blocking bug; the tools and equipment might be available, but we don’t know about them, or we haven’t asked the right people in the right way; there might experts available to us, either on the testing team, among the programmers, or on the business side and we don’t realize it. There’s a difference between feeling stuck and being stuck.

    7. The Pause That Refreshes Heuristic. Instead of stopping testing, we suspend it for a while. We might stop testing and take a break when we’re tired, or bored, or uninspired to test. We might pause to do some research, to do some planning, to reflect on what we’ve done so far, the better to figure out what to do next. The idea here is that we need a break of some kind, and can return to the product later with fresh eyes or fresh minds.

    There’s another kind of pause, too: We might stop testing some feature because another has higher priority for the moment.

    Sure, we might be tired or bored, but is it more important for us to hang in there and keep going? Might we learn what we need to learn more efficiently by interacting with the program now, rather than doing work offline? Might a crucial bit of information be revealed by just one more test? Is the other “priority” really a priority? Is it ready for testing? Have we already tested it enough for now?

    8. The Flatline Heuristic. No matter what we do, we’re getting the same result. This can happen when the program has crashed or has become unresponsive in some way, but we might get flatline results when the program is especially stable, too—”looks good to me!”

    Is the application really crashed, or might it be recovering? Is the lack of response in itself an important test result? Does our idea of “no matter what we do” incorporate sufficient variation or load to address potential risks?

    9. The Customary Conclusion Heuristic. We stop testing when we usually stop testing. There’s a protocol in place for a certain number of test ideas, or test cases, or test cycles or variation, such that there’s a certain amount of testing work that we do, and we stop when that’s done. Agile teams (say that they) often implement this approach: “When all the acceptance tests pass, then we know we’re ready to ship.” Ewald Roodenrijs gives an example of this heuristic in his blog post titled When Does Testing Stop? He says he stops “when a certain amount of test cycles has been executed including the regression test”.

    This differs from “Time’s Up”, in that the time dimension might be more elastic than some other dimension. Since many projects seem to be dominated by the schedule, it took a while for James and me to realize that this one is in fact very common. We sometimes hear “one test per requirement” or “one positive test and one negative test per requirement” as a convention for establishing good-enough testing. (We don’t agree with it, of course, but we hear about it.)

    Have we sufficiently questioned why we always stop here? Should we be doing more testing as a matter of course? Less? Is there information available—say, from the technical support department, from Sales, or from outside reviewers—that would suggest that changing our patterns might be a good idea? Have we considered all the other heuristics?

    10. No more interesting questions. At this point, we’ve decided that no questions have answers sufficiently valuable to justify the cost of continuing to test, so we’re done. This heuristic tends to inform the others, in the sense that if a question or a risk is sufficiently compelling, we’ll continue to test rather than stopping.

    How do we feel about our risk models? Are we in danger of running into a Black Swan—or a White Swan that we’re ignoring? Have we obtained sufficient coverage? Have we validated our oracles?

    11. The Avoidance/Indifference Heuristic. Sometimes people don’t care about more information, or don’t want to know what’s going on the in the program. The application under test might be a first cut that we know will be replaced soon. Some people decide to stop testing because they’re lazy, malicious, or unmotivated. Sometimes the business reasons for releasing are so compelling that no problem that we can imagine would stop shipment, so no new test result would matter.

    If we don’t care now, why were we testing in the first place? Have we lost track of our priorities? If someone has checked out, why? Sometimes businesses get less heat for not knowing about a problem than they do for knowing about a problem and not fixing it—might that be in play here?

    Update: Cem Kaner has suggested one more:  Mission Rejected, in which the tester himself or herself declines to continue testing.  Have a look here.

    Any more ideas? Feel free to comment!