DevelopsenseLogo

When Do We Stop a Test?

Several years ago, around the time I started teaching Rapid Software Testing, my co-author James Bach recorded a video to demonstrate rapid stress testing. In this case, the approach involved throwing an overwhelming amount of data at an application’s wizard, essentially getting the application to stress itself out.

The video goes on for almost six minutes. About halfway through, James asks, “You might be asking why I don’t stop now. The reason is that we’re seeing a steadily worsening pattern of failure. We could stop now, but we might see something even worse if we keep going.” And so the test does keep going. A few moments later, James provides the stopping heuristics: we stop when 1) we’ve found a sufficiently dramatic problem; or 2) there’s no apparent variation in the behaviour of the program—the program is essentially flat-lining; or 3) the value of continuing doesn’t justify the cost. Those were the stopping heuristics for that stress test.

About a year after I first saw the video, I wanted to prepare a Better Software column on more general stopping heuristics, so James and I had a transpection session. The column is here. About a year after that, the column turned into a lightning talk that I gave in a few places.

About six months after that, we had both recognized even more common stopping heuristics. We were talking them over at STAR East 2009 when Dale Emery and James Lyndsay walked by, and they also contributed to the discussion. In particular, Dale offered that in combat, the shooting might stop in several ways: a lull, “hold your fire”, “ceasefire”, “at ease”, “stand down”, and “disarm”. I thought that was interesting.

Anyhow, here where we’re at so far. I emphasize that these stopping heuristics are heuristics. Heuristics are quick, inexpensive ways of solving a problem or making a decision. Heuristics are fallible—that is, they might work, and they might not work. Heuristics tend to be leaky abstractions, in that one might have things in common with another. Heuristics are also context-dependent, and it is assumed that they will be used by someone who has the competence and skill to use them wisely. So for each one, I’ve listed the heuristic and included at least one argument for not using the heuristic, or for questioning it.

1. The Time’s Up! Heuristic. This, for many testers, is the most common one: we stop testing when the time allocated for testing has expired.

Have we obtained the information that we need to know about the product? Is the risk of stopping now high enough that we might want to go on testing? Was the deadline artificial or arbitrary? Is there more development work to be done, such that more testing work will be required?

2. The Piñata Heuristic. We stop whacking the program when the candy starts falling out—we stop the test when we see the first sufficiently dramatic problem.

Might there be some more candy stuck in the piñata’s leg? Is the first dramatic problem the most important problem, or the only problem worth caring about? Might we find other interesting problems if we keep going? What if our impression of “dramatic” is misconceived, and this problem isn’t really a big deal?

3. The Dead Horse Heuristic. The program is too buggy to make further testing worthwhile. We know that things are going to be modified so much that any more testing will be invalidated by the changes.

The presumption here is that we’ve already found a bunch of interesting or important stuff. If we stop now, will miss something even more important or more interesting?

4. The Mission Accomplished Heuristic. We stop testing when we have answered all of the questions that we set out to answer.

Our testing might have revealed important new questions to ask. This leads us to the Rumsfeld Heuristic: “There are known unknowns, and there are unknown unknowns.” Has our testing moved known unknowns sufficiently into the known space? Has our testing revealed any important new known unknowns? And a hard-to-parse but important question: Are we satisified that we’ve moved the unknown unknowns sufficiently towards the knowns, or at least towards known unknowns?

5. The Mission Revoked Heuristic. Our client has told us, “Please stop testing now.” That might be because we’ve run out of budget, or because the project has been cancelled, or any number of other things. Whatever the reason is, we’re mandated to stop testing. (In fact, Time’s Up might sometimes be a special case of the more general Mission Revoked, if it’s the client rather than ourselves that have made the decision that time’s up.)

Is our client sufficiently aware of the value of continuing to test, or the risk of not continuing? If we disagree with the client, are we sufficiently aware of the business reasons to suspend testing?

6. The I Feel Stuck! Heuristic. For whatever reason, we stop because we perceive there’s something blocking us. We don’t have the information we need (many people claim that they can’t test without sufficient specifications, for example). There’s a blocking bug, such that we can’t get to the area of the product that we want to test; we don’t have the equipment or tools we need; we don’t have the expertise on the team to perform some kind of specialized test.

There might be any number of ways to get unstuck. Maybe we need help, or maybe we just need a pause (see below). Maybe more testing might allow us to learn what we need to know. Maybe the whole purpose of testing is to explore the product and discover the missing information. Perhaps there’s a workaround for the blocking bug; the tools and equipment might be available, but we don’t know about them, or we haven’t asked the right people in the right way; there might experts available to us, either on the testing team, among the programmers, or on the business side and we don’t realize it. There’s a difference between feeling stuck and being stuck.

7. The Pause That Refreshes Heuristic. Instead of stopping testing, we suspend it for a while. We might stop testing and take a break when we’re tired, or bored, or uninspired to test. We might pause to do some research, to do some planning, to reflect on what we’ve done so far, the better to figure out what to do next. The idea here is that we need a break of some kind, and can return to the product later with fresh eyes or fresh minds.

There’s another kind of pause, too: We might stop testing some feature because another has higher priority for the moment.

Sure, we might be tired or bored, but is it more important for us to hang in there and keep going? Might we learn what we need to learn more efficiently by interacting with the program now, rather than doing work offline? Might a crucial bit of information be revealed by just one more test? Is the other “priority” really a priority? Is it ready for testing? Have we already tested it enough for now?

8. The Flatline Heuristic. No matter what we do, we’re getting the same result. This can happen when the program has crashed or has become unresponsive in some way, but we might get flatline results when the program is especially stable, too—”looks good to me!”

Is the application really crashed, or might it be recovering? Is the lack of response in itself an important test result? Does our idea of “no matter what we do” incorporate sufficient variation or load to address potential risks?

9. The Customary Conclusion Heuristic. We stop testing when we usually stop testing. There’s a protocol in place for a certain number of test ideas, or test cases, or test cycles or variation, such that there’s a certain amount of testing work that we do, and we stop when that’s done. Agile teams (say that they) often implement this approach: “When all the acceptance tests pass, then we know we’re ready to ship.” Ewald Roodenrijs gives an example of this heuristic in his blog post titled When Does Testing Stop? He says he stops “when a certain amount of test cycles has been executed including the regression test”.

This differs from “Time’s Up”, in that the time dimension might be more elastic than some other dimension. Since many projects seem to be dominated by the schedule, it took a while for James and me to realize that this one is in fact very common. We sometimes hear “one test per requirement” or “one positive test and one negative test per requirement” as a convention for establishing good-enough testing. (We don’t agree with it, of course, but we hear about it.)

Have we sufficiently questioned why we always stop here? Should we be doing more testing as a matter of course? Less? Is there information available—say, from the technical support department, from Sales, or from outside reviewers—that would suggest that changing our patterns might be a good idea? Have we considered all the other heuristics?

10. No more interesting questions. At this point, we’ve decided that no questions have answers sufficiently valuable to justify the cost of continuing to test, so we’re done. This heuristic tends to inform the others, in the sense that if a question or a risk is sufficiently compelling, we’ll continue to test rather than stopping.

How do we feel about our risk models? Are we in danger of running into a Black Swan—or a White Swan that we’re ignoring? Have we obtained sufficient coverage? Have we validated our oracles?

11. The Avoidance/Indifference Heuristic. Sometimes people don’t care about more information, or don’t want to know what’s going on the in the program. The application under test might be a first cut that we know will be replaced soon. Some people decide to stop testing because they’re lazy, malicious, or unmotivated. Sometimes the business reasons for releasing are so compelling that no problem that we can imagine would stop shipment, so no new test result would matter.

If we don’t care now, why were we testing in the first place? Have we lost track of our priorities? If someone has checked out, why? Sometimes businesses get less heat for not knowing about a problem than they do for knowing about a problem and not fixing it—might that be in play here?

Update: Cem Kaner has suggested one more:  Mission Rejected, in which the tester himself or herself declines to continue testing.  Have a look here.

Any more ideas? Feel free to comment!

38 replies to “When Do We Stop a Test?”

  1. Michael,

    I like how you've examined the question from several different valid perspectives and raised relevant questions for each. If anyone wants a bite-sized intro to how context-driven-testing thinking is applied, this would makes a nice example.

    It is the opposite of a one-size-fits all "You should stop testing when A, B, and C are true" approach (which might be disappointing to someone looking for a nice, easy answer), and yet it is helpful in framing the problem and hopefully getting testers to the right answer for their particular project.

    PS, I was enjoying beers with a friend at Brixx in Chapel Hill the other day and thought back to our post-TISQA conference conversation. I enjoyed that.

    – Justin Hunter

    Reply
  2. There is a dangerous version of heuristic 11, The Avoidance/Indifference Heuristic,
    which we can call The Hidden Stop Heuristic

    It is when the tester seem to be testing, but in reality only performs tests without looking at what happens or what the results are,
    e.g. you are executing tests that you know will give the same result as last time.
    You have stopped testing, but it looks like you are testing.

    Arguments for questioning this heuristic is that sometimes this behavior is demanded, another that the indifferent approach gives diversity to your test approach, which might find new things.

    Then we have Heureka! Heuristic (a version of 7. The Pause That Refreshes Heuristic)
    which is when you realize something very important, e.g. a new type of error, so you start looking at other features/products for the same type of error,
    or you start talking about this very special thing all the time.

    Is this really as important as you think? Can you really skip/pause the other things you were supposed to be doing? Maybe one bug report of the same type is enough? Time to move on?

    Reply
  3. Michael, great list, but for a change I think I have something to add to your well thought out heuristics. I posted my suggestion on my blog, but the gist of it is that I think the seventh heuristic "Pause and Refresh" should be broken out into two, "Change in Priorities" and "Lights are Off".

    -Joe Let me know what you think.

    Reply
  4. Michael –

    probably about 8 out of 10 times as testers we would not asked this quesiton at all. Hence answering the question is out of scope. Time given for testing is always less than what is actually needed. Hence testing is invariable cut short. This is out of my experience in IT.

    Only time when this quesiton becomes relevant is at the time of estimation when a tester is asked about time required to test (not when testing would be "complete" in some sense). But in this case, the heuristics mentioned will not be of use as they apply to situation where the "thing is in motion".

    Shrini

    Reply
  5. @Hannu

    Thanks for the suggestion.

    If it's been automated, and the automation has been run, that's a case of "mission accomplished", isn't it? And the counter to it is, "So we accomplished some testing mission a mission that we set out some time ago; are we sure that we don't have any important questions that have come up since then?"

    —Michael B.

    Reply
  6. @justin

    Thank you for the comment and the compliment. Conversations are the heart of conferences! (Those reading who are interested in combinatorics as testing heuristics should check out Justin's work and his product/company, Hexawise.)

    @shrini

    Remember, the premise of the stopping heuristics is they could be applied to a single test case or test idea, not just a test cycle or a development project. Second, they assume sapience (that is, that the decision to stop is not determined by an existing decision rule) and they assume that the tester has at least some autonomy, and therefore some influence on the decision when to stop. I like your turn of phrase "thing in motion". The "thing in motion" could be a test cycle, but it could also be a test idea or an observation within one step in an otherwise highly scripted test.

    As Justin suggests, that's a context in which context-driven thinking can be applied, even if the rest of the work is not context-driven. As James Bach points out, there are circumstances in which context-driven thinking might be a bad idea, and one of them is the case in which someone else, someone other than the tester, is entirely responsible for the quality of the work. If the tester isn't responsible for the decision, or is non-sapient, then the stopping heuristic is covered by "mission accomplished", "mission revoked", or "time's up".

    As for the business of "insufficient time to test" or testing being "cut short", I don't think that's our decision to make in any case. I think you mean "less testing time than I'd like" or "a shorter schedule than is necessary to accomplish this set of tasks" or "less time than we think we need to find the bugs that are there". Yet testing is a service to the project; we don't run it. Our ideals don't matter compared to what our clients want. And they often believe they want to ship more than anything else. That's fine; it's their business, and their risk to assume. What does matter, though, is that we provide the best service we can in the time that is available to us, whether that time is luxuriously long or ludicrously short to match the business' goals, whether it's what we'd like or whether it's "cut short". It's like being a waiter, or a salesman in a clothing store; we don't get to decide how long the client wants our services. (More on that simile at http://www.developsense.com/articles/2009-05-IssuesAboutMetricsAboutBugs.pdf).

    Reply
  7. Sometimes when I'm testing I have to stop because I've raised lot of bugs and they are piling up too high. The developer gets overwhelmed and only fixes some of the bugs. I suppose it could be a version of the Dead Horse Heuristic, but perhaps its worthy of its own heuristic. Perhaps "Have a Kit Kat heuristic?"

    Reply
  8. #11 is really prevalent in the startup community where pragmatism almost inevitably leads to cowboy development. The unstated assumption being the code currently produced is merely a proof of concept for demonstration purposes, and will be completely refactored or redesigned at a later date when there’s time to “do it right.” Unfortunately that time never actually seems to arrive. In other words today’s untested demo code is tomorrow’s base data class.

    Reply
  9. Michael,

    I would offer another angle on this. There was a recent question posed to me that went something like this:

    “If you’re writing testware for the system understand, then who writes the tests for your testware?” My answer is basically that I will test-drive my testware. This essentially is the same as saying “test each adjacent layer.”

    Thoughts?

    Hmm… you might want to elaborate on that. Is saying that you will test-drive your testware essentially the same as saying “test each adjacent layer”? Each adjacent layer of what? What do you mean by “adjacent layer”?

    Reply
  10. Michael Bolton has a range of other heuristics for when to stop a test (many variations available)

    Reply
  11. When to stop testing? – This is a decison

    Can we categorize this as a decision making heuristic?

    Michael replies: Yes; the post is a set of heuristic ideas for when to stop testing.

    I have also read about other heuristics from ET experts: my understanding is that each one is for different purposes

    For exploration
    For asking questions
    For testing..etc

    Over time, I , as a tester may have thousands of heuristics, learnt from others and also my own…how do I manage this…any ideas?

    Lots of ways. The first step is to recognize that your brain is the principal repository for what’s most important about managing heuristics: the skill and the judgement to apply them appropriately. That said, there are all kinds of approaches to cataloging. You can create hierarchical lists or taxonomies; unordered lists; mind maps; diagrams; tables; stories; works of fiction; wikis… the possibilities are endless. You can use computers, paper notebooks, index cards, wall charts, three-ring binders… Elisabeth Hendrickson prints her Test Heuristic Cheat Sheet on coffee mugs, an idea that I still intend to steal some day. Experiment, and try various things. Choose the one, or ones, that work for you. (Note that finding a means of cataloging heuristics is an ongoing, heuristic process.)

    Reply
  12. @Michael: Thanks for reply

    I got the answer in brief for categorizing heuristics, based on your statement, I have one more question

    “Experiment, and try various things. Choose the one, or ones, that work for you. (Note that finding a means of cataloging heuristics is an ongoing, heuristic process.)”

    Suppose, I have a testing problem to solve, and I have started to browse categorized lists of heuristics to select which are best suited to solve the problem. after some time I realize I am constrained by time limits. Now I stop searching and start applying, I may miss coverage in this process.

    In this situation I would have a result but not satisfactory.

    In the above scenario I feel there are other factors like
    1)how good you are at selecting heuristics
    2)how skilled you are at applying heuristics
    3)how much pressure you can handle if you are constrained by time limits
    4)confidence level in your results & judgement

    any thoughts how this situation can be handled better?

    Reply
  13. […] Exhaustive investigation of all possible boundary conditions is very time-consuming.  In the same time, after any code change of the particular functionality all boundary investigation results become obsolete. That is, unless you have infinite time, the primary goal is not to find all boundary bugs but look until we find the first important one, and then move on to another piece of functionality (Why? Please read about “The Dead Horse” and other stopping heuristics). […]

    Reply
  14. Might you stop if you had reason to suspect someone on your team couldn’t be trusted with information? Or if there were a fire in the building?

    These (rather odd) cases could fall under a “Mission Postponed” or “Mission Modified” heuristic, although those implicitly imply the original mission was revoked or rejected. However, I don’t feel like these fit under the descriptions “revoked” or “rejected” because there is no additional notion/assumption of continuing later.

    Michael replies: I have a question for you: did you sweat and ponder to come up with these weird, completely exceptional cases, or did these weird, completely exceptional cases just pop into your head? In either case, you have my admiration. 🙂

    To answer your question, though, I’d need more details on who you are in this scenario, and what your options are. If you’re the manager, and you suspect the tester of impropriety, and you order him to stop testing, I’d file it under Mission Revoked (also known as Mission Abandoned). If you’re a colleague and you’re worried that one of your colleagues is untrustworthy and you stop testing thereby, I’d call that Mission Rejected. If the building is on fire, it could be Mission Rejected or Pause, but more probably I Feel Stuck.

    But here’s something more important than any of those things, I think. All of these lists of heuristics that we develop can be used in two ways. One is retrospectively, when we’re trying to look back and explain a decision that we’ve made. This is especially important with identifying our oracles—why we see something as a problem—or with other things that we need to explain or justify. The second way to use the heuristics is generatively, to trigger ideas that lead to observations or decisions. Either way, any particular classification of an observation or decision usually isn’t quite as important as the fact that you’ve made the observation or the decision; that’s the gold.

    Reply
  15. Not very scientific heuristics.

    Michael replies: Well, that’s a pretty compelling argument you’ve presented there.

    But if you’re interested in having a conversation, what’s your concept of a heuristic? To me, a heuristic is a fallible method for solving a problem or making a decision. Heuristics (as pointed out in Billy Vaughan Koen’s foundational book, Discussion of the Method) are at the core of engineering work; indeed, Koen makes a compelling case that all knowledge and all methods are heuristic.

    What’s your concept of “scientific”? Wikipedia suggests that “To be termed scientific, a method of inquiry must be based on gathering empirical and measurable evidence subject to specific principles of reasoning.” This is not antithetical to the use of heuristics. Indeed, science itself is entirely based in heuristics. The principle behind the scientific method is that all single experiments are fallible and open to alternative interpretations, and that any matter of scientific fact is a provisional conclusion, based on reasoning to the best explanation so far. (Since challenges to the infallibility of the scientific method are relatively new—dating back only three hundred and fifty years or so—they may have escaped the attention of dedicated neo-Platonists.)

    Or is your concern that heuristically-based approaches are intrinsically unscientific? If so, have you looked at Paul Feyerabend’s Against Method, in which he shows that the progress of science itself has been decidedly unscientific?

    Is there something non-empirical or unmeasurable about the stopping heuristics that I’ve provided? Is your objection that most of these heuristics don’t use the kind of third-order measurement that physicists use? If so, you might like to consider that most of the important aspects of software development and of software’s value are rooted in self-aware and social systems. In such domains, third-order measurement is not only inaccurate and inappropriate (look here) but also leads to distortion and dysfunction (look here).

    Is your objection that heuristics are unreliable or invalid? On the one hand, that’s a reasonable objection because heuristics are by definition not perfectly reliable. On the other hand, it’s not like more algorithmic methods are by their nature more trustworthy simply because they’re algorithmic. After all, an algorithm can be applied in an inappropriate context, or can be based on an invalid model. The Weibull distribution is a classic example. Cem Kaner and Walter (“Pat”) Bond give a detailed refutation of its applicability to software bug-finding metrics here.

    Most of all I’m curious about how the list of stopping heuristics I’ve provided is any less scientific than this list of heuristics that I found on the page pointed to by the link that you kindly provided:

    We can release when:
    Coverage is OK

    • all the code has been exercised
    • every feature has been shown to work
    • every use case scenario has been exercised
    • all the tests have been run

    Bugs are OK

    • the number of bugs found is almost the number expected
    • the unfixed bugs are few enough

    Non-functional features are OK
    Code turmoil is OK
    Feature stability is OK

    In general, I’m inclined to agree with the items in this list and the evaluation criterion of “OK”, since both appear to be heuristically based. I do have some specific concerns though.

    All the code has been exercised. It’s possible to exercise every line of code in a product without observing a single bug, even when there are terrible problems in the product, just as it’s possible for a parking enforcement cop to walk along every street in a town without observing a single expired meter or issuing a single ticket.

    Every feature has been shown to work. That’s a nice idea too, but it’s not terribly hard, even for a profoundly broken program, to show that features can work. Excellent testing isn’t about that, though; it’s about a diligent search for possible failures. You see the difference, I hope: one approach is focused on confirmation and verification; the other is focused on exploration, discovery, investigation, and learning. The former is a very weak kind of testing; the latter much stronger.

    Every use case scenario has been exercised. That’s certainly one kind of (or heuristic for) test coverage. Thinking about it for a moment raises the possibility that there are ways of using the product that aren’t covered by the scenarios in the use cases, especially since use cases are typically crafted at the beginning of the project or development cycle, when we know less than we will ever know about the product.

    All the tests have been run. That a heuristic for test coverage too, but it poses some questions. What about the quality of the tests? What about the quality of the oracles that inform the tests? What about the skill of the tester? Do the tests cover the product sufficiently to address the most important risks and to inform the ship or no-ship decision?

    The number of bugs found is almost the number expected. Almost? The number expected? What’s the basis for your expectation? Is that expectation valid? What might threaten the validity of that expectation? What if there are more bugs than you expected? What if there are fewer bugs than you expected? What does “almost” mean? Everything but the last two? or three? or four bugs that on their own would demolish the value of the product?

    The unfixed bugs are few enough. I have no particular concern with the number of bugs in the product, since the number has little relevance. What does have relevance is the significance of the bugs, irrespective of their number. (As examples, Microsoft Windows 2000, a very successful product, shipped with over 60,000 open issues; in the 1960s, an Algol compiler from IBM shipped with only one bug: it wouldn’t load. People still obtained value from Windows 2000, which may be unsurprising. What might be a little more surprising is that some people obtained value from the Algol compiler too; the availability of an Algol compiler, even one that didn’t work, was sufficient to address a sales problem that IBM needed to address.)

    So, considering that you’ve presented a list of heuristics that seem to be rooted in un- or quasi-scientific principles at best, colour me confused about your objections to mine.

    Thanks for writing.

    Reply

Leave a Comment