Blog Posts for the ‘Estimation’ Category

Why Is Testing Taking So Long? (Part 1)

Tuesday, November 24th, 2009

If you’re a tester, you’ve probably been asked, “Why is testing taking so long?” Maybe you’ve had a ready answer; maybe you haven’t. Here’s a model that might help you deal with the kind of manager who asks such questions.

Let’s suppose that we divide our day of testing into three sessions, each session being, on average, 90 minutes of chartered, uninterrupted testing time. That’s four and a half hours of testing, which seems reasonable in an eight-hour day interrupted by meetings, planning sessions, working with programmers, debriefings, training, email, conversations, administrivia of various kinds, lunch time, and breaks.

The reason that we’re testing is that we want to obtain coverage; that is, we want to ask and answer questions about the product and its elements to the greatest extent that we can. Asking and answering questions is the process of test design and execution. So let’s further assume that we break each session into average two-minute micro-sessions, in which we perform some test activity that’s focused on a particular testing question, or on evaluating a particular feature. That means in a 90-minute session, we can theoretically perform 45 of these little micro-sessions, which for the sake of brevity we’ll informally call “tests”. Of course life doesn’t really work this way; a test idea might a couple of seconds to implement, or it might take all day. But I’m modeling here, making this rather gross simplification to clarify a more complex set of dynamics. (Note that if you’d like to take a really impoverished view of what happens in skilled testing, you could say that a “test case” takes two minutes. But I leave it to my colleague James Bach to explain why you should question the concept of test cases.)

Let’s further suppose that we’ll find problems every now and again, which means that we have to do bug investigation and reporting. This is valuable work for the development team, but it takes time that interrupts test design and execution—the stuff that yields test coverage. Let’s say that, for each bug that we find, we must spend an extra eight minutes investigating it and preparing a report. Again, this is a pretty dramatic simplification. Investigating a bug might take all day, and preparing a good report could take time on the order of hours. Some bugs (think typos and spelling errors in the UI) leap out at us and don’t call for much investigation, so they’ll take less than eight minutes. Even though eight minutes is probably a dramatic underestimate for investigation and reporting, let’s go with that. So a test activity that doesn’t find a problem costs us two minutes, and a test activity that does find a problem takes ten minutes.

Now, let’s imagine one more thing: we have perfect testing prowess; that if there’s a problem in an area that we’re testing, we’ll find it, and that we’ll never enter a bogus report, either. Yes, this is a thought experiment.

One day we come into work, and we’re given three modules to test.

The morning session is taken up with Module A, from Development Team A. These people are amazing, hyper-competent. They use test-first programming, and test-driven design. They work closely with us, the testers, to design challenging unit checks, scriptable interfaces, and log files. They use pair programming, and they review and critique each other’s work in an egoless way. They refactor mercilessly, and run suites of automated checks before checking in code. They brush their teeth and floss after every meal; they’re wonderful. We test their work diligently, but it’s really a formality because they’ve been testing and we’ve been helping them test all along. In our 90-minute testing session, we don’t find any problems. That means that we’ve performed 45 micro-sessions, and have therefore obtained 45 units of test coverage.

(And if you’re viewing this under at least some versions of IE 7, you’ll see a cool bug in its handling of the text flow around the table.  You’ve been warned!)

Module Bug Investigation and Reporting
(time spent on tests that find bugs)
Test Design and Execution
(time spent on tests that don’t find bugs)
Total Tests
A 0 minutes (no bugs found) 90 minutes (45 tests) 45
The first thing after lunch, we have a look at Team B’s module. These people are very diligent indeed. Most organizations would be delighted to have them on board. Like Team A, they use test-first programming and TDD, they review carefully, they pair, and they collaborate with testers. But they’re human. When we test their stuff, we find a bug very occasionally; let’s say once per session. The test that finds the bug takes two minutes; investigation and reporting of it takes a further eight minutes. That’s ten minutes altogether. The rest of the time, we don’t find any problems, so that leaves us 80 minutes in which we can run 40 tests. Let’s compare that with this morning’s results.

Module Bug Investigation and Reporting
(time spent on tests that find bugs)
Test Design and Execution
(time spent on tests that don’t find bugs)
Total Tests
A 0 minutes (no bugs found) 90 minutes (45 tests) 45
B 10 minutes (1 test, 1 bug) 80 minutes (40 tests) 41
After the afternoon coffee break, we move on to Team C’s module. Frankly, it’s a mess. Team C is made up of nice people with the best of intentions, but sadly they’re not very capable. They don’t work with us at all, and they don’t test their stuff on their own, either. There’s no pairing, no review, in Team C. To Team C, if it compiles, it’s ready for the testers. The module is a dog’s breakfast, and we find bugs practically everywhere. Let’s say we find eight in our 90-minute session. Each test that finds a problem costs us 10 minutes, so we spent 80 minutes on those eight bugs. Every now and again, we happen to run a test that doesn’t find a problem. (Hey, even dBase IV occasionally did something right.) Our results for the day now look like this:

Module Bug Investigation and Reporting
(time spent on tests that find bugs)
Test Design and Execution
(time spent on tests that don’t find bugs)
Total Tests
A 0 minutes (no bugs found) 90 minutes (45 tests) 45
B 10 minutes (1 test, 1 bug) 80 minutes (40 tests) 41
C 80 minutes (8 tests, 8 bugs) 10 minutes (5 tests) 13
Because of all the bugs, Module C allows us to perform thirteen micro-sessions in 90 minutes. Thirteen, where with the other modules we managed 45 and 41. Because we’ve been investigating and reporting bugs, there are 32 micro-sessions, 32 units of coverage, that we haven’t been able to obtain on this module. If we decide that we need to perform that testing (and the module’s overall badness is consistent throughout), we’re going to need at least three more sessions to cover it. Alternatively, we could stop testing now, but what are the chances of a serious problem lurking in the parts of the module we haven’t covered? So, the first thing to observe here is:
Lots of bugs means reduced coverage, or slower testing, or both.

There’s something else that’s interesting, too. If we are being measured based on the number of bugs we find (exactly the sort of measurement that will be taken by managers who don’t understand testing), Team A makes us look awful—we’re not finding any bugs in their stuff. Meanwhile, Team C makes us look great in the eyes of management. We’re finding lots of bugs! That’s good! How could that be bad?

On the other hand, if we’re being measured based on the test coverage we obtain in a day (which is exactly the sort of measurement that will be taken by managers who count test cases; that is, managers who probably have an even more damaging model of testing than the managers in the last paragraph), Team C makes us look terrible. “You’re not getting enough done! You could have performed 45 test cases today on Module C, and you’ve only done 13!” And yet, remember that in our scenario we started with the assumption that, no matter what the module, we always find a problem if there’s one there. That is, there’s no difference between the testers or the testing for each of the three modules; it’s solely the condition of the product that makes all the difference.

This is the first in a pair of posts. Let’s see what happens tomorrow.

When Do We Stop a Test?

Friday, September 11th, 2009

Several years ago, around the time I started teaching Rapid Software Testing, my co-author James Bach recorded a video to demonstrate rapid stress testing. In this case, the approach involved throwing an overwhelming amount of data at an application’s wizard, essentially getting the application to stress itself out.

The video goes on for almost six minutes. About halfway through, James asks, “You might be asking why I don’t stop now. The reason is that we’re seeing a steadily worsening pattern of failure. We could stop now, but we might see something even worse if we keep going.” And so the test does keep going. A few moments later, James provides the stopping heuristics: we stop when 1) we’ve found a sufficiently dramatic problem; or 2) there’s no apparent variation in the behaviour of the program—the program is essentially flat-lining; or 3) the value of continuing doesn’t justify the cost. Those were the stopping heuristics for that stress test.

About a year after I first saw the video, I wanted to prepare a Better Software column on more general stopping heuristics, so James and I had a transpection session. The column is here. About a year after that, the column turned into a lightning talk that I gave in a few places.

About six months after that, we had both recognized even more common stopping heuristics. We were talking them over at STAR East 2009 when Dale Emery and James Lyndsay walked by, and they also contributed to the discussion. In particular, Dale offered that in combat, the shooting might stop in several ways: a lull, “hold your fire”, “ceasefire”, “at ease”, “stand down”, and “disarm”. I thought that was interesting.

Anyhow, here where we’re at so far. I emphasize that these stopping heuristics are heuristics. Heuristics are quick, inexpensive ways of solving a problem or making a decision. Heuristics are fallible—that is, they might work, and they might not work. Heuristics tend to be leaky abstractions, in that one might have things in common with another. Heuristics are also context-dependent, and it is assumed that they will be used by someone who has the competence and skill to use them wisely. So for each one, I’ve listed the heuristic and included at least one argument for not using the heuristic, or for questioning it.

1. The Time’s Up! Heuristic. This, for many testers, is the most common one: we stop testing when the time allocated for testing has expired.

Have we obtained the information that we need to know about the product? Is the risk of stopping now high enough that we might want to go on testing? Was the deadline artificial or arbitrary? Is there more development work to be done, such that more testing work will be required?

2. The Piñata Heuristic. We stop whacking the program when the candy starts falling out—we stop the test when we see the first sufficiently dramatic problem.

Might there be some more candy stuck in the piñata’s leg? Is the first dramatic problem the most important problem, or the only problem worth caring about? Might we find other interesting problems if we keep going? What if our impression of “dramatic” is misconceived, and this problem isn’t really a big deal?

3. The Dead Horse Heuristic. The program is too buggy to make further testing worthwhile. We know that things are going to be modified so much that any more testing will be invalidated by the changes.

The presumption here is that we’ve already found a bunch of interesting or important stuff. If we stop now, will miss something even more important or more interesting?

4. The Mission Accomplished Heuristic. We stop testing when we have answered all of the questions that we set out to answer.

Our testing might have revealed important new questions to ask. This leads us to the Rumsfeld Heuristic: “There are known unknowns, and there are unknown unknowns.” Has our testing moved known unknowns sufficiently into the known space? Has our testing revealed any important new known unknowns? And a hard-to-parse but important question: Are we satisified that we’ve moved the unknown unknowns sufficiently towards the knowns, or at least towards known unknowns?

5. The Mission Revoked Heuristic. Our client has told us, “Please stop testing now.” That might be because we’ve run out of budget, or because the project has been cancelled, or any number of other things. Whatever the reason is, we’re mandated to stop testing. (In fact, Time’s Up might sometimes be a special case of the more general Mission Revoked, if it’s the client rather than ourselves that have made the decision that time’s up.)

Is our client sufficiently aware of the value of continuing to test, or the risk of not continuing? If we disagree with the client, are we sufficiently aware of the business reasons to suspend testing?

6. The I Feel Stuck! Heuristic. For whatever reason, we stop because we perceive there’s something blocking us. We don’t have the information we need (many people claim that they can’t test without sufficient specifications, for example). There’s a blocking bug, such that we can’t get to the area of the product that we want to test; we don’t have the equipment or tools we need; we don’t have the expertise on the team to perform some kind of specialized test.

There might be any number of ways to get unstuck. Maybe we need help, or maybe we just need a pause (see below). Maybe more testing might allow us to learn what we need to know. Maybe the whole purpose of testing is to explore the product and discover the missing information. Perhaps there’s a workaround for the blocking bug; the tools and equipment might be available, but we don’t know about them, or we haven’t asked the right people in the right way; there might experts available to us, either on the testing team, among the programmers, or on the business side and we don’t realize it. There’s a difference between feeling stuck and being stuck.

7. The Pause That Refreshes Heuristic. Instead of stopping testing, we suspend it for a while. We might stop testing and take a break when we’re tired, or bored, or uninspired to test. We might pause to do some research, to do some planning, to reflect on what we’ve done so far, the better to figure out what to do next. The idea here is that we need a break of some kind, and can return to the product later with fresh eyes or fresh minds.

There’s another kind of pause, too: We might stop testing some feature because another has higher priority for the moment.

Sure, we might be tired or bored, but is it more important for us to hang in there and keep going? Might we learn what we need to learn more efficiently by interacting with the program now, rather than doing work offline? Might a crucial bit of information be revealed by just one more test? Is the other “priority” really a priority? Is it ready for testing? Have we already tested it enough for now?

8. The Flatline Heuristic. No matter what we do, we’re getting the same result. This can happen when the program has crashed or has become unresponsive in some way, but we might get flatline results when the program is especially stable, too—”looks good to me!”

Is the application really crashed, or might it be recovering? Is the lack of response in itself an important test result? Does our idea of “no matter what we do” incorporate sufficient variation or load to address potential risks?

9. The Customary Conclusion Heuristic. We stop testing when we usually stop testing. There’s a protocol in place for a certain number of test ideas, or test cases, or test cycles or variation, such that there’s a certain amount of testing work that we do, and we stop when that’s done. Agile teams (say that they) often implement this approach: “When all the acceptance tests pass, then we know we’re ready to ship.” Ewald Roodenrijs gives an example of this heuristic in his blog post titled When Does Testing Stop? He says he stops “when a certain amount of test cycles has been executed including the regression test”.

This differs from “Time’s Up”, in that the time dimension might be more elastic than some other dimension. Since many projects seem to be dominated by the schedule, it took a while for James and me to realize that this one is in fact very common. We sometimes hear “one test per requirement” or “one positive test and one negative test per requirement” as a convention for establishing good-enough testing. (We don’t agree with it, of course, but we hear about it.)

Have we sufficiently questioned why we always stop here? Should we be doing more testing as a matter of course? Less? Is there information available—say, from the technical support department, from Sales, or from outside reviewers—that would suggest that changing our patterns might be a good idea? Have we considered all the other heuristics?

10. No more interesting questions. At this point, we’ve decided that no questions have answers sufficiently valuable to justify the cost of continuing to test, so we’re done. This heuristic tends to inform the others, in the sense that if a question or a risk is sufficiently compelling, we’ll continue to test rather than stopping.

How do we feel about our risk models? Are we in danger of running into a Black Swan—or a White Swan that we’re ignoring? Have we obtained sufficient coverage? Have we validated our oracles?

11. The Avoidance/Indifference Heuristic. Sometimes people don’t care about more information, or don’t want to know what’s going on the in the program. The application under test might be a first cut that we know will be replaced soon. Some people decide to stop testing because they’re lazy, malicious, or unmotivated. Sometimes the business reasons for releasing are so compelling that no problem that we can imagine would stop shipment, so no new test result would matter.

If we don’t care now, why were we testing in the first place? Have we lost track of our priorities? If someone has checked out, why? Sometimes businesses get less heat for not knowing about a problem than they do for knowing about a problem and not fixing it—might that be in play here?

Update: Cem Kaner has suggested one more:  Mission Rejected, in which the tester himself or herself declines to continue testing.  Have a look here.

Any more ideas? Feel free to comment!

Test Estimation Is Really Negotiation

Thursday, August 20th, 2009

Some of this posting is based on a conversation from a little while back on TestRepublic.com.

If anyone has a problem with “test estimation”, here’s a thought experiment:

Your manager (your client) wants to give you an assignment: to evaluate someone’s English skills, with the intention of qualifying him to work with your team. So how long would it take you to figure out whether a Spanish-speaking person spoke English well enough to join your team? Ponder that for a second, and then consider a few different kinds of Spanish-speaking people:

1) The fellow who, in response to every question you ask in English, replies, “Que?”

2) The fellow who speaks very articulately, until you mention the word “array”. And then he says, “Que?”

3) The fellow who spouts all kinds of technical talk perfectly, but when you say, “Let’s go for lunch,” says “Que?”

4) The fellow who speaks perfectly clearly, but every now and then spouts an obscenity.

5) The fellow who speaks English perfectly, but has no technical ability whatsoever.

6) The fellow who has great technical chops and speaks better English than the Queen, but spits tobacco juice in the corner every minute and a half.

How long you need to test a candidate’s capacity to speak English isn’t a question that has a firm answer, since the answer surely depends on

a) the candidate;
b) the extent to which you and the client want to examine them;
c) the mission upon which the candidate will be sent;
d) the information that you discover about the candidate;
e) the demands and schedule of the project for which you’re qualifying candidates;
f) the criteria upon which your client will decide they have enough information;
g) the amount of money and resources that the client is prepared to give you for your evaluation;
h) the amount of time that the client is prepared to give you.

So, yes, you can provide an estimate. Your client will often demand one. Mind you, since (H) is going to constrain your answer every time, you might as well start by asking the client how long you have to test. If the client answers with a date or a time, you don’t have to estimate how long it’s going to take you.

Suppose the client doesn’t provide a date. Do you know anything about the candidate? Before the interview, you find out that he’s only ever been a rickshaw driver; no previous experience with testing; no previous experience with computers. He speaks no English, but has a habit of screaming at the top of his lungs once every twenty minutes. In this case, you probably don’t have to estimate. It would take less time to report to your client that the candidate is likely to be unsuitable than it would to prepare an estimate for how long it will take to evaluate him. Why bother?

So here’s another candidate. This woman has been working at Microsoft for ten years, the first eight as a tester and the last two as a test lead. Her references have all checked out. The mission is to test a text-only Website of three pages, no programmatic features. In this case, you probably won’t have to estimate. It would take less time to report to your client that the candidate is likely to be qualified (or overqualified) than it would to prepare an estimate. Why bother?

The information that you discover in your evaluation of the candidate’s English skills is to a large degree unpredictable. The problem that sinks him might not be related to his English, and you might not discover a crucial problem until after he’s been hired. The problems that you discover might be deemed insufficient to disqualify him from the job, since ultimately it’s the manager who’s going to decide.

So instead of thinking about estimation in testing, think about negotiation. Testing is an open-ended task, and it must respond to development work. The quality of that development work and the problems that we might find are open questions (if they weren’t, we wouldn’t be testing). In addition, the decision to ship the product (which includes a decision to stop testing) is a business decision, not a technical one.

In cases where you don’t know things about the candidate, you can certainly propose a suite of questions and exercises that you’ll put them through, and negotiate that with the client. In case of the first candidate, the very first bit of information that your receive is likely change all of your choices about what to ask them and how you’re going to test them. In the second case, your interview will probably be quick too, but for the opposite reason. It’s in the cases in between, when you’re dealing with uncertainty and want to dispel it, that your testing will take somewhat longer, will require probing and investigation of questions that arise during the interview—and that may require extra time that you may have to negotiate with your client. One thing for sure: you probably don’t want to spend so much time designing the protocol that it has a serious negative impact on your interviewing time, right?

For those who are still interested (or unconvinced) and haven’t seen it, you might like to look at this:

http://www.developsense.com/2007/01/test-project-estimation-rapid-way.html

Three Kinds of Measurement and Two Ways to Use Them

Wednesday, July 22nd, 2009

In the testing business, we’ve been wrestling with the measurement problem for quite a while. I think there are two prongs to the problem. The first is the aphorism that “you can’t control what you can’t measure”. The second is the confusion between measurement (which can be either quantitative or qualitative) and metrics, which are mathematical functions of measurements, and therefore fundamentally quantitative, only quantitative.

I don’t know if you can’t control something that you can’t measure, but you can certainly make responsible, defensible choices control things based on non-quantitative measures. For example, I’m hungry right now, and the non-bald parts of my head are a little shaggy. I’m not really comfortable with the keyboard on my new ThinkPad, but I like the display even though the default fonts seem to be a little on the small side for an astigmatic guy approaching his 50s. I can measure and manage all of these things without applying numbers.

I’m going to go grab a bite after I’ve finished this note; I’ll get my wife to give me a haircut before she heads out on the canoe trip, and I’ll trim my beard on my own. I can’t do much about the keyboard, although I can measure it by saying that I liked my old machine’s keys better. And I can grow the fonts in the browser by pressing Ctrl-+ until I’m happy again. In each case, I’m measuring to manage just the effects that I want, even though I’m doing it without quantitative measures. (Thanks to Matt Heusser for pointing out the haircut example to me; and thanks to Cem Kaner for pointing out the significance of the fact that I griped about the keyboard before complimenting the display.)

Apropos of all this, another of my Test Connection columns has been posted on StickyMinds. This one is about measurement and metrics, and the way that people use and confuse them. You can read it by clicking here, or by going to http://www.developsense.com/articles/2009-07-ThreeKindsOfMeasurement.pdf.

I’m grateful for the guidance and compliments given to me by Jerry Weinberg on this one.

I’m also delighted by the appearance of a recent article by Tom DeMarco in IEEE Computer, in which he re-evaluates his thoughts on metrics as expressed in early and influential book, Controlling Software Projects: Management, Measurement, and Estimation (Prentice Hall/Yourdon Press, 1982). He also questions his thoughts on software engineering, as evinced by the title of the piece, “Software Engineering: An Idea Whose Time Has Come and Gone?”. It’s brilliant, and high time that some of Mr. DeMarco’s stature raised these questions. You can read the article here, or by going to http://bit.ly/pRrkd.

Test Project Estimation, The Rapid Way

Thursday, January 25th, 2007

Erik Petersen (with whom I’ve shared one of the more memorable meals in my life) says, in the Software Testing Yahoo! group,

I know when I train testers, nearly all of them complain about not enough time to test, or things being hard to test. The lack of time is typically being forced into a completely unrealistic time frame to test against.

I used to have that problem. I don’t have that problem any more, because I’ve reframed it (thanks to Cem Kaner, Jerry Weinberg, and particularly James Bach for helping me to get this). It’s not my lack of time, because the time I’ve got is a given. Here’s a little sketch for you.

I’m sitting in my office. Someone, a Pointy-haired Boss (Ph.B.), barges in and says…

Ph.B.: “We’re releasing on March 15th. How long do you need to test this product?”

Me: (pause) Um… Let’s see. June 22.

Ph.B.: WHAT?! That can’t be!

Me: You had some other date in mind?

Ph.B.: Well, something a little earlier than that.

Me: Okay… How about February 19?

Ph.B.: WHAT!?! We want to release it on March 15th! Are you just going to sit on your hands for four weeks?

Me: Oh. So… how about I test until about, say, March 14.

Ph.B.: Well that’s… better…

Me: (pause) …but I won’t tell you that it’s ready to ship.

Ph.B.: How do you know already that it won’t be ready to ship?

Me: I don’t know that. That’s not what I mean; I’m sorry, I didn’t make myself clear. I mean that I won’t tell you whether it’s ready to ship.

Ph.B.: What? You won’t? Why not?!

Me: It’s not my decision to ship or not to ship. The product has to be good enough for you, not for me. I don’t have the business knowledge you have. I don’t know if the stock price depends on quarterly results, and I definitely don’t know if there are bonuses tied to this release. There are bunches of factors that determine the business decision. I can’t tell you about most of those. But I can tell you things that I think are important about the product. In particular, I can tell you about important problems.

Ph.B.: But when will you know when I can ship?

Me: Only you can know that. I can’t make your decision, but I can give you information that helps you to make it. Every day, I’ll learn more and more about the product and our understanding of it, and I’ll pass that on to you. I’ll focus on finding important problems quickly. If you want to know something specific about the product, I’ll run tests to find it out, and I’ll tell you about what I find. Any time you want to ask me to report my status, I’ll do that. If at any time you decide to change the ship date, I’ll abide by that; you can release before or after or on the 15th—whenever you decide that you don’t have any more important questions about the product, and that you’re happy with the answers you’ve got.

Ph.B.: So when will you have run all the tests?

Me: All the tests that I can think of? I can always think of more questions that I could ask and answer about the product—and I’ll let you know what those are. At some point, you’ll decide that you don’t need those questions answered—the questions or answers aren’t interesting enough to prevent you from releasing the product. So I’ll keep testing until I’m done.

Ph.B.: When will you be done?

Me: You’re my client; I’ll test as long as you want me to. I’ll be done when you ask me to stop testing—or when you ship.


Rapid testers are a service to the project, not an obstacle. We keep providing service until the client is satisfied. That means, for me, that there’s never “not enough time to test”; any amount of time is enough for me. The question isn’t whether the tester has enough time; the question is whether the client has enough information—and the client gets to decide that.