DevelopsenseLogo

Project Estimation and Black Swans (Part 5): Test Estimation

In this series of blog posts, I’ve been talking about project estimation. But I’m a tester, and if you’re reading this blog, presumably you’re a tester too, or at least you’re interested in testing. So, all this has might have been interesting for project estimation in general, but what are the implications for test project estimation?

Let’s start with the tester’s approach: question the question.

Is there ever such a thing as a test project? Specifically, is there such a thing as a test project that happens outside of a development project?

“Test projects” are never completely independent of some other project. There’s always a client, and typically there are other stakeholders too. There’s always an information mission, whether general or specific. There’s always some development work that has been done, such that someone is seeking information about it. There’s always a tester, or some number of testers (let’s assume plural, even if it’s only one). There’s always some kind of time box, whether it’s the end of an agile iteration, a project milestone, a pre-set ship date, or a vague notion of when the project will end. Within that time box, there is at least one cycle of testing, and typically several of them. And there are risks that testing tries to address by seeking and providing information. From time to time, whether continuously or at the end of a cycle, testers report to the client on what they have discovered.

The project might be a product review for a periodical. The project might be a lawsuit, in which a legal team tries to show that a product doesn’t meet contracted requirements. The project might be an academic or industrial research program in which software plays a key role. More commonly, the project is some kind of software development, whether mass-market commercial software, an online service, or IT support inside a company. The project may entail customization of an existing product, or it may involve lots of new code development. But no matter what, testing isn’t the project in and of itself; testing is a part of a project, a part that informs the project. Testing doesn’t happen in isolation; it’s part of a system. Testing observes outputs and outcomes of the system of which it is a part, and feeds that information back into the system. And testing is only one of several feedback mechanisms available to the system.

Although testing may be arranged in cycles, it would be odd to think of testing as an activity that can be separated from the rest of its project, just as it would be odd to think of seeing as a separate phase of your day. People may say a lot of strange things, but you’ll rarely hear them say “I just need to get this work done, and then I’ll start seeing”; and you almost never get asked “When are you going to be done seeing?” Now, there might be part of your day when you need to pay a lot of attention to your eyes—when you’re driving a car, or cutting vegetables, or watching your child walk across a cluttered room. But, even when you’re focused (sorry) on seeing, the seeing part happens in the context of—and in the service of—some other activity.

Does it make sense to think in terms of a “testing phase”?

Many organizations (in particular, the non-agile ones) divide a project into two discrete parts: a “development phase” and a “testing phase”. My colleague James Bach notes an interesting fallacy there.

What happens during the “development phase”? The programmers are programming. Programming may include a host of activities such as research, design, experimentation, prototyping, coding, unit testing (and in TDD, a unit check is created just before the code to be be checked), integration testing, debugging, or refactoring. Some of those activities are testing activities.

And what are the testers doing during the “development phase”? The testers are testing. More specifically, they may be engaged in review, planning, test design, toolsmithing, data generation, environment setup, or the running of relatively low-level integration tests, or even very high-level system tests. All of those activities can be wrapped up under the rubric of “testing”.

What happens during the “testing phase”? The programmers are still programming, and the testers are still testing. The primary thing that distinguishes the two phases, though, is the focus of the programming work: the programmers have generally stopped adding new features, but are instead fixing the problems that have been found so far. In the first phase, programmers focused on developing new features; in the second, programmers are focused on fixing. By that reckoning, James reckons, the “testing phase” should be called the fixing phase.

It seems to me that if we took James’ suggestion seriously, it might change the nature of some of the questions are often asked in a development project. Replace the word “test” with the word “fix”: “How long are you going to need to fix this product?” “When is fixing going to be done?” “Can’t we just automate the fixing?” “Shouldn’t fixing get involved early in the project?” “Why was that feature broken when the customer got it? Didn’t you fix it?” And when we ask those questions, should we be asking the testers?

As James also points out, no one ever held up the release or deployment of a product because there was more testing to be done. Products are delayed because of a present concern that there might be more development work to be done. Testing can stop as soon as product owners believe that they have sufficient information to accept the risk of shipping. If that’s so, the question for the testers “When are you going to be done testing?” translates to in a question for the product owner: “When am I going to believe that I have sufficient technical information to inform a risk-based business decision?” At that point, the product owner should—appropriately—be skeptical about anyone else’s determination that they are “done” testing.

Now, for a program manager, the “when do I have sufficient information” question might sound hard to answer. It is hard to answer. When I was a program manager for a commercial software company, it was impossible for me to answer before the information had been marshalled. Look at the variables involved in answering the question well: technical information, technical risk, test coverage, the quality of our models, the quality of our oracles, business information, business risk, the notion of sufficiency, decisiveness…

Most of those variables must be accumulated and weighed and decided in the head of a single person—and that person isn’t the tester. That person is the product owner. The evaluation of those variables and the decision to ship are all in play from one moment to the next. The final state of the contributing variables and the final decision on when to ship are in the future.

Asking the tester “When are you going to be done testing?” is like asking the eyes, “When are you going to be done seeing?” Eyes will continue to scan the surroundings, providing information in parallel with the other senses, until the brain decides upon a course of action. In a similar way, testers continue to test, generating information in parallel with the other members of the project community, until the product owner decides to ship the product.

Neither the tester alone nor the eyes alone can answer the “when are you going to be done” question usefully; they’re not in charge. Until it makes a decision, the brain (optionally) takes in more data which the eyes and the other sense organs, by default, continue to supply. Those of us who have ogled the dessert table, or who have gone out on disastrous dates, know the consequences of letting our eyes make decisions for us. Moreover, if there is a problem, it’s not likely the eyes that will make the problem go away.

Some people believe that they can estimate when testing will be done by breaking down testing into measurable units, like test cases or test steps. To me, that’s like proposing “vision cases” or “vision steps”, which leads to our next question:

Can we estimate the duration of a “testing project” by counting “test cases” or “test steps”?

Recently I attended a conference presentation in which the speaker presented a method for estimating when testing would be completed. Essentially, it was a formula: break testing down into test cases, break test cases down into test steps, observe and time some test steps, average them out (or something) to find out how long a test step takes, and then multiply that time by the number of test steps. Voila! an estimate.

Only one small problem: there was no validity to the basis of the calculation. What is a test step? Is it a physical action? The speaker seem to suggest that you can tell a tester has moved on to the next step when he performs another input action. Yet surely all input actions are not created equal. What counts as an input action? A mouse click? A mouse movement? The entry of some data into a field? Into a number of fields, followed by the press of an Enter key? Does the test step include an observation? Several observations? Evaluation? What happens when a human notices something odd and starts thinking? What happens when, in the middle of test execution, a tester recognizes a risk and decides to search for a related problem? What happens to the unit of measurement when a tester finds a problem, and begins to investigate and report it?

The speaker seemed to acknowledge the problem when she said that a step might take five seconds, or half a day. A margin of error of about 3000 to one per test step—the unit on which the estimate is based—would seem to jeopardize the validity of the estimate. Yet the margin of error, profound as it is, is orthogonal to a much bigger problem with this approach to estimation.

Excellent testing is not the monotonic or repetitive execution of scripted ideas. (That’s something that my community calls checking.) Instead, testing is an investigation of code, computers, people, value, risks, and the relationships between them. Investigation requires loops of exploration, experimentation, discovery, research, result interpretation, and learning. Variation and adaptation are essential to the process. Execution of a test often involves reflecting on what has just happened, backtracking over a set of steps, and then repeating or varying the steps while posing different questions or making observations. An investigation cannot follow a prescribed set of steps. Indeed, an investigation that follows a predetermined set of steps is not an investigation at all.

In an investigation, any question you ask may—starting with the first—may yield an answer that completely derails your preconceptions. In an investigation, assumptions need to be surfaced, attacked, and refined. In an investigation, the answer to the most recent question may be far more relevant to the mission than anything that has gone before. If we want to investigate well, we cannot assume that the most critical risk has already been identified. If we want to investigate well, we can’t do it by rote. (If there are rote questions, let’s put them into low-level automated checks. And let’s do it skillfully.)

If we can’t estimate by counting test cases, how can we estimate how much time we’ll need for testing?

There are plenty of activities that don’t yield to piecework models because they are inseparable from the project in which they happen. In another of James Bach’s analogies, no one estimates the looking-out-the-window phase of driving an automobile journey. You can estimate the length of the journey, but looking out the window happens continuously, until the travellers have reached the destination. Indeed, looking out the window informs the driver’s evaluation of whether journey is on track, and whether the destination has been reached. No one estimates the customer service phase of a hotel stay. You can estimate the length of the stay, but customer service (when it’s good) is available continuously until the visitor has left the hotel. For management purposes, customer service people (the front desk, the room cleaners) inform the observation that the visitor has left. No one estimates the “management phase” of a software development project. You can estimate how long development will take, but management (when it’s good) happens continuously until the product owner has decided to release the product. Observations and actions from managers (the development manager, the support manager, the documentation manager, and yes, the test manager) inform the product owner’s decision as to whether the product is ready to ship.

So it goes for testing. Test estimation becomes a problem only if one makes the mistake of treating testing as a separate activity or phase, rather than as an open-ended, ongoing investigation that continues throughout the project.

My manager says that I have to provide an estimate, so what do I do?

At the beginning of the project, we know very little relative to what we’ll know later. We can’t know everything we’ll need to know. We can’t know at the beginning of the project whether the product will meet its schedule without being visited by a Black Swan or a flock of Black Cygnets. So instead of thinking in terms of test estimation, try thinking in terms of strategy, logistics, negotiation, and refinement.

Our strategy is the set of ideas that guide our test design. Those ideas are informed by the project environment, or context; by the quality criteria that might be valued by users and other stakeholders; by the test coverage that we might wish to obtain; and by the test techniques that we might choose to apply. (See the Heuristic Test Strategy Model that we use in Rapid Testing as an example of a framework for developing a strategy.) Logistics is the set of ideas that guide our application of people, equipment, tools, and other resources to fulfill our strategy. Put strategy and logistics together and we’ve got a plan.

Since we’re working with—and, more importantly, for—a client, the client’s mission, schedule, and budget are central to choices on the elements of our strategy and logistics. Some of those choices may follow history or the current state of affairs. For example, many projects happen in shops that already have a roster of programmers and testers; many projects are extensions of an existing product or service. Sometimes project strategy ideas based on projections or guesswork or hopes; for example, the product owner already has some idea of when she wants to ship the product. So we use whatever information is available to create a preliminary test plan. Our client may like our plan—and she may not. Either way, in an effective relationship, neither party can dictate the terms of service. Instead, we negotiate. Many of our preconceptions (and the client’s) will be invalid and will change as the project evolves. But that’s okay; the project environment, excellent testing, and a continuous flow of reporting and interaction will immediately start helping to reveal unwarranted assumptions and new risk ideas. If we treat testing as something happens continuously with development, and if we view development in cycles that provide a kind of pulse for the project, we have opportunities to review and refine our plans.

So: instead of thinking about estimation of the “testing phase”, think about negotiation and refinement of your test strategy within the context of the overall project. That’s what happens anyway, isn’t it?

But my management loves estimates! Isn’t there something we can estimate?

Although it doesn’t make sense to estimate testing effort outside the context of the overall project, we can charter and estimate testing effort within a development cycle. The basic idea comes from Session Based Test Management, James and Jon Bach’s approach to plan, estimate, manage, and measure exploratory testing in circumstances that require high levels of accountability. The key factors are:

  • time-boxed sessions of uninterrupted testing, ranging from 45 minutes to two hours and fifteen minutes, with the goal of making a normal session 90 minutes or so;
  • test coverage areas—typically functions or features of the product to which we would like to dedicate some testing time;
  • activities such as research, review, test design, data generation, toolsmithing, research, or retesting, to which we might also like to dedicate testing time;
  • charters, in the form of a one- to three-sentence mission statement that guides the session to focus on specific coverage areas and/or activities;
  • debriefings, in which a tester and a test lead or manager discuss the outcome of a session;
  • reviewable results, in the form of a session sheet that provides structure for the debrief, and that can be scanned and parsed by a Perl script; and, optionally,
  • a screen-capture recording of the session when detailed retrospective investigation or analysis might be needed;
  • metrics whose purposes are to determine how much time is spent on test design and execution (activities that yield test coverage) vs. bug investigation and reporting, and setup (activities that interrupt the generation of test coverage).

The timebox provides a structure intended to make estimation and accounting for time fairly imprecise, but reasonably accurate. (What’s the difference? As I write, the time and date is 9:43:02.1872 in the morning, January 23, 1953. That’s a very precise reckoning of the time and date, but it’s completely inaccurate.)

Let’s also assume that a development cycle is two weeks, or ten working days—the length of a typical agile iteration. Let’s assume that we have four testers on the team, and that each tester can accomplish three sessions of work per day (meetings, e-mail, breaks, conversations, and other non-session activities take up the rest of the time).

ten days * four testers * three sessions = 120 sessions

Let’s assume further that sessions cannot be completely effective, in that test design and execution will be interrupted by setup and bug investigation. Suppose that we reckon 10% of the time spent on setup, and 25% of the time spent on investigating and reporting bugs. That’s 35% in total; for convenience, let’s call it 1/3 of the time.

120 sessions – 120 * 1/3 interruption time = 80 sessions

Thus in our two-week iteration we estimate that we have time for 80 focused, targeted effective idealized sessions of test coverage, embedded in 120 actual sessions of testing. Again, this is not a precise figure; it couldn’t possibly be. If our designers and programmers have done very well in a particular area, we won’t find lots of bugs and our effective coverage per session will go up. If setup is in some way lacking, we may find that interruptions account for more than one-third of the time, which means that our effective coverage will be reduced, or that we have to allocate more sessions to obtain the same coverage. So as soon as we start obtaining information about what actually went on in the sessions, we feed that information back into the estimation. I wrote extensively about that here.

On its own, the metrics on interruptions could be fascinating and actionable information for managers. But note that the metrics on their own are not conclusive. They can’t be. Instead, they inform questions. Why has there been more bug investigation than we expected? Are there more problems than we anticipated, or are testers spending too much time investigating before consulting with the programmers? Is setup taking longer than it should, such that customers will have setup problems too? Even if the setup problems will be experienced only in testing, are there ways to make setup more rapid so that we can spend more time on test coverage? The real value of any metrics is in the questions they raise, rather than in the answers they give.

There’s an alternative approach, for those who want to estimate the duration or staffing for a test cycle: set the desired amount of coverage, and apply the fixed variables and calculate for the free ones. Break the product down into areas, and assign some desired number of sessions to each based on risk, scope, complexity, or any combination of factors you choose. Based on prior experience or even on a guess, adjust for interruptions and effectiveness. If you know the number of testers, you can figure the amount of time required; if you want to set the amount of time, you can calculate for the number of testers required. This provides you with a quick estimate.

Which, of course, you should immediately distrust. What influence does tester experience and skill have on your estimate? On the eventual reality? If you’re thinking of adding testers, can you avoid banging into Brooks’ Law? Are your notions of risk static? Are they valid? And so forth. Estimation done well should provoke a large number of questions. Not to worry; actual testing will inform the answers to those questions.

Wait a second. We paid a lot of money for an expensive test management tool, and we sent all of our people to a one-week course on test estimation, and we now spend several weeks preparing our estimates. And since we started with all that, our estimates have come out really accurate.

If experience tells us anything, it should tell us that we should be suspicious of any person or process that claims to predict the future reliably. Such claims tend to be fulfilled via the Ludic Fallacy and the narrative bias, central pillars of the philosophy of The Black Swan. Since we already have an answer to the question “When are we going to be done?”, we have the opporutunity (and often the mandate) to turn an estimate into a self-fulfilling prophecy. Jerry Weinberg‘s Zeroth Law of Quality (“If you ignore quality, you can meet any other requirement“) is a special case of my own, more general Zeroth Law of Wish Fulfillment: “If you ignore some factors, you can achieve anything you like.” If your estimates always match reality, what assumptions and observations have you jettisoned in order to make reality fit the estimate? And if you’re spending weeks on estimation, might that time be better spent on testing?

6 replies to “Project Estimation and Black Swans (Part 5): Test Estimation”

  1. I think I could understand your point of view but still I have some questions.

    Michael replies: Good! Thank you for asking.

    If you don’t estimate testing time and testing happens while the project exists, who stops first, testing or the project? Typically a project had a timeline and testing “should stop” by then (although we never stop). How do you manage that? How do you estimate and plan testing to “fit” in project timeline?

    Let’s start with your last question first. I suspect that the question is based on the idea that testing is something that happens separately from the project timeline. If you reframe testing as something that happens within the project timeline, then the quick answer is that you estimate the project timeline. How does testing fit within that? I believe that I addressed the issue in the section on session-based test management: you plan testing as a part of the cycles of project development.

    Now, your first question: when does testing stop? I’ve addressed that question in earlier blog postings, in particular this one. Although there are many heuristics listed there, it all comes down to this: we stop testing when we decide that there are no more questions for which the cost of answering them meets the value of answering them. That typically happens at the same time that the product owner decides to ship the product. You could make arguments that the decision to stop testing comes first, or that the decision to ship comes first; you could say that the recognition that there’s no more testing to be done triggers the release decision; you could say that the product manager has decided to ship and therefore no more testing need be done. I’d suggest that either point of view is tangled up in the mangle of practice, as Andrew Pickering would say.

    Reply
  2. I think it is possible to estimate test effort but we need to make sure we are asking the right sort of questions of each feature we expect to complete by the end of an iteration. Saying testing this feature will take that long, is never going to work. And I don’t like the idea of splitting testing into sessions.

    Michael replies: Oh, really? How do you feel about splitting development work into iterations?

    A good analogy (told to me by the Scrum Master at my current employer) to show the problems with estimating based on time is the one of three runners: an olympic sprinter, a middle aged man, and a man aged 90. Ask each of them how long it will take to run a distance in front of them and you get three varying answers right? The sprinter looks at the distance and says: 10 seconds, the old man says 20 and the oldest may say 30. However if we reframe the question and ask each of them how far they are going to be running – we will probably end up with a much smaller spread in our estimations.

    You’ve left out all kinds of important question here, though. The most important, it seems to me, is why are they running? That is, what’s their destination? In testing, there is no set finish line.

    I really think testing should, and could be estimated at a task level, and ideally, estimation of test effort should be tied into the estimation of development effort. Done means shippable (or at least done for one person at one time means shippable for one business at one time!) and testing software is part of that effort, so why not create one estimation containing test effort and development effort? After all development and testing are two sides of the same coin, their efforts should be treated as parts of the same task for me.

    Interesting: did you miss this part? “Test estimation becomes a problem only if one makes the mistake of treating testing as a separate activity or phase, rather than as an open-ended, ongoing investigation that continues throughout the project.”

    Test effort can be vertically sliced in the same way as development work. You are right to say that investigation by it’s very nature is hard to estimate. I agree, but with a caveat. Any good tester should also be a domain expert. And any domain expert will be able to use their knowledge to identify and note hidden complexities. I know if I am testing a certain part of my application, that it is more complex than other parts and I adjust the estimation for testing that area. I agree that we can’t rely on the past to predict the future, but if we endeavour to learn the lessons it teaches us – we can often become better at adapting to the future.

    But further than that I think we need to be encouraging collaborative estimation rather than treating development and testing as distinctive activities. We can estimate the complexity of a development task, and estimate the complexity of testing that task as one big team, coming up with a number (which should be relatively arbitrary) to represent the effort involved in completing that feature. Thus, instead of saying as a tester I can do x hours of work per iteration which may or may not translate into x tested features. I can instead say that on an average iteration my team will complete x (arbitrary) feature points.

    Not a bad idea on the face of it, but there is a catch: the idea that development work and testing effort are symmetrical. They’re not. Something that takes a very small amount of time to code might take a very long time to test (consider potential variations in platform, workflow, data, timing and reliability considerations, and so forth). Conversely, something that takes a long time to code might take relatively little time to test (especially when what we call “coding” includes a good deal of TDD or unit testing).

    Apologies for the slightly stream of consciousness nature of the comment, but it seems your blog posts always seem to get me thinking and commenting.

    My apologies for that. 🙂

    -Callum-

    Reply
  3. What is the criteria to determine how many features/functions are included in a coverage area? Should it be what can be tested in a session?

    Michael replies: We suggest dividing the product or service into roughly 15 to 30 coverage areas of roughly equal size, significance, and testing effort. More than 30 areas mean things get hard to comprehend at a glance; fewer than 15 and the breakdown might not be sufficiently granular. Each coverage area will receive several sessions per test cycle, approximately the same number of sessions per area.

    Notice that I’m not answering your original question on the number of features or functions per area. There’s a reason for doing that: I don’t know how to count features or functions in a way that’s meaningful for testing—in particular, for parafunctional testing. Some people might suggest “function points”; that could work for them, but not for me, since for me a product is more than functions, and testing is more than functional testing. Dividing the project into areas of relatively equal testing effort isn’t easy to do precisely, but precision isn’t so important; reasonable accuracy will do until testing reveals more information to help us focus our attention.

    For more ideas, you might like to look here and here.

    Reply

Leave a Comment