Archive for the ‘Time’ Category

Should Testers Play Planning Poker?

Wednesday, October 26th, 2011

My colleague and friend Eric Jacobson, who recently (as I write) did a bang-up job on his first conference presentation at STAR West 2011, asks a question in response to this blog post from 2006. (I like it when people reflect on an issue for a few years.) Eric asks:

You are suggesting it may not make sense for testers to give time-based estimates to their teams, but what about relative estimates? Let’s say a Rapid Software Tester is asked to participate in Planning Poker (relative-based story estimation) on an Agile Scrum team. I’ve always considered this a golden opportunity. Are you suggesting said tester may want to refuse to participate in the Planning Poker?

Having observed Planning Poker in action, I’m conflicted. Estimating anything is always a bit of a dodgy business, even at the best of times. That’s especially true for investigation and in particular for discovery. (I’ve written about some of the problems with estimation here and in subsequent posts, and with how those problems pertain to testing here.) Yet Planning Poker may be one way to get a good deal closer to the best of times. I like the idea of testers hearing what’s going on in planning sessions, and of offering perspective on the possible implications of work or change. On the other hand, at Planning Poker sessions I’ve observed or participated in, testers are often pressured to lower their numbers. In an environment where there’s trust, there tends to be much less pressure; in an environment where there’s less trust, I’d take pressure to lower the estimate as a test result with several possible interpretations. (I leave those interpretations as an exercise for the reader, but don’t stop until you get to five, at least.)

In any case, some fundamental problems remain: First, testing is oriented towards discovering things, not building things. At the root of it all, any estimate of how long it will take to test something is like estimating how long it will take you to evaluate someone’s ability to speak Spanish (which I wrote about here), and discovering problems in their ability to express themselves. If you already know something or can reasonably anticipate it, that helps a lot, and the Planning Poker approach (among many others) can help with that to some degree.

The second problem is that there’s not necessarily symmetry between the effort in creating something and the effort in testing it. A function or feature that takes very little effort to program might take an enormous amount of effort to test. What kinds of variation could we put into data, workflow, timing, platform dependencies and interactions, scenarios, and so forth? Meanwhile, a feature that takes signficant amounts of programming effort could take almost no time to test (since “programming effort” could include an enormous amount of testing effort). There are dozens of factors involved, including the amount of testing the programmers do as they code; what kind of review is being done; what the scope of the change is; when particular discoveries get made (during “development time” or “testing time”; the skill of the parties involved; the testability of the product under test; how buggy the finished feature is (in which case there will be more time needed for investigation and reporting)… Planning Poker doesn’t solve the asymmetry problem, but it provides a venue for discussing it and getting started on sorting it out.

The third problem, closely related to the second, is this idea that all testing work associated with developing something must and shall happen within the same iteration. Testing never ends; it only stops. So it’s folly to think that all testing for a given amount of programming work can always fit into the same iteration in which the work is done. I’d argue that we need a more nuanced perspective and more options than that. The decision as to how much testing we’ll need is informed by many factors. Paradoxically, we’ll need some testing to help reveal and inform our notions of how much testing we’ll need.

I understand the desire to close the book on a development story within the sprint. I often—even usually—share that desire. Yet many kinds of testing work must respond to development work, and in such cases the development work has to be complete in some lesser sense than “fully tested”. Many kinds of confirmatory checking work, it seems to me, can be done within the same sprint as the programming work; no problem there. Yet it seems to me that other kinds of testing can reasonably wait for subsequent sprints—indeed, must wait for subsequent sprints, unless we’d like to have programmers stop all programming work altogether after a certain day in the sprint. Let me give you an example: in big banks, some kinds of transactions take several days to wend their way through batch processes that are run overnight. The testing work associated with that can be simulated, for sure (indeed, one would hope that most of such work would be simulated), but only at the expense of some loss of realism. For the test, whether the realism is important or not is always an open question with a fallible answer. Instead of making sure that there’s NO testing debt, consider reasonable, small, and sustainable amounts of testing debt that spans iterations. Agile can be about actual agility, instead of dogma.

So… If playing Planning Poker is part of the context, go for it. It’s a heuristic approach to getting people to consider testing more consciously and thoughtfully, and there’s something to that. It’s oriented towards estimating things in a more comprehensible time frame, and in digestible chunks of task and effort. Planning Poker is fallible, and one approach among many possible approaches. Like everything else, its usefulness largely depends mostly on the people using it, and how they use it.

Testing: Difficult or Time-Consuming?

Thursday, September 29th, 2011

In my recent blog post, Testing Problems Are Test Results, I noted a question that we might ask about people’s perceptions of testing itself:

Does someone perceive testing to be difficult or time-consuming? Who? What’s the basis for that perception? What assumptions underlie it?

The answer to that question may provide important clues to the way people think about testing, which in turn influences the cost and value of testing.

As an example, an pseudonymous person (“PM Hut”) who is evidently associated with project management in some sense (s/he provides the URL http://www.pmhut.com) answered my questions above.

Just to answer your question “Does someone perceive testing to be difficult or time-consuming?” Yes, everyone, I can’t think of a single team member I have managed who doesn’t think that testing is time consuming, and they’d rather do something else.

This, alas, isn’t an unusual response. To someone like me who offers help in increasing the value and reducing the cost of testing, it triggers some questions that might prompt reframes or further questions.

  • What do the team members think testing is? Do they think that it’s something ancillary to the project, rather than an essential and integrated aspect of software development? To me, testing is about gathering information and raising awareness that’s essential for identifying product risks and steering the project. That’s incredibly important and valuable.

    So when the team members are driving a car, do they perceive looking out the windshield to be difficult or time-consuming? Do they perceive looking at the dashboard to be difficult or time-consuming? If so, why? What are the differences between the way they obtain awareness when they’re driving a car, versus the way they obtain awareness when they’re contributing to the development of a product or service?

  • Do the team members think testing is the mindless repetition of actions and observation of specific outputs, as prescribed by someone else? If so, I’d agree with them that testing is an unpalatable activity—except I don’t call that testing. I call it checking, and I’d rather let a machine do it. I’d also ask if checking is being done automatically by the programmers at lower levels where it tends to be fast, cheap, easy, useful and timely—or manually at higher levels, where it tends to be slower, more expensive, more difficult, less useful, and less timely—and tedious?
  • Is testing focused mostly on confirmation of things that we already know or hope to be true? Is it mostly focused on the functional aspects of the program (which are amenable to checking)? People tend to find this dull and tedious, and rightly so. Or is testing an active search for new information, problems, and risks? Does it include focus on parafunctional aspects of the product—the things that provide important perceptions of real value to real people? Are the testers given the freedom and responsibility to manage a good deal of their own investigation? Testers tend to find this kind of approach a lot more engaging and a lot more interesting, and the results are typically more wide-ranging, informative, and valuable to programmers and managers.
  • Is testing overburdened by meaningless and valueless paperwork, bureaucracy, and administrivia? How did that come to pass? Are team members aware that there are simple, lightweight, rapid, and highly effective ways of planning, recording, and reporting testing work and project status?
  • Are there political issues? Are testers (or people acting temporarily in a testing role) routinely blown off (as in this example)? Are the nuggets of information revealed by testing habitually dismissed? Is that because testing is revealing trivial information? If so, is there a problem with specific testing skills like modeling the test space, determining coverage, determining oracles, recording, or reporting?
  • Have people been trained on the basis of testing as a skilled, sophisticated thinking art? Or is testing something for which capability can be assessed by a trivial, 40-question multiple choice exam?
  • If testing is being done well (which given people’s attitudes expressed above would be a surprise), are programmers or managers afraid of having to deal with the information that testing reveals? Does that lead to recrimination and conflict?
  • If there’s a perception that testing is by its nature dull and slow, are the testers aware of the quick testing approaches in our Rapid Software Testing class (PDF, page 97-99) , in the Black Box Software Testing course offered by the Association for Software Testing, or in James Whittaker’s How to Break Software? Has anyone read and absorbed Lessons Learned in Software Testing?
  • If there’s a perception that technical reviews are slow, have the testers, programmers, or managers read Perfect Software and Other Illusions About Testing? Do they recognize the ways in which careful observation provides us with “instant reviews” (see Perfect Software, page 143)? Has anyone on the team read any other of Jerry Weinberg’s books on software management and measurement?
  • Have the testers, programmers, and managers recognized the extent to which exploratory testing is going on all the time? Do they recognize that issues revealed by testing might be even more important than bugs? Do they understand that every test result and every testing problem points to meta-information that can be extremely valuable in managing the project?

On PM Hut’s own Web site, there’s an article entitled “Why Project Managers Fail“. The author, Jim Benson, lists five common problems, each of which could be quickly revealed by looking at testing as a source of information, rather than by simply going through the motions. Take it from the former program manager of a product that, in its day, was the best-selling piece of commercial software in the world: testers, testing, and the information they reveal are a project manager’s best friends and most valuable assets—when you have the awareness to recognize them.

Testing need not be difficult, tedious or time-consuming. A perception that it is so, or that it must be so, suggests a problem with testing as practised or testing as perceived. Astute managers and teams will investigate that important and largely mistaken perception.

Testing Problems Are Test Results

Tuesday, September 6th, 2011

I often do an exercise in the Rapid Software Testing class in which I ask people to catalog things that, for them, make testing harder or slower. Their lists fit a pattern I hear over and over from testers (you can see an example of the pattern in this recent question on Stack Exchange). Typical points include:

  • I’m a tester working alone with several programmers (or one of a handful of testers working with many programmers).
  • I’m under enormous time pressure. Builds are coming in continuously, and we’re organized on one- or two-week development cycles.
  • The product(s) I’m testing is (are) very complex.
  • There are many interdependencies between modules within the product, or between products.
  • I’m seeing a consistent pattern of failures specifically related to those interdependencies; the tiniest change here can have devastating impact there—or anywhere.
  • I believe that I have to run a complete regression test on every build to try to detect those failures.
  • I’m trying to cope by using automated checks, but the complexity makes the automation difficult, the program’s testing hooks are minimal at best, and frequent product changes make the whole relationship brittle.
  • The maintenance effort for the test automation is significant, at a cost to other testing I’d like to do.
  • I’m feeling overwhelmed by all this, but I’m trying to cope.

On top of that,

  • The organization in which I’m working calls itself Agile.
  • Other than the two-week iterations, we’re actually using at most two other practices associated with Agile development, (typically) daily scrums or Kanban boards.

Oh, and for extra points,

  • The builds that I’m getting are very unstable. The system falls over under the most basic of smoke tests. I have to do a lot of waiting or reconfiguring or both before I can even get started on the other stuff.

How might we consider these observations?

We could choose to interpret them as problems for testing, but we could think of them differently: as test results.

Test results don’t tell us whether something is good or bad, but they may inform a decision or an evaluation or more questions. People observe test results and decide whether there are problems and what the problems are, what further questions are warranted, and what decisions should be made. Doing that requires human judgement and wisdom, consideration of lots of factors, and a number of possible interpretations.

Just as for automated checks and other test results, it’s important to consider a variety of explanations and interpretations for testing meta-results—observations about testing—lest we miss an important problem. As Jerry Weinberg points out in Perfect Software and Other Illusions About Testing, whatever else something might be, it’s information. If testing is, as Jerry says, gathering information with the intention of informing a decision, it seems a mistake to leave potentially valuable observations lying around on the floor. Indeed, rather than thinking of them as problems for testing, we could choose to think of them as symptoms of product or project problems—problems that testing can help to solve.

For example, when a tester feels outnumbered by programmers, or when a tester feels under time pressure, that’s a test result. The feeling often comes from the programmers generating more work and more complexity than the tester can handle. Yet complexity, like quality, is a relationship between some person and something else. Complexity on its own isn’t necessarily a problem; it’s how people deal with it and its attendant risks that’s a problem. When we observe the ways in which people react to a perception of complexity, we might learn a lot.

  • Are people conscious of the risks—especially the Black Swans—that typically accompany complexity?
  • If people are conscious of risk, are they paying attention to it? Are they panicking over it? Or are they ignoring it and whistling past the graveyard? Or…
  • Are people reacting calmly and pragmatically? Are they acknowledging and dealing with the complexity of the product? If they can’t make the product or the process that it models less complex, are they at least taking steps to make understanding of the product more tractable?
  • Might the programmers be generating or modifying code so quickly that they’re not taking the time to understand what’s really going on with it?
  • If someone feels that more testers are needed, what’s behind that feeling? (I took a stab at an answer to that question a few years back.)

How might we figure that out answers to those questions? One way might be to look at more of the test results and test meta-results.

  • Does someone perceive testing to be difficult or time-consuming? Who? What’s the basis for that perception? What assumptions underlie it?
  • Does the need to investigate and report bugs overwhelm the testers’ capacity to obtain good test coverage? (I wrote about that problem here.)
  • Does testing consistently reveal consistent patterns of failure?
  • Are programmers consistently surprised by such failures and patterns?
  • Do small changes in the code cause problems that are disproportionately large or hard to find?
  • Do the programmers understand the interdependencies clearly? Are those interdependencies necessary, or could they be eliminated?
  • Are programmers taking steps to anticipate or prevent problems related to interfaces and interactions?
  • If automated checks are difficult to develop and maintain, does that say something about the skill of the tester, the quality of the automation interfaces, or the scope of checks? Or about something else?
  • Are unstable builds a problem that get in the way of deeper testing? Or could we interpret them as a sign that the product has problems so numerous and serious that even shallow testing reveals them?
  • When a “stable” build appears after a long series of unstable builds, how stable is it really?

Perhaps, with the answers to those questions, we could raise even more questions.

  • What risks do those problems present for the success of the product, whether in the short term or the longer term?
  • When testing consistently reveals patterns of failures and attendant risk, what does the product team do with that information?
  • Are the programmers mandated to deliver code? Or are the programmers mandated to deliver code with a warrant that the code does what it should (and doesn’t do what it shouldn’t), to the best of their knowledge? Do the programmers adamantly prefer the latter mandate?
  • Is someone pressuring the programmers to make schedule or scope commitments that they can’t really fulfill?
  • Are the programmers and the testers empowered to push back on scope or schedule pressue when it adds to product or project risk?
  • Do the business people listen to the development team’s concerns? Are they aware of the risks that testers and programmers bring to their attention? When the development team points out risks, do managers and business people deal with them congruently?
  • Is the team working at a sustainable pace, or might we expect the product and the project to become overwhelmed by complexity, interdependencies, fragility, and problems that lurk just beyond the reach of our development and testing effort?
  • Is the development team really Agile, in the sense of the precepts of the Agile Manifesto? Or is “agility” being used in a cargo-cult way, using practices or artifacts to mask over an incoherent project?

Testers often feel that their role is to find, investigate, and report on bugs in the product. That’s usually true, but it’s also a pretty limited view of the kinds of information that testing reveals. When seen one way, the problems I’ve listed above sound like serious problems for testing. What if we also remembered Jerry’s definition of testing as “gathering information with the intention of informing a decision”? If that’s the case, then everything that we notice or discover during testing is a test result.

(See also this discussion for an example of looking beyond the test result for possible product and project risks.)

Project Estimation and Black Swans (Part 5): Test Estimation

Sunday, October 31st, 2010

In this series of blog posts, I’ve been talking about project estimation. But I’m a tester, and if you’re reading this blog, presumably you’re a tester too, or at least you’re interested in testing. So, all this has might have been interesting for project estimation in general, but what are the implications for test project estimation?

Let’s start with the tester’s approach: question the question.

Is there ever such a thing as a test project? Specifically, is there such a thing as a test project that happens outside of a development project?

“Test projects” are never completely independent of some other project. There’s always a client, and typically there are other stakeholders too. There’s always an information mission, whether general or specific. There’s always some development work that has been done, such that someone is seeking information about it. There’s always a tester, or some number of testers (let’s assume plural, even if it’s only one). There’s always some kind of time box, whether it’s the end of an agile iteration, a project milestone, a pre-set ship date, or a vague notion of when the project will end. Within that time box, there is at least one cycle of testing, and typically several of them. And there are risks that testing tries to address by seeking and providing information. From time to time, whether continuously or at the end of a cycle, testers report to the client on what they have discovered.

The project might be a product review for a periodical. The project might be a lawsuit, in which a legal team tries to show that a product doesn’t meet contracted requirements. The project might be an academic or industrial research program in which software plays a key role. More commonly, the project is some kind of software development, whether mass-market commercial software, an online service, or IT support inside a company. The project may entail customization of an existing product, or it may involve lots of new code development. But no matter what, testing isn’t the project in and of itself; testing is a part of a project, a part that informs the project. Testing doesn’t happen in isolation; it’s part of a system. Testing observes outputs and outcomes of the system of which it is a part, and feeds that information back into the system. And testing is only one of several feedback mechanisms available to the system.

Although testing may be arranged in cycles, it would be odd to think of testing as an activity that can be separated from the rest of its project, just as it would be odd to think of seeing as a separate phase of your day. People may say a lot of strange things, but you’ll rarely hear them say “I just need to get this work done, and then I’ll start seeing”; and you almost never get asked “When are you going to be done seeing?” Now, there might be part of your day when you need to pay a lot of attention to your eyes—when you’re driving a car, or cutting vegetables, or watching your child walk across a cluttered room. But, even when you’re focused (sorry) on seeing, the seeing part happens in the context of—and in the service of—some other activity.

Does it make sense to think in terms of a “testing phase”?

Many organizations (in particular, the non-agile ones) divide a project into two discrete parts: a “development phase” and a “testing phase”. My colleague James Bach notes an interesting fallacy there.

What happens during the “development phase”? The programmers are programming. Programming may include a host of activities such as research, design, experimentation, prototyping, coding, unit testing (and in TDD, a unit check is created just before the code to be be checked), integration testing, debugging, or refactoring. And what are the testers doing during the “development phase”? The testers are testing. More specifically, they may be engaged in review, planning, test design, toolsmithing, data generation, environment setup, or the running of relatively low-level integration tests, or even very high-level system tests. All of those activities can be wrapped up under the rubric of “testing”.

What happens during the “testing phase”? The programmers are still programming, and the testers are still testing. The primary thing that distinguishes the two phases, though, is the focus of the programming work: the programmers have generally stopped adding new features, but are instead fixing the problems that have been found so far. In the first phase, programmers focused on developing new features; in the second, programmers are focused on fixing. By that reckoning, James reckons, the “testing phase” should be called the fixing phase. It seems to me that if we took James’ suggestion seriously, it might change the nature of some of the questions are often asked in a development project. Replace the word “test” with the word “fix”: “How long are you going to need to fix this product?” “When is fixing going to be done?” “Can’t we just automate the fixing?” “Shouldn’t fixing get involved early in the project?” “Why was that feature broken when the customer got it? Didn’t you fix it?” And when we ask those questions, should we be asking the testers?

As James also points out, no one ever held up the release or deployment of a product because there was more testing to be done. Products are delayed because of a present concern that there might be more development work to be done. Testing can stop as soon as product owners believe that they have sufficient information to accept the risk of shipping. If that’s so, the question for the testers “When are you going to be done testing?” translates to in a question for the product owner: “When am I going to believe that I have sufficient technical information to inform a risk-based business decision?” At that point, the product owner should—appropriately—be skeptical about anyone else’s determination that they are “done” testing.

Now, for a program manager, the “when do I have sufficient information” question might sound hard to answer. It is hard to answer. When I was a program manager for a commercial software company, it was impossible for me to answer before the information had been marshalled. Look at the variables involved in answering the question well: technical information, technical risk, test coverage, the quality of our models, the quality of our oracles, business information, business risk, the notion of sufficiency, decisiveness… Most of those variables must be accumulated and weighed and decided in the head of a single person—and that person isn’t the tester. That person is the product owner. The evaluation of those variables and the decision to ship are all in play from one moment to the next. The final state of the contributing variables and the final decision on when to ship are in the future. Asking the tester “When are you going to be done testing?” is like asking the eyes, “When are you going to be done seeing?” Eyes will continue to scan the surroundings, providing information in parallel with the other senses, until the brain decides upon a course of action. In a similar way, testers continue to test, generating information in parallel with the other members of the project community, until the product owner decides to ship the product. Neither the tester alone nor the eyes alone can answer the “when are you going to be done” question usefully; they’re not in charge. Until it makes a decision, the brain (optionally) takes in more data which the eyes and the other sense organs, by default, continue to supply. Those of us who have ogled the dessert table, or who have gone out on disastrous dates, know the consequences of letting our eyes make decisions for us. Moreover, if there is a problem, it’s not likely the eyes that will make the problem go away.

Some people believe that they can estimate when testing will be done by breaking down testing into measurable units, like test cases or test steps. To me, that’s like proposing “vision cases” or “vision steps”, which leads to our next question:

Can we estimate the duration of a “testing project” by counting “test cases” or “test steps”?

Recently I attended a conference presentation in which the speaker presented a method for estimating when testing would be completed. Essentially, it was a formula: break testing down into test cases, break test cases down into test steps, observe and time some test steps, average them out (or something) to find out how long a test step takes, and then multiply that time by the number of test steps. Voila! an estimate.

Only one small problem: there was no validity to the basis of the calculation. What is a test step? Is it a physical action? The speaker seem to suggest that you can tell a tester has moved on to the next step when he performs another input action. Yet surely all input actions are not created equal. What counts as an input action? A mouse click? A mouse movement? The entry of some data into a field? Into a number of fields, followed by the press of an Enter key? Does the test step include an observation? Several observations? Evaluation? What happens when a human notices something odd and starts thinking? What happens when, in the middle of test execution, a tester recognizes a risk and decides to search for a related problem? What happens to the unit of measurement when a tester finds a problem, and begins to investigate and report it?

The speaker seemed to acknowledge the problem when she said that a step might take five seconds, or half a day. A margin of error of about 3000 to one per test step—the unit on which the estimate is based—would seem to jeopardize the validity of the estimate. Yet the margin of error, profound as it is, is orthogonal to a much bigger problem with this approach to estimation.

Excellent testing is not the monotonic or repetitive execution of scripted ideas. (That’s something that my community calls checking.) Instead, testing is an investigation of code, computers, people, value, risks, and the relationships between them. Investigation requires loops of exploration, experimentation, discovery, research, result interpretation, and learning. Variation and adaptation are essential to the process. Execution of a test often involves reflecting on what has just happened, backtracking over a set of steps, and then repeating or varying the steps while posing different questions or making observations. An investigation cannot follow a prescribed set of steps. Indeed, an investigation that follows a predetermined set of steps is not an investigation at all.

In an investigation, any question you ask may—starting with the first—may yield an answer that completely derails your preconceptions. In an investigation, assumptions need to be surfaced, attacked, and refined. In an investigation, the answer to the most recent question may be far more relevant to the mission than anything that has gone before. If we want to investigate well, we cannot assume that the most critical risk has already been identified. If we want to investigate well, we can’t do it by rote. (If there are rote questions, let’s put them into low-level automated checks. And let’s do it skillfully.)

If we can’t estimate by counting test cases, how can we estimate how much time we’ll need for testing?

There are plenty of activities that don’t yield to piecework models because they are inseparable from the project in which they happen. In another of James Bach’s analogies, no one estimates the looking-out-the-window phase of driving an automobile journey. You can estimate the length of the journey, but looking out the window happens continuously, until the travellers have reached the destination. Indeed, looking out the window informs the driver’s evaluation of whether journey is on track, and whether the destination has been reached. No one estimates the customer service phase of a hotel stay. You can estimate the length of the stay, but customer service (when it’s good) is available continuously until the visitor has left the hotel. For management purposes, customer service people (the front desk, the room cleaners) inform the observation that the visitor has left. No one estimates the “management phase” of a software development project. You can estimate how long development will take, but management (when it’s good) happens continuously until the product owner has decided to release the product. Observations and actions from managers (the development manager, the support manager, the documentation manager, and yes, the test manager) inform the product owner’s decision as to whether the product is ready to ship.

So it goes for testing. Test estimation becomes a problem only if one makes the mistake of treating testing as a separate activity or phase, rather than as an open-ended, ongoing investigation that continues throughout the project.

My manager says that I have to provide an estimate, so what do I do?

At the beginning of the project, we know very little relative to what we’ll know later. We can’t know everything we’ll need to know. We can’t know at the beginning of the project whether the product will meet its schedule without being visited by a Black Swan or a flock of Black Cygnets. So instead of thinking in terms of test estimation, try thinking in terms of strategy, logistics, negotiation, and refinement.

Our strategy is the set of ideas that guide our test design. Those ideas are informed by the project environment, or context; by the quality criteria that might be valued by users and other stakeholders; by the test coverage that we might wish to obtain; and by the test techniques that we might choose to apply. (See the Heuristic Test Strategy Model that we use in Rapid Testing as an example of a framework for developing a strategy.) Logistics is the set of ideas that guide our application of people, equipment, tools, and other resources to fulfill our strategy. Put strategy and logistics together and we’ve got a plan.

Since we’re working with—and, more importantly, for—a client, the client’s mission, schedule, and budget are central to choices on the elements of our strategy and logistics. Some of those choices may follow history or the current state of affairs. For example, many projects happen in shops that already have a roster of programmers and testers; many projects are extensions of an existing product or service. Sometimes project strategy ideas based on projections or guesswork or hopes; for example, the product owner already has some idea of when she wants to ship the product. So we use whatever information is available to create a preliminary test plan. Our client may like our plan—and she may not. Either way, in an effective relationship, neither party can dictate the terms of service. Instead, we negotiate. Many of our preconceptions (and the client’s) will be invalid and will change as the project evolves. But that’s okay; the project environment, excellent testing, and a continuous flow of reporting and interaction will immediately start helping to reveal unwarranted assumptions and new risk ideas. If we treat testing as something happens continuously with development, and if we view development in cycles that provide a kind of pulse for the project, we have opportunities to review and refine our plans.

So: instead of thinking about estimation of the “testing phase”, think about negotiation and refinement of your test strategy within the context of the overall project. That’s what happens anyway, isn’t it?

But my management loves estimates! Isn’t there something we can estimate?

Although it doesn’t make sense to estimate testing effort outside the context of the overall project, we can charter and estimate testing effort within a development cycle. The basic idea comes from Session Based Test Management, James and Jon Bach’s approach to plan, estimate, manage, and measure exploratory testing in circumstances that require high levels of accountability. The key factors are:

  • time-boxed sessions of uninterrupted testing, ranging from 45 minutes to two hours and fifteen minutes, with the goal of making a normal session 90 minutes or so;

  • test coverage areas—typically functions or features of the product to which we would like to dedicate some testing time;
  • activities such as research, review, test design, data generation, toolsmithing, research, or retesting, to which we might also like to dedicate testing time;
  • charters, in the form of a one- to three-sentence mission statement that guides the session to focus on specific coverage areas and/or activities;

  • debriefings, in which a tester and a test lead or manager discuss the outcome of a session;

  • reviewable results, in the form of a session sheet that provides structure for the debrief, and that can be scanned and parsed by a Perl script; and, optionally,

  • a screen-capture recording of the session when detailed retrospective investigation or analysis might be needed;

  • metrics whose purposes are to determine how much time is spent on test design and execution (activities that yield test coverage) vs. bug investigation and reporting, and setup (activities that interrupt the generation of test coverage).

The timebox provides a structure intended to make estimation and accounting for time fairly imprecise, but reasonably accurate. (What’s the difference? As I write, the time and date is 9:43:02.1872 in the morning, January 23, 1953. That’s a very precise reckoning of the time and date, but it’s completely inaccurate.)

Let’s also assume that a development cycle is two weeks, or ten working days—the length of a typical agile iteration. Let’s assume that we have four testers on the team, and that each tester can accomplish three sessions of work per day (meetings, e-mail, breaks, conversations, and other non-session activities take up the rest of the time).

ten days * four testers * three sessions = 120 sessions

Let’s assume further that sessions cannot be completely effective, in that test design and execution will be interrupted by setup and bug investigation. Suppose that we reckon 10% of the time spent on setup, and 25% of the time spent on investigating and reporting bugs. That’s 35% in total; for convenience, let’s call it 1/3 of the time.

120 sessions – 120 * 1/3 interruption time = 80 sessions

Thus in our two-week iteration we estimate that we have time for 80 focused, targeted effective idealized sessions of test coverage, embedded in 120 actual sessions of testing. Again, this is not a precise figure; it couldn’t possibly be. If our designers and programmers have done very well in a particular area, we won’t find lots of bugs and our effective coverage per session will go up. If setup is in some way lacking, we may find that interruptions account for more than one-third of the time, which means that our effective coverage will be reduced, or that we have to allocate more sessions to obtain the same coverage. So as soon as we start obtaining information about what actually went on in the sessions, we feed that information back into the estimation. I wrote extensively about that here.

On its own, the metrics on interruptions could be fascinating and actionable information for managers. But note that the metrics on their own are not conclusive. They can’t be. Instead, they inform questions. Why has there been more bug investigation than we expected? Are there more problems than we anticipated, or are testers spending too much time investigating before consulting with the programmers? Is setup taking longer than it should, such that customers will have setup problems too? Even if the setup problems will be experienced only in testing, are there ways to make setup more rapid so that we can spend more time on test coverage? The real value of any metrics is in the questions they raise, rather than in the answers they give.

There’s an alternative approach, for those who want to estimate the duration or staffing for a test cycle: set the desired amount of coverage, and apply the fixed variables and calculate for the free ones. Break the product down into areas, and assign some desired number of sessions to each based on risk, scope, complexity, or any combination of factors you choose. Based on prior experience or even on a guess, adjust for interruptions and effectiveness. If you know the number of testers, you can figure the amount of time required; if you want to set the amount of time, you can calculate for the number of testers required. This provides you with a quick estimate.

Which, of course, you should immediately distrust. What influence does tester experience and skill have on your estimate? On the eventual reality? If you’re thinking of adding testers, can you avoid banging into Brooks’ Law? Are your notions of risk static? Are they valid? And so forth. Estimation done well should provoke a large number of questions. Not to worry; actual testing will inform the answers to those questions.

Wait a second. We paid a lot of money for an expensive test management tool, and we sent all of our people to a one-week course on test estimation, and we now spend several weeks preparing our estimates. And since we started with all that, our estimates have come out really accurate.

If experience tells us anything, it should tell us that we should be suspicious of any person or process that claims to predict the future reliably. Such claims tend to be fulfilled via the Ludic Fallacy and the narrative bias, central pillars of the philosophy of The Black Swan. Since we already have an answer to the question “When are we going to be done?”, we have the opporutunity (and often the mandate) to turn an estimate into a self-fulfilling prophecy. Jerry Weinberg‘s Zeroth Law of Quality (“If you ignore quality, you can meet any other requirement“) is a special case of my own, more general Zeroth Law of Wish Fulfillment: “If you ignore some factors, you can achieve anything you like.” If your estimates always match reality, what assumptions and observations have you jettisoned in order to make reality fit the estimate? And if you’re spending weeks on estimation, might that time be better spent on testing?

Project Estimation and Black Swans (Part 3)

Wednesday, October 20th, 2010

Last time out, we determined that mucking with the estimate to account for variance and surprises in projects is in several ways wanting. This time, we’ll make some choices about the tasks and the projects, and see where those choices might take us.

Leave Problem Tasks Incomplete; Accept Missing Features

There are a couple of variations on this strategy. The first is to Blow The Whistle At 100. That is, we could time-box the entire project, which in the examples above would mean stopping the project after 100 hours of work no matter where we were. That might seem a little abrupt, but we would be done after 100 hours.

To reduce the surprise level and make things a tiny bit more predictable, we could Drop Scope As You Go. That is, if we were to find out at some point that you’re behind our intended schedule, we could refine the charter of the project to meet the schedule by immediately revising the estimate and dropping scope commitments equivalent to the amount of time we’ve lost. Moreover, we could choose which tasks to drop, preferring to drop those that were interdependent with other tasks.

In our Monte Carlo model, project scope is represented by the number of tasks that we attempt. After a Wasted Morning, we drop any future commitment to at least three tasks; after a Lost Day, we drop seven; and after a Black Cygnet, we drop 15. We don’t have to drop the tasks completely; if we get close to 100 hours and find out that we have plenty of time left over due to a number of Stunning Successes, we can resume work on one or more of the dropped tasks.

Of course, any tasks that we’ve dropped might have turned out to be Stunning Successes, but in this model, we assume that we can’t know that in advance; otherwise, there’d be no need to estimate. In this scenario, it would also be wise to allocate some task time to manage the dropping and picking up of tasks.

I’ve been a program manager for a company that used a combination of Blow The Whistle and Drop Scope As You Go very successfully. This strategy often works reasonably well for commercial software. In general, you have to release an update periodically to keep the stock market analysts and shareholders happy. Releasing something less ambitious than you hoped is disappointing. Still, it’s usually more palatable than shipping late and missing out on revenue for a given quarter. If you can keep the marketers, salespeople, and gossip focused on things that you’ve actually done, no one outside the company has to know how much you really intended to do. There’s an advantage here, too, for future planning: uncompleted tasks for this project represent elements of the task list for the next project.

Leave Problem Tasks Incomplete; Accept Missing Features AND Bugs

We could time-box our tasks, lower our standard of quality, and stop working on a task as soon as it extends beyond a Little Slip. This typically means bugs or other problems in tasks that would otherwise have been Wasted Mornings, Lost Days, or Black Cygnets, and it means at least a few dropped tasks too (since even a Little Slip costs us a Regular Task).

This is The Perpetual Beta Strategy, in which we adjust our quality standards such that we can declare a result a draft or a beta at the predicted completion time. The Perpetual Beta Strategy assumes that our customers explicitly or implicitly consent to accepting something on the estimated date, and are willing to sacrifice features, live with problems, wait for completion of the original task list, or some combination of all of these. That’s not crazy. In fact, many organizations work this way. Some have got very wealthy doing it.

Either of these two strategies would work less well the more our tasks had dependencies upon each other. So, a related strategy would be to…

De-Linearize and Decouple Tasks

We’re especially at risk of project delays when tasks are interdependent, and when we’re unable to switch the sequence of tasks quickly and easily. My little Monte Carlo exercises are agnostic about task dependencies. As idealized models, they’re founded on the notion that a problem in one area wouldn’t affect the workings in any other area, and that a delay in one task wouldn’t have an impact on any other tasks, only on the project overall. On the one hand, the simulations just march straight through the tasks in each project equentially, as though each task were dependent on the last. On the other hand, each task is assigned a time at random.

In real life, things don’t work this way. Much of the time, we have options to re-organize and re-prioritize tasks, such that when a Black Cygnet task comes along, we may be able to ignore it and pick some other task. That works when we’re ultimately flexible, and when tasks aren’t dependent on other tasks.

And yet at some point, in any project and any estimation effort there’s going to be a set of tasks that are on a critical path. I’ve never seen a project organized such no task was dependent on any other task. The model still has some resonance, even if we don’t take it literally.

A key factor here would seem to be preventing problems, and dealing with potential problems at the first available opportunity.

Detect and Manage The Problems

What could we do to prevent, detect, and manage problems?

We could apply Agile practices like promiscuous pairing (that is, making sure that every team member regularly pairs with every other team member). Such things might to help with the critical path issue. If each person has at least passing familiarity with the whole project, each is more likely to be able to work on a new task while their current one is blocked. Similarly, when one person is blocked, others can help by picking up on that person’s tasks, or by helping to remove the block.

We could perform some kind of corrective action as soon as we have any information to suggest that a given task might not be completed on time. That suggests shortening feedback loops by constant checking and testing, checking in on tasks in progress, and resolving problems as early as possible, instead of allowing tasks to slip into potentially disastrous delays. By that measure, a short daily standup is better than a long weekly status meeting; pairing, co-location and continuous conversation are better still. Waiting to check or test the project until we have an integration- or system-level build provided relatively slow feedback for low-level problems; low-level unit checks reveal information relatively quickly and easy.

We could manage both tasks and projects to emphasize information gathering and analysis. Look at the nature of the slippages; maybe there’s a pattern to Black Cygnets, Lost Days, or Wasted Mornings. Is a certain aspect of the project consistently problematic? Does the sequencing of the project make it more vulnerable to slips? Are experiments or uncertain tasks allocated the task time that they need to inform better estimation? Is some person or group consistently involved in delays, such that training, supervision, pairing, or reassignment might help?

Note that obtaining feedback takes some time. Meetings can take task-level units of times, and continuous conversation may slow down tasks. As a result, we might have to change some of our tasks or some part of them from work to examining work or talking about work; and it’s likely some Stunning Successes will turn into Regular Tasks. That’s the downside. The upside is that we’ll probably prevent some Little Slips, Wasted Mornings, Lost Days and Black Cygnets, and turn them into Regular Tasks or Stunning Successes.

We could try to reduce various kinds of inefficiencies associated with certain highly repetitive tasks. Lots of organizations try to do this by bringing in continuous building and integration, or by automating the checking that they do for each new build. But be aware that the process of automating those checks involves lots of tasks that are themselves subject to the same kind of estimation problems that the rest of your project must endure.

So, if we were to manage the project, respond quickly to potentially out-of-control tasks, and moderate the variances using some of the ideas above, how would we model that in a Monte Carlo simulation? If we’re checking in frequently, we might not be able to get as much done in a single task, so let’s turn the Stunning Successes (50% of the estimated task time) into Modest Successes (75% of the estimated task time). Inevitably we’ll underestimate some tasks and overestimate others, so let’s say on average, out of 100 tasks, 50 come in 25% early, 49 come in 25% late. Bad luck of some kind happens to everyone at some point, so let’s say there’s still a chance of one Black Cygnet per project.

Number of tasks Type of task Duration Total (hours)
50 Modest Success .75 37.5
49 Tiny Slip 1.25 61.25
1 Black Cygnet 16 16

Once again, I ran 5000 simulated projects.

Average Project 114.67
Minimum Length 92.0
Maximum Length 204.25
On time or early 1058 (21.2%)
Late 3942(78.8%)
Late by 50% or more 96 (1.9%)
Late by 100% or more 1 (0.02%)

Image:  Managed Project

Remember that in the first example above, half our tasks were early by 50%. Here, half our tasks are early by only 25%, but things overall look better. We’ve doubled the number of on-time projects, and our average project length is down to 114% from 124%. Catching problems before they turn into Wasted Mornings or Lost Days makes an impressive difference.

Detect and Manage The Problems, Plus Short Iterations

The more tasks in a project, the greater the chance that we’ll be whacked with a random Black Cygnet. So, we could choose your projects and refrain from attempting big ones. This is essentially the idea behind agile development’s focus on a rapid series of short iterations, rather than on a single monolithic project. Breaking a big project up into sprints offers you the opportunity to do the project-level equivalent of frequent check-ins in on our tasks.

When I modeled an agile project with a Monte Carlo simulation, I was astonished by what happened.

For the task/duration breakdown, I took the same approach as just above:

Number of tasks Type of task Duration Total (hours)
50 Modest Success .75 37.5
49 Tiny Slip 1.25 61.25
1 Black Cygnet 16 16

I changed the project size to 20 tasks. Then, to compensate for the fact that the projects were only 20 tasks long, instead of 100, I ran 25000 simulated projects.

Average Project 22.94
Minimum Length 16
Maximum Length 66.75
On time or early 12433 (49.7%)
Late 12567 (50.3%)
Late by 50% or more 4552 (18.2%)
Late by 100% or more 400 (1.6%)

Image: Agile Project

A few points of interest. At last, we’re estimating to the point where almost half our the projects are on time! In addition, more than 80% of the projects (20443 out of 25000, in my run) are within 15% of the estimate time—and since the entire project is only 20 hours, these projects run over by only three hours. That affords quick course correction; in the 100-hours-per-project model, the average project is late by three days.

Here’s one extra fascinating result: the total time taken for these 25000 projects (500,000 tasks in all) was 573,410 hours. For the original model (the one above, the first in yesterday’s post), the total was 619,156.5 hours, or 8% more. For the more realistic second example, the total was 736,199.2 hours, or 28% more. In these models, shorter iterations give less opportunity for random events to affect a given project.

So, what does all this mean? What can we learn? Let’s review some ideas on that next time.

Project Estimation and Black Swans (Part 2)

Sunday, October 17th, 2010

In the last post, I talked about the asymmetry of unexpected events and the attendant problems with estimation. Today we’re going to look at some possible workarounds for the problems. Testers often start by questioning the validity of models, so let’s start there.

The linear model that I’ve proposed doesn’t match reality in several ways, and so far I haven’t been very explicit about them. Here are just a few of the problems with the model.

  • The model tacitly assumes that all tasks have to be done in a specific order.
  • The model tacitly assumes that all tasks are of equal significance.
  • The model leaves out all notions of tasks being independent or interdependent with respect to each other.
  • The model assumes that once we’re into a Wasted Morning, a Lost Day, or a Black Cygnet, there’s nothing we can do about it, and that we won’t do anything about it.

In particular, the model leaves out control actions that could be applied by managers or by the people performing the tasks, control actions that could be applied to the tasks, the project, the context, or to the estimates. Let’s start with the latter.

Pad The Estimates So We’re Half Right

Here’s the chart of yesterday’s first scenario again:

Under the given set of assumptions, and assuming random distribution, we come in late a little over 90% of the time. To counter this, we could add some arbitrary percentage to our estimates such that half the time we’ll come in early, while the other half of the time we’ll (still) come in late. In that case, we’d want to pick a median value.

When I used the data from the Monte Carlo simulation and sorted the project lengths, I found that Project 2500, the one right in the middle, has a length of 122 hours. So: pad the estimate by 22%, and we’ll be on time 50% of the time.

There are two problems with this. The first is that there’s still significant variability in terms of how late.  Second, the asymmetry problem is the same for projects as it is for individual tasks: our big losses have a greater magnitude than our big wins. Even if we go for the average project length, rather than the median (the average 123.83 hours, is a couple of hours longer), fewer projects will go over the estimated time, but early projects will tend to be more modestly early, while the late ones will be more extremely late. None of this is likely to be acceptable to someone who values predictability (that is, the person who is asking us for the estimate).

Pad The Estimates So We’re Almost Always Right

Someone who likes predictability would probably prefer our projects to come in on time 95% of the time. If we wanted to satisfy that, based on the same set of assumptions, we would do the best estimating job we could, then pad our estimate by 58%, to 158 hours.

One problem with that strategy is that work tends to expand to fill time available, and people will start to work at a slower pace.

One the other hand, if people keep the regular pace up, 82% of our projects are going to come in at least 10% early, and 42% of our projects will come in 25% early! In such a case, we’ll probably face political backlash and be urged to less conservative with our estimates. By the math, we really can’t win under this set of assumptions.

Pad The Team

Rather than padding the estimate of time, we could build slack into the system by having extra people available to take on any surprises or misunderstandings. But note Fred Brooks’ Law, which says that adding people to a late project makes it later. That’s because of at least two problems: the new people need to be brought up to speed, and having more connections in a system tends increases the communication burden.

So maybe we’ll have to change something about the way we manage the project. We’ll look at that next.

Project Estimation and Black Swans (Part 1)

Thursday, October 14th, 2010

There has been a flurry of discussion about estimation on the net in the last few months.

All this reminded me to post the results of some number-crunching experiments that I started to do back in November 2009, based on a thought experiment by James Bach. That work coincided with the writing of a Swan Song, a Better Software column in which I discussed The Black Swan, by Nassim Nicholas Taleb.

A Black Swan is an improbable and unexpected event that has three characteristics. First, it takes us completely by surprise, typically because it’s outside of our models. Taleb says, “Models and constructions, those intellectual maps of reality, are not always wrong; they are wrong only in some specific applications. The difficulty is that a) you do not know beforehand (only after the fact) where the map will be wrong, and b) the mistakes can lead to severe consequences. These models are like potentially helpful medicines that carry random but very severe side effects.”

Second, a Black Swan has a disproportionately large impact. Many rare and surprising events happen that aren’t such a big deal. Black Swans can destroy wealth, property, or careers—or create them. A Black Swan can be a positive event, even though we tend not to think of them as such.

Third, after a Black Swan, people have a tendency to say that they saw it coming. They make this claim after the event because of a pair of inter-related cognitive biases. Taleb calls the first epistemic arrogance, an inflated sense of knowing what we know. The second is the narrative fallacy, our tendency to bend a story to fit with our perception of what we know, without validating the links between cause and effect. It’s easy to say that we know the important factors of the story when we already know the ending. The First World War was a Black Swan; September 11, 2001 was a Black Swan; the earthquake in Haiti, the volcano in Iceland, and the Deepwater Horizon oil spill in the Gulf of Mexico were all Black Swans. (The latter was a white swan, but it’s now coated in oil, which is the kind of joke that atracygnologists like to make). The rise of Google’s stock price after it went public was a Black Swan too. (You’ll probably meet people who claim that they knew in advance that Google’s stock price would explode. If that were true, they would have bought stock then, and they’d be rich. If they’re not rich, it’s evidence of the narrative fallacy in action.)

I think one reason that projects don’t meet their estimates is that we don’t naturally consider the impact of the Black Swan. James introduced me to a thought experiment that illustrates some interesting problems with estimation.

Imagine that you have a project, and that, for estimation’s sake, you broke it down into really fine-grained detail. The entire project decomposes into one hundred tasks, such that you figured that each task would take one hour. That means that your project should take 100 hours.

Suppose also that you estimated extremely conservatively, such that half of the tasks (that is, 50) were accomplished in half an hour, instead of an hour. Let’s call these Stunning Successes. 35% of the tasks are on time; we’ll called them Regular Tasks.

15% of the time, you encounter some bad luck.


  • Eight tasks, instead of taking an hour, take two hours. Let’s call those Little Slips.

  • Four tasks (one in 25) end up taking four hours, instead of the hour you thought they’d take. There’s a bug in some library that you’re calling; you need access to a particular server and the IT guys are overextended so they don’t call back until after lunch. We’ll call them Wasted Mornings.

  • Two tasks (one in fifty) take a whole day, instead of an hour. Someone has to stay home to mind a sick kid. Those we’ll call Lost Days.

  • One task in a hundred—just one—takes two days instead of just an hour. A library developed by another team is a couple of days late; a hard drive crash takes down a system and it turns out there’s a Post-It note jammed in the backup tape drive; one of the programmers has her wisdom teeth removed (all these things have happened on projects that I’ve worked on). These don’t have the devastating impact of a Black Swan; they’re like baby Black Swans, so let’s call them Black Cygnets.

Number of tasks Type of task Duration Total (hours)
50 Stunning Success 50 25
35 On Time 1.00 35
8 Little Slip 2 16
4 Wasted Morning 4 16
2 Lost Day 8 16
1 Black Cygnet 16 16
100 124

That’s right: the average project, based on the assumptions above, would come in 24% late. That is, you estimated it would take two and a half weeks. In fact, it’s going to take more than three weeks. Mind you, that’s the average project, and the notion of the “average” project is strictly based on probability. There’s no such thing as an “average” project in reality and all of its rich detail. Not every project will encounter bad luck—and some projects will run into more bad luck than others.

So there’s a way of modeling projects in a more representative way, and it can be a lot of fun. Take the probabilities above, and subject them to random chance. Do that for every task in the project, then run a lot of projects. This shows you what can happen on projects in a fairly dramatic way. It’s called a Monte Carlo simulation, and it’s an excellent example of exploratory test automation.

I put together a little Ruby program to generate the results of scenarios like the one above. The script runs N projects of M tasks each, allows me to enter as many probabilities and as many durations as I like, puts the results into an Excel spreadsheet, and graphs them. (Naturally I found and fixed a ton of bugs in my code as I prepared this little project. But I also found bugs in Excel, including some race-condition-based crashes, API performance problems, and severely inadequate documentation. Ain’t testing fun?) For the scenario above, I ran 5000 projects of 100 randomized tasks each. Based on the numbers above, I got these results:

Average Project 123.83 hours
Minimum Length 74.5 hours
Maximum Length 217 hours
On time or early projects 460 (9.2%)
Late projects 4540 (90.8%)
Late by 50% or more 469 (9.8%)
Later by 100% or more 2 (0.9%)

Image: Standard Project

Here are some of the interesting things I see here:


  • The average project took 123.83 hours, almost 25% longer than estimated.

  • 460 projects (or fewer than 10%) were on time or early!

  • 4540 projects (or just over 90%) were late!

  • You can get lucky. In the run I did, three projects were accomplished in 80 hours or fewer. No project avoided having any Wasted Mornings, Lost Days, or Black Cygnets. That’s none out of five thousand.

  • You can get unlucky, too. 469 projects took at least 1.5 times their projected time. Two took more than twice their projected time. And one very unlucky project had four Wasted Mornings, one Lost Day, and eight Black Cygnets. That one took 217 hours.

This might seem to some to be a counterintuitive result. Half the tasks took only half of the time alloted to them. 85% of the tasks came in on time or better. Only 15% were late. There’s a one-in-one-hundred chance that you’ll encounter a Black Cygnet. How could it be that so few projects came in on time?

The answer likes in asymmetry, another element of Taleb’s Black Swan model. It’s easy to err in our estimates by, say, a factor of two. Yet dividing the duration of a task by two has a very different impact from multiplying the duration by two. A Minor Victory saves only half a Regular Task, but a Little Slip costs two whole Regular Tasks.

Suppose you’re pretty good at estimation, and that you don’t underestimate so often. 20% of the tasks came in 10% early (let’s call those Minor Victories). 65% of the tasks come right on time (Regular Tasks). That is, 85% of your estimates are either too conservative or spot on. As before, there are eight Little Slips, four Wasted Mornings, two Lost Days, and a Black Cygnet.

With 20% of your tasks coming in early, and 15% coming in late, how long would you expect the average project to take?

Number of tasks Type of task Duration Total (hours)
20 Minor Victory .9 18
65 On Time 1.00 65
8 Little Slip 2 16
4 Wasted Morning 4 16
2 Lost Day 8 16
1 Black Cygnet 16 16
100 147

That’s right: even though your estimation of tasks is more accurate than in the first example above, the average project would come in 47% late. That is, you thought it would take two and a half weeks, and in fact, it’s going to take more than three and a half weeks. Mind you, that’s the average, and again that’s based on probability. Just as above, not every project will encounter bad luck, and some projects will run into more bad luck than others. Again, I ran 5,000 projects of 100 tasks each.

Average Project 147.24 hours
Minimum Length 105.2 hours
Maximum Length 232 hours
On time or early projects 0 (0.0%)
Late projects 5000 (100.0%)
Late by 50% or more 2022 (40.4%)
Late by 100% or more 30 (0.6%)

Image: Typical Project

Over 5000 projects, not a single project came in on time. The very best project came in just over 5% late. It had 18 Minor Victories, 77 on-time tasks, four Little Slips, and a Wasted Morning. It successfully avoided the Lost Day and the Black Cygnet. And in being anywhere near on-time, it was exceedingly rare. In fact, only 16 out of 5000 projects were less than 10% late.

Now, these are purely mathematical models. They ignore just about everything we could imagine about self-aware systems, and the ways the systems and their participants influence each other. The only project management activity that we’re really imagining here is the modelling and estimating of tasks into one-hour chunks. Everything that happens after that is down to random luck. Yet I think the Monte Carlo simulations shows that, unmanaged, what we might think of as a small number of surprises and a small amount of disorder can have a big impact.

Note that, in both of the examples above, at least 85% of the tasks come in on time or early overall. At most, only 15% of the tasks are late. It’s the asymmetry of the impact of late tasks that makes the overwhelming majority of projects late. A task that takes one-sixteenth of the time you estimated saves you less that one Regular Task, but a Black Cygnet costs you an extra fifteen Regular Tasks. The combination of the mathematics and the unexpected is relentlessly against you. In order to get around that, you’re going to have to manage something. What are the possible strategies? Let’s talk about that tomorrow.

Done, The Relative Rule, and The Unsettling Rule

Thursday, September 9th, 2010

The Agile community (to the degree that such a thing exists at all; it’s a little like talk about “the software industry”) appears to me to be really confused about what “done” means.

Whatever “done” means, it’s subject to the Relative Rule. I coined the Relative Rule, inspired by Jerry Weinberg‘s definition of quality (“quality is value to some person(s)”). The Relative Rule goes like this:

For any abstract X, X is X to some person, at some time.

For example, the idea of a “bug” is subject to the Relative Rule. A bug is not a thing that exists in the world; it doesn’t have a tangible form. A bug is a relationship between the product and some person. A bug is a threat to the value of the product to some person. The notion of a bug might be shared among many people, or it might be exclusive to some person.

Similarly: “done” is “done to some person(s), at some time,” and implicitly, “for some purpose“. To me, a tester’s job is to help people who matter—most importantly, the programmers and the product owner—make an informed decision about what constitutes “done” (and as I’ve said before, testers aren’t there to make that determination themselves). So testers, far from worrying about “done”, can begin to relax right away.

Let’s look at this in terms of a story.

A programmer takes on an assignment to code a particular function. She goes into cycles of test-driven development, writing a unit check, writing code to make the check pass, running a suite of prior unit checks and making sure that they all run green, and repeating the loop, adding more and more checks for that function as she goes. Meanwhile, the testers have, in consultation with the product owner, set up a suite of examples that demonstrate basic functionality, and they automate those examples as checks. The programmer decides that she’s done writing a particular function. She feels confident. She runs them against the examples. Two examples don’t work properly. Ooops, not done. Now she doesn’t feel so confident. She writes fixes. Now the examples all work, so now she’s done. That’s better.

A tester performs some exploratory tests that exercise that function, to see if it fulfills its explicit requirements. It does. Hey, the tester thinks, based on what I’ve seen so far, maybe we’re done programming… but we’re not done testing. Since no one—not testers, not programmers, not even requirements document writers; imagine!—is perfect, the tester performs other tests that explore the space of implicit requirements. The tester raises questions about the way the function might or might not work. The tester expands the possibilities of conditions and risks that might be relevant. Some of his questions raise new test ideas, and some of those tests raise new questions, and some of those questions reveal that certain implicit reqiurements haven’t been met. (The tester is done testing, for now, but no one is now sure that programming is done.) The programmer agrees that the tester has raised some significant issues. She’s mildly irritated that she didn’t think of some of these things on her own, and she’s annoyed that others are not explicit in the specs that were given to her. Still, she works on both sets of problems until they’re addressed too. (Done.) For two of the issues the tester has raised, the programmer disagrees that they’re really necessary (that is, things are done, according to the programmer). The tester tries to make sure that this isn’t personal, but remains concerned about the risks (things are not done, according to the tester). After a conversation, the programmer persuades the tester that these two issues aren’t problems (oh, done after all), and they both feel better. Just to be sure, though, the tester brings up the issues with the product owner. The product owner has some information about business risk that neither the tester nor the programmer had, and declares emphatically that the problem should be fixed (not done). The programmer is reasonably exasperated, because this seems like more work. Upon implementing one fix, the programmer has an epiphany; everything can be handled by a refactoring that simultaneously makes the code easier to understand AND addresses both problems AND takes much less time. She feels justifiably proud of herself. She writes a few more unit checks, refactors, and all the unit checks pass. (Done!) One of the checks of the automated examples doesn’t pass. (Damn; not done.) That’s frustrating. Another fix; the unit checks pass, the examples pass, the tester does more exploration and finds nothing more to be concerned about. Done! Both the programmer and the tester are happy, and the product owner is relieved and impressed.

Upon conversation with other programmers on the the project team, our programmer realizes that there are interactions between her function and other functions that mean she’s not done after all. That’s a little deflating. Back to the drawing board for a new build, followed by more testing. The tester feels a little pressured, because there’s lots of other work to do. Still, after a little investigation, things look good, so, okay, now done.

It’s getting to the end of the iteration. The programmers all declare themselves done. All of the unit checks are running green, and all of the ATDD checks are running green too. The whole team is ready to declare itself done. Well, done coding the new features, but there’s still a little uncertainty because there’s still a day left in which to test, and the testers are professionally uncertain. On the morning of the last day of the iteration, the programmers get into scoping out the horizon for the next iteration, while testers perform some final exploratory tests. They apply oracles that show the product isn’t consistent with a particular point in a Request-For-Comment that, alas, no one has noticed before. Aaargh! Not done. Now the team is nervous; people are starting to think that they might not be done what they committed to do. The programmers put in a quick fix and run some more checks (done). The testers raise more questions, perform more investigations, consider more possibilities, and find that more and more stopping heuristics apply (you’ll find a list of those here: http://www.developsense.com/blog/2009/09/when-do-we-stop-test/). It’s 3:00pm. Okay, finally: done. Now everyone feels good. They set up the demo for the iteration.

The Customer (that is, the product owner) says “This is great. You’re done everything that I asked for in this iteration.” (Done! Yay!) “…except, we just heard from The Bank, and they’ve changed their specifications on how they handle this kind of transaction. So we’re done this iteration (that is, done now, for some purpose), but we’ve got a new high-priority backlog item for next Monday, which—and I’m sorry about this—means rolling back a lot of the work we’ve done on this feature (not done for some other purpose). And, programmers, the stuff you were anticipating for next week is going to be back-burnered for now.” Well, that’s a little deflating. But it’s only really deflating for the people who believe in the illusion that there’s a clear finish line for any kind of development work—a finish line that is algorithmic, instead of heuristic.

After many cycles like the above, eventually the programmers and the testers and the Customer all agree that the product is indeed ready for deployment. That agreement is nice, but in one sense, what the programmers and the testers think doesn’t matter. Shipping is a business decision, and not a technical one; it’s the product owner that makes the final decision. In another sense, though, the programmers and testers absolutely matter, in that a responsible and effective product owner must seriously consider all of the information available to him, weighing the business imperatives against technical concerns. Anyway, in this case, everything is lined up. The team is done! Everyone feels happy and proud.

The product gets deployed onto the bank’s system on a platform that doesn’t quite match the test environment, at volumes that exceed the test volumes. So the bank’s project manager isn’t happy (not done). The testers diligently test and find a way to reproduce the problem (they’re done, for now). The programmers don’t make any changes to the code, but find a way to change a configuration setting that works around the problem (so now they’re done). The testers show that the fix works in the test environments and at heavier loads(done). Upon evaluation of the original contract, recognition of the workaround, and after its own internal testing, the bank accepts the situation for now (done) but warns that it’s going to contest whether the contract has been fulfilled (not done). Some people are tense; others realize that business is business, and they don’t take it personally. After much negotiation, the managers from the bank and the development shop agree that the terms of the contract have been fulfilled (done), but that they’d really prefer a more elegant fix for which the bank will agree to pay (not done). And then the whole cycle continues. For years.

So, two things:

1) Defintions and decisions about “done” are always relative to some person, some purpose, and some time. Decisions about “done” are always laden with context. Not only technical considerations matter; business considerations matter too. Moreover, the process of deciding about doneness is not merely logical, but also highly social. Done is based not on What’s Right, but on Who Decides and For What Purpose and For Now. And as Jerry Weinberg points out, decisions about quality are political and emotional, but made by people who would like to appear rational. However, if you want to be politically, emotionally, and rationally comfortable, you might want to take a deep breath and learn to accept—with all of your intelligence, heart, and good will—not only the first point, but also the second…

2) “Done” is subject to another observation that Jerry often makes, and that I’ve named The Unsettling Rule:

Nothing is ever settled.

Disposable Time

Sunday, January 17th, 2010

In our Rapid Testing class, James Bach and I like to talk about an underappreciated tester resource: disposable time. Disposable time is the time that you can afford to waste without getting into trouble.

Now, we want to be careful about what we mean by “waste”, here. It’s not that you want to waste the time. You probably want to spend it wisely. It’s just that you won’t suffer harm if you do happen to waste it. Disposable time is to your working hours what disposable income is to your total personal income. (In fact, even that’s not quite correct, strictly speaking; we actually mean discretionary income: the money that’s left over after you’ve paid for all of the things that you must pay for—food, shelter, basic clothing, medical, and tax expenses. The money that people call disposable income is more properly called discretionary income; as Wikipedia says, “the amount of ‘play money’ left to spend or save.” Oh well. We’ll go with the incorrect but popular interpretation of “disposable” here.)

You’re never being scrutinized every minute of every day. Practically everyone has a few moments when no one important is watching. In that time, you might

  • try a tiny test that hasn’t been prescribed.
  • try putting in a risky value instead of a safe value.
  • pretend to change your mind, or to make a mistake, and go back a step or two; users make mistakes, and error handling and recovery are often the most vulnerable parts of the program.
  • take a couple of moments to glance at some background information relevant to the work that you’re doing.
  • write in your journal.
  • see if any of your colleagues in technical support have a hot issue that can inform some test ideas.
  • steal a couple of moments to write a tiny, simple program that will save you some time; use the saved time and the learning to extend your programming skills so that you can solve increasingly complex programming problems.
  • spend an extra couple of minutes at the end of a coffee break befriending the network support people.
  • sketch a workflow diagram for your product, and at some point show it to an expert, and ask if you’ve got it right.
  • snoop around in the support logs for the product.
  • add a few more lines to a spreadsheet of data values
  • help someone else solve a problem that they’re having.
  • chat with a programmer about some aspect of the technology.
  • even if you do nothing else, at least pause and look around the screen as you’re testing. Take a moment or two to recognize a new risk and write down a new question or a new test idea. Report on that idea later on; ask your test lead, your manager, or a programmer, or a product owner if it’s a risk worth investigating. Hang on to your notes. When someone asks “Why didn’t you find that bug,” you may have an answer for them.

If it turns out that you’ve made a bad investment, oh well. By definition, however large or small the period, disposable is time that you can afford to blow without suffering consequences.

On the other hand, you may have made a good investment. You may have found a bug, or recognized a new risk, or learned something important, or helped someone out of a jam, or built on a professional relationship, or surprised and impressed your manager. You may have done all of these things at once. Even if you feel like you’ve wasted your time, you’ve probably learned enough to insulate yourself from wasting more time in the same way. When you discover that an alley is blind, you’re unlikely to return there when there are other things to explore.

In The Black Swan, Nassim Nicholas Taleb proposes an investment strategy wherein you put the vast bulk of your money, your nest egg, in very safe securities. You then invest a small amount—an amount that you can afford to lose—in very speculative bets that have a chance of providing a spectacular return. He call that very improbable high-return event a positive Black Swan. Your nest egg is like the part of your job that you must accomplish. Disposable time is like your Black Swan fund; you may lose it all, but you have a shot at a big payoff. But there’s an important difference, too: since learning is an almost inevitable product of using your disposable time, there’s almost always some modest positive outcome.

We encourage test managers to allow disposable time explicitly for their testers. As an example, Google provides its staff with Innovation Time Off. Engineers are encouraged to spend 20% of their time pursuing projects that interest them. That sounds like a waste, until one learns that Google projects like Gmail, Google News, Orkut, and AdSense came of these investments.

What Google may not know is that even within the other 80% of the time that’s ostensibly on mission, people still have, and are still using, non-explicit disposable time. People have that almost everywhere, whether they have explicit disposable time or not.

If you’re working in an environment where you’re being watched so closely that none of this is possible, and where you’re punished for learning or seeking problems, my advice is to make sure that slavery has been abolished in your jurisdiction. Then find a job where your testing skills are valued and your managers aren’t wasting their time by watching your work instead of doing theirs. But when you’ve got a few moments to fill, fill them and learn something!

Defect Detection Efficiency: An Evaluation of a Research Study

Friday, January 8th, 2010

Over the last several months, B.J. Rollison has been delivering presentations and writing articles and blog posts in which he cites a paper Defect Detection Efficiency: Test Case Based vs. Exploratory Testing [DDE2007], by Juha Itkonen, Mika V. Mäntylä and Casper Lassenius (First International Symposium on Empirical Software Engineering and Measurement, pp. 61-70; the paper can be found here).

I appreciate the authors’ intentions in examining the efficiency of exploratory testing.  That said, the study and the paper that describes it have some pretty serious problems.

Some Background on Exploratory Testing

It is common for people writing about exploratory testing to consider it a technique, rather than an approach. “Exploratory” and “scripted” are opposite poles on a continuum. At one pole, exploratory testing integrates test design, test execution, result interpretation, and learning into a single person at the same time.  At the other, scripted testing separates test design and test execution by time, and typically (although not always) by tester, and mediates information about the designer’s intentions by way of a document or a program. As James Bach has recently pointed out, the exploratory and scripted poles are like “hot” and “cold”.  Just as there can be warmer or cooler water, there are intermediate gradations to testing approaches. The extent to which an approach is exploratory is the extent to which the tester, rather than the script, is in immediate control of the activity.  A strongly scripted approach is one in which ideas from someone else, or ideas from some point in the past, govern the tester’s actions. Test execution can be very scripted, as when the tester is given an explicit set of steps to follow and observations to make; somewhat scripted, as when the tester is given explicit instruction but is welcome or encouraged to deviate from it; or very exploratory, in which the tester is given a mission or charter, and is mandated to use whatever information and ideas are available, even those that have been discovered in the present moment.

Yet the approaches can be blended.  James points out that the distinguishing attribute in exploratory and scripted approaches is the presence or absence of loops.  The most extreme scripted testing would follow a strictly linear approach; design would be done at the beginning of the project; design would be followed by execution; tests would be performed in a prescribed order; later cycles of testing would use exactly the same tests for regression

Let’s get more realistic, though.  Consider a tester with a list of tests to perform, each using a data-focused automated script to address a particular test idea.  A tester using a highly scripted approach would run that script, observe and record the result, and move on to the next test.  A tester using a more exploratory approach would use the list as a point of departure, but upon observing an interesting result might choose to perform a different test from the next one on the list; to alter the data and re-run the test; to modify the automated script; or to abandon that list of tests in favour of another one.  That is, the tester’s actions in the moment would not be directed by earlier ideas, but would be informed by them. Scripted approaches set out the ideas in advance, and when new information arrives, there’s a longer loop between discovery and the incorporation of that new information into the testing cycle.  The more exploratory the approach, the shorter the loop.  Exploratory approaches do not preclude the use of prepared test ideas, although both James and I would argue that our craft, in general, places excessive emphasis on test cases and focusing techniques at the expense of more general heuristics and defocusing techniques.

The point of all this is that neither exploratory testing nor scripted approaches are testing techniques, nor bodies of testing techniques.  They’re approaches that can be applied to any testing technique.

To be fair to the authors of [DDE2007], since publication of their paper there has been ongoing progress in the way that many people—in particular Cem Kaner, James Bach, and I—articulate these ideas, but the fundamental notions haven’t changed significantly.

Literature Review

While the authors do cite several papers on testing and test design techniques, they do not cite some of the more important and relevant publications on the exploratory side.  Examples of such literature include “Measuring the Effectiveness of Software Testers” (Kaner, 2003; slightly updated in 2006); and “Software engineering metrics: What do they measure and how do we know?” (Kaner & Bond, 2004); and “Inefficiency and Ineffectiveness of Software Testing: A Key Problem in Software Engineering” (Kaner 2006; to be fair to the authors, this paper may have been published too late to inform [DDE2007]),  General Functionality and Stability Test Procedure (for Microsoft Windows 2000 Application Certification) (Bach, 2000); Satisfice Heuristic Test Strategy Model (Bach, 2000); How To Break Software (Whittaker, 2002).

The authors of [DDE2007] appear also to have omitted literature on the subject of exploration and its role in learning. Yet there is significant material on the subject, in both popular and more academic literature.  Examples here include Collaborative Discovery in a Scientific Domain (Okada and Simon; note that the subjects are testing software); Exploring Science: The Cognition and Development of Discovery Processes (David Klahr and Herbert Simon); Plans and Situated Actions (Lucy Suchman); Play as Exploratory Learning (Mary Reilly); How to Solve It (George Polya); Simple Heuristics That Make Us Smart (Gerg Gigerenzer); Sensemaking in Organizations (Karl Weick); Cognition in the Wild (Edward Hutchins); The Social Life of Information (Paul Duguid and John Seely Brown); Sciences of the Artificial (Herbert Simon); all the way back to A System of Logic, Ratiocinative and Inductive (John Stuart Mill, 1843).

These omissions are reflected in the study and the analysis of the experiment, and that leads to a common problem in such studies: heuristics and other important cognitive structures in exploration are treated as mysterious and unknowable.  For example, the authors say, “For the exploratory testing sessions we cannot determine if the subjects used the same testing principles that they used for designing the documented test cases or if they explored the functionality in pure ad-hoc manner. For this reason it is safer to assume the ad-hoc manner to hold true.”  [DDE2007, p. 69]  Why assume?  At the very least, one could at least observe the subjects and debrief them, asking about their approaches.  In fact, this is exactly the role that the test lead fulfills in the practice of skilled exploratory testing.  And why describe the principles only as “ad-hoc”?  It’s not like the principles can’t be articulated. I talk about oracle heuristics in this article, and talk about stopping heuristics here; Kaner’s Black Box Software Testing course talks about test design heuristics; James Bach‘s work talks about test strategy heuristics (especially here); James Whittaker’s books talk about heuristics for finding vulnerabilities…

Tester Experience

The study was performed using testers who were, in the main, novices.  “27 subjects had no previous experience in software engineering and 63 had no previous experience in testing. 8 subjects had one year and 4 subjects had two years testing experience. Only four subjects reported having some sort of training in software testing prior to taking the course.”  ([DDE2007], p. 65 my emphasis)  Testing—especially testing using an exploratory approach—is a complex cognitive activity.  If one were to perform a study on novice jugglers, one would likely find that they drop an approximately equal number of objects, whether they were juggling balls or knives.

Tester Training

The paper notes that “subjects were trained to use the test case design techniques before the experiment.” However, the paper does not make note of any specific training in heuristics or exploratory approaches.  That might not be surprising in light of the weaknesses on the exploratory side of the literature review.  My experience, that of James Bach, and anecdotal reports from our clients suggests that even a brief training session can greatly increase the effectiveness of an exploratory approach.

Cycles of Testing

Testing happens in cycles.  In a strongly scripted testing, the process tends to the linear.  All tests are designed up front; then those tests are executed; then testing for that area is deemed to be done.  In subsequent cycles, the intention is to repeat the original tests to make sure that bugs are fixed to check for regression.  By contrast, exploratory testing is an organic and iterative process.  In an exploratory approach, the same area might be visited several times, such that learning from early “reconnaissance” sessions informs further exploration in subsequent “deep coverage” sessions.  The learning from those (and from ideas about bugs that have been found and fixed) informs “wrap-up sessions”, in which tests may be repeated, varied, or cut from new cloth.  No allowance is made for information and learning obtained during one round of testing to inform later rounds.  Yet such information and learning is typically of great value.

Quantitative vs. Qualitative Analysis

In the study, there is a great deal of emphasis placed on quantifying results, on experimental and on mathematical rigour.  However, such rigour may be misplaced when the products of testing are qualitative, rather than quantitative.

Finding bugs is important, finding many bugs is important, and finding important bugs is especially important. Yet bugs and bug reports are by no means the only products of testing.  The study largely ignores the other forms of information that testing may provide.

  • The tester might learn something about test design, and feed that learning into her approach toward test execution, or vice versa. The value of that learning might be realized immediately (as in an exploratory approach) or over time (as in a scripted approach).
  • The tester, upon executing a test, might recognize a new risk or missing coverage. That recognition might inform ideas about the design and choices of subsequent tests.  In a scripted approach, that’s a relatively long loop.  In an exploratory approach, upon noticing a new risk, the tester might choose to note findings for later on.  On the other hand, the discovery could be cashed immediately:  she  might choose to repeat the test, she might perform a variation on the same test, or might alter her strategy to follow a different line of investigation.  Compared to a scripted approach, the feedback loop between discovery and subsequent action is far shorter.  The study ignores the length of the feedback loops.
  • In addition to discovering bugs that threaten the value of the product, the tester might discover issues—problems that threaten the value of the testing effort or the development project overall.
  • The tester who takes an exploratory approach may choose to investigate a bug or an issue that she has found.  This may reduce the total bug count, but in some contexts may be very important to the tester’s client.  In such cases, the quality of the investigation, rather than the number of bugs found, would be important.

More work products from testing can be found here.

“Efficiency” vs. “Effectiveness”

The study takes a very parsimonious view of “efficiency”, and further confuses “efficiency” with “effectiveness”.  Two tests are equally effective if they produce the same effects. The discovery of a bug is certainly an important effect of a test.  Yet there are other important effects too, as noted above, but they’re not considered in the study.

However, even if we decide that bug-finding is the only worthwhile effect of a test, two equally effective tests might not be equally efficient.  I would argue that efficiency is a relationship between effectiveness and cost.  An activity is more efficient if it has the same effectiveness at lower cost in terms of time, money, or resources.  This leads to what is by far the most serious problem in the paper…

Script Preparation Time Is Ignored

The authors’ evaluation of “efficiency” leaves out the preparation time for the scripted tests! The paper says that the exploratory testing sessions took 90 minutes for design, preparation, and execution. The preparation for the scripted tests took seven hours, where the scripted test execution sessions took 90 minutes, for a total of 8.5 hours.  This fact is not highlighted; indeed, it is not mentioned until the eighth of ten pages. (page 68).  In journalism, that would be called burying the lead.  In terms of bug-finding alone, the authors suggest that the results were of equivalent effectiveness, yet the scripted approach took, in total, 5.6 times longer than the exploratory approach. What other problems could the exploratory testing approaches find given seven additional hours?

Conclusions

The authors offer these four conclusions at the end of the paper:

“First, we identify a lack of research on manual test execution from other than the test case design point of view. It is obvious that focusing only on test case design techniques does not cover many important aspects that affect manual testing. Second, our data showed no benefit in terms of defect detection efficiency of using predesigned test cases in comparison to an exploratory testing approach. Third, there appears to be no big differences in the detected defect types, severities, and in detection difficulty. Fourth, our data indicates that test case based testing produces more false defect reports.”

I would offer to add a few other conclusions.  The first is from the authors themselves, but is buried on page 68:  “Based on the results of this study, we can conclude that an exploratory approach could be efficient, especially considering the average 7 hours of effort the subjects used for test case design activities.”  Or, put another way,

  • During test execution
  • unskilled testers found the same number of problems, irrespective of the approach that they took, but
  • preparation of scripted tests increased testing time approximately by a factor of five
  • and appeared to add no significant value.

Now:  as much as I would like to cite this study as a significant win for exploratory testing, I can’t.  There are too many problems with it.  There’s not much value in comparing two approaches when those approaches are taken by unskilled and untrained people.  The study is heavy on data but light on information. There are no details about the bugs that were found and missed using each approach.  There’s no description of the testers’ activities or thought processes; just the output numbers.  There is the potential for interesting, rich stories on which bugs were found and which bugs were missed by which approaches, but such stories are absent from the paper.  Testing is a qualitative evaluation of a product; this study is a quantitative evaluation of testing.  Valuable information is lost thereby.

The authors say, “We could not analyze how good test case designers our subjects were and how much the quality of the test cases affected the results and how much the actual test execution aproach.”  Actually, they could have analyzed that.  It’s just that they didn’t.  Pity.