Archive for the ‘Management’ Category

Project Estimation and Black Swans (Part 5): Test Estimation

Sunday, October 31st, 2010

In this series of blog posts, I’ve been talking about project estimation. But I’m a tester, and if you’re reading this blog, presumably you’re a tester too, or at least you’re interested in testing. So, all this has might have been interesting for project estimation in general, but what are the implications for test project estimation?

Let’s start with the tester’s approach: question the question.

Is there ever such a thing as a test project? Specifically, is there such a thing as a test project that happens outside of a development project?

“Test projects” are never completely independent of some other project. There’s always a client, and typically there are other stakeholders too. There’s always an information mission, whether general or specific. There’s always some development work that has been done, such that someone is seeking information about it. There’s always a tester, or some number of testers (let’s assume plural, even if it’s only one). There’s always some kind of time box, whether it’s the end of an agile iteration, a project milestone, a pre-set ship date, or a vague notion of when the project will end. Within that time box, there is at least one cycle of testing, and typically several of them. And there are risks that testing tries to address by seeking and providing information. From time to time, whether continuously or at the end of a cycle, testers report to the client on what they have discovered.

The project might be a product review for a periodical. The project might be a lawsuit, in which a legal team tries to show that a product doesn’t meet contracted requirements. The project might be an academic or industrial research program in which software plays a key role. More commonly, the project is some kind of software development, whether mass-market commercial software, an online service, or IT support inside a company. The project may entail customization of an existing product, or it may involve lots of new code development. But no matter what, testing isn’t the project in and of itself; testing is a part of a project, a part that informs the project. Testing doesn’t happen in isolation; it’s part of a system. Testing observes outputs and outcomes of the system of which it is a part, and feeds that information back into the system. And testing is only one of several feedback mechanisms available to the system.

Although testing may be arranged in cycles, it would be odd to think of testing as an activity that can be separated from the rest of its project, just as it would be odd to think of seeing as a separate phase of your day. People may say a lot of strange things, but you’ll rarely hear them say “I just need to get this work done, and then I’ll start seeing”; and you almost never get asked “When are you going to be done seeing?” Now, there might be part of your day when you need to pay a lot of attention to your eyes—when you’re driving a car, or cutting vegetables, or watching your child walk across a cluttered room. But, even when you’re focused (sorry) on seeing, the seeing part happens in the context of—and in the service of—some other activity.

Does it make sense to think in terms of a “testing phase”?

Many organizations (in particular, the non-agile ones) divide a project into two discrete parts: a “development phase” and a “testing phase”. My colleague James Bach notes an interesting fallacy there.

What happens during the “development phase”? The programmers are programming. Programming may include a host of activities such as research, design, experimentation, prototyping, coding, unit testing (and in TDD, a unit check is created just before the code to be be checked), integration testing, debugging, or refactoring. And what are the testers doing during the “development phase”? The testers are testing. More specifically, they may be engaged in review, planning, test design, toolsmithing, data generation, environment setup, or the running of relatively low-level integration tests, or even very high-level system tests. All of those activities can be wrapped up under the rubric of “testing”.

What happens during the “testing phase”? The programmers are still programming, and the testers are still testing. The primary thing that distinguishes the two phases, though, is the focus of the programming work: the programmers have generally stopped adding new features, but are instead fixing the problems that have been found so far. In the first phase, programmers focused on developing new features; in the second, programmers are focused on fixing. By that reckoning, James reckons, the “testing phase” should be called the fixing phase. It seems to me that if we took James’ suggestion seriously, it might change the nature of some of the questions are often asked in a development project. Replace the word “test” with the word “fix”: “How long are you going to need to fix this product?” “When is fixing going to be done?” “Can’t we just automate the fixing?” “Shouldn’t fixing get involved early in the project?” “Why was that feature broken when the customer got it? Didn’t you fix it?” And when we ask those questions, should we be asking the testers?

As James also points out, no one ever held up the release or deployment of a product because there was more testing to be done. Products are delayed because of a present concern that there might be more development work to be done. Testing can stop as soon as product owners believe that they have sufficient information to accept the risk of shipping. If that’s so, the question for the testers “When are you going to be done testing?” translates to in a question for the product owner: “When am I going to believe that I have sufficient technical information to inform a risk-based business decision?” At that point, the product owner should—appropriately—be skeptical about anyone else’s determination that they are “done” testing.

Now, for a program manager, the “when do I have sufficient information” question might sound hard to answer. It is hard to answer. When I was a program manager for a commercial software company, it was impossible for me to answer before the information had been marshalled. Look at the variables involved in answering the question well: technical information, technical risk, test coverage, the quality of our models, the quality of our oracles, business information, business risk, the notion of sufficiency, decisiveness… Most of those variables must be accumulated and weighed and decided in the head of a single person—and that person isn’t the tester. That person is the product owner. The evaluation of those variables and the decision to ship are all in play from one moment to the next. The final state of the contributing variables and the final decision on when to ship are in the future. Asking the tester “When are you going to be done testing?” is like asking the eyes, “When are you going to be done seeing?” Eyes will continue to scan the surroundings, providing information in parallel with the other senses, until the brain decides upon a course of action. In a similar way, testers continue to test, generating information in parallel with the other members of the project community, until the product owner decides to ship the product. Neither the tester alone nor the eyes alone can answer the “when are you going to be done” question usefully; they’re not in charge. Until it makes a decision, the brain (optionally) takes in more data which the eyes and the other sense organs, by default, continue to supply. Those of us who have ogled the dessert table, or who have gone out on disastrous dates, know the consequences of letting our eyes make decisions for us. Moreover, if there is a problem, it’s not likely the eyes that will make the problem go away.

Some people believe that they can estimate when testing will be done by breaking down testing into measurable units, like test cases or test steps. To me, that’s like proposing “vision cases” or “vision steps”, which leads to our next question:

Can we estimate the duration of a “testing project” by counting “test cases” or “test steps”?

Recently I attended a conference presentation in which the speaker presented a method for estimating when testing would be completed. Essentially, it was a formula: break testing down into test cases, break test cases down into test steps, observe and time some test steps, average them out (or something) to find out how long a test step takes, and then multiply that time by the number of test steps. Voila! an estimate.

Only one small problem: there was no validity to the basis of the calculation. What is a test step? Is it a physical action? The speaker seem to suggest that you can tell a tester has moved on to the next step when he performs another input action. Yet surely all input actions are not created equal. What counts as an input action? A mouse click? A mouse movement? The entry of some data into a field? Into a number of fields, followed by the press of an Enter key? Does the test step include an observation? Several observations? Evaluation? What happens when a human notices something odd and starts thinking? What happens when, in the middle of test execution, a tester recognizes a risk and decides to search for a related problem? What happens to the unit of measurement when a tester finds a problem, and begins to investigate and report it?

The speaker seemed to acknowledge the problem when she said that a step might take five seconds, or half a day. A margin of error of about 3000 to one per test step—the unit on which the estimate is based—would seem to jeopardize the validity of the estimate. Yet the margin of error, profound as it is, is orthogonal to a much bigger problem with this approach to estimation.

Excellent testing is not the monotonic or repetitive execution of scripted ideas. (That’s something that my community calls checking.) Instead, testing is an investigation of code, computers, people, value, risks, and the relationships between them. Investigation requires loops of exploration, experimentation, discovery, research, result interpretation, and learning. Variation and adaptation are essential to the process. Execution of a test often involves reflecting on what has just happened, backtracking over a set of steps, and then repeating or varying the steps while posing different questions or making observations. An investigation cannot follow a prescribed set of steps. Indeed, an investigation that follows a predetermined set of steps is not an investigation at all.

In an investigation, any question you ask may—starting with the first—may yield an answer that completely derails your preconceptions. In an investigation, assumptions need to be surfaced, attacked, and refined. In an investigation, the answer to the most recent question may be far more relevant to the mission than anything that has gone before. If we want to investigate well, we cannot assume that the most critical risk has already been identified. If we want to investigate well, we can’t do it by rote. (If there are rote questions, let’s put them into low-level automated checks. And let’s do it skillfully.)

If we can’t estimate by counting test cases, how can we estimate how much time we’ll need for testing?

There are plenty of activities that don’t yield to piecework models because they are inseparable from the project in which they happen. In another of James Bach’s analogies, no one estimates the looking-out-the-window phase of driving an automobile journey. You can estimate the length of the journey, but looking out the window happens continuously, until the travellers have reached the destination. Indeed, looking out the window informs the driver’s evaluation of whether journey is on track, and whether the destination has been reached. No one estimates the customer service phase of a hotel stay. You can estimate the length of the stay, but customer service (when it’s good) is available continuously until the visitor has left the hotel. For management purposes, customer service people (the front desk, the room cleaners) inform the observation that the visitor has left. No one estimates the “management phase” of a software development project. You can estimate how long development will take, but management (when it’s good) happens continuously until the product owner has decided to release the product. Observations and actions from managers (the development manager, the support manager, the documentation manager, and yes, the test manager) inform the product owner’s decision as to whether the product is ready to ship.

So it goes for testing. Test estimation becomes a problem only if one makes the mistake of treating testing as a separate activity or phase, rather than as an open-ended, ongoing investigation that continues throughout the project.

My manager says that I have to provide an estimate, so what do I do?

At the beginning of the project, we know very little relative to what we’ll know later. We can’t know everything we’ll need to know. We can’t know at the beginning of the project whether the product will meet its schedule without being visited by a Black Swan or a flock of Black Cygnets. So instead of thinking in terms of test estimation, try thinking in terms of strategy, logistics, negotiation, and refinement.

Our strategy is the set of ideas that guide our test design. Those ideas are informed by the project environment, or context; by the quality criteria that might be valued by users and other stakeholders; by the test coverage that we might wish to obtain; and by the test techniques that we might choose to apply. (See the Heuristic Test Strategy Model that we use in Rapid Testing as an example of a framework for developing a strategy.) Logistics is the set of ideas that guide our application of people, equipment, tools, and other resources to fulfill our strategy. Put strategy and logistics together and we’ve got a plan.

Since we’re working with—and, more importantly, for—a client, the client’s mission, schedule, and budget are central to choices on the elements of our strategy and logistics. Some of those choices may follow history or the current state of affairs. For example, many projects happen in shops that already have a roster of programmers and testers; many projects are extensions of an existing product or service. Sometimes project strategy ideas based on projections or guesswork or hopes; for example, the product owner already has some idea of when she wants to ship the product. So we use whatever information is available to create a preliminary test plan. Our client may like our plan—and she may not. Either way, in an effective relationship, neither party can dictate the terms of service. Instead, we negotiate. Many of our preconceptions (and the client’s) will be invalid and will change as the project evolves. But that’s okay; the project environment, excellent testing, and a continuous flow of reporting and interaction will immediately start helping to reveal unwarranted assumptions and new risk ideas. If we treat testing as something happens continuously with development, and if we view development in cycles that provide a kind of pulse for the project, we have opportunities to review and refine our plans.

So: instead of thinking about estimation of the “testing phase”, think about negotiation and refinement of your test strategy within the context of the overall project. That’s what happens anyway, isn’t it?

But my management loves estimates! Isn’t there something we can estimate?

Although it doesn’t make sense to estimate testing effort outside the context of the overall project, we can charter and estimate testing effort within a development cycle. The basic idea comes from Session Based Test Management, James and Jon Bach’s approach to plan, estimate, manage, and measure exploratory testing in circumstances that require high levels of accountability. The key factors are:

  • time-boxed sessions of uninterrupted testing, ranging from 45 minutes to two hours and fifteen minutes, with the goal of making a normal session 90 minutes or so;

  • test coverage areas—typically functions or features of the product to which we would like to dedicate some testing time;
  • activities such as research, review, test design, data generation, toolsmithing, research, or retesting, to which we might also like to dedicate testing time;
  • charters, in the form of a one- to three-sentence mission statement that guides the session to focus on specific coverage areas and/or activities;

  • debriefings, in which a tester and a test lead or manager discuss the outcome of a session;

  • reviewable results, in the form of a session sheet that provides structure for the debrief, and that can be scanned and parsed by a Perl script; and, optionally,

  • a screen-capture recording of the session when detailed retrospective investigation or analysis might be needed;

  • metrics whose purposes are to determine how much time is spent on test design and execution (activities that yield test coverage) vs. bug investigation and reporting, and setup (activities that interrupt the generation of test coverage).

The timebox provides a structure intended to make estimation and accounting for time fairly imprecise, but reasonably accurate. (What’s the difference? As I write, the time and date is 9:43:02.1872 in the morning, January 23, 1953. That’s a very precise reckoning of the time and date, but it’s completely inaccurate.)

Let’s also assume that a development cycle is two weeks, or ten working days—the length of a typical agile iteration. Let’s assume that we have four testers on the team, and that each tester can accomplish three sessions of work per day (meetings, e-mail, breaks, conversations, and other non-session activities take up the rest of the time).

ten days * four testers * three sessions = 120 sessions

Let’s assume further that sessions cannot be completely effective, in that test design and execution will be interrupted by setup and bug investigation. Suppose that we reckon 10% of the time spent on setup, and 25% of the time spent on investigating and reporting bugs. That’s 35% in total; for convenience, let’s call it 1/3 of the time.

120 sessions – 120 * 1/3 interruption time = 80 sessions

Thus in our two-week iteration we estimate that we have time for 80 focused, targeted effective idealized sessions of test coverage, embedded in 120 actual sessions of testing. Again, this is not a precise figure; it couldn’t possibly be. If our designers and programmers have done very well in a particular area, we won’t find lots of bugs and our effective coverage per session will go up. If setup is in some way lacking, we may find that interruptions account for more than one-third of the time, which means that our effective coverage will be reduced, or that we have to allocate more sessions to obtain the same coverage. So as soon as we start obtaining information about what actually went on in the sessions, we feed that information back into the estimation. I wrote extensively about that here.

On its own, the metrics on interruptions could be fascinating and actionable information for managers. But note that the metrics on their own are not conclusive. They can’t be. Instead, they inform questions. Why has there been more bug investigation than we expected? Are there more problems than we anticipated, or are testers spending too much time investigating before consulting with the programmers? Is setup taking longer than it should, such that customers will have setup problems too? Even if the setup problems will be experienced only in testing, are there ways to make setup more rapid so that we can spend more time on test coverage? The real value of any metrics is in the questions they raise, rather than in the answers they give.

There’s an alternative approach, for those who want to estimate the duration or staffing for a test cycle: set the desired amount of coverage, and apply the fixed variables and calculate for the free ones. Break the product down into areas, and assign some desired number of sessions to each based on risk, scope, complexity, or any combination of factors you choose. Based on prior experience or even on a guess, adjust for interruptions and effectiveness. If you know the number of testers, you can figure the amount of time required; if you want to set the amount of time, you can calculate for the number of testers required. This provides you with a quick estimate.

Which, of course, you should immediately distrust. What influence does tester experience and skill have on your estimate? On the eventual reality? If you’re thinking of adding testers, can you avoid banging into Brooks’ Law? Are your notions of risk static? Are they valid? And so forth. Estimation done well should provoke a large number of questions. Not to worry; actual testing will inform the answers to those questions.

Wait a second. We paid a lot of money for an expensive test management tool, and we sent all of our people to a one-week course on test estimation, and we now spend several weeks preparing our estimates. And since we started with all that, our estimates have come out really accurate.

If experience tells us anything, it should tell us that we should be suspicious of any person or process that claims to predict the future reliably. Such claims tend to be fulfilled via the Ludic Fallacy and the narrative bias, central pillars of the philosophy of The Black Swan. Since we already have an answer to the question “When are we going to be done?”, we have the opporutunity (and often the mandate) to turn an estimate into a self-fulfilling prophecy. Jerry Weinberg‘s Zeroth Law of Quality (“If you ignore quality, you can meet any other requirement“) is a special case of my own, more general Zeroth Law of Wish Fulfillment: “If you ignore some factors, you can achieve anything you like.” If your estimates always match reality, what assumptions and observations have you jettisoned in order to make reality fit the estimate? And if you’re spending weeks on estimation, might that time be better spent on testing?

Project Estimation and Black Swans (Part 4)

Monday, October 25th, 2010

Over the last few posts, exploratory automation has suggested some interesting things about project dynamics and estimation. What might we learn from these little mathematical experiments?

The first thing we need to do is to emphasize the fact that we’re playing with numbers here. This exercise can’t offer any real construct validity, since an arbitrary chunk of time combined with a roll of the dice doesn’t match software development in all of its complex, messy, human glory. In a way, though, that doesn’t matter too much, since the goal of this exercise isn’t to prove anything in particular, but rather to raise interesting questions and to offer suggestions or hints about where we might look next.

The mathematics appears to support an idea touted over and over by Agile enthusiasts, humanists, and systems thinkers alike: make feedback rapid and frequent. The suggestion we might take from the last model—fewer tasks and shorter projects —is that the shorter and better-managed the project, the less the Black Swan has a chance to hurt you in any given project.

Another plausible idea that comes from the math is to avoid projects where the power-distribution law applies—projects where you’re vulnerable to Wasted Mornings and Lost Days. Stay away from projects in Taleb’s Fourth Quadrant, projects that contain high-impact, high-uncertainty tasks. To the greatest degree possible, stick with things that are reasonably predictable, so that the statistics of random and unpredicted events don’t wallop us quite so often. Stay within the realm of the known, “in Mediocristan” as Taleb would say. Head for the next island, rather than trying to navigate too far over the current horizon.

In all that, there’s a caveat. It is of the essence of Black Swan (or even a Black Cygnet) that it’s unpredicted and unpredictable. Ironically, the more successful we are at reducing uncertainty, the less often we’ll encounter rare events. The rarer the event, the less we know about it—and therefore, the less we’re aware of the range of its potential consequences. The less we know about the consequences, the less likely we are to know about how to manage them—certainly the less specifically we know how to manage them. In short, the more rare the event, the less information and experience we’ll have to help us to deal with it. One implication of this is that our Black Cygnets, in addition to adding time, having a chance of screwing up other things in ways that we don’t expect.

Some people would suggest that we eliminate variability and uncertainty and unpredictability. What a nice idea! By definition, uncertainty is the state of not knowing something; by definition, something that’s unpredictable can’t be predicted. Snowstorms happen (even in Britain!). Servers go down. Power cuts happen in India on a regular basis—on my last visit to India, I experienced three during class time, and three more in the evening in a two-day stay at a business class hotel. In North America, power cuts happen too—and because we’re not used to them, we aren’t prepared to deal with them. (To us they’re Black Swans, where to people who live in India, they’re Grey Swans.) Executives announce all-hands meetings, sometimes with dire messages. Computers crash. Post-It notes get jammed in the backup tape drive. People get sick, and if they’re healthy, their kids get sick. Trains are delayed. Bicycles get flat tires. And bugs are, by their nature, unpredicted.

So: we can’t predict the unpredictable. There is a viable alternative, though: we can expect the unpredictable, anticipate it to some degree, manage it as best we can, and learn from the experience. Embracing the unpredictable reminds me of the The Fundamental Regulator Paradox, from Jerry and Dani Weinberg’s General Principles of System Design which I’ve referred to before:

The task of a regulator is to eliminate variation, but this variation is the ultimate source of information about the quality of its work. Therefore, the better job a regulator does, the less information it gets about how to improve.

This suggests to me that, at least to a certain degree, we shouldn’t make our estimates too precise, our commitments too rigid, our processes too strict, our roles too closed, and our vision of the future too clear. When we do that, we reduce the flow of information coming in from outside the system, and that means that the system doesn’t develop an important quality: adaptability.

When I attended Jerry Weinberg’s Problem Solving Leadership workshop (PSL), one of the groups didn’t do so well on one of the problem-solving exercises. During the debrief, Jerry asked, “Why did you have such a problem with that? You handled a much harder problem yesterday.”

“The complexity of the problem screwed us up,” someone answered.

Jerry peered over the top of his glasses. He replied, “Your reaction to the complexity of the problem screwed you up.”

One of the great virtues of PSL is that it exposes you to lots of problems in a highly fault-tolerant environment. You get practice at dealing with surprises and behaviours that emerge from giving a group of people a moderately complex task, under conditions of uncertainty and time pressure. You get an opportunity to reflect on what happened, and you learn what you need to learn. That’s the intention of the Rapid Software Testing class, too: to expose people to problems, puzzles, and traps; to give people practice in recognizing and evading traps where possible; and to help them dealing with problems effectively.

As Jerry has frequently pointed out, plenty of organizations fall victim to back luck, but much of the time, it’s not the bad luck that does them in; it’s how they react to the bad luck. A lot of organizations pillory themselves when they fail to foster environments in which everyone is empowered to solve problems. That leaves problem-solving in the hands of individuals, typically people with the title of “manager”. Yet at the moment a problem is recognized, the manager may not be available, or may not be the best person to deal with the problem. So, another reason that estimation fails is that organizations and individuals are not prepared or empowered to deal— mentally, politically, and emotionally—with surprises. The ensuing chaos and panic leaves them more vulnerable to Black Swans.

Next time, we’ll look at what all of this means for testing specifically, and for test estimation.

Project Estimation and Black Swans (Part 3)

Wednesday, October 20th, 2010

Last time out, we determined that mucking with the estimate to account for variance and surprises in projects is in several ways wanting. This time, we’ll make some choices about the tasks and the projects, and see where those choices might take us.

Leave Problem Tasks Incomplete; Accept Missing Features

There are a couple of variations on this strategy. The first is to Blow The Whistle At 100. That is, we could time-box the entire project, which in the examples above would mean stopping the project after 100 hours of work no matter where we were. That might seem a little abrupt, but we would be done after 100 hours.

To reduce the surprise level and make things a tiny bit more predictable, we could Drop Scope As You Go. That is, if we were to find out at some point that you’re behind our intended schedule, we could refine the charter of the project to meet the schedule by immediately revising the estimate and dropping scope commitments equivalent to the amount of time we’ve lost. Moreover, we could choose which tasks to drop, preferring to drop those that were interdependent with other tasks.

In our Monte Carlo model, project scope is represented by the number of tasks that we attempt. After a Wasted Morning, we drop any future commitment to at least three tasks; after a Lost Day, we drop seven; and after a Black Cygnet, we drop 15. We don’t have to drop the tasks completely; if we get close to 100 hours and find out that we have plenty of time left over due to a number of Stunning Successes, we can resume work on one or more of the dropped tasks.

Of course, any tasks that we’ve dropped might have turned out to be Stunning Successes, but in this model, we assume that we can’t know that in advance; otherwise, there’d be no need to estimate. In this scenario, it would also be wise to allocate some task time to manage the dropping and picking up of tasks.

I’ve been a program manager for a company that used a combination of Blow The Whistle and Drop Scope As You Go very successfully. This strategy often works reasonably well for commercial software. In general, you have to release an update periodically to keep the stock market analysts and shareholders happy. Releasing something less ambitious than you hoped is disappointing. Still, it’s usually more palatable than shipping late and missing out on revenue for a given quarter. If you can keep the marketers, salespeople, and gossip focused on things that you’ve actually done, no one outside the company has to know how much you really intended to do. There’s an advantage here, too, for future planning: uncompleted tasks for this project represent elements of the task list for the next project.

Leave Problem Tasks Incomplete; Accept Missing Features AND Bugs

We could time-box our tasks, lower our standard of quality, and stop working on a task as soon as it extends beyond a Little Slip. This typically means bugs or other problems in tasks that would otherwise have been Wasted Mornings, Lost Days, or Black Cygnets, and it means at least a few dropped tasks too (since even a Little Slip costs us a Regular Task).

This is The Perpetual Beta Strategy, in which we adjust our quality standards such that we can declare a result a draft or a beta at the predicted completion time. The Perpetual Beta Strategy assumes that our customers explicitly or implicitly consent to accepting something on the estimated date, and are willing to sacrifice features, live with problems, wait for completion of the original task list, or some combination of all of these. That’s not crazy. In fact, many organizations work this way. Some have got very wealthy doing it.

Either of these two strategies would work less well the more our tasks had dependencies upon each other. So, a related strategy would be to…

De-Linearize and Decouple Tasks

We’re especially at risk of project delays when tasks are interdependent, and when we’re unable to switch the sequence of tasks quickly and easily. My little Monte Carlo exercises are agnostic about task dependencies. As idealized models, they’re founded on the notion that a problem in one area wouldn’t affect the workings in any other area, and that a delay in one task wouldn’t have an impact on any other tasks, only on the project overall. On the one hand, the simulations just march straight through the tasks in each project equentially, as though each task were dependent on the last. On the other hand, each task is assigned a time at random.

In real life, things don’t work this way. Much of the time, we have options to re-organize and re-prioritize tasks, such that when a Black Cygnet task comes along, we may be able to ignore it and pick some other task. That works when we’re ultimately flexible, and when tasks aren’t dependent on other tasks.

And yet at some point, in any project and any estimation effort there’s going to be a set of tasks that are on a critical path. I’ve never seen a project organized such no task was dependent on any other task. The model still has some resonance, even if we don’t take it literally.

A key factor here would seem to be preventing problems, and dealing with potential problems at the first available opportunity.

Detect and Manage The Problems

What could we do to prevent, detect, and manage problems?

We could apply Agile practices like promiscuous pairing (that is, making sure that every team member regularly pairs with every other team member). Such things might to help with the critical path issue. If each person has at least passing familiarity with the whole project, each is more likely to be able to work on a new task while their current one is blocked. Similarly, when one person is blocked, others can help by picking up on that person’s tasks, or by helping to remove the block.

We could perform some kind of corrective action as soon as we have any information to suggest that a given task might not be completed on time. That suggests shortening feedback loops by constant checking and testing, checking in on tasks in progress, and resolving problems as early as possible, instead of allowing tasks to slip into potentially disastrous delays. By that measure, a short daily standup is better than a long weekly status meeting; pairing, co-location and continuous conversation are better still. Waiting to check or test the project until we have an integration- or system-level build provided relatively slow feedback for low-level problems; low-level unit checks reveal information relatively quickly and easy.

We could manage both tasks and projects to emphasize information gathering and analysis. Look at the nature of the slippages; maybe there’s a pattern to Black Cygnets, Lost Days, or Wasted Mornings. Is a certain aspect of the project consistently problematic? Does the sequencing of the project make it more vulnerable to slips? Are experiments or uncertain tasks allocated the task time that they need to inform better estimation? Is some person or group consistently involved in delays, such that training, supervision, pairing, or reassignment might help?

Note that obtaining feedback takes some time. Meetings can take task-level units of times, and continuous conversation may slow down tasks. As a result, we might have to change some of our tasks or some part of them from work to examining work or talking about work; and it’s likely some Stunning Successes will turn into Regular Tasks. That’s the downside. The upside is that we’ll probably prevent some Little Slips, Wasted Mornings, Lost Days and Black Cygnets, and turn them into Regular Tasks or Stunning Successes.

We could try to reduce various kinds of inefficiencies associated with certain highly repetitive tasks. Lots of organizations try to do this by bringing in continuous building and integration, or by automating the checking that they do for each new build. But be aware that the process of automating those checks involves lots of tasks that are themselves subject to the same kind of estimation problems that the rest of your project must endure.

So, if we were to manage the project, respond quickly to potentially out-of-control tasks, and moderate the variances using some of the ideas above, how would we model that in a Monte Carlo simulation? If we’re checking in frequently, we might not be able to get as much done in a single task, so let’s turn the Stunning Successes (50% of the estimated task time) into Modest Successes (75% of the estimated task time). Inevitably we’ll underestimate some tasks and overestimate others, so let’s say on average, out of 100 tasks, 50 come in 25% early, 49 come in 25% late. Bad luck of some kind happens to everyone at some point, so let’s say there’s still a chance of one Black Cygnet per project.

Number of tasks Type of task Duration Total (hours)
50 Modest Success .75 37.5
49 Tiny Slip 1.25 61.25
1 Black Cygnet 16 16

Once again, I ran 5000 simulated projects.

Average Project 114.67
Minimum Length 92.0
Maximum Length 204.25
On time or early 1058 (21.2%)
Late 3942(78.8%)
Late by 50% or more 96 (1.9%)
Late by 100% or more 1 (0.02%)

Image:  Managed Project

Remember that in the first example above, half our tasks were early by 50%. Here, half our tasks are early by only 25%, but things overall look better. We’ve doubled the number of on-time projects, and our average project length is down to 114% from 124%. Catching problems before they turn into Wasted Mornings or Lost Days makes an impressive difference.

Detect and Manage The Problems, Plus Short Iterations

The more tasks in a project, the greater the chance that we’ll be whacked with a random Black Cygnet. So, we could choose your projects and refrain from attempting big ones. This is essentially the idea behind agile development’s focus on a rapid series of short iterations, rather than on a single monolithic project. Breaking a big project up into sprints offers you the opportunity to do the project-level equivalent of frequent check-ins in on our tasks.

When I modeled an agile project with a Monte Carlo simulation, I was astonished by what happened.

For the task/duration breakdown, I took the same approach as just above:

Number of tasks Type of task Duration Total (hours)
50 Modest Success .75 37.5
49 Tiny Slip 1.25 61.25
1 Black Cygnet 16 16

I changed the project size to 20 tasks. Then, to compensate for the fact that the projects were only 20 tasks long, instead of 100, I ran 25000 simulated projects.

Average Project 22.94
Minimum Length 16
Maximum Length 66.75
On time or early 12433 (49.7%)
Late 12567 (50.3%)
Late by 50% or more 4552 (18.2%)
Late by 100% or more 400 (1.6%)

Image: Agile Project

A few points of interest. At last, we’re estimating to the point where almost half our the projects are on time! In addition, more than 80% of the projects (20443 out of 25000, in my run) are within 15% of the estimate time—and since the entire project is only 20 hours, these projects run over by only three hours. That affords quick course correction; in the 100-hours-per-project model, the average project is late by three days.

Here’s one extra fascinating result: the total time taken for these 25000 projects (500,000 tasks in all) was 573,410 hours. For the original model (the one above, the first in yesterday’s post), the total was 619,156.5 hours, or 8% more. For the more realistic second example, the total was 736,199.2 hours, or 28% more. In these models, shorter iterations give less opportunity for random events to affect a given project.

So, what does all this mean? What can we learn? Let’s review some ideas on that next time.

Project Estimation and Black Swans (Part 2)

Sunday, October 17th, 2010

In the last post, I talked about the asymmetry of unexpected events and the attendant problems with estimation. Today we’re going to look at some possible workarounds for the problems. Testers often start by questioning the validity of models, so let’s start there.

The linear model that I’ve proposed doesn’t match reality in several ways, and so far I haven’t been very explicit about them. Here are just a few of the problems with the model.

  • The model tacitly assumes that all tasks have to be done in a specific order.
  • The model tacitly assumes that all tasks are of equal significance.
  • The model leaves out all notions of tasks being independent or interdependent with respect to each other.
  • The model assumes that once we’re into a Wasted Morning, a Lost Day, or a Black Cygnet, there’s nothing we can do about it, and that we won’t do anything about it.

In particular, the model leaves out control actions that could be applied by managers or by the people performing the tasks, control actions that could be applied to the tasks, the project, the context, or to the estimates. Let’s start with the latter.

Pad The Estimates So We’re Half Right

Here’s the chart of yesterday’s first scenario again:

Under the given set of assumptions, and assuming random distribution, we come in late a little over 90% of the time. To counter this, we could add some arbitrary percentage to our estimates such that half the time we’ll come in early, while the other half of the time we’ll (still) come in late. In that case, we’d want to pick a median value.

When I used the data from the Monte Carlo simulation and sorted the project lengths, I found that Project 2500, the one right in the middle, has a length of 122 hours. So: pad the estimate by 22%, and we’ll be on time 50% of the time.

There are two problems with this. The first is that there’s still significant variability in terms of how late.  Second, the asymmetry problem is the same for projects as it is for individual tasks: our big losses have a greater magnitude than our big wins. Even if we go for the average project length, rather than the median (the average 123.83 hours, is a couple of hours longer), fewer projects will go over the estimated time, but early projects will tend to be more modestly early, while the late ones will be more extremely late. None of this is likely to be acceptable to someone who values predictability (that is, the person who is asking us for the estimate).

Pad The Estimates So We’re Almost Always Right

Someone who likes predictability would probably prefer our projects to come in on time 95% of the time. If we wanted to satisfy that, based on the same set of assumptions, we would do the best estimating job we could, then pad our estimate by 58%, to 158 hours.

One problem with that strategy is that work tends to expand to fill time available, and people will start to work at a slower pace.

One the other hand, if people keep the regular pace up, 82% of our projects are going to come in at least 10% early, and 42% of our projects will come in 25% early! In such a case, we’ll probably face political backlash and be urged to less conservative with our estimates. By the math, we really can’t win under this set of assumptions.

Pad The Team

Rather than padding the estimate of time, we could build slack into the system by having extra people available to take on any surprises or misunderstandings. But note Fred Brooks’ Law, which says that adding people to a late project makes it later. That’s because of at least two problems: the new people need to be brought up to speed, and having more connections in a system tends increases the communication burden.

So maybe we’ll have to change something about the way we manage the project. We’ll look at that next.

Project Estimation and Black Swans (Part 1)

Thursday, October 14th, 2010

There has been a flurry of discussion about estimation on the net in the last few months.

All this reminded me to post the results of some number-crunching experiments that I started to do back in November 2009, based on a thought experiment by James Bach. That work coincided with the writing of a Swan Song, a Better Software column in which I discussed The Black Swan, by Nassim Nicholas Taleb.

A Black Swan is an improbable and unexpected event that has three characteristics. First, it takes us completely by surprise, typically because it’s outside of our models. Taleb says, “Models and constructions, those intellectual maps of reality, are not always wrong; they are wrong only in some specific applications. The difficulty is that a) you do not know beforehand (only after the fact) where the map will be wrong, and b) the mistakes can lead to severe consequences. These models are like potentially helpful medicines that carry random but very severe side effects.”

Second, a Black Swan has a disproportionately large impact. Many rare and surprising events happen that aren’t such a big deal. Black Swans can destroy wealth, property, or careers—or create them. A Black Swan can be a positive event, even though we tend not to think of them as such.

Third, after a Black Swan, people have a tendency to say that they saw it coming. They make this claim after the event because of a pair of inter-related cognitive biases. Taleb calls the first epistemic arrogance, an inflated sense of knowing what we know. The second is the narrative fallacy, our tendency to bend a story to fit with our perception of what we know, without validating the links between cause and effect. It’s easy to say that we know the important factors of the story when we already know the ending. The First World War was a Black Swan; September 11, 2001 was a Black Swan; the earthquake in Haiti, the volcano in Iceland, and the Deepwater Horizon oil spill in the Gulf of Mexico were all Black Swans. (The latter was a white swan, but it’s now coated in oil, which is the kind of joke that atracygnologists like to make). The rise of Google’s stock price after it went public was a Black Swan too. (You’ll probably meet people who claim that they knew in advance that Google’s stock price would explode. If that were true, they would have bought stock then, and they’d be rich. If they’re not rich, it’s evidence of the narrative fallacy in action.)

I think one reason that projects don’t meet their estimates is that we don’t naturally consider the impact of the Black Swan. James introduced me to a thought experiment that illustrates some interesting problems with estimation.

Imagine that you have a project, and that, for estimation’s sake, you broke it down into really fine-grained detail. The entire project decomposes into one hundred tasks, such that you figured that each task would take one hour. That means that your project should take 100 hours.

Suppose also that you estimated extremely conservatively, such that half of the tasks (that is, 50) were accomplished in half an hour, instead of an hour. Let’s call these Stunning Successes. 35% of the tasks are on time; we’ll called them Regular Tasks.

15% of the time, you encounter some bad luck.


  • Eight tasks, instead of taking an hour, take two hours. Let’s call those Little Slips.

  • Four tasks (one in 25) end up taking four hours, instead of the hour you thought they’d take. There’s a bug in some library that you’re calling; you need access to a particular server and the IT guys are overextended so they don’t call back until after lunch. We’ll call them Wasted Mornings.

  • Two tasks (one in fifty) take a whole day, instead of an hour. Someone has to stay home to mind a sick kid. Those we’ll call Lost Days.

  • One task in a hundred—just one—takes two days instead of just an hour. A library developed by another team is a couple of days late; a hard drive crash takes down a system and it turns out there’s a Post-It note jammed in the backup tape drive; one of the programmers has her wisdom teeth removed (all these things have happened on projects that I’ve worked on). These don’t have the devastating impact of a Black Swan; they’re like baby Black Swans, so let’s call them Black Cygnets.

Number of tasks Type of task Duration Total (hours)
50 Stunning Success 50 25
35 On Time 1.00 35
8 Little Slip 2 16
4 Wasted Morning 4 16
2 Lost Day 8 16
1 Black Cygnet 16 16
100 124

That’s right: the average project, based on the assumptions above, would come in 24% late. That is, you estimated it would take two and a half weeks. In fact, it’s going to take more than three weeks. Mind you, that’s the average project, and the notion of the “average” project is strictly based on probability. There’s no such thing as an “average” project in reality and all of its rich detail. Not every project will encounter bad luck—and some projects will run into more bad luck than others.

So there’s a way of modeling projects in a more representative way, and it can be a lot of fun. Take the probabilities above, and subject them to random chance. Do that for every task in the project, then run a lot of projects. This shows you what can happen on projects in a fairly dramatic way. It’s called a Monte Carlo simulation, and it’s an excellent example of exploratory test automation.

I put together a little Ruby program to generate the results of scenarios like the one above. The script runs N projects of M tasks each, allows me to enter as many probabilities and as many durations as I like, puts the results into an Excel spreadsheet, and graphs them. (Naturally I found and fixed a ton of bugs in my code as I prepared this little project. But I also found bugs in Excel, including some race-condition-based crashes, API performance problems, and severely inadequate documentation. Ain’t testing fun?) For the scenario above, I ran 5000 projects of 100 randomized tasks each. Based on the numbers above, I got these results:

Average Project 123.83 hours
Minimum Length 74.5 hours
Maximum Length 217 hours
On time or early projects 460 (9.2%)
Late projects 4540 (90.8%)
Late by 50% or more 469 (9.8%)
Later by 100% or more 2 (0.9%)

Image: Standard Project

Here are some of the interesting things I see here:


  • The average project took 123.83 hours, almost 25% longer than estimated.

  • 460 projects (or fewer than 10%) were on time or early!

  • 4540 projects (or just over 90%) were late!

  • You can get lucky. In the run I did, three projects were accomplished in 80 hours or fewer. No project avoided having any Wasted Mornings, Lost Days, or Black Cygnets. That’s none out of five thousand.

  • You can get unlucky, too. 469 projects took at least 1.5 times their projected time. Two took more than twice their projected time. And one very unlucky project had four Wasted Mornings, one Lost Day, and eight Black Cygnets. That one took 217 hours.

This might seem to some to be a counterintuitive result. Half the tasks took only half of the time alloted to them. 85% of the tasks came in on time or better. Only 15% were late. There’s a one-in-one-hundred chance that you’ll encounter a Black Cygnet. How could it be that so few projects came in on time?

The answer likes in asymmetry, another element of Taleb’s Black Swan model. It’s easy to err in our estimates by, say, a factor of two. Yet dividing the duration of a task by two has a very different impact from multiplying the duration by two. A Minor Victory saves only half a Regular Task, but a Little Slip costs two whole Regular Tasks.

Suppose you’re pretty good at estimation, and that you don’t underestimate so often. 20% of the tasks came in 10% early (let’s call those Minor Victories). 65% of the tasks come right on time (Regular Tasks). That is, 85% of your estimates are either too conservative or spot on. As before, there are eight Little Slips, four Wasted Mornings, two Lost Days, and a Black Cygnet.

With 20% of your tasks coming in early, and 15% coming in late, how long would you expect the average project to take?

Number of tasks Type of task Duration Total (hours)
20 Minor Victory .9 18
65 On Time 1.00 65
8 Little Slip 2 16
4 Wasted Morning 4 16
2 Lost Day 8 16
1 Black Cygnet 16 16
100 147

That’s right: even though your estimation of tasks is more accurate than in the first example above, the average project would come in 47% late. That is, you thought it would take two and a half weeks, and in fact, it’s going to take more than three and a half weeks. Mind you, that’s the average, and again that’s based on probability. Just as above, not every project will encounter bad luck, and some projects will run into more bad luck than others. Again, I ran 5,000 projects of 100 tasks each.

Average Project 147.24 hours
Minimum Length 105.2 hours
Maximum Length 232 hours
On time or early projects 0 (0.0%)
Late projects 5000 (100.0%)
Late by 50% or more 2022 (40.4%)
Late by 100% or more 30 (0.6%)

Image: Typical Project

Over 5000 projects, not a single project came in on time. The very best project came in just over 5% late. It had 18 Minor Victories, 77 on-time tasks, four Little Slips, and a Wasted Morning. It successfully avoided the Lost Day and the Black Cygnet. And in being anywhere near on-time, it was exceedingly rare. In fact, only 16 out of 5000 projects were less than 10% late.

Now, these are purely mathematical models. They ignore just about everything we could imagine about self-aware systems, and the ways the systems and their participants influence each other. The only project management activity that we’re really imagining here is the modelling and estimating of tasks into one-hour chunks. Everything that happens after that is down to random luck. Yet I think the Monte Carlo simulations shows that, unmanaged, what we might think of as a small number of surprises and a small amount of disorder can have a big impact.

Note that, in both of the examples above, at least 85% of the tasks come in on time or early overall. At most, only 15% of the tasks are late. It’s the asymmetry of the impact of late tasks that makes the overwhelming majority of projects late. A task that takes one-sixteenth of the time you estimated saves you less that one Regular Task, but a Black Cygnet costs you an extra fifteen Regular Tasks. The combination of the mathematics and the unexpected is relentlessly against you. In order to get around that, you’re going to have to manage something. What are the possible strategies? Let’s talk about that tomorrow.

Gaming the Tests

Monday, September 27th, 2010

Let’s imagine, for a second, that you had a political problem at work. Your CEO has promised his wife that their feckless son Ambrose, having flunked his university entrance exams, will be given a job at your firm this fall. Company policy is strict: in order to prevent charges of nepotism, anyone holding a job must be qualified for it. You know, from having met him at last year’s Christmas party, that Ambrose is (how to put this gently?) a couple of tomatoes short of a thick sauce. Yet the policy is explicit: every candidate must not only pass a multiple choice test, but must get every answer right. The standard number of correct answers required is (let’s say) 40.

So, the boss has a dilemma. He’s not completely out to lunch. He knows that Ambrose is (how can I say this?) not the sharpest razor in the barbershop. Yet the boss adamantly wants his son to get a job with the firm. At the same time, the boss doesn’t want to be seen to be violating his own policy. So he leaves it to you to solve the problem. And if you solve the problem, the boss lets you know subtly that you’ll get a handsome bonus. Equally subtly, he lets you know that if Ambrose doesn’t pass, your career path will be limited.

You ponder for a while, and you realize that, although you have to give Ambrose an exam, you have the authority to set the content and conditions of the exam. This gives you some possibilities.

A. You could give a multiple choice test in which all the answers were right. That way, anyone completing the test would get a perfect score.

B. You could give a multiple choice test for which the answers were easy to guess, but irrelvant to the work Ambrose would be asked to do. For example, you could include questions like, “What is the very bright object in the sky that rises in the morning and sets in the evening?” and provide “The Sun” as choice of answer, and the names of hockey players for the other choices.

C. You could find out what questions Ambrose might be most likely to answer correctly in the domain of interest, and then craft an exam based on that.

D. You could give a multiple choice test in which, for every question, one of A, B, or C was the correct answer, and answer D was always “One of the above.”

E. You might give a reasonably difficult multiple-choice exam, but when Ambrose got an answer wrong, you could decide that there’s another way to interpret the answer, and quietly mark it right.

F. You might give Ambrose a very long set of multiple-choice questions (say 400 of them), and then, of his answers, pick 40 correct ones. You then present those questions and answers as the completed exam.

G. You could give Ambrose a set of questions, but give him as much time as he wanted to provide an answer. In addition, you don’t watch him carefully (although not watching carefully is a strategy that nicely supports most of these options).

H. You could ask Ambrose one multiple choice question. If he got it wrong, correct him until he gets it right. Then you could develop another question, ask that, and if he gets it wrong, correct him until he gets it right. Then continue in a loop until you get to 40 questions.

I. This approach is like H, but instead you could give a multiple choice test for which you had chosen an entire set of 40 questions in advance. If Ambrose didn’t get them all right, you could correct him, and then give him the same set of questions again. And again. And over and over again, until he finally gets them all right. You don’t have to publicize the failed attempts; only the final, successful one. That might take some time and effort, and Ambrose wouldn’t really be any more capable of anything except answering these specific questions. But, like all the other approaches above, you could effect a perfect score for Ambrose.

When the boss is clamoring for a certain result, you feel under pressure and you’re vulnerable. You wouldn’t advise anyone to do any of the things above, and you wouldn’t do them yourself. Or at least, you wouldn’t do them consciously. You might even do them with the best of intentions.

There’s an obvious parallel here—or maybe not. You may be thinking of the exam in terms of a certain kind of certification scheme that uses only multiple-choice questions, the boss as the hiring manager for a test group, and Ambrose as a hapless tester that everyone wants to put into a job for different reasons, even though no one is particularly thrilled about the idea. Some critical outsider might come along and tell you point-blank that your exam wasn’t going to evaluate Ambrose accurately. Even a sympathetic observer might offer criticism. If that were to happen, you’d want to keep the information under your hat—and quite frankly, the other interested parties would probably be complacent too. Dealing with the critique openly would disturb the idea that everyone can save face by saying that Ambrose passed a test.

Yet that’s not what I had in mind—not specifically, at least. I wanted to point out some examples of bad or misleading testing, which you can find in all kinds of contexts if you put your mind to it. Imagine that the exam is a set of tests—checks, really. The boss is a product owner who wants to get the product released. The boss’ wife is a product marketing manager. Hapless Ambrose is a program—not a very good program to be sure, but one that everyone wants to release for different reasons, even though no one is particularly thrilled by the idea. You, whether a programmer or a tester or a test manager, are responsible for “testing”, but you’re really setting up a set of checks. And you’re under a lot of pressure. How might your judgement—consciously or subconsciously—be compromised? Would your good intentions bend and stretch as you tried to please your stakeholders and preserve your integrity? Would you admit to the boss that your testing was suspect? If you were under enough pressure, would you even notice that your testing was suspect?

So this story is actually about any circumstance in which someone might set up a set of checks that provide some illusion of success. Can you think of any more ways that you might game the tests… or worse, fool yourself?

Encouraging Programmers to be Testers

Monday, September 20th, 2010

A colleague wrote to me recently and asked about a problem that he’s had in hiring. He says…

The kind of test engineers we’re looking for are ones that can think their way around a system and look for all the ways that things can go wrong (pretty standard, so far), and then code up a tool or system that can automatically verify that those things haven’t gone wrong (a bit more rare, especially, for some reason, in the part of the country where I’m working).

The problem is that I suspect there is a larger pool of candidates who fit this description, but think of themselves as software developers. They never consider a software engineer in test role because they think “oh, that’s QA”.

It’s a little more easy on the west coast—in Silicon Valley and the Pacific Northwest, I mean—because big companies like Microsoft, Google, and others define a lot of their test engineering roles as “software developers who build systems that just happen to test other systems as their product”.

Any thoughts on how to shift that perception here to be more closely aligned with the west coast?

My reply went like this:

I don’t like that particular perception, myself. I observe that the “Software Development Engineer in Test” view often presupposes that the SDET role is a consolation prize for not being hired as a Real Programmer on the one hand. On the other hand, the view poses a barrier to entry for non-programmers who want to be testers.

If your prospects think, “Oh, that’s QA” in this dismissive kind of way, what kind of testers are they going to be? Moreover, what kind of programmers are they going to be? To me and to our community, testing is questioning a product in order to evaluate it. Testing is a service to the project, wherein testers help to discover risks and problems that threaten the value of the product and the goals of the business. While testers are specialists in that kind of role, testing is something that any programmer should be doing too, as he or she is writing, reviewing, and revising code. So if prospects think that testing is some kind of second-class role or task, who needs them? Encourage them to take a hike and come back when they’ve learned not to be such prima donnas.

If offered a gaggle of programming enthusiasts to choose from, I’d prefer to hire the person who is most genuinely interested in testing. I’d give that person a toolsmith role, wherein (s)he provides programming services to other testers. In addition, the toolsmith can provide the service of aiding testers in learning to program, while the testers help the toolsmith to sharpen his/her testing skills.

If you’d like to motivate programmer-types to test, here’s a suggestion: Instead of treating testing as a programming exercise, treat it as a more general problem-solving task in which hand-crafted tools might be very helpful. James Bach and I do this by giving people testing puzzles that look simple but that have devious twists and traps. In my experience, most programmers (and all of the best ones) enjoy challenging intellectual problems, irrespective of whether programming is at the centre of solution. In our classes and at conferences and in jam sessions, we show programmers over and over again how they can be fooled by puzzles, just as anyone else can. We hasten to point out that programming can help to solve problems in a manner that can be far more more practical and useful than other approaches for solving some problems in some contexts. But as it goes with production code, so it goes with test code: working out how to solve the problem is the most challenging part, and writing a program is a means to that end, rather than the end in itself. The trick is to help people to recognize when they might want to emphasize writing code and when they might want to emphasize other approaches and other lines of investigation.

As Cem Kaner points out in his talk “Investment Modeling as an Exemplar of Exploratory Test Automation“, the best skilled testers (and especially those with programming skills) have the kind of mindset that we might easily associate with the best quantitative analysts. On Wall Street, they’re called quants, and they make a lot of money. In order to be successful and to reduce risk, both their programs and their models have to be useful, reliable, accurate—which means they have to be tested, and tested well.

Trouble is, many managers treat testing as a rote, clerical, bureaucratic, mechanistic activity. If that were all there is to testing, it would be a dead-end role and ripe for automation, but testing is so much more than coding up a set of checks. It’s up to us to help others to recognize what testing can do, by offering conversation, challenges, and leadership by example—and information about products and projects that people genuinely value. To do that, we need a community of testers who are passionate about their craft and practice it with skill, recognizing that it’s thinking—about risk, cost, value, and learning—that’s central.

But if you put the hammer and hammering at the centre of the task, you’ll get enthusiatic hammerers applying for the job, where I suspect you really want cabinet makers.

Statistician or Journalist?

Friday, August 27th, 2010

Eric Jacobson has a problem, which he thoughtfully relates on his thoughtful blog in a post called “How Can I Tell Users What Testers Did?”. In this post, I’ll try to answer his question, so you might want to read his original post for context.

I see something interesting here: Eric tells a clear story to relate to his readers some problem that he’s having with explaining his work to others who, by his account, don’t seem to understand it well. In that story, he mentions some numbers in passing. Yet the numbers that he presents are incidental to the story, not central to it. On the contrary, in fact: when he uses numbers, he’s using them as examples of how poorly numbers tell the kind of story he wants to tell. Yet he tells a fine story, don’t you think?

In the Rapid Software Testing course, we present this idea (Note to Eric: we’ve added this since you took the class): To test is to compose, edit, narrate, and justify two parallel stories. You must tell a story about the product: how it works, how it fails, and how it might not work in ways that matter to your client (and in the context of a retrospective, you might like to talk about how the product was failing and is now working). But in order to give that story its warrant, you must tell another story: you must tell a story about your testing. In a case like Eric’s, that story would take the form of a summary report focused on two things: what you want to convey to your clients, and what they want to know from you (and, ideally, those two things should be in sync with each other).

To do that, you might like to consider various structures to frame your story. Let’s start with the elements of what we (somewhat whimsically) call The Universal Test Procedure (you can find it in the course notes for the class). From a retrospective view, that would include

  • your model of the test space (that is, what was inside and outside the scope of your testing, and in particular the risks that you were trying to address)
  • the oracles that you used
  • the coverage that you obtained
  • the test techniques you applied
  • the ways in which you configured the product
  • the ways in which you operated the product
  • the ways in which you observed the product
  • the ways in which you evaluated the product; and
  • the heuristics by which you decided to stop testing
  • what you discovered and reported, and how you reported

You might also consider the structures of exploratory testing. Even if your testing isn’t highly exploratory, a lot of the structures have parallels in scripted testing.

Jon Bach says (and I agree) that testing is journalism, so look at the way journalists structure a story: they often start with the classic pyramid lead. They might also start with a compelling anecdote as recounted in What’s Your Story, by Craig Wortmann, or Made to Stick, by Chip and Dan Heath. If you’re in the room with your clients, you can use a whiteboard talk with diagrams, as in Dan Roam’s The Back of the Napkin. At the centre of your story, you could talk about risks that you addressed with your testing; problems that you found and that got addressed; problems that you found and that didn’t get addressed; things that slowed you down as you were testing; effort that you spent in each area; coverage that you obtained. You could provide testimonials from the programmers about the most important problems you found; the assistance that you provided to them to help prevent problems; your contributions to design meetings or bug triage sessions; obstacles that you surmounted; a set of charters that you performed, and the feature areas that they covered. Again, focus on what you want to convey to your clients, and what they want to know from you.

Incidentally, the more often and the more coherently you tell your story, the less explaining you’ll have to do about the general stuff. That means keeping as close to your clients as you can, so that they can observe the story unfolding as it happens. But when you ask “What metric or easily understood information can my test team provide users, to show our contribution to the software we release?”, ask yourself this: “Am I a statistician or a journalist?”


Other resources for telling testing stories:

Thread-Based Test Management: Introducing Thread-Based Test Management, by James Bach; and A New Thread, by Jon Bach (as of this writing, this is brand new stuff)

Telling Your Exploratory Story: A presentation at Agile 2010, by Jonathan Bach (I was unable to download anything other than a damaged version this, but maybe it’s working now; please let me know)

Constructing the Quality Story (from Better Software, November 2009): Knowledge doesn’t just exist; we build it. Sometimes we disagree on what we’ve got, and sometimes we disagree on how to get it. Hard as it may be to imagine, the experimental approach itself was once controversial. What can we learn from the disputes of the past? How do we manage skepticism and trust and tell the testing story?

On Metrics:

Three Kinds of Measurement (And Two Ways to Use Them) (from Better Software, July 2009): How do we know what’s going on? We measure. Are software development and testing sciences, subject to the same kind of quantitative measurement that we use in physics? If not, what kinds of measurements should we use? How could we think more usefully about measurement to get maximum value with a minimum of fuss? One thing is for sure: we waste time and effort when we try to obtain six-decimal-place answers to whole-number questions. Unquantifiable doesn’t mean unmeasurable. We measure constantly WITHOUT resorting to numbers. Goldilocks did it.

Issues About Metrics About Bugs (Better Software, May 2009): Managers often use metrics to help make decisions about the state of the product or the quality of the work done by the test group. Yet measurements derived from bug counts can be highly misleading because a “bug” isn’t a tangible, countable thing; it’s a label for some aspect of some relationship between some person and some product, and it’s influenced by when and how we count… and by who is doing the counting.

On Coverage:

Got You Covered (from Better Software, October 2008): Excellent testing starts by questioning the mission. So, the first step when we are seeking to evaluate or enhance the quality of our test coverage is to determine for whom we’re determining coverage, and why.

Cover or Discover (from Better Software, November 2008): Excellent testing isn’t just about covering the “map”—it’s also about exploring the territory, which is the process by which we discover things that the map doesn’t cover.

A Map By Any Other Name (from Better Software, December 2008): A mapping illustrates a relationship between two things. In testing, a map might look like a road map, but it might also look like a list, a chart, a table, or a pile of stories. We can use any of these to help us think about test coverage.

When Should the Product Owner Release the Product?

Tuesday, August 3rd, 2010

In response to my previous blog post “Another Silly Quantitative Model”, Greg writes: In my current project, the product owner has assumed the risk of any financial losses stemming from bugs in our software. He wants to release the product to customers, but he is of course nervous. How do you propose he should best go about deciding when to release? How should he reason about the risks, short of using a quantitative model?

The simple answer is “when he’s not so nervous that he doesn’t want to ship”. What might cause him to decide to stop shipment? He should stop shipment when there are known problems in the product that aren’t balanced by countervailing benefits. Such problems are called showstoppers. A colleague once described “showstopper” as “any problem in the product that would make more sense to fix than to ship.”

When I was a product owner and I reasoned with the project team about showstoppers, we deemed as a showstopper

  • Any single problem in the product that would definitely cause loss or harm (or sufficient annoyance or frustration) to users, such that the product’s value in its essential operation would be nil. Beware of quantifying “users” here. In the age of the Internet, you don’t need very many people with terrible problems to make noise disproportionate to the population, nor do you need those problems to be terrible problems when they affect enough people. The recent kerfuffle over the iPhone 4 is a case in point; the Pentium Bug is another. Customer reactions are often emotional more than rational, but to the customer, emotional reactions are every bit as real as rational ones.
  • Any set of problems in the product that, when taken individually, would not threaten its value, but when viewed collectively would. That could include a bunch of minor irritants that confuse, annoy, disturb, or slow down people using the product; embarrassing cosmetic defects; non-devastating functional problems; parafunctional issues like poor performance or compatibility, and the like.

Now, in truth, your product owner might need to resort to a quantitative model here: he has to be able to count to one. One showstopper, by definition, is enough to stop shipment.

How might you evaluate potential showstoppers qualitatively? My colleague Fiona Charles has two nice suggestions: “Could a problem that we know about in this product trigger a front-page story in the Globe and Mail‘s Report on Business, or in the Wall Street Journal?” “Could a problem that we know about in this product lead to a question being raised in Parliament?” Now: the fact is that we don’t, and can’t, know the answer to whether the problem will have that result, but that’s not really the point of the questions. The points are to explore and test the ways that we might feel about the product, the problems, and their consequences.

What else might cause nervousness for your client? Perhaps he’s worried that, other than the known problems, there are unanswered questions about the product. Those include

  • Open questions whose answer would produce one or more instances of a showstopper.
  • Unasked questions that, when asked, would turn into open questions instead of “I don’t care”. Where would you get ideas for such questions? Try the Heuristic Test Strategy Model at http://www.satisfice.com/tools/satisfice-tsm-4p.pdf for an example of the kinds of questions that you might ask.
  • Unanswered questions about the product are one indicator that you might not be finished testing. There are other indicators; you can read about them here: http://www.developsense.com/blog/2009/09/when-do-we-stop-test/

    Questions about how much we have (or haven’t) tested are questions about test coverage. I wrote three columns about that a while back. Here are some links and synopses:

    Got You Covered: Excellent testing starts by questioning the mission. So, the first step when we are seeking to evaluate or enhance the quality of our test coverage is to determine for whom we’re determining coverage, and why.

    Cover or Discover: Excellent testing isn’t just about covering the “map”—it’s also about exploring the territory, which is the process by which we discover things that the map doesn’t cover.

    A Map By Any Other Name: A mapping illustrates a relationship between two things. In testing, a map might look like a road map, but it might also look like a list, a chart, a table, or a pile of stories. We can use any of these to help us think about test coverage.

    Whether you’ve established a clear feeling or are mired in uncertainty, you might want to test your first-order qualitative evaluation with a first-order quantitative model. For example, many years ago at Quarterdeck, we had a problem that threatened shipment: on bootup, our product would lock system that had a particular kind of hard disk controller. There was a workaround, which would take a trained technical support person about 15 minutes to walk through. No one felt good about releasing the product in that state, but we were under quarterly schedule pressure.

    We didn’t have good data to work with, but we did have a very short list of beta testers and data about their systems. Out of 60 beta testers, three had machines with this particular controller. There was no indication that our beta testers were representative of our overall user population, but 5% of our beta testers had this controller. We then performed the thought experiment of destroying the productivity of 5% of our user base, or tech support having to spend 15 minutes with 5% of our user base (in those days, in the millions).

    How big was our margin of error? What if we were off by a factor of two, and ten per cent of our user base had that controller? What if we were off by a factor of five, and only one per cent of our user base had that controller? Suppose that only one per cent of the user base had their machines crash on startup; suppose that only a fraction of those users called in. Eeew. The story and the feeling, rather than the numbers, told us this: still too many.

    Is it irrational to base a decision based on such vague, unqualified, first-order numbers? If so, who cares? We made an “irrational” but conscious decision: fix the product, rather than ship it. That is, we didn’t decide based on the numbers, but rather on how we felt about the numbers. Was that the right decision? We’ll never know, unless we figure out how to get access to parallel universes. In this one, though, we know that the problem got fixed startlingly quickly when another programmer viewed it with fresh eyes; that the product shipped without the problem; that users with that controller never faced that problem; and that the tech support department continued in its regular overloaded state, instead of a super-overloaded one.

    The decision to release is always a business decision, and not merely a technical one. The decision to release is not based on numbers or quantities; even for those who claim to make decisions “based on the numbers”, the decisions are really based on feelings about the numbers. The decision to release is always driven by balancing cost, value, knowledge, trust, and risk. Some product owners will have a bigger appetite for risk and reward; others will be more cautious. Being a product owner is challenging because, in the end, product owners own the shipping decisions. By definition, a product owner assumes responsibility for the risk of financial losses stemming from bugs in the software. That’s why they get the big bucks.

    Heuristics and Leadership

    Friday, May 28th, 2010

    In a recent blog post, James Bach discusses the essence of heuristics. A heuristic is a fallible method for solving a problem or making a decision. When used as an adjective, “heuristic” means fallible and conducive to learning. James ends the post by introducing a number of questions in order to test whether someone is teaching you a heuristic effectively. Meeta Prakash, in the comments, remarks “Your questions sound so much like my idea and interpretation of a ‘leader’.”

    Yes, Meeta, the asking of those questions sounds like leadership to me too. What is leadership? Leadership is “creating an environment in which everyone is empowered” (that’s Jerry Weinberg’s definition). We believe in that. We believe that excellent testing starts with the skill set and the mindset of the individual tester. Other things might help, but excellence in testing must centre on the tester.

    Discussion of the Method is Billy Vaughan Koen’s book on engineering. He describes the engineering method as “the use of heuristics to cause the best change in a poorly understood situation within the available resources”. He notes that the individual engineer, the engineering organization in which he works, and the engineering discipline overall each has a state of the art, which he abbreviates as “sota”. These sotas overlap one another in some places and cover new ground elsewhere. Each leads and lags the others in certain areas. The overall discipline has a sota that is more advanced than that of the organization and the individual in some places. The organization’s sota is aware of things that neither the individual nor the discipline has yet recognized. Each individual’s sota contains some knowledge that is unknown to both the organization and the discipline.

    Each sota may advance before that of the other two, and the advances are ongoing. Each sota evolves in a different context. For each stage of evolution, we can’t be certain about what factors might be relevant to success or failure. Thus no practice nor method nor approach can deemed to be best. We don’t know, can’t know, whether some method is best; we don’t know, can’t know, whether someone might invent a better method. We don’t know, can’t know, whether someone has already invented a better method. What we can do, however, is to create environments in which everyone is empowered to discover and apply new heuristics along with those that we already know. That’s important because the discipline and the organization never advance on their own. They advance when someone has the inspiration, the initiative, the courage, and the opportunity to try something new, uncertain as to whether the new approach will work.

    Organizations and individuals can foster initiative by empowering people (or at the very least, by leaving them alone). The overall discipline tends not to encourage initiative much, so it seems. With prescriptive, restrictive standards, disciplines often discourage innovation. Why would disciplines do this? One reason is that disciplines are typically led by experts who, as McLuhan said, are heavily invested in their own expertise, and therefore resist change that would threaten that investment. Propaganda and the threat of unemployment are among the crude tools that experts wield.  Want evidence for that in testing? You need look no further than the “experts”, the certifiers, and the standards enthusiasts in our craft; their narrow and flawed models of evidence and measurement; their intolerance of uncertainty; and their resistance to acknowledging the exploratory mindset. Yet without exploration, progress in any domain ceases while the rest of the world rushes past.

    “All is heuristic,” declares Koen. That’s an absolute statement which appears to declare that there are no absolutes. But rather than denying the paradox, Koen embraces it. All is heuristic, he says, including the heuristic that all is heuristic. The notion that all is heuristic is pretty robust. Koen points out that even algorithms have contexts in which they work and contexts in which they fail, and that algorithms must be chosen by people applying judgement and skill in uncertain conditions. Yet he leaves open the possibility that someone, somewhere, some day, might discover an infallible method for solving a problem.

    James, our colleagues, and I deal with a similar paradox when we contend that all testing is heuristic. There’s no test for infallibility! It might be that there are some occasions when there are infallible methods for testing. It’s just that we’ve never seen one, and that we can’t currently imagine a case in which a process or a standard could—without the application of judgement or skill—be guaranteed to solve some testing problem. That’s because skilled testers (let’s call them “we”, without claiming expertise, but asserting that we are students of the craft) have specific advantages over process models, methods, and tools (“they”):

    • We have situational awareness (they don’t).
    • We work from the assumption that we’re fallible (they don’t).
    • We have the capacity to make judgements on questions of cost vs. value (they don’t).
    • We have the ability to apply a stopping heuristic at any time (they don’t).
    • We have the intelligence to choose which heuristics are applicable and which are not (they don’t).
    • We have the opportunity to consult, in real time, with our clients (they don’t).
    • We have the inventiveness to work around a problem (they don’t).
    • We have the sensory appartus to determine when a heuristic is failing (they don’t).
    • We have the humanity to notice something unexpected that might be a problem for people (they don’t).
    • We have the capacity to learn (they don’t).

    And that’s only a partial list. A couple of notes: every one of these capabilities is itself heuristic; and it should be clear that I’m not talking only about skilled testers, but about any skilled discipline.

    It’s not that bodies of knowledge or process models or standards never have anything interesting to say (although, in my experience, most of them do tend to be written in a style that induces immediate and profound slumber). It’s that none of these tools have any intrinsic relevance or value unless and until they are applied by people. If the tools are to be applied effectively, we need to recognize that they are all heuristic. We must test these heurstics and their validity with questions like the ones James poses.  We must also recognize the people applying them must have the skills to know how, when, and when not to apply them. To develop those heuristics and those skills, we need to create an environment in which everyone is empowered. That is what we believe.