Blog Posts for the ‘Systems Thinking’ Category

Project Estimation and Black Swans (Part 5): Test Estimation

Sunday, October 31st, 2010

In this series of blog posts, I’ve been talking about project estimation. But I’m a tester, and if you’re reading this blog, presumably you’re a tester too, or at least you’re interested in testing. So, all this has might have been interesting for project estimation in general, but what are the implications for test project estimation?

Let’s start with the tester’s approach: question the question.

Is there ever such a thing as a test project? Specifically, is there such a thing as a test project that happens outside of a development project?

“Test projects” are never completely independent of some other project. There’s always a client, and typically there are other stakeholders too. There’s always an information mission, whether general or specific. There’s always some development work that has been done, such that someone is seeking information about it. There’s always a tester, or some number of testers (let’s assume plural, even if it’s only one). There’s always some kind of time box, whether it’s the end of an agile iteration, a project milestone, a pre-set ship date, or a vague notion of when the project will end. Within that time box, there is at least one cycle of testing, and typically several of them. And there are risks that testing tries to address by seeking and providing information. From time to time, whether continuously or at the end of a cycle, testers report to the client on what they have discovered.

The project might be a product review for a periodical. The project might be a lawsuit, in which a legal team tries to show that a product doesn’t meet contracted requirements. The project might be an academic or industrial research program in which software plays a key role. More commonly, the project is some kind of software development, whether mass-market commercial software, an online service, or IT support inside a company. The project may entail customization of an existing product, or it may involve lots of new code development. But no matter what, testing isn’t the project in and of itself; testing is a part of a project, a part that informs the project. Testing doesn’t happen in isolation; it’s part of a system. Testing observes outputs and outcomes of the system of which it is a part, and feeds that information back into the system. And testing is only one of several feedback mechanisms available to the system.

Although testing may be arranged in cycles, it would be odd to think of testing as an activity that can be separated from the rest of its project, just as it would be odd to think of seeing as a separate phase of your day. People may say a lot of strange things, but you’ll rarely hear them say “I just need to get this work done, and then I’ll start seeing”; and you almost never get asked “When are you going to be done seeing?” Now, there might be part of your day when you need to pay a lot of attention to your eyes—when you’re driving a car, or cutting vegetables, or watching your child walk across a cluttered room. But, even when you’re focused (sorry) on seeing, the seeing part happens in the context of—and in the service of—some other activity.

Does it make sense to think in terms of a “testing phase”?

Many organizations (in particular, the non-agile ones) divide a project into two discrete parts: a “development phase” and a “testing phase”. My colleague James Bach notes an interesting fallacy there.

What happens during the “development phase”? The programmers are programming. Programming may include a host of activities such as research, design, experimentation, prototyping, coding, unit testing (and in TDD, a unit check is created just before the code to be be checked), integration testing, debugging, or refactoring. Some of those activities are testing activities. And what are the testers doing during the “development phase”? The testers are testing. More specifically, they may be engaged in review, planning, test design, toolsmithing, data generation, environment setup, or the running of relatively low-level integration tests, or even very high-level system tests. All of those activities can be wrapped up under the rubric of “testing”.

What happens during the “testing phase”? The programmers are still programming, and the testers are still testing. The primary thing that distinguishes the two phases, though, is the focus of the programming work: the programmers have generally stopped adding new features, but are instead fixing the problems that have been found so far. In the first phase, programmers focused on developing new features; in the second, programmers are focused on fixing. By that reckoning, James reckons, the “testing phase” should be called the fixing phase. It seems to me that if we took James’ suggestion seriously, it might change the nature of some of the questions are often asked in a development project. Replace the word “test” with the word “fix”: “How long are you going to need to fix this product?” “When is fixing going to be done?” “Can’t we just automate the fixing?” “Shouldn’t fixing get involved early in the project?” “Why was that feature broken when the customer got it? Didn’t you fix it?” And when we ask those questions, should we be asking the testers?

As James also points out, no one ever held up the release or deployment of a product because there was more testing to be done. Products are delayed because of a present concern that there might be more development work to be done. Testing can stop as soon as product owners believe that they have sufficient information to accept the risk of shipping. If that’s so, the question for the testers “When are you going to be done testing?” translates to in a question for the product owner: “When am I going to believe that I have sufficient technical information to inform a risk-based business decision?” At that point, the product owner should—appropriately—be skeptical about anyone else’s determination that they are “done” testing.

Now, for a program manager, the “when do I have sufficient information” question might sound hard to answer. It is hard to answer. When I was a program manager for a commercial software company, it was impossible for me to answer before the information had been marshalled. Look at the variables involved in answering the question well: technical information, technical risk, test coverage, the quality of our models, the quality of our oracles, business information, business risk, the notion of sufficiency, decisiveness… Most of those variables must be accumulated and weighed and decided in the head of a single person—and that person isn’t the tester. That person is the product owner. The evaluation of those variables and the decision to ship are all in play from one moment to the next. The final state of the contributing variables and the final decision on when to ship are in the future. Asking the tester “When are you going to be done testing?” is like asking the eyes, “When are you going to be done seeing?” Eyes will continue to scan the surroundings, providing information in parallel with the other senses, until the brain decides upon a course of action. In a similar way, testers continue to test, generating information in parallel with the other members of the project community, until the product owner decides to ship the product. Neither the tester alone nor the eyes alone can answer the “when are you going to be done” question usefully; they’re not in charge. Until it makes a decision, the brain (optionally) takes in more data which the eyes and the other sense organs, by default, continue to supply. Those of us who have ogled the dessert table, or who have gone out on disastrous dates, know the consequences of letting our eyes make decisions for us. Moreover, if there is a problem, it’s not likely the eyes that will make the problem go away.

Some people believe that they can estimate when testing will be done by breaking down testing into measurable units, like test cases or test steps. To me, that’s like proposing “vision cases” or “vision steps”, which leads to our next question:

Can we estimate the duration of a “testing project” by counting “test cases” or “test steps”?

Recently I attended a conference presentation in which the speaker presented a method for estimating when testing would be completed. Essentially, it was a formula: break testing down into test cases, break test cases down into test steps, observe and time some test steps, average them out (or something) to find out how long a test step takes, and then multiply that time by the number of test steps. Voila! an estimate.

Only one small problem: there was no validity to the basis of the calculation. What is a test step? Is it a physical action? The speaker seem to suggest that you can tell a tester has moved on to the next step when he performs another input action. Yet surely all input actions are not created equal. What counts as an input action? A mouse click? A mouse movement? The entry of some data into a field? Into a number of fields, followed by the press of an Enter key? Does the test step include an observation? Several observations? Evaluation? What happens when a human notices something odd and starts thinking? What happens when, in the middle of test execution, a tester recognizes a risk and decides to search for a related problem? What happens to the unit of measurement when a tester finds a problem, and begins to investigate and report it?

The speaker seemed to acknowledge the problem when she said that a step might take five seconds, or half a day. A margin of error of about 3000 to one per test step—the unit on which the estimate is based—would seem to jeopardize the validity of the estimate. Yet the margin of error, profound as it is, is orthogonal to a much bigger problem with this approach to estimation.

Excellent testing is not the monotonic or repetitive execution of scripted ideas. (That’s something that my community calls checking.) Instead, testing is an investigation of code, computers, people, value, risks, and the relationships between them. Investigation requires loops of exploration, experimentation, discovery, research, result interpretation, and learning. Variation and adaptation are essential to the process. Execution of a test often involves reflecting on what has just happened, backtracking over a set of steps, and then repeating or varying the steps while posing different questions or making observations. An investigation cannot follow a prescribed set of steps. Indeed, an investigation that follows a predetermined set of steps is not an investigation at all.

In an investigation, any question you ask may—starting with the first—may yield an answer that completely derails your preconceptions. In an investigation, assumptions need to be surfaced, attacked, and refined. In an investigation, the answer to the most recent question may be far more relevant to the mission than anything that has gone before. If we want to investigate well, we cannot assume that the most critical risk has already been identified. If we want to investigate well, we can’t do it by rote. (If there are rote questions, let’s put them into low-level automated checks. And let’s do it skillfully.)

If we can’t estimate by counting test cases, how can we estimate how much time we’ll need for testing?

There are plenty of activities that don’t yield to piecework models because they are inseparable from the project in which they happen. In another of James Bach’s analogies, no one estimates the looking-out-the-window phase of driving an automobile journey. You can estimate the length of the journey, but looking out the window happens continuously, until the travellers have reached the destination. Indeed, looking out the window informs the driver’s evaluation of whether journey is on track, and whether the destination has been reached. No one estimates the customer service phase of a hotel stay. You can estimate the length of the stay, but customer service (when it’s good) is available continuously until the visitor has left the hotel. For management purposes, customer service people (the front desk, the room cleaners) inform the observation that the visitor has left. No one estimates the “management phase” of a software development project. You can estimate how long development will take, but management (when it’s good) happens continuously until the product owner has decided to release the product. Observations and actions from managers (the development manager, the support manager, the documentation manager, and yes, the test manager) inform the product owner’s decision as to whether the product is ready to ship.

So it goes for testing. Test estimation becomes a problem only if one makes the mistake of treating testing as a separate activity or phase, rather than as an open-ended, ongoing investigation that continues throughout the project.

My manager says that I have to provide an estimate, so what do I do?

At the beginning of the project, we know very little relative to what we’ll know later. We can’t know everything we’ll need to know. We can’t know at the beginning of the project whether the product will meet its schedule without being visited by a Black Swan or a flock of Black Cygnets. So instead of thinking in terms of test estimation, try thinking in terms of strategy, logistics, negotiation, and refinement.

Our strategy is the set of ideas that guide our test design. Those ideas are informed by the project environment, or context; by the quality criteria that might be valued by users and other stakeholders; by the test coverage that we might wish to obtain; and by the test techniques that we might choose to apply. (See the Heuristic Test Strategy Model that we use in Rapid Testing as an example of a framework for developing a strategy.) Logistics is the set of ideas that guide our application of people, equipment, tools, and other resources to fulfill our strategy. Put strategy and logistics together and we’ve got a plan.

Since we’re working with—and, more importantly, for—a client, the client’s mission, schedule, and budget are central to choices on the elements of our strategy and logistics. Some of those choices may follow history or the current state of affairs. For example, many projects happen in shops that already have a roster of programmers and testers; many projects are extensions of an existing product or service. Sometimes project strategy ideas based on projections or guesswork or hopes; for example, the product owner already has some idea of when she wants to ship the product. So we use whatever information is available to create a preliminary test plan. Our client may like our plan—and she may not. Either way, in an effective relationship, neither party can dictate the terms of service. Instead, we negotiate. Many of our preconceptions (and the client’s) will be invalid and will change as the project evolves. But that’s okay; the project environment, excellent testing, and a continuous flow of reporting and interaction will immediately start helping to reveal unwarranted assumptions and new risk ideas. If we treat testing as something happens continuously with development, and if we view development in cycles that provide a kind of pulse for the project, we have opportunities to review and refine our plans.

So: instead of thinking about estimation of the “testing phase”, think about negotiation and refinement of your test strategy within the context of the overall project. That’s what happens anyway, isn’t it?

But my management loves estimates! Isn’t there something we can estimate?

Although it doesn’t make sense to estimate testing effort outside the context of the overall project, we can charter and estimate testing effort within a development cycle. The basic idea comes from Session Based Test Management, James and Jon Bach’s approach to plan, estimate, manage, and measure exploratory testing in circumstances that require high levels of accountability. The key factors are:

  • time-boxed sessions of uninterrupted testing, ranging from 45 minutes to two hours and fifteen minutes, with the goal of making a normal session 90 minutes or so;

  • test coverage areas—typically functions or features of the product to which we would like to dedicate some testing time;
  • activities such as research, review, test design, data generation, toolsmithing, research, or retesting, to which we might also like to dedicate testing time;
  • charters, in the form of a one- to three-sentence mission statement that guides the session to focus on specific coverage areas and/or activities;

  • debriefings, in which a tester and a test lead or manager discuss the outcome of a session;

  • reviewable results, in the form of a session sheet that provides structure for the debrief, and that can be scanned and parsed by a Perl script; and, optionally,

  • a screen-capture recording of the session when detailed retrospective investigation or analysis might be needed;

  • metrics whose purposes are to determine how much time is spent on test design and execution (activities that yield test coverage) vs. bug investigation and reporting, and setup (activities that interrupt the generation of test coverage).

The timebox provides a structure intended to make estimation and accounting for time fairly imprecise, but reasonably accurate. (What’s the difference? As I write, the time and date is 9:43:02.1872 in the morning, January 23, 1953. That’s a very precise reckoning of the time and date, but it’s completely inaccurate.)

Let’s also assume that a development cycle is two weeks, or ten working days—the length of a typical agile iteration. Let’s assume that we have four testers on the team, and that each tester can accomplish three sessions of work per day (meetings, e-mail, breaks, conversations, and other non-session activities take up the rest of the time).

ten days * four testers * three sessions = 120 sessions

Let’s assume further that sessions cannot be completely effective, in that test design and execution will be interrupted by setup and bug investigation. Suppose that we reckon 10% of the time spent on setup, and 25% of the time spent on investigating and reporting bugs. That’s 35% in total; for convenience, let’s call it 1/3 of the time.

120 sessions – 120 * 1/3 interruption time = 80 sessions

Thus in our two-week iteration we estimate that we have time for 80 focused, targeted effective idealized sessions of test coverage, embedded in 120 actual sessions of testing. Again, this is not a precise figure; it couldn’t possibly be. If our designers and programmers have done very well in a particular area, we won’t find lots of bugs and our effective coverage per session will go up. If setup is in some way lacking, we may find that interruptions account for more than one-third of the time, which means that our effective coverage will be reduced, or that we have to allocate more sessions to obtain the same coverage. So as soon as we start obtaining information about what actually went on in the sessions, we feed that information back into the estimation. I wrote extensively about that here.

On its own, the metrics on interruptions could be fascinating and actionable information for managers. But note that the metrics on their own are not conclusive. They can’t be. Instead, they inform questions. Why has there been more bug investigation than we expected? Are there more problems than we anticipated, or are testers spending too much time investigating before consulting with the programmers? Is setup taking longer than it should, such that customers will have setup problems too? Even if the setup problems will be experienced only in testing, are there ways to make setup more rapid so that we can spend more time on test coverage? The real value of any metrics is in the questions they raise, rather than in the answers they give.

There’s an alternative approach, for those who want to estimate the duration or staffing for a test cycle: set the desired amount of coverage, and apply the fixed variables and calculate for the free ones. Break the product down into areas, and assign some desired number of sessions to each based on risk, scope, complexity, or any combination of factors you choose. Based on prior experience or even on a guess, adjust for interruptions and effectiveness. If you know the number of testers, you can figure the amount of time required; if you want to set the amount of time, you can calculate for the number of testers required. This provides you with a quick estimate.

Which, of course, you should immediately distrust. What influence does tester experience and skill have on your estimate? On the eventual reality? If you’re thinking of adding testers, can you avoid banging into Brooks’ Law? Are your notions of risk static? Are they valid? And so forth. Estimation done well should provoke a large number of questions. Not to worry; actual testing will inform the answers to those questions.

Wait a second. We paid a lot of money for an expensive test management tool, and we sent all of our people to a one-week course on test estimation, and we now spend several weeks preparing our estimates. And since we started with all that, our estimates have come out really accurate.

If experience tells us anything, it should tell us that we should be suspicious of any person or process that claims to predict the future reliably. Such claims tend to be fulfilled via the Ludic Fallacy and the narrative bias, central pillars of the philosophy of The Black Swan. Since we already have an answer to the question “When are we going to be done?”, we have the opporutunity (and often the mandate) to turn an estimate into a self-fulfilling prophecy. Jerry Weinberg‘s Zeroth Law of Quality (“If you ignore quality, you can meet any other requirement“) is a special case of my own, more general Zeroth Law of Wish Fulfillment: “If you ignore some factors, you can achieve anything you like.” If your estimates always match reality, what assumptions and observations have you jettisoned in order to make reality fit the estimate? And if you’re spending weeks on estimation, might that time be better spent on testing?

Project Estimation and Black Swans (Part 4)

Monday, October 25th, 2010

Over the last few posts, exploratory automation has suggested some interesting things about project dynamics and estimation. What might we learn from these little mathematical experiments?

The first thing we need to do is to emphasize the fact that we’re playing with numbers here. This exercise can’t offer any real construct validity, since an arbitrary chunk of time combined with a roll of the dice doesn’t match software development in all of its complex, messy, human glory. In a way, though, that doesn’t matter too much, since the goal of this exercise isn’t to prove anything in particular, but rather to raise interesting questions and to offer suggestions or hints about where we might look next.

The mathematics appears to support an idea touted over and over by Agile enthusiasts, humanists, and systems thinkers alike: make feedback rapid and frequent. The suggestion we might take from the last model—fewer tasks and shorter projects —is that the shorter and better-managed the project, the less the Black Swan has a chance to hurt you in any given project.

Another plausible idea that comes from the math is to avoid projects where the power-distribution law applies—projects where you’re vulnerable to Wasted Mornings and Lost Days. Stay away from projects in Taleb’s Fourth Quadrant, projects that contain high-impact, high-uncertainty tasks. To the greatest degree possible, stick with things that are reasonably predictable, so that the statistics of random and unpredicted events don’t wallop us quite so often. Stay within the realm of the known, “in Mediocristan” as Taleb would say. Head for the next island, rather than trying to navigate too far over the current horizon.

In all that, there’s a caveat. It is of the essence of Black Swan (or even a Black Cygnet) that it’s unpredicted and unpredictable. Ironically, the more successful we are at reducing uncertainty, the less often we’ll encounter rare events. The rarer the event, the less we know about it—and therefore, the less we’re aware of the range of its potential consequences. The less we know about the consequences, the less likely we are to know about how to manage them—certainly the less specifically we know how to manage them. In short, the more rare the event, the less information and experience we’ll have to help us to deal with it. One implication of this is that our Black Cygnets, in addition to adding time, having a chance of screwing up other things in ways that we don’t expect.

Some people would suggest that we eliminate variability and uncertainty and unpredictability. What a nice idea! By definition, uncertainty is the state of not knowing something; by definition, something that’s unpredictable can’t be predicted. Snowstorms happen (even in Britain!). Servers go down. Power cuts happen in India on a regular basis—on my last visit to India, I experienced three during class time, and three more in the evening in a two-day stay at a business class hotel. In North America, power cuts happen too—and because we’re not used to them, we aren’t prepared to deal with them. (To us they’re Black Swans, where to people who live in India, they’re Grey Swans.) Executives announce all-hands meetings, sometimes with dire messages. Computers crash. Post-It notes get jammed in the backup tape drive. People get sick, and if they’re healthy, their kids get sick. Trains are delayed. Bicycles get flat tires. And bugs are, by their nature, unpredicted.

So: we can’t predict the unpredictable. There is a viable alternative, though: we can expect the unpredictable, anticipate it to some degree, manage it as best we can, and learn from the experience. Embracing the unpredictable reminds me of the The Fundamental Regulator Paradox, from Jerry and Dani Weinberg’s General Principles of System Design which I’ve referred to before:

The task of a regulator is to eliminate variation, but this variation is the ultimate source of information about the quality of its work. Therefore, the better job a regulator does, the less information it gets about how to improve.

This suggests to me that, at least to a certain degree, we shouldn’t make our estimates too precise, our commitments too rigid, our processes too strict, our roles too closed, and our vision of the future too clear. When we do that, we reduce the flow of information coming in from outside the system, and that means that the system doesn’t develop an important quality: adaptability.

When I attended Jerry Weinberg’s Problem Solving Leadership workshop (PSL), one of the groups didn’t do so well on one of the problem-solving exercises. During the debrief, Jerry asked, “Why did you have such a problem with that? You handled a much harder problem yesterday.”

“The complexity of the problem screwed us up,” someone answered.

Jerry peered over the top of his glasses. He replied, “Your reaction to the complexity of the problem screwed you up.”

One of the great virtues of PSL is that it exposes you to lots of problems in a highly fault-tolerant environment. You get practice at dealing with surprises and behaviours that emerge from giving a group of people a moderately complex task, under conditions of uncertainty and time pressure. You get an opportunity to reflect on what happened, and you learn what you need to learn. That’s the intention of the Rapid Software Testing class, too: to expose people to problems, puzzles, and traps; to give people practice in recognizing and evading traps where possible; and to help them dealing with problems effectively.

As Jerry has frequently pointed out, plenty of organizations fall victim to back luck, but much of the time, it’s not the bad luck that does them in; it’s how they react to the bad luck. A lot of organizations pillory themselves when they fail to foster environments in which everyone is empowered to solve problems. That leaves problem-solving in the hands of individuals, typically people with the title of “manager”. Yet at the moment a problem is recognized, the manager may not be available, or may not be the best person to deal with the problem. So, another reason that estimation fails is that organizations and individuals are not prepared or empowered to deal— mentally, politically, and emotionally—with surprises. The ensuing chaos and panic leaves them more vulnerable to Black Swans.

Next time, we’ll look at what all of this means for testing specifically, and for test estimation.

Challenges and Legibility

Thursday, October 14th, 2010

Lately, James Bach and I have been issuing challenges to some of our colleagues on Twitter, typically based on something they’ve said or observed. I think James would agree that the results have been very exciting. In our community, people build credibility by responding to challenges and probing the issues more deeply, and it’s been tremendous to see how some of them have risen to the challenge. For me, recent examples include Joe Harter and his response to the question “Why keep testing when we’ve got a swarm of bugs?”; and David O’Dowd and his recent tweets on how to address disagreement over the “right” temperature for a cup of coffee. It goes both ways, of course: we expect other people to challenge us, too. That’s how we test ideas.

Recently, James turned me on to an interesting Web site, authored by a fellow named Venkatesh Rao, and in particular to this blog post. I was very excited by the concept of legibility, making things more readable in a metaphoric sense, more understandable. To me, legibility is a powerful idea because it seems to explain a central conundrum in testing and in the management of software development: a good deal of the effort that we spend, so it seems, is not in producing better stuff, but rather in attempting to make complex stuff more understandable. One approach to understanding complexity is to take the general systems view, and model the system of interest in terms of other, simpler systems, and look at the aspects of elements, relationships, control, feedback, and effects, and the relationships between all of these. Another approach is to close your eyes to the complexity (as French governments and tax collecters tried to do in the 1800s) and pay attention only to a couple of specific elements in the model. Yet another approach, often used by large organizations and bureaucracies such as nation states, is a wholesale attempt to make the system more legible by eliminating the complexity by eliminating elements (as Prussian forest managers did in the 19th century, or as the builders of Brasilia did in the 20th).

I ordered the book Seeing Like a State to which Venkatesh refers, and I’m finding it interesting. More on that later, perhaps.

Before I ordered the book, though, I thought the idea of legibility would be of interest to a general systems thinker, so I sent a link along to Jerry Weinberg. He surprised me a little by replying,

“Well, yes, but it’s a far over-simplified vision itself. For instance, it doesn’t seem to account for why the “recipe” actually succeeds (value to some persons or groups). Think it through.”

Here’s my reply:

[quote]

Thank you for the challenge. Let me see if I can answer it.

I think it does account for why the “recipe” actually succeeds, although it may gloss over the point somewhat.

  • Success is subject to the Relative Rule. (As I described in my chapter of The Gift of Time, the Relative Rule states that “for any abstract X, X is X to some person”.) That is, success is success to some person(s).
  • Success is measured by some persons at some time (a refinement of the Relative Rule that I identified and that Markus Gartner seized on). Any determination of success (at some time and for some purpose) is like observing the part of the curve that looks linear. We cant’t save we’ve achieved the end result because a) not all the data is in yet, and b) as I’ve heard you say on a number of occasions, “nothing is ever settled”. (I think I’d like to call this The Unsettling Rule.)
  • Similarly, “complexity”, “reality”, “irrationality”, “orderliness”, “legibility”, etc. are all subject to the Relative Rule and the Unsettling Rule too. When Venkatesh says, “The big mistake in this pattern of failure is projecting your subjective lack of comprehension onto the object you are looking at, as ‘irrationality’”, that reminds me of your (Jerry’s) advice in the SHAPE Forum many years back: stop looking at it as “irrational”, and start looking at it as “rational from the perspective of a different set of values”.
  • Says Venkatesh, “This failure mode is ideology-neutral, since it arises from a flawed pattern of reasoning rather than values.” Well, that’s all very well, but you can’t have the concept of “a flawed pattern of reasoning” without imposing a value judgement.
  • By making something more legible, you might have a short-term effect that you consider negative, but which gives rise to a more “positive” long-term effect. For example, in the old days, anyone could cut down trees pretty much anywhere they liked. These days we seem to have a stricter sense of preserving some kinds of land so as not to be interfered with by the forestry business, and using other kinds of land for what is effectively tree farming. “Legibility” is always in flux.
  • “Rational and unlivable grid-cities like Brasilia, versus chaotic and alive cities like Sao Paolo.” Yeah, but I’ve heard about problems in Sao Paolo, and I’m not convinced that Brasilia is less livable than Sao Paolo, based on those problems.

I could go on… but have I shown you some evidence of thinking it through?

[/quote]

Jerry’s response was,

“Well done.

You’ve got another blog post there, I think.”

So here is that blog post.

In his challenge to me, Jerry was encouraging me (and, by extension, Venkatesh) to think about things in a more complex and nuanced way. For me, the key lesson is to remember that whatever you see as “broken” is almost certainly working for someone. That person, being different from you, is to some degree looking at everything from the perspective of a different set of values. When you see a problem in a product, or organization, or system, addressing that problem is going to take some effort for someone, and that person might see neither the problem, nor its the cost, nor the value of change as you see it. That person might have political authority over the situation, and like all people, that person is driven not only by rationality but also by emotion. That person might not even see you.

For example, as a tester, when you say that a product has “too many bugs”, it’s important to ask, “Too many compared to what?” “Too many for whom?” “Too many according to whom?” “Too many to meet what goal?” That’s one of the reasons that test framing is so important: your testing won’t be valued if it’s not congruent with the mission, whether implicit or explicit, that your client has in mind.

Now, having to deal with all this uncertainty and subjectivity might require us to give up an idealist Platonic sense of Goodness and Order and Godliness, and might force us to deal with messy, complex, and human concerns. But considering that we all have to live with each other, and that “ideal” is only ideal to some person, at some time, that might be a good thing.

Thank you to Jerry for his persistent, patient reminders.

Why We Do Scenario Testing

Saturday, May 1st, 2010

Last night I booked a hotel room using a Web-based discount travel service. The service’s particular shtick is that, in exchange for a heavy discount, you don’t get to know the name of the airline, hotel, or car company until you pay for the reservation. (Apparently the vendors are loath to admit that they’re offering these huge discounts—until they’ve received the cash; then they’re okay with the secret getting out.) When you’re booking a hotel, the service reveals the general location and the amenities. I made a choice that looked reasonable to me, and charged it to my credit card.

I had screwed up. When I got the confirmation, I noticed that I had booked for one night, when I should have booked for two. I wanted to extend my stay, but when I went back to the Web page, I couldn’t be sure that I was booking the same hotel. The names of the hotels are hidden, and I knew that the rates might change from night to night. One can obtain clues by looking at the amenities and the general location of the hotels, but I wanted to be sure. So instead of booking online, I called the travel service’s 1-800 number.

Jim answered the phone sympathetically. It turns out that not even the employees of the service can see the hotel name before a booking is made. However, this was a familiar problem to him, so it seemed, and he told me that he’d match the hotel by location and amenities, back out the first credit-card transaction for one night, and charge me for a new transaction of two nights. He managed to book the same hotel. So far so good.

I went to the hotel and checked in. The woman behind the counter asked for identification and a credit card for extras, and then she asked me, “How many keys will you be needing tonight, sir?” “Just one”, I said. She put a single key card into the electronic key programming machine, and handed the card to me. I took the elevator to room 761, which had a comfortable bed and desk with a window behind it, including a nice view. I went up to my room, unpacked some of my things, and decided to go for a dip in the hot tub. When I came back upstairs, I changed into dry clothes, took out my laptop, plugged it in, and sat down at the desk.

The floor was shaking. I mean, it was really vibrating. Some big motor—an air-conditioning compressor? a water pump?—had turned the office chair into a massager. I stood up, and it seemed that half of the room, including the bed was shaking. I tried to do a little work, but the vibration was enormously distracting. I called down to the front desk.

Peter answered the phone sympathetically. “I’ll send someone right up to check it out,” he said. Fair enough, but this problem was unlikely to go away any time soon, and until it did, I wanted another room. “No worries,” said Peter. “I’ll start the process now, and send someone up to check out the problem. Then you can come downstairs to exchange your key.” (“Why not send the new key up with the person coming upstairs?” I thought, but I didn’t say anything.) “I’ll need a few minutes to tidy up,” I said. “Very well, sir,” said Peter. I repacked my bags. A few minutes later, the phone rang, and Peter asked if I was ready for the staff member to arrive.  Yes.

After a short time, someone knocked on the door. He had a pair of new keys (two, not one), which he passed to me. He appeared skeptical at first, but I sat him down in the desk chair. “Oh, now I feel it,” he said. “Stand over here, next to the bed” I said. He got up, moved over, and felt the shaking. “Wow,” he said. We chatted for a few more moments, speculating on where the shaking was coming from. He left to investigate, and I decamped to my new room, 1021, on another floor on the other side of the building. So far so good.

This morning on my way to the shower, I noticed that a piece of paper had been slipped under the door. It was the checkout statement for my stay, noting my arrival and departure date and the various charges had been made to my credit card, including state sales tax, county tax, and a service fee for Internet use. I noticed that the checkout date and time was this morning, but I’m not supposed to be leaving until tomorrow morning. I called the front desk.

Zhong-li answered the phone sympathetically. I explained the situation, noting that I had booked through a travel service twice, once for one night and then later for two, and that the first booking should have been backed out (but maybe the service hadn’t done that), plus I had changed rooms the night before, so maybe it was an issue with the service but maybe it was an issue with the hotel’s own system too. Or maybe it was only the hotel. “No problem,” he said. “We can extend your stay for another night. But you’ll have to come downstairs at some point today so that we can re-author your room keys.”

So here’s the thing: how many variables can you see here? How many interconnected systems? How many different hardware platforms are involved? What protocols do they use to communicate?  To create, read, update, and delete? What are the overall transactions here?  What are the atomic elements of each one?  How does each transaction influence others?  How is each influenced by others?  What are the chances that everything is going to work right, and that I will neither under nor overpay?  What are the chances that the travel service will overpay (or underpay) the hotel for my stay, even if my credit card shows the appropriate entries and reversals?

It’s not even a terribly complicated story, but look at how many subtleties there are to the scenario. Have you ever seen a user story that has the richness and complexity of even this relatively simple little story? And yet, if we pay attention, aren’t there lots of stories like mine every day? Does my story, long as it is, include everything that we’d need to program or test the scenario? Does the card below include everything?

Index Card

Next question: if you want to create automated acceptance tests, do you want a scenario like this to be static, using record and playback to lock in on checking specific values in specific fields? Are we really going to get value from the story if we use the same data and the same outputs over and over again? This approach will be hard enough to program, but it will tend to be very brittle, resistant to change and variation. It will tend to miss details in the scenario that we would only learn about through repeated human interaction with the product.

Or would you prefer to have a flexible framework that allows you to explore and vary the scenario, designing and acting upon new test ideas, and observing the flow of each piece of data through each interconnected system? Might you be able to do this by exploiting testing tools that you’ve developed for the lower levels of the system and assembling them into progressively more powerful suites? This second apprach will likely be even harder to program, although you might be able to take advantage of lower-level test APIs, probes, and data generators that you and the programmers have developed as you’ve gone along. This approach, though, will tend to be far more powerful and robust to change, to learning, and to incorporating new and varied test ideas. Think well, and choose wisely.

In either case, unless you have people exploring and interacting with the product and the story directly, I guarantee you will miss important points in the story and you’ll miss important problems in the product.  Your tools, as helpful as they are, won’t ever pause and say, “What if…?” or “I wonder…” or “That’s funny…” You’ll need people to exercise skill, judgment, imagination, and interaction with the system, not in a linear set of prescribed steps but in a thoughtful, inventitive, risk-focused, and variable set of interactions.

In either case, you’ll also have a choice as to how to account for what you’re doing.  It’s one scenario, but is it only one test?  Is it dozens of tests?  Thousands?  If you use the second framework and induce variation, what does that do for your test count?  Or would it be better to report your work in an entirely different way, reporting on risks and test ideas and test activities, rather than try to quantify a complex intellectual interaction by using meaningless, quantitatively invalid units like “test cases” or “test steps”?

It’s been a while since I’ve posted this, but it’s time to do it again. This passage comes from a book on programming and on testing, written by Herbert Leeds and Gerald M. Weinberg (Jerry wrote this passage, he says). It’s understandable that people haven’t got the point yet, since the book is relatively new: it came out only 49 years ago (in 1961).  The emphases are mine.

“One of the lessons to be learned … is that the sheer number of tests performed is of little significance in itself. Too often, the series of tests simply proves how good the computer is at doing the same things with different numbers. As in many instances, we are probably misled here by our experiences with people, whose inherent reliability on repetitive work is at best variable. With a computer program, however, the greater problem is to prove adaptability, something which is not trivial in human functions either. Consequently we must be sure that each test does some work not done by previous tests. To do this, we must struggle to develop a suspicious nature as well as a lively imagination.

Amen.