Blog Posts for the ‘Measurement’ Category

Counting the Wagons

Monday, December 30th, 2013

A member of Linked In asks if “a test case can have multiple scenarios”. The question and the comments reinforce, for me, just how unhelpful the notion of the “test case” is.

Since I was a tiny kid, I’ve watched trains go by—waiting at level crossings, dashing to the window of my Grade Three classroom, or being dragged by my mother’s grandchildren to the balcony of her apartment, perched above a major train line that goes right through the centre of Toronto. I’ve always counted the cars (or wagons, to save us some confusion later on). As a kid, it was fun to see how long the train was (were more than a hundred wagons?!). As a parent, it was a way to get the kids to practice counting while waiting for the train to pass and the crossing gates to lift.

train

Often the wagons are flatbeds, loaded with shipping containers or the trailers from trucks. Others are enclosed, but when I look through the screening, they seem to be carrying other vehicles—automobiles or pickup trucks. Some of the wagons are traditional boxcars. Other wagons are designed to carry liquids or gases, or grain, or gravel. Sometimes I imagine that I could learn something about the economy or the transportation business if I knew what the trains were actually carrying. But in reality, after I’ve counted them, I don’t know anything significant about the contents or their value. I know a number, but I don’t know the story. That’s important when a single car could have explosive implications, as in another memory from my youth.

A test case is like a railway wagon. It’s a container for other things, some of which have important implications and some of which don’t, some of which may be valuable, and some of which may be other containers. Like railway wagons, the contents—the cargo, and not the containers—are the really interesting and important parts. And like railway wagons, you can’t tell much about the contents without more information. Indeed, most of the time, you can’t tell from the outside whether you’re looking at something full, empty, or in between; something valuable or nothing at all; something ordinary and mundane, or something complex, expensive, or explosive. You can surely count the wagons—a kid can do that—but what do you know about the train and what it’s carrying?

To me, a test case is “a question that someone would like to ask (and presumably answer) about a program”. There’s nothing wrong with using “test case” as shorthand for the expression in quotes. We risk trouble, though, when we start to forget some important things.

  • Apparently simple questions may contain or infer multiple, complex, context-dependent questions.
  • Questions may have more outcomes than binary, yes-or-no, pass-or-fail, green-or-red answers. Simple questions can lead to complex answers with complex implications—not just a bit, but a story.
  • Both questions and answers can have multiple interpretations.
  • Different people will value different questions and answers in different ways.
  • For any given question, there may be many different ways to obtain an answer.
  • Answers can have multiple nuances and explanations.
  • Given a set of possible answers, many people will choose to provide a pleasant answer over an unpleasant one, especially when someone is under pressure.
  • The number of questions (or answers) we have tells us nothing about their relevance or value.
  • Most importantly: excellent testing of a product means asking questions that prompt discovery, rather than answering questions that confirm what we believe or hope.

Testing is an investigation in which we learn about the product we’ve got, so that our clients can make decisions about whether it’s the product they want. Other investigative disciplines don’t model things in terms of “cases”. Newspaper reporters don’t frame their questions in terms of “story cases”. Historians don’t write “history cases”. Even the most reductionist scientists talk about experiments, not “experiment cases”.

Why the fascination with modeling testing in terms of test cases? I suspect it’s because people have a hard time describing testing work qualitatively, as the complex cognitive activity that it is. These are often people whose minds are blown when we try to establish a distinction between testing and checking. Treating testing in terms of test cases, piecework, units of production, simplifies things for those who are disinclined to confront the complexity, and who prefer to think of testing as checking at the end of an assembly line, rather than as an ongoing, adaptive investigation. Test cases are easy to count, which in turn makes it easy to express testing work in a quantitative way. But as with trains, fixating on the containers doesn’t tell you anything about what’s in them, or about anything else that might be going on.


As an alternative to thinking in terms of test cases, try thinking in terms of coverage. Here are links to some further reading:

  • Got You Covered: Excellent testing starts by questioning the mission. So, the first step when we are seeking to evaluate or enhance the quality of our test coverage is to determine for whom we’re determining coverage, and why.
  • Cover or Discover: Excellent testing isn’t just about covering the “map”—it’s also about exploring the territory, which is the process by which we discover things that the map doesn’t cover.
  • A Map By Any Other Name: A mapping illustrates a relationship between two things. In testing, a map might look like a road map, but it might also look like a list, a chart, a table, or a pile of stories. We can use any of these to help us think about test coverage.
  • What Counts“, an article that I wrote for Better Software magazine, on problems with counting things.
  • Braiding the Stories” and “Delivering the News“, two blog posts on describing testing qualitatively.
  • My colleague James Bach has a presentation on the case against test cases.
  • Apropos of the reference to “scenarios” in the original thread, Cem Kaner has at least two valuable discussions of scenario testing, as tutorial notes and as an article.

Where Does All That Time Go?

Tuesday, October 30th, 2012

It had been a long day, so a few of the fellows from the class agreed to meet a restaurant downtown. The main courses had been cleared off the table, some beer had been delivered, and we were waiting for dessert. Pedro (not his real name) was complaining, again, about how much time he had to spend doing administrivial tasks—meetings, filling out forms, time sheets, requisitions, and the like. “Everything takes so long. I want a pad of paper to take notes, I have to fill out a form for it. God help me if I run out of forms!”

“How much time do you spend on this kind of stuff each week?” I asked.

Pedro replied, “An hour a day. Maybe two, some days. Meetings…let’s say an hour and a half, on average.”

Wow, I thought—that’s a pretty good chunk of the week. I had an idea.

“Let’s visualize this, I said.” I took out my trusty Moleskine notebook. I prefer the version with the graph paper in it, for occasions just like this one. I outlined a grid, 20 squares across by two down.

Empty Week

“So you spend, on average, an hour and a half each day on compliance stuff. One-point-five times five, or 7.5 hours a week. Let’s make it eight. Put a C in eight squares.” He did that.

Compliance

“Okay,” I said. “You were griping today about how much time you spend wrestling with your test environments.”

Pedro’s eyes lit up. “Yes!” he said. “That’s the big one. See, it’s mobile stuff. We have a server component and a handset component to what we do, and the server stuff is a real bear.”

“Tell me more.”

“It’s a big deal. We’ve got one environment that models the production system. The software we’re developing has been so buggy that we can’t tell whether a given problem is general, or specific to the handset, so we have another one that we set up to do targeted testing every time we add support for a new handset. That’s the one I work with. Trouble is, setting it up takes ages and it’s really finicky. I have to do everything really carefully. I’ve asked for time to do scripting to automate some of it, but they won’t give that to me, because they’re always in such a rush. So, I do it by hand. It’s buggy, and I make the odd mistake. Either way, when I find out it doesn’t work, I have to troubleshoot it. That means I have to get on instant messaging or the phone to the developers, and figure out what’s wrong; then I have to figure out where to roll back to. And usually that’s right from the start. It wastes hours. And it’s every day.”

“Okay. Show me that on our little table, here. Use an S to represent each hour your spend each day.”

Whereupon Pedro proceded to fill in squares. Ten of them. Ten more. And then, eight more.

Setup

“Really?!” I said. “28 hours a week divided by five days—that’s more than five hours a day. Seriously?”

“Totally,” said Pedro. “It’s most of the day, every day, honestly. Never mind the tedium. What’s really killing me is that I don’t feel like I’m getting any real testing work done.”

“No kidding. There’s no time for it. There are only four squares left in the week. Plus, something you said earlier today about tons of bugs that aren’t related to setting up?”

“Right. When it comes to the stuff that I’m actually being asked to test, there’s lots of bugs there too. So my ‘testing time’ isn’t really testing. It’s mostly taken up with trying to reproduce and document the bugs.”

“Yes. In session-based test management, that’s bug investigation and reporting—B-time. And it does interrupt test design and execution—T-time—which is what produces actual test coverage, learning about what’s actually going on in the product. So, how much B-time?” He filled in three of the squares with Bs.

Bug Investigation and Reporting

“And T-time?”

He had room left to put in one lonely little T in the lower right corner.

Testing Time

“Wow,” I laughed. “One-fortieth of your whole week is spent in getting actual test coverage. The rest is all overhead. Have you told them how it affects you?”

“I’ve mentioned it,” he said.

“So look at this,” I suggested. “It’s even more clear when we use colour for emphasis.”

With Colour

“Whoa. I never looked at it that way. And then,” he paused. “Then they ask me, ‘Why didn’t you find that bug?’”

“Well,” I said, “considering the illusion they’re probably working under, it’s not an unreasonable question.”

“What do you mean?” Pedro asked.

“What does it say on your business card?”

“‘Software Testing’.”

“And what does it say on the door of the the test lab?”

“‘Test Lab’,” said Pedro.

“And they call you…?”

“Pedro.”

“No,” I laughed. “They say you’re a… what?”

“Oh. A tester.”

“So since you’re a tester, and since the door on the test lab says ‘Test Lab’, and your business card says ‘Testing’, they figure that’s all you do. The illusion is what Jerry Weinberg calls the Lumping Problem. All of those different activities—administrative compliance, setup, bug investigation and reporting, and test design and execution—are lumped into a single idea for them.” And I drew it for him.

Management's Dream

“That’s management’s illusion, there. Since, in their imagination, you’ve got forty hours of testing time in a week, it’s not unreasonable for them to wonder why you didn’t find that bug.”

“Hmmm. Right,” said Pedro.

“When in fact, what they’re getting from you is this.” And I drew it for him.

Testing Reality

“For testing—actual interaction with the product, looking for problems, you’ve got one-fortieth of the time they think you’ve got. One lonely little T. Is that part of your test report?”

“Oy,” he said. “Maybe I should show them something like this.”

“Maybe you should,” I said.

A couple of nights later, I showed that page of my notebook to James Bach over Skype. “Wow,” he said. “That guy could be forty times more productive!”

“Forty?”

“Well, no, not really, of course. But suppose the programmers checked their work a little more carefully, or suppose the testers practiced writing more concise bug reports and sharpened their investigating skill. One of those two things could cut the bug investigation time by a third. That would give more time for testing, when they’re not being interrupted by other stuff. What if they cut the setup time by a half, and that administrivia by half?”

“Four, fourteen…” I said. “That would give eighteen more hours for testing and bug investigation, for a total of 22 hours. And even if they’re still doing two hours of bug investigation for every one hour of testing time… well, that’s seven times more productive, at least.”

“Seven times the test coverage if they get some of those issues worked out, then,” said James.

“Maybe de-lumping is the kind of thing lots of testers would want to do in their test reports,” I said.

How about you?

Why Pass vs. Fail Rates Are Unethical (Test Reporting Part 1)

Thursday, February 23rd, 2012

Calculating a ratio of passing tests to failing tests is a relatively easy task. If it is used as a means of estimating the state of a development project, though, the ratio is invalid, irrelevant, and misleading. At best, if everyone ignores it entirely, it’s simply playing with numbers. Otherwise, producing a pass/fail ratio is irresponsible, unethical, and unprofessional.

A passing test is no guarantee that the product is working correctly or reliably. Instead, a passing test is an observation that the program appeared to work correctly, under some set of conditions that we were conscious of (and many that we weren’t), using a selection of specific inputs (and not using the rest of an essentially infinite set), at some time (to which we will never return), on some machine (that was in a particular state at that time; we observed and understood only a fraction of that state), based on a handful of things that we were looking at (and a boatload of things that we weren’t looking at, not that we’d have any idea where or how to look for everything). At best, a passing test is a rumour of success. Take any of the parameters above, change one bit, and we could have had a failing test instead.

Meanwhile, a failing test is no guarantee of a failure in the product we’re testing. Someone may have misunderstood a requirement, and turned that misunderstanding into an inappropriate test procedure. Someone may have understood the requirement comprehensively, and erred in establishing the test procedure; someone else may have erred in following it. The platform on which we’re testing may be misconfigured, or there may be something wrong with something on the system, such that our failing test points to that problem and is not an indicator of a problem in our product. If the test was being assisted by automation, perhaps there was a bug in the automation. Our test tools may be misconfigured such that they’re not doing what we think they’re doing. When generating data, we may have misclassified invalid data as valid, or vice versa, and not noticed it. We may have inadvertently entered the wrong data. The timing of the test may be off, such that system was not ready for the input we provided. There may be an as-yet-understood reason why the product is providing a result which seems incorrect to us, but which is in fact correct. A failing test is an allegation of failure.

When we do the math based on these assumptions, the unit of measurement in which pass/fail rates are expressed is rumours over allegations. Is this a credible unit of measurement?

Neither rumours nor allegations are things. Uncertainties are not units with a valid natural scale against which they can be measured. One entity that we call a “test case”, whether passing or failing, may consist of a single operation, observation and decision rule. Another entity called “test case” may consist of hundreds or thousands or millions of operations, all invisible, with thousands of opportunities for a tester to observe problems based not only on explicit knowledge, but also on tacit knowledge. Measuring while failing to account for clear differences between entities demolishes the construct validity of the measurement. Treating test cases—whether passing or failing—as though they were countable objects is a classic case of the reification fallacy. Aggregating scale-free, reified (non-)entities loses information about each instance, and loses information about any relationships between them. Some number of rumours doesn’t tell us anything about the meaning, significance, or value of any given passing tests, nor does the aggregate tell us anything about coverage that the passing tests provide, nor does the number doesn’t tell us about missing coverage. Some number of allegations of which we’re aware doesn’t tell us anything about the seriousness of those allegations, nor does it tell us about undiscovered allegations. Dividing one invalid number by another invalid doesn’t mean the invalidity cancels and produces a valid ratio.

When the student has got an answer wrong, and the student is misinformed, there’s a problem. What does the number of questions that the teacher asked have to do with it? When a manager interviews a candidate for a job, and halfway through the interview he suddenly starts shouting obscenities at her, will the number of questions the manager asked have to do anything to do with her hiring decision? If the battery on the Tesla Roadster is ever completely drained, the car turns into a brick with a $40,000 bill attached to it. Does anyone, anywhere, care about the number of passing tests that were done on the car?

If we are asked to produce pass/fail ratios, I would argue that it’s our professional responsibility to politely refuse to do it, and to explain why: we should not be offering our clients the service of self-deception and illusion, nor should our client accept those services. The ratio of passing test cases to failing test cases is at best irrelevant, and more often a systemic means of self- and organizational deception. Reducing the product story to a number means reducing its relationship with people to a number. By extension, that means reducing people to numbers too. So to irresponsible, unethical, and unprofessional, we can add unscientific and inhumane.

So what’s the alternative? We’ll get to that tomorrow.

One Test Per Requirement

Friday, February 4th, 2011

Despite all of the dragons that Agile approaches have attacked successfully, a few still live. As crazy as it is, the idea of one test check per requirement has managed to survive in some quarters.

Let’s put aside the fact that neither tests nor requirements are valid units of measurement, and focus on this: If you believe that there should be one test per requirement, then you have to assume that each requirement can have only one problem with it. Is that a valid assumption?

Gaming the Tests

Monday, September 27th, 2010

Let’s imagine, for a second, that you had a political problem at work. Your CEO has promised his wife that their feckless son Ambrose, having flunked his university entrance exams, will be given a job at your firm this fall. Company policy is strict: in order to prevent charges of nepotism, anyone holding a job must be qualified for it. You know, from having met him at last year’s Christmas party, that Ambrose is (how to put this gently?) a couple of tomatoes short of a thick sauce. Yet the policy is explicit: every candidate must not only pass a multiple choice test, but must get every answer right. The standard number of correct answers required is (let’s say) 40.

So, the boss has a dilemma. He’s not completely out to lunch. He knows that Ambrose is (how can I say this?) not the sharpest razor in the barbershop. Yet the boss adamantly wants his son to get a job with the firm. At the same time, the boss doesn’t want to be seen to be violating his own policy. So he leaves it to you to solve the problem. And if you solve the problem, the boss lets you know subtly that you’ll get a handsome bonus. Equally subtly, he lets you know that if Ambrose doesn’t pass, your career path will be limited.

You ponder for a while, and you realize that, although you have to give Ambrose an exam, you have the authority to set the content and conditions of the exam. This gives you some possibilities.

A. You could give a multiple choice test in which all the answers were right. That way, anyone completing the test would get a perfect score.

B. You could give a multiple choice test for which the answers were easy to guess, but irrelvant to the work Ambrose would be asked to do. For example, you could include questions like, “What is the very bright object in the sky that rises in the morning and sets in the evening?” and provide “The Sun” as choice of answer, and the names of hockey players for the other choices.

C. You could find out what questions Ambrose might be most likely to answer correctly in the domain of interest, and then craft an exam based on that.

D. You could give a multiple choice test in which, for every question, one of A, B, or C was the correct answer, and answer D was always “One of the above.”

E. You might give a reasonably difficult multiple-choice exam, but when Ambrose got an answer wrong, you could decide that there’s another way to interpret the answer, and quietly mark it right.

F. You might give Ambrose a very long set of multiple-choice questions (say 400 of them), and then, of his answers, pick 40 correct ones. You then present those questions and answers as the completed exam.

G. You could give Ambrose a set of questions, but give him as much time as he wanted to provide an answer. In addition, you don’t watch him carefully (although not watching carefully is a strategy that nicely supports most of these options).

H. You could ask Ambrose one multiple choice question. If he got it wrong, correct him until he gets it right. Then you could develop another question, ask that, and if he gets it wrong, correct him until he gets it right. Then continue in a loop until you get to 40 questions.

I. This approach is like H, but instead you could give a multiple choice test for which you had chosen an entire set of 40 questions in advance. If Ambrose didn’t get them all right, you could correct him, and then give him the same set of questions again. And again. And over and over again, until he finally gets them all right. You don’t have to publicize the failed attempts; only the final, successful one. That might take some time and effort, and Ambrose wouldn’t really be any more capable of anything except answering these specific questions. But, like all the other approaches above, you could effect a perfect score for Ambrose.

When the boss is clamoring for a certain result, you feel under pressure and you’re vulnerable. You wouldn’t advise anyone to do any of the things above, and you wouldn’t do them yourself. Or at least, you wouldn’t do them consciously. You might even do them with the best of intentions.

There’s an obvious parallel here—or maybe not. You may be thinking of the exam in terms of a certain kind of certification scheme that uses only multiple-choice questions, the boss as the hiring manager for a test group, and Ambrose as a hapless tester that everyone wants to put into a job for different reasons, even though no one is particularly thrilled about the idea. Some critical outsider might come along and tell you point-blank that your exam wasn’t going to evaluate Ambrose accurately. Even a sympathetic observer might offer criticism. If that were to happen, you’d want to keep the information under your hat—and quite frankly, the other interested parties would probably be complacent too. Dealing with the critique openly would disturb the idea that everyone can save face by saying that Ambrose passed a test.

Yet that’s not what I had in mind—not specifically, at least. I wanted to point out some examples of bad or misleading testing, which you can find in all kinds of contexts if you put your mind to it. Imagine that the exam is a set of tests—checks, really. The boss is a product owner who wants to get the product released. The boss’ wife is a product marketing manager. Hapless Ambrose is a program—not a very good program to be sure, but one that everyone wants to release for different reasons, even though no one is particularly thrilled by the idea. You, whether a programmer or a tester or a test manager, are responsible for “testing”, but you’re really setting up a set of checks. And you’re under a lot of pressure. How might your judgement—consciously or subconsciously—be compromised? Would your good intentions bend and stretch as you tried to please your stakeholders and preserve your integrity? Would you admit to the boss that your testing was suspect? If you were under enough pressure, would you even notice that your testing was suspect?

So this story is actually about any circumstance in which someone might set up a set of checks that provide some illusion of success. Can you think of any more ways that you might game the tests… or worse, fool yourself?

Statistician or Journalist?

Friday, August 27th, 2010

Eric Jacobson has a problem, which he thoughtfully relates on his thoughtful blog in a post called “How Can I Tell Users What Testers Did?”. In this post, I’ll try to answer his question, so you might want to read his original post for context.

I see something interesting here: Eric tells a clear story to relate to his readers some problem that he’s having with explaining his work to others who, by his account, don’t seem to understand it well. In that story, he mentions some numbers in passing. Yet the numbers that he presents are incidental to the story, not central to it. On the contrary, in fact: when he uses numbers, he’s using them as examples of how poorly numbers tell the kind of story he wants to tell. Yet he tells a fine story, don’t you think?

In the Rapid Software Testing course, we present this idea (Note to Eric: we’ve added this since you took the class): To test is to compose, edit, narrate, and justify two parallel stories. You must tell a story about the product: how it works, how it fails, and how it might not work in ways that matter to your client (and in the context of a retrospective, you might like to talk about how the product was failing and is now working). But in order to give that story its warrant, you must tell another story: you must tell a story about your testing. In a case like Eric’s, that story would take the form of a summary report focused on two things: what you want to convey to your clients, and what they want to know from you (and, ideally, those two things should be in sync with each other).

To do that, you might like to consider various structures to frame your story. Let’s start with the elements of what we (somewhat whimsically) call The Universal Test Procedure (you can find it in the course notes for the class). From a retrospective view, that would include

  • your model of the test space (that is, what was inside and outside the scope of your testing, and in particular the risks that you were trying to address)
  • the oracles that you used
  • the coverage that you obtained
  • the test techniques you applied
  • the ways in which you configured the product
  • the ways in which you operated the product
  • the ways in which you observed the product
  • the ways in which you evaluated the product; and
  • the heuristics by which you decided to stop testing
  • what you discovered and reported, and how you reported

You might also consider the structures of exploratory testing. Even if your testing isn’t highly exploratory, a lot of the structures have parallels in scripted testing.

Jon Bach says (and I agree) that testing is journalism, so look at the way journalists structure a story: they often start with the classic pyramid lead. They might also start with a compelling anecdote as recounted in What’s Your Story, by Craig Wortmann, or Made to Stick, by Chip and Dan Heath. If you’re in the room with your clients, you can use a whiteboard talk with diagrams, as in Dan Roam’s The Back of the Napkin. At the centre of your story, you could talk about risks that you addressed with your testing; problems that you found and that got addressed; problems that you found and that didn’t get addressed; things that slowed you down as you were testing; effort that you spent in each area; coverage that you obtained. You could provide testimonials from the programmers about the most important problems you found; the assistance that you provided to them to help prevent problems; your contributions to design meetings or bug triage sessions; obstacles that you surmounted; a set of charters that you performed, and the feature areas that they covered. Again, focus on what you want to convey to your clients, and what they want to know from you.

Incidentally, the more often and the more coherently you tell your story, the less explaining you’ll have to do about the general stuff. That means keeping as close to your clients as you can, so that they can observe the story unfolding as it happens. But when you ask “What metric or easily understood information can my test team provide users, to show our contribution to the software we release?”, ask yourself this: “Am I a statistician or a journalist?”


Other resources for telling testing stories:

Thread-Based Test Management: Introducing Thread-Based Test Management, by James Bach; and A New Thread, by Jon Bach (as of this writing, this is brand new stuff)

Telling Your Exploratory Story: A presentation at Agile 2010, by Jonathan Bach (I was unable to download anything other than a damaged version this, but maybe it’s working now; please let me know)

Constructing the Quality Story (from Better Software, November 2009): Knowledge doesn’t just exist; we build it. Sometimes we disagree on what we’ve got, and sometimes we disagree on how to get it. Hard as it may be to imagine, the experimental approach itself was once controversial. What can we learn from the disputes of the past? How do we manage skepticism and trust and tell the testing story?

On Metrics:

Three Kinds of Measurement (And Two Ways to Use Them) (from Better Software, July 2009): How do we know what’s going on? We measure. Are software development and testing sciences, subject to the same kind of quantitative measurement that we use in physics? If not, what kinds of measurements should we use? How could we think more usefully about measurement to get maximum value with a minimum of fuss? One thing is for sure: we waste time and effort when we try to obtain six-decimal-place answers to whole-number questions. Unquantifiable doesn’t mean unmeasurable. We measure constantly WITHOUT resorting to numbers. Goldilocks did it.

Issues About Metrics About Bugs (Better Software, May 2009): Managers often use metrics to help make decisions about the state of the product or the quality of the work done by the test group. Yet measurements derived from bug counts can be highly misleading because a “bug” isn’t a tangible, countable thing; it’s a label for some aspect of some relationship between some person and some product, and it’s influenced by when and how we count… and by who is doing the counting.

On Coverage:

Got You Covered (from Better Software, October 2008): Excellent testing starts by questioning the mission. So, the first step when we are seeking to evaluate or enhance the quality of our test coverage is to determine for whom we’re determining coverage, and why.

Cover or Discover (from Better Software, November 2008): Excellent testing isn’t just about covering the “map”—it’s also about exploring the territory, which is the process by which we discover things that the map doesn’t cover.

A Map By Any Other Name (from Better Software, December 2008): A mapping illustrates a relationship between two things. In testing, a map might look like a road map, but it might also look like a list, a chart, a table, or a pile of stories. We can use any of these to help us think about test coverage.

When Should the Product Owner Release the Product?

Tuesday, August 3rd, 2010

In response to my previous blog post “Another Silly Quantitative Model”, Greg writes: In my current project, the product owner has assumed the risk of any financial losses stemming from bugs in our software. He wants to release the product to customers, but he is of course nervous. How do you propose he should best go about deciding when to release? How should he reason about the risks, short of using a quantitative model?

The simple answer is “when he’s not so nervous that he doesn’t want to ship”. What might cause him to decide to stop shipment? He should stop shipment when there are known problems in the product that aren’t balanced by countervailing benefits. Such problems are called showstoppers. A colleague once described “showstopper” as “any problem in the product that would make more sense to fix than to ship.”

When I was a product owner and I reasoned with the project team about showstoppers, we deemed as a showstopper

  • Any single problem in the product that would definitely cause loss or harm (or sufficient annoyance or frustration) to users, such that the product’s value in its essential operation would be nil. Beware of quantifying “users” here. In the age of the Internet, you don’t need very many people with terrible problems to make noise disproportionate to the population, nor do you need those problems to be terrible problems when they affect enough people. The recent kerfuffle over the iPhone 4 is a case in point; the Pentium Bug is another. Customer reactions are often emotional more than rational, but to the customer, emotional reactions are every bit as real as rational ones.
  • Any set of problems in the product that, when taken individually, would not threaten its value, but when viewed collectively would. That could include a bunch of minor irritants that confuse, annoy, disturb, or slow down people using the product; embarrassing cosmetic defects; non-devastating functional problems; parafunctional issues like poor performance or compatibility, and the like.

Now, in truth, your product owner might need to resort to a quantitative model here: he has to be able to count to one. One showstopper, by definition, is enough to stop shipment.

How might you evaluate potential showstoppers qualitatively? My colleague Fiona Charles has two nice suggestions: “Could a problem that we know about in this product trigger a front-page story in the Globe and Mail‘s Report on Business, or in the Wall Street Journal?” “Could a problem that we know about in this product lead to a question being raised in Parliament?” Now: the fact is that we don’t, and can’t, know the answer to whether the problem will have that result, but that’s not really the point of the questions. The points are to explore and test the ways that we might feel about the product, the problems, and their consequences.

What else might cause nervousness for your client? Perhaps he’s worried that, other than the known problems, there are unanswered questions about the product. Those include

  • Open questions whose answer would produce one or more instances of a showstopper.
  • Unasked questions that, when asked, would turn into open questions instead of “I don’t care”. Where would you get ideas for such questions? Try the Heuristic Test Strategy Model at http://www.satisfice.com/tools/satisfice-tsm-4p.pdf for an example of the kinds of questions that you might ask.
  • Unanswered questions about the product are one indicator that you might not be finished testing. There are other indicators; you can read about them here: http://www.developsense.com/blog/2009/09/when-do-we-stop-test/

    Questions about how much we have (or haven’t) tested are questions about test coverage. I wrote three columns about that a while back. Here are some links and synopses:

    Got You Covered: Excellent testing starts by questioning the mission. So, the first step when we are seeking to evaluate or enhance the quality of our test coverage is to determine for whom we’re determining coverage, and why.

    Cover or Discover: Excellent testing isn’t just about covering the “map”—it’s also about exploring the territory, which is the process by which we discover things that the map doesn’t cover.

    A Map By Any Other Name: A mapping illustrates a relationship between two things. In testing, a map might look like a road map, but it might also look like a list, a chart, a table, or a pile of stories. We can use any of these to help us think about test coverage.

    Whether you’ve established a clear feeling or are mired in uncertainty, you might want to test your first-order qualitative evaluation with a first-order quantitative model. For example, many years ago at Quarterdeck, we had a problem that threatened shipment: on bootup, our product would lock system that had a particular kind of hard disk controller. There was a workaround, which would take a trained technical support person about 15 minutes to walk through. No one felt good about releasing the product in that state, but we were under quarterly schedule pressure.

    We didn’t have good data to work with, but we did have a very short list of beta testers and data about their systems. Out of 60 beta testers, three had machines with this particular controller. There was no indication that our beta testers were representative of our overall user population, but 5% of our beta testers had this controller. We then performed the thought experiment of destroying the productivity of 5% of our user base, or tech support having to spend 15 minutes with 5% of our user base (in those days, in the millions).

    How big was our margin of error? What if we were off by a factor of two, and ten per cent of our user base had that controller? What if we were off by a factor of five, and only one per cent of our user base had that controller? Suppose that only one per cent of the user base had their machines crash on startup; suppose that only a fraction of those users called in. Eeew. The story and the feeling, rather than the numbers, told us this: still too many.

    Is it irrational to base a decision based on such vague, unqualified, first-order numbers? If so, who cares? We made an “irrational” but conscious decision: fix the product, rather than ship it. That is, we didn’t decide based on the numbers, but rather on how we felt about the numbers. Was that the right decision? We’ll never know, unless we figure out how to get access to parallel universes. In this one, though, we know that the problem got fixed startlingly quickly when another programmer viewed it with fresh eyes; that the product shipped without the problem; that users with that controller never faced that problem; and that the tech support department continued in its regular overloaded state, instead of a super-overloaded one.

    The decision to release is always a business decision, and not merely a technical one. The decision to release is not based on numbers or quantities; even for those who claim to make decisions “based on the numbers”, the decisions are really based on feelings about the numbers. The decision to release is always driven by balancing cost, value, knowledge, trust, and risk. Some product owners will have a bigger appetite for risk and reward; others will be more cautious. Being a product owner is challenging because, in the end, product owners own the shipping decisions. By definition, a product owner assumes responsibility for the risk of financial losses stemming from bugs in the software. That’s why they get the big bucks.

    Another Silly Quantitative Model

    Wednesday, July 14th, 2010

    John D. Cook recently issued a blog post, How many errors are left to find?, in which he introduces yet another silly quantitative model for estimating the number of bugs left in a program.

    The Lincoln Index, as Mr. Cook refers to it here, was used as a model for evaluating typographical errors, and was based on a method for estimating the population of a given species of animal. There are several terrible problems with this analysis.

    First, reification error. Bugs are relationships, not things in the world. A bug is a perception of a problem in the product; a problem is a difference between what is perceived and what is desired by some person. There are at least four ways to make a problem into a non-problem: 1) Change the perception. 2) Change the product. 3) Change the desire. 4) Ignore the person who perceives the problem. Any time a product owner can say, “That? Nah, that’s not a bug,” the basic unit of the system of measurement is invalidated.

    Second, even if we suspended the reification problem, the model is inappropriate. Bugs cannot be usefully modelled as a single kind of problem or a single population. Typographical errors are not the only problems in writing; a perfectly spelled and syntactically correct piece of writing is not necessarily a good piece of writing. Nor are plaice the only species of fish in the fjords, nor are fish the only form of life in the sea, nor do we consider all life forms as equivalently meaningful, significant, benign, or threatening. Bugs have many different manifestations, from capability problems to reliability problems to compatibility problems to performance problems and so forth. Some of those problems don’t have anything to do with coding errors (which themselves could be like typos or grammatical errors or statements that can interpreted ambiguously). Problems in the product may include misunderstood requirements, design problems, valid but misunderstood implementation of the design, and so forth. If you want to compare estimating bugs in a program to a population estimate, it would be more appropriate to compare it to estimating the number of all biological organisms in a given place. Imagine some of the problems in doing that, and you may get some insight into the problem of estimating bugs.

    Third, there’s Djikstra’s notion that testing can show the presence of problems, but not their absence. That’s a way of noting that testing is subject to the Halting Problem. Since you can’t tell if you’ve found the last problem in the product, you can’t estimate how many are left in it.

    Fourth, the Ludic Fallacy (part one). Discovery and analysis of problems in a product is not a probabilistic game, but a non-linear, organic system of exploration, discovery, investigation, and learning. Problems are discovered at neither a steady nor a random rate. Indeed, discoveries often happen in clusters as the tester learns about the program and things that might threaten its value. The Lincoln Index, focused on typos—a highly decidable and easily understood problem that could largely be accomplished by checking—doesn’t fit for software testing.

    Fifth, the Ludic Fallacy (part two). Mr. Cook’s analysis implies that all problems are of equal value. Those of us who have done testing and studied it for a long time know that, from one time to another, some testers find a bunch of problems, and others find relatively few. Yet those few problems might be of critical significance, and the many of lesser significance. It’s an error to think in terms of a probabilistic model without thinking in terms of the payoff. Related to that is the idea that the number of bugs remaining in the product may not be that big a deal. All the little problems might pale in significance next to the one terrible problem; the one terrible problem might be easily fixable while the little problems grind down the users’ will to live.

    Sixth, measurement-induced distortion. Whenever you measure a self-aware system, you are likely to introduce distortion (at best) and dysfunction (at worst), as the system adapts itself to optimize the thing that’s being measured. Count bugs, and your testers will report more bugs—but finding more bugs can get in the way of finding more important bugs. That’s at least in part because of…

    Seventh, the Lumping Problem (or more formally, Assimiliation Bias). Testing is not a single activity; it’s a collection of activities that includes (at least) setup, investigation and reporting, and design and execution. Setup and investigation and reporting take time away from test coverage. When a tester finds a problem, she investigates reports it. That time is time that she can’t spend finding other problems. The irony here is that the more problems you find, the fewer problems you have time to find. The quality of testing work also involves the quality of the report. Reporting time, since it isn’t taken into account in the model, will distort the perception of the number of bugs remaining.

    Eighth, estimating the number of problems remaining in the product takes time away from sensible, productive activities. Considering that the number of problems remaining is subjective, open-ended, and unprovable, one might be inclined to think that counting how many problems are left is a waste of time better spent on searching for other bad ones.

    I don’t think I’ve found the last remaining problem with this model.

    But it does remind me that when people see bugs as units and testing as piecework, rather than the complex, non-linear, cognitive process that it is, they start inventing all these weird, baseless, silly quantitative models that are at best unhelpful and that, more likely, threaten the quality of testing on the project.

    Why Is Testing Taking So Long? (Part 2)

    Wednesday, November 25th, 2009

    Yesterday I set up a thought experiment in which we divided our day of testing into three 90-minute sessions. I also made a simplifying assumption that bursts of testing activity representing some equivalent amount of test coverage (I called it a micro-session, or just a “test”) take two minutes. Investigating and reporting a bug that we find costs an additional eight minutes, so a test on its own would take two minutes, and a test that found a problem would take ten.

    Yesterday we tested three modules. We found some problems. Today the fixes showed up, so we’ll have to verify them.

    Let’s assume that a fix verification takes six minutes. (That’s yet another gross oversimplification, but it sets things up for our little thought experiment.) We don’t just perform the original microsession again; we have to do more than that. We want to make sure that the problem is fixed, but we also want to do a little exploration around the specific case and make sure that the general case is fixed too.

    Well, at least we’ll have to do that for Modules B and C. Module A didn’t have any fixes, since nothing was broken. And Team A is up to its usual stellar work, so today we can keep testing Team A’s module, uninterrupted by either fix verifications or by bugs. We get 45 more micro-sessions in today, for a two-day total of 90.

    (As in the previous post, if you’re viewing this under at least some versions of IE 7, you’ll see a cool bug in its handling of the text flow around the table.  You’ve been warned!)

    Module Fix Verifications Bug Investigation and Reporting
    (time spent on tests that find bugs)
    Test Design and Execution
    (time spent on tests that don’t find bugs)
    New Tests Today Two-Day Total
    A 0 minutes (no bugs yesterday) 0 minutes (no bugs found) 90 minutes (45 tests) 45 90

    Team B stayed an hour or so after work yesterday. They fixed the bug that we found, tested the fix, and checked it in. They asked us to verify the fix this afternoon. That costs us six minutes off the top of the session, leaving us 84 more minutes. Yesterday’s trends continue; although Team B is very good, they’re human, and we find another bug today. The test costs two minutes, and bug investigation and reporting costs eight more, for a total of ten. In the remaining 74 minutes, we have time for 37 micro-sessions. That means a total of 38 new tests today—one that found a problem, and 37 that didn’t. Our two-day today for Module B is 79 micro-sessions.

    Module Fix Verifications Bug Investigation and Reporting
    (time spent on tests that find bugs)
    Test Design and Execution
    (time spent on tests that don’t find bugs)
    New Tests Today Two-Day Total
    A 0 minutes (no bugs yesterday) 0 minutes (no bugs found) 90 minutes (45 tests) 45 90
    B 6 minutes (1 bug yesterday) 10 minutes (1 test, 1 bug) 74 minutes (37 tests) 38 79

    Team C stayed late last night. Very late. They felt they had to. Yesterday we found eight bugs, and they decided to stay at work and fix them. (Perhaps this is why their code has so many problems; they don’t get enough sleep, and produce more bugs, which means they have to stay late again, which means even less sleep…) In any case, they’ve delivered us all eight fixes, and we start our session this afternoon by verifying them. Eight fix verifications at six minutes each amounts to 48 minutes. So far as obtaining new coverage goes, today’s 90-minute session with Module C is pretty much hosed before it even starts; 48 minutes—more than half of the session—is taken up by fix verifications, right from the get-go. We have 42 minutes left in which to run new micro-sessions, those little two-minute slabs of test time that give us some equivalent measure of coverage. Yesterday’s trends continue for Team C too, and we discover four problems that require investigation and reporting. That takes 40 of the remaining 42 minutes. Somewhere in there, we spend two minutes of testing that doesn’t find a bug. So today’s results look like this:

    Module Fix Verifications Bug Investigation and Reporting
    (time spent on tests that find bugs)
    Test Design and Execution
    (time spent on tests that don’t find bugs)
    New Tests Today Two-Day Total
    A 0 minutes (no bugs yesterday) 0 minutes (no bugs found) 90 minutes (45 tests) 45 90
    B 6 minutes (1 bug yesterday) 10 minutes (1 test, 1 bug) 74 minutes (37 tests) 38 79
    C 48 minutes (8 bugs yesterday) 40 minutes (4 tests, 4 bugs) 2 minutes (1 test) 5 18

    Over two days, we’ve been able to obtain only 20% of the test coverage for Module C that we’ve been able to obtain for Module A. We’re still at less than 1/4 of the coverage that we’ve been able to obtain for Module B.

    Yesterday, we learned one lesson:

    Lots of bugs means reduced coverage, or slower testing, or both.

    From today’s results, here’s a second:

    Finding bugs today means verifying fixes later, which means even less coverage or even slower testing, or both.

    So why is testing taking so long? One of the biggest reasons might be this:

    Testing is taking longer than we might have expected or hoped because, although we’ve budgeted time for testing, we lumped into it the time for investigating and reporting problems that we didn’t expect to find.

    Or, more generally,

    Testing is taking longer than we might have expected or hoped because we have a faulty model of what testing is and how it proceeds.

    For managers who ask “Why is testing taking so long?”, it’s often the case that their model of testing doesn’t incorporate the influence of things outside the testers’ control. Over two days of testing, the difference between the quality of Team A’s code and Team C’s code has a profound impact on the amount of uninterrupted test design and execution work we’re able to do. The bugs in Module C present interruptions to coverage, such that (in this very simplified model) we’re able to spend only one-fifth of our test time designing and executing tests. After the first day, we were already way behind; after two days, we’re even further behind. And even here, we’re being optimistic. With a team like Team C, how many of those fixes will be perfect, revealing no further problems and taking no further investigation and reporting time?

    And again, those faulty management models will lead to distortion or dysfunction. If the quality of testing is measured by bugs found, then anyone testing Module C will look great, and people testing Module A will look terrible. But if the quality of testing is evaluated by coverage, then the Module A people will look sensational and the Module C people will be on the firing line. But remember, the differences in results here have nothing to do with the quality of the testing, and everything to do with the quality of what is being tested.

    There’s a psychological factor at work, too. If our approach to testing is confirmatory, with steps to follow and expected, predicted results, we’ll design our testing around the idea that the product should do this, and that it should behave thus and so, and that testing will proceed in a predictable fashion. If that’s the case, it’s possible—probable, in my view—that we will bias ourselves towards the expected and away from the unexpected. If our approach to testing is exploratory, perhaps we’ll start from the presumption that, to a great degree, we don’t know what we’re going to find. As much as managers, hack statisticians, and process enthusiasts would like to make testing and bug-finding predictable, people don’t know how to do that such that the predictions stand up to human variability and the complexity of the world we live in. Plus, if you can predict a problem, why wait for testing to find it? If you can really predict it, do something about it now. If you don’t have the ability to do that, you’re just playing with numbers.

    Now: note again that this has been a thought experiment. For simplicity’s sake, I’ve made some significant distortions and left out an enormous amount of what testing is really like in practice.

    • I’ve treated testing activities as compartmentalized chunks of two minutes apiece, treading dangerously close to the unhelpful and misleading model of testing as development and execution of test cases.
    • I haven’t looked at the role of setup time and its impact on test design and execution.
    • I haven’t looked at the messy reality of having to wait for a product that isn’t building properly.
    • I haven’t included the time that testers spend waiting for fixes.
    • I haven’t included the delays associated with bugs that block our ability to test and obtain coverage of the code behind them.
    • I’ve deliberately ignored the complexity of the code.
    • I’ve left out difficulties in learning about the business domain.
    • I’ve made a highly simplistic assumptions about the quality and relevance of the testing and the quality and relevance of the bug reports, the skill of the testers in finding and reporting bugs, and so forth.
    • And I’ve left out the fact that, as important as skill is, luck always plays a role in finding problems.

    My goal was simply to show this:

    Problems in a product have a huge impact on our ability to obtain test coverage of that product.

    The trouble is that even this fairly simple observation is below the level of visibilty of many managers. Why is it that so many managers fail to notice it?

    One reason, I think, is that they’re used to seeing linear processes instead of organic ones, a problem that Jerry Weinberg describes in Becoming a Technical Leader. Linear models “assume that observers have a perfect understanding of the task,” as Jerry says. But software development isn’t like that at all, and it can’t be. By its nature, software development is about dealing with things that we haven’t dealt with before (otherwise there would be no need to develop a new product; we’d just reuse the one we had). We’re always dealing with the novel, the uncertain, the untried, and the untested, so our observation is bound to be imperfect. If we fail to recognize that, we won’t be able to improve the quality and value of our work.

    What’s worse about managers with a linear model of development and testing is that “they filter our innovations that the observer hasn’t seen before or doesn’t understand” (again, from Becoming a Technical Leader.) As an antidote for such managers, I’d recommend Perfect Software, and Other Illusions About Testing and Lessons Learned in Software Testing as primers. But mostly I’d suggest that they observe the work of testing. In order to do that well, they may need some help from us, and that means that we need to observe the work of testing too. So over the next little while, I’ll be talking more than usual about Session-Based Test Management, developed initially by James and Jon Bach, which is a powerful set of ideas, tools and processes that aid in observing and managing testing.

    Why Is Testing Taking So Long? (Part 1)

    Tuesday, November 24th, 2009

    If you’re a tester, you’ve probably been asked, “Why is testing taking so long?” Maybe you’ve had a ready answer; maybe you haven’t. Here’s a model that might help you deal with the kind of manager who asks such questions.

    Let’s suppose that we divide our day of testing into three sessions, each session being, on average, 90 minutes of chartered, uninterrupted testing time. That’s four and a half hours of testing, which seems reasonable in an eight-hour day interrupted by meetings, planning sessions, working with programmers, debriefings, training, email, conversations, administrivia of various kinds, lunch time, and breaks.

    The reason that we’re testing is that we want to obtain coverage; that is, we want to ask and answer questions about the product and its elements to the greatest extent that we can. Asking and answering questions is the process of test design and execution. So let’s further assume that we break each session into average two-minute micro-sessions, in which we perform some test activity that’s focused on a particular testing question, or on evaluating a particular feature. That means in a 90-minute session, we can theoretically perform 45 of these little micro-sessions, which for the sake of brevity we’ll informally call “tests”. Of course life doesn’t really work this way; a test idea might a couple of seconds to implement, or it might take all day. But I’m modeling here, making this rather gross simplification to clarify a more complex set of dynamics. (Note that if you’d like to take a really impoverished view of what happens in skilled testing, you could say that a “test case” takes two minutes. But I leave it to my colleague James Bach to explain why you should question the concept of test cases.)

    Let’s further suppose that we’ll find problems every now and again, which means that we have to do bug investigation and reporting. This is valuable work for the development team, but it takes time that interrupts test design and execution—the stuff that yields test coverage. Let’s say that, for each bug that we find, we must spend an extra eight minutes investigating it and preparing a report. Again, this is a pretty dramatic simplification. Investigating a bug might take all day, and preparing a good report could take time on the order of hours. Some bugs (think typos and spelling errors in the UI) leap out at us and don’t call for much investigation, so they’ll take less than eight minutes. Even though eight minutes is probably a dramatic underestimate for investigation and reporting, let’s go with that. So a test activity that doesn’t find a problem costs us two minutes, and a test activity that does find a problem takes ten minutes.

    Now, let’s imagine one more thing: we have perfect testing prowess; that if there’s a problem in an area that we’re testing, we’ll find it, and that we’ll never enter a bogus report, either. Yes, this is a thought experiment.

    One day we come into work, and we’re given three modules to test.

    The morning session is taken up with Module A, from Development Team A. These people are amazing, hyper-competent. They use test-first programming, and test-driven design. They work closely with us, the testers, to design challenging unit checks, scriptable interfaces, and log files. They use pair programming, and they review and critique each other’s work in an egoless way. They refactor mercilessly, and run suites of automated checks before checking in code. They brush their teeth and floss after every meal; they’re wonderful. We test their work diligently, but it’s really a formality because they’ve been testing and we’ve been helping them test all along. In our 90-minute testing session, we don’t find any problems. That means that we’ve performed 45 micro-sessions, and have therefore obtained 45 units of test coverage.

    (And if you’re viewing this under at least some versions of IE 7, you’ll see a cool bug in its handling of the text flow around the table.  You’ve been warned!)

    Module Bug Investigation and Reporting
    (time spent on tests that find bugs)
    Test Design and Execution
    (time spent on tests that don’t find bugs)
    Total Tests
    A 0 minutes (no bugs found) 90 minutes (45 tests) 45
    The first thing after lunch, we have a look at Team B’s module. These people are very diligent indeed. Most organizations would be delighted to have them on board. Like Team A, they use test-first programming and TDD, they review carefully, they pair, and they collaborate with testers. But they’re human. When we test their stuff, we find a bug very occasionally; let’s say once per session. The test that finds the bug takes two minutes; investigation and reporting of it takes a further eight minutes. That’s ten minutes altogether. The rest of the time, we don’t find any problems, so that leaves us 80 minutes in which we can run 40 tests. Let’s compare that with this morning’s results.

    Module Bug Investigation and Reporting
    (time spent on tests that find bugs)
    Test Design and Execution
    (time spent on tests that don’t find bugs)
    Total Tests
    A 0 minutes (no bugs found) 90 minutes (45 tests) 45
    B 10 minutes (1 test, 1 bug) 80 minutes (40 tests) 41
    After the afternoon coffee break, we move on to Team C’s module. Frankly, it’s a mess. Team C is made up of nice people with the best of intentions, but sadly they’re not very capable. They don’t work with us at all, and they don’t test their stuff on their own, either. There’s no pairing, no review, in Team C. To Team C, if it compiles, it’s ready for the testers. The module is a dog’s breakfast, and we find bugs practically everywhere. Let’s say we find eight in our 90-minute session. Each test that finds a problem costs us 10 minutes, so we spent 80 minutes on those eight bugs. Every now and again, we happen to run a test that doesn’t find a problem. (Hey, even dBase IV occasionally did something right.) Our results for the day now look like this:

    Module Bug Investigation and Reporting
    (time spent on tests that find bugs)
    Test Design and Execution
    (time spent on tests that don’t find bugs)
    Total Tests
    A 0 minutes (no bugs found) 90 minutes (45 tests) 45
    B 10 minutes (1 test, 1 bug) 80 minutes (40 tests) 41
    C 80 minutes (8 tests, 8 bugs) 10 minutes (5 tests) 13
    Because of all the bugs, Module C allows us to perform thirteen micro-sessions in 90 minutes. Thirteen, where with the other modules we managed 45 and 41. Because we’ve been investigating and reporting bugs, there are 32 micro-sessions, 32 units of coverage, that we haven’t been able to obtain on this module. If we decide that we need to perform that testing (and the module’s overall badness is consistent throughout), we’re going to need at least three more sessions to cover it. Alternatively, we could stop testing now, but what are the chances of a serious problem lurking in the parts of the module we haven’t covered? So, the first thing to observe here is:
    Lots of bugs means reduced coverage, or slower testing, or both.

    There’s something else that’s interesting, too. If we are being measured based on the number of bugs we find (exactly the sort of measurement that will be taken by managers who don’t understand testing), Team A makes us look awful—we’re not finding any bugs in their stuff. Meanwhile, Team C makes us look great in the eyes of management. We’re finding lots of bugs! That’s good! How could that be bad?

    On the other hand, if we’re being measured based on the test coverage we obtain in a day (which is exactly the sort of measurement that will be taken by managers who count test cases; that is, managers who probably have an even more damaging model of testing than the managers in the last paragraph), Team C makes us look terrible. “You’re not getting enough done! You could have performed 45 test cases today on Module C, and you’ve only done 13!” And yet, remember that in our scenario we started with the assumption that, no matter what the module, we always find a problem if there’s one there. That is, there’s no difference between the testers or the testing for each of the three modules; it’s solely the condition of the product that makes all the difference.

    This is the first in a pair of posts. Let’s see what happens tomorrow.