Blog Posts from February, 2012

Delivering the News (Test Reporting Part 3)

Monday, February 27th, 2012

In the last post in this series, I noted some potentially useful structual similarities between bug reports (whether oral or written) and newspaper reports. This time, I’ll delve into that a little more.

To our clients, investigative problem reports are usually the most important part of the product story. The most respected newspapers don’t earn their reputations by reprinting press releases; they earn their reputations through investigative journalism. As testers (or, heaven help us, quality assurance) people, we tend to be chartered to look for problems, and to investigate them in ways that are most helpful to programmers, managers, and our other clients. A failing test on its own tells us little, and a failing check even less; as I pointed out here, a failing test is only an allegation of a problem. Investigation and study of a failing test is likely to inform us of something more useful: whether someone will perceive a problem that threatens the value of the product. I’ll talk more about the nature of problems in a later post, but for now, think of a product problem in terms of a perceived absence of or threat to some dimension of quality. (See the Heuristic Test Strategy Model for one list of quality criteria; see Software Quality Characteristics, by Rikard Edgren, Henrik Emilsson and Martin Jansson for another.) Since the manager’s goal is generally to release a product at her desired level of quality, problems that could threaten that goal are likely to be interesting and important. Or, as they say in the newspaper business, “if it bleeds, it leads”.

Potential showstoppers are usually the most important stories. In the 1990s, I was a technical support person, a tester, a program manager, and a programmer for a mass-market, commercial shrink-wrap software company. Since we had millions of customers, even minor problems could have a big impact on technical support and on the reputation of our products in the market. The market was enormous, hardware and software were even less standardized than they are now, and we worked under a great deal of time pressure. Classifying and prioritizing problems was contentious. One of the important classification questions was “What should we consider a showstopper?” One of the senior programmers came up with an answer that I’ve used ever since:

Showstopper (n.): Something that makes more sense to fix than to ship.

(I talked about showstoppers here.) In a development project, a showstopper—any threat to the timely release of the project—is a page-one, above-the-fold story, a story that you can see and begin to read without opening the newspaper or picking it up.

There’s always one story that leads. The most important threat to a timely, successful release may be a single problem, or it may be a collection of problems—what Ian Mitroff calls a mess. Do we have a problem, or a couple of problems, or a mess? No matter what the answer, there’s only so much space on the front page above the fold. Will you have one headline, or two, or three? What will that headline say? What will the lead paragraph of each story look like? Does the lead paragraph cover the five Ws—who, what, where, when, and why? If not, are those questions answered shortly? Might there be a reasonable reason not to answer them?

There’s only one front page, and there’s almost always more than one story on it. Our clients need to be able to absorb the lead story and the other front-page stories quickly, so we need to be able to provide headlines, lead paragraphs, and details in appropriate proportions. See an example front page here, with details that follow.

Very infrequently, serious newspapers give their entire front page to a story. In those cases, it’s usually an overwhelmingly important story, or one that threatens the newspaper or journalism itself.

The most compelling stories are those that have an impact on people. Although product problems are often technical in nature, the “making sense” part of the showstopper decision is focused on the business. Testers must to be able to connect technical problems with business risk. Problems related to technical correctness are often easy to describe, but they might not be important. The skill of bug advocacy—making sure that the customer is aware of the best possible motivations for fixing the bug—depends on your ability to report the bug in terms of its most significant effect on the business. Ben Simo has a lovely way to sum this up. Early in his career, when Ben was trying to advocate a bug fix, his project manager said, “Revenue is king. Liability is queen. Tell me how this bug impacts them.”

The number of stories usually isn’t as important as the significance of the stories. This is another way in which test reports can be like newspapers. We don’t usually evaluate the quality of a newspaper by the number of stories in it. Instead, we look at the significance, relevance, and credibility of the stories.

It may take time to distinguish between a breaking story and a major story. Sometimes the news cycle doesn’t afford time for investigation, even though the story might be important. Information gets passed around the project at various moments during the test and development cycle. Sometimes a discovery happens just before a meeting. Smart reporters know to balance urgency and restraint when there’s a breaking story. When I worked in commercial mass-market software in the 1990s, we sometimes discovered a terrible-looking problem a couple of hours before release. Such discoveries would trigger arousal (no, not sexual friskiness, but arousal in the psychological sense of being suddenly snapped awake and alert to danger). All of a sudden, we’d be noticing all kinds of things that we hadn’t noticed before, and most of them were non-problems of one kind or another. We were biased by fear. We called it the “snakes on everything” moment. When reporting, testers need to take stock of the emotional factors surrounding them, and report cautiously and accurate. An hour from now, an allegation or a rumour might be an important story—or it might be nothing.

Non-problems aren’t news. There’s a pattern of stories in the first section of the newspaper: they’re mostly stories about problems, and there’s a reason for that: problems compel attention. Our emotional systems evolved to help keep us out of trouble. Problems or threats trigger arousal. Things that are going well are nice to hear about, but they don’t engage emotions in the same way as problems do. In a software development project, non-problems have relatively little significance for project managers. Routine daily successes don’t threaten the project, and therefore need less attention.

Numbers, like pictures, are illustrations, not the whole story. A qualitative report is not quantity-free; after all, identifying the presence or absence of something involves counting to one, and the degree of some attribute of interest can be illustrated by number. But just as a pictorial illustration isn’t the item it depicts, a numerical illustration isn’t the story it might help to describe. A picture looks a part of a scene through a particular lens; a number focuses on one attribute using a particular metric. Each one may emphasizes some observations at the expense of other observations. Each one may crop out detail. Each one may magnify or distort.

Since the product and testing stories are multi-dimensional, be prepared to show the dimensions. Newspapers reports always have a bias, but reporters and editors often attempt to manage the bias by providing alternative sources of information, and alternative interpretations. A story of any length often includes multiple stories, or multiple threads of the main story. When tables of data are appropriate, newspapers print tables (think stock quotes in the business section, or box scores or line scores in sports). Products, coverage, quality, and problems are all multi-dimensional, multi-variate, and qualitative. Where there’s a mass of data, consider using tables such as dashboards or coverage tables. Pin numbers to reliable measurements (see the slip charts, the detailed impact case methods, and the subjective impact methods in Weinberg’s Quality Software Management, Volume 2: First Order Measurement; and pay attention to validity—see Kirk and Miller’s Reliability and Validity in Qualitative Research and Shadish, Cook, and Campbell’s Experimental and Quasi-Experimental Designs for Generalized Causal Inference).

Describe your coverage. Boris Beizer described coverage as “any metric of test completeness with respect to a test selection criterion”. That suggests that it is possible to quantify coverage if you have a quantifiable test selection criterion. For example, if a single-digit field accepts any digit from 0 to 9, one could select 10 tests and claim complete coverage based on that criterion. Mind, that data coverage doesn’t account for flow or sequence coverage; suppose that a bug was triggered only when a 7 replaced a 3 in that field. Since the overall number of possible tests is infinite, test selection criteria are based on models. In practical terms, this means that overall test coverage is some finite number over an infinite number. If you report that accurately, you’re stuck with a number that remains asymptotically close to zero. Instead, focus on the qualitative, and describe your coverage on an ordinal scale. Level 0 means “We know nothing about this area of the product.” Use Level 1 to say “We have done smoke or sanity testing; at this point, we’ve determined whether the product is even stable enough for serious testing.” Level 2 means “we’ve tested the common, the core, the critical, the happy path; our testing has been focused on can it work.” Level 3 means “We’ve tested the harsh, the complex, the challenging, the extreme, the exceptional; if there were a serious problem in this area, we’d probably know about it by now.” In this system, the numbers are barely more than labels for a qualitative evaluation, so don’t be tempted to do serious math with them.

Braiding The Stories (Test Reporting Part 2)

Friday, February 24th, 2012

We were in the middle of a testing exercise at the Amplifying Your Effectiveness conference in 2005. I was assisting James Bach in a workshop that he was leading on testing. He presented the group with a mysterious application written by James Lyndsay—an early version of one of the Black Box Test Machines. “How many test cases would you need to test this application?” he asked.

Just then Jerry Weinberg wandered into the room. “Ah! Jerry Weinberg!” said James. “One of the greatest testing experts in the world! He’ll know the answer to this one. How many test cases would you need to test this application, Jerry?”

Jerry looked at the screen for a moment. “Three,” he said, firmly and decisively.

James knew to play along. “Three?!“, he said, in a feigned combination of amazement, uncertainty, and curiosity. “How do you know it’s three? Is it really three, Jerry?”

“Yes,” said Jerry. “Three.” He paused, and then said drily, “Why? Were you expecting some other number?”

In yesterday’s post, I was harshly critical of pass vs. fail ratios, a very problematic yet startlingly common way of estimating the state of the product and the project. When I point out the mischief of pass vs. fail ratios, some people object. “In the real world,” they say, “we have to report pass vs. fail ratios to our managers, because that’s what they want.” Yet bogus reporting is antithetical to the “real world”. Pass vs. fail ratios come from the the fake world, a world where numbers have magical properties to soothe troubled and uncertain souls. Still, there’s no question that managers want something. It’s our mandate to give them something of value.

Some people say that managers want numbers because they want to know that we’re measuring. I’ve found two ways of thinking about measurement that have been very useful to me. One is the definition from Kaner and Bond’s splendid paper “Software Engineering Metrics: What Do They Measure and How Do We Know?”: “Measurement is the empirical, objective assignment of numbers, according to a rule derived from a model or theory, to attributes of objects or events with the intent of describing them.” I think that’s a superb definition of quantitative measurement, and the paper includes a set of probing questions to test the validity of a quantitative measurement. Pass vs. fail ratios fall down badly when they’re subjected to those tests.

Jerry Weinberg offers another definition of measurement that I think is more in line with what managers really want: “Measurement is the art and science of making reliable (and significant) observations.” (The main part of the definition comes from Quality Software Management, Vol. 2: First-Order Measurement; the parenthetical comes from recent correspondence over Twitter.) That’s a more general, inclusive definition. It incorporates Kaner and Bond’s notion of quantitative measurement, but it’s more welcoming to qualitative, first-order approaches. First-order measurement, as Jerry describes it, provides answers to questions like “What seems to be happening? and What should I do now?” It entails a minimum of fuss, and tends to be direct, unobtrusive, inexpensive, and qualitative, leading either to immediate action or a decision to seek more information. It’s a common, misleading, and often expensive mistake in software development to leap over first-order measurement and reporting in favour of second-order—less direct, more quantified, more abstract, and based on more elaborate and vulnerable models.

My experience, as a tester, a programmer, a program manager, and a consultant, tells me that to manage a project well, you need a good deal of immediate and significant information. “Immediate” here doesn’t only mean timely; it also means unmediated, without a bunch of stuff getting in between you and the observation. In particular, managers need to know about problems that threaten the value of the product and the on-time, successful completion of the project. That knowledge requires more than abstract data; it requires information. So, as testers, how can we inform the decision-makers? In our Rapid Software Testing class, James Bach and I have lately taken to emphasizing this: We must learn to describe and report on the product, our testing, and the quality of our testing. This involves constructing, editing, narrating, and justifying a story in three lines that weave around each other like a braid. Each line, or level, is its own story.

Level 1: Tell the product story. The product story is a qualitative report on how the product can work, how it fails, and how it might fail in ways that matter to our clients. “Working”, “failure”, and “what matters” are all qualitative evaluations. Quality is value to some person; in a business setting, quality is value to some person who matters to the business. A qualitative report about a product requires us to relate the nature of the product, the people who matter, and the presence or absence of value, risks, and problems for those people. Qualitative information makes it possible for our clients to make informed decisions about quality.

Level 2: To make the product story credible, tell the testing story. The testing story is about how we configured, operated, observed, and evaluated the product; what we actually did and what we actually saw. The testing story gives warrant to the product story; it helps our clients understand why they should believe and trust the product story we’re giving. The testing story is centred around the coverage that we obtained and the oracles that we applied. Coverage is the extent to which we’ve tested the program; it’s about where we’ve looked and how we’ve looked, and it’s also about what’s uncovered—where we might not have looked yet, and where we don’t intend to look. Oracles are central to evaluation; they’re the principles and mechanisms that allow us to recognize a problem. The product story will likely feature problems in the product; the testing story, where necessary, includes an account of how we knew they were problems, for whom they would be problems, and inferences about how serious the problems it might be. We can make inferences about the significance of problems, but not ultimate conclusions, since the decision of what matters and what constitutes a problem lies with the product owner. The product story and our clients’ reactions to it will influence the ongoing testing story, and vice versa.

Level 3: To make the testing story credible, tell a story about the quality of the testing. Just as the product story needs warrant, so too does the testing story. To tell a story about the quality of testing requires us to describe why the testing we’ve done has been good enough, and why the testing we haven’t done hasn’t been so important so far. The quality-of-testing story includes details on what made testing harder or slower, what made the product more or less testable, what the risks and costs of testing are, and what we might need or recommend in order to provide better, more accurate, more timely information. The quality-of-testing story will shape and be shaped by the other two stories.

Develop skills to tell and frame stories. People sometimes justify presenting invalid numbers in lieu of stories by saying that numbers are “efficient”. I think they mean “fast”, since efficiency of communication depends not only on speed, but also on value, relevance, validity, and the level of detail your client needs. In order to frame stories appropriately and hit the right level of detail…

Don’t think data feed; think the daily news. Testing is like investigative journalism, researching and delivering stories to people. The newspaper business knows how to direct attention efficiently to the stories in which we’re interested, such that we get the level of detail that we seek. Some of those strategies include:

  • Headlines. A quick glance over each page tells us immediately what, in the editors’ judgement, are the most salient aspects of any given story. Headlines come in different sizes, relative to the editors’ assessment of the importance of the story.
  • Front page. The paper comes folded. The stories that the paper deems most important to its reader are on the front page, above the fold. Other important stories are on the front page below the fold. The page is laid out to direct our attention to what we find most relevant, and to allow us to focus and refocus on items of interest.
  • Continuation. When an entire story is too long to fit on the front page, it’s abbreviated and the story continues elsewhere. This gives the reader the option of following the story or looking at other items on the front page.
  • Coverage areas. The newspaper is organized into sections (hard news, business, sports, life and leisure, arts, real estate, cars, travel, and so forth). Each section comes with its own front page, which generally includes headlines and continuations of its own.
  • Structured storytelling. Newspaper stories tend to be organized in spiralling levels of detail, such that the story is set up to follow the inverted pyramid (the link is well worth reading). The story typically begins with the most newsworthy information, usually immediately addressing the five W questions—who, what where, why, and when, plus how—and the the story builds from there. The key is that the reader can absorb information to the level of detail she seeks, continuing to the end of the story or jumping out when she’s satisfied.
  • Identifying who is involved and who is affected. Reporters and editors contextualize their stories. Just as in testing, people are the most important element of the context. A story is far more compelling when it affects the reader or people that the reader cares about. A good story often helps to clarify why the reader should care.
  • Varying approaches to delivering information. Newspapers often use a picture to helps to illustrate or emphasize an important aspect of a story. In the business or sports sections, where quantitative data is often crucial, information may be organized in tables, or trends may be illustrated with charts. Notice that the stories—first-order reports—are always given greater prominence than the tables of stock quotes league standings, and line scores.
  • Sidebars. Some stories are illuminated by background information that might break the flow of the main story. That information is presented in parallel; in another thread, as we might say.
  • Daily (and in the world of the Web, continuous) delivery of information. My newspaper arrives at a regular time each day, a sort of daily heartbeat for the news cycle. The paper’s Web site is updated on a continuous basis. Information is available both on a supply and a demand basis; both when I expect it and when I seek it.
  • Identifiable sources. Well-researched stories gain credibility by identifying how, where, when, and from whom the information was obtained. This helps to set up degrees of trust and skepticism in the reader.

One important note: These approaches apply to more than text. Testers need to extend these patterns not only to written or mechanical forms, but to oral discourse.

I’ll have more suggestions and additional parallels between test reporting and newspapers in the next post in this series.

Why Pass vs. Fail Rates Are Unethical (Test Reporting Part 1)

Thursday, February 23rd, 2012

Calculating a ratio of passing tests to failing tests is a relatively easy task. If it is used as a means of estimating the state of a development project, though, the ratio is invalid, irrelevant, and misleading. At best, if everyone ignores it entirely, it’s simply playing with numbers. Otherwise, producing a pass/fail ratio is irresponsible, unethical, and unprofessional.

A passing test is no guarantee that the product is working correctly or reliably. Instead, a passing test is an observation that the program appeared to work correctly, under some set of conditions that we were conscious of (and many that we weren’t), using a selection of specific inputs (and not using the rest of an essentially infinite set), at some time (to which we will never return), on some machine (that was in a particular state at that time; we observed and understood only a fraction of that state), based on a handful of things that we were looking at (and a boatload of things that we weren’t looking at, not that we’d have any idea where or how to look for everything). At best, a passing test is a rumour of success. Take any of the parameters above, change one bit, and we could have had a failing test instead.

Meanwhile, a failing test is no guarantee of a failure in the product we’re testing. Someone may have misunderstood a requirement, and turned that misunderstanding into an inappropriate test procedure. Someone may have understood the requirement comprehensively, and erred in establishing the test procedure; someone else may have erred in following it. The platform on which we’re testing may be misconfigured, or there may be something wrong with something on the system, such that our failing test points to that problem and is not an indicator of a problem in our product. If the test was being assisted by automation, perhaps there was a bug in the automation. Our test tools may be misconfigured such that they’re not doing what we think they’re doing. When generating data, we may have misclassified invalid data as valid, or vice versa, and not noticed it. We may have inadvertently entered the wrong data. The timing of the test may be off, such that system was not ready for the input we provided. There may be an as-yet-not-understood reason why the product is providing a result which seems incorrect to us, but which is in fact correct. A failing test is an allegation of failure.

When we do the math based on these assumptions, the unit of measurement in which pass/fail rates are expressed is rumours over allegations. Is this a credible unit of measurement?

Neither rumours nor allegations are things. Uncertainties are not units with a valid natural scale against which they can be measured. One entity that we call a “test case”, whether passing or failing, may consist of a single operation, observation and decision rule. Another entity called “test case” may consist of hundreds or thousands or millions of operations, all invisible, with thousands of opportunities for a tester to observe problems based not only on explicit knowledge, but also on tacit knowledge. Measuring while failing to account for clear differences between entities demolishes the construct validity of the measurement. Treating test cases—whether passing or failing—as though they were countable objects is a classic case of the reification fallacy. Aggregating scale-free, reified (non-)entities loses information about each instance, and loses information about any relationships between them. Some number of rumours doesn’t tell us anything about the meaning, significance, or value of any given passing tests, nor does the aggregate tell us anything about coverage that the passing tests provide, nor does the number tell us about missing coverage. Some number of allegations of which we’re aware doesn’t tell us anything about the seriousness of those allegations, nor does it tell us about undiscovered allegations. Dividing one invalid number by another invalid doesn’t mean the invalidity cancels and produces a valid ratio.

When the student has got an answer wrong, and the student is misinformed, there’s a problem. What does the number of questions that the teacher asked have to do with it? When a manager interviews a candidate for a job, and halfway through the interview he suddenly starts shouting obscenities at her, will the number of questions the manager asked have to do anything to do with her hiring decision? If the battery on the Tesla Roadster is ever completely drained, the car turns into a brick with a $40,000 bill attached to it. Does anyone, anywhere, care about the number of passing tests that were done on the car?

If we are asked to produce pass/fail ratios, I would argue that it’s our professional responsibility to politely refuse to do it, and to explain why: we should not be offering our clients the service of self-deception and illusion, nor should our client accept those services. The ratio of passing test cases to failing test cases is at best irrelevant, and more often a systemic means of self- and organizational deception. Reducing the product story to a number means reducing its relationship with people to a number. By extension, that means reducing people to numbers too. So to irresponsible, unethical, and unprofessional, we can add unscientific and inhumane.

So what’s the alternative? We’ll get to that tomorrow.

Do Not Close This Window (Or Click The Back Button)

Tuesday, February 7th, 2012

Here’s a classic case of poor design and user experience. Most of us have seen something like it. It happened to my wife yesterday. It will happen to you again soon, probably.

  • You’re making an online payment for some product or service.

  • You press a button that says something like “Submit Payment”.

  • A web page appears that says something like “Your payment is being submitted. Please do not close this window or click the Back button on your browser.” And that’s all the page says.

  • The page stays on your screen forever—or until you wince and close the browser window despite the specific instructions on the screen.

Here are some questions that a tester could ask when presented with this design, or with this experience:

  • “Or else what?” “Please do not close this window or click the Back button on your browser.” Or else what? What Bad Thing might happen? What Good Thing might fail to happen? This should lead directly to…

  • “What if…?” What if the sequence of actions doesn’t go as planned? What if a conversation between a server and a client is interrupted? (Note: the connections between any two systems are at best somewhat reliable. If you believe otherwise, a travelling testing consultant has two words for you: hotel WiFi.) At what points might interruptions happen (quick answer: all of them.) How is the state of the conversation being managed? Have we considered interruptions in our design? Have we tested for them? How does the system handle and recover from delayed or interrupted transactions?

  • “What should the customer reasonably expect?” It’s not hard to imagine a good deal of variance in the performance of a system, especially when its end nodes might be dozens of network hops apart from each other. Still, how long should a customer reasonably expect the transaction to take? At what point might it make sense for the customer to bail out?

  • “How would the customer know when it’s time to bail out?” If you can put a message on the screen, and if you know how long it would be reasonable to wait before bailing out, should the customer have to look at her watch? Might a countdown timer be helpful?

  • “Is there another way?” Is there another way for the customer to see that the transaction has completed successfully, or has failed? Does your design and the message you display make that option clear?

  • “What emotions might come up?” How might a customer feel uncertain, confused, frustrated, annoyed, mystified, impatient, surprised, helpless—or confident, impressed, reassured, or delighted—by what she sees and experiences? How might we use those potential feelings to help us guide our search for problems?

  • “Who can help?” If the transaction fails, who can help the customer out? How does the customer get in touch with that person? Is there a means of contacting customer support on that “Please wait…” screen?

  • “What meta-information is available?” I’ve worked with companies that have said, “We can’t put a customer support telephone number on that screen; customer support would be swamped!” What does that statement tell you about the system, about people’s impressions of its reliability, and about risk?

  • “How do we raise awareness of problems?” When a transaction on our site fails or is subject to an unreasonable delay, how do we get to find out? Is someone alerted immediately? Are failures aggregated? Buried in a log file somewhere? Who looks for problems, and how often do they look? Who hears about problems? How does that information get relayed to the people who design, maintain, and update the system? How might that information—or parts of it—not get relayed to those people?

This last question is important. Its answer provides part of the explanation for the fact that, after fifteen years of Web commerce, we’re still seeing designs like the one that appears at the top of this post.