Blog Posts for the ‘Critical Thinking’ Category

On Green

Tuesday, July 7th, 2015

A little while ago, I took a look at what happens when a check runs red. Since then, comments and conversations with colleagues emphasized this point from the post: it’s overwhelmingly common first to doubt the red result, and then to doubt the check. A red check almost provokes a kind of panic for some testers, because it takes away a green check’s comforting—even narcotic—confirmation that Everything Is Going Just Fine.

Skepticism about any kind of test result is reasonable, of course. Before delivering painful news, it’s natural and responsible for a tester to examine the evidence for it carefully. All software projects—and all decisions about quality—are to some degree loaded with politics and emotions. This is normal. When a tester’s technical and social skills are strong, and self-esteem is high, those political and emotional considerations are manageable. When we encounter a red check—a suggestion that there might be a problem in the product—we must be prepared for powerful feelings, potential controversy, and cognitive dissonance all around. When people feel politically or emotionally vulnerable, the cognitive dissonance can start to overwhelm the desire to investigate the problem. Several colleague have recalled circumstances in which intermittent red checks were considered sufficiently pesky by someone on the project team—even by testers themselves, on occasion—that the checks were ignored or disabled, as one might do with a cooking detector.

So what happens when checks return “green” results?

As my colleague James Bach puts it, checks are like motion detectors around the boundaries of our attention. When the check runs green, it’s easy to remain relaxed. The alarm doesn’t sound; the emergency lighting doesn’t come on; the dog doesn’t bark. If we’re insufficiently attentive and skeptical, every green check helps to confirm that everything is okay.

Kirk and Miller identified a big problem with confirmation:

Most of the technology of “confirmatory” non-qualitative research in both the social and natural sciences is aimed at preventing discovery. When confirmatory research goes smoothly, everything comes out precisely as expected. Received theory is supported by one more example of its usefulness, and requires no change. As in everyday social life, confirmation is exactly the absence of insight. In science, as in life, dramatic new discoveries must almost by definition be accidental (“serendipitous”). Indeed, they occur only in consequence of some mistake.

Kirk, Jerome, and Miller, Marc L., Reliability and Validity in Qualitative Research (Qualitative Research Methods). Sage Publications, Inc, Thousand Oaks, CA, 1985.

It’s our relationship between the checks and our models of them that matters here. When we have unjustified trust in our checks, we have the opposite problem that we have with the cooking detector: we’re unlikely to notice that the alarm doesn’t go off when it should. That is, we don’t pay attention. The good news is that being inattentive is optional. We can choose to hold on to the possibility that something might be wrong with our checks, and to identify the absence of red checks as meta-information; a suspicious silence, instead of a comforting one. The responsible homeowner checks the batteries on the smoke alarm, and the savvy explorer knows when to say “The forest is quiet tonight… maybe too quiet.”

By putting variation into our testing, we rescue ourselves from the possibility that our checks are too narrow, too specific, cover too few kinds of risk. If you’re aware of the possibility that your alarm clock might fail to wake you, you’re more likely to take alternative measures to avoid sleeping too long.

Valuable conversations with James Bach and Chris Tranter contributed to this post.

Oracles Are About Problems, Not Correctness

Thursday, March 12th, 2015

As James Bach and I have have been refining our ideas of testing, we’ve been refining our ideas about oracles. In a recent post, I referred to this passage:

Program testing involves the execution of a program over sample test data followed by analysis of the output. Different kinds of test output can be generated. It may consist of final values of program output variables or of intermediate traces of selected variables. It may also consist of timing information, as in real time systems.

The use of testing requires the existence of an external mechanism which can be used to check test output for correctness. This mechanism is referred to as the test oracle. Test oracles can take on different forms. They can consist of tables, hand calculated values, simulated results, or informal design and requirements descriptions.

—William E. Howden, A Survey of Dynamic Analysis Methods, in Software Validation and Testing Techniques, IEEE Computer Society, 1981

While we have a great deal of respect for the work of testing pioneers like Prof. Howden, there are some problems with this description of testing and its focus on correctness.

  • Correct output from a computer program is not an absolute; an outcome is only correct or incorrect relative to some model, theory, or principle. Trivial example: Even the mathematical rule “one divided by two equals one-half” is a heuristic for dividing things. In most domains, it’s true, but as in George Carlin’s joke, when you cut a crumb in two, you don’t have two half-crumbs; you have two crumbs.
  • A product can produce a result that is functionally correct, and yet still be deeply unsatisfactory to its user. Trivial example: a calculator returns the value “4” from the function “2 + 2″—and displays the result in white on a white background.
  • Conversely, a product can produce an incorrect result and still be quite acceptable. Trivial example: a computer desktop clock’s internal state and second hand drift a few tenths of a second each second, but the program resets itself to be consistent with an atomic clock at the top of every minute. The desktop clock almost never shows the right time precisely, but the human observer doesn’t notice and doesn’t really care. Another trivial example: a product might return a calculation inconsistent with its oracle in the tenth decimal place, when only the first two or three decimal places really matter.
  • The correct outcome of a program or function is not always known in advance. Some development and testing work, like some science, is done in an attempt to discover something new; to establish what a correct answer might look like; to explore a mathematical model; to learn about the limitations of a novel system. In such cases, our ideas of correctness or acceptability are not clear from the outset, and must be developed. (See Collins and Pinch’s The Golem books, which discuss the messiness and confusion of controversial science.) Trivial example: in benchmarking, correctness is not at issue. Comparison between one system and another (or versions of the same system at different times) is the mission of testing here.
  • As we’re developing and testing a product, we may observe things that are unexpected, under-described or completely undescribed. In order to program a machine to make an observation, we must anticipate that observation and encode it. The machine doesn’t imagine, invent, or learn, and a machine cannot produce an unanticipated oracle in response to an observation. By contrast, human observers continually learn and refine their ideas on what to observe. Sometimes we observe a problem without having anticipated it. Sometimes we become aware that we’re making a new observation—one that may or may not represent a problem. Distinct from checking, testing continually affords new things to observe. Testing prompts us to decide when new observations represent problems, and testing informs decisions about what to do about them.
  • An oracle may be in error, or irrelevant. Trivial examples: a program that checks the output of another program may have its own bugs. A reference document may be outdated. A subject matter expert who is usually a reliable source of information may have forgotten something.
  • Oracles might be inconsistent with each other. Even though we have some powerful models for it, temperature measurement in climatology is inherently uncertain. What is the “correct” temperature outdoors? In the sunlight? In the shade? When the thermometer is near a building or farther away? Over grass, or over pavement? Some of the issues are described in this remarkable article (read the comments, too).
  • Although we can demonstrate incorrectness in a program, we cannot prove a program to be correct. As Djikstra put it, testing can only show the presence of errors, not their absence; and to go even deeper, Popper pointed out that theories can only be falsified, and not proven. Trivial example: No matter how many tests we run on that calculator, we can never know that it will always return 4 given the inputs 2 + 2; we can only infer that it will do so through induction, and induction can be deeply problemmatic. In a Nassim Taleb’s example (cribbed from Bertrand Russell and David Hume), every day the turkey uses induction to reinforce his belief in the farmer’s devotion to the desires and interests of turkeys—until a few days before Thanksgiving, when the turkey receives a very sudden, unpleasant, and (alas for the turkey) momentary flash of insight.
  • Sometimes we don’t need to know the correct result to know that the observed result is wrong. Trivial example: the domain of the cosine function ranges from -1 to 1. I don’t need to know the correct value for cos(72) to know that an output of 4.2 is wrong. (Elaine Weyuker discusses this in a paper called “On Testing Nontestable Programs” (Weyuker, Elaine, “On Testing Nontestable Programs”, Department of Computer Science, Courant Institute of Mathematical Sciences, New York University). “Frequently the tester is able to state with assurance that a result is incorrect without actually knowing the correct answer.”)

Checking for correctness—especially when the test output is observed and evaluated mechanically or indirectly—is a risky business. All oracles are fallible. A “passing” test, based on comparison with a fallible oracle cannot prove correctness, and no number of “passing” tests can do that. In this, a test is like a scientific experiment: an experiment’s outcome can falsify one theory while supporting another, but an experiment cannot prove a theory to be true. A million observations of white swans says nothing about the possibility that there might be black swans; a million passing tests, a million observations of correct behaviour cannot eliminate the possibility that there might be swarms of bugs. At best, a passing test is essentially the observation of one more white swan. We urge those who rely on passing acceptance tests to remember this.

A check can suggest the presence of a problem, or can at best provide support for the idea that the program can work. But no matter what oracle we might use, a test cannot prove that a program is working correctly, or that the program will work . So what can oracles actually do for us?

If we invert the focus on correctness, we can produce a more robust heuristic. We can’t logically use an oracle to prove that a system is behaving correctly or that it will behave correctly, but we can use an oracle to help falsify the theory that it is behaving correctly. This is why, in Rapid Software Testing, we say that an oracle is a means by which we recognize a problem when it happens during testing.

The Rapid Software Testing Namespace

Monday, February 2nd, 2015

Just as no one has the right to tell you what language to speak at home, nobody outside of your project has the authority to tell you how to speak inside your project. Every project develops its own namespace, so to speak, and its own formal or informal criteria for naming things inside it. Rapid Software Testing is, among other things, a project in that sense. For years, James Bach and I have been developing labels for ideas and activities that we talk about in our work and in our classes. While we’re happy to adopt useful ideas and terms from other places, we have the sole authority (for now) to set the vocabulary formally within Rapid Software Testing (RST). We don’t have the right to impose our vocabulary on anyone else. So what do we do when other people use a word to mean something different from what we mean by the same word?

We invoke “the RST namespace” when we talk about testing and checking, for example, so that we can speak clearly and efficiently about ideas that we bring up in our classes and in the practice of Rapid Software Testing. From time to time, we also try to make it clear why we use words in a specific way. For example, we make a big deal about testing and checking. We define checking as “the process of making evaluations by applying algorithmic decision rules to specific observations of a product” (and a check is an instance of checking). We define testing as “the process of evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, etc.” (and a test is an instance of testing).

This is in contrast with the ISTQB, which in its Glossary defines “test” as “a set of test cases”—along with “test case” as “a set of input values, execution preconditions, expected results and execution postconditions, developed for a particular objective or test condition, such as to exercise a particular program path or to verify compliance with a specific requirement.” Interesting, isn’t it: the ISTQB’s definition of test looks a lot like our definition of check. In Rapid Software Testing, we prefer to put learning and experimentation (rather than satisfying requirements and demonstrating fitness for purpose) at the centre of testing. We prefer to think of a test as something that people do as an act of investigation; as a performance, not as an artifact.

Because words convey meaning, we converse (and occasionally argue, and sometimes passionately) the value we see in the words we choose and the ways we think of them. Our goal is to describe things that people haven’t noticed, or to make certain distinctions clear, with the goal of reducing the risk that someone will misunderstand—or miss—something important. Nonetheless, we freely acknowledge that we have no authority outside of Rapid Software Testing. There’s nothing to stop people from using the words we use in a different way; there are no language police in software development. So we’re also willing to agree to use other people’s labels for things when we’ve had the conversation about what those labels mean, and have come to agreement.

People who tout a “common language” often mean “my common language”, or “my namespace”. They also have the option to certify you as being able to pass a vocabulary test, if anyone thinks that’s important. We don’t. We think that it’s important for people to notice when words are being used in different ways. We think it’s important for people to become polyglots—and that often means working out which namespace we might be using from one moment to the next. In our future writing, conversation, classes, and other work, you might wonder what we’re talking about when we refer to “the RST namespace”. This post provides your answer.

The Pause

Thursday, January 16th, 2014

I would like to remind people involved in testing that—after an engaged brain—one of our most useful testing tools is… the pause.

A pause is precisely the effect delivered by the application of four little words: Huh? Really? And? So? Each word prompts a pause, a little breathing space in which questions oriented towards critical thinking have time to come to mind.

  • Wait…huh? Did I hear that properly? Does it mean what I think it means?
  • Um…Really? Does that match with my experience and understanding of the world as it is, or how it might be? What are some plausible alternative interpretations for what I’ve just heard or read? How might we be fooled by it?
  • Just a sec… And? What additional information might be missing? What other meanings could I infer?
  • Okay…So? What are some consequences or ramifications of those interpretations? What might follow? What do we do—or say, or ask—next?

Those four words nestle nicely between the four elements of the Satir Interaction Model—Intake (huh?), Meaning (really? and?), Significance (so?), and Response.

We recently added “And” to the earlier set of “Huh? Really? So?” upon which James Bach elaborates here.

Severity vs. Priority

Tuesday, March 5th, 2013

Another day has dawned on Planet Earth, so another tester has used LinkedIn to ask about the difference between severity and priority.

The reason the tester is asking is, probably, that there’s a development project, and there’s probably a bug tracking system, and it probably contains fields for both severity and priority (and probably as numbers). The tester has probably been told to fill in each field as part of his bug report; and the tester probably hasn’t been told specifically what the fields mean—or the tester is probably uncertain about how the numbers map to reality.

“Severity” is the noun associated with the adjective, “severe”. In my Concise Oxford Dictionary, “severe” has six listed meanings. The most relevant one for this context is “serious, critical”. Severity, with respect to a problem, is basically how big a problem is; how much trouble it’s going to cause. If it’s a big problem, it gets marked as high severity (oddly, that’s typically a low number), and if it’s not a big deal, it gets marked as low severity, typically with a higher number. So, severity is a simple concept. Except…

When we’re testing, and we think we see a problem, we don’t see everything about that problem. We see what some people call a failure, a symptom. The symptom we observe may be a manifestation of a coding error, or of a design issue, or of a misunderstood or mis-specified requirement. We see a symptom; we don’t see the cause or the underlying fault, as the IEEE and others might call it.

Whatever we’re observing may be a terrible problem for some user or some customer somewhere—or the customer might not notice or care. Here’s an example: in Microsoft Word 2010’s Insert Page Number feature, choose small Roman numerals as your format, and use the value 32768 (rendered in Roman numerals). Word hangs on my machine, and on every machine I’ve tried this trick on (you can try it too). Now: is this a Severity 1 bug? It certainly appears to be severe, considering the symptom. A hang is a severe problem, in terms of reliability.

But wait… considering that vanishingly few people use lower-case Roman numeral page numbers larger than, say, a few hundred, is the problem really that severe? In terms of capability, it’s probably not a big deal; there’s a very low probability that any normal user would need to use that feature and would encounter the problem.

Except… considering the fact that a problem like this could—at least in theory—present an opportunity for a hacker to bring down an application or, worse, take control of a system, maybe this is a devastatingly severe problem.

There’s yet another factor to consider here. We all suffer to some degree from a bias that can play out in testing. This might be a form of representativeness bias, or of assimilation bias, or of correspondence bias, but none of these seems to be a perfect fit. I think of it as the Heartburn Heuristic, in honour of my dad: for a year or more, he perceived minor heartburn—a seemingly trivial symptom of a seemingly minor gastric reflux problem. What my (late) dad didn’t count on was that, from the symptoms, it’s hard to tell the difference between gastric reflux and esophageal cancer.

The Heartburn Heuristic is a reminder that it’s easy to believe—falsely—that a minor symptom is naturally associated with a minor problem. It’s similarly easy to believe that a serious problem will always be immediately and dramatically obvious. It’s also easy to believe that a problem that looks like big trouble is big trouble, even when a fast one-byte fix will make the problem go away forever.

We also become easily confused about the relationship between the prominence of the symptom, the impact on the customer, and the difficulty associated with fixing the problem, and the urgency of the fix relative to the urgency of releasing the product. (Look at the Challenger and Columbia incidents as canonical examples of how this plays out in engineering, emotions, and politics.) In reality, there’s no reason to believe in a strong correlation between the prominence of a problem and its severity, or the potential impact of a problem and the difficulty of a fix. A missing character in some visible field may be a design limitation or a display formatting bug, or it may be a sign of corruption in the database.

Of course, since we’re fallible human beings, looking for unknown problems in an infinite space with finite time to do it, the most severe problems in a product can escape our notice entirely. So based on the symptom alone, at best we can only guess at the severity of the problem. That’s bad enough, but the problem of classifying severity gets even worse.

Just as we have biases and cognitive shortcomings, other people on the project team will tend to have them too. The tester’s credibility may be called into question if she places a high severity number on what others consider to be a low severity problem. Severity, after all, is subject to the Relative Rule: severity is not an attribute of the problem, but a relationship between the problem and some person at some time.

To the end user who never uses the feature, the Roman numeral hang is not a big deal. To the end user who actually experiences a hang and possible loss of time or data, this could be a deeply annoying problem. To a programmer who takes great pride in his craft, a hang is a severe problem. To a programmer who is being evaluated on the number of Severity 1 problems in the product (a highly dubious way to measure the quality of a programmer’s work, but it happens), there is a strong motivation to make sure that the Roman numeral hang is classified as something other than a Severity 1 problem. To a program manager who has a few months of development time available before release, our Roman numeral problem might be a problem worth fixing. To a program manager who is facing a one-week deadline before the product has to ship (thanks to retail and stock market pressure), this is a trivial bug. (Trust me on that; I’ve been a program manager.)

In light of all this, what is a tester to do? My personal preference (based on experience as a tester, as a programmer, and as a program manager) is to encourage testers to stay out of the severity business if possible. By all means, I provide the project team with a clear description of the symptom, the quality criteria that could be threatened by it, and ideas on how the problem could have an effect on people who matter. I might provide a guess, based on inference, as to the underlying cause. I’ll be careful to frame it as a guess, unless I’ve seen the source code and understand the problem clearly.

My default assumption is that I can’t go by appearances, and that every symptom has an unknown cause with potentially harsh consequences. I assume that every problem is guilty until proven innocent—that it’s a potentially severe problem until the code has been examined, the risk models revisited, and the team consulted.

I’m especially wary of assigning a low severity on a bug report based on an apparently trivial symptom. If I haven’t seen the code, I try to avoid saying that something is a trivial problem; if pressed, I’ll say it looks like a trivial problem.

If I’m forced to enter a number into a bug reporting form, I’ll set the severity of a problem at its highest level unless I have substantial understanding and reason to see the problem as being insignificant. In order to avoid the political cost of seeming like a Cassandra, I’ll make sure my clients are aware of my fundamental uncertainty about severity: the best I can provide is a guess, and if I want to err, I’d rather err or the side of overestimating severity rather than underestimating it and thereby downplaying an important problem. As a solution that feels better to me, I might also request an “unclassified” option in the Severity field, so that I can move on quickly and leave the classification to the team, to the programmers and to the program managers.

As for priority: priority is the order in which someone wants things to be done. Perhaps some people use the priority field to rank the order in which particular problems should be discussed, but my experience is that, usually, “priority” is a tester’s assessment of how important it is to fix the problem—a kind of ranking of what should be fixed first.

Again based on my experience as tester, programmer, and program manager, I don’t see this as being a tester’s business at all. Deciding what should be done on a programming or business level is the job of the person with authority and responsibility over the work, in collaboration with the people who are actually doing the work. When I’m a tester, there is one exception: if I see a problem that is preventing me from doing further testing, I will request that the fix for that problem be fast-tracked (and I’ll outline the risks of not being able to test that area of the product). As tester, one of the most important aspects of my report is the set of things that make testing harder or slower, the things that give bugs more time and more opportunity to hide. Nonetheless, deciding what gets fixed first is for those who do the managing and the fixing.

In the end, I believe that decisions about severity and priority are business and management decisions. As testers, our role is to provide useful information to the decision-makers, but I believe we should let development managers manage development.