Blog Posts for the ‘Testing vs. Checking’ Category

s/automation/programming/

Thursday, June 2nd, 2016

Several years ago in one of his early insightful blog posts, Pradeep Soundarajan said this:

“The test doesn’t find the bug. A human finds the bug, and the test plays a role in helping the human find it.”

More recently, Pradeep said this:

Instead of saying, “It is programmed”, we say, “It is automated”. A world of a difference.

It occurred to me instantly that it could make a world of difference, so I played with the idea in my head.

Automated checks? “Programmed checks.” 

Automated testing? “Programmed testing.” 

Automated tester?  “Programmed tester.” 

Automated test suite?  “Programmed test suite.”

Let’s automate to do all the testing?  “Let’s write programs to do all the testing.”

Testing will be faster and cheaper if we automate. “Testing will be faster and cheaper if we write programs.”

Automation will replace human testers. “Writing programs will replace human testers.”

To me, the substitutions all generated a different perspective and a different feeling from the originals. When we don’t think about it too carefully, “automation” just happens; machines “do” automation.  But when we speak of programming, our knowledge and experience remind us that we need people do programming, and that good programming can be hard, and that good programming requires skill.  And even good programming is vulnerable to errors and other problems.

So by all means, let’s use hardware and software tools skilfully to help us investigate the software we’re building.  Let’s write and develop and maintain programs that afford deeper or faster insight into our products (that is, our other programs) and their behaviour.  Let’s use and build tools that make data generation, visualisation, analysis, recording, and reporting easier. Let’s not be dazzled by writing programs that simply get the machinery to press its own buttons; let’s talk about how we might use our tools to help us reveal problems and risks that really matter to us and to our clients.  

And let’s consider the value and the cost and the risk associated with writing more programs when we’re already rationally uncertain about the programs we’ve got.

You Are Not Checking

Sunday, April 10th, 2016

Note: This post refers to testing and checking in the Rapid Software Testing namespace. This post has received a few minor edits since it was first posted.

For those disinclined to read Testing and Checking Refined, here are the definitions of testing and checking as defined by me and James Bach within the Rapid Testing namespace.

Testing is the process of evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, etc.

(A test is an instance of testing.)

Checking is the process of making evaluations by applying algorithmic decision rules to specific observations of a product.

(A check is an instance of checking.)

You are not checking. Well, you are probably not checking; you are certainly not only checking. You might be trying to do checking. Yet even if you are being asked to do checking, or if you think you’re doing checking, you will probably fail to do checking, because you are a human. You can do things that could be encoded as checks, but you will do many other things too, at the same time. You won’t be able to restrict yourself to doing only checking.

Checking is a part of testing that can be performed entirely algorithmically. Remember that: checking is a part of testing that can be performed entirely algorithmically. The exact parallel to that in programming is compiling: compiling is a part of programming that can be performed entirely algorithmically. No one talks of “automated compiling”, certainly not anymore. It is routine to think of compiling as an activity performed by a machine. We still speak of “automated checking” because we have only recently introduced “checking” as a term of art. We say “automated checking” to emphasize that checking by definition can be, and in practice probably should be, automated.

If you are trying to do only checking, you will screw it up, because you are not a robot. Your humanity—your faculties that allow you to make unprogrammed observations and evaluations; your tendency to vary your behaviour; your capacity to identify unanticipated risks—will prevent you from living to an algorithm. As a human tester—not a robot—you’re essentially incapable of sticking strictly to what you’ve been programmed to do. You will inevitably think or notice or conjecture or imagine or learn or evaluate or experiment or explore. At that point, you will have jumped out of checking and into the wider activities of testing. (What you do with the outcome of your testing is up to you, but we’d say that if your testing produces information that might matter to a client, you should probably follow up on it and report it.)

Your unreliability and your variability is, for testing, a good thing. Human variability is a big reason why you’ll find bugs even when you’re following a script that the scriptwriter—presumably—completed successfully. (In our experience, if there’s a test script, someone has probably tried to perform it and has run through it successfully at least once.)

So, unless you’ve given up your humanity, it is very unlikely that you are only checking. What’s more likely is that you are testing. There are specific observations that you may be performing, and there are specific decision rules that you may be applying. Those are checks, and you might be performing them as tactics in your testing. Many of your checks will happen below the level of your awareness. But just as it would be odd to describe someone’s activities at the dinner table as “biting” when they were eating, it would be odd to say that you were “checking” when you were testing.

Perhaps another one of your tactics, while testing, is programming a computer—or using a computer that someone else has programmed—to perform checking. In Rapid Software Testing, people who develop checks are generally called toolsmiths, or technical testers—people who are not intimidated by technology or code.

Remember: checking is a part of testing that can be performed entirely algorithmically. Therefore, if you’re a human, neither instructing the machine to start checking nor developing checks is “doing checking”.

Testers who develop checks are not “doing checking”. The checks themselves are algorithmic, and they are performed algorithmically by machinery, but the testers are not following algorithms as they develop checks, or deciding that a check should be performed, or evaluating the outcome of the checking. Similarly, programmers who develop classes and functions are not “doing compiling”. Those programmers are not following algorithms to produce code.

Toolsmiths who develop tools and frameworks for checking, and who program checks, are not “doing checking” either. Developers who produce tools and compilers for compiling are not “doing compiling”. Testers who produce checking tools should be seen as skilled specialists, just as developers who produce compilers are seen as skilled specialists. In order to develop excellent checks and excellent checking tools, a tester needs two distinct kinds of expertise: testing expertise, and programming and development expertise.

Testers apply checking as tactic of testing. Checking is embedded within a host of testing activities: modeling the test space; identifying risks; framing questions that can be asked about the product; encoding those questions in terms of algorithmic actions, observations, outcomes, and reports; choosing when the checking should be done; interpreting the outcome of checks, whether green or red.

Notice that checking does not find bugs. Testers—or developers temporarily in a testing role or a testing mindset—who apply checking find bugs, and the checks (and the checking) play a role in finding bugs.

In all of our talk about testing and checking, we are not attempting to diminish the role of people who create and use testing tools, including checks and checking. Nothing could be farther from the truth. Tools are vital to testing. Tools support testing.

We are, however, asking that testing not be reduced to checking. Checking is not testing, just as compiling is not software development. Checking may be a very important tactic in our testing, and as such, it is crucial to consider how it can be done expertly to assist our testing. It is important to consider the extents and limits of what checking can do for us. Testing a whole product while being fixated on checking is like like developing a whole product while being fixated on compiling.

A Context-Driven Approach to Automation in Testing

Sunday, January 31st, 2016

(We interrupt the previously-scheduled—and long—series on oracles for a public service announcement.)

Over the last year James Bach and I have been refining our ideas about the relationships between testing and tools in Rapid Software Testing. The result is this paper. It’s not a short piece, because it’s not a light subject. Here’s the abstract:

There are many wonderful ways tools can be used to help software testing. Yet, all across industry, tools are poorly applied, which adds terrible waste, confusion, and pain to what is already a hard problem. Why is this so? What can be done? We think the basic problem is a shallow, narrow, and ritualistic approach to tool use. This is encouraged by the pandemic, rarely examined, and absolutely false belief that testing is a mechanical, repetitive process.

Good testing, like programming, is instead a challenging intellectual process. Tool use in testing must therefore be mediated by people who understand the complexities of tools and of tests. This is as true for testing as for development, or indeed as it is for any skilled occupation from carpentry to medicine.

You can find the article here. Enjoy!

On Green

Tuesday, July 7th, 2015

A little while ago, I took a look at what happens when a check runs red. Since then, comments and conversations with colleagues emphasized this point from the post: it’s overwhelmingly common first to doubt the red result, and then to doubt the check. A red check almost provokes a kind of panic for some testers, because it takes away a green check’s comforting—even narcotic—confirmation that Everything Is Going Just Fine.

Skepticism about any kind of test result is reasonable, of course. Before delivering painful news, it’s natural and responsible for a tester to examine the evidence for it carefully. All software projects—and all decisions about quality—are to some degree loaded with politics and emotions. This is normal. When a tester’s technical and social skills are strong, and self-esteem is high, those political and emotional considerations are manageable. When we encounter a red check—a suggestion that there might be a problem in the product—we must be prepared for powerful feelings, potential controversy, and cognitive dissonance all around. When people feel politically or emotionally vulnerable, the cognitive dissonance can start to overwhelm the desire to investigate the problem. Several colleague have recalled circumstances in which intermittent red checks were considered sufficiently pesky by someone on the project team—even by testers themselves, on occasion—that the checks were ignored or disabled, as one might do with a cooking detector.

So what happens when checks return “green” results?

As my colleague James Bach puts it, checks are like motion detectors around the boundaries of our attention. When the check runs green, it’s easy to remain relaxed. The alarm doesn’t sound; the emergency lighting doesn’t come on; the dog doesn’t bark. If we’re insufficiently attentive and skeptical, every green check helps to confirm that everything is okay.

Kirk and Miller identified a big problem with confirmation:

Most of the technology of “confirmatory” non-qualitative research in both the social and natural sciences is aimed at preventing discovery. When confirmatory research goes smoothly, everything comes out precisely as expected. Received theory is supported by one more example of its usefulness, and requires no change. As in everyday social life, confirmation is exactly the absence of insight. In science, as in life, dramatic new discoveries must almost by definition be accidental (“serendipitous”). Indeed, they occur only in consequence of some mistake.

Kirk, Jerome, and Miller, Marc L., Reliability and Validity in Qualitative Research (Qualitative Research Methods). Sage Publications, Inc, Thousand Oaks, CA, 1985.

It’s our relationship between the checks and our models of them that matters here. When we have unjustified trust in our checks, we have the opposite problem that we have with the cooking detector: we’re unlikely to notice that the alarm doesn’t go off when it should. That is, we don’t pay attention. The good news is that being inattentive is optional. We can choose to hold on to the possibility that something might be wrong with our checks, and to identify the absence of red checks as meta-information; a suspicious silence, instead of a comforting one. The responsible homeowner checks the batteries on the smoke alarm, and the savvy explorer knows when to say “The forest is quiet tonight… maybe too quiet.”

By putting variation into our testing, we rescue ourselves from the possibility that our checks are too narrow, too specific, cover too few kinds of risk. If you’re aware of the possibility that your alarm clock might fail to wake you, you’re more likely to take alternative measures to avoid sleeping too long.

Valuable conversations with James Bach and Chris Tranter contributed to this post.

On Scripting

Saturday, July 4th, 2015

A script, in the general sense, is something that constrains our actions in some way.

In common talk about testing, there’s one fairly specific and narrow sense of the word “script”—a formal sequence of steps that are intended to specify behaviour on the part of some agent—the tester, a program, or a tool. Let’s call that “formal scripting”. In Rapid Software Testing, we also talk about scripts as something more general, in the same kind of way that some psychologists might talk about “behavioural scripts”: things that direct, constrain, or program our behaviour in some way. Scripts of that nature might be formal or informal, explicit or tacit, and we might follow them consciously or unconsciously. Scripts shape the ways in which people behave, influencing what we might expect people to do in a scenario as the action plays out.

As James Bach says in the comments to our blog post Exploratory Testing 3.0, “By ‘script’ we are speaking of any control system or factor that influences your testing and lies outside of your realm of choice (even temporarily). This does not refer only to specific instructions you are given and that you must follow. Your biases script you. Your ignorance scripts you. Your organization’s culture scripts you. The choices you make and never revisit script you.” (my emphasis, there)

When I’m driving to a party out in the country, the list of directions that I got from the host scripts me. Many other things script me too. The starting time of the party—combined with cultural norms that establish whether I should be very prompt or fashionably late—prompts me to leave home at a certain time. The traffic laws and the local driving culture condition my behaviour and my interactions with other people on the road. The marked detour along the route scripts me, as do the weather and the driving conditions. My temperament and my current emotional state script me too. In this more general sense of “scripting”, any activity can become heavily scripted, even if it isn’t written down in a formal way.

Scripts are not universally bad things, of course. They often provide compelling advantages. Scripts can save cognitive effort; the more my behaviour is scripted, the less I have to think, do research, make choices, or get confused. In my driving example, a certain degree of scripting helps me to get where I’m going, to get along with other drivers, and to avoid certain kinds of trouble. Still, if I want to get to the party without harm to myself or other people, I must bring my own agency to the task and stay vigilant, present, and attentive, making conscious and intentional choices. Scripts might influence my choices, and may even help me make better choices, but they should not control me; I must remain in control. Following a script means giving up engagement and responsibility for that part of the action.

From time to time, testing might include formal testing—testing that must be done in a specific way, or to check specific facts. On those occasions, formal scripting—especially the kind of formal script followed by a machine—might be a reasonable approach enabling certain kinds of tasks and managing them successfully. A highly scripted approach could be helpful for rote activities like operating the product following explicitly declared steps and then checking for specific outputs. A highly scripted approach might also enable or extend certain kinds of variation—randomizing data, for example. But there are many other activities in testing: learning about the product, designing a test strategy, interviewing a domain expert, recognizing a new risk, investigating a bug—and dealing with problems in formally scripted activities. In those cases, variability and adaptation are essential, and an overly formal approach is likely to be damaging, time-consuming, or outright impossible. Here’s something else that is almost never formally scripted: the behaviour of normal people using software.

Notice on the one hand that formal testing is, by its nature, highly scripted; most of the time, scripting constrains or even prevents exploration by constraining variation. On the other hand, if you want to make really good decisions about what to test formally, how to test formally, why to test formally, it helps enormously to learn about the product in unscripted and informal ways: conversation, experimentation, investigation… So excellent scripted testing and excellent checking are rooted in exploratory work. They begin with exploratory work and depend on exploratory work. To use language as Harry Collins might, scripted testing is parasitic on exploration.

We say that any testing worthy of the name is fundamentally exploratory. We say that to test a product means to evaluate it by learning about it through experimentation and exploration. To explore a product means to investigate it, to examine it, to create and travel over maps and models of it. Testing includes studying the product, modeling it, questioning it, making inferences about it, operating it, observing it. Testing includes reporting, which itself includes choosing what to report and how to contextualize it. We believe these activities cannot be encoded in explicit procedural scripting in the narrow sense that I mentioned earlier, even though they are all scripted to some degree in the more general sense. Excellent testing—excellent learning—requires us to think and to make choices, which includes thinking about what might be scripting us, and deciding whether to control those scripts or to be controlled by them. We must remain aware of the factors that are scripting us so that we can manage them, taking advantage of them when they help and resisting them when they interfere with our mission.

On Red

Friday, June 26th, 2015

What actually happens when a check returns a “red” result?

Some people might reflexively say “Easy: we get a red; we fix the bug.” Yet that statement is too simplistic, concealing a good deal of what really goes on. The other day, James Bach and I transpected on the process. Although it’s not the same in every case, we think that for responsible testers, the process actually goes something more like this:

First, we ask, “Is the check really returning a red?” The check provides us with a result which signals some kind of information, but by design the check hides lots of information too. The key here is that we want to see the problem for ourselves and apply human sensemaking to the result and to the possibility of a real problem.

Sensemaking is not a trivial subject. Karl Weick, in Sensemaking in Organizations, identifies seven elements of sensemaking, saying it is:

  • grounded in identity construction (which means that making sense of something is embedded in a set of “who-am-I-and-what-am-I-doing here?” questions);
  • social (meaning that “human thinking and social functioning are essential aspects of each other”, and that making sense of something tends to be oriented towards sharing the meanings);
  • ongoing (meaning that it’s happening all the time, continuously; yet it’s…)
  • retrospective (meaning that it’s based on “what happened” or “what just happened?”; even though it’s happening in the present, it’s about things that have happened in the past, however recent that might be);
  • enactive of sensible environments (meaning that sensemaking is part of a process in which we try to make the world a more understandable place);
  • based on plausibilty, rather than accuracy (meaning that when people make sense of something, they tend to rely on heuristics, rather than things that are absolutely 100% guaranteed to be correct)
  • focused on extracted cues (extracted cues are simple, familiar bits of information that lead to a larger sense of what is occurring, like “Flimsy!->Won’t last!” or “Shouting, with furrowed brow!->Angry!” or “Check returns red!->Problem!”).

The reason that we need to apply sensemaking is that it’s never clear that a check is signaling an actual problem in the product. Maybe there’s a problem in the instrumentation, or a mistake in the programming of the check. So when we see a “red” result, we try to make sense of it by seeking more information (or examining other extracted cues, as Weick might say).

  • We might perform the check a second time, to see if we’re getting a consistent result. (Qualitative researchers would call this a search for diachronic reliability; are we getting the same result over time?)
  • If the second result isn’t consistent with the first, we might perform the check again several times, to see if the result recurs only occasionally and intermittently.
  • We might look for secondary indicators of the problem, other oracles or other evidence that supports or refutes the result of the check.
  • If we’re convinced that the check is really red, we then ask “where is the trouble?” The trouble might be in the product or in the check.

    • We might inspect the state of our instrumentation, to make sure that all of the equipment is in place and set up correctly.
    • We might work our way back through the records produced by the check, tracing through log files for indications of behaviours and changes of state, and possible causes for them.
    • We might perform the check slowly, step by step, observing more closely to see where the things went awry. We might step through the code in the debugger, or perform a procedure interactively instead of delegating the activity to the machinery.
    • We might perform the check with different values, to assess the extents or limits of the problem.
    • We might perform the check using different pacing or different sequences of actions to see if time is a factor.
    • We might perform the check on other platforms, to see if the check is revealing a problem of narrow or more general scope. (Those qualitative researchers would call this a search for synchronic reliability; could the same thing happen at the same time in different places?)
    • Next, if the check appears to be producing a result that makes sense—the check is accurately identifying a condition that we programmed it to identify—it might be easy to conclude that there’s a bug, and now it’s time to fix it. But we’re not done, because although the check is pointing to an inconsistency between the actual state of the product and some programmed result, there’s yet another decision to be made: is that inconsistency a problem with respect to something that someone desires? In other words, does that inconsistency matter?

      • Maybe the check is archaic, checking for some condition that is no longer relevant, and we don’t need it any more.
      • Maybe the check is one of several that are still relevant, but this specific check wrong in some specific respect. Perhaps something that used to be permitted is now forbidden, or vice versa.
      • When the check returns a binary result based on a range of possible results, we might ask “is the result within a tolerable range?” In order to do that, we might have to revisit our notions of what is tolerable. Perhaps the output deviated from the range insignificantly, or momentarily; that is, the check may be too restrictive or too fussy.
      • Maybe the check has been not been set up with explicit pass/fail criteria, but to alert us about some potentially interesting condition that is not necessarily a failure. In this case, the check doesn’t represent a problem per se, but rather a trigger for investigation.
      • We might look outside of the narrow scope of the check to see if there’s something important that the check has overlooked. We might do this interactively, or by applying different checks.

      In other words: after making an observation and concluding that it fits the facts, we might choose to apply our tacit and explicit oracles to make a different sense of the outcome. Rather than concluding “The product doesn’t work the way we wanted it to”, we may realize that we didn’t want the product to do that after all. Or we might repair the outcome (as Harry Collins would put it) by saying, “That check sometimes appears to fail when it doesn’t, just ignore it” or “Oh… well, probably this thing happened… I bet that’s what it was… don’t bother to investigate.”

      In the process of developing the check, we were testing (evaluating the product by learning about it through exploration and experimentation). The check itself happens mechanically, algorithmically. As it does so, it projects a complex, multi-dimensional space down to single-dimensional result, “red” or “green”. In order to make good use of that result, we must unpack the projection. After a red result, the check turns into the centre of a test as we hover over it and examine it. In other words, the red check result typically prompts us to start testing again.

      That’s what usually happens when a check returns a “red” result. What happens when it returns nothing but “green” results?

    Exploratory Testing 3.0

    Tuesday, March 17th, 2015

    This blog post was co-authored by James Bach and me. In the unlikely event that you don’t already read James’ blog, I recommend you go there now.

    The summary is that we are beginning the process of deprecating the term “exploratory testing”, and replacing it with, simply, “testing”. We’re happy to receive replies either here or on James’ site.

    Oracles Are About Problems, Not Correctness

    Thursday, March 12th, 2015

    As James Bach and I have have been refining our ideas of testing, we’ve been refining our ideas about oracles. In a recent post, I referred to this passage:

    Program testing involves the execution of a program over sample test data followed by analysis of the output. Different kinds of test output can be generated. It may consist of final values of program output variables or of intermediate traces of selected variables. It may also consist of timing information, as in real time systems.

    The use of testing requires the existence of an external mechanism which can be used to check test output for correctness. This mechanism is referred to as the test oracle. Test oracles can take on different forms. They can consist of tables, hand calculated values, simulated results, or informal design and requirements descriptions.

    —William E. Howden, A Survey of Dynamic Analysis Methods, in Software Validation and Testing Techniques, IEEE Computer Society, 1981

    While we have a great deal of respect for the work of testing pioneers like Prof. Howden, there are some problems with this description of testing and its focus on correctness.

    • Correct output from a computer program is not an absolute; an outcome is only correct or incorrect relative to some model, theory, or principle. Trivial example: Even the mathematical rule “one divided by two equals one-half” is a heuristic for dividing things. In most domains, it’s true, but as in George Carlin’s joke, when you cut a crumb in two, you don’t have two half-crumbs; you have two crumbs.
    • A product can produce a result that is functionally correct, and yet still be deeply unsatisfactory to its user. Trivial example: a calculator returns the value “4” from the function “2 + 2″—and displays the result in white on a white background.
    • Conversely, a product can produce an incorrect result and still be quite acceptable. Trivial example: a computer desktop clock’s internal state and second hand drift a few tenths of a second each second, but the program resets itself to be consistent with an atomic clock at the top of every minute. The desktop clock almost never shows the right time precisely, but the human observer doesn’t notice and doesn’t really care. Another trivial example: a product might return a calculation inconsistent with its oracle in the tenth decimal place, when only the first two or three decimal places really matter.
    • The correct outcome of a program or function is not always known in advance. Some development and testing work, like some science, is done in an attempt to discover something new; to establish what a correct answer might look like; to explore a mathematical model; to learn about the limitations of a novel system. In such cases, our ideas of correctness or acceptability are not clear from the outset, and must be developed. (See Collins and Pinch’s The Golem books, which discuss the messiness and confusion of controversial science.) Trivial example: in benchmarking, correctness is not at issue. Comparison between one system and another (or versions of the same system at different times) is the mission of testing here.
    • As we’re developing and testing a product, we may observe things that are unexpected, under-described or completely undescribed. In order to program a machine to make an observation, we must anticipate that observation and encode it. The machine doesn’t imagine, invent, or learn, and a machine cannot produce an unanticipated oracle in response to an observation. By contrast, human observers continually learn and refine their ideas on what to observe. Sometimes we observe a problem without having anticipated it. Sometimes we become aware that we’re making a new observation—one that may or may not represent a problem. Distinct from checking, testing continually affords new things to observe. Testing prompts us to decide when new observations represent problems, and testing informs decisions about what to do about them.
    • An oracle may be in error, or irrelevant. Trivial examples: a program that checks the output of another program may have its own bugs. A reference document may be outdated. A subject matter expert who is usually a reliable source of information may have forgotten something.
    • Oracles might be inconsistent with each other. Even though we have some powerful models for it, temperature measurement in climatology is inherently uncertain. What is the “correct” temperature outdoors? In the sunlight? In the shade? When the thermometer is near a building or farther away? Over grass, or over pavement? Some of the issues are described in this remarkable article (read the comments, too).
    • Although we can demonstrate incorrectness in a program, we cannot prove a program to be correct. As Djikstra put it, testing can only show the presence of errors, not their absence; and to go even deeper, Popper pointed out that theories can only be falsified, and not proven. Trivial example: No matter how many tests we run on that calculator, we can never know that it will always return 4 given the inputs 2 + 2; we can only infer that it will do so through induction, and induction can be deeply problemmatic. In a Nassim Taleb’s example (cribbed from Bertrand Russell and David Hume), every day the turkey uses induction to reinforce his belief in the farmer’s devotion to the desires and interests of turkeys—until a few days before Thanksgiving, when the turkey receives a very sudden, unpleasant, and (alas for the turkey) momentary flash of insight.
    • Sometimes we don’t need to know the correct result to know that the observed result is wrong. Trivial example: the domain of the cosine function ranges from -1 to 1. I don’t need to know the correct value for cos(72) to know that an output of 4.2 is wrong. (Elaine Weyuker discusses this in a paper called “On Testing Nontestable Programs” (Weyuker, Elaine, “On Testing Nontestable Programs”, Department of Computer Science, Courant Institute of Mathematical Sciences, New York University). “Frequently the tester is able to state with assurance that a result is incorrect without actually knowing the correct answer.”)

    Checking for correctness—especially when the test output is observed and evaluated mechanically or indirectly—is a risky business. All oracles are fallible. A “passing” test, based on comparison with a fallible oracle cannot prove correctness, and no number of “passing” tests can do that. In this, a test is like a scientific experiment: an experiment’s outcome can falsify one theory while supporting another, but an experiment cannot prove a theory to be true. A million observations of white swans says nothing about the possibility that there might be black swans; a million passing tests, a million observations of correct behaviour cannot eliminate the possibility that there might be swarms of bugs. At best, a passing test is essentially the observation of one more white swan. We urge those who rely on passing acceptance tests to remember this.

    A check can suggest the presence of a problem, or can at best provide support for the idea that the program can work. But no matter what oracle we might use, a test cannot prove that a program is working correctly, or that the program will work . So what can oracles actually do for us?

    If we invert the focus on correctness, we can produce a more robust heuristic. We can’t logically use an oracle to prove that a system is behaving correctly or that it will behave correctly, but we can use an oracle to help falsify the theory that it is behaving correctly. This is why, in Rapid Software Testing, we say that an oracle is a means by which we recognize a problem when it happens during testing.

    Give Us Back Our Testing

    Saturday, February 14th, 2015

    “Program testing involves the execution of a program over sample test data followed by analysis of the output. Different kinds of test output can be generated. It may consist of final values of program output variables or of intermediate traces of selected variables. It may also consist of timing information, as in real time systems.

    “The use of testing requires the existence of an external mechanism which can be used to check test output for correctness. This mechanism is referred to as the test oracle. Test oracles can take on different forms. They can consist of tables, hand calculated values, simulated results, or informal design and requirements descriptions.”

    —William E. Howden, A Survey of Dynamic Analysis Methods, in Software Validation and Testing Techniques, IEEE Computer Society, 1981

    Once upon a time, computers were used solely for computation. Humans did most of the work that preceded or followed the computation, so the scope of a computer program was limited. In the earliest days, testing a program mostly involved checking to see if the computations were being performed correctly, and that the hardware was working properly before and after the computation.

    Over time, designers and programmers became more ambitious and computers became more powerful, enabling more complex and less purely numerical tasks to be encoded and delegated to the machinery. Enormous memory and blinding speed largely replaced the physical work associated with storing, retrieving, revising, and transmitting records. Computers got smaller and became more powerful and protean, used not only by mathematicians but also by scientists, business people, specialists, consumers, and kids. Software is now used for everything from productivity to communications, control systems, games, audio playback, video displays, thermostats… Yet many of the software development community’s ideas about testing haven’t kept up. In fact, in many ways, they’ve gone backwards.

    Ask people in the software business to describe what testing means to them, and many will begin to talk about test cases, and about comparing a program’s output to some predicted or expected result. Yet outside of software development, “testing” has retained its many more expansive meanings. A teenager tests his parents’ patience. When confronted with a mysterious ailment, doctors perform diagnostic tests with open expectations and results that must be interpreted. Writers in Cook’s Illustrated magazine test techniques for roasting a turkey, and report on the different outcomes involving the different factors—flavours, colours, moisture, textures—that they obtain. The Mythbusters, says Wikipedia, “use elements of the scientific method to test the validity of rumors, myths, movie scenes, adages, Internet videos, and news stories.”

    Notice that all of these things called “testing” are focused on exploration, investigation, discovery, and learning. Yet over the last several decades, Howden’s notions of testing as checking for correctness, and of an oracle as a mechanism (or an artifact) became accepted by many people in the development and testing communities at large. Whether people were explicitly aware of those notions, they certainly seemed tacitly to have subscribed to the idea that testing should be focused on analysis of the output, displacing those deeper meanings of testing. That idea might have been more reasonable when computers did nothing but compute. Today, computers and their software are richly intertwined with daily social life and things that we value. Yet for many in software development, “testing” has this narrow, impoverished meaning, limited to what James Bach and I call checking. Checking is a tactic of testing; the part of testing that can be encoded as algorithms and that therefore can be performed entirely by machinery. It is analogous to compiling, the part of programming that can be performed algorithmically.

    Oddly, since we started distinguishing between testing and checking, some people have claimed that we’re “redefining” testing. We disagree. We believe that we are recovering testing’s meaning, restoring it to its original, rich, investigative sense. Testing’s meaning was stolen; we’re stealing it back.

    The Rapid Software Testing Namespace

    Monday, February 2nd, 2015

    Just as no one has the right to tell you what language to speak at home, nobody outside of your project has the authority to tell you how to speak inside your project. Every project develops its own namespace, so to speak, and its own formal or informal criteria for naming things inside it. Rapid Software Testing is, among other things, a project in that sense. For years, James Bach and I have been developing labels for ideas and activities that we talk about in our work and in our classes. While we’re happy to adopt useful ideas and terms from other places, we have the sole authority (for now) to set the vocabulary formally within Rapid Software Testing (RST). We don’t have the right to impose our vocabulary on anyone else. So what do we do when other people use a word to mean something different from what we mean by the same word?

    We invoke “the RST namespace” when we talk about testing and checking, for example, so that we can speak clearly and efficiently about ideas that we bring up in our classes and in the practice of Rapid Software Testing. From time to time, we also try to make it clear why we use words in a specific way. For example, we make a big deal about testing and checking. We define checking as “the process of making evaluations by applying algorithmic decision rules to specific observations of a product” (and a check is an instance of checking). We define testing as “the process of evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, etc.” (and a test is an instance of testing).

    This is in contrast with the ISTQB, which in its Glossary defines “test” as “a set of test cases”—along with “test case” as “a set of input values, execution preconditions, expected results and execution postconditions, developed for a particular objective or test condition, such as to exercise a particular program path or to verify compliance with a specific requirement.” Interesting, isn’t it: the ISTQB’s definition of test looks a lot like our definition of check. In Rapid Software Testing, we prefer to put learning and experimentation (rather than satisfying requirements and demonstrating fitness for purpose) at the centre of testing. We prefer to think of a test as something that people do as an act of investigation; as a performance, not as an artifact.

    Because words convey meaning, we converse (and occasionally argue, and sometimes passionately) the value we see in the words we choose and the ways we think of them. Our goal is to describe things that people haven’t noticed, or to make certain distinctions clear, with the goal of reducing the risk that someone will misunderstand—or miss—something important. Nonetheless, we freely acknowledge that we have no authority outside of Rapid Software Testing. There’s nothing to stop people from using the words we use in a different way; there are no language police in software development. So we’re also willing to agree to use other people’s labels for things when we’ve had the conversation about what those labels mean, and have come to agreement.

    People who tout a “common language” often mean “my common language”, or “my namespace”. They also have the option to certify you as being able to pass a vocabulary test, if anyone thinks that’s important. We don’t. We think that it’s important for people to notice when words are being used in different ways. We think it’s important for people to become polyglots—and that often means working out which namespace we might be using from one moment to the next. In our future writing, conversation, classes, and other work, you might wonder what we’re talking about when we refer to “the RST namespace”. This post provides your answer.