Archive for the ‘Testing vs. Checking’ Category

The Cooking Detector

Friday, September 23rd, 2011

A heuristic is a fallible method for solving a problem or making a decision. “Heuristic” as an adjective means “something that helps us to learn”. In testing, an oracle is a heuristic principle or mechanism by which we recognize a problem.

Some years ago, during a lunch break from the Rapid Software Testing class, a tester remarked that he was having a good time, but that he wanted to know how to get over the boredom that he experienced whenever he was testing. I suggested to him that if found that testing was boring, something was wrong, and that he could consider the boredom as something we call a trigger heuristic. A trigger heuristic is like an alarm clock for a slumbering mind. Emotions are natural trigger heuristics, nature’s way of stirring us to wake up and pay attention. What was the boredom signalling? Maybe he was covering ground that he had covered before. Maybe the risks that he had in mind weren’t terribly significant, and other, more important risks were looming. Maybe the work he was doing was repetitive and mechanical, better left to a machine.

Somewhat later, I realized that every time I had seen a bug in a piece of software, an emotion had been involved in the discovery. Surprise naturally suggested some kind of unexpected outcome. Amusement followed an observation of something that looked silly and that posed a threat to someone’s image. Frustration typically meant that I had been stymied in something that I wanted to accomplish.

There is a catch with emotions, though: they don’t tell you explicitly what they’re about. In that, they’re like this device we have in our home. It’s mounted in a hallway, and it’s designed to alert us to danger. It does that heuristically: it emits a terrible, piercing noise whenever I’m baking bread or broiling a steak. And that’s why, in our house, we call it the cooking detector. The cooking detector, as you may have guessed, came in a clear plastic package labelled “smoke detector”.

When the cooking detector goes off, it startles us and definitely gets our attention. When that happens, we make more careful observations (look around; look at the oven; check for a fire; observe the air around us). We determine the meaning of our observations (typically “there’s enough smoke to set off the cooking detector, and it’s because we’re cooking“); and we evaluate the significance of them (typically, “no big deal, but the noise is enough to make us want to do something”). Whereupon we perform some appropriate control action: turn on the fan over the stove, open a door or a window, turn down the oven temperature, mop up any oil that has spilled inside the oven, check to make sure that the steak hasn’t caught fire. Oh, and reset the damned cooking detector.

Notice that the package says “smoke detector”, not “fire detector”. The cooking detector apparently can’t detect fires. Indeed, on the two occasions that we’ve had an actual fire in the kitchen (once in the oven and once it the toaster over), the cooking detector remained resolutely and ironically silent. We were already in the kitchen, and noticed the fires and put them out before the cooking detector detected the smoke. Had one of the fires got bad enough, I’m reasonably certain the cooking detector would have squawked eventually. That’s a good thing. Even though our wiring is in good shape, we don’t smoke, and the kids are fire-aware, one never knows what could happen. The alarm could give us a chance to extinguish a fire early, to help to reduce damage, or to escape life-threatening danger.

The cooking detector is like a programmer’s unit test—an automated check. It makes a low-level, one-dimensional, one-bit observation: smoke, or no smoke. It’s oblivious to any threat that doesn’t manifest itself as smoke, such as the arrival of a burglar or a structural weakness in the building. The maximum value of the cooking is unlikely to be realized. It occasionally raises a ruckus, and when it does, it doesn’t tell us what the ruckus is about. Usually it’s for something that we can understand, explain, and deal with quickly and easily. Smoke doesn’t automatically mean fire. The cooking detector is an oracle, a device that provides a heuristic trigger, and heuristic devices are fallible. The cooking detector doesn’t tell us that there is a problem; only that there might be a problem. We have to figure out whether there’s a problem, and if so, what the problem is.

Yet the cooking detector comes at low cost. It didn’t cost much to buy, it takes one battery a year, and it’s easy to reset. More importantly, the problem to which it alerts us is a potentially terrible problems. Although it doesn’t tell us what the problem is, it tells us to pay attention so that we can investigate and decide on what to do, before a problem gets serious without our notice. Smoke doesn’t automatically mean fire, but it does mean smoke. Where there’s smoke, maybe there’s fire, or maybe there’s something else that’s unpleasant or dangerous. The cooking detector reminds us to check the steak, open the windows, clean the oven every once in a while, evaluate what’s going on. I don’t believe that the smoking detector will ever detect a real, serious problem that we don’t know about already—but I’m not prepared to bet my family’s life on that.

Testing Problems Are Test Results

Tuesday, September 6th, 2011

I often do an exercise in the Rapid Software Testing class in which I ask people to catalog things that, for them, make testing harder or slower. Their lists fit a pattern I hear over and over from testers (you can see an example of the pattern in this recent question on Stack Exchange). Typical points include:

  • I’m a tester working alone with several programmers (or one of a handful of testers working with many programmers).
  • I’m under enormous time pressure. Builds are coming in continuously, and we’re organized on one- or two-week development cycles.
  • The product(s) I’m testing is (are) very complex.
  • There are many interdependencies between modules within the product, or between products.
  • I’m seeing a consistent pattern of failures specifically related to those interdependencies; the tiniest change here can have devastating impact there—or anywhere.
  • I believe that I have to run a complete regression test on every build to try to detect those failures.
  • I’m trying to cope by using automated checks, but the complexity makes the automation difficult, the program’s testing hooks are minimal at best, and frequent product changes make the whole relationship brittle.
  • The maintenance effort for the test automation is significant, at a cost to other testing I’d like to do.
  • I’m feeling overwhelmed by all this, but I’m trying to cope.

On top of that,

  • The organization in which I’m working calls itself Agile.
  • Other than the two-week iterations, we’re actually using at most two other practices associated with Agile development, (typically) daily scrums or Kanban boards.

Oh, and for extra points,

  • The builds that I’m getting are very unstable. The system falls over under the most basic of smoke tests. I have to do a lot of waiting or reconfiguring or both before I can even get started on the other stuff.

How might we consider these observations?

We could choose to interpret them as problems for testing, but we could think of them differently: as test results.

Test results don’t tell us whether something is good or bad, but they may inform a decision or an evaluation or more questions. People observe test results and decide whether there are problems and what the problems are, what further questions are warranted, and what decisions should be made. Doing that requires human judgement and wisdom, consideration of lots of factors, and a number of possible interpretations.

Just as for automated checks and other test results, it’s important to consider a variety of explanations and interpretations for testing meta-results—observations about testing—lest we miss an important problem. As Jerry Weinberg points out in Perfect Software and Other Illusions About Testing, whatever else something might be, it’s information. If testing is, as Jerry says, gathering information with the intention of informing a decision, it seems a mistake to leave potentially valuable observations lying around on the floor. Indeed, rather than thinking of them as problems for testing, we could choose to think of them as symptoms of product or project problems—problems that testing can help to solve.

For example, when a tester feels outnumbered by programmers, or when a tester feels under time pressure, that’s a test result. The feeling often comes from the programmers generating more work and more complexity than the tester can handle. Yet complexity, like quality, is a relationship between some person and something else. Complexity on its own isn’t necessarily a problem; it’s how people deal with it and its attendant risks that’s a problem. When we observe the ways in which people react to a perception of complexity, we might learn a lot.

  • Are people conscious of the risks—especially the Black Swans—that typically accompany complexity?
  • If people are conscious of risk, are they paying attention to it? Are they panicking over it? Or are they ignoring it and whistling past the graveyard? Or…
  • Are people reacting calmly and pragmatically? Are they acknowledging and dealing with the complexity of the product? If they can’t make the product or the process that it models less complex, are they at least taking steps to make understanding of the product more tractable?
  • Might the programmers be generating or modifying code so quickly that they’re not taking the time to understand what’s really going on with it?
  • If someone feels that more testers are needed, what’s behind that feeling? (I took a stab at an answer to that question a few years back.)

How might we figure that out answers to those questions? One way might be to look at more of the test results and test meta-results.

  • Does someone perceive testing to be difficult or time-consuming? Who? What’s the basis for that perception? What assumptions underlie it?
  • Does the need to investigate and report bugs overwhelm the testers’ capacity to obtain good test coverage? (I wrote about that problem here.)
  • Does testing consistently reveal consistent patterns of failure?
  • Are programmers consistently surprised by such failures and patterns?
  • Do small changes in the code cause problems that are disproportionately large or hard to find?
  • Do the programmers understand the interdependencies clearly? Are those interdependencies necessary, or could they be eliminated?
  • Are programmers taking steps to anticipate or prevent problems related to interfaces and interactions?
  • If automated checks are difficult to develop and maintain, does that say something about the skill of the tester, the quality of the automation interfaces, or the scope of checks? Or about something else?
  • Are unstable builds a problem that get in the way of deeper testing? Or could we interpret them as a sign that the product has problems so numerous and serious that even shallow testing reveals them?
  • When a “stable” build appears after a long series of unstable builds, how stable is it really?

Perhaps, with the answers to those questions, we could raise even more questions.

  • What risks do those problems present for the success of the product, whether in the short term or the longer term?
  • When testing consistently reveals patterns of failures and attendant risk, what does the product team do with that information?
  • Are the programmers mandated to deliver code? Or are the programmers mandated to deliver code with a warrant that the code does what it should (and doesn’t do what it shouldn’t), to the best of their knowledge? Do the programmers adamantly prefer the latter mandate?
  • Is someone pressuring the programmers to make schedule or scope commitments that they can’t really fulfill?
  • Are the programmers and the testers empowered to push back on scope or schedule pressue when it adds to product or project risk?
  • Do the business people listen to the development team’s concerns? Are they aware of the risks that testers and programmers bring to their attention? When the development team points out risks, do managers and business people deal with them congruently?
  • Is the team working at a sustainable pace, or might we expect the product and the project to become overwhelmed by complexity, interdependencies, fragility, and problems that lurk just beyond the reach of our development and testing effort?
  • Is the development team really Agile, in the sense of the precepts of the Agile Manifesto? Or is “agility” being used in a cargo-cult way, using practices or artifacts to mask over an incoherent project?

Testers often feel that their role is to find, investigate, and report on bugs in the product. That’s usually true, but it’s also a pretty limited view of the kinds of information that testing reveals. When seen one way, the problems I’ve listed above sound like serious problems for testing. What if we also remembered Jerry’s definition of testing as “gathering information with the intention of informing a decision”? If that’s the case, then everything that we notice or discover during testing is a test result.

(See also this discussion for an example of looking beyond the test result for possible product and project risks.)

Exploratory Testing is All Around You

Monday, May 16th, 2011

I regularly converse with people who say they want to introduce exploratory testing in their organization. They say that up until now, they’ve only used a scripted approach.

I reply that exploratory testing is already going on all the time at your organization.  It’s just that no one notices, perhaps because they call it

  • “review”, or
  • “designing scripts”, or
  • “getting ready to test”, or
  • “investigating a bug”, or
  • “working around a problem in the script”, or
  • “retesting around the bug fix”, or
  • “going off the script, just for a moment”, or
  • “realizing the significance of what a programmer said in the hallway, and trying it out on the system”, or
  • “pausing for a second to look something up”, or
  • “test-driven development”, or
  • “Hey, watch this!”, or
  • “I’m learning how to use the product”, or
  • “I’m shaking out it a bit”, or
  • “Wait, let’s do this test first instead of that test”, or
  • “Hey, I wonder what would happen if…”, or
  • “Is that really the right phone number?”, or
  • “Bag it, let’s just play around for a while”, or
  • “How come what the script says and what the programmer says and what the spec says are all different from each other?”, or
  • “Geez, this feature is too broken to make further testing worthwhile; I’m going to go to talk to the programmer”, or
  • “I’m training that new tester in how to use this product”, or
  • “You know, we could automate that; let’s try to write a quickie Perl script right now”, or
  • “Sure, I can test that…just gimme a sec”, or
  • “Wow… that looks like it could be a problem; I think I’ll write a quick note about that to remind me to talk to my test lead”, or
  • “Jimmy, I’m confused… could you help me interpret what’s going on on this screen?”, or
  • “Why are we always using ‘tester’ as the login account? Let’s try ‘tester2′ today”, or
  • “Hey, I could cancel this dialog and bring it up again and cancel it again and bring it up again”, or
  • “Cool! The return value for each call in this library is the round-trip transaction time—and look at these four transactions that took thirty times longer than average!”, or
  • “Holy frijoles! It blew up! I wonder if I can make it blow up even worse!”, or
  • “Let’s install this and see how it works”, or
  • “Weird… that’s not what the Help file says”, or
  • “That could be a cool tool; I’m going to try it when I get home”, or
  • “I’m sitting with a new tester, helping her to learn the product”.

Now it’s possible that none of that stuff ever happens in your organization. Or maybe people aren’t paying attention or don’t know how to observe testing. Or both.

Then, just before I posted this blog post, James Bach offered me two more sure-fire clues that people are doing exploratory testing. If they say, “I am in no way doing exploratory testing”, or “We’re doing only highly rigorous formal testing”. In both cases, the emphatic nature of the claim guarantees that the claimant is not sufficiently observant about testing to realize that exploratory testing is happening all around them.

Why Do Some Testers Find The Critical Problems?

Saturday, February 5th, 2011

Today, someone on Twitter pointed to an interesting blog post by Alan Page of Microsoft. He says:

“How do testers determine if a bug is a bug anyone would care about vs. a bug that directly impacts quality (or the customers perception of quality)? (or something in between?) Of course, testers should report anything that may annoy a user, but learning to differentiate between an ‘it could be better’ bug and a ‘oh-my-gosh-fix-this’ bug is a skill that some testers seem to learn slowly. … “So what is it that makes some testers zero in on critical issues, while others get lost in the weeds?”

I believe I have some answers to this. My answers are based on roughly 20 years of observation and experience in consulting, training, and working with other testers. The forms of interaction have included in-class training; online coaching via video, voice, and text; face-to-face conversation in workplaces, conferences, and workshops; direct collaboration with other working testers in mass-market commercial software, financial services, retail services, specialized mathematical applications, and several other domains.

My first answer is that testing, for a long time and in many places, has been myopically focused on functional correctness, rather than on value to people. Cem Kaner discusses this issue in his talk Software Testing as a Social Science, and later variations on it. This problem in testing is a subset of a larger problem in computer science and software engineering. Introductory texts often observe that a computer program is “a set of instructions for a computer”. Kaner’s defintion of a computer program as “a communication among several humans and computers, distributed over distance and time, that contains instructions that can be executed a computer” goes some distance towards addressing the problem; his explication that “the point of the program is to provide value to the stakeholders” goes further still. When the definition of programming is reduced to producing “a set of instructions for a computer”, it misses the point—value to people—and when testing is reduced to the checking of those instructions, the “testing” will miss the same point. I’ve suggested in recent talks that testing is “the investigation of systems composed of people, computer programs, related products and services.” Successful testers avoid a fascination with functional correctness, and focus on ways in which people might obtain value from a program—or have their value unfulfilled or threatened.

This first answer gives rise to my second: that when testing is focused on functional correctness, it becomes a confirmatory, verification-oriented task, rather than an exploratory, discovery-oriented set of processes. This is not a new problem. It’s old enough that Glenford Myers tried (more or less unsuccessfully, it seems) to argue against it in The Art of Software Testing in 1979. Myers’ point was the testing should be premised on trying to expose the program’s failures, rather than on trying to confirm that it works. Psychological research before and since Myers’ book (in particular Klayman and Ha’s paper on confirmation bias) shows that the positive test heuristic biases people towards choosing tests that demonstrate fit with a working hypothesis (showing THAT it works), rather than tests that drive towards final rule discovery (showing how it works, and more important, how it might fail). Worse yet, I’ve heard numerous reports of development and test managers urging testers to “make sure the tests pass”. The trouble with passing tests is that they don’t expose threats to value. Every function in the program code might be checked and found correct, but the product might be unusable. As in Alan’s example, the phone might make calls perfectly, but unless we model the way people actually use the product—talking for more than three minutes at a time, say—we will miss important problems. Every function might work perfectly, but we might fail to observe missing functionality. Every function might work perfectly, but we might miss terrible compatibility problems. Functional correctness is a very important thing in computer software, but it’s not the only thing. (See the “Quality Criteria” section of the Heuristic Test Strategy Model for suggestions.) Testers “who zero in critical issues” avoid the confirmation trap.

My third answer (related to the first two) is that when testing is focused on confirming functional correctness, a lot of other information gets left lying on the table. Testing becomes a search for finding errors, rather than on finding issues. That is, testers become oriented towards reporting bugs, and less oriented towards the discovery of issues—things that aren’t bugs, necessarily, but that threaten the value of testing and of the project generally. I’ve written recently about issues here. Successful testers recognize issues that represent obstacles to their missions and strategies, and work around them or seek help.

My fourth answer is that many (in my unscientific sample, most) testers are poorly versed in the skills of test framing. This is understandable, at least in part because test framing itself wasn’t known by that name as recently as a year ago as I write. Test framing is the set of logical connections that structure and inform a test. It involves the capacity to follow and express a line of perhaps informal yet reasonably structured logic that directly links the testing mission to the tests and their results. In my experience, most testers are unable to trace this logical line quickly and expertly. There are many roots for this problem. The earlier answers above provide part of the explanation; the mission of value to the customer is overwhelmed by the mission of proving functional correctness. In situations where the process of test design is separated from test execution (as in environments that take a highly scripted approach to testing), the steps to perform the test and observe the results are typically listed explicitly, but the motivation for performing the test is often left out. In situations where test execution, observation of outcomes, and reporting of test results is heavily delegated to automation, motivation is even further disconnected from the mission. In such environments, focus is directed towards getting the automation to follow a script, rather using than automation to assist in probing for problems. In such environments, focus is often on the quantity of tests or the quantity of bug reports, rather than on the quality, the value, of the information revealed by testing. Testers who find problems successfully can link tests, test activities, and test results to the mission. They’re far more concerned about the quality of the information they provide than the quantity.

My fifth answer is that in many organizations there is insufficient diversity of tester skills, mindsets, and approaches for finding the great diversity of problems that might lurk in the product. This problem starts in various ways. In some organizations, testers are drawn exclusively from the business. In others, testers are required to have programming skills before they can be considered for the job. And then things get left out. Testers who need training or experience in the business domain don’t get it, and are kept separated from the business people (that’s a classic example of an issue). Testers aren’t given training in software design, programming, or related skills. They’re not given training in testing, problem reporting and bug advocacy, design of experiments. They’re not given training or education in anthropology, critical thinking, systems thinking, or philosophy and other disciplines that inform excellent testing. Successful testers tend to take on diversified skills, knowledge, and tactics, and when those skills are lacking, they collaborate with people who have them.

Note that I’m not suggesting here that anyone become a Donald Knuth-level programmer, a Pierre Bourdieu-league anthropologist, a Ross Ashby-class systems thinker, a Wittgenstein-grade philosopher. I am suggesting that testers be given sufficient training and opportunity to learn to program to the level of Brian Marick’s Everyday Scripting with Ruby, and that they be given classes, experience, and challenges in observation, the business domain, systems thinking and critical thinking. I am suggesting that people who are testing computer software do need some exposure to core ideas about logic (if we see this, can we justifiably infer that?), about ontology (what are our systems of knowledge about the way things work—especially related to computer programs and to testing), and about epistemology (how do we know what we know?).

I’ve been told by people involved in the design of testing standards that “you can’t expect regular testers to learn epistemology, for goodness’ sake”. Well, I’m saying that we can and that we must at least provide opportunities for learning, to the degree that testers can frame their mission, their ideas about risk, their testing, and their evaluation of the product in the ways that their clients value. Moreover, I’ve worked with testing organizations that have done that, and the results have been impressive. Sometimes I hear people saying “what if we train our testers and they leave?” As one wag on Twitter replied (I wish I knew who), “What if you don’t train them and they stay?”

In our classes, James Bach and I have the experience of inspriring testers to become interested in and excited by these topics. We find that it’s not hard to do that. We remain concerned about the capacity of some organizations to sustain that enthusiasm, often because some middle managers’ misconceptions about the practice and value of testing can squash both enthusiasm and value in a hurry. Testers, to be successful, must be given the freedom and responsibility to explore and to contribute what they’ve learned back to their team and to the rest of the organization.

So, what would we advise?

Read this set of ideas as a system, rather than as a linear list:

  • The purpose of testing is to identify threats to the value of the program. Functional errors are only one kind of threat to the value of the program.
  • Take on expansive ideas about what might constitute—or threaten—the quality of the product.
  • Dynamically manage your focus to exercise the product and test those ideas about value.
  • In hiring, staffing, and training, focus on the mindset and the skill set of the individual tester as a member of a highly diversified team.
  • As an individual tester, develop and diversify your skills and your strategies.
  • Immediately identify report issues that threaten the value of the testing effort and of the project generally. Solve the ones you can; raise team and management awareness of the costs and risks of issues, in order to get attention and help.
  • Learn to frame your testing and to compose, edit, narrate and justify a compelling testing story.
  • Don’t try to control or restrain testers; grant them the freedom—along with the responsibility to discover what they will. Given that… they will.

Gaming the Tests

Monday, September 27th, 2010

Let’s imagine, for a second, that you had a political problem at work. Your CEO has promised his wife that their feckless son Ambrose, having flunked his university entrance exams, will be given a job at your firm this fall. Company policy is strict: in order to prevent charges of nepotism, anyone holding a job must be qualified for it. You know, from having met him at last year’s Christmas party, that Ambrose is (how to put this gently?) a couple of tomatoes short of a thick sauce. Yet the policy is explicit: every candidate must not only pass a multiple choice test, but must get every answer right. The standard number of correct answers required is (let’s say) 40.

So, the boss has a dilemma. He’s not completely out to lunch. He knows that Ambrose is (how can I say this?) not the sharpest razor in the barbershop. Yet the boss adamantly wants his son to get a job with the firm. At the same time, the boss doesn’t want to be seen to be violating his own policy. So he leaves it to you to solve the problem. And if you solve the problem, the boss lets you know subtly that you’ll get a handsome bonus. Equally subtly, he lets you know that if Ambrose doesn’t pass, your career path will be limited.

You ponder for a while, and you realize that, although you have to give Ambrose an exam, you have the authority to set the content and conditions of the exam. This gives you some possibilities.

A. You could give a multiple choice test in which all the answers were right. That way, anyone completing the test would get a perfect score.

B. You could give a multiple choice test for which the answers were easy to guess, but irrelvant to the work Ambrose would be asked to do. For example, you could include questions like, “What is the very bright object in the sky that rises in the morning and sets in the evening?” and provide “The Sun” as choice of answer, and the names of hockey players for the other choices.

C. You could find out what questions Ambrose might be most likely to answer correctly in the domain of interest, and then craft an exam based on that.

D. You could give a multiple choice test in which, for every question, one of A, B, or C was the correct answer, and answer D was always “One of the above.”

E. You might give a reasonably difficult multiple-choice exam, but when Ambrose got an answer wrong, you could decide that there’s another way to interpret the answer, and quietly mark it right.

F. You might give Ambrose a very long set of multiple-choice questions (say 400 of them), and then, of his answers, pick 40 correct ones. You then present those questions and answers as the completed exam.

G. You could give Ambrose a set of questions, but give him as much time as he wanted to provide an answer. In addition, you don’t watch him carefully (although not watching carefully is a strategy that nicely supports most of these options).

H. You could ask Ambrose one multiple choice question. If he got it wrong, correct him until he gets it right. Then you could develop another question, ask that, and if he gets it wrong, correct him until he gets it right. Then continue in a loop until you get to 40 questions.

I. This approach is like H, but instead you could give a multiple choice test for which you had chosen an entire set of 40 questions in advance. If Ambrose didn’t get them all right, you could correct him, and then give him the same set of questions again. And again. And over and over again, until he finally gets them all right. You don’t have to publicize the failed attempts; only the final, successful one. That might take some time and effort, and Ambrose wouldn’t really be any more capable of anything except answering these specific questions. But, like all the other approaches above, you could effect a perfect score for Ambrose.

When the boss is clamoring for a certain result, you feel under pressure and you’re vulnerable. You wouldn’t advise anyone to do any of the things above, and you wouldn’t do them yourself. Or at least, you wouldn’t do them consciously. You might even do them with the best of intentions.

There’s an obvious parallel here—or maybe not. You may be thinking of the exam in terms of a certain kind of certification scheme that uses only multiple-choice questions, the boss as the hiring manager for a test group, and Ambrose as a hapless tester that everyone wants to put into a job for different reasons, even though no one is particularly thrilled about the idea. Some critical outsider might come along and tell you point-blank that your exam wasn’t going to evaluate Ambrose accurately. Even a sympathetic observer might offer criticism. If that were to happen, you’d want to keep the information under your hat—and quite frankly, the other interested parties would probably be complacent too. Dealing with the critique openly would disturb the idea that everyone can save face by saying that Ambrose passed a test.

Yet that’s not what I had in mind—not specifically, at least. I wanted to point out some examples of bad or misleading testing, which you can find in all kinds of contexts if you put your mind to it. Imagine that the exam is a set of tests—checks, really. The boss is a product owner who wants to get the product released. The boss’ wife is a product marketing manager. Hapless Ambrose is a program—not a very good program to be sure, but one that everyone wants to release for different reasons, even though no one is particularly thrilled by the idea. You, whether a programmer or a tester or a test manager, are responsible for “testing”, but you’re really setting up a set of checks. And you’re under a lot of pressure. How might your judgement—consciously or subconsciously—be compromised? Would your good intentions bend and stretch as you tried to please your stakeholders and preserve your integrity? Would you admit to the boss that your testing was suspect? If you were under enough pressure, would you even notice that your testing was suspect?

So this story is actually about any circumstance in which someone might set up a set of checks that provide some illusion of success. Can you think of any more ways that you might game the tests… or worse, fool yourself?

Why Exploratory? Isn’t It All Just Testing?

Friday, September 24th, 2010

The post “Exploratory Testing and Review” continues to prompt comments whose responses, I think, are worthy of their own posts. Thank you to Parthi, who provides some thoughtful comments and questions.

I always wondered and in attempted to see the difference between the Exploratory testing that you are talking about and the testing that I am doing. Unlike the rest of the commenter’s, this post made this question all the more valid and haunting.

From what you have written, as long as there is a loop between the test design and execution, its exploratory testing? And the shorter the loop, exploratory nature goes up?

Yes, that’s right. A completely linear process would be entirely scripted, with no exploratory element to it. The existence of a loop suggests that the testing is to some degree exploratory. This suggests (to me, at least) a link to one of the points of Jerry Weinberg’s Perfect Software and Other Illusions About Testing. Testing, he suggests, is gathering information with the intention of informing a decision, and he also says that if you’re not going to use that information, you might as well not test. I’ll go a little further and suggest that if you “test” with no intention of using the information in any way, you might be doing something, but you’re not really testing.

As we’ve said before, some people seem to have interpreted the fact that there’s a distinction between exploratory testing and scripted testing as meaning that you can only be doing one or the other. That’s a misconception. It’s like saying that there are only two kinds of liquid water: hot or cold. Yet there are varying gradations of water: almost freezing, extremely cold, chilly, cool, room temperature, tepid, warm, hot, scalding, boiling. To stretch the metaphor, a test is it’s being done by a machine (that is, a check) is like ice. It’s frozen and it’s not going anywhere. An investigation of a piece of software done by a tester with no purpose other than to assuage his curiosity is like steam; it’s invisible and vaporous. But testing in most cases is to some extent scripted and to some extent exploratory. No matter how exploratory, a test is to some degree informed by a mission that typically came from someone else, at some point in the past; that is, the test is to some degree scripted. No matter how scripted, a test is to some degree informed by decisions and actions that come from the individual tester in the moment—otherwise the tester would freeze and stop working, just like a machine, as soon as he or she was unable to perform some step specified in the script. That is, all testing is to some degree exploratory.

In addition to the existence of loops, there other elements too. Very generally,

  • the extent to which the tester has freedom to make his or her own choices about which step to take next, which tests to perform, which tools to use, which oracles to apply, and which coverage to obtain (more freedom means more exploratory and less scripted; more control means less exploratory and more scripted);
  • the extent to which the tester is held responsible for the choices being made and the quality of his or her work. More responsibility on the tester means more exploratory and less scripted; more responsibility on some other agency means less exploratory and more scripted.
  • the extent to which all available information (including the most recent information) informs the design and execution of the next test. The broader the scope of the information that informs the test, the more exploratory; the narrower the scope of information that informs the test , the more scripted.
  • the extent to which the mission—the search for information—is open-ended and new information is welcomed. The more new information will be embraced, the more exploratory the mission; the more new information will be ignored or rejected, the less exploratory the mission.
  • again, very generally, the length of the loops that include designing, performing, and interpreting an activity and learning from it, and then feeding that information back into the next cycle of design, performance, interpretation, and learning. I’m not talking here so much about timing and sequences of actions so much as the cognitive engagement. Timing is a factor; that’s one reason one reason that we now favour “parallel” over “simultaneous”. But more importantly, the more difficult it is to unsnarl the tangle of your interactions and your ideas, the more exploratory a process you’re in. The more rapidly you are able to shift from one heavy focus (say on executing the test) to another heavy focus (pondering the implications of what you’ve just seen) to another (running a thought experiment in your head) to yet another (revising your design), very generally, the more exploratory the process. Another way to put it: the more organic the process, the more exploratory it is; the more linear the process, the more scripted it is.

Is this what you are saying? If yes, there is hardly any difference in what I do at my work and what you preach and this is true with most of my team (am talking about 600+ testers in my organization) and we simply call this Testing.

I’d smilingly suggest that you can “simply” call it whatever you like. The more important issue is whether you want to simply call it something, or whether you want to achieve a deeper understanding of it. The risk associated with “simply” calling it something is that you’ll end up doing it simply, and that may fail to serve your clients when they are producing and using very complex products and services and systems. Which is, these days, mostly what’s happening.

For example, is there really a difference between what I’m talking about and what are your 600+ testers doing? Can you describe what they’re doing? How would you describe it? How would you frame their actions in terms of risk, cost, value, skill, diversity, heuristics, oracles, coverage, procedures, context, quality criteria, product elements, recording, reporting? Is all that stuff “simply” testing? For any one of those elements of testing, where are your testers in control of their own process, and when are they being controlled? Are all 600+ at equivalent stages of development and experience? Are they all simply testing simply, or are some testing in more complex ways?

Watch out for the magic words “simply” or “just”. Those are magic words. They cast a spell, blinding and deafening people to complexity. Yet the blindness and deafness don’t make the complexity go away. Even though these words have all the weight of snowflakes, their cumulative effect is to cover up complexity like a heavy snowfall covers up a garden.

May be these posts should be titled “Testing” than “Exploratory Testing”?

There is already good number of groups/people taking advantage of the (confused state of the larger) testing community (like certification boards). Why to add fuel to this instead of simplifying things?

There’s a set of important answers to that, in my view.

  • Testing is a complex cognitive activity comprising many other complex cognitive activities. If we want to understand testing and learn how to do it well, we need to confront and embrace that complexity, instead of trying to simplify it away.
  • If we want our clients to understand the value, the costs, the extents, and the limitations of the services we can provide for them, we need to be able to explain what we’re doing, how we’re doing it, and why we’re doing it. That’s important so that both we and they can collaborate in making better informed choices about the information that we’re all seeking and the ways we go about obtaining that information.
  • One way to “simplify” matters is to pretend that testing is “simply” the preparation and then following of a script, or that exploratory testing is “simply” fooling around with the computer. If you’re upset at all about the certification boards that trivialize testing (as I am), it’s important to articulate and demonstrate the fact that testing is not at all a simple activity, or that comprehension of it can be assessed with any validity via a 40-question multiple choice test. Such a claim, in my opinion, is false, and charging money for such a test while making such a claim is, in my opinion, morally equivalent to theft. The whole scheme is founded in the premise that testing a tester is “simply” a matter of putting the tester through 40 checks. If we really wanted to evaluate and qualify a tester, we’d use an exploratory process: interviews, auditions, field testing, long sequence tests, compatibility tests, and so on. And we wouldn’t weed people out on the basis of them failing to take a bogus exam, any more than we’d reject a program for not being run against a set of automated checks that were irrelevant to what the program was actually supposed to do.
  • Just as software development is done in many contexts, so testing is done in many contexts. As we say in the Rapid Testing class, in excellent testing, your context informs your choices and vice versa. And in excellent testing, both your context and your choices evolve over time. I would argue that a heavily scripted process is more resistant to this evolution. That might be a good thing for certain purposes and certain contexts, and a not-at-all good thing for other purposes and other contexts.

Many people say, for example, that to test medical devices, you must do scripted testing. There is indeed much in medical device testing that must be checked. Problems of a certain class yield very nicely to scripted tests (checks), such that a scripted approach is warranted. The trouble comes with the implicit suggestion that if you must do scripted testing, you must not do exploratory testing. Yet if we agree that problems in a product don’t follow scripts; if we agree that there will be problems in requirements as well as in code; if we agree that we can’t recognize incompleteness or ambiguity in advance of encountering their consequences; if we agree that although we can address the unexpected we can’t eliminate it; and if we agree that people’s lives may be at stake: isn’t it the case that we must do exploratory testing in addition to any scripted testing that we might or might not do?

The answer is, to my mind, certainly Yes. So, to what extent, from moment to moment, are we emphasising one approach or the other? That’s not a question that we can answer by saying that we’re “just” testing.

Thanks again, Parthi, for prompting this post.

Can Exploratory Testing Be Automated?

Wednesday, September 22nd, 2010

In a comment on the previous post, Rahul asks,

One doubt which is lingering in my mind for quite sometime now, “Can exploratory testing be automated?”

There are (at least) two ways to interpret and answer that question. Let’s look first at answering the literal version of the question, by looking at Cem Kaner’s definition of exploratory testing:

Exploratory software testing is a style of software testing that emphasizes the personal freedom and responsibility of the individual tester to continually optimize the value of her work by treating test-related learning, test design, test execution, and test result interpretation as mutually supportive activities that run in parallel throughout the project.

If we take this defintion of exploratory testing, we see that it’s not a thing that a person does, so much as a way that a person does it. An exploratory approach emphasizes the individual tester, and his/her freedom and responsibility. The definition identifies design, interpretation, and learning as key elements of an exploratory approach. None of these are things that we associate with machines or automation, except in terms of automation as a medium in the McLuhan sense: an extension (or enablement, or enhancement, or acceleration, or intensification) of human capabilities. The machine to a great degree handles the execution part, but the work in getting the machine to do it is governed by exploratory—not scripted—work.

Which brings us to the second way of looking at the question: can an exploratory approach include automation? The answer there is absolutely Yes.

Some people might have a problem with the idea, because of a parsimonious view of what test automation is, or does. To some, test automation is “getting the machine to perform the test”. I call that checking. I prefer to think of test automation in terms of what we say in the Rapid Software Testing course: test automation is any use of tools to support testing.

If yes then up to what extent? While I do exploration (investigation) on a product, I do whatever comes to my mind by thinking in reverse direction as how this piece of functionality would break? I am not sure if my approach is correct but so far it’s been working for me.

That’s certainly one way of applying the idea. Note that when you think in a reverse direction, you’re not following a script. “Thinking backwards” isn’t an algorithm; it’s a heuristic approach that you apply and that you interact with. Yet there’s more to test automation than breaking. I like your use of “investigation”, which to me suggests that you can use automation in any way to assist learning something about the program.

I read somewhere on Shrini Kulkarni’s blog that automating exploratory testing is an oxymoron, is it so?

In the first sense of the question, Yes, it is an oxymoron. Machines can do checking, but they can’t do testing, because they’re missing the ability to evaluate. Here, I don’t mean “evaluation” in the sense of performing a calculation and setting a bit. I mean evaluation in the sense of making a determination about what people value; what they might choose or prefer.

In the second way of interpreting the question, automating exploratory testing is impossible—but using automation as part of an exploratory process is entirely possible. Moreover, it can be exceedingly powerful, about which more below.

I see a general perception among junior testers (even among ignorant seniors) that in exploratory testing, there are no scripts (read test cases) to follow but first version of the definition i.e. “simultaneous test design, test execution, and learning” talks about test design also, which I have been following by writing basic test cases, building my understanding and then observing the application’s behavior once it is done, I move back to update the test cases and this continues till stakeholders agree with state of the application.

Please guide if it is what you call exploratory testing or my understanding of exploratory testing needs modifications.

That is an exploratory process, isn’t it? Let’s use the rubric of Kaner’s defintion: it’s a style of working; it emphasizes your freedom and responsibility; it’s focused on optimizing the quality of your work; it treats design, execution, interpretation, and learning in a mutually supportive way; and it continues throughout the project. Yet it seems that the focus of what you’re trying to get to is a set of checks. Automation-assisted exploration can be very good for that, but it can be good for so much more besides.

So, modification? No, probably not much, so it seems. Expansion, maybe. Let me give you an example.

A while ago, I developed a program to be used in our testing classes. I developed that program test-first, creating some examples of input that it should accept and process, and input that it should reject. That was an exploratory process, in that I designed, executed, and interpreted unit checks, and I learned. It was also an automated process, to the degree that the execution of the checks and the aggregating and reporting of results was handled by the test framework. I used the result of each test, each set of checks, to inform both my design of the next check and the design of the program. So let me state this clearly:

Test-driven development is an exploratory process.

The running of the checks is not an exploratory process; that’s entirely scripted. But the design of the checks, the interpretation of the checks, the learning derived from the checks, the looping back into more design or coding of either program code or test code, or of interactive tests that don’t rely on automation so much: that’s all exploratory stuff.

The program that I wrote is a kind of puzzle that requires class participants to test and reverse-engineer what the program does. That’s an exploratory process; there aren’t scripted approaches to reverse engineering something, because the first unexpected piece of information derails the script. In workshopping this program with colleagues, one in particular—James Lyndsay—got curious about something that he saw. Curiosity can’t be automated. He decided to generate some test values to refine what he had discovered in earlier exploration. Sapient decisions can’t be automated. He used Excel, which is a powerful test automation tool, when you use it to support testing. He invented a couple of formulas. Invention can’t be automated. The formulas allowed Excel to generate a great big table. The actual generation of the data can be automated. He took that data from Excel, and used the Windows clipboard to throw the data against the input mechanism of the puzzle. Sending the output of one program to the input of another can be automated. The puzzle, as I wrote it, generates a log file automatically. Output logging can be automated. James noticed the logs without me telling him about them. Noticing can’t be automated. Since the program had just put out 256 lines of output, James scanned it with his eyes, looking for patterns in the output. Looking for specific patterns and noticing them can’t be automated unless and until you know what to look for.. BUT automation can help to reveal hitherto unnoticed patterns by changing the context of your observation. James decided that the output he was observing was very interesting. Deciding whether something is interesting can’t be automated. James could have filtered the output by grepping for other instance of that pattern. Searching for a pattern, using regular expressions, is something that can be automated. James instead decided that a visual scan was fast enough and valuable enough for the task at hand. Evaluation of cost and value, and making decisions about them, can’t be automated. He discovered the answer to the puzzle that I had expressed in the program… and he identified results that blew my mind—ways in which the program was interpreting data in a way that was entirely correct, but far beyond my model of what I thought the program did.

Learning can’t be automated. Yet there is no way that we would have learned this so quickly without automation. The automation didn’t do the exploration on its own; instead, it super-charged our exploration. There were no automated checks in the testing that we did, so no automation in the record-and-playback sense, no automation in the expected/predicted result sense. Since then, I’ve done much more investigation of that seemingly simple puzzle, in which I’ve fed back what I’ve learned into more testing, using variations on James’ technique to explore the input and output space a lot more. And I’ve discovered that the program is far more complex than I could have imagined.

So: is that automating exploratory testing? I don’t think so. Is that using automation to assist an exploratory process? Absolutely.

For a more thorough treatment of exploratory approaches to automation, see

Investment Modeling as an Exemplar of Exploratory Test Automation (Cem Kaner)

Boost Your Testing Superpowers (James Bach)

Man and Machine: Combining the Power of the Human Mind with Automation Tools (Jonathan Kohl)

“Agile Automation” an Oxymoron? Resolved and Testing as a Creative Endeavor (Karen Wysopal)

…and those are just a few.

Thank you, Rahul, for the question.

All Testing is (not) Confirmatory

Tuesday, August 24th, 2010

In a recent blog post, Rahul Verma suggests that all testing is confirmatory.

First, I applaud his writing of an exploratory essay. I also welcome and appreciate critique of the testing vs. checking idea. I don’t agree with his conclusions, but maybe in the long run we can work something out.

In mythology, there was a fellow called Procrustes, an ironmonger. He had a iron bed which he claimed fit anyone perfectly. He accomplished a perfect fit by violently lengthening or shortening the guest. I think that, to some degree, Rahul is putting the idea of confirmation into Procrustes’ bed.

He cites the cites the Oxford Online Dictionary definition of confirm: (verb) establish the truth or correctness of (something previously believed or suspected to be the case). (Rahul doesn’t cite the usage notes, which show some different senses of the word.)

When I describe a certain approach to testing as “confirmatory” in my discussion of testing vs. checking, I’m not trying to introduce another term. Instead, I’m using an ordinary English adjective to identify an approach or a mindset to testing. My emphasis is twofold: 1) not on the role of confirmation in test results, but rather on the role of confirmation in test design; and 2) on a key word in the definition Rahul cites, “previously“.

A confirmatory mindset would steer the tester towards designing a test based on a particular and  specific hypothesis. A tester working in a confirmatory way would be oriented towards saying, “Someone or something has told me that the product should do be able to do X. My test will demonstrate that it can do X.” Upon the execution of the (passing) test, the tester would say “See? The product can do X.” Such tests are aimed in the direction of showing that the product can work.

Someone working from an exploratory or investigative mindset would have a different, broader, more open-ended mission. “Someone or something has told me that the product does X. What are the extents and limitations of what we think of as X? What are the implications of doing X? What essential component of X might we have missed in our thinking about previous tests? What else happens when I ask the product to do X? Can I make the product do P, by asking it to do X in a slightly different way? What haven’t I noticed? What could I learn from the test that I’ve just executed?” Upon performing the test, the tester would report on whatever interesting information she might have discovered, which might include a pass or fail component, but might not. Exploratory tests are aimed at learning something about the product, how it can work, how it might work, and how it might not work; or if you like, on “if it will work”, rather than “that it can work”. To those who would reasonably object: yes, yes, no test ever shows that a product will work in all circumstances. But the focus here is on learning something novel, often focusing on robustness and adaptability. In this mindset, we’re typically seeking to find out how the program deals with whatever we throw at it, rather than on demonstrating that it can hit a pitch in the centre of the strike zone.

I believe that, in his post, Rahul is focused on the evaluation of the test, rather than on test design. That’s different from what I’m on about. He puts confirmation squarely into result interpretation, defining the confirmation step as “a decision (on) whether the test passed or failed or needs further investigation, based on observations made on the system as a result of the interaction. The observations are compared against the assumption(s).” I don’t think of that as confirmation (“establishing the truth or correctness of something previously believed or suspected to be the case”). I think of that as application of an oracle; as a comparison of the observed behaviour with a principle or mechanism that would allow us to recognize a problem. In the absence of any countervailing reason for it to be otherwise, we expect a product to be consistent with its history; with an image that someone wants to project; with comparable products; with specific claims; with reasonable user expectations; with the explicit or implicit purpose of the product; with itself in any set of observable aspects; and with relevant standards, statutes, regulations, or laws. (These heuristics, with an example of how they can be applied in an exploratory way, are listed as the HICCUPP heuristics here. It’s now “HICCUPPS”; we recognized the “Standards and Statutes” oracle after the article was written.)

At best, your starting hypothesis determines whether applying an oracle suggests confirmation. If your hypothesis is that the product works—that is, that the product behaves in a manner consistent with the oracle heuristics—then your approach might be described as confirmatory. Yet the confirmatory mindset has been identified in both general psychological literature and testing literature as highly problematic. Klayman and Ha point out in their 1987 paper Confirmation, Disconfirmation, and Information in Hypothesis Testing that “In rule discovery, the positive test strategy leads to the predominant use of positive hypothesis tests, in other words, a tendency to test cases you think will have the target property.” For software testing, this tendency (a form of confirmation bias) is dangerous because of the influence it has on your selection of tests. If you want to find problems, it’s important to take a disconfirmatory strategy—one that includes tests of conditions outside the space of the hypothesis that program works. “For example, when dealing with a major communicable disease (or software bugs —MB), it is more serious to allow a true case to go undiagnosed and untreated than it is to mistakenly treat someone.” Here, Klayman and Ha point out, if we want to prevent disease, the emphasis should be on tests that are outside of those that would exemplify a desired attribute (like good health). In the medical case, they say that would involve “examining people who test negative for the disease, to find any missed cases, because they reveal potential false negatives.” In testing, the object would be to run tests that challenge the idea that the test should pass. This is consistent with Myers’ analysis in The Art of Software Testing (which, interestingly, as it was written in 1979, predates Klayman and Ha’s paper).

As I see it, if we’re testing the product (rather than, say, demonstrating it), we’re not looking for confirmation of the idea that it works; we’re seeking to disconfirm the idea that it works. Or, as James Bach might put it, we’re in the illusion demolition business.

One other point: Rahul suggests “Testing should be considered complete for a given interaction only when the result of confirmation in terms of pass or fail is available.” To me, that’s checking. A test should reveal information, but it does not have to pass or fail. For example, I might test a competitive product to discover the features that it offers; such tests don’t have a pass or fail component to them. A tester might be asked to compare a current product with a past version to look for differences between the two. A tester might be asked to use a product and describe her experience with it, such that there’s an evaluation with explicit, atomic pass or fail criteria. “Pass and fail” are highly limiting in terms of our view of the product: I’m sure that the arrival of yet another damned security message on Windows Vista was deemed as a pass in the suite of automated checks that got run on the system every night. But in terms of my happiness with the product, it’s a grinding and repeated failure. I think Rahul’s notion that a test must pass or fail is confused with the idea that a test should involve the application of a stopping heuristic.  For a check, “pass or fail” is essential, since a check relies on the non-sapient application of a decision rule.  For a test, pass-vs.-fail might an example of the “mission accomplished” stopping heuristic, but there are plenty of other conditions that we might use to trigger the end of a test.

Since Rahul appears to be a performance tester, perhaps he’ll relate to this example (the framing of which I owe to the work of Cem Kaner). Imagine a system that has an explicit requirement to handle 100,000 transactions per minute. We have two performance testing questions that we’d like to address. One is the load testing question: “Can this system in fact handle 100,000 transactions per minute?” To me, that kind of question often gets addressed with a confirmatory mindset. The tester forms a hypothesis that the system does handle 100,000 transactions per minute; he sets up some automation to pump 100,000 transactions per minute through the system; and if the system stays up and exhibits no other problems, he asserts that the test passes.

The other performance question is a stress testing question: “In what circumstances will the system be unable to handle a given load, and fail?” For that we design a different kind of experiment. We have a hypothesis that the system will fail eventually as we ramp up the number of transactions. But we don’t know how many transactions will trigger the failure, nor do we know the part of the system in which the failure will occur, nor do we know way in which the failure will manifest itself.  We want to know those things, so have a different information objective here than for the load test, and we have a mission that can’t be handled by a check.

In the latter test, there is a confirmatory dimension if you’re willing to look hard enough for it. We “confirm” our hypothesis that, given heavy enough stress, the system will exhibit some problem. When we apply an oracle that exposes a failure like a crash, maybe one could say that we “confirm” that the the crash is a problem, or that behaviour we consider to be bad is bad. Even in the former test, we could flip the hypothesis, and suggest that we’re seeking to confirm the hypothesis that the program doesn’t support a load of 100,000 transactions per minute . If Rahul wants to do that, he’s welcome to do so. To me, though, labelling all that stuff as “confirmatory” testing reminds me of Procrustes.

Questions from Listeners (2): Is Unit Testing Automated?

Monday, June 28th, 2010

On April 19, 2010, I was interviewed by Gil Broza.  In preparation for that interview, we solicited questions from the listeners, and I promised to answer them either in the interview or in my blog.  Here’s the second one.

Unit testing is automated. When functional, integration, and system test cannot be automated, how to handle regression testing without exploding the manual test with each iteration?

This question provides a great opportunity to look at a number of points—so many that I’d like to address only the first sentence in the question this time around. I’ll look at the second part of the question later on.

Expansive Definitions

I find the most helpful definitions and descriptions to be those that are expansive and inclusive. While testing, one big risk is that I might have narrow ideas about certain risks or threats to the value of the product. Thinking expansively helps me to avoid tunnel vision that would lead to my missing important problems. In conversations, thinking expansively helps me to remain alert to the possibility that the other person and I might be talking at cross-purposes. That can happen when one of us uses a word that means different things to each of us. It can also happen when we’re thinking of the same thing, but using different words. In fact, as Jerry Weinberg once remarked to James Bach, “A tester is someone who knows that things can be different.” Here’s an example of that. The questioner says that “unit testing is automated”. I’d argue that this refers to one part of testing, test execution, the part we can automate. Well, to me, things can be different.

Testing Includes Many Activities

Testing includes not only test execution, but also test design, learning, and reporting, all performed in cycles or loops. What is test design? As we say in the Rapid Software Testing course notes, test design includes

  • modeling the test space (that is, considering questions of what we could test; what’s in scope);
  • determining oracles (that is, figuring out the principles or mechanisms by which we’d recognize a problem, and considering how those principles or mechanisms might fail to help us recognize a problem)
  • determining coverage (that is, how much testing we’re going to do, given the scope)
  • determining procedures (how we’re going to perform the tests; how we’ll go about the business of test execution)

Test execution includes

  • configuring the product (obtaining it, setting it up for the purposes of a given test)
  • operating the product (exercising the product in some way to obtain coverage)
  • observing the product (applying the oracles that we’ve determined in advance, but also recognizing behaviours that trigger us to recognize and apply new oracles)
  • evaluating the product (comparing its behaviour to our oracles)
  • applying a stopping heuristic (deciding when the test is done)
  • Test execution may or may not include reporting, but reporting happens at some point. And when testing is being done well, learning is happening pretty much all the time. This isn’t a strictly linear process, by the way. Depending on your approach to testing, and depending on what you’re these things may happen in the order that you see above, or they may happen all at once in an organic tangled ball, with lots of tight little loops. Sometimes all of the elements of testing are done by the same person, and the elements interact with each other very quickly. Sometimes one person designs a test and another person handles the execution, in which case the loops will be long or broken. If you separate test design and test execution (as happens in scripted testing), you separate the learning associated with each. Sometimes we’ll evaluate a result and stop a test; sometimes we’ll stop first and then interpret what we’ve seen. For a given test, some aspects may take much longer than others; some may be done more consciously or thoughtfully than others. But at some point in pretty much every test, each of the steps above happen.

    Unit Testing Includes Many Activities

    Like any other kind of testing, unit testing consists of cycles of design, execution, learning, and reporting. Like any other test, a unit test starts with some person having a test idea, a question that we want to ask about the program. A person designing a unit test typically frames that question in terms of a check—an observation linked to a decision rule such that both can be performed by a machine. The person writes program code to express that yes-or-no question, usually assisted by some kind of unit testing framework. Next, some person—or, more often, some process that a person has initiated—performs the checks. The check produces a result. Sometimes a person observes that result independently of other results; more often, some person (the author of the automation framework) has programmed a mechanism that provides a means of aggregating the results. Then some person interprets the aggregated results and figures out what needs to be done next—whether everything is okay, whether a test result suggests that the product should be revised, or whether the check is excellent or wanting or broken irrelevant. And then the development cycle continues, in a loop that includes some development of the actual product too.

    Most Parts of Unit Testing Are Sapient, Not Mechanical

    Notice how many times the word “person” appears in the above description of unit testing. None of the steps in the process (with the exception of the running of the checks) can be automated, since each step requires a thinking person, rather than a machine, to seek information, to make decisions, and to control the overall process. Parts of unit testing can be assisted by automation, but the automation isn’t doing anything particularly on its own; it remains an extension of the person’s ability to execute and to observe.

    What form might unit test automation take? Many people think in terms of a testing framework that sets up some conditions, executes some code from the product under test, makes some assertions about the output of some function or some aspect the state of the system. That’s cool, and quite powerful. But for years at Quarterdeck, I watched programmers doing unit testing (and did some myself) by stepping though code under various debuggers (DEBUG, SYMDEB, WDEB386, or Soft-ICE, a software-based simulacrum of an in-circuit emulator), watching the registers and the ports for each instruction. Sometimes I’m writing some stuff in Ruby, and I want to do a quick little test of a fairly trivial function that I know I’m going to throw away. In that case, I don’t bother with the testing framework; I run the code and inspect the variables in IRB, the Ruby interpreter, and get my information that way. Sometimes I write a function, and generate some data to test it using automation. Sometimes, while unit testing, I use tools to examine the contents of a database table or a file or the Windows registry. Are all these different things unit testing? Jerry Weinberg says that testing is “gathering information with the intention of informing a decision”. I’m testing a unit, and I’m using automation to assist that testing, even though (so it seems) people tend to hold a more narrow view of what unit testing is. Unit testing is testing done at the unit level.

    Is stepping through the code the way that we should always do unit testing? Of course not. For the purpose of creating easily-runnable change detectors, the unit test framework is the way to go. Yet different approaches, tools, and techniques that we employ allow us to observe in different ways, discover different problems, and learn different things about the unit under test.

    Finally, it’s important to note that the development of unit-level checks tends to reveal more problems than the running of them. Chip Groeder won a best paper award at the STAR conference in 1997, in which he claimed that 88% of the bugs that he found with automated tests were found during development of the tests (that is, the non-automated parts of the testing). (Thanks to Cem Kaner for pointing me to this.)  Anecdotally, everyone that I speak to who uses automation for the execution of tests—whether at the unit level or not—says exactly the same thing.  That’s not to say that automated checks are useless.  On the contrary; checks, as change detectors, are very useful.  Instead, my point is that unit testing is not automated; not the interesting parts. Unit checking is automated.

    In summary:

    • Unit testing is a highly exploratory process, in the that the loops are short, tightly integrated, and typically performed by the same person.
    • The most important parts of unit test are the sapient parts—the design, programming, design of reports, interpretation of results, and the evaluation of what to do next.
    • The scripted part of unit testing—the execution of the checks—is the least interesting part of unit testing. And yet…
    • Many people seem to be fascinated by the mechanical parts, dazzled by lines on the screen, blissful upon observation of the green bar. And the same people say things like “unit testing is automated”. Why is that?

    That’s a lot for now. I’ll answer the rest of the question in a future post.

    “Merely” Checking or “Merely” Testing

    Tuesday, November 10th, 2009

    The distinction between testing vs. checking got a big boost recently from James Bach at the Øredev conference in Malmö, Sweden. But a recent tweet by Brian Marick, and a recent conversation with a colleague have highlighted an issue that I should probably address.

    My colleague suggested that somehow I may have underplayed the significance or importance or the worth of checking. Brian’s tweet said,

    “I think the trendy distinction between “testing” and “checking” is a power play: which would you preface with “mere”? http://bit.ly/2Cuyj

    As a consequence, I was worried that I had ever said “mere checking” or “merely checking” in one of my blog postings or on Twitter, so I researched it. Apparently I had not; that was a relief. However, the fact that I was suspicious even of myself suggests that some maybe I need to clarify something.

    The distinction between testing and checking is a power play, but it’s not a power play between (say) testers and programmers. It’s a power play between the glorification of mechanizable assertions over human intelligence. It’s a power play between sapient and non-sapient actions.

    Recall that the action of a check has three parts to it. Part one is an observation of a product. Part two is a decision rule, by which we can compare that empirical observation of the product with an idea that someone had about it. Part three is the setting of a bit (pass or fail, yes or no, true or false) that represents the non-sapient application of both the observation and the decision rule. Note, too, that this means that a check can be performed by one of two agencies: 1) a machine. 2) A sufficiently disengaged human; that is, a human who has been scripted to behave like a machine, and who has for whatever reason accepted that assignment.

    So checks can be hugely important. Checks are a means by which a programmer, engaged in test-driven development, checks his idea. Creating the check and analyzing its result are both testing activities. Checks are a valuable product (a by-product, some would say) of test-driven development. Checks are change detectors, tools that allow programmers to refactor with confidence. Checks built into continuous integration are mechanisms to make sure that our builds can work well enough to be tested—or, if we’re confident enough in the prior quality of our testing, can work well enough to be deployed. Checks tend to shortens the loop between the implementation of an idea and the disovery of a problem that the checks can detect, since the checks are typically designed and run (a lot, iteratively) by the person doing the implementation. Checks tend to speed up certain aspects of the post-programmer testing of the product, since good checks will find the kind dopey, embarrassing errors that even the best programmers can make from time to time. The need for checks sometimes (alas, not always) prompts us to create interfaces that can be used by programmers or testers to aid in later exploration.

    Checking represents the rediscovery of techniques that were around at least in 1957. “The first attack on the checkout problem may be made before coding has begun.” D. D. McCracken, Digital Computer Programming, 1957 (Thanks to Ben Simo for inspiring me to purchase a copy of this book.) In 2007, I had dinner with Jerry Weinberg and Josh Kerievsky. Josh asked Jerry if he did a lot of unit testing back in the day. Jerry practically did a spit-take, saying “Yes, of course. Computer time was hugely expensive, but we programmers were cheap. Getting the program right was really important, so we had to test a lot.” Then he added something that hadn’t occurred to me. “There was another reason, too. Apart from everything else, we tested because the machinery was so unreliable. We’d run a test program, then run the program we wrote, then run the test program again to make sure that we got the same result the second time. We had to make sure that no tubes had blown out.”

    So, in those senses, checking rocks. Checking has always rocked. It seems that in some places, people forgot how much it rocks, and that the Agilists have rediscovered them.

    Yet it’s important to note that checks on their own don’t deliver value unless there’s sapient engagement with them. What do I mean by that?

    As James Bach says here, “A sapient process is any process that relies on skilled humans.” Sapience is the capacity to act with human intelligence, human judgment, and some degree of human wisdom.

    It takes sapience to recognize the need for a check—a risk, or a potential vulnerability. It takes sapience—testing skill—to express that need in terms of a test idea. It takes sapience—more test design skill—to express that test idea in terms of a question that we could ask about the program. Sapience—in terms of testing skill, and probably some programming skill—is needed to frame that question as a yes-or-no, true-or-false, pass-or-fail question. Sapience, in the form of programming skill, is required to turn that question into executable code that can implement the check (or, far more expensively and with less value, into a test script for execution by a non-sapient human). We need sapience—testing skill again—to identify an event or condition that would trigger some agency to perform the check. We need sapience—programming skill again—to encode that trigger into executable code so that the process can be automated.

    Sapience disappears while the check is being performed. By definition, the observation, the decision rule, and the setting of the bit all happen without the cognitive engagement of a skilled human.

    Once the check has been performed, though, skill comes back into the picture for reporting. Checks are rarely done on their own, so they must be aggregated. The aggregation is typically handled by the application of programming skill. To make the outcome of the check observable, the aggregated results must be turned into a human-readable report of some kind, which requires both testing and programming skill. The human observation of the report, intake, is by defintion a sapient process. Then comes interpretation. The human ascribes meaning to the various parts of the report, which requires skills of testing and of critical thinking. The human ascribes significance to the meaning, which again takes testing and critical thinking skill. Sapient activity by someone—a tester, a programmer, or a product owner—is needed to determine the response. Upon deciding on significance, more sapient action is required—fixing the application being checked (by the production programmer); fixing or updating the check (by the person who designed or programmed the check); adding a new check (by whomever might want to do so) or getting rid of the check (by one or more people who matter, and who have decided that the check is no longer relevant).

    So: the check in and of itself is relatively trivial. It’s all that stuff around the check—the testing and programming and analysis activity—that’s important, supremely important. And as is usual with important stuff, there are potential traps.

    The first trap is that it might be easy to do any of the sapient aspects of checking badly. Since the checks are at their core software, there might be problems in requirements, design, coding, or interpretation, just as there might be with any software.

    The second trap is that it can be easy to fall asleep somewhere between the report and interpretations stages of the checking process. The green bar tells us that All Is Well, but we must be careful about that. All is well with respect to the checks that we’ve programmed is a very different statement. Red tends to get our attention, but green is an addictive and narcotic colour. A passing test is another White Swan, confirmation of our existing beliefs, proof by induction. Now, we can’t live without proof by induction, but induction can’t alert us to new problems. Millions of repeated tests, repeated thousands of times, don’t tell us about the bugs that elude them. We only need one Black Swan to bump into a devastating effect.

    The third trap is that we might believe that checking a program is all there is to testing it. Checking done well incorporates an enormous amount of testing and programming skill, but some quality attributes of a program are not machine-decidable. Checks are the kinds of tests that aren’t vulnerable to the halting problem.Someone on a mailing list once said, “Once all the (automated) acceptance test pass (that is, all the checks), we know we’re done.” I liked Joe Rainsberger‘s reply, “No, you’re not done; you’re ready to give it to a real tester to kick the snot out of it.” That kicking is usually expressed with greater emphasis on exploration, discovery, and investigation, and rather less on confirmation, verification, and validation.

    The fourth trap is a close cousin of the third trap: at certain points, we might pay undue attention to the value of checking with respect to its cost. Cost vs. value is a dominating problem with any kind of testing, of course. One of the reasons that the Agile emphasis on testing remains exciting is that excellent checking lowers the cost of testing, and both help to defend the value of the program. Yet checks may not be Just The Thing for some purposes. Joe has expressed concerns in his series Integration Tests are a Scam, and Brian Marick did too, a while ago, An Alternative to Business-Facing TDD. I think they’re both making important points here, thinking of checks as a means to an end, rather than as a fetish.

    Fifth: upon noting the previous four traps (and others), we might be tempted to diminish the value of checking. That would be a mistake. Pretty much any program is made more testable by someone removing problems before someone else sees them. Every bug or issue that we find could trigger investigation, reporting, fixing, and retesting, and that gives other (and potentially more serious) problems time to hide. Checking helps to prevent those unhappy discoveries. Excellent checking (which incorporates excellent testing) will tend to reduce the number of problems in the product at any given time, and thereby results in a more testable program. James Bach points out that a good manual test could never be automated (he’d say “sapient” now, I believe). But note, in that same post that he says, that “if you can truly automate a manual test, it couldn’t have been a good manual test”, and “if you have a great automated test, it’s not the same as the manual test that you believe you were automating”. The point is that there are such things as great automated tests, and some of them might be checks.

    So the power play is over which we’re going to value: the checks (“we have 50,000 automated tests”) or the checking. Mere checks aren’t important; but checking—the activity required to build, maintain, and analyze the checks—is. To paraphrase Eisenhower, with respect to checking, the checks are nothing; the checking is everything. Yet the checking isn’t everything; neither is the testing. They’re both important, and to me, neither can be appropriately preceded with “mere”, or “merely”.

    There’s one exception, though: If you’re only doing one or the other, it might be important to say, “You’re merely been testing the program; wouldn’t you be better off checking it, too?” or “That program hasn’t been tested; it’s merely been checked.”

    See more on testing vs. checking.