Blog Posts from August, 2009

Testing vs. Checking

Saturday, August 29th, 2009

Post-postscript: Think of this blog post with its feet up, enjoying a relaxing retirement after a strenuous career. Please read the new version first. In the years since the original post, I’ve further refined my take on the subject of testing and checking, mostly in collaboration with my colleague James Bach. Our current thinking on the topic appears on his blog, and I provide some followup here. We’ve also benefitted from comments and questions from other colleagues, so we encourage you to read the comments, and to comment yourself. Then come back here if you’re still interested. I’ll wait. Read it. (August 2014)

OK. Since you’re back, we can carry on into the past.

Postscript: Over the years, some people have misinterpreted this post as a rejection of checking, or of regression testing, or of testing that is assisted by automation. So in addition to reading this post, it is important that you also read this one.

This posting is an expansion of a lightning talk that I gave at Agile 2009. Many thanks to Arlo Belshee and James Shore for providing the platform. Many thanks also to the programmers and testers at the session for the volume of support that the talk and the idea received. Special thanks to Joe (J.B.) Rainsberger. Spread the meme!

There is confusion in the software development business over a distinction between testing and checking. I will now attempt to make the distinction clearer.

Checking Is Confirmation

Checking is something that we do with the motivation of confirming existing beliefs. Checking is a process of confirmation, verification, and validation. When we already believe something to be true, we verify our belief by checking. We check when we’ve made a change to the code and we want to make sure that everything that worked before still works. When we have an assumption that’s important, we check to make sure the assumption holds. Excellent programmers do a lot of checking as they write and modify their code, creating automated routines that they run frequently to check to make sure that the code hasn’t broken. Checking is focused on making sure that the program doesn’t fail.

Testing Is Exploration and Learning

Testing is something that we do with the motivation of finding new information. Testing is a process of exploration, discovery, investigation, and learning. When we configure, operate, and observe a product with the intention of evaluating it, or with the intention of recognizing a problem that we hadn’t anticipated, we’re testing. We’re testing when we’re trying to find out about the extents and limitations of the product and its design, and when we’re largely driven by questions that haven’t been answered or even asked before. As James Bach and I say in our Rapid Software Testing classes, testing is focused on “learning sufficiently everything that matters about how the program works and about how it might not work.”

Checks Are Machine-Decidable; Tests Require Sapience

A check provides a binary result—true or false, yes or no. Checking is all about asking and answering the question “Does this assertion pass or fail?” Such simple assertions tend to be machine-decidable and are, in and of themselves, value-neutral.

A test has an open-ended result. Testing is about asking and answering the question “Is there a problem here?” That kind of decision requires the application of many human observations combined with many value judgements.

When a check passes, we don’t know whether the program works; we only know that it’s still working within the scope of our expectations. The program might have serious problems, even though the check passes. To paraphrase Dkijstra, “checking can prove the presence of bugs, but not their absence.” Machines can recognize inconsistencies and problems that they have been programmed to recognize, but not new ones. Testing doesn’t tell us whether the program works either—certainty on such questions isn’t available—but testing may provide the basis of a strong inference addressing the question “problem or no problem?”

Testing is, in part, the process of finding out whether our checks have been good enough. When we find a problem through testing, one reasonable response is to write one or more checks to make sure that that particular problem doesn’t crop up again.

Whether we automate the process or not, if we could express our question such that a machine could ask and answer it via an assertion, it’s almost certainly checking. If it requires a human, it’s a sapient process, and is far more likely to be testing. In James Bach‘s seminal blog entry on sapient processes, he says, “My business is software testing. I have heard many people say they are in my business, too. Sometimes, when these people talk about automating tests, I think they probably aren’t in my business, after all. They couldn’t be, because what I think I’m doing is very hard to automate in any meaningful way. So I wonder… what the heck are they automating?” I have an answer: they’re automating checks.

When we talk about “tests” at any level in which we delegate the pass or fail decision to the machine, we’re talking about automated checks. I propose, therefore, that those things that we usually call “unit tests” be called “unit checks“. By the same token, I propose that automated acceptance “tests” (of the kind Ron Jeffries refers to in his blog post on automating story “tests”) become known as automated acceptance checks. These proposals appeared to galvanize a group of skilled programmers and testers in a workshop at Agile 2009, something about which I’ll have more to say in a later blog post.)

Testing Is Not Quality Assurance, But Checking Might Be

You can assure the quality of something over which you have control; that is, you can provide some level of assurance to some degree that it fulfills some requirement, and you can accept responsiblity if it does not fulfill that requirement. If you don’t have authority to change something, you can’t assure its quality, although you can evaluate it and report on what you’ve found. (See pages 6 and 7 of this paper, in which Cem Kaner explains the distinction between testing and quality assurance and cites Johanna Rothman‘s excellent set of questions that help to make the distinction.) Testing is not quality assurance, but acts in service to it; we supply information to programmers and managers who have the authority to make decisions about the project.

Checking, when done by a programmer, is mostly a quality assurance practice. When an programmer writes code, he checks his work. He might do this by running it directly and observing the results, or observing the behaviour of the code under the debugger, but often he writes a set of routines that exercise the code and perform some assertions on it. We call these unit “tests”, but they’re really checks, since the idea is to confirm existing knowledge. In this context, finding new information would be considered a surprise, and typically an unpleasant one. A failing check prompts the programmer to change the code to make it work the way he expects. That’s the quality assurance angle: a programmer helps to assure the quality of his work by checking it.

Testing, the search for new information, is not a quality assurance practice per se. Instead, testing informs quality assurance. Testing, to paraphrase Jerry Weinberg, is gathering information with the intention of informing a decision, or as James Bach says, “questioning a product in order to evaluate it.” Evaluation of a product doesn’t assure its quality, but it can inform decisions that will have an impact on quality. Testing might involve a good deal of checking; I’ll discuss that at more length below.

Checkers Require Specifications; Testers Do Not

A tester, as Jerry Weinberg said, is “someone who knows that things can be different”. As testers, it’s our job to discover information; often that information is in terms of inconsistencies between what people think and what’s true in reality. (Cem Kaner‘s definition of testing covers this nicely: “testing is an empirical, technical investigation of a product, done on behalf of stakeholders, with the intention of revealing quality-related information of the kind that they seek.”)

We often hear old-school “testing” proponents claim that good testing requires specifications that are clear, complete, up-to-date, and unambiguous. (I like to ask these people, “What do you mean by ‘unambiguous’?” They rarely take well to the joke. But I digress.) A tester does not require the certainty of a perfect specification to make useful observations and inferences about the product. Indeed, the tester’s task might be to gather information that exposes weakness or ambiguity in a specification, with the intention of providing information to the people who can clear it up. Part of the tester’s role might be to reveal problems when the plans for the product and the implementation have diverged at some point, even if part of the plan has never been written down. A tester’s task might be to reveal problems that occur when our excellent code calls buggy code in someone else’s library, for which we don’t have a specification. Capable testers can deal easily with such situations.

A person who needs a clear, complete, up-to-date, unambiguous specification to proceed is a checker, not a tester. A person who needs a test script to proceed is a checker, not a tester. A person who does nothing but to compare a program against some reference is a checker, not a tester.

Testing vs. Checking Is A Leaky Abstraction

Joel Spolsky has named a law worthy of the general systems movement, the Law of Leaky Abstractions (“All non-trivial abstractions, to some degree, are leaky.”). In the process of developing a product, we might alternate very rapidly between checking and testing. The distinction between the two lies primarily in our motivations. Let’s look at some examples.

  • A programmer who is writing some new code might be exploring the problem space. In her mind, she has a question about how she should proceed. She writes an assertion—a check. Then she writes some code to make the assertion pass. The assertion doesn’t pass, so she changes the code. The assertion still doesn’t pass. She recognizes that her initial conception of the problem was incomplete, so she changes the assertion, and writes some more code. This time the check passes, indicating that the assertion and the code are in agreement. She has an idea to write another bit of code, and repeats the process of writing a check first, then writing some code to make it pass. She also makes sure that the original check passes. Next, she sees the possibility that the code could fail given a different input. She believes it will succeed, but writes a new check to make sure. It passes. She tries different input. It fails, so she has to investigate the problem. She realizes her mistake, and uses her learning to inform a new check; then she writes functional code to fix the problem and pass the check.So far, her process has been largely exploratory. Even though she’s been using checks to support the process, her focus has been on learning, exploring the problem space, discovering problems in the code, and investigating those problems. In that sense, she’s testing as she’s programming. At the end of this burst of development, she now has some functional code that will go into the product. As a happy side effect, she has another body of code that will help her to check automatically for problems if and when the functional code gets modified.Mark Simpson, a programmer that I spoke to at Agile 2009, said that this cyclic process is like bushwhacking, hacking a new trail through the problem space. There are lots of ways that you could go, and you clear the bush of uncertainty around you in an attempt to get to where you’re going Historically, this process has been called “test-driven development”, which is a little unfortunate in that TDD-style “tests” are actually checks. Yet it would be hard, and even a little unfair, to argue that the overall process is not exploratory to a significant degree. Programmers engaged in TDD have a goal, but the path to the goal is not necessarily clear. If you don’t know exactly where you’re going to end up and exactly how you’re going to get there, you have to do some amount of exploration. The moniker “behavior-driven development” (BDD) helps to clear up the confusion to some degree, but it’s not yet in widespread adoption. BDD uses checks in the form “(The program) should…”, but the development process requires a lot of testing of the ideas as they’re being shaped.
  • Now our programmer looks over her code, and realizes that one of the variables is named in an unclear way, that one line of code would be more readable and maintainable expressed as two, and that a group of three lines could more elegantly and clearly expressed as a for loop. She decides to refactor. She addresses the problems one at a time, running her checks after each change. Her intention in running these checks is not to explore; it’s confirm that nothing’s been messed up. She doesn’t develop new checks; she’s pretty sure the old ones will do. At this point, she’s not really testing the product; she’s checking her work.
  • Much of the traditional “testing” literature suggests that “testing” is a process of validation and verification, as though we already know how the code should work. Although testing does involve some checking, a program that is only checked is likely to be poorly tested. Much of the testing literature focused on correctness—which can be checked—and ignores the sapience that is necessary to inform deeper questions about value, which must be tested. For example, that which is called “boundary testing” is usually boundary checking.The canonical example is that of the program that adds two two-digit integers, where values in the range from -99 to 99 are accepted, and everything else is rejected. The classic advice on how to “test” such a program focuses on boundary conditions, given in a form something like this: “Try -99 and 99 to verify that valid values are accepted, and try -100 and 100 to verify that invalid values are rejected.” I would argue that these “tests” are so weak as to be called checks; they’re frightfully obvious, they’re focused on confirmation, they focus on output rather than outcome, and they could be easily mechanized.If you wanted to test a program like that, you’d configure, operate, observe the product with eyes open to many more risks, including ones that aren’t at the forefront of your consciousness until a problem manifests itself. You’d be prepared to consider anything that might threaten the value of the product—problems related to performance, installability, usability, testability, and many other quality criteria. You’d tend to vary your tests, rather than repeating them. You’d engage curiosity, and perform a smattering of tests unrelated to your current models of risks and threats, with the goal of recognizing unanticipated risks. You might use automation to assist your exploration; perhaps you would use automation to generate data, to track coverage, to parse log files, to probe the registry or the file system for unanticipted effects. Even if you used automation to punch the keys for you, you’d use the automation in an exploratory way; you’d be prepared to change your line of investigation and your tactics when a test reveals surprising information.
  • The exploratory mindset is focused on questions like “What if…?” “I wonder…?” “How does this thing…?” “What happens when I…?” Even though we might be testing a program with a strongly exploratory approach, we will engage a number of confirmatory kinds of ideas. “If I press on that Cancel button, that dialog should go away.” “That field is asking for U.S. ZIP code; the field should accept at least five digits.” “I’ll double-click on ‘foo.doc’, and that file should open in Microsoft Word on this system.” Excellent testers hold these and dozens of other assumptions and assertions as a matter of course. We may not even be conscious of them being checks, but we’re checking sub-consciously as we explore and learn about the program. Should one of these checks fail, we might be prompted to seek new information, or if the behaviour seems reasonable, we might instead change our model of how the program is supposed to work. That’s a heuristic process (a fallible means of solving a problem or making a decision, conducive to learning; we presume that a heuristic usually works but that it might fail).
  • Also at Agile 2009, Chris McMahon gave a presentation called “History of a Large Test Automation Project using Selenium”. He described an approach of using several thousand automated checks (he called them “tests”) to find problems in the application with testing. How to describe the difference? Again, the difference is one of motivation. If you’re running thousands of automated checks with the intention of demonstrating that you’re okay today, just like you were yesterday, you’re checking. You could use those automated checks in a different way, though. If you are trying to answer new questions, “what would happen if we ran our checks on thirty machines at once to really pound the server?” (where we’re focused on stress testing), or “what would happen if we ran our automated checks on this new platform?” (where we’re focused on compatibility testing), or “what would happen if we were to run our automated checks 300 times in a row?” (where we’re focused on flow testing), the category of checks would be leaking into testing (which would be a fine thing in these cases).

There will be much, much more to say about testing vs. confirmation in the days ahead. I can guarantee that people won’t adopt this distinction across the board, nor will they do it overnight. But I encourage you to consider the distinction, and to make it explicit when you can.

Postscript: Over the years, some people have misinterpreted this post as a rejection of checking, or of regression testing, or of testing that is assisted by automation. So in addition to reading this post, it is important that you also read this one.

See more on testing vs. checking.

Related: James Bach on Sapience and Blowing People’s Minds

The Tyranny of Always

Tuesday, August 25th, 2009

I just spent $3,000 to get my nose fixed, and then I found out it was my tie that was crooked.

—Steve Shrott

There’s a piece of software development mythodology that suggests that it’s always more expensive to fix a problem late in the development process rather than early. Usually the ratios quoted are fantastic; a hundred to one, a thousand to one, ten thousand to one. Let’s put that idea under the microscope for a moment.

The original idea comes from work by Barry Boehm, based on work that he did at TRW, a defense contractor. The idea includes “often”, not always, and a figure of 100 to one.

Here are some stories to contrast with the always/10,000 to one trope.

  • This has happened dozens of times for me, and I rather suspect that this has happened at least once for large numbers of testers: just before release, you find a bug, you show it to a developer, and she says, “Oh. Cool. Just a second. [type, type, type]. Fixed.” That didn’t cost 10,000 to one.
  • This has also happened for me, I suspect that this has happened for practically everyone as well: you find a bug, you show it to a program manager, and she says, “Oh. Well, it is a problem, I suppose, but we’ve got really important problems to fix and this one isn’t such a big deal. Let’s not bother with fixing it.” After release, no one complains. That didn’t cost 10,000 to one either.
  • Similarly, we can spend a lot of time and money trying to work around the persistent and pernicious problems with certain platforms, or we could simply decide to drop support for it and wait for the problem to go away. The recent movement to drop support for Internet Explorer 6.0 is a case in point; IE 6’s market share has been dropping consistently month by month since September 2005 (weirdly, there was an uptick in June 2009), and it’s down to 14.5 per cent now. Dropping support would save Web developers everywhere a good deal of grief, maybe to the temporary displeasure of some customers—who will eventually upgrade anyway. Will that cost 10,000 to one?
  • This has happened for some of us: “Gak! Another bug? This feature really isn’t working. We should back this whole feature out.” That might save way more than it costs.
  • Another one from my own experience: a programmer (let’s call him Phil) struggled for weeks and weeks with a problem to no avail. He decided to put off fixing it. Shipping time approached, and Phil’s problem list was too long for him to handle on his own. Another programmer (let’s call him Ron), free from having shipped his product, was suddenly available. With the luxury of a clear mind and an absence of preconceptions, he was able to fix the problem within an hour. Had Phil been forced to try to fix the problem earlier, he might have wasted enormous amounts of time. He saved time by delaying.
  • We choose to delay shipping the product to fix a problem. In so doing, we miss our shipping schedule, and our company declares a zero-revenue quarter. The drop in the value of our stock puts us out of business. Fixing the problem in that case may well have been unwise.
  • Testing reveals that the problem is a bug in someone else’s code. They fix the bug. Our cost to fix the problem is not 10,000 to one; it’s free.
  • Many organizations prepare prototypes. There are problems in the prototype. We don’t fix those problems right away; we save them for the Real Thing.
  • After deferring the problem for a while, some third party comes up with an update or a library or a toolkit that addresses the problem. Jonathan Bach tells a wonderful story about a bug that was found during testing of a popular commercial software product that he worked on. The developer resisted fixing the problem. The bug languished in development for many months, but was consistently deferred because the development effort (and therefore the cost) for the fix was intolerably high. The team eventually decided that the problem was significant enough to fix. The developer looked into solving the problem, and discovered that a third-party library had been released one week earlier; that library made the effort associated with the fix trivial. Had the developer tried to fix the problem earlier, that effort would have been spent at the cost of not being able to fix other bugs.
  • We might decline to fix a trivial bug now because fixing that bug might unblock a bunch of much more serious bugs.
  • After wrestling with the problem for a while, and then dropping it, someone comes up with a flash of insight. We develop a new approach to solving it. This takes much less time than our original approach would have taken.
  • We choose to release software, having fixed some of the problems and having ignored others, based on our theories of what the customers will like and dislike. We discover that people care very deeply about some of the problems that we didn’t fix, and that they wouldn’t have cared about the ones that we did.

Now: there are plenty of cases in which it does cost vastly less to fix a problem earlier than later. One of the more dramatic cases of this is the 1995 Pentium problem, in which a handful of missing entries from a table caused a number of floating point calculations to be handled incorrectly. That one probably could have been caught with more careful work, and it cost a billion dollars or so to manage the PR debacle. Yet how much would it have cost Intel to make sure that it’s processors were perfect? To this day, Intel doesn’t fix everything that it finds in testing. No one would pay for processors created with that level of effort.

It’s probably the case that it’s often or even usually more expensive to fix a problem later than earlier, based on certain assumptions. Yet I believe that it’s important to make sure that those assumptions aren’t buried under slogans. Software development isn’t assembling Tinkertoys according to a canned plan; it is suffused with questions of context and cost vs. value. It’s not merely assembly; it’s design and learning and discovery and investigation and branching and backtracking and breaking and fixing and learning some more. Yes, it’s important to be a good craftsman, and a good craftsman methodically fixes problems as he finds them… usually. But it’s also important to recognize that craftsmen also make their own decisions about what’s most important right now. As Victor Basili and Forrest Shull say (in a book edited by Boehm)

“It is clear that there are many sources of variation between one development context and another, and it is not clear a priori what specific variables influence the effectiveness of a process in a given context.”

So: Always question simplistic slogans that start with “always”.

Test Estimation Is Really Negotiation

Thursday, August 20th, 2009

Some of this posting is based on a conversation from a little while back on TestRepublic.com.

If anyone has a problem with “test estimation”, here’s a thought experiment:

Your manager (your client) wants to give you an assignment: to evaluate someone’s English skills, with the intention of qualifying him to work with your team. So how long would it take you to figure out whether a Spanish-speaking person spoke English well enough to join your team? Ponder that for a second, and then consider a few different kinds of Spanish-speaking people:

1) The fellow who, in response to every question you ask in English, replies, “Que?”

2) The fellow who speaks very articulately, until you mention the word “array”. And then he says, “Que?”

3) The fellow who spouts all kinds of technical talk perfectly, but when you say, “Let’s go for lunch,” says “Que?”

4) The fellow who speaks perfectly clearly, but every now and then spouts an obscenity.

5) The fellow who speaks English perfectly, but has no technical ability whatsoever.

6) The fellow who has great technical chops and speaks better English than the Queen, but spits tobacco juice in the corner every minute and a half.

How long you need to test a candidate’s capacity to speak English isn’t a question that has a firm answer, since the answer surely depends on

a) the candidate;
b) the extent to which you and the client want to examine them;
c) the mission upon which the candidate will be sent;
d) the information that you discover about the candidate;
e) the demands and schedule of the project for which you’re qualifying candidates;
f) the criteria upon which your client will decide they have enough information;
g) the amount of money and resources that the client is prepared to give you for your evaluation;
h) the amount of time that the client is prepared to give you.

So, yes, you can provide an estimate. Your client will often demand one. Mind you, since (H) is going to constrain your answer every time, you might as well start by asking the client how long you have to test. If the client answers with a date or a time, you don’t have to estimate how long it’s going to take you.

Suppose the client doesn’t provide a date. Do you know anything about the candidate? Before the interview, you find out that he’s only ever been a rickshaw driver; no previous experience with testing; no previous experience with computers. He speaks no English, but has a habit of screaming at the top of his lungs once every twenty minutes. In this case, you probably don’t have to estimate. It would take less time to report to your client that the candidate is likely to be unsuitable than it would to prepare an estimate for how long it will take to evaluate him. Why bother?

So here’s another candidate. This woman has been working at Microsoft for ten years, the first eight as a tester and the last two as a test lead. Her references have all checked out. The mission is to test a text-only Website of three pages, no programmatic features. In this case, you probably won’t have to estimate. It would take less time to report to your client that the candidate is likely to be qualified (or overqualified) than it would to prepare an estimate. Why bother?

The information that you discover in your evaluation of the candidate’s English skills is to a large degree unpredictable. The problem that sinks him might not be related to his English, and you might not discover a crucial problem until after he’s been hired. The problems that you discover might be deemed insufficient to disqualify him from the job, since ultimately it’s the manager who’s going to decide.

So instead of thinking about estimation in testing, think about negotiation. Testing is an open-ended task, and it must respond to development work. The quality of that development work and the problems that we might find are open questions (if they weren’t, we wouldn’t be testing). In addition, the decision to ship the product (which includes a decision to stop testing) is a business decision, not a technical one.

In cases where you don’t know things about the candidate, you can certainly propose a suite of questions and exercises that you’ll put them through, and negotiate that with the client. In case of the first candidate, the very first bit of information that your receive is likely change all of your choices about what to ask them and how you’re going to test them. In the second case, your interview will probably be quick too, but for the opposite reason. It’s in the cases in between, when you’re dealing with uncertainty and want to dispel it, that your testing will take somewhat longer, will require probing and investigation of questions that arise during the interview—and that may require extra time that you may have to negotiate with your client. One thing for sure: you probably don’t want to spend so much time designing the protocol that it has a serious negative impact on your interviewing time, right?

For those who are still interested (or unconvinced) and haven’t seen it, you might like to look at this:

http://www.developsense.com/2007/01/test-project-estimation-rapid-way.html

McLuhan on Blink Testing

Tuesday, August 4th, 2009

At about the 10-minute mark in this video (from 1968) Marshall McLuhan refers to an expression that he claimed was then in use at IBM: “Information overload leads to pattern recognition.” This is central to the idea of what blink testing is all about. He also describes characteristics of the scientist’s mindset vs. the artist’s mindset, which reminded me of similar patterns that we might see in programmers and testers.

Thanks to Adam Goucher for pointing me to the video.