Blog Posts from September, 2009

Context-free Questions For Testing and Checking

Wednesday, September 30th, 2009

After a presentation on exploratory approaches and on testing vs. checking yesterday, a correspondent and old friend writes:

Although the presentation made good arguments for exploratory testing, I am not sure a small QA department can spare the resources unless a majority of regression checking can be moved to automation. Particularly in situations with short QA cycles.

(Notice that he and I are using “testing” and “checking” in this specific way.)

Any time someone makes an observation about what is or isn’t possible, irrespective of the kind of testing (or checking) that they’re doing, it suggests some questions for the testing, programming, and management teams. I’d ask my old friend

1) How much checking do you need to do?

2) What, specifically, suggests that checking needs to be done? What happens when you do it? What doesn’t happen when you do it? What happens when you don’t do it? What doesn’t happen when you don’t do it?

3) What specifically, might suggest that the testers are the best people to do the checking? What, specifically, might suggest that they aren’t the best people to do it?

4) Where do your testers spend their time? When you speak with the people who are actually testing, do they feel the time that they’re spending on checking is worthwhile? Do they have things to say about what slows down testing (or checking)?

5) What are the risks that checking addresses well? What risks are not addressed well by checking?

These are open questions that all teams can ask, regardless of the approach they’re using now. Feel free to replace the word “checking” with “testing”, and vice versa, wherever you like.

I encourage and, when asked, help people to ask and answer these questions, and others like them. I have no specific answers from the outset; I don’t know you, and I don’t know your context. But you do. Maybe the questions can be helpful to you. I hope so.

See more on testing vs. checking.

Related: James Bach on Sapience and Blowing People’s Minds

A Letter To The Programmer

Tuesday, September 29th, 2009

This is a letter that I would not show to a programmer in a real-life situation. I’ve often thought of bits of it at a time, and those bits come up in conversation occasionally, but not all at once.

This is based on an observation of the chat window in Skype 4.0.0.226.

Dear Programmer,

I discovered a bug today. I’ll tell you how I found it. It’s pretty easy to reproduce. There’s this input field in our program. I didn’t know what the intended limit was. It was documented somewhere, but that part of the spec got deleted when the CM system went down last week. I could have asked you, but you were downstairs getting another latte.

Plus, it’s really quick and easy to find out empirically; quicker than looking it up, quicker than asking you, even if you were here. There’s this tool called PerlClip that allows me to create strings that look like this

*3*5*7*9*12*15*18*21*24*27*30*33*36*39*42*45*48*51*54*57*60*…

As you’ll notice, the string itself tells you about its own length. The number to the left of each asterisk tells you the offset position of that asterisk in the string. (You can use whatever character you like for a delimiter, including letters and numbers, so that you can test fields that filter unwanted characters.)

It takes a handful of keystrokes to generate a string of tremendous length, millions of characters. The tool automatically copies it to the Windows clipboard, whereupon you can paste it into an input field. Right away, you get to see the apparent limit of the field; find an asterisk, and you can figure out in a moment exactly how many characters it accepts. It makes it easy to produce all kinds of strings using Perl syntax, which saves you having to write a line of Perl script to do it and another few lines to get it into the clipboard. In fact, you can give PerlClip to a less-experienced tester that doesn’t know Perl syntax at all (yet), show them a few examples and the online help, and they can get plenty of bang for the buck. They get to learn something about Perl, too. This little tool is like a keychain version of a Swiss Army knife for data generation. It’s dead handy for analyzing input constraints. It allows you to create all kinds of cool patterns, or data that describes itself, and you can store the output wherever you can paste from the clipboard. Oh, and it’s free.

You can get a copy of PerlClip here, by the way. It was written by James Bach and Danny Faught. The idea started with a Perl one-liner by Danny, and they build on each other’s ideas for it. I don’t think it took them very long to write it. Once you’ve had the idea, it’s a pretty trivial program to implement. But still, kind of a cool idea, don’t you think?

So anyway, I created a string a million characters long, and I pasted it into the chat window input field. I saw that the input field apparently accepted 32768 characters before it truncated the rest of the input. So I guess your limit is 32768 characters.

Then I pressed “Send”, and the text appeared in the output field. Well, not all of it. I saw the first 29996 characters, and then two periods, and then nothing else. The rest of the text had vanished.

That’s weird. It doesn’t seem like a big deal, does it? Yet there’s this thing called representativeness bias. It’s a critical thinking error, the phenomenon that causes us to believe that a big problem always looks big from every angle, and that an observation of a problem with little manifestations always has little consequences.

Our biases are influenced by our world views. For example, last week when that tester found that crash in that critical routine, everyone else panicked, but you realized that it was only a one-byte fix and we were back in business within a few minutes. It also goes the other way, though: something that looks trivial or harmless can have dire and shocking consequences, made all the more risky because of the trivial nature of the symptom. If we think symptoms and problems and fixes are all alike in terms of significance, when we see a trivial symptom, no one bothers to investigate the problem. It’s only a little rounding error, and it only happens on one transaction in ten, and it only costs half a cent at most. When that rounding error is multiplied over hundreds of transactions a minute, tens of thousands an hour… well you get the point.

I’m well aware that, as a test, this is a toy. It’s like a security check where you rattle the doorknob. It’s like testing a car by kicking the tires. And the result that I’m seeing is like the doorknob falling off, or the door opening, or a tire suddenly hissing. For a tester, this is a mere bagatelle. It’s a trivial test. Yet when a trivial test reveals something that we can’t explain immediately, it might be good idea to seek an explanation.

A few things occurred to me as possibilities.

  • The first one is that someone, somewhere, is missing some kind of internal check in the code. Maybe it’s you; maybe it’s the guy who wrote the parser downstream, maybe it’s the guy that’s writing the display engine. But it seems to me as though you figured that you could send 32768 bytes, someone else has a limit of 29998 bytes. Or 29996, probably. Well, maybe.
  • Maybe one of you isn’t aware of the published limits of the third-party toolkits you’re using. That wouldn’t be the first time. It wouldn’t necessarily be negligence on your part, either—the docs for those toolkits are terrible, I know.
  • Maybe the published limit is available, but there’s simply a bug in one of those toolkits. In that case, maybe there isn’t a big problem here, but there’s a much bigger problem that the toolkit causes elsewhere in the code.
  • Maybe you’re not using third-party toolkits. Maybe they’re toolkits that we developed here. Mind you, that’s exactly the same as the last problem; if you’re not aware of the limits, or if there’s a bug, who produced the code has no bearing on the behaviour of the code.
  • Maybe you’re not using toolkits at all, for any given function. Mind you, that doesn’t change the nature of the problems above either.
  • Maybe some downstream guy is truncating everything over 29996 bytes, placing those two dots at the end, and ignoring everything else, and and he’s not sending a return value to you to let you know that he’s doing it.
  • Maybe he is sending you a return value, but the wrong one.
  • Maybe he’s sending you a return value, and you’re ignoring it.
  • Maybe he’s sending you a return value, and you are paying attention to it, but there’s some confusion about what it means and how it should be handled.
  • Maybe you’re truncating the last two and a half kilobytes or so of data before you send it on, and we’re not telling the user about it. Maybe that’s your intention. Seems a little rude to me to do that, but to you, it works as designed. To some user, it doesn’t work—as designed.
  • Maybe there’s no one else involved, and it’s just you working on all those bits of the code, but the program has now become sufficiently complex that you’re unable to keep everything in your head. That stands to reason; it is a complicated program, with lots of bits and pieces.
  • Maybe you’re depending on unit tests to tell you if anything is wrong with the individual functions or objects. But maybe nothing is wrong with any particular one of them in isolation; maybe it’s the interaction between them that’s problemmatic.
  • Maybe you don’t have any unit tests at all.
  • Maybe you do have unit tests for this stuff. From right here, I can’t tell. If you do have them, I can’t tell whether your checks are really great and you just missed one this time, or if you missed a few, or if you missed a bunch of them, or whether there’s a ton of them and they’re all really lousy.
  • Any of the above explanations could be in play, many of them simultaneously. No matter what, though, all your unit tests could pass, and you’d never know about the problem until we took out all the mocks and hooked everything up in the real system. Or deployed into the field. (Actually, by now they’re not unit tests; they’re just unit checks, since it’s a while since this part of the code was last looked at and we’ve been seeing green bars for the last few months.)

For any one of the cases above, since it’s so easy to test and check for these things, I would think that if you or anyone else knew about this problem, your sense of professionalism and craftsmanship would tell you to do some testing, write some checks, and fix it. After all, as Uncle Bob Martin said, you guys don’t want us to find any bugs, right?

But it’s not my place to say that. All that stuff is up to you. I don’t tell you how to do your work; I tell you what I observe, in this case entirely from the outside. Plus it’s only one test. I’ll have to do a few more tests to find out if there’s a more general problem. Maybe this is an aberration.

Now, I know you’re fond of saying, “No user would ever do that.” I think what you really mean is no user that you’ve thought of, and that you like, would do that on purpose. But it might be a thought to consider users that you haven’t thought of, however unlikely they and their task might be to you. It could be a good idea to think of users that neither one of us like, such as hackers or identity thieves. It could also be important to think of users that you do like who would do things by accident. People make mistakes all the time. In fact, by accident, I pasted the text of this message into another program, just a second ago.

So far, I’ve only talked about the source of the problem and the trigger for it. I haven’t talked much about possible consequences, or risks. Let’s consider some of those.

  • A customer could lose up to 2770 bytes of data. That actually sounds like a low-risk thing, to me. It seems pretty unlikely that someone would type or paste that much data in any kind of routine way. Still, I did hear from one person that they like to paste stack traces into a chat window. You responded rather dismissively to that. It does sound like a corner case.
  • Maybe you don’t report truncated data as a matter of course, and there are tons of other problems like this in the code, in places that I’m not yet aware of or that are invisible from the black box. Not this problem, but a problem with the same kind of cause could lead to a much more serious problem than this unlikely scenario.
  • Maybe there is a consistent pattern of user interface problems where the internals of the code handle problems but don’t alert the user, even though the user might like to know about them.
  • Maybe there’s a buffer overrun. That worries me more—a lot more—than the stack trace thing above. You remember that this kind of problem used to be dismissed as a “corner case” back when we worked at Microsoft—and then how Microsoft shut down new product development spent two months on investigating these kinds of problems, back in the spring of 2002? Hundreds of worms and viruses and denial of service attacks stem from problems whose outward manifestation looked exactly as trivial as this problem. There are variations on it.
  • Maybe there’s a buffer overrun that would allow other users to view a conversation that my contact and I would like to keep between ourselves.
  • Maybe an appropriately crafted string could allow hackers to get at some of my account information.
  • Maybe an appropriately crafted string could allow hackers to get at everyone‘s account information.
  • Maybe there’s a vulnerability that allows access to system files, as the Blaster worm did.
  • Maybe the product is now unstable, and there’s a crash about to happen that hasn’t yet manifested itself. We never know for sure if a test is finished.
  • Here’s something that I think is more troubling, and perhaps the biggest risk of all. Maybe, by blowing off this report, you’ll discourage testers from reporting a similarly trivial symptom of a much more serious problem. In a meeing a couple of weeks ago, the last time a tester reported something like this, you castigated her in public for the apparently trivial nature of the problem. She was embarrassed and intimidated. These days she doesn’t report anything except symptoms that she thinks you’ll consider sufficiently dramatic. In fact, just yesterday she saw something that she thought to be a pretty serious performance issue, but she’s keeping mum about it. Some time several weeks from now, when we start to do thousands or millions of transactions, you may find yourself wishing that she had felt okay about speaking up today. Or who knows; maybe you’ll just ask her why she didn’t find that bug.

NASA calls this last problem “the normalization of deviance”. In fact, this tiny little inconsistency reminds me of the Challenger problem. Remember that? There were these O-rings that were supposed to keep two chambers of highly-pressurized gases separate from each other. It turns out that on seven of the shuttle flights that preceded the Challenger, these O-rings burned through a bit and some gases leaked (they called this “erosion” and “blow-by”). Various managers managed to convince themselves that it wasn’t a problem, because it only happened on about a third of the flights, and the rings, at most, only burned a third of the way through. Because these “little” problems didn’t result in catastrophe the first seven times, NASA managers used this as evidence for safety. Every successful flight that had the problem was taken as reassurance that NASA could get away with it. In that sense, it was like Nassim Nicholas Taleb’s turkey, who increases his belief in the benevolence of the farmer every day… until some time in the week before Thanksgiving.

Richard Feynman, in his Appendix to the Rogers Commission Report on the Space Shuttle Challenger Accident, nailed the issue:

The phenomenon of accepting for flight, seals that had shown erosion and blow-by in previous flights, is very clear. The Challenger flight is an excellent example. There are several references to flights that had gone before. The acceptance and success of these flights is taken as evidence of safety. But erosion and blow-by are not what the design expected. They are warnings that something is wrong. The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in this unexpected and not thoroughly understood way. The fact that this danger did not lead to a catastrophe before is no guarantee that it will not the next time, unless it is completely understood. When playing Russian roulette the fact that the first shot got off safely is little comfort for the next.

That’s the problem with any evidence of any bug, at first observation; we only know about a symptom, not the cause, and not the consequences. When the system is in an unpredicted state, it’s in an unpredictable state.

Software is wonderfully deterministic, in that it does exactly what we tell it to do. But, as you know, there’s sometimes a big difference between what we tell it to do and what we meant to tell it to do. When software does what we tell it to do instead of what we meant, we find ourselves off the map that we drew for ourselves. And once we’re off the map, we don’t know where we are.

According to Wikipedia,

Feynman’s investigations also revealed that there had been many serious doubts raised about the O-ring seals by engineers at Morton Thiokol, which made the solid fuel boosters, but communication failures had led to their concerns being ignored by NASA management. He found similar failures in procedure in many other areas at NASA, but singled out its software development for praise due to its rigorous and highly effective quality control procedures – then under threat from NASA management, which wished to reduce testing to save money given that the tests had always been passed.

At NASA, back then, the software people realized that just because their checks were passing, it didn’t mean that they should relax their diligence. They realized that what really reduced risk on the project was appropriate testing, lots of tests, and paying attention to seemingly inconsequential failures.

I know we’re not sending people to the moon here. Even though we don’t know the consequences of this inconsistency, it’s hard to conceive of anyone dying because of it. So let’s make it clear: I’m not saying that the sky is falling, and I’m not making a value judgment as to whether we should fix it. That stuff is for you and the project managers to decide upon. It’s simply my role to observe it, to investigate it, and to report it.

I think it might be important, though, for us to understand why the problem is there in the first place. That’s because I don’t know whether the problem that I’m seeing is a big deal. And the thing is, until you’ve looked at the code, neither do you.

As always, it’s your call. And as usual, I’m happy to assist you in running whatever tests you’d like me to run on your behalf. I’ll also poke around and see if I can find any other surprises.

Your friend,

The Tester

P.S. I did run a second test. This time, I used PerlClip to craft a string of 100000 instances of :). That pair of characters, in normal circumstances, results in a smiley-face emoticon. It seemed as though the input field accepted the characters literally, and then converted them to the graphical smiley face. It took a long, long time for the input field to render this. I thought that my chat window had crashed, but it hadn’t. Eventually it finished processing, and displayed what it had parsed from this odd input. I didn’t see 32768 smileys, nor 29996, nor 16384, nor 14998. I saw exactly two dots. Weird, huh?

Should We Call Test-Driven Development Something Else?

Monday, September 28th, 2009

In the first post in this series, I proposed “that those things that we usually call ‘unit tests‘ be called ‘unit checks‘.” I stand by the proposal, but I should clarify something important about it. See, it’s all a matter of timing. And, of course, sapience.

After James Bach‘s blog post titled “Sapience and Blowing Peoples’ Minds“, Joe Rainsberger commented:

Sadly, the distinction between testing and checking makes describing test-driven development (TDD) somewhat awkward, because it’s a test when I write it and a check after I run it for the first time. Am I test-driving or check-driving?

Joe has put his finger on something that’s important: that in the mangle of practice, things are constantly changing, and so are our perspectives on them.

In The Elements of Testing and Checking, I broke down the process of developing, performing, and analyzing a check. The most important thing to note is that the check (an observation linked to a decision rule) can be performed non-sapiently, but that everything surrounding it—the development and analysis of the checkis sapient, and is testing. Test-driven development is first and foremost development, and development is a sapient process. The interactive process of developing a check and analyzing its outcome is a sapient process; the development cycle includes having an idea, testing it and responding to the information revealed by the test (the whole process), even when the result is supplied by a check (an atomic part of the test process). TDD is an exploratory, heuristic process. You don’t know in advance what your solution is going to look like; you explore the problem space and build your solution iteratively, and you stop when you decide to stop.

Several years ago, James and Jon Bach produced a set of exploratory skills, tactics, and dynamics:

  • Modeling
  • Resourcing
  • Chartering
  • Observing
  • Manipulating
  • Pairing (now called Collaborating)
  • Generation and Elaboration
  • Overproduction and Abandonment
  • Abandonment and Recovery
  • Refocusing (Focusing and Defocusing)
  • Alternating
  • Branching and Backtracking
  • Conjecturing
  • Recording
  • Reporting

(I believe that several other people have made contributions to the original list, including Jonathan Kohl and Mike Kelly. I’d also include tooling—building tools, rather than merely obtaining or resourcing them, and orienteering—figuring out where you are in relation to where you want to be. I think James disagrees. That’s okay; good colleagues do that. The cool thing about such lists is that they can evolve as we think and learn more, and disagreeing helps us to figure out what’s important eventually. Maybe I’ll drop them, or maybe James will adopt them.)

The point is that these exploratory skills, tactics and dynamics apply not only to testing, but to practically any open-ended heuristic process. Note how TDD, done well, incorporates practically all of the stuff from James and Jon’s original list, which was focused on testing.

So the answer to the question in the title of this blog is this: No; there’s no need to rename TDD. It really is test-driven development.

As James replied to Joe,

Strictly speaking you are “doing testing” by “writing checks“, but not actually “writing tests.” If you run the checks unattended and accept the green bar as is, then that is not testing. It requires absolutely no testing skill to do that, just as you wouldn’t say someone is doing programming just because they invoke a compiler and verify that the compile was successful. If the bar is NOT green, the process of investigating is testing, as well as debugging, programming, etc.

If you watch the tests (James means checks here, I think –MB) run or ponder the deeper meaning of the green bar, you are doing testing along with the checking.

Think of “test” as a verb rather than a noun, and it becomes clear that test-driven design is truly test-driven design, although the testing is rather simplistic, based primarily on those little checks. Once the design is done the automated checks become useful as change detectors against the risk of regression. They certainly aid the testing process, despite not being tests.

Checks definitely do NOT drive development. Development is never a rote and non-sapient process. It’s far better to say test-driven, because the design of the checks is a thoughtful process.

So what of the earlier business about calling unit tests “unit checks“?

For me, the distinction lies in the artifact—that xUnit thingy, or that rSpec assertion—and the way that you approach it. A minor gloss on Joe’s comment: the thingy might not be a check after you run it the first time, especially if it doesn’t pass. At that point, it is still very much part of your conscious interaction with the business of creating working code; it’s figure, rather than ground.

After you’ve solved the problem that your unit of code is intended to solve, the thingy’s prominence fades from figure into ground. You’re no longer really paying much attention to it. There’s no design activity going on with respect to it, it gets performed automatically and non-sapiently, and its result gets ignored, especially when the result is positive and aggregated with dozens, hundreds, or thousands of other positive results. At that point, it’s no longer shedding any particular cognitive light on what you’re doing, and its testing power has faded into a single pixel in a pale green glow. It’s now a check, no longer a test but a change detector. In fact, you might think of “check” as an abbreviation for “change detector”. The change from a test to a check is a kind of reverse metamorphosis, as though an intriuging, fluttering butterfly has turned into a not-very-interesting, ponderous little green caterpillar. That’s not to say that it’s not an important part of the Great Chain of Being; just that we tend not to pay much attention to it. However, we might pay more attention to the caterpillar when it’s red.

As I’ve said repeatedly, what you call them is less important than how you think of them. As James says,

I wouldn’t insist that people change their ordinary language. I see no problem calling whales “fish” or spiders “insects” in everyday life. But sometimes it matters to be precise. We should be ABLE to make strict distinctions when trying to help ourselves, or others, master our craft.

At Agile 2009, Joe pointed out that if we can produce more code with fewer errors in it, we can get our products to real testing, and then to market more quickly. And that means that we can get paid sooner. So I agree with Joe here, too:

I have to admit I like the pun of Check-Driven Development, even if it only works in American English.

See more on testing vs. checking.

Related: James Bach on Sapience and Blowing People’s Minds

A Tester Asks About Checking

Monday, September 21st, 2009

In a previous comment, Sunjeet asks

Does not testing encompass checking? Can testing alone be efficient without doing any checking?

As I hope I made it clear in Elements of Testing and Checking, the development and analysis of checks is surrounded by plenty of testing activity, and testing may include a good deal of checking. Testing, I think, can be vastly more efficient if we consider the ways in which checking can be helpful. Cem Kaner, in his 2004 paper The Ongoing Revolution in Software Testing, said this:

I think the recent advances in unit testing (Astels, 2003; Rainsberger, 2004) have been the most exciting progress that I’ve seen in testing in the last 10 years.

With programmer-created and programmer-maintained change detectors (that is, checks -MB):
• There is a near-zero feedback delay between the discovery of a problem and awareness of the programmer who created the problem
• There is a near-zero communication cost between the discover of the problem and the programmer who will either fix the bug or fix the [check]
• The [check] is tightly tied to the underlying coding problem, making troubleshooting much cheaper and easier than system-level testing.

And I agree.

Should testers shun checking? Why not call checking as “confirmative testing“?

There is a role for testers to program and perform checks where cost is low and value is high, but I think that if practices associated with XP really begin to take hold, in the long run it behooves testers to get out of the checking business. That’s because the vast bulk of the checking work will be done by programmers; because checks at the system level tend to be time-consuming, error-prone, and expensive when performed by humans (and expensive to automate when not performed by humans); because they drive humans to inattentional blindness.

But there’s another reason: “confirmative testing” isn’t really testing; it’s confirming. It’s looking at a white swan and saying, “All swans are white;” at another and saying, “See? All swans are white;” and at yet another and saying “Just like I told you; all swans are white.” There’s an analogy in software, “It works on my machine.” “See? It works on my machine.” “Just like I told you; it works on my machine.” To find problems in a product, which is one of the key goals of testing, we need to get out of the confirmatory mindset.

I confirm that the exact problem is fixed – by exactly executing the steps mentioned in the bug report – by this I confirm that the bug and only that bug is fixed or not. Brainless? Yes… a machine could have done the same… yes BUT is this required …YES we might not be deriving new quality value from it but we are CONFIRMING existing quality info from it…

Let me suggest an alternative way of looking at this.

In an environment where the bug report is vague or your programmers are known or suspected of being unreliable with their bug fixes or you know that the programmer has not created an automated check, then what you’re describing might indeed be a very good idea. (When testing the fix, I might start by trying to reproduce the problem as exactly as I could, but I might also start with a slight variation on the problem to see if the general case has been fixed. Either way, I’ll likely end up performing a check to see if the special case of the problem has been fixed.)

Task 2 i do is …i look out for side effects …regression…new test ideas …execute more tests etc i.e. all the pillars of ET… How is task 2 useful without 1?

If you’re following exactly the steps mentioned in the bug report and the programmer has fixed the problem and the programmer has already set up an automated check for the problem, then what you’re doing is reproducing the programmer’s effort (and the machine’s effort) in performing a check, when it might be much more valuable to use your sapient skills and test. Note that I cannot decide either way. I’m not in your context or your immediate situation. But you can. To me, there’s at least one clear circumstance in which it would be greatly more valuable for you to focus on Task 2: When someone has already done Task 1.

2. Should i tell my lead/manager…buddy I am just a tester find a checker to do this! or get a machine to do this? if a machine needs to do this who is going to code/script a machine to do this …wont that be a tester himself?

In answer to the first question, let me ask this: Are you a tester or a checker? Again, I don’t know. The quality of the questions you’re asking suggest to me that you’re a tester; that is, you’re not accepting what I’m saying blindly, nor are you rejecting it out of hand. You’re thinking critically about the idea, even if I may have blown your mind at first.

Either way, I wouldn’t advise telling your lead or your manager that you’re refusing to do work. But thinking in terms of testing vs. checking, if you so choose, might trigger a productive conversation between you about the relative cost and value of the activities that you (and the programmers) are doing. It might indeed make more sense to get a machine to do the checking work—and for the programmers to insert some change detectors and take greater responsibility for the quality of their code, an idea that was one of the triggers for Extreme Programming and the Agile movement.

As for who does the programming, note the passage from the very posting upon which you commented:

“When someone asks, ‘Can’t we hire pretty much any programmer to write our test automation code?’, we can point out that the quality of checking is conditioned largely by the quality of the testing work that surrounds it, and emphasize that creating excellent checks requires excellent testing skill, in addition to programming skill.”

Thank you for your comments and questions.

See more on testing vs. checking.

Related: James Bach on Sapience and Blowing People’s Minds

Tests vs. Checks: The Motive for Distinguishing

Friday, September 18th, 2009

The word “criticism” has several meanings and connotations. To criticize, these days, often means to speak reproachfully of someone or something, but criticism isn’t always disparaging. Way, way back when, I studied English literature, and read the work of many critics. Literary critics and film critics aren’t people who merely criticize, as we use the word in common parlance. Instead, the role of the critic is to contextualize—to observe and evaluate things, to shine light on some work so as to help people understand it better.

So when I say that Dale Emery is a critic, that’s a compliment. On the subject of testing vs. checking, Dale recently remarked to me, “I think I understand the distinction. I don’t yet understand what problem you’re trying to solve with your specific choice of terminology. Not the implications, but the problem.” That’s an excellent critical statement, in that Dale is not disparaging, but he’s trying to tell me something that I need to recognize and deal with.

My answer is that sometimes having different vocabulary allows us to recognize a problem and its solution more easily. As Jerry Weinberg says, “A rose by any other name should smell as sweet, yet nobody can seriously doubt that we are often fooled by the names of things.” (An Introduction to General Systems Thinking, p. 74). He also says “If we have limited memories, decomposing a system into noninteracting parts may enable us to predict behavior better than we could without the decomposition. This is the method of science, which would not be necessary were it not for our limited brains.” (ibid, p. 134).

The problem I’m trying to address, then, is that the word test lumps a large number of concepts into a single word, and testing lumps a similarly large number of activities together. As James Bach suggests, compiling is part of the activity of programming, yet we don’t mistake compiling for programming, nor do we mistake the compiler for the programmer.

If we have a conceptual item called a check, or an activity called checking, I contend that we suddenly have a new observational state available to us, and new observations to be made. That can help us to resolve differences in perception or opinion. It can help us to understand the process of testing at a finer level of detail, so that we can make better decisions about strategy and tactics.

In the Agile 2009 session, “Brittle and Slow: Replacing End-To-End Testing“, Arlo Belshee and James Shore took this as a point of departure:

End-to-end tests appear everywhere: test-driven development, story-test-driven development, acceptance testing, functional testing, and system testing. They’re also slow, brittle, and expensive.

This was confusing to me. My colleague Fiona Charles specializes in end-to-end system testing for great big projects. The teams that she leads are fast, compared to others that I’ve seen. Their tests are painstaking and detailed, but they’re flexible and adaptable, not brittle.

During the session, one person (presumably a programmer, but maybe not) said, “Manual testing sucks.” There was a loud murmur of agreement from both the testers and the programmers in the room.

I thought that was strange too. I love manual testing. I like operating the product interactively and making observations and evaluations. I like pretending that I’m a user of the program, with some task to accomplish or some problem to solve. I like looking at the program from a more analytical perspective, too—thinking about how all the components of the product interact with one another, and where the communication between them might be vulnerable if distorted or disturbed or interrupted. I like playing with the data, trying to figure out the pathological cases where the program might hiccupp or die on certain inputs. In my interaction with the program, I discover lots of things that appear to be problems. Upon making such a discovery, I’m compelled to investigate it. As I investigate it, sometimes I find that it’s a real problem, and sometimes I find that it isn’t. In this process, I learn about the system, about the ways in which it can work and the ways in which it might fail. I learn about my preconceptions, which are sometimes right and sometimes wrong. As I test, I recognize new risks, whereupon I realize new test ideas. I act on those test ideas, often right away. (By the way, I’m trying to get out of the habit of calling this stuff manual testing; I learning to call it sapient testing, because it’s primarily the eyes and the brain, not the hands, that are doing the work.) Whatever you call it, manual testing doesn’t suck; it rocks.

So are the programmer in question and all the people who applauded ignorant? That seems unlikely. They’re smart people, and they know tons about software development. Are they wrong? Well, that’s a value judgment, but it would seem to me that as smart people who solve problems for a living, it would be very surprising if they weren’t engaged by exploration and discovery and investigation and learning. So there must be another explanation.

Maybe when they’re talking about manual testing, they’re talking about something else. Maybe they’re talking about behaving like an automaton and precisely following a precisely described set of steps, the last of which is to compare some output of the program to a predicted, expected value. For a thinking human, that process is slow, and it’s tedious, and it doesn’t really engage the brain. And in the end, almost all the time, all we get is exactly what we expected to get in the first place.

So if that’s what they’re talking about, I agree with them. Therefore: if we’re going to understand each other more clearly, it would help to make the distinction between some kinds of manual testing and other kinds. The think that we don’t like, that none of us like apparently, is manual checking.

Maybe Arlo and James were talking about end-to-end system checks being brittle and slow. Maybe it’s integration checks, rather than integration tests, that are a scam, as Joe (J.B.) Rainsberger puts it here, here, here, and here.

So having a handle for a particular concept may make it easier for us to make certain observations and to carry on certain conversations.

  • If we can differentiate between manual testing and manual checking, we might be more specific about what, specifically sucks.
  • If we can comprehend the difference between automated tests and automated checks, we can understand the circumstances in which one might be more valuable than the other.
  • If we tease out the elements of developing, performing, and evaluating a check (as I attempted to do here) we might better see specific opportunities for increasing value or reducing cost.
  • If we can recognize when we’re checking, rather than testing, we can better recognize the opportunity to hand the work over to a machine.
  • If we can recognize that we’re spending inordinate amounts of time and money preparing scripts directing outsourced testers in other countries to check, rather than test, we can recognize a waste of energy, time, money, and human potential, because testers are capable of so much more than merely checking. (We might also detect the odour of immorality in asking people in developing countries to behave like machines, and instead consider giving a modicum of freedom and responsibility to them so that they can learn things about the product—things in which we might be very interested.)
  • If we can recognize that checking alone doesn’t yield new information, we can better recognize the need to de-emphasize checking and emphasize testing when that’s appropriate.
  • If we can recognize when testing is pointing us to areas of the product that appear to be vulnerable to breakage, we might choose to emphasize inserting more and/or better checks, so as to draw our attention to breakage should it occur (“change detectors”, as Cem Kaner calls them).
  • If we can distinguish between testing and checking, we can penetrate “the illusion that software systems are simple enough to define all the checks before any code is written”, as my colleague Ben Simo recently pointed out—never mind all the tests.
  • When someone asks, “Why didn’t testing find that bug when we spent all that money on all those automation tools?”, maybe we can point to the fact that the tools foster checking far more than they foster testing.
  • Maybe we can recognize that checking tends to be helpful in preventing bugs that we can anticipate, but not so helpful at finding problems that we didn’t anticipate. For that we need testing. Or, alas, sometimes, accidental discovery.
  • Maybe we’d be able to recognize that testing (but not checking) can reveal information on novel ways of using the product, information that can add to the perceived value of the product.
  • When someone asks, “Can’t we hire pretty much any programmer to write our test automation code?”, we can point out that the quality of checking is conditioned largely by the quality of the testing work that surrounds it, and emphasize that creating excellent checks requires excellent testing skill, in addition to programming skill.
  • If we’re interested in improving the efficiency and capacity of the test group, we can point out that test automation is far more than just check automation. Test automation is, in James Bach’s way of putting it, any use of tools to support testing. Testing tools help us to generate test data; to probe the internals of an application or an operating system; to produce oracles that use a different algorithm to produce a comparable result; to produce macros that automate a long sequence of actionas in the application so that the tester can be quickly delivered to place to start exploring and testing; to rapidly configure or reconfigure the application; to parse, sort, and search log files; to produce blink oracles for blink testing
  • When a programmer says to a tester, “You should only test this stuff; here are the boundary conditions,” the tester can respond “I will check that stuff, but I’m also going to test for boundary conditions that you might not have been aware of, or that you’ve forgotten to tell me about, and for other possible problems too.”
  • When we see a test group that is entirely focused on confirming that a product conforms to some requirements document, rather than investigating to discover things that might threaten the value of the product to its users, we can point out that they may be checking, but they’re not testing.

Here’s a passage from Jerry Weinberg, one that I find inspiring and absolutely true

“One of the lessons to be learned … is that the sheer number of tests performed is of little significance in itself. Too often, the series of tests simply proves how good the computer is at doing the same things with different numbers. As in many instances, we are probably misled here by our experiences with people, whose inherent reliability on repetitive work is at best variable. With a computer program, however, the greater problem is to prove adaptability, something which is not trivial in human functions either. Consequently we must be sure that each test does some work not done by previous tests. To do this, we must struggle to develop a suspicious nature as well as a lively imagination.” Jerry Weinberg, Computer Programming Fundamentals, 1961.

To me, that’s a magnificent paragraph. But just in case, let’s paraphrase it to make it (to my mind, at least) even clearer:

“One of the lessons to be learned … is that the sheer number of checks performed is of little significance in itself. Too often, the series of checks simply proves how good the computer is at doing the same things with different numbers. As in many instances, we are probably misled here by our experiences with people, whose inherent reliability on repetitive work is at best variable. With a computer program, however, the greater problem is to prove adaptability, something which is not trivial in human functions either. Consequently we must be sure that each test does some work not done by previous checks. To do this, we must struggle to develop a suspicious nature as well as a lively imagination.

Thank you to Dale for your critical questions, and to the others who have asked questions about the motivation for making the distinction and hanging a new label on it. I hope this helps. If it doesn’t, please let me know, and we’ll try to work it out. In any case, there will be more to come.

See more on testing vs. checking.

Related: James Bach on Sapience and Blowing People’s Minds

Upcoming Events: KWSQA and STAR West

Wednesday, September 16th, 2009

I’m delighted to have been asked to present a lunchtime talk at the Kitchener-Waterloo Software Quality Association, Wednesday September 30. I’ll be giving a reprise of my STAR East keynote talk, What Haven’t You Noticed Lately? Building Awareness in Testers. (The title has been pinched from Mark Federman, who got it from Terence McKenna, who may have got it from Marshall McLuhan, but maybe not.)

The following week, it’s STAR West in Anaheim, California. I’ll be giving a half-day workshop, Tester’s Clinic: Dealing with Tough Questions and Testing Myths and a track session, The Skill of Factoring: Identifying What to Test.

I’ll also be giving a bonus session, Using the Secrets of Improv to Improve Your Testing. I’ve done this one at Agile 2008 in Toronto, and at the AYE Conference in 2006, and it’s fun, but because so much of the learning comes from the participants, in the moment, it’s also been remarkably insightful both times. Improv is about being aware of your actions, the actions of others, and how they relate to each other—immediately. Even dipping one’s toe in it is very exciting. Adam White talks compellingly about his experience of a couple of rounds of classes with Second City, and he did a well-regarded improv session at CAST 2008.

There’s an official panel discussion hosted by Ross Collard on Wednesday at 6:30, and there’s an official Meet-The-Presenter session Thursday morning. The rest of the time, James Bach and I will be holding unofficial versions of both of those things. We’ll be bringing testing toys and testing games, and workshopping old and new exercises with whomever wants to come. He’ll likely be talking about his new book, Secrets of a Buccaneer Scholar, a terrific memoir and guide to self-education.

I’d like to meet you at the conference, but I’m not sure who you are. If you’d like to do some hands-on testing puzzles, have chat about testing vs. checking, or to discuss anything you like, drop me a line—michael at developsense.com.

Testing, Checking, and Changing the Language

Wednesday, September 16th, 2009

In the course of trying to describe distinctions between testing and checking, a number of questions have come up:

  • Do you want to change the language?
  • Won’t saying “check” be confusing?
  • Won’t this undermine our goal of industry-standard terminology?
  • Won’t calling certain kinds of tests “checks” fly in the face of years of documentation and books?
  • Isn’t this yet another case of you wanting testing to be done the same way everywhere?

In addition to the gratifying remarks, there has been a spate of similar questions or comments, in replies to the blog post, on Twitter, and various places around the blogosphere.

Then, this evening, I got this:

I think it’s fine to emphasize the value of exploration vs asserting, However I find your attempt at creating a new dictionary to solve the worlds problems incredibly naive and manipulative.

Words mean what people on average choose them to mean; they don’t even obtain their meaning from mainstream dictionaries let alone ones concocted on a blog.

What was the bloody problem with just using the phrase ‘exploratory testing’? I think what you are trying to do is create a ‘newspeak’ in which naughty thoughts are harder to express.

Apparently I blew Anonymous’ mind.

I’ll get to Anonymous way below, but for everyone else, here are some answers to the questions above [her|his] post. To all, I hope we can continue the conversation and work things out in a way that helps to advance our several crafts.

Do you want to change the language?

That might be cool, but it’s not my goal. I made a proposal. Here’s what the Compact Oxford Dictionary says about “propose”:

• verb 1 put forward (an idea or plan) for consideration by others.

Right: consideration. The idea is on the table, and through conversation with various people, maybe we can sharpen the idea and make it more useful. I’d love it if people started expressing themselves more precisely sometimes—calling “unit tests” “unit checks” when they get that way, for example—but as I said in the first post in the series, that’s not my goal. I said then, “I can guarantee that people won’t adopt this distinction across the board, nor will they do it overnight. But I encourage you to consider the distinction, and to make it explicit when you can.” Got that? I’m encouraging, not commanding. I’d be completely happy if people—testers and programmers—were simply conscious of the distinction, and were able to make more informed choices based on that.

(And more on that in tomorrow’s post.)

For all those who keep asking, I said it again in the comments to the first post: “there’s no authority who can enforce the distinction. The question is whether the distinction works for you, if it’s helpful, if it triggers a different way of thinking. In addition, there’s no particular harm in using “test” and “check” interchangably in common parlance.” I said it again later in the comments: “@Gerard: I’m not really interested in “purifying” the usage, although it was fun to hear so many people picking up on the idea at Agile 2009…Nor is there a right or wrong way to use the words. If people adopted testing vs. checking universally, that would be cool, I guess, but it’s a pretty unrealistic to believe that they would. I’m glad you agree the distinction is useful.”

And I said it again in the comments to the first post: “@Declan: First, as both the original text and the comments outline, I’m not really interested in changing people’s language. Got something that is being run entirely by automation and that receives no human evaluation? Want to call it a test? Go ahead; I wouldn’t want to stop you.”

And I said it again in the comments to the second post: One more thing, though. I don’t want to enforce orthodoxy on speech; that way lies the ISTQB and the SWEBOK and all that. In casual conversation, it makes little difference what word you use. It’s when you want to think critically about what you’re up to that the distinction (and not the label) matters. “I want to check to make sure that these are good tests,” is a fine thing to say, even if what you’re doing is really very testerly, and far more than checkish.

The third post introduced both “test” and “check” with “for the purposes of this discussion”. Plus there’s all the stuff that’s new to this post, so I will now say this: please use whatever words you like. If either one of us is unclear to the other, we’ll talk it over—probably very quickly—and we’ll work it out.

Won’t saying “checks” be confusing?

I don’t think so. I think not having a word for checks is confusing.

Won’t this undermine our goal of industry-standard terminology?

Nah. First of all, that may be your goal, but it’s not ours; leave me out of it. The notion of industry-standard terminology is, from my perspective, silly and misguided; I don’t advocate it, and I often speak out against it. To begin with, is there even a testing industry? There isn’t, any more than there’s a “writing” industry. Testing, like writing, is done in all kinds of different business contexts, organizational models, development paradigms, spoken and written languages, social structures, and for all kinds of purposes within those situations. There’s too much diversity in the world for there to be any “standard” terminology. And that’s a good thing. Diversity is complex and messy, but it also addresses people’s different values, supports different context, and engenders innovation.

One more thing—and here I address the English-speaking people: if there is to be industry-standard terminology, what language should it be in? Why not Mandarin? Why not Spanish? Why not Hindi? If you say that the English words should be interpreted into other languages, how can you be sure that the nuances of your terms show up in the interpretation?

Won’t calling certain kinds of tests “checks” fly in the face of years of documentation and books?

If people were actually to adopt the term “checking”, maybe it would. I’m not holding my breath. But you know, I’m optimistic about the bright people in this business. I think people who are genuine students of their crafts would be able to cope by making an instant mental translation, and those who aren’t are unlikely to read books anyway. The world didn’t come to an end when people started calling functions “messages” or “methods”. Nobody panics when a development project deals with several meanings of “integer” simultaneously. Boris Beizer introduced “The Pesticide Paradox”; how did people react? The way they always do as they make sense of the world: “Oh, that’s what you mean.” “Oh, that’s what you call that around here.” “Really? At KnitWare, we called it this.” “That’s cool; I never thought of it that way.”

Within a project, we sort this kind of stuff out all the time, quickly and efficiently. There’s even a name for what we’re developing: ubiquitous language, by which Eric Evans means a shared language specific to a particular team in the context of a particular project. (Why hasn’t that term been adopted as an industry standard? Answer: even though everyone does it, the standards-bearers won’t brook it, because they refuse to deal with the fact that non-standard behaviour is standard.) There will always be new people arriving on the scene with new words. The same old people will coin neologisms. Other people will get confused by certain unfamiliar terms. New technologies will show up with new labels. We’ll occasionally run into archaic words and technologies. (Only a few years ago, I had to write a little routine to translate between ASCII and EBCDIC. Don’t know what I mean? Look it up.) It’s important to notice these patterns, but sorting them out is not a big deal.

Isn’t this yet another case of you wanting testing to be done your way, all the time, everywhere?

Of course not. I don’t believe that testing should be done the same way everywhere. I do have biases in favour of testing being done as inexpensively, as rapidly, and as thoughtfully as possible; in favour of testing that is oriented towards value for people, rather than technological fetish; in favour of documentation being pared down to the minimum required to completely satisfy the mission and the client; in favour of the elimination of waste at every turn; in favour of a wide diversity of oracles and coverage models; in favour of the empowerment and skill and responsibility of the individual tester; towards principles like those expressed in the Agile Manifesto (although I don’t claim AgilityTM).

Not everyone agrees with those principles and those biases. That’s okay; it takes all kinds of communities to make a world, and (as Cem Kaner says) there’s lots of room for reasonable people to disagree reasonably, since there are so many different forms of experience and context to inform our points of view.

Some people say that we context-driven testing advocates believe that all testing should be context-driven. That’s ridiculous, and we’ve said so (see the Commentary). Reasonable people should be able to recognize that the claim of context-driven imperialism is oxymoronic; to be a context-driven thinker absolutely requires us to identify and acknowledge circumstances where context-driven approaches aren’t a good idea (see Four Attitudes Towards Context, linked here).

As I’ve said numerous times, in various forums (you could look them up), the testing that we perform for a financial institution would be ludicrous overkill for an online dating service; the testing that we do for medical devices would be far too slow and detailed for computer games. And as I’ve said numerous times, the quality of testing is like the quality of anything else: value to some person who matters. Do I disagree with you, or you with me? If you want, you can dispatch the disagreement right away by saying that I don’t matter. On the other hand, if I do matter to you, let’s talk about it and work it out.

If after that, we’re still not in agreement, that’s okay too. At one point in the SHAPE Forum (may it rest in peace), a maintenance programmer bemoaned the irrationality he saw in others’ source code, and ask how he could deal with it. Jerry Weinberg responded that “your first step is to stop thinking of it as ‘irrational’, and to start thinking about it as ‘rational from the perspective of a different set of values'”. Or as philosopher/risk manager/stepbrother Ian Heppell once elegantly put it, “Most arguments seem to be about conclusions, when they’re really about premises.”

Now:

Dear Anonymous,

I’m sorry that I blew your mind.

I think it’s fine to emphasize the value of exploration vs asserting, However I find your attempt at creating a new dictionary to solve the worlds problems incredibly naive and manipulative.

I wonder if you’ve heard about the Association for Software Testing’s dictionary project. You can read about that here and here. My vision for the dictionary is that it be based on the Oxford English Dictionary model, as described in Simon Winchester’s fabulous book, The Meaning of Everything. From the outset, the OED was designed to be descriptive, not prescriptive. The idea was to produce the story of each word, and to track where and when each one had appeared, and to follow its different paths through history, language, and culture.

We don’t intend to solve the world’s problems, nor testing’s. A dictionary won’t do that, but it’s possible that recognition of alternative points of view might be an interesting first step. As Cem Kaner puts it in the linked post, “We are not imposing definitions on the community, we are honoring the fact that different approaches to testing yield different language usage. We are not advocating a best definition, we are advocating a good practice, which is to adopt your client’s vocabulary (at least, to adopt it when talking with that client).”

Oh, and it took 71 years to complete the first version of the OED. The third is scheduled to be completed in 2037. With the idea on the table at the AST for the last three years, and with no editor yet chosen, we’re following the OED development model very well indeed. But I digress.

I recognize that you may have an emotional investment in certain words. If that’s true, that’s okay. I do too, sometimes. But I realize that if I want to keep using a certain word, or a certain pronunciation, or a certain turn of phrase, no one can stop me. And no one can stop you either, Anon. You are an adult, you can choose your own path, and you can only be manipulated if you’re willing to be manipulated. Peace be upon you. Namaste. Shalom. Really.

As I said in the comments to the first post, “there is no the definition of testing. It is always someone’s definition of testing—just as there is no property of quality that exists in a product without reference to some person and his or her notion of value.”

Then you said,

Words mean what people on average choose them to mean; they don’t even obtain their meaning from mainstream dictionaries let alone ones concocted on a blog.

Two questions on that.

First, if you’re right about that (and I think you are), how is it that language evolves? I think there are several reasons. Sometimes a new technology appears, and we need a new name for it and the stuff around it. As I tweeted this very evening (note the neologism),“Wife now: “I’m going to CASECamp. I sent stuff out on Crowdvine, but I should have Twittered it.” Sentences unintelligible one year ago.” Sometimes it’s because people notice things that they want names for, and they name them. Sometimes it’s because people want to break something complex into simpler components. Sometimes it’s because people want to lump a bunch of elements into a single concept or model. I’m trying to de-lump a previously lumped part of testing.

Second, if you’re right (and I think you are) that words don’t obtain their meanings from the ones concocted on a blog, then we’re all safe. So what’s the problem?

But there is one thing I’d gently dispute. Words don’t acquire meanings based on averages; they acquire meanings as soon as two or more people are willing to share that meaning. One ex of mine had a word, “dido”; it meant “idiosyncracy or habit of the type exhibited by a cat, or a person acting like one”. She and I also developed “insurance pee”, meaning “an activity in which your kids (or your spouse, or your self) should indulge just before leaving on a long highway drive.” Another ex used “fleffing”, meaning “to dither”.

So far, a handful of people like the word “check” as I’ve described it. Some don’t. Others are willing to consider it. Most people, so it seems, just don’t care. What’s the big deal?

What was the bloody problem with just using the phrase ‘exploratory testing’? I think what you are trying to do is create a ‘newspeak’ in which naughty thoughts are harder to express.

Perhaps you don’t remember the somewhat epic controversy over the term “exploratory testing”. “You mean ‘ad hoc’ testing, right?” “You mean testing without knowing what you’re doing, right?” “You mean testing without documentation, right?” This went on for years. Well, thanks to Cem’s coining of the term (we believe that he was the first, in the first edition of Testing Computer Software), James Bach’s vigourous articulation of the idea, and a bunch of other material written by Jonathan Bach, Elisabeth Hendrickson, Mike Kelly, Jonathan Kohl, James Lyndsay, Brian Marick, Bret Pettichord, Harry Robinson, Rob Sabourin, James Whittaker, and many others, including me, we now have a really substantial body of thought on the subject so that people can discuss it, learn it, teach it, compare it different contexts, figure out how to do it better. “Exploratory testing” even has its own erratic Wikipedia entry—the hallmark of every powerful idea.

Similar controversies have been expressed over “agile” and “extreme programming”. I’ve observed (only half-jokingly) that no one who had ever played rugby would name a development model “Scrum”. But people appropriate words all the time. Nothing new there either. In 1661, Boyle was talking about the “spring of the air”, and Hobbes was launching vitriolic attacks not only on Boyle’s conclusions, but on the whole premise of “experimental philosophy”, another neologism of the age. Controversy is nothing new. The world is a story that we’re all editing as we go.

So thanks for your participation; it drives me to clarify the idea, to make it more useful. Feel free to post a more complete expression of your concerns on your blog, point me to it, and we’ll work it out. Or, as above, you can simply declare that I don’t matter.

Newspeak, in Orwell’s 1984, was not a language designed to suppress naughty thoughts; its ultimate purpose was to make it impossible to express any thoughts at all. Suppression of naughtiness was simply a beneficial (to Big Brother) side effect. My purpose is the opposite of that. My purpose is to foster more thinking, deeper thinking about our fascinating and complex craft. If some consider such thoughts naughty, great. Testing could use some titillation.

See more on testing vs. checking.

Related: James Bach on Sapience and Blowing People’s Minds

Elements of Testing and Checking

Tuesday, September 15th, 2009

In the last couple of weeks, I’ve been very gratified by the response to the testing-vs.-checking distinction. Thanks to all who have grabbed on to the idea and to those who have questioned it.

There’s a wonderful passage in Chapter 4 of Jerry Weinberg‘s Perfect Software and Other Illusions About Testing in which he breaks down the activities of a programmer engaged in testing activities—testing for discovery, discovering an unexpected problem, pinpointing the problem in the behaviour of the product, locating the problem in the source code, determining the significance of the problem, repairing the problem, troubleshooting, and testing to learn (or hacking, or reverse engineering). He points out that confusion among the differences in these different aspects of testing can lead to conflict, resentment, and failed projects.

I brought up the test-vs.-check idea because, like Jerry, I think that the word “test” lumps a large number of concepts into a single word, and (as any programmer will tell you) not knowing or noticing what’s going on inside an encapsulation can lead to trouble. I wanted to raise the issue that (as Dale Emery has helped me to articulate) excellent testing requires us to generate new knowledge, in addition to whatever confirmations we generate. Moreover, tests that generate new knowledge and tests (or checks) that confirm existing knowledge have different motivations, and therefore have different standards of excellence.

A test is a question (or set of questions) that we want to ask of the program. It might consist of a single idea, or many ideas. Designing a test requires us to model the test space (or consider the scope of the question we want to ask), and to determine the oracles we’ll use, the coverage we hope to obtain, and the test procedures that we intend to follow. These are the elements of test design. Performing the test requires us to configure, operate, observe, and evaluate the system, and then to report on what we’ve done, what we’ve observed, and our evaluation. These are the elements of test execution.

A check is a component of a confirmatory approach to testing. As James Bach and I reckoned, a check itself has three elements:

1) It involves an observation.
2) The observation is linked to a decision rule.
3) Both the observation and the decision rule can be performed without sapience (that is, without a human brain).

Although you can execute a check without sapience, you can’t design, implement, or interpret a check without sapience. What needs to be done to make a check happen and to respond to it?

  • We start the process when we recognize the need for the check. That’s the bit in which we consider some problem to solve or identify some risk, and come up with an observation that we’d like to make. That requires sapience; it’s an act of testing.
  • Once we’ve seen a need for the check, we must translate the check into a question for the agency that’s going to perform it, whether that agency is a human or a machine. That requires us to develop the decision rule, turning the test idea into a question with a binary outcome. That requires sapience too, and thus is a testing activity.
  • When we have a question that expreses our test idea, the next step is to program the check, interpreting the binary question into program code (or into a script for non-sapient human execution), and put it into some transferrable form, such as a source code file or a Word document. Part of this requires sapience (the design and expression of the idea), and part of it doesn’t (the typing). Maybe a machine could do the typing part (say, via voice recognition), but programming isn’t just typing; it’s typing and thinking.
  • When we have a check programmed, the next step it to initiate it, to start it up or kick it off. This too has a sapient and a non-sapient aspect. A machine could start a check automatically, either on a schedule or in response to an event, but someone has to tell the machine about the schedule and the event. So the decision to run a check and when to run it is sapient, but the actual kickoff isn’t; it can be done mechanically.
  • Once the check has been initiated, the agency (machine or human) will execute or run the check, going through a prescribed set of steps from start to end. By definition, that’s definitely machine-doable and non-sapient. Pre-scribed literally means written down beforehand. For a check, the script specifies exactly what the agency must do, exactly what the agency must observe, exactly how the agency must decide the result, and exactly how the agency must report, and the agency does no more and no less than that.
  • Upon completing the prescribed steps, the agency must decide the result of the check. Pass or fail? True or false? Yes or no? By definition, the check must be non-sapient; machine-decidable, whether a human or a machine makes the decision.
  • The agency will typically record the result, based on a program for doing it. Checks performed by a machine might record results in an alert or result pane in an IDE, or in a log file. Checks performed by a human might show up in a pass or fail checkbox in a test management tool, or in a column of a spreadsheet.
  • The agency may report the result, alerting some human that something has happened. The report might be passive—the agency may be programmed to leave a log file in a folder at the end of a check run, say; or it might be more active, taking the form of a green or red bar, or a lava lamp. Depending upon the degree to which his actions have been scripted, a tester may or may not actively or immediately report the result.
  • Someone may interpret the result of check, assigning meaning to it. Okay, so the output says “pass”, or “fail”. What does it mean? What is our oracle—that is, what is the heuristic principle or mechanism by which we might recognize a problem? Is the result what we expected? Problem or no problem? If there’s a problem, is the problem in the check or in the item that we’re testing? Ascribing meaning requires sapience.Note that this step is optional. It’s possible for someone to consider a check “complete” without a human observation of the result. This should trigger a Black Swan alert: failing checks tend to get noticed, and passing checks don’t.
  • Someone may evaluate the check, ascribing significance to the outcome and to the meaning that we’ve reckoned. After the check has passed or failed, and we’ve figured out what it means, we have to decide “big deal” or “not a big deal”; whether we need to do something about it; whether the check and its outcome supply us with sufficient information.This step is optional too. Whether it happens or not, evaluation is definitely a human thing. Machines don’t make value judgments. Another Black Swan alert: if we don’t go through the previous step, interpreting the result of the check, we won’t get to this step either. There’s a risk here: the narcotic comfort of the green bar.
  • Whether we’ve ascribed meaning and significance or not, there is a response. One response is to ignore the result of the check altogether. Another is to pay just enough attention to say that the check has passed, and otherwise ignore interpretation and evaluation. Ignoring the check—oblivion—doesn’t require sapience. However, a person could also choose to ignore the result of the check consciously, which is a sapient act.Alternatively, if the check has passed, are we okay with that? Shall we proceed to something else, or should we program and execute another check? If the check has failed, a typical response is to decide to perform some action. We could perform some further analysis by developing new checks or other forms of testing. We could fix the program, fix the check, or change them to be consistent with one another. We could delete the check, or kill the program. All of these decisions and the subsequent activities require sapience, a human.

In future posts, I’ll be talking about how we can put this miniature task analysis to work for us, which I hope in turn will help us to consider important issues in the quality of our testing.

See more on testing vs. checking.

Related: James Bach on Sapience and Blowing People’s Minds

When Do We Stop a Test?

Friday, September 11th, 2009

Several years ago, around the time I started teaching Rapid Software Testing, my co-author James Bach recorded a video to demonstrate rapid stress testing. In this case, the approach involved throwing an overwhelming amount of data at an application’s wizard, essentially getting the application to stress itself out.

The video goes on for almost six minutes. About halfway through, James asks, “You might be asking why I don’t stop now. The reason is that we’re seeing a steadily worsening pattern of failure. We could stop now, but we might see something even worse if we keep going.” And so the test does keep going. A few moments later, James provides the stopping heuristics: we stop when 1) we’ve found a sufficiently dramatic problem; or 2) there’s no apparent variation in the behaviour of the program—the program is essentially flat-lining; or 3) the value of continuing doesn’t justify the cost. Those were the stopping heuristics for that stress test.

About a year after I first saw the video, I wanted to prepare a Better Software column on more general stopping heuristics, so James and I had a transpection session. The column is here. About a year after that, the column turned into a lightning talk that I gave in a few places.

About six months after that, we had both recognized even more common stopping heuristics. We were talking them over at STAR East 2009 when Dale Emery and James Lyndsay walked by, and they also contributed to the discussion. In particular, Dale offered that in combat, the shooting might stop in several ways: a lull, “hold your fire”, “ceasefire”, “at ease”, “stand down”, and “disarm”. I thought that was interesting.

Anyhow, here where we’re at so far. I emphasize that these stopping heuristics are heuristics. Heuristics are quick, inexpensive ways of solving a problem or making a decision. Heuristics are fallible—that is, they might work, and they might not work. Heuristics tend to be leaky abstractions, in that one might have things in common with another. Heuristics are also context-dependent, and it is assumed that they will be used by someone who has the competence and skill to use them wisely. So for each one, I’ve listed the heuristic and included at least one argument for not using the heuristic, or for questioning it.

1. The Time’s Up! Heuristic. This, for many testers, is the most common one: we stop testing when the time allocated for testing has expired.

Have we obtained the information that we need to know about the product? Is the risk of stopping now high enough that we might want to go on testing? Was the deadline artificial or arbitrary? Is there more development work to be done, such that more testing work will be required?

2. The Piñata Heuristic. We stop whacking the program when the candy starts falling out—we stop the test when we see the first sufficiently dramatic problem.

Might there be some more candy stuck in the piñata’s leg? Is the first dramatic problem the most important problem, or the only problem worth caring about? Might we find other interesting problems if we keep going? What if our impression of “dramatic” is misconceived, and this problem isn’t really a big deal?

3. The Dead Horse Heuristic. The program is too buggy to make further testing worthwhile. We know that things are going to be modified so much that any more testing will be invalidated by the changes.

The presumption here is that we’ve already found a bunch of interesting or important stuff. If we stop now, will miss something even more important or more interesting?

4. The Mission Accomplished Heuristic. We stop testing when we have answered all of the questions that we set out to answer.

Our testing might have revealed important new questions to ask. This leads us to the Rumsfeld Heuristic: “There are known unknowns, and there are unknown unknowns.” Has our testing moved known unknowns sufficiently into the known space? Has our testing revealed any important new known unknowns? And a hard-to-parse but important question: Are we satisified that we’ve moved the unknown unknowns sufficiently towards the knowns, or at least towards known unknowns?

5. The Mission Revoked Heuristic. Our client has told us, “Please stop testing now.” That might be because we’ve run out of budget, or because the project has been cancelled, or any number of other things. Whatever the reason is, we’re mandated to stop testing. (In fact, Time’s Up might sometimes be a special case of the more general Mission Revoked, if it’s the client rather than ourselves that have made the decision that time’s up.)

Is our client sufficiently aware of the value of continuing to test, or the risk of not continuing? If we disagree with the client, are we sufficiently aware of the business reasons to suspend testing?

6. The I Feel Stuck! Heuristic. For whatever reason, we stop because we perceive there’s something blocking us. We don’t have the information we need (many people claim that they can’t test without sufficient specifications, for example). There’s a blocking bug, such that we can’t get to the area of the product that we want to test; we don’t have the equipment or tools we need; we don’t have the expertise on the team to perform some kind of specialized test.

There might be any number of ways to get unstuck. Maybe we need help, or maybe we just need a pause (see below). Maybe more testing might allow us to learn what we need to know. Maybe the whole purpose of testing is to explore the product and discover the missing information. Perhaps there’s a workaround for the blocking bug; the tools and equipment might be available, but we don’t know about them, or we haven’t asked the right people in the right way; there might experts available to us, either on the testing team, among the programmers, or on the business side and we don’t realize it. There’s a difference between feeling stuck and being stuck.

7. The Pause That Refreshes Heuristic. Instead of stopping testing, we suspend it for a while. We might stop testing and take a break when we’re tired, or bored, or uninspired to test. We might pause to do some research, to do some planning, to reflect on what we’ve done so far, the better to figure out what to do next. The idea here is that we need a break of some kind, and can return to the product later with fresh eyes or fresh minds.

There’s another kind of pause, too: We might stop testing some feature because another has higher priority for the moment.

Sure, we might be tired or bored, but is it more important for us to hang in there and keep going? Might we learn what we need to learn more efficiently by interacting with the program now, rather than doing work offline? Might a crucial bit of information be revealed by just one more test? Is the other “priority” really a priority? Is it ready for testing? Have we already tested it enough for now?

8. The Flatline Heuristic. No matter what we do, we’re getting the same result. This can happen when the program has crashed or has become unresponsive in some way, but we might get flatline results when the program is especially stable, too—”looks good to me!”

Is the application really crashed, or might it be recovering? Is the lack of response in itself an important test result? Does our idea of “no matter what we do” incorporate sufficient variation or load to address potential risks?

9. The Customary Conclusion Heuristic. We stop testing when we usually stop testing. There’s a protocol in place for a certain number of test ideas, or test cases, or test cycles or variation, such that there’s a certain amount of testing work that we do, and we stop when that’s done. Agile teams (say that they) often implement this approach: “When all the acceptance tests pass, then we know we’re ready to ship.” Ewald Roodenrijs gives an example of this heuristic in his blog post titled When Does Testing Stop? He says he stops “when a certain amount of test cycles has been executed including the regression test”.

This differs from “Time’s Up”, in that the time dimension might be more elastic than some other dimension. Since many projects seem to be dominated by the schedule, it took a while for James and me to realize that this one is in fact very common. We sometimes hear “one test per requirement” or “one positive test and one negative test per requirement” as a convention for establishing good-enough testing. (We don’t agree with it, of course, but we hear about it.)

Have we sufficiently questioned why we always stop here? Should we be doing more testing as a matter of course? Less? Is there information available—say, from the technical support department, from Sales, or from outside reviewers—that would suggest that changing our patterns might be a good idea? Have we considered all the other heuristics?

10. No more interesting questions. At this point, we’ve decided that no questions have answers sufficiently valuable to justify the cost of continuing to test, so we’re done. This heuristic tends to inform the others, in the sense that if a question or a risk is sufficiently compelling, we’ll continue to test rather than stopping.

How do we feel about our risk models? Are we in danger of running into a Black Swan—or a White Swan that we’re ignoring? Have we obtained sufficient coverage? Have we validated our oracles?

11. The Avoidance/Indifference Heuristic. Sometimes people don’t care about more information, or don’t want to know what’s going on the in the program. The application under test might be a first cut that we know will be replaced soon. Some people decide to stop testing because they’re lazy, malicious, or unmotivated. Sometimes the business reasons for releasing are so compelling that no problem that we can imagine would stop shipment, so no new test result would matter.

If we don’t care now, why were we testing in the first place? Have we lost track of our priorities? If someone has checked out, why? Sometimes businesses get less heat for not knowing about a problem than they do for knowing about a problem and not fixing it—might that be in play here?

Update: Cem Kaner has suggested one more:  Mission Rejected, in which the tester himself or herself declines to continue testing.  Have a look here.

Any more ideas? Feel free to comment!

This may be my all-time favourite error message

Friday, September 4th, 2009

This may just be my all-time favourite error message:

Note that the promulgator of the message doesn’t identify itself (the caption bar is helpfully labelled “DLL”); that the program to be loaded isn’t identified; that the format isn’t identified; that what you might do to fix the problem isn’t identified…

Oh, and by the way… a little detective work shows that it comes from Adobe Acrobat.