Blog Posts from March, 2013

Versus != Opposite

Sunday, March 31st, 2013

Dale Emery, a colleague for whom we have great respect, submitted a comment on my last blog post, which in turn referred to Testing and Checking Refined on James Bach‘s blog. Dale says:

I don’t see the link between your goals and your solution. Your solution seems to be (a) distinguishing what you call checking from what you call testing, (b) using the terms “checking” and “testing” to express the distinction, and (c) promoting both the distinction and the terminology. So, three elements: distinction, terminology, promotion.

How do these:

  • deepen understanding of the craft? (Also: Which craft?)
  • emphasize that tools and skilled use are essential?
  • illustrate the risks of asking humans to behave like machines?

I can see how your definitions contribute to the other goal you stated: to show that checking is deeply embedded in testing. And your recent refinements contribute better than your earlier definitions did.

But then there’s “versus,” which I think that bumps smack into this goal. And not only the explicit use of “versus”; also the “versus” implied by repeatedly insisting that “That’s not testing, that’s checking!”

Also, I think your choice of terminology bumps up against this “deeply embedded” goal. Notice, that you often express distinctions by adding modifiers. In James’s post: Checking, human checking, machine checking, human/machine checking. The terms with modifiers are clearly related (and likely a subset) of the unmodified term.

Your use of a distinct word (“checking”) rather than a modified term (e.g. “mechanizable testing” or “scripted testing” or similar), have a natural effect of hinting at a relationship other than “this is a kind of that.” I read your choice of terminology (and what I interpret as insistence on the terminology) as expressing a more distant relationship than “deeply embedded in.”

James and I composed this reply together:

Our goal here is to improve the crafts of software testing, software engineering, and software project management. We use several tactics in our attempt to achieve that goal.

One tactic is to install linguistic guardrails to help prevent people from casually driving off a certain semantic cliff. At the bottom of that cliff is a gaggle of confused testers, programmers, and managers who are systematically—due to their confusion and not to any evil intent—releasing software that has been negligently tested.

This approach is less likely than they would wish to reveal important things that they would want to know about the software. You might believe that “negligently tested” is a strong way of putting it. We agree. To the extent that this unawareness brings harm to themselves or others, the software has been negligently tested. For virtual world chat programs on the Web, that negligence might be no big deal (or at least, no big deal until they store your credit card information). However, we have experience working with people in financial domains, retail, medical devices, and educational software who are similarly confused on this issue specifically: there’s more to testing a product than checking it.

Our tactic, we believe, deepens the understanding of the craft of test quite literally: where there were no distinctions and people talked at cross-purposes, we install distinctions so that we can more easily detect when we are not talking about the same things. This adds an explicit dimension there there had been just a tacit and half-glimpsed one. That is exactly what it means to deepen understanding. In Chapter 4 of Perfect Software and Other Illusions about Testing, Jerry Weinberg performed a similar task, de-lumping (that’s his term) “testing”. There, he calls out components of testing and some related activities that are not, strictly speaking, testing at all: “testing for discovery”, “pinpointing”, “locating”, “determining significance”, “repairing”, “troubleshooting”, “testing to learn”, “task-switching”. We’re working along similar lines here.

Our tactic, we believe, helps to emphasize that tools and skilled use of tools are essential by creating explicit categories for processes amenable to tooling and processes not so amenable. These categories then become roosting places for our thoughts and our conversations about how tools relate to our work. At best, understanding is elusive and communication is difficult. Without words to mark them, understanding and communication are even more difficult. That is not necessarily a problem in everyday life. As testers we work in a turbulent world of business, technology, and ideas. Problems in products (bugs) and in projects (issues) emerge from misunderstanding. The essence of a testing work is to clear misunderstandings over differences between what people want and what they say they want; what people produced and what they say they produced; what they did and what they say they did. People often tell us that they’ve tested a product. It often turns out that people mean that they’ve checked the functions in a product. We want to know what else they’ve done to test it. We need words, we claim, to mark those distinctions.

We’re aware that other people have come up with labels for what we might call “checks”; for example, Mike Hill speaks of “microtests” in a similar way, and others have picked up on that, presenting arguments on similar lines to ours. That’s cool. In the post on James’ blog, we make it explicit that we use this terminology in the domain we control—the Rapid Software Testing class and our writings—and we suggest that it might be useful for others. Some people borrow bits of Rapid Software Testing for their own work; some plagiarize. We encourage the former, and ask the latter to give attribution to their sources. But in the end, as we’ve said all along, it’s the ideas that matter, and it’s up to people to use the language they want. To us, it’s not a terrible thing to say “simplistic testing” any more than it would be a terrible thing to call a compiler an automatic programmer, but we think “compiler” works better.

We visit many projects and companies, including a lot of Agile projects, and we routinely find that talk of checking has drowned out talk of testing—except that people call it testing so nobody even notices how skewed their focus has become. Testers become increasingly selected for their enthusiasm as quasi-programmers and check-jockeys. Who studies testing then? What do testers on Agile projects normally talk about at their conferences or on the Web? Tools. Tools. And tools—most of which focus on checking, and the design of checkable requirements. This is not in itself a bad thing. We’re concerned by the absence of serious discussion of testing, critical investigation of the product. Sometimes there is an off-handed reference to exploratory testing, based on naïve or misbegotten ideas about it. Here’s a paradigmatic example, from only yesterday as we write: http://www.scrumalliance.org/articles/511-agile-methodology-is-not-all-about-exploratory-testing.

The fellow who wrote that article speaks of “validation criteria”, “building confidence” (Lord help us, at one point he says “guarantees confidence”), “defined expected results”. That is, he’s talking about checking.

Checking is deeply embedded in testing. It is also distinct from testing. That is not a contradiction. Distinction is not necessarily disjunction; “or” in common parlance is not necessarily “xor”. Our use of “versus” is exactly how we English speakers make sharp distinctions even among things that are strongly related, even when one is embedded in the other (the forest vs. the trees, playing hockey vs. skating). Consider people who believe they can eat nothing but bread and meat, as long as they gobble a daily handful of vitamin pills. We think it would be perfectly legitimate to say “That’s not nutrition, that’s vitamin supplements.” Yes, vitamins are part of nutrition. But they are not nutrition. It’s reasonable, we would argue, to talk about “nutrition versus vitamins” in that conversation.

For instance we could say “mind vs. body.” Mind is obviously embedded in body. Deeply embedded. But don’t you agree that mind is quite a different sort of thing than body? Do you feel that some sort of violence is being done with that distinction? Perhaps some people do think so, but the distinction is undeniably a popular and helpful one, and it has been to a great many thinkers over hundreds of years. Some people focus on their minds and neglect their bodies. Others focus on their bodies and neglect their minds. At least we have these categories so that we can have a conversation about them.

When Pierre Janet first distinguished between conscious and sub-conscious thought, that also was not an easy distinction. Today it is a commonplace. Everyone, even those who never took a class in psychology, is aware of the concept of the sub-conscious, and that not everything we do is driven by purely conscious forces. We believe our distinction between testing and checking could have a similar impact and similar effect—in time.

Meanwhile, Dale, we know you and we respect you. Please help us resolve our confusion: what’s YOUR goal? In your world, are there testers? Do the ambitious testers in your world diligently study testing? Or would you say that they study programming and how to use tools? How do you cope with that? Do you feel that your goal is served best by treating testing and whatever people do with tools to verify certain facts about a product as just one kind of activity? Would you suggest breaking it down a different way than we do? If so, how?

On Testing and Checking Refined

Friday, March 29th, 2013

Over the last few months, and especially during some face-to-face time that we had in England recently, James Bach and I have been working to sharpen our notions of testing and checking. Although the task had been on the list for some time, we didn’t get a sense of great urgency about it until we were surprised recently to find that, at a very subtle but important level, we meant different things by “checking”. Until then, what we had achieved was “shallow agreement”, something that’s very common in our world. Ideas can only be represented by words and never completely described. Words are always often ambiguous and slippery. For example, the word “versus” in my original post on the subject, “Testing vs. Checking” was misunderstood by some people. “Versus” can mean “in opposition to” (Manchester United vs. Chelsea, Marbury vs. Madison), but it can also mean “in contrast to, distinct from”, which affords expressions like “trees vs. leaves”, “French people vs. Parisians”, or “riding vs. balancing”. It’s interesting and to some degree unfortunate that people naturally tend to drop anchor on their initial interpretations of words. But like software itself, sometimes it’s hard to anticipate what other people will recognize as a bug. It’s even harder to recognize what we ourselves will recognize as bugs. Whatever we will realize eventually, we’re not there yet.

In the course of our conversations, we argued. A lot. In our business, argument is not to be feared. It’s the stone on which we sharpen ideas. From time to time, I adopted positions that were more like James’ used to be, and it seemed that James adopted positions that were more like mine used to be, until eventually we converged. We took confusion, comments, and complaints from colleagues (and some antagonists) seriously. We obtained some invaluable insights from the work of Harry Collins, whose books (The Shape of Actions, Tacit and Explicit Knowledge, Artificial Experts, Changing Order, and others) have been profoundly influential on us, as I predicted they would be a couple of years back. Indeed, the post in which I made that prediction reflects a lot of the background that informs what I’m announcing today.

The outcome of our conversations, a statement on what we mean by testing and checking in Rapid Testing and in the rest of our work, was posted on James’ blog on March 26 or so. Since that time, the post has been lightly edited in response to some thoughtful and helpful comments from reviewers and early readers.

I would like to emphasize our goals here. Our purpose is not to denigrate checking, nor to disparage the use of tools, nor to deplore those people who are asked to do human checking. On the contrary: we’re attempting to deepen our understanding of our craft; to show that checking is deeply embedded in testing; to emphasize that tools and the skilled use of them are essential to our work in many ways; to realize that humans will always inject human elements into the things they do; to realize the value of those human elements and the risks involved in asking humans to behave like machines. We must be clear on the differences between what humans do and how our processes and tools—media, as McLuhan would call them—do. Or more accurately, the differences between what we do and how our tools affect what we do.

We must also be clear that media, processes and tools, do not do things well or badly. We do things well or badly by and through and with our media. Media extend, enhance, accelerate, intensify, enable, amplify what we are, in ways that precisely reflects our thoughtfulness and our skill. This is crucially important to recognize in testing, where our goal is to use our minds, our skills, our tools and our processes to help people understand the product they’ve got so that they can make informed decisions about whether they have the product that they want.

Severity vs. Priority

Tuesday, March 5th, 2013

Another day has dawned on Planet Earth, so another tester has used LinkedIn to ask about the difference between severity and priority.

The reason the tester is asking is, probably, that there’s a development project, and there’s probably a bug tracking system, and it probably contains fields for both severity and priority (and probably as numbers). The tester has probably been told to fill in each field as part of his bug report; and the tester probably hasn’t been told specifically what the fields mean—or the tester is probably uncertain about how the numbers map to reality.

“Severity” is the noun associated with the adjective, “severe”. In my Concise Oxford Dictionary, “severe” has six listed meanings. The most relevant one for this context is “serious, critical”. Severity, with respect to a problem, is basically how big a problem is; how much trouble it’s going to cause. If it’s a big problem, it gets marked as high severity (oddly, that’s typically a low number), and if it’s not a big deal, it gets marked as low severity, typically with a higher number. So, severity is a simple concept. Except…

When we’re testing, and we think we see a problem, we don’t see everything about that problem. We see what some people call a failure, a symptom. The symptom we observe may be a manifestation of a coding error, or of a design issue, or of a misunderstood or mis-specified requirement. We see a symptom; we don’t see the cause or the underlying fault, as the IEEE and others might call it.

Whatever we’re observing may be a terrible problem for some user or some customer somewhere—or the customer might not notice or care. Here’s an example: in Microsoft Word 2010’s Insert Page Number feature, choose small Roman numerals as your format, and use the value 32768 (rendered in Roman numerals). Word hangs on my machine, and on every machine I’ve tried this trick on (you can try it too). Now: is this a Severity 1 bug? It certainly appears to be severe, considering the symptom. A hang is a severe problem, in terms of reliability.

But wait… considering that vanishingly few people use lower-case Roman numeral page numbers larger than, say, a few hundred, is the problem really that severe? In terms of capability, it’s probably not a big deal; there’s a very low probability that any normal user would need to use that feature and would encounter the problem.

Except… considering the fact that a problem like this could—at least in theory—present an opportunity for a hacker to bring down an application or, worse, take control of a system, maybe this is a devastatingly severe problem.

There’s yet another factor to consider here. We all suffer to some degree from a bias that can play out in testing. This might be a form of representativeness bias, or of assimilation bias, or of correspondence bias, but none of these seems to be a perfect fit. I think of it as the Heartburn Heuristic, in honour of my dad: for a year or more, he perceived minor heartburn—a seemingly trivial symptom of a seemingly minor gastric reflux problem. What my (late) dad didn’t count on was that, from the symptoms, it’s hard to tell the difference between gastric reflux and esophageal cancer.

The Heartburn Heuristic is a reminder that it’s easy to believe—falsely—that a minor symptom is naturally associated with a minor problem. It’s similarly easy to believe that a serious problem will always be immediately and dramatically obvious. It’s also easy to believe that a problem that looks like big trouble is big trouble, even when a fast one-byte fix will make the problem go away forever. We also become easily confused about the relationship between the prominence of the symptom, the impact on the customer, and the difficulty associated with fixing the problem, and the urgency of the fix relative to the urgency of releasing the product. (Look at the Challenger and Columbia incidents as canonical examples of how this plays out in engineering, emotions, and politics.) In reality, there’s no reason to believe in a strong correlation between the prominence of a problem and its severity, or the potential impact of a problem and the difficulty of a fix. A missing character in some visible field may be a design limitation or a display formatting bug, or it may be a sign of corruption in the database. Of course, since we’re fallible human beings, looking for unknown problems in an infinite space with finite time to do it, the most severe problems in a product can escape our notice entirely. So based on the symptom alone, at best we can only guess at the severity of the problem. That’s bad enough, but the problem of classifying severity gets even worse.

Just as we have biases and cognitive shortcomings, other people on the project team will tend to have them too. The tester’s credibility may be called into question if she places a high severity number on what others consider to be a low severity problem. Severity, after all, is subject to the Relative Rule: severity is not an attribute of the problem, but a relationship between the problem and some person at some time. To the end user who never uses the feature, the Roman numeral hang is not a big deal. To the end user who actually experiences a hang and possible loss of time or data, this could be a deeply annoying problem. To a programmer who takes great pride in his craft, a hang is a severe problem. To a programmer who is being evaluated on the number of Severity 1 problems in the product (a highly dubious way to measure the quality of a programmer’s work, but it happens), there is a strong motivation to make sure that the Roman numeral hang is classified as something other than a Severity 1 problem. To a program manager who has a few months of development time available before release, our Roman numeral problem might be a problem worth fixing. To a program manager who is facing a one-week deadline before the product has to ship (thanks to retail and stock market pressure), this is a trivial bug. (Trust me on that; I’ve been a program manager.)

In light of all this, what is a tester to do? My personal preference (based on experience as a tester, as a programmer, and as a program manager) is to encourage testers to stay out of the severity business if possible. By all means, I provide the project team with a clear description of the symptom, the quality criteria that could be threatened by it, and ideas on how the problem could have an effect on people who matter. I might provide a guess, based on inference, as to the underlying cause. I’ll be careful to frame it as a guess, unless I’ve seen the source code and understand the problem clearly. My default assumption is that I can’t go by appearances, and that every symptom has an unknown cause with potentially harsh consequences. I assume that every problem is guilty until proven innocent—that it’s a potentially severe problem until the code has been examined, the risk models revisited, and the team consulted. I’m especially wary of assigning a low severity on a bug report based on an apparently trivial symptom. If I haven’t seen the code, I try to avoid saying that something is a trivial problem; if pressed, I’ll say it looks like a trivial problem. If I’m forced to enter a number into a bug reporting form, I’ll set the severity of a problem at its highest level unless I have substantial understanding and reason to see the problem as being insignificant. In order to avoid the political cost of seeming like a Cassandra, I’ll make sure my clients are aware of my fundamental uncertainty about severity: the best I can provide is a guess, and if I want to err, I’d rather err or the side of overestimating severity rather than underestimating it and thereby downplaying an important problem. As a solution that feels better to me, I might also request an “unclassified” option in the Severity field, so that I can move on quickly and leave the classification to the team, to the programmers and to the program managers.

As for priority: priority is the order in which someone wants things to be done. Perhaps some people use the priority field to rank the order in which particular problems should be discussed, but my experience is that, usually, “priority” is a tester’s assessment of how important it is to fix the problem—a kind of ranking of what should be fixed first. Again based on my experience as tester, programmer, and program manager, I don’t see this as being a tester’s business at all. Deciding what should be done on a programming or business level is the job of the person with authority and responsibility over the work, in collaboration with the people who are actually doing the work. When I’m a tester, there is one exception: if I see a problem that is preventing me from doing further testing, I will request that the fix for that problem be fast-tracked (and I’ll outline the risks of not being able to test that area of the product). As tester, one of the most important aspects of my report is the set of things that make testing harder or slower, the things that give bugs more time and more opportunity to hide. Nonetheless, deciding what gets fixed first is for those who do the managing and the fixing.

In the end, I believe that decisions about severity and priority are business and management decisions. As testers, our role is to provide useful information to the decision-makers, but I believe we should let development managers manage development.

Why Would a User Do THAT?

Monday, March 4th, 2013

If you’ve been in testing for long enough, you’ll eventually report or demonstrate a problem, and you’ll hear this:

“No user would ever do that.”

Translated into English, that means “No user that I’ve thought of, and that I like, would do that on purpose, or in a way that I’ve imagined.” So here are a few ideas that might help to spur imagination.

  • The user made a simple mistake, based on his erroneous understanding of how the program was supposed to work.
  • The user had a simple slip of the fingers or the mind—inadvertently pasting a letter from his mother into the “Withdrawal Amount” field.
  • The user was distracted by something, and happened to omit an important step from a normal process.
  • The user was curious, and was trying to learn about the system.
  • The user was a hacker, and wanted to find specific vulnerabilities in the system.
  • The user is confused by the poor affordances in the product, and at that point was willing to try anything to get his task accomplished.
  • The user was poorly trained in how to use the product.
  • The user didn’t do that. The product did that, such that the user appeared to do that.
  • Users actually do that all the time, but the designer didn’t realize it, so product’s design is inconsistent with the way the user actually works.
  • The product used to do it that way, but to the user’s surprise now does it this way.
  • The user was looking specifically for vulnerabilities in the product as a part of an evaluation of competing products.
  • The product did something that the user perceived as unusual, and the user is now exploring to get to the bottom of it.
  • The user did that because some other vulnerability—say, a botched installation of the product—led him there.
  • The user was in another country, where they use commas instead of periods, dashes instead of slashes, kilometres instead of miles… Or where dates aren’t rendered the way we render them here.
  • The user was testing the product.
  • The user didn’t realize this product doesn’t work the way that product does, even though the products have important and relevant similarities.
  • The user did that, prompted by an error in the documentation (which in turn was prompted by an error in a designer’s description of her intentions).
  • To the designer’s surprise, the user didn’t enter the data via the keyboard, but used the clipboard or a programming interface to enter a ton of data all at once.
  • The user was working for another company, and was trying to find problems in an active attempt to embarrass the programmer.
  • The user observed that this sequence of actions works in some other part of the product, and figured that the same sequence of actions would be appropriate here too.
  • The product took a long time to respond, the user got impatient, and started doing other stuff before the product responded to his earlier request.

And I’m not even really getting started. I’m sure you can supply lots more examples.

Do you see? The space of things that people can do intentionally or unintentionally, innocently or malevolently, capably or erroneously, is huge. This is why it’s important to test products not only for repeatability (which, for computer software, is relatively easy to demonstrate) but also for adaptability. In order to do this, we must do much more than show that a program can produce an expected, predicted result. We must also expose the product to reasonably foreseeable misuse, to stress, to the unexpected, and to the unpredicted.