What Do You Mean By “Arguing Over Semantics”? (Part 2)

April 4th, 2013

Continuing from yesterday

As you may recall, my correspondent remarked

“To be honest, I don’t care what these types of verification are called be it automated checking or manual testing or ministry of John Cleese walks. What I would like to see is investment and respect being paid to testing as a profession rather than arguing with ourselves over semantics.”

Here’s an example of the importance of semantics in testing. When someone says, “it works”, what do they mean? In the Rapid Testing class, we say that “it works” really means “it appears to meet some requirement to some degree”.

That expanded statement should raise a bunch of questions for a tester, a test manager, or a client: What observations gave rise to that appearance? Observations by whom? When did that person observe? Under what conditions? Did he induce any variation into those conditions? Did he observe repeatedly, or just the once? For a long time, or just a glance? If the observation was made somewhere, what might be different somewhere else? Has anything changed since the last observation? Which requirements were considered? Which requirement specifically seems to have been met? Only that requirement? Are there other explicit or implicit requirements that might have not been met? To some degree–to what degree? Have the right people been consulted on the degree to which the requirement seems to have been met? Do they agree? Is the requirement and the evidence that “it works” well-understood by and acceptable to other people who might matter? Yes, that’s a lot of questions—and testers who ask them find nests of bugs living underneath the questions, and in the cracks between the answers.

Semantics has its parallel in science and measurement, in the form of a concept called “construct validity”. In measurement, construct validity centres around the issue of what counts as an instance of something you’re measuring, and what doesn’t count as such an instance. In their book Experimental and Quasi-Experimental Designs for Generalized Causal Inference (I have no idea why a book with such a catchy title isn’t flying off the shelves), Shadish, Cook and Campbell say,

“The naming of things is a key problem in all sciences, for names reflect category memberships that themselves have implications about relationships to other concepts, theories, and uses. This is true even for seemingly simple labeling problems. For example, a recent newspaper article reported a debate among astronomers over what to call 18 newly discovered celestial objects… The Spanish astronomers who discovered the bodies called them planets, a choice immediately criticized by some other astronomers… At issue was the lack of a match between some characteristics of the 18 objects (they are drifting freely through space and are only about 5 million years old) and some characteristics that are prototypical of planets (they orbit a star and require tens of millions of years to form). Critics said these objects were more reasonably called brown dwarfs, objects that are too massive to be planets but not massive enough to sustain the thermonuclear processes in a star. Brown dwarfs would drift freely and be young, like these 18 objects. The Spanish astronomers responded that these objects are too small to be brown dwarfs and are so cool that they could not be that young.” (p. 66)

Well, tomayto-tomahto, right? There are these objects out there, and they’re out there no matter what we call them. Brown dwarfs, planets… why bother quibbling? Shadish, Cook, and Campbell answer: “All this is more than just a quibble: if these objects really are planets, then current theories of how planets form by condensing around a star are wrong!” (my emphasis)

They end the passage by noting that construct validity is a much more difficult problem in social science field experiments—and I would argue that most of software testing is far closer to the field sciences than to astrophysics. More on that in future blog posts.

(Shadish, Cook and Campell cite the newspaper article as “Scientists quibble on calling discovery ‘planets’” (2000, October 6). The Memphis Commercial Appeal p. A5. I did an online search through all of the the Commercial Appeal’s articles for that day, and a more global search for newspaper articles with that title, but I was unable to find it. However, the controversy over what constitutes a planet continues: http://en.wikipedia.org/wiki/IAU_definition_of_planet.)

Labels are intertwined with our ontologies (our ways of categorizing the world) and our theories. Combustion used to be explained by an element called “phlogiston” that escaped when something was burned, and that had negative mass. When problems with that theory arose, phlogiston evolved from an element into a principle. The phlogiston theory had considerable explanatory power, and drove a good deal of invention and research. Eventually, though, people recognized enough problems in the phlogiston theory that Joseph Priestly’s discovery—”dephlogisticated air”—came to be called by Lavoisier’s name, the name we still use today: oxygen. Interestingly, oxidation began as a principle, before the element was identified. So theories of combustion went from element to principle to inverted principle and back to element. (See Steven Johnson’s The Invention of Air.)

In the old days, people simply got sick. Some treatments worked; others didn’t work so well. Modern doctors don’t casually confuses bacteria and viruses. They prescribe antibiotics for bacterial infections, and retroviral drugs for viruses. Labels not only represent underlying meanings; they often incorporate those meanings.

If we’re pleading for professional respect for testing, it’s worth asking ourselves what we think is worthy of respect. I believe that what’s most respectable is our special interest in dispelling illusions, demolishing unwarranted confidence, recognizing unnoticed complexity, and revealing underlying truths in a rapidly changing and developing world. If you agree, I think you’ll see a professional problem in promoting ways of speaking that create or sustain illusions, instead of telling plain truths about our work. I think you’ll see a professional problem with oversimplified models of complex cognitive processes, and I think you’ll see a problem with keeping our vocabulary static.

To my correspondent way above: I understand that dealing with all this discussion is effortful. Taking on new ideas is a pain, and so is defending old ones that are being challenged. Revising some simple, seemingly stable beliefs is hard work. Recognizing that two labels might point to the same thing might feel like a distraction, and choosing whose vocabulary to use might be laden with politics and emotion. You’re almost certainly busy with getting stuff done, and it takes time to keep up with the conversation—never mind the racket. But be sure of this: investment and respect are paid to astronomy, chemistry, and medicine as professions precisely because it is the nature of a profession to question and redefine its ideas about itself. Studying a profession involves developing distinctions and definitions to aid in the study. If we’re going to talk seriously and credibly about developing skill in testing, it’s important for us to develop clarity on the activities we’re talking about.

I thank James Bach for his contributions to this essay.

What Do You Mean By “Arguing Over Semantics”?

April 3rd, 2013

Commenting on testing and checking, one correspondent responds:

“To be honest, I don’t care what these types of verification are called be it automated checking or manual testing or ministry of John Cleese walks. What I would like to see is investment and respect being paid to testing as a profession rather than arguing with ourselves over semantics.”

My very first job in software development was as a database programmer at a personnel agency. Many times I wrote a bit of new code, I got a reality check: the computer always did exactly what I said, and not necessarily what I meant. The difference was something that I experienced as a bug. Sometimes the things that I told the computer were consistent with what I meant to tell it, but the way I understood something and the way my clients understood something was different. In that case, the difference was something that my clients experienced as a bug, even though I didn’t, at first. The issue was usually that my clients and I didn’t agree on what we said or what we meant. That wasn’t out of ignorance or ill-will. The problem was often that my clients and I had shallow agreement on a concept. A big part of the job was refining our words for things—and when we did that, we often found that the conversation refined our ideas about things too. Those revelations (Eric Evans calls them “knowledge crunching” are part of the process of software development.

As the only person on my development team, I was also responsible for preparing end-user documentation for the program. My spelling and grammar could be impeccable, and spelling and grammar checkers could check my words for syntactic correctness. When my description of how to use program was vague, inaccurate, or imprecise, the agents who used the application would get confused, or would make mistakes, or would miss out on something important. There was a real risk that the company’s clients wouldn’t get the candidates they wanted, or that some qualified person wouldn’t get a shot at a job. Being unclear had real consequences for real people.

A few years later, my friend Dan Spear—at the time, Quaterdeck’s chief scientist, and formerly the principal programmer of QEMM-386—accepted my request for some lessons in assembly language programming. He began the first lesson while we were both sitting back from the keyboard. “Programming a computer,” he began, “is the most humbling thing that you can do. The computer is like a mirror. it does exactly what you tell it to do, and in doing that, it reflects any sloppiness in your thinking or in your way of expressing yourself.”

I was a program manager (a technical position) for the company for four years. Towards the end of my tenure, we began working on an antivirus product. One of the product managers (“product manager” was a marketing position) wanted to put a badge on the retail box: “24 hour support response time!” In a team meeting, we technical people made it clear that we didn’t provide 24-hour monitoring of our support channels. The company’s senior management clearly had no intention of staffing or funding 24-hour support, either. We were in Los Angeles, and the product was developed in Israel. It took development time—sometimes hours, but sometimes days—to analyse a virus and figure out ways to detect and to eradicate it. Nonetheless, the marketing guy (let’s call him Mark) continued to insist that that’s what he wanted to put on the box. One of the programming liaisons (let’s call him Paul) spoke first:

Paul: “I doubt that some of the problems we’re going to see can be turned around in 24 hours. Polymorphic viruses can be tricky to identify and pin down. So what do you mean by 24-hour response time?”

Mark: “Well, we’ll respond within 24 hours.”

Paul: “With a fix?”

Mark: “Not necessarily, but with a response.”

Paul: “With a promise of a fix ? A schedule for a fix?”

Mark: “Not necessarily, but we will respond.”

Paul: “What does it mean to respond?”

Mark: “When someone calls in, we’ll answer the phone.”

Sam (a support person): “We don’t have people here on the weekends.”

Mark: “Well, 24 hours during the week.”

Sam: “We don’t have people here before 7:00am, or after 5:00pm.”

Mark: “Well… we’ll put someone to check voicemail as soon as they get in… and, on the weekends… I don’t know… maybe we can get someone assigned to check voicemail on the weekend too, and they can… maybe, uh… send an email to Israel. And then they can turn it around.”

At this point, as the program manager for the product, I’d had enough. I took a deep breath, and said, “Mark, if you put ’24-hour response time’ on the box, I will guarantee that that will mislead some people. And if we mislead people to take advantage of them, knowing that we’re doing it, we’re lying. And if they give us money because of a lie we’re telling, we’re also stealing. I don’t think our CEO wants to be the CEO of a lying, stealing company.”

There’s a common thread that runs through these stories: they’re about what we say, about what we mean, and about whether we say what we mean and mean what we say. That’s semantics: the relationships between words and meaning. Those relationships are central to testing work.

If you feel yourself tempted to object to something by saying “We’re arguing about semantics,” try a macro expansion: “We’re arguing about what we mean by the words we’re choosing,” which can then be shortened to “We’re arguing about what we mean.” If we can’t settle on the premises of a conversation, we’re going to have an awfully hard time agreeing on conclusions.

I’ll have more to say on this tomorrow.

Versus != Opposite

March 31st, 2013

Dale Emery, a colleague for whom we have great respect, submitted a comment on my last blog post, which in turn referred to Testing and Checking Refined on James Bach‘s blog. Dale says:

I don’t see the link between your goals and your solution. Your solution seems to be (a) distinguishing what you call checking from what you call testing, (b) using the terms “checking” and “testing” to express the distinction, and (c) promoting both the distinction and the terminology. So, three elements: distinction, terminology, promotion.

How do these:

  • deepen understanding of the craft? (Also: Which craft?)
  • emphasize that tools and skilled use are essential?
  • illustrate the risks of asking humans to behave like machines?

I can see how your definitions contribute to the other goal you stated: to show that checking is deeply embedded in testing. And your recent refinements contribute better than your earlier definitions did.

But then there’s “versus,” which I think that bumps smack into this goal. And not only the explicit use of “versus”; also the “versus” implied by repeatedly insisting that “That’s not testing, that’s checking!”

Also, I think your choice of terminology bumps up against this “deeply embedded” goal. Notice, that you often express distinctions by adding modifiers. In James’s post: Checking, human checking, machine checking, human/machine checking. The terms with modifiers are clearly related (and likely a subset) of the unmodified term.

Your use of a distinct word (“checking”) rather than a modified term (e.g. “mechanizable testing” or “scripted testing” or similar), have a natural effect of hinting at a relationship other than “this is a kind of that.” I read your choice of terminology (and what I interpret as insistence on the terminology) as expressing a more distant relationship than “deeply embedded in.”

James and I composed this reply together:

Our goal here is to improve the crafts of software testing, software engineering, and software project management. We use several tactics in our attempt to achieve that goal.

One tactic is to install linguistic guardrails to help prevent people from casually driving off a certain semantic cliff. At the bottom of that cliff is a gaggle of confused testers, programmers, and managers who are systematically—due to their confusion and not to any evil intent—releasing software that has been negligently tested.

This approach is less likely than they would wish to reveal important things that they would want to know about the software. You might believe that “negligently tested” is a strong way of putting it. We agree. To the extent that this unawareness brings harm to themselves or others, the software has been negligently tested. For virtual world chat programs on the Web, that negligence might be no big deal (or at least, no big deal until they store your credit card information). However, we have experience working with people in financial domains, retail, medical devices, and educational software who are similarly confused on this issue specifically: there’s more to testing a product than checking it.

Our tactic, we believe, deepens the understanding of the craft of test quite literally: where there were no distinctions and people talked at cross-purposes, we install distinctions so that we can more easily detect when we are not talking about the same things. This adds an explicit dimension there there had been just a tacit and half-glimpsed one. That is exactly what it means to deepen understanding. In Chapter 4 of Perfect Software and Other Illusions about Testing, Jerry Weinberg performed a similar task, de-lumping (that’s his term) “testing”. There, he calls out components of testing and some related activities that are not, strictly speaking, testing at all: “testing for discovery”, “pinpointing”, “locating”, “determining significance”, “repairing”, “troubleshooting”, “testing to learn”, “task-switching”. We’re working along similar lines here.

Our tactic, we believe, helps to emphasize that tools and skilled use of tools are essential by creating explicit categories for processes amenable to tooling and processes not so amenable. These categories then become roosting places for our thoughts and our conversations about how tools relate to our work. At best, understanding is elusive and communication is difficult. Without words to mark them, understanding and communication are even more difficult. That is not necessarily a problem in everyday life. As testers we work in a turbulent world of business, technology, and ideas. Problems in products (bugs) and in projects (issues) emerge from misunderstanding. The essence of a testing work is to clear misunderstandings over differences between what people want and what they say they want; what people produced and what they say they produced; what they did and what they say they did. People often tell us that they’ve tested a product. It often turns out that people mean that they’ve checked the functions in a product. We want to know what else they’ve done to test it. We need words, we claim, to mark those distinctions.

We’re aware that other people have come up with labels for what we might call “checks”; for example, Mike Hill speaks of “microtests” in a similar way, and others have picked up on that, presenting arguments on similar lines to ours. That’s cool. In the post on James’ blog, we make it explicit that we use this terminology in the domain we control—the Rapid Software Testing class and our writings—and we suggest that it might be useful for others. Some people borrow bits of Rapid Software Testing for their own work; some plagiarize. We encourage the former, and ask the latter to give attribution to their sources. But in the end, as we’ve said all along, it’s the ideas that matter, and it’s up to people to use the language they want. To us, it’s not a terrible thing to say “simplistic testing” any more than it would be a terrible thing to call a compiler an automatic programmer, but we think “compiler” works better.

We visit many projects and companies, including a lot of Agile projects, and we routinely find that talk of checking has drowned out talk of testing—except that people call it testing so nobody even notices how skewed their focus has become. Testers become increasingly selected for their enthusiasm as quasi-programmers and check-jockeys. Who studies testing then? What do testers on Agile projects normally talk about at their conferences or on the Web? Tools. Tools. And tools—most of which focus on checking, and the design of checkable requirements. This is not in itself a bad thing. We’re concerned by the absence of serious discussion of testing, critical investigation of the product. Sometimes there is an off-handed reference to exploratory testing, based on naïve or misbegotten ideas about it. Here’s a paradigmatic example, from only yesterday as we write: http://www.scrumalliance.org/articles/511-agile-methodology-is-not-all-about-exploratory-testing.

The fellow who wrote that article speaks of “validation criteria”, “building confidence” (Lord help us, at one point he says “guarantees confidence”), “defined expected results”. That is, he’s talking about checking.

Checking is deeply embedded in testing. It is also distinct from testing. That is not a contradiction. Distinction is not necessarily disjunction; “or” in common parlance is not necessarily “xor”. Our use of “versus” is exactly how we English speakers make sharp distinctions even among things that are strongly related, even when one is embedded in the other (the forest vs. the trees, playing hockey vs. skating). Consider people who believe they can eat nothing but bread and meat, as long as they gobble a daily handful of vitamin pills. We think it would be perfectly legitimate to say “That’s not nutrition, that’s vitamin supplements.” Yes, vitamins are part of nutrition. But they are not nutrition. It’s reasonable, we would argue, to talk about “nutrition versus vitamins” in that conversation.

For instance we could say “mind vs. body.” Mind is obviously embedded in body. Deeply embedded. But don’t you agree that mind is quite a different sort of thing than body? Do you feel that some sort of violence is being done with that distinction? Perhaps some people do think so, but the distinction is undeniably a popular and helpful one, and it has been to a great many thinkers over hundreds of years. Some people focus on their minds and neglect their bodies. Others focus on their bodies and neglect their minds. At least we have these categories so that we can have a conversation about them.

When Pierre Janet first distinguished between conscious and sub-conscious thought, that also was not an easy distinction. Today it is a commonplace. Everyone, even those who never took a class in psychology, is aware of the concept of the sub-conscious, and that not everything we do is driven by purely conscious forces. We believe our distinction between testing and checking could have a similar impact and similar effect—in time.

Meanwhile, Dale, we know you and we respect you. Please help us resolve our confusion: what’s YOUR goal? In your world, are there testers? Do the ambitious testers in your world diligently study testing? Or would you say that they study programming and how to use tools? How do you cope with that? Do you feel that your goal is served best by treating testing and whatever people do with tools to verify certain facts about a product as just one kind of activity? Would you suggest breaking it down a different way than we do? If so, how?

On Testing and Checking Refined

March 29th, 2013

Over the last few months, and especially during some face-to-face time that we had in England recently, James Bach and I have been working to sharpen our notions of testing and checking. Although the task had been on the list for some time, we didn’t get a sense of great urgency about it until we were surprised recently to find that, at a very subtle but important level, we meant different things by “checking”. Until then, what we had achieved was “shallow agreement”, something that’s very common in our world. Ideas can only be represented by words and never completely described. Words are always often ambiguous and slippery. For example, the word “versus” in my original post on the subject, “Testing vs. Checking” was misunderstood by some people. “Versus” can mean “in opposition to” (Manchester United vs. Chelsea, Marbury vs. Madison), but it can also mean “in contrast to, distinct from”, which affords expressions like “trees vs. leaves”, “French people vs. Parisians”, or “riding vs. balancing”. It’s interesting and to some degree unfortunate that people naturally tend to drop anchor on their initial interpretations of words. But like software itself, sometimes it’s hard to anticipate what other people will recognize as a bug. It’s even harder to recognize what we ourselves will recognize as bugs. Whatever we will realize eventually, we’re not there yet.

In the course of our conversations, we argued. A lot. In our business, argument is not to be feared. It’s the stone on which we sharpen ideas. From time to time, I adopted positions that were more like James’ used to be, and it seemed that James adopted positions that were more like mine used to be, until eventually we converged. We took confusion, comments, and complaints from colleagues (and some antagonists) seriously. We obtained some invaluable insights from the work of Harry Collins, whose books (The Shape of Actions, Tacit and Explicit Knowledge, Artificial Experts, Changing Order, and others) have been profoundly influential on us, as I predicted they would be a couple of years back. Indeed, the post in which I made that prediction reflects a lot of the background that informs what I’m announcing today.

The outcome of our conversations, a statement on what we mean by testing and checking in Rapid Testing and in the rest of our work, was posted on James’ blog on March 26 or so. Since that time, the post has been lightly edited in response to some thoughtful and helpful comments from reviewers and early readers.

I would like to emphasize our goals here. Our purpose is not to denigrate checking, nor to disparage the use of tools, nor to deplore those people who are asked to do human checking. On the contrary: we’re attempting to deepen our understanding of our craft; to show that checking is deeply embedded in testing; to emphasize that tools and the skilled use of them are essential to our work in many ways; to realize that humans will always inject human elements into the things they do; to realize the value of those human elements and the risks involved in asking humans to behave like machines. We must be clear on the differences between what humans do and how our processes and tools—media, as McLuhan would call them—do. Or more accurately, the differences between what we do and how our tools affect what we do.

We must also be clear that media, processes and tools, do not do things well or badly. We do things well or badly by and through and with our media. Media extend, enhance, accelerate, intensify, enable, amplify what we are, in ways that precisely reflects our thoughtfulness and our skill. This is crucially important to recognize in testing, where our goal is to use our minds, our skills, our tools and our processes to help people understand the product they’ve got so that they can make informed decisions about whether they have the product that they want.

Severity vs. Priority

March 5th, 2013

Another day has dawned on Planet Earth, so another tester has used LinkedIn to ask about the difference between severity and priority.

The reason the tester is asking is, probably, that there’s a development project, and there’s probably a bug tracking system, and it probably contains fields for both severity and priority (and probably as numbers). The tester has probably been told to fill in each field as part of his bug report; and the tester probably hasn’t been told specifically what the fields mean—or the tester is probably uncertain about how the numbers map to reality.

“Severity” is the noun associated with the adjective, “severe”. In my Concise Oxford Dictionary, “severe” has six listed meanings. The most relevant one for this context is “serious, critical”. Severity, with respect to a problem, is basically how big a problem is; how much trouble it’s going to cause. If it’s a big problem, it gets marked as high severity (oddly, that’s typically a low number), and if it’s not a big deal, it gets marked as low severity, typically with a higher number. So, severity is a simple concept. Except…

When we’re testing, and we think we see a problem, we don’t see everything about that problem. We see what some people call a failure, a symptom. The symptom we observe may be a manifestation of a coding error, or of a design issue, or of a misunderstood or mis-specified requirement. We see a symptom; we don’t see the cause or the underlying fault, as the IEEE and others might call it.

Whatever we’re observing may be a terrible problem for some user or some customer somewhere—or the customer might not notice or care. Here’s an example: in Microsoft Word 2010′s Insert Page Number feature, choose small Roman numerals as your format, and use the value 32768 (rendered in Roman numerals). Word hangs on my machine, and on every machine I’ve tried this trick on (you can try it too). Now: is this a Severity 1 bug? It certainly appears to be severe, considering the symptom. A hang is a severe problem, in terms of reliability.

But wait… considering that vanishingly few people use lower-case Roman numeral page numbers larger than, say, a few hundred, is the problem really that severe? In terms of capability, it’s probably not a big deal; there’s a very low probability that any normal user would need to use that feature and would encounter the problem.

Except… considering the fact that a problem like this could—at least in theory—present an opportunity for a hacker to bring down an application or, worse, take control of a system, maybe this is a devastatingly severe problem.

There’s yet another factor to consider here. We all suffer to some degree from a bias that can play out in testing. This might be a form of representativeness bias, or of assimilation bias, or of correspondence bias, but none of these seems to be a perfect fit. I think of it as the Heartburn Heuristic, in honour of my dad: for a year or more, he perceived minor heartburn—a seemingly trivial symptom of a seemingly minor gastric reflux problem. What my (late) dad didn’t count on was that, from the symptoms, it’s hard to tell the difference between gastric reflux and esophageal cancer.

The Heartburn Heuristic is a reminder that it’s easy to believe—falsely—that a minor symptom is naturally associated with a minor problem. It’s similarly easy to believe that a serious problem will always be immediately and dramatically obvious. It’s also easy to believe that a problem that looks like big trouble is big trouble, even when a fast one-byte fix will make the problem go away forever. We also become easily confused about the relationship between the prominence of the symptom, the impact on the customer, and the difficulty associated with fixing the problem, and the urgency of the fix relative to the urgency of releasing the product. (Look at the Challenger and Columbia incidents as canonical examples of how this plays out in engineering, emotions, and politics.) In reality, there’s no reason to believe in a strong correlation between the prominence of a problem and its severity, or the potential impact of a problem and the difficulty of a fix. A missing character in some visible field may be a design limitation or a display formatting bug, or it may be a sign of corruption in the database. Of course, since we’re fallible human beings, looking for unknown problems in an infinite space with finite time to do it, the most severe problems in a product can escape our notice entirely. So based on the symptom alone, at best we can only guess at the severity of the problem. That’s bad enough, but the problem of classifying severity gets even worse.

Just as we have biases and cognitive shortcomings, other people on the project team will tend to have them too. The tester’s credibility may be called into question if she places a high severity number on what others consider to be a low severity problem. Severity, after all, is subject to the Relative Rule: severity is not an attribute of the problem, but a relationship between the problem and some person at some time. To the end user who never uses the feature, the Roman numeral hang is not a big deal. To the end user who actually experiences a hang and possible loss of time or data, this could be a deeply annoying problem. To a programmer who takes great pride in his craft, a hang is a severe problem. To a programmer who is being evaluated on the number of Severity 1 problems in the product (a highly dubious way to measure the quality of a programmer’s work, but it happens), there is a strong motivation to make sure that the Roman numeral hang is classified as something other than a Severity 1 problem. To a program manager who has a few months of development time available before release, our Roman numeral problem might be a problem worth fixing. To a program manager who is facing a one-week deadline before the product has to ship (thanks to retail and stock market pressure), this is a trivial bug. (Trust me on that; I’ve been a program manager.)

In light of all this, what is a tester to do? My personal preference (based on experience as a tester, as a programmer, and as a program manager) is to encourage testers to stay out of the severity business if possible. By all means, I provide the project team with a clear description of the symptom, the quality criteria that could be threatened by it, and ideas on how the problem could have an effect on people who matter. I might provide a guess, based on inference, as to the underlying cause. I’ll be careful to frame it as a guess, unless I’ve seen the source code and understand the problem clearly. My default assumption is that I can’t go by appearances, and that every symptom has an unknown cause with potentially harsh consequences. I assume that every problem is guilty until proven innocent—that it’s a potentially severe problem until the code has been examined, the risk models revisited, and the team consulted. I’m especially wary of assigning a low severity on a bug report based on an apparently trivial symptom. If I haven’t seen the code, I try to avoid saying that something is a trivial problem; if pressed, I’ll say it looks like a trivial problem. If I’m forced to enter a number into a bug reporting form, I’ll set the severity of a problem at its highest level unless I have substantial understanding and reason to see the problem as being insignificant. In order to avoid the political cost of seeming like a Cassandra, I’ll make sure my clients are aware of my fundamental uncertainty about severity: the best I can provide is a guess, and if I want to err, I’d rather err or the side of overestimating severity rather than underestimating it and thereby downplaying an important problem. As a solution that feels better to me, I might also request an “unclassified” option in the Severity field, so that I can move on quickly and leave the classification to the team, to the programmers and to the program managers.

As for priority: priority is the order in which someone wants things to be done. Perhaps some people use the priority field to rank the order in which particular problems should be discussed, but my experience is that, usually, “priority” is a tester’s assessment of how important it is to fix the problem—a kind of ranking of what should be fixed first. Again based on my experience as tester, programmer, and program manager, I don’t see this as being a tester’s business at all. Deciding what should be done on a programming or business level is the job of the person with authority and responsibility over the work, in collaboration with the people who are actually doing the work. When I’m a tester, there is one exception: if I see a problem that is preventing me from doing further testing, I will request that the fix for that problem be fast-tracked (and I’ll outline the risks of not being able to test that area of the product). As tester, one of the most important aspects of my report is the set of things that make testing harder or slower, the things that give bugs more time and more opportunity to hide. Nonetheless, deciding what gets fixed first is for those who do the managing and the fixing.

In the end, I believe that decisions about severity and priority are business and management decisions. As testers, our role is to provide useful information to the decision-makers, but I believe we should let development managers manage development.

Why Would a User Do THAT?

March 4th, 2013

If you’ve been in testing for long enough, you’ll eventually report or demonstrate a problem, and you’ll hear this:

“No user would ever do that.”

Translated into English, that means “No user that I’ve thought of, and that I like, would do that on purpose, or in a way that I’ve imagined.” So here are a few ideas that might help to spur imagination.

  • The user made a simple mistake, based on his erroneous understanding of how the program was supposed to work.
  • The user had a simple slip of the fingers or the mind—inadvertently pasting a letter from his mother into the “Withdrawal Amount” field.
  • The user was distracted by something, and happened to omit an important step from a normal process.
  • The user was curious, and was trying to learn about the system.
  • The user was a hacker, and wanted to find specific vulnerabilities in the system.
  • The user is confused by the poor affordances in the product, and at that point was willing to try anything to get his task accomplished.
  • The user was poorly trained in how to use the product.
  • The user didn’t do that. The product did that, such that the user appeared to do that.
  • Users actually do that all the time, but the designer didn’t realize it, so product’s design is inconsistent with the way the user actually works.
  • The product used to do it that way, but to the user’s surprise now does it this way.
  • The user was looking specifically for vulnerabilities in the product as a part of an evaluation of competing products.
  • The product did something that the user perceived as unusual, and the user is now exploring to get to the bottom of it.
  • The user did that because some other vulnerability—say, a botched installation of the product—led him there.
  • The user was in another country, where they use commas instead of periods, dashes instead of slashes, kilometres instead of miles… Or where dates aren’t rendered the way we render them here.
  • The user was testing the product.
  • The user didn’t realize this product doesn’t work the way that product does, even though the products have important and relevant similarities.
  • The user did that, prompted by an error in the documentation (which in turn was prompted by an error in a designer’s description of her intentions).
  • To the designer’s surprise, the user didn’t enter the data via the keyboard, but used the clipboard or a programming interface to enter a ton of data all at once.
  • The user was working for another company, and was trying to find problems in an active attempt to embarrass the programmer.
  • The user observed that this sequence of actions works in some other part of the product, and figured that the same sequence of actions would be appropriate here too.
  • The product took a long time to respond, the user got impatient, and started doing other stuff before the product responded to his earlier request.

And I’m not even really getting started. I’m sure you can supply lots more examples.

Do you see? The space of things that people can do intentionally or unintentionally, innocently or malevolently, capably or erroneously, is huge. This is why it’s important to test products not only for repeatability (which, for computer software, is relatively easy to demonstrate) but also for adaptability. In order to do this, we must do much more than show that a program can produce an expected, predicted result. We must also expose the product to reasonably foreseeable misuse, to stress, to the unexpected, and to the unpredicted.

“Manual” and “Automated” Testing

February 24th, 2013

The categories “manual testing” and “automated testing” (and their even less helpful byproducts, “manual tester” and “automated tester”) were arguably never meaningful, but they’ve definitely outlived their sell-by date. Can we please put them in the compost bin now? Thank you.

Here’s a long and excellent rant by pilot Patrick Smith, who for years has been trying to address a similar problem in the way people talk (and worse, think) about “manual” and “automated” in commercial aviation.

When it comes to software, there is no manual testing. My trusty Chambers dictionary says manual means “of the hand or hands”; “done, worked or used by the hand(s), as opposed to automatic, computer-operated, etc.”; “working with the hands”. It’s true that sometimes we use our hands to tap keys on a keyboard, but thinking of that as “manual testing” is no more helpful than thinking of using your hands on the steering wheel as “manual driving”. We don’t say that Itzhak Perlman, using his hands, is performing “manual music”, even though his hands play a far more important role in his music than our hands play in our testing. If you’re describing a complex, cognitive, value-laden activity, at least please focus on the brain.

We might choose to automate the exercise of some functions in a program, and then automatically compare the output of that program with some value obtained by another program or process. I’d call that a check. (So did McCracken (1957), and so did Alan Turing.) But whatever you call the automated activity, it’s not really the interesting part. It’s cool to know that the machine can perform a check very precisely, or buhzillions of checks really quickly. But the risk analysis, the design of the check, the programming of the check, the choices about what to observe and how to observe it, the critical interpretation of the result and other aspects of the outcome—those are the parts that actually matter, the parts that are actually testing. And those are exactly the parts are not automated. A machine producing a bit is not doing the testing; the machine, by performing checks, is accelerating and extending our capacity to perform some action that happens as part of the testing that we humans do. The machinery is invaluable, but it’s important not to be dazzled by it. Instead, pay attention to the purpose that it helps us to fulfill, and to developing the skills required to use tools wisely and effectively. Otherwise, the outcome will be like giving a 17-year-old fan of Top Gear the keys to a Ferrari: it will not end well.

People often get excited when they develop or encounter something that helps them to extend their capabilities in some way. Marshall McLuhan used “media” to describe tools and technologies—anything that humans use to effect change—as extensions of man, things that enhance, extend, enable, accelerate or intensify what we do. He also emphasized that media are agnostic about our capabilities and our purposes. Tools not only accelerate what we do, they intensify what we are. The tool also changes the nature of the test that we’re performing, sometimes in ways that may escape our notice. Tools can extend or enhance or enable fabulous testing. But by accelerating some action, tools can enable us to do bad testing faster than ever, far worse than we could possibly do it without using the tool.

As Cem Kaner so perfectly puts it, “When we’re testing, we’re not talking about something you’re doing with your hands, we’re talking about something you do with your head.” Besides, as he and others have also pointed out, it’s exceedingly rare that we don’t use machinery and tools in “manual” testing. Even in review—when we’re not operating the product—we’re often looking at requirement documents or source code on a screen; using a keyboard and mouse as input mechanisms for our notes; using the computer to help us find and sift information; using software and hardware to communicate with other people; comparing ideas in our head with the output from a printer. Is this stuff manual? I prefer to think of it as sapient, something that by definition can’t be performed by a machine (or, using a word coined by our friend Pradeep Soundararajan, “brainual”), and assisted by tools. But if you want to call that stuff “manual” testing, how do you characterize the act of creating automated checks? I imagine you would call that writing, not typing. Through macros, computers can effectively type, and save us time in typing; it takes people to write. See the difference?

Several years ago, I used Excel to construct an automated oracle to model functional aspects in a teller workstation application, and identify which general ledger accounts should be credited or debited. Building the oracle took three weeks of solid analysis and programming work. As I was doing that work, I was learning about the system, performing tests, revealing problems, feeding back learning in the the development of my oracle. During this process, I also found lots of bugs. Sometimes my oracle helped me find a bug; sometimes tests that I performed interactively while developing the oracle helped me find a bug. Was I doing “automated” or “manual” testing?

I worked at another financial institution, where part of the task was to create tables of inputs and actions for functional integration checks, using FITnesse as the tool to help me select the functions, organize the data, and execute the checks. I identified risks. I designed checks to probe those risks. I entered those checks manually, via the keyboard. The tool executed the checks, but I triggered its execution manually. I reviewed the output by eye, but the tool helped by allowing me to see unexpected results easily. I revisited our assumptions frequently, considering what the checks appeared to be telling us and what they might be leaving out. When we realized a lapse in our check coverage, I’d enter new checks manually (although sometimes I laid them out in Excel first). So was I doing automated testing, or not?

Why is this a big deal to me? To me, there are two dominating reasons, and they complement each other.

When we think in terms of “automated testing”, we run the risk that we will narrow our notion of testing to automated checking, and focus our testing effort on the development and execution of automated checks. Checking is not a bad thing; it’s a good thing, and an important thing. But there are more aspects to the quality of a product than those that can be made subject to binary evaluation, which is what checks produce. I’ve had the experience of being fascinated by checking to the point of distraction from a broader and deeper analysis of the product I was working on, which meant that I missed some important problems (thankfully, other testers found those problems before they reached the customer).  Qualitative analysis of a product isn’t about bits; it’s about developing a story. Yet investigating qualitative aspects can be aided by tools, which brings us to the second point.

Talk of “manual testing” often slides into talk of “manual exploratory testing”, as though exploration doesn’t make use of tools, or as though tools could not used in an exploratory way. This slipping limits our ideas about exploration and our ideas about tools. Exploratory testing is not tool-free. When we think in terms of “manual testing”, we run the risk of underestimating the role that tools can play in our interactions with the product and in the testing effort. As a consequence, we may also ignore myriad ways in which automation can assist and amplify and accelerate and enable all kinds of testing tasks: collecting data, setting up the product or the test suite, reconfiguring the product, installing the product, monitoring its behaviour, logging, overwhelming the product, fuzzing, parsing reports, visualizing output, generating data, randomizing input, automating repetitive tasks that are not checks, preparing comparable algorithms, comparing versions—and this list isn’t even really scratching the surface. I’ve had experience of not pausing to think about how tools could help me, of neglecting to use them, and of using them unskillfully, and that compromised the quality of my work too.

To sum up:  “Manual testing” vs. “automated testing” is a false dichotomy, like “manual statistical analysis” vs. “automated statistical analysis”, or like “manual book editing” vs. “automated book editing”. In the sense that most people seem to use them, “manual” and “automated” refer to an input mechanism. People probably started saying “manual test” as a contrast to “automated test”, but just as testing isn’t manual, there’s no such thing as an automated test, either—except of the kind that we call a check, which still requires significant testing skill to design, prepare, interpret, and analyze. The check might be automated—performed by a machine—but the testing that surrounds the check is neither manual nor automated. The person who performs the testing is neither an “automated tester” nor a “manual tester”.  Tools can aid testing powerfully, and in all kinds of ways, but they can also lead to distortion or dysfunction, so practice and develop skill to use them effectively.

If you think that all this is a silly quibble, take a page from my colleague James Bach: you don’t talk about compiling and linking as “automated programming”, do you?—nor do you talk about writing code as “manual programming”, do you?

Oracles: The Brainstream Media

January 17th, 2013

In the past, James Bach and I have defined “oracle” as “a principle or mechanism by which we recognize a problem,” and we’ve focused on principles rooted in ideas about consistency and inconsistency. At its foundation, applying an oracle and recognizing a problem always involves some principle that links an observation of a product to someone’s desires for it. That is: we observe a problem when we see some kind of undesirable relationship between related things: inconsistency with things someone wants, consistency with things someone doesn’t want.

Most traditional ideas about oracles focus on mechanisms—typically in the form of a reference document, or a tool that provides “the correct answer”. (There are enormous problems with the notion of an oracle as the source of the correct answer, but I’ll save that for a later post.) Documents and tools are media—things that lie between ourselves and some person who matters. Media, as McLuhan points out, extend, accelerate, enhance, enable, or intensify some human capability. Oracles are media that extend, accelerate, enhance, enable, or intensify our ability to recognize a problem—presumably a problem for some person who matters.

In a conversation in 2009, Cem Kaner remarked to me that a tester is himself or herself a test instrument. As I’ve reflected on that, I’ve come to realize that the role of the tester is that of a medium—one that extends, accelerates, enhances, enables, or intensifies our clients’ abilities to understand their products and the problems in them.

At least since 2006, James has been talking about the concept of the “live oracle”. A live oracle may be the product owner, a programmer, a domain expert, a business analyst, a designer, another tester—any person who can alert us to the presence of a problem.

It’s the nature of media that they are indirect; mediate (as an adjective), rather than immediate. That is, they’re in between the observer of the problem and the person to whom the problem matters. But naturally, sometimes our recognition of a problem can be more immediate; that is, we recognize a problem based on a principle, without a specific mechanism or instrument or person to mediate it.

So: sometimes as we’re testing, we recognize a problem by a direct feeling, or through a principle that we’re aware of. Sometimes our recognition of a problem comes from a document or tool; sometimes from another person. That may entail embodied mental processes, rational principles, or mechanisms, but all that’s kind a mouthful. So lately, James and I have started to say that an oracle is simply “a way to recognize a problem”. Here’s a high-level, work-in-progress list of ways a tester might recognize a problem:

a) A feeling like confusion or annoyance.
b) A desirable consistency between related things.
c) A person whose opinion matters.
d) An opinion held by a person who matters.
e) A disagreement among people who matter.
f) A reference document with useful information.
g) A known good example output.
h) A known bad example output.
i) A process or tool by which output is checked.
j) A process or tool that helps a tester identify patterns.

There are two things that we encourage Rapid Software Testing students to do with this list:

1) What examples can you provide as specific instances of each of these categories? (The FEW HICCUPPS heuristics provide an answer for (b).)
2) What are the factors or dimensions that might cause us to use or prefer one oracle over another?

I’ll try to provide some answers over the coming days.

What’s Comparable (Part 2)

December 4th, 2012

In the previous post, Lynn McKee recognized that, with respect to the Comparable Product oracle heuristic, “comparable” can be have several expansive interpretations, and not just one narrow one. I’ll emphasize: “comparable product”, in the context of the FEW HICCUPPS oracle heuristics, can mean any software product, any attribute of a software product, or even attributes of non-software products that we could use as a basis for comparison. (Since “comparable product” is a heuristic, it can fail us by not helping us to recognize a problem, or by fooling us into believing that there is a problem where there really isn’t one. For now, at least, I leave the failure modes for each example below as an exercise for the reader. That said…) Here are some examples of comparable products that we could use when applying this heuristic.

An alternative product. Our product is intended to help us accomplish a particular task or set of tasks. We compare the overall operation of our product to the alternative product and its behaviour, look and feel, output, workflow, and so forth. If our product is inconsistent with a product that helps people do the same thing, then we might suspect a problem in our product. This is the “Microsoft Word vs. OpenOffice” sense of “comparable product”.

A commercially competitive product. This is a special case of “alternative product”. People often hold commercial products to a higher standard than they hold freeware. If our product is inconsistent with another commercial product that is in the same market category (think “Microsoft Word vs. WordPerfect”), then we might suspect a problem in our product.

A product that’s a member of the same suite of products. Imagine being a tester on the enormous team that produces Microsoft Office. In places, Microsoft Outlook’s behaviour is inconsistent with the behaviour of Microsoft Word. We might recognize that a user could be frustrated or annoyed by inconsistencies between those products, because those products could reasonably be expected to have a consistent look and feel. I use both Word and Outlook. Sometimes I want to find a particular word or phrase in a long Outlook message that someone sent me. I press Ctrl-F. Instead of popping open the Find dialog, Outlook opens a copy of the message to be Forwarded. The appropriate key is to find something in Outlook is F4, which by default is assigned to “Redo or Repeat” in Word. (Note that Joel Spolsky’s Law of Leaky Abstractions starts to take effect here. This flavour of the comparable product heuristic starts to leak into territory covered by the “user expectations” heuristic. That’s okay; some overlap between oracle heuristics helps to reduce to chance that we’ll miss a problems if one heuristic misfires. Moreover, weighing information from a variety of oracles helps us to evaluate the signficance of a given problem. There’s another leaky abstraction here too: what is a product? Given that Word is a product and Outlook is a product, is Office a product?)

Two products that are subcomponents within the same larger product. As in the Office/Outlook/Word example just above, Outlook isn’t even consistent within itself. In the (incoming) message reading window, Ctrl-F triggers the Forward Message function. In the (outgoing) message editing window, Ctrl-F does bring up the Find dialog. That’s because I have Outlook configured to use Word’s editor as Outlook’s. (There’s a leaky abstraction here too: the “consistency within the product” heuristic, where similar behaviours and states within the product should be consistent with one another. It’s good when oracles overlap!)

An existing product whose sole purpose is comparable to a specific feature in our product. A very simple product might have a purpose that is directly comparable to a purpose, feature or function in our product. A command-line tool like wc (Unix’ command-line word-count program) isn’t comparable with Microsoft Word in the large, but it can be used as a point of comparison for a specific aspect of Word’s behaviour.

An existing product that is different, yet shares some comparable feature, function, or concept. Many non-testers (and, apparently, many testers too) would consider Halo IV and Microsoft Word to be in completely different categories, yet there are similarities. Both are pieces of computer software; both process data; both exhibit behaviour; both save and restore state; both may change their appearance depending on the display settings. If either one were to crash, respond slowly, or misrepresent something on the screen, we might recognize a problem, and recognizing or conceiving of a problem in one might trigger us to consider a problem in the other.

A chain of events in some product. We might choose to build simple test automation to aid us in comparing the output of comparable functions or algorithms in two products. (For example, if we were testing OpenOffice, we might use scripting to compare OpenOffice’s result of a sin(x) function with Microsoft Excel’s API result, or we could use a call to the Web to obtain comparable output from the sin(x) function in Wolfram Alpha.) Those comparisons may become much more interesting when we chain a number of functions together. Note that if we’re not modeling accurately, coding carefully, and logging diligently, comparisons of chains of events may be harder to analyze.

A product that we develop specifically to implement a comparable algorithm. While working at a bank, I developed an Excel spreadsheet and VBA code to model the business logic for the teller workstation application I was testing. I used the use cases for the application as a specification, which allowed me to predict and test the ways in which which general ledger accounts would be affected by each supported transaction. This was a superb way to learn about the application, the business rules, and the power of Excel.

A reference output or artifact. Those who use FIT or FitNesse develop tables of functions, inputs, and outputs that the tool compares to output from integration-level functions; those tables are comparable products. If our testing mission were to examine the font output of Word, the display from a font management tool could be comparable to Word’s output. The comparable product may not even be instantiated in software or in electronic form. For example, we could compare the fonts in the output of our presentation software to the fonts in a Letraset catalog; we could compare the output from a pocket calculator to the output of our program; we could compare aggregated output from our program to a graph sketched on paper by a statistician; we could compare the data in our mailing list to the data in the postal code book. (Well, we used to do that; now it’s much easier to do it with some scripting that gets the data from the postal service.) More than once I’ve found a bug by comparing the number posted on the “Contact Us” page to the number printed on our business cards or in our marketing material. We could also compare output produced by our program today (to output produced by our program yesterday (an idea that leaks into the “consistency with history” heuristic).

A product that we don’t like. I remember this joke from somewhere in Isaac Asimov’s work: “People compare my violin playing to Jascha Heifetz. They say, ‘A Heifetz he ain’t!’” A comparable product is not always comparable in a desirable way. If someone touts a music management product to me saying “it’s just like iTunes!”, I’m not likely to use it. If people have been known to complain about a product, and our product provides the same basis for a complaint, we suspect a problem with our product. (The Law of Leaky Abstractions applies here too, leaking into the “familiar problems” heuristic, in which a product should be inconsistent with patterns of problems that we’ve seen before.)

Patterns of behaviour in a range or sphere of products. We can compare our product against our experience with technology and with entire classes of relevant or interesting products, without immediately referring to a specific product. “It’s one thing from freeware, but I didn’t expect crashes this often from a professional product.” “Well, this would be passable on a Windows system, but you know about those finicky Mac users.” “Yikes! I didn’t expect this product to make my password visible on-screen!” “Aw geez; the on-screen controls here are just as confusing as they are on a real VCR—and there are no tooltips, either.” “The success code is 1? Non-zero return codes on command-line utilities usually represent errors, don’t they?”

All of these things point to a few overarching points.

  • “Similar” and “comparable” can be interpreted narrowly or broadly. Even when products are dissimilar in important respects, even one point of similarity may be useful.

  • Products can be compared by similarity or by contrast.

  • We can make or prepare comparable products, in addition to referring to existing ones.

  • A comparable product may or may not be computer software.

  • Especially in reference to the few categories above, there is great value for a tester in knowing not only about technologies and functional aspects of products in the same product space, but also about user interface conventions, business or workplace domains, sources of background information, cultural and aesthetic characteristics, design heuristics, and all kinds of other things because…

  • If the object of the exercise is to find problems in the product quickly, it’s a good idea to have access to a requisite variety of ideas about what we might use as bases for comparison. (I describe “requisite variety” here, and Jurgen Appello describes it even better here.

  • Bugs thrive on overly narrow or overly broad interpretations of “comparable”. Know what you’re comparing, and why the comparison matters to your testing and to your clients.

The comparable product heuristic is an oracle principle, but in describing it here, I haven’t paid much attention to mechanisms by which we might make comparisons. We’ll get to that issue next.

What’s Comparable (Part 1)

December 3rd, 2012

People interpret requirements and specifications in different ways, based on their models, and their past experiences, and their current context. When they hear or read something, many people tend to choose an interpretation that is familiar to them, which may close off their thinking about other possible interpretations. That’s not a big problem in simple, stable systems. It’s a bigger problem in software development. The problems we’re trying to solve are neither simple nor stable, and the same is true with the software that we’re developing.

The interpretation problem applies not only to software development and testing, but to the teaching of testing too. For example, in Rapid Software Testing, James Bach and I teach that an oracle is a way to recognize a problem, and one of the most important and powerful a broader set of oracle heuristics.

Here’s how the typical experiment went. We started by asking “We’re thinking of applying the comparable product oracle heuristic to a test of Microsoft Word. What product could we use for that?” Almost everyone suggested OpenOffice Writer, which seems to be the last remaining well-known full-featured word processing alternative to Microsoft Word. Some suggested WordPad, or Notepad, although almost everyone who did so suggested that WordPad (much less Notepad) wouldn’t be much use as comparable products. “Why not?” we asked. In general, the answer was that WordPad and Notepad were too simple, and didn’t reflect the complexity of Word.

Then we asked some follow-up questions. Is Word comparable with Unix’s command-line program wc? Most people said No (for some, we had to explain what wc is; it counts the words in a file that you provide as input). It was only when we asked, “What if we were testing the word count feature in Microsoft Word?” that the light began to dawn. When we asked if Word was comparable with Halo (the game), most people still said No. When we encouraged them to think more broadly about specific features of Word that we might compare with Halo, they started to get unstuck, and began to realize that while Word and Halo were dramatically different products in important respects, they were nonetheless comparable on some levels.

By contrast, here’s a conversation with Lynn McKee. The chat has been edited to de-Skypeify it (I’ve removed some typos, fixed some punctuation, and removed a couple of digressions not consequential to the conversation).

Michael: If you were asked, “We’re thinking of applying the comparable product oracle heuristic to a test of Microsoft Word. What product could we use for that?”, how would you answer?

Lynn: Hmmm. Certainly, we could use products such as “Open Office”, “Notepad” and others. Could you tell me more about what “we” are hoping to learn about the product under test to better assess which comparable products to use? Is this a brand new product? A version release? If so, what changed and what functions are we interested in comparing?

Michael: That’s a pretty good answer. A followup question: do you know the wc program, typically available under Unix?

Lynn: Sorry, I am not familiar with that product. Can you tell me more about it? How does it relate to your product? Is your product running on Unix?

Michael: Yes. wc is a command-line program. Its purpose is to count the words in a document. You supply the document as input; it returns the number of words in that document.

Lynn: While you were typing, I used my handy Google search to tell me a bit about Unix WC. Oh interesting, so are you looking to gather information about how capable, performant, etc the word count functionality is within MS Word? Can you tell me more about what functions of MS Word interest you the most? And why?

Michael: One more question: Halo IV — the game. Is that a comparable product to MS Word?

Lynn: Sheesh, I’ve only ever seen ads. Lemme think. It blows people’s brains out…sometimes I want to do that with MS Word. ;) It would depend on what type of comparison we are hoping to draw. For example, Halo is a game and does require interaction with a user. From a UI perspective, there are menus and other forms of cause-and-effect type of interaction—that is, when I do X, I expect Y. There are also state comparisons I could draw. When I start a new game, save a game, reopen a game I have expectations about the state the game should be in. This is similar to how I may expect a document to behave with states. I may also expect certain behavior with pausing or crashing the game in terms of recovery that could be compared to MS Word. Conversely… if I am looking to compare the product’s ability to display fonts, images, format tables, etc. then I may find very low value in comparing the products. I think that you could compare any two products but you may find very different value in the comparison exercise, depending on what you hope to learn.


This is an answer that I would consider exemplary. I have related it here because it was outstanding in two ways: it was an extremely good answer, but it was also exceptional, in that most people didn’t consider wc or Halo to be even remotely comparable to Microsoft Word without a good deal of prompting. Lynn, on the other hand, recognized that “comparable” doesn’t necessarily mean “highly similar”; it can also mean “anything or any aspect of something that you might use as a basis for comparison“. She immediately questioned the question, to make sure that she understood the task at hand. She also did a bit of research on her own while I was answering the question, and asked some highly relevant questions about risks and particular concerns that I might have. Note that she’s doing important informal work—understanding the testing mission—before making too firm a commitment to what might or might not be considered “comparable” for the purposes of a particular question that we might have about the product.

I’ll have more to say about the Comparable Product heuristic tomorrow.