Blog Posts for the ‘Heuristics’ Category

Expected Results

Sunday, August 23rd, 2020

Klára Jánová is a dedicated tester who studies and practices and advocates Rapid Software Testing. Recently, on LinkedIn, she said:

I might EXPECT something to happen. But that doesn’t necessarily mean that I WANT IT/DESIRE for IT to happen. I even may want it to happen, but it not happening doesn’t have to automatically mean that there’s a problem.

The point of this post: no more “expected results” in the bug reports, please!

In reply, Derek Charles asked:

Then how else would you communicate to the developer or the team what is SUPPOSED to happen? I think that expected results are very necessary especially when regressions are found during testing.

Klara replied:

I suggest to describe the behavior that the tester recognizes as problematic and explain WHY it might be a problem for someone—the reasoning why the behavior is perceived as a bug—that’s what really matters.

Exactly so. Klára is referring here to problems and oracles—means by which we recognize problems when we encounter them in testing.

There’s an issue with the “what is supposed to happen” stuff: in development work, what is supposed to happen is not always entirely clear. Moreover, and more importantly, since testers don’t run the project or the business, we don’t mandate what is supposed to happen.

For instance, while testing, I may observe something in the product that I find confusing, or surprising, or wrong. When I look up the intended behaviour in the specification, it says one thing; the developer, claiming that the spec is out of date, contradicts it; and the product owner confirms that the spec is outdated. But she also says that the developer’s interpretation of what should happen is not what she wants him to implement. And then, when I consult an RFC, the product owner’s interpretation is inconsistent with what the RFC says should be the appropriate behaviour.

Fortunately, I don’t have to decide, and I don’t have to say what should happen. My job as a tester is to report on an apparent inconsistency between the product and presumably desirable things, or between the product and someone’s expressed desire or requirement. In the case above, I let the product owner know about the inconsistency between her interpretation and the standard, and she makes the call on what she and the business want from the product.

That is, even though I have certain expectations, I might be wrong about them and about what I think should be. For instance, she might decide that our product is not going to support that standard. She might point out that the standard I’m considering has been superseded by a later one. In any case, what is supposed to happen gets decided not by me, but by the people who run things. That’s what they’re paid for. This is a good thing, not a bad thing.

But still, I’d like to honour Derek’s question: as testers, how should we report a problem without referring to “expected results”?

  • Instead of saying “expected result” and leaving it that, we could say “inconsistent with the specification”.

    Inconsistency with the specification is a special case of a more general way of recognizing and describing a problem: inconsistency with claims. “Inconsistency with claims” is an oracle heuristic. (A heuristic is a fallible means for solving a problem; an oracle is a special kind of heuristic which, fallibly, helps you to solve the problem of identifying and describing a bug.) When a product is inconsistent with a claim that someone important makes about it, there’s likely a problem, either with the product or the claim. As a tester, I don’t have to decide which.

    The specification is a particular form of a claim that someone is making about what the product is like, or what it should be like. Claims can be made in design sessions, planning meetings, pair programming, hallway conversations, training workshops… Claims can be represented in help files, marketing materials, workflow diagrams, lookup tables, user manuals, whiteboard sketches, UML diagrams… Claims can also be represented in the code of an automated check, where someone has written code to compare the output of the product with an anticipated and presumably desirable result. Recognizing many sources of claims and inconsistencies with them makes us more powerful testers.

    Whatever relevant claim you’re referring to, having said “inconsistent with a claim” (and having identified the nature of the claim, and where or whom it comes from), you don’t need to say “expected result”.

  • Instead of saying “expected result” and leaving it that, you could say “inconsistent with how the product used to work”.

    Inconsistency with history is an oracle heuristic. After a change, the product might have a new bug in it. On the other hand, the product might have been wrong all along, and now it’s right. (This is an example of how oracles can mislead us or conflict with each other, which is why it’s a good idea to identify the oracles we’re applying in problem reports.) If you (or others) aren’t aware of why the desirable change was made, that’s a different kind of problem, but a problem nonetheless.

    Either way, having said “inconsistent with how the product used to work” (and having described that in terms of a problem), you don’t need to say “expected result”.

  • Instead of saying “expected result” and leaving it that, you could say “inconsistent with respect to the product itself”.

    Inconsistency within the product is an oracle heuristic. This can takes a number of forms: the product might return inconsistent results from one run to the next; the product could afford a tidy, smooth interface in one place, and a frustrating, confusing interface in another; the product could present output very precisely in one part of the product, and imprecisely in another; one component in the product could log output using one format, while another component’s log output is in a different format, which makes analysis more difficult…

    The inconsistency might be undesirable (because of a reliability problem), or it might be completely desirable (a Web page for a newspaper should change from day to day), or it might desirable or undesirable in ways that you’re not aware of (since, like me, you probably don’t know everything).

    In general, people tend to prefer things that present themselves in a consistent way. Here’s a trivial example from Microsoft Office (Office 365, these days): to search for text in Word, the keyboard command is Ctrl-F. In Outlook, part of the same product suite, Ctrl-F triggers the Forward Message action instead; F4 triggers a search. Had Outlook and Word been designed by the same teams at the same time, this probably would have been identified as a bug, and addressed. In the end, the Office suite’s program managers decided that consistency with history dominated inconsistency within the product, and now we all have to live with that. Oh well.

    In any case, having said “inconsistent with respect to some aspect of the same product” (and having identified the specifics of the inconsistency), you don’t need to say “expected result”.

  • Instead of saying “expected result” and leaving it that, you could say “inconsistency with a comparable product” (and identify the product, and the nature of the inconsistency).

    Inconsistency with a comparable product is an oracle heuristic. Any product (something that someone has produced) that provides a relevant point of comparison is, by defintion, a comparable product. That includes competitive products, of course; Microsoft Word and Google Docs are comparable products, in that sense. Microsoft Word and WordPad are comparable products too; they have many features in common. If Word can’t open an .RTF file generated by WordPad, we have reason to suspect a problem in one product or the other. If WordPad prints an RTF file properly, and Word does not, we have reason to suspect a problem in Word.

    Is the Unix program wc (wc stands for “word count”) a comparable product to Microsoft Word? All wc does is count words in text files, so no, except… Word has a word-counting feature. If Word’s calculation for the number of words in a text file is inexplicably different from wc‘s count, we have reason to suspect a problem in one product or the other.

    Test tools and suites of automated output checks represent comparable products too. If the output from your product is inconsistent with the specified and desired results provided by your test tool, or with some data that it processes to produce such results, you have reason to suspect a problem somewhere.

    In any case, having said “inconsistent with a comparable product”, and having identified the product and the basis for comparison, you don’t need to say “expected result”.

Those are just a few examples. When we teach Rapid Software Testing, we offer a set of oracle heuristics that identify principles of desirable (and undesirable) consistency (and inconsistency) for identifying bugs; you can read more about those here.

James Bach has recently identified another principle that might apply to bugs but that, in my view, more powerfully applies to enhancement requests: we desire the product to be consistent with acceptable quality: that is, not only good, but every bit as good as it can be.

Why is all this a big deal? Several reasons, I think.

First, “expected result” begs the question of where the expectation comes from. It’s just a middleman for something we could say more specifically. Why not get to the point and say it while at the same time sounding like a pro? Because…

Second, being specific about where the expectation comes from saves time and focuses conversation on the (un)desirable (in)consistencies that matter when developers and product owners are deciding whether something is a bug worth fixing. It also helps to focus repair in the appropriate claim (for example, if the product is right and the spec is wrong, it’s a prompt to repair the spec).

Third, it helps for us to remember that our job as testers is not to confirm that the product works “as expected”, but to ask “is there a problem here?” A product can fulfill an expectation and nonetheless have terrible problems about it. It’s our job to seek and find and describe inconsistencies and problems that matter before it’s too late.

And finally…

Fourth, speaking in terms of an oracle instead of an “expected result” can help to avoid patronizing, condescending, time-wasting, and obvious elements of bug reports that cause developers to feel insulted or to roll their eyes.

Actual result: Product crashes.

Expected result: Product does not crash.

Don’t be that tester.

Further reading:

Not-So-Great Expectations
Oracles From the Inside Out

Want to learn how to observe, analyze, and investigate software? Want to learn how to talk more clearly about testing with your clients and colleagues? Rapid Software Testing Explored, presented by me and set up for the daytime in North America and evenings in Europe and the UK, November 9-12. James Bach will be teaching Rapid Software Testing Managed November 17-20, and a flight of Rapid Software Testing Explored from December 8-11. There are also classes of Rapid Software Testing Applied coming up. See the full schedule, with links to register here.

Oracles Are About Problems, Not Correctness

Thursday, March 12th, 2015

As James Bach and I have have been refining our ideas of testing, we’ve been refining our ideas about oracles. In a recent post, I referred to this passage:

Program testing involves the execution of a program over sample test data followed by analysis of the output. Different kinds of test output can be generated. It may consist of final values of program output variables or of intermediate traces of selected variables. It may also consist of timing information, as in real time systems.

The use of testing requires the existence of an external mechanism which can be used to check test output for correctness. This mechanism is referred to as the test oracle. Test oracles can take on different forms. They can consist of tables, hand calculated values, simulated results, or informal design and requirements descriptions.

—William E. Howden, A Survey of Dynamic Analysis Methods, in Software Validation and Testing Techniques, IEEE Computer Society, 1981

While we have a great deal of respect for the work of testing pioneers like Prof. Howden, there are some problems with this description of testing and its focus on correctness.

  • Correct output from a computer program is not an absolute; an outcome is only correct or incorrect relative to some model, theory, or principle. Trivial example: Even the mathematical rule “one divided by two equals one-half” is a heuristic for dividing things. In most domains, it’s true, but as in George Carlin’s joke, when you cut a crumb in two, you don’t have two half-crumbs; you have two crumbs.
  • A product can produce a result that is functionally correct, and yet still be deeply unsatisfactory to its user. Trivial example: a calculator returns the value “4” from the function “2 + 2″—and displays the result in white on a white background.
  • Conversely, a product can produce an incorrect result and still be quite acceptable. Trivial example: a computer desktop clock’s internal state and second hand drift a few tenths of a second each second, but the program resets itself to be consistent with an atomic clock at the top of every minute. The desktop clock almost never shows the right time precisely, but the human observer doesn’t notice and doesn’t really care. Another trivial example: a product might return a calculation inconsistent with its oracle in the tenth decimal place, when only the first two or three decimal places really matter.
  • The correct outcome of a program or function is not always known in advance. Some development and testing work, like some science, is done in an attempt to discover something new; to establish what a correct answer might look like; to explore a mathematical model; to learn about the limitations of a novel system. In such cases, our ideas of correctness or acceptability are not clear from the outset, and must be developed. (See Collins and Pinch’s The Golem books, which discuss the messiness and confusion of controversial science.) Trivial example: in benchmarking, correctness is not at issue. Comparison between one system and another (or versions of the same system at different times) is the mission of testing here.
  • As we’re developing and testing a product, we may observe things that are unexpected, under-described or completely undescribed. In order to program a machine to make an observation, we must anticipate that observation and encode it. The machine doesn’t imagine, invent, or learn, and a machine cannot produce an unanticipated oracle in response to an observation. By contrast, human observers continually learn and refine their ideas on what to observe. Sometimes we observe a problem without having anticipated it. Sometimes we become aware that we’re making a new observation—one that may or may not represent a problem. Distinct from checking, testing continually affords new things to observe. Testing prompts us to decide when new observations represent problems, and testing informs decisions about what to do about them.
  • An oracle may be in error, or irrelevant. Trivial examples: a program that checks the output of another program may have its own bugs. A reference document may be outdated. A subject matter expert who is usually a reliable source of information may have forgotten something.
  • Oracles might be inconsistent with each other. Even though we have some powerful models for it, temperature measurement in climatology is inherently uncertain. What is the “correct” temperature outdoors? In the sunlight? In the shade? When the thermometer is near a building or farther away? Over grass, or over pavement? Some of the issues are described in this remarkable article (read the comments, too).
  • Although we can demonstrate incorrectness in a program, we cannot prove a program to be correct. As Djikstra put it, testing can only show the presence of errors, not their absence; and to go even deeper, Popper pointed out that theories can only be falsified, and not proven. Trivial example: No matter how many tests we run on that calculator, we can never know that it will always return 4 given the inputs 2 + 2; we can only infer that it will do so through induction, and induction can be deeply problemmatic. In a Nassim Taleb’s example (cribbed from Bertrand Russell and David Hume), every day the turkey uses induction to reinforce his belief in the farmer’s devotion to the desires and interests of turkeys—until a few days before Thanksgiving, when the turkey receives a very sudden, unpleasant, and (alas for the turkey) momentary flash of insight.
  • Sometimes we don’t need to know the correct result to know that the observed result is wrong. Trivial example: the domain of the cosine function ranges from -1 to 1. I don’t need to know the correct value for cos(72) to know that an output of 4.2 is wrong. (Elaine Weyuker discusses this in a paper called “On Testing Nontestable Programs” (Weyuker, Elaine, “On Testing Nontestable Programs”, Department of Computer Science, Courant Institute of Mathematical Sciences, New York University). “Frequently the tester is able to state with assurance that a result is incorrect without actually knowing the correct answer.”)

Checking for correctness—especially when the test output is observed and evaluated mechanically or indirectly—is a risky business. All oracles are fallible. A “passing” test, based on comparison with a fallible oracle cannot prove correctness, and no number of “passing” tests can do that. In this, a test is like a scientific experiment: an experiment’s outcome can falsify one theory while supporting another, but an experiment cannot prove a theory to be true. A million observations of white swans says nothing about the possibility that there might be black swans; a million passing tests, a million observations of correct behaviour cannot eliminate the possibility that there might be swarms of bugs. At best, a passing test is essentially the observation of one more white swan. We urge those who rely on passing acceptance tests to remember this.

A check can suggest the presence of a problem, or can at best provide support for the idea that the program can work. But no matter what oracle we might use, a test cannot prove that a program is working correctly, or that the program will work . So what can oracles actually do for us?

If we invert the focus on correctness, we can produce a more robust heuristic. We can’t logically use an oracle to prove that a system is behaving correctly or that it will behave correctly, but we can use an oracle to help falsify the theory that it is behaving correctly. This is why, in Rapid Software Testing, we say that an oracle is a means by which we recognize a problem when it happens during testing.

How Models Change

Saturday, July 19th, 2014

Like software products, models change as we test them, gain experience with them, find bugs in them, realize that features are missing. We see opportunities for improving them, and revise them.

A product coverage outline, in Rapid Testing parlance, is an artifact (a map, or list, or table…) that identifies the dimensions or elements of a product. It’s a kind of inventory of aspects of the product that could be tested. Many years ago, my colleague and co-author James Bach wrote an article on product elements, identifying Structure, Function, Data, Platform, and Operations (SFDPO; think “San Francisco DePOt”, he suggested) as a set of heuristic guidewords for creating or structuring or reviewing the highest levels of a coverage outline.

A few years later, I was working as a tester. While I was on that assignment, I missed a few test ideas and almost missed a few bugs that I might have noticed earlier had I thought of “Time” as another guideword for modeling the product. After some discussion, I persuaded James that Time was a worthy addition to the Product Elements list. I wrote my own article on that, Time for New Test Ideas).

Over the years, it seemed that people were excited by the idea of using SFDPOT as the starting point for a general coverage outline. Many people reported getting a lot of value out of it, so in my classes, I’ve placed more and more emphasis on using and practicing the application of that part of the Heuristic Test Strategy Model. One of the exercises involves creating a mind map for a real software product. I typically offer that one way to get started on creating a coverage outline is to walk through the user interface and enumerate each element of the UI in the mind map.

(Sometimes people ask, “Why bother? Don’t the specifications or the documentation or the Help file provide maps of the UI? What’s the point of making another one?” One answer is that the journey, rather than the map, is the point. We learn one set of things by reading about a product; we learn different things—and we typically learn more deeply—by touring the product, interacting with it, gaining experience with it, and organizing descriptions of what we’ve found. Moreover, at each moment, we may notice, infer, or wonder about things that the documentation doesn’t address. When we recognize something new, we can add it to our coverage model, our risk list, or our test ideas—plus we might recognize and note some bugs or issues along the way. Another answer is that we should treat anything that any documentation says about a product as a rumour until we’ve engaged with the product.)

One issue kept coming up in class: on the product coverage outline, where should the map of the user interface go? Under Functions (what the product does)? Or Operations (how people use the product)? Or Structure (the bits and pieces of the product)? My answer was that it doesn’t matter much where you put things on your coverage outline, as long as it fits for you and the people with whom you might be sharing the map. The idea is to identify things that could be tested, and not to miss important stuff.

After one class, I was on the phone with James, and I happened to mention that day’s discussion. “I prefer to put the UI under Structure,” I noted.

What? That’s crazy talk! The UI goes under Functions!”

“What?” I replied. “That’s crazy talk. The UI isn’t Functions. Sure, it triggers functions. But it doesn’t perform those functions.”

“So what?” asked James. “If it’s how the user gets at functions, it fits under Functions just fine. What makes you think the UI goes under Structure?”

“Well, the UI has a structure. It’s… structural.”

Everything has a structure,” said James. “The UI goes under Functions.”

And so we argued on. Then one of us—and I honestly don’t remember who—suggested that maybe the UI was important enough to be its own top-level product element. I do remember James pointing out that if when we think of interfaces, plural, there might be several of them—not just the graphical user interface, but maybe a command-line interface. An application programming interface.

“Hmmm…,” I said. This reminded me of the four-user model mentioned in How to Break Software (human user, API user, operating system user, file system user). “Interfaces,” I said. “Operating system interface, file system interface, network interface, printer interface, debugging interface, other devices…”

“Right,” said James. “Plus there are those other interface-y things—importing and exporting stuff, for instance.”

“Aren’t those covered under ‘Functions’?”

“Sure. Or they might be, depending on how you think about it. But the point of this kind of model isn’t to be a template, or a form you fill out. It’s to help us reduce the chances that we might miss something important. Our models are leaky abstractions; overlaps are okay,” said James. Which, of course, was exactly the same argument I had used on him several years earlier when we had added Time to the model. Then he paused. “Ah! But we don’t want to break the mnemonic, do we? San Francisco DePOT.”

“We can deal with that. Just misspell ‘depot’ San Francisco DIPOT. SFDIPOT.”

And so we updated the model.

I wonder what it will look like five years from now.

What’s Comparable (Part 2)

Tuesday, December 4th, 2012

In the previous post, Lynn McKee recognized that, with respect to the Comparable Product oracle heuristic, “comparable” can be have several expansive interpretations, and not just one narrow one. I’ll emphasize: “comparable product”, in the context of the FEW HICCUPPS oracle heuristics, can mean any software product, any attribute of a software product, or even attributes of non-software products that we could use as a basis for comparison. (Since “comparable product” is a heuristic, it can fail us by not helping us to recognize a problem, or by fooling us into believing that there is a problem where there really isn’t one. For now, at least, I leave the failure modes for each example below as an exercise for the reader. That said…) Here are some examples of comparable products that we could use when applying this heuristic.

An alternative product. Our product is intended to help us accomplish a particular task or set of tasks. We compare the overall operation of our product to the alternative product and its behaviour, look and feel, output, workflow, and so forth. If our product is inconsistent with a product that helps people do the same thing, then we might suspect a problem in our product. This is the “Microsoft Word vs. OpenOffice” sense of “comparable product”.

A commercially competitive product. This is a special case of “alternative product”. People often hold commercial products to a higher standard than they hold freeware. If our product is inconsistent with another commercial product that is in the same market category (think “Microsoft Word vs. WordPerfect”), then we might suspect a problem in our product.

A product that’s a member of the same suite of products. Imagine being a tester on the enormous team that produces Microsoft Office. In places, Microsoft Outlook’s behaviour is inconsistent with the behaviour of Microsoft Word. We might recognize that a user could be frustrated or annoyed by inconsistencies between those products, because those products could reasonably be expected to have a consistent look and feel. I use both Word and Outlook. Sometimes I want to find a particular word or phrase in a long Outlook message that someone sent me. I press Ctrl-F. Instead of popping open the Find dialog, Outlook opens a copy of the message to be Forwarded. The appropriate key to launch a search for something in Outlook is F4, which by default is assigned to “Redo or Repeat” in Word. (Note that Joel Spolsky’s Law of Leaky Abstractions starts to take effect here. This flavour of the comparable product heuristic starts to leak into territory covered by the “user expectations” heuristic. That’s okay; some overlap between oracle heuristics helps to reduce to chance that we’ll miss a problems if one heuristic misfires. Moreover, weighing information from a variety of oracles helps us to evaluate the signficance of a given problem. There’s another leaky abstraction here too: what is a product? Given that Word is a product and Outlook is a product, is Office a product?)

Two products that are subcomponents within the same larger product. As in the Office/Outlook/Word example just above, Outlook isn’t even consistent within itself. In the (incoming) message reading window, Ctrl-F triggers the Forward Message function. In the (outgoing) message editing window, Ctrl-F does bring up the Find dialog. That’s because I have Outlook configured to use Word’s editor as Outlook’s. (There’s a leaky abstraction here too: the “consistency within the product” heuristic, where similar behaviours and states within the product should be consistent with one another. It’s good when oracles overlap!)

An existing product whose sole purpose is comparable to a specific feature in our product. A very simple product might have a purpose that is directly comparable to a purpose, feature or function in our product. A command-line tool like wc (Unix’ command-line word-count program) isn’t comparable with Microsoft Word in the large, but it can be used as a point of comparison for a specific aspect of Word’s behaviour.

An existing product that is different, yet shares some comparable feature, function, or concept. Many non-testers (and, apparently, many testers too) would consider Halo IV and Microsoft Word to be in completely different categories, yet there are similarities. Both are pieces of computer software; both process data; both exhibit behaviour; both save and restore state; both may change their appearance depending on the display settings. If either one were to crash, respond slowly, or misrepresent something on the screen, we might recognize a problem, and recognizing or conceiving of a problem in one might trigger us to consider a problem in the other.

A chain of events in some product. We might choose to build simple test automation to aid us in comparing the output of comparable functions or algorithms in two products. (For example, if we were testing OpenOffice, we might use scripting to compare OpenOffice’s result of a sin(x) function with Microsoft Excel’s API result, or we could use a call to the Web to obtain comparable output from the sin(x) function in Wolfram Alpha.) Those comparisons may become much more interesting when we chain a number of functions together. Note that if we’re not modeling accurately, coding carefully, and logging diligently, comparisons of chains of events may be harder to analyze.

A product that we develop specifically to implement a comparable algorithm. While working at a bank, I developed an Excel spreadsheet and VBA code to model the business logic for the teller workstation application I was testing. I used the use cases for the application as a specification, which allowed me to predict and test the ways in which which general ledger accounts would be affected by each supported transaction. This was a superb way to learn about the application, the business rules, and the power of Excel.

A reference output or artifact. Those who use FIT or FitNesse develop tables of functions, inputs, and outputs that the tool compares to output from integration-level functions; those tables are comparable products. If our testing mission were to examine the font output of Word, the display from a font management tool could be comparable to Word’s output. The comparable product may not even be instantiated in software or in electronic form. For example, we could compare the fonts in the output of our presentation software to the fonts in a Letraset catalog; we could compare the output from a pocket calculator to the output of our program; we could compare aggregated output from our program to a graph sketched on paper by a statistician; we could compare the data in our mailing list to the data in the postal code book. (Well, we used to do that; now it’s much easier to do it with some scripting that gets the data from the postal service.) More than once I’ve found a bug by comparing the number posted on the “Contact Us” page to the number printed on our business cards or in our marketing material. We could also compare output produced by our program today (to output produced by our program yesterday (an idea that leaks into the “consistency with history” heuristic).

A product that we don’t like. I remember this joke from somewhere in Isaac Asimov’s work: “People compare my violin playing to Jascha Heifetz. They say, ‘A Heifetz he ain’t!'” A comparable product is not always comparable in a desirable way. If someone touts a music management product to me saying “it’s just like iTunes!”, I’m not likely to use it. If people have been known to complain about a product, and our product provides the same basis for a complaint, we suspect a problem with our product. (The Law of Leaky Abstractions applies here too, leaking into the “familiar problems” heuristic, in which a product should be inconsistent with patterns of problems that we’ve seen before.)

Patterns of behaviour in a range or sphere of products. We can compare our product against our experience with technology and with entire classes of relevant or interesting products, without immediately referring to a specific product. “It’s one thing from freeware, but I didn’t expect crashes this often from a professional product.” “Well, this would be passable on a Windows system, but you know about those finicky Mac users.” “Yikes! I didn’t expect this product to make my password visible on-screen!” “Aw geez; the on-screen controls here are just as confusing as they are on a real VCR—and there are no tooltips, either.” “The success code is 1? Non-zero return codes on command-line utilities usually represent errors, don’t they?”

All of these things point to a few overarching points.

  • “Similar” and “comparable” can be interpreted narrowly or broadly. Even when products are dissimilar in important respects, even one point of similarity may be useful.

  • Products can be compared by similarity or by contrast.

  • We can make or prepare comparable products, in addition to referring to existing ones.

  • A comparable product may or may not be computer software.

  • Especially in reference to the few categories above, there is great value for a tester in knowing not only about technologies and functional aspects of products in the same product space, but also about user interface conventions, business or workplace domains, sources of background information, cultural and aesthetic characteristics, design heuristics, and all kinds of other things because…

  • If the object of the exercise is to find problems in the product quickly, it’s a good idea to have access to a requisite variety of ideas about what we might use as bases for comparison. (I describe “requisite variety” here, and Jurgen Appello describes it even better here.)

  • Bugs thrive on overly narrow or overly broad interpretations of “comparable”. Know what you’re comparing, and why the comparison matters to your testing and to your clients.

The comparable product heuristic is an oracle principle, but in describing it here, I haven’t paid much attention to mechanisms by which we might make comparisons. We’ll get to that issue next.

What’s Comparable (Part 1)

Monday, December 3rd, 2012

People interpret requirements and specifications in different ways, based on their models, and their past experiences, and their current context. When they hear or read something, many people tend to choose an interpretation that is familiar to them, which may close off their thinking about other possible interpretations. That’s not a big problem in simple, stable systems. It’s a bigger problem in software development. The problems we’re trying to solve are neither simple nor stable, and the same is true with the software that we’re developing.

The interpretation problem applies not only to software development and testing, but to the teaching of testing too. For example, in Rapid Software Testing, James Bach and I teach that an oracle is a way to recognize a problem, and one of the most important and powerful a broader set of oracle heuristics.

Here’s how the typical experiment went. We started by asking “We’re thinking of applying the comparable product oracle heuristic to a test of Microsoft Word. What product could we use for that?” Almost everyone suggested OpenOffice Writer, which seems to be the last remaining well-known full-featured word processing alternative to Microsoft Word. Some suggested WordPad, or Notepad, although almost everyone who did so suggested that WordPad (much less Notepad) wouldn’t be much use as comparable products. “Why not?” we asked. In general, the answer was that WordPad and Notepad were too simple, and didn’t reflect the complexity of Word.

Then we asked some follow-up questions. Is Word comparable with Unix’s command-line program wc? Most people said No (for some, we had to explain what wc is; it counts the words in a file that you provide as input). It was only when we asked, “What if we were testing the word count feature in Microsoft Word?” that the light began to dawn. When we asked if Word was comparable with Halo (the game), most people still said No. When we encouraged them to think more broadly about specific features of Word that we might compare with Halo, they started to get unstuck, and began to realize that while Word and Halo were dramatically different products in important respects, they were nonetheless comparable on some levels.

By contrast, here’s a conversation with Lynn McKee. The chat has been edited to de-Skypeify it (I’ve removed some typos, fixed some punctuation, and removed a couple of digressions not consequential to the conversation).

Michael: If you were asked, “We’re thinking of applying the comparable product oracle heuristic to a test of Microsoft Word. What product could we use for that?”, how would you answer?

Lynn: Hmmm. Certainly, we could use products such as “Open Office”, “Notepad” and others. Could you tell me more about what “we” are hoping to learn about the product under test to better assess which comparable products to use? Is this a brand new product? A version release? If so, what changed and what functions are we interested in comparing?

Michael: That’s a pretty good answer. A followup question: do you know the wc program, typically available under Unix?

Lynn: Sorry, I am not familiar with that product. Can you tell me more about it? How does it relate to your product? Is your product running on Unix?

Michael: Yes. wc is a command-line program. Its purpose is to count the words in a document. You supply the document as input; it returns the number of words in that document.

Lynn: While you were typing, I used my handy Google search to tell me a bit about Unix WC. Oh interesting, so are you looking to gather information about how capable, performant, etc the word count functionality is within MS Word? Can you tell me more about what functions of MS Word interest you the most? And why?

Michael: One more question: Halo IV — the game. Is that a comparable product to MS Word?

Lynn: Sheesh, I’ve only ever seen ads. Lemme think. It blows people’s brains out…sometimes I want to do that with MS Word. 😉 It would depend on what type of comparison we are hoping to draw. For example, Halo is a game and does require interaction with a user. From a UI perspective, there are menus and other forms of cause-and-effect type of interaction—that is, when I do X, I expect Y. There are also state comparisons I could draw. When I start a new game, save a game, reopen a game I have expectations about the state the game should be in. This is similar to how I may expect a document to behave with states. I may also expect certain behavior with pausing or crashing the game in terms of recovery that could be compared to MS Word. Conversely… if I am looking to compare the product’s ability to display fonts, images, format tables, etc. then I may find very low value in comparing the products. I think that you could compare any two products but you may find very different value in the comparison exercise, depending on what you hope to learn.

This is an answer that I would consider exemplary. I have related it here because it was outstanding in two ways: it was an extremely good answer, but it was also exceptional, in that most people didn’t consider wc or Halo to be even remotely comparable to Microsoft Word without a good deal of prompting. Lynn, on the other hand, recognized that “comparable” doesn’t necessarily mean “highly similar”; it can also mean “anything or any aspect of something that you might use as a basis for comparison“. She immediately questioned the question, to make sure that she understood the task at hand. She also did a bit of research on her own while I was answering the question, and asked some highly relevant questions about risks and particular concerns that I might have. Note that she’s doing important informal work—understanding the testing mission—before making too firm a commitment to what might or might not be considered “comparable” for the purposes of a particular question that we might have about the product.

I’ll have more to say about the Comparable Product heuristic tomorrow.


Monday, July 23rd, 2012

Several years ago, I wrote an article for Better Software Magazine called Testing Without a Map. The article was about identifying and applying oracles, and it listed several dimensions of consistency by which we might find or describe problems in the product. The original list came from James Bach.

Testers often say that they recognize a problem when the product doesn’t “meet expectations”. But that seems empty to me; a tautology. Testers can be a lot more credible when they can describe where their expectations come from. Perhaps surprisingly, many testers struggle with this, so let’s work through it.

Expectations about a product revolve around desirable consistencies between related things.

  • History. We expect the present version of the system to be consistent with past versions of it.
  • Image. We expect the system to be consistent with an image that the organization wants to project, with its brand, or with its reputation.
  • Comparable Products. We expect the system to be consistent with systems that are in some way comparable. This includes other products in the same product line; competitive products, services, or systems; or products that are not in the same category but which process the same data; or alternative processes or algorithms.
  • Claims. We expect the system to be consistent with things important people say about it, whether in writing (references specifications, design documents, manuals, whiteboard sketches…) or in conversation (meetings, public announcements, lunchroom conversations…).
  • Users’ Desires. We believe that the system should be consistent with ideas about what reasonable users might want. (Update, 2014-12-05: We used to call this “user expectations”, but those expectations are typically based on the other oracles listed here, or on quality criteria that are rooted in desires; so, “user desires” it is. More on that here.)
  • Product. We expect each element of the system (or product) to be consistent with comparable elements in the same system.
  • Purpose. We expect the system to be consistent with the explicit and implicit uses to which people might put it.
  • Statutes. We expect a system to be consistent with laws or regulations that are relevant to the product or its use.

I noted that, in general, we recognize a problem when we observe that the product or system is inconsistent with one or more of these principles; we expect this from the product, and when we get that, we have reason to suspect a problem.

(If I were writing that article today, I would change expect to desire, for reasons outlined here.)

“In general” is important. Each of these principles is heuristic. Oracle principles are, like all heuristics, fallible and context-dependent; to be applied, not followed. An inconsistency with one of the principles above doesn’t guarantee that there’s a problem; people make the determination of “problem” or “no problem” by applying a variety of oracle principles and notions of value. Our oracles can also mislead us, causing us to see a problem that isn’t there, or to miss a problem that is there.

Since an oracle is a way of recognizing a problem, it’s a wonderful thing to be able to keep a list like this in your head, so that you’re primed to recognize problems. Part of the reason that people have found the article helpful, perhaps, is that the list is memorable: the initial letters of the principles form the word HICCUPPS. History, Image, Claims, Comparable products, User expectations (since then, changed to “user desires”), Product, Purpose, and Statutes. 

With a little bit of memorization and practice and repetition, you can rattle off the list, keep it in your head, and consult it at moment’s notice. You can use the list to anticipate problems or to frame problems that you perceive.
Another reason to internalize the list is to be able to move quickly from a feeling of a problem to an explicit recognition and description of a problem. You can improve a vague problem report by referring to a specific oracle principle. A tester’s report is more credible when decision-makers (program managers, programmers) can understand clearly why the tester believes an observation points to a problem.

I’ve been delighted with the degree to which the article has been cited, and even happier when people tell me that it’s helped them. However, it’s been a long time since the article was published, and since then, James Bach and I have observed testers using other oracle principles, both to anticipate problems and to describe the problems they’ve found. To my knowledge, this is the first time since 2005 that either one of us has published a consolidated list of our oracle principles outside of our classes, conference presentations, or informal conversations. Our catalog of oracle principles now includes:

  • Statutes and Standards. We expect a system to be consistent with relevant statutes, acts, laws, regulations, or standards. Statutes, laws and regulations are mandated mostly by outside authority (though there is a meaning of “statute” that refers to acts of corporations or their founders). Standards might be mandated or voluntary, explicit or implicit, external to the development group or internal to it.

    What’s the difference between Standards and Statutes versus Claims? Claims come from inside the project. For Standards and Statutes, the mandate comes from outside the project. When a development group consciously chooses to adhere to a given standard, or when a law or regulation is cited in a requirements document, there’s a claim that would allow us to recognize a problem. We added Standards when we realized that sometimes a tester recognizes a potential problem for which no explicit claim has yet been made.

    While testing, a tester familiar with a relevant standard may notice that the product doesn’t conform to published UI conventions, to a particular RFC, or to an informal, internal coding standard that is not controlled by the project itself.

    Would any of these things constitute a problem? At least each would be an issue, until those responsible for the product declare whether to follow to the standard, to violate some points in it, or reject it entirely.

    A tester familiar with the protocols of an FDA audit might recognize gaps in the evidence that the auditor desires.  Similarly, a tester familiar with requirements in the Americans With Disabilities Act might recognize accessibility problems that other testers might miss. Moreover, an expert tester might use her knowledge of the standard to identify extra cost associated with misunderstanding of the standard, excessive documentation, or unnecessary conformance.

  • Explainability. We expect a system to be understandable to the degree that we can articulately explain its behaviour to ourselves and others.If, as testers, we don’t understand a system well enough to describe it, or if it exhibits behaviour that we can’t explain, then we have reason to suspect that there might be a problem of one kind or another. On the one hand, there might be a problem in the product that threatens its value. On the other hand, we might not know the about the product well enough to test it capably. This is, arguably, a bigger problem than the first. Our misunderstanding might waste time by prompting us to report non-problems. Worse, our misunderstandings might prevent us for recognizing a genuine problem when it’s in front of us.

    Aleksander Simic, in a private message, suggests that the explainability heuristic extends to more members of the team than testers. If a programmer can’t explain code that she must maintain (or worse, has written), or if a development team has started with something ill-defined and confusion is moving slowly through the product, then we have reason to suspect, investigate, or report a problem. I agree with Aleksander. Any kind of confusion in the product is an issue, and issues are petri dishes for bugs.

  • World. We expect the product to be consistent with things that we know about or can observe in the world.Often this kind of inconsistency leads us to recognize that the product is inconsistent with its purpose or with an expectation that we might have had, based on our models and schemas.  When we’re testing, we’re not able to realize and articulate all of our expectations in advance of an observation. Sometimes we notice an inconsistency with our knowledge of the world before we apply some other principle.This heuristic can fail when our knowledge of the world is wrong; when we’re misinformed or mis-remembering. It can also fail when the product reveals something that we hadn’t previously known about the world.

There is one more heuristic that testers commonly apply as they’re seeking problems, especially in an unfamiliar product. Unlike the preceding ones, this one is an inconsistency heuristic:

  • Familiarity. We expect the system to be inconsistent with patterns of familiar problems.When we watch testers, we notice that they often start testing a product by seeking problems that they’ve seen before. This gives them some immediate traction; as they start to look for familiar kinds of bugs, they explore and interact with the product, and in doing so, they learn about it.Starting to test by focusing on familiar problems is quick and powerful, but it can mislead us. Problems that are significant in one product (for example, polish in the look of the user interface in a commercial product) may be less significant in another context (say, an application developed for a company’s internal users). A product developed in one context (for example, one in which programmers perform lots of unit testing) might have avoided problems familiar to other us in other contexts (for example, one in which programmers are less diligent).

    Focusing on familiar problems might divert our attention away from other consistency principles that are more relevant to the task at hand. Perhaps most importantly, a premature search for bugs might distract us from a crucial task in the early stages of testing: a search for benefits and features that will help us to develop better ideas about value, risk, and coverage, and will inform deeper and more thoughtful testing.Note that any pattern of familiar problems must eventually reduce to one of the consistency heuristics; if it was a problem before, it was because the system was inconsistent with some oracle principle.

Standards was the first of the new heuristics that we noticed; then Familiar problems. The latter threatened our mnenomic! For a while, I folded Standards in with Statutes, suggesting that people memorize HICCUPPS(F), with that inconsistent F coming at the end. But since we’ve added Explainability and World, we can now put F at the beginning, emphasizing the reality that testers often start looking for problems by looking for familiar problems. So, the new mnemonic: (F)EW HICCUPPS. When we’re testing, actively seeking problems in a product, it’s because we desire… FEW HICCUPPS.

This isn’t an exhaustive list. Even if we were silly enough to think that we had an exhaustive list of consistency principles, we wouldn’t be able to prove it exhaustive. For that reason, we encourage testers to develop their own models of testing, including the models of consistency that inform our oracles.

This article was first published 2012-07-23. I made a few minor edits on 2016-12-18, and a few more on 2017-01-26.

Oracles and The Right Answer

Tuesday, May 8th, 2012

In which the conversation about heuristics and oracles continues…

Tony’s brow furrowed as he spoke. “No oracle comes with a guarantee that it’s giving you the right answer. That’s what you said. But surely there are some oracles that are reliable,” he said. “What about pure math?”

Pure math? All right. Here’s an example: what’s 61 plus 45?”

“Duh. 106.”

“Well,” I said, “for many computer systems prior to the year 2000, if you added 45 to the year 61, you’d get 6. That is, if you looked at a printout or a screen, you’d expect to see “06” in the year field. And for those systems, that would have been the right answer.”

“But that was wrong! Y2K was a problem. They called it ‘the Y2K problem‘, didn’t they?”

“True,” I said. “But until the late ’90s, it wasn’t a problem—or to be more accurate, people didn’t perceive it as a problem. On the contrary, it was a solution to a problem: memory and storage were expensive. You could work around the “problem” with a combination of clever code and trust that people would interpret the output appropriately. Remember, a problem is a problem to some person at some time. Programmers and designers in the 1960s had one set of problems to solve, and programmers at the end of the ’90s had another set. The point is that one oracle (regular math) would give you one right answer, and another oracle (what the programmers and designers wanted) would give you another. Listen: no oracle can give you the right answer. An oracle can give you a right answer—a plausible answer that might be right for its context. But changing the context can flip that right answer into a wrong one—or a wrong answer into a right one.”

“Oracles are heuristic,” I continued. “There’s this terrific book, Discussion of the Method, by Billy Vaughan Koen. He’s an engineer, but he’s also a philosopher of engineering. In the book, he makes the argument that all decision-making, all problem-solving is heuristic.”

Tony looked quizzical. “Wait… Even algorithms? ‘Algorithm’ is the opposite of ‘heuristic’—didn’t you say that?”

“Not exactly. Algorithms are robust; they tend to produce very reliable results. But Koen says that even algorithms are heuristic. After all, if you apply an algorithm in the wrong way, to solve the wrong problem, or in the wrong context, it will fail.”

“Aaargh,” Tony said. “Where does that leave us? How can we ever know when a program’s correct?”

“That’s the interesting part,” I said. “We can’t. A program can appear to be working in all kinds of ways, but the program and your oracles can fool you. Think of a calculator program. Yep: one plus one gives the answer ‘2’. That’s looks correct to you, right?”


“And yet if the calculator is in binary mode, the answer should be ’10’. You might be applying the wrong oracle for a given problem. Even if the program isn’t in binary mode and ‘2’ is right, the program could be tying up the processor so your machine is unusable. Or the program gives you the right answer—in white text on a white background. Or the program clobbers the contents of the clipboard. And you don’t notice these things unless you’re looking for them, or unless you happen to notice them. That is, there might be a problem for which you don’t have an oracle..”

“So nothing can tell us that a program’s working right? We can’t ever tell whether a program is giving us the right answer?” Tony asked doubtfully. “That doesn’t sound… right.”

“Working right, yes, but only in the sense that it appears to be fulfilling some requirement to some degree. A right answer, yes, but the right answer only in context, and not a complete answer. Correctness is a human notion, and things are only correct in some context. As testers, we can’t know for sure the deep truth about any observation. Any right answer that we see in computer software is only right for now, this time, for some purpose, on this machine. We can’t reliably project our observations into the future. We can use an oracle to give us as a strong inference that the answer will be the same next time, but we don’t get a guarantee. What we see might be right based on what we’re observing, but there’s all this stuff that what we’re not observing too. Cem Kaner and Doug Hoffman describe that stuff really thoroughly. You’ve heard that complete testing is impossible, right?”

“Of course.”

“Well, part of that is the coverage problem; we can’t test every possible input to a program in a finite amount of time. But part of it is the oracle problem, too. We can’t see a problem unless we have an oracle for that problem: that is, a principle or mechanism for recognizing that problem. All our oracles are heuristic, fallible—and in software, the potential for problems is limitless.”

“So how do we get around that?” Tony asked.

“The first thing is to recognize that oracles don’t give us the right answer, but every oracle may be able to point us to some problem. Over the years, we we’ve studied oracles, we’ve come up with a bunch of principles and mechanisms for them, and we keep discovering more. Since there are infinite numbers of possible problems, we need a wide variety and diversity of oracles to spot them. But there is one principle that seems to prevail overall.”


“It seems to us that oracles are founded on the idea of consistency.”

to be continued…

All Oracles Are Heuristic

Wednesday, April 25th, 2012

In which the conversation about heuristics and oracles continues…

“So what’s the difference,” I asked my tester friend Tony, “between an oracle and a heuristic?”

“Hmm. Well, I’ve read the Rapid Testing stuff, and you and James keep saying an oracle is a principle or mechanism by which we recognize a problem.

“Yes,” I said. “That’s what we call an oracle. What’s the difference between that and a heuristic?”

“An oracle helps us recognize a problem, but it’s not a method for solving a problem, or for making a decision.” He suddenly paused.

“Wait,” he said. “There’s that question you say testers should always be asking—Is there a problem here? An oracle does help us make a decision: it helps us to decide whether there’s a problem in the product we’re testing. And oracles can fail, too. So an oracle’s not different from a heuristic; an oracle is a heuristic. They’re the same.”

“Okay,” I said. “But that’s like saying ‘an iPhone isn’t different from a smartphone; an iPhone is a smartphone. They’re the same.'”

“But? But what? What’s the problem with that? Aren’t all iPhones smartphones?”

“Well, I’d say so,” I replied. “But let me ask you: are all smartphones iPhones?”

He paused for a second. “Oooh. Oracles are heuristic, but not all heuristics are oracles. An oracle is a heuristic, but it’s a specific kind of heuristic. Okay, let me see if I’ve got this: tossing a coin is a heuristic for making a decision. A heuristic approach for making a decision, I mean. You’d use the Coin Toss heuristic in some contexts—random decisions, or unimportant decisions, or… or intractable decisions, or decisions that you want to be fair. The approach can fail. It might not be a fair coin. Or it might be a high-stakes decision that shouldn’t be left to chance. So the Coin Toss heuristic might work, it can fail.”

“Right,” I said. “Tossing a coin is a heuristic approach for making a decision.”

“But it’s not an oracle,” Tony said, “because tossing a coin doesn’t help us to recognize a problem. So tossing a coin is a heuristic, but it’s not an oracle.”

“All right. What does an oracle do for us?”

Tony started confidently. “An oracle is something that gives us the right answer, so that we can compare it to the result the product gives us. If there’s a difference between the oracle’s answer and the product’s result, there’s a problem. If the product’s answer is the same as the oracle’s answer, then there’s no problem.”

“Are you sure about that?” I asked. “Is a specification an oracle?”

“Yes. The specification tells us how the product is supposed to behave.”

“And how reliable are the specifications where you work?”

Tony paused, and then he grinned. “Okay. They suck, to be honest with you,” he said. “They’re ambiguous. They’re unclear. They’re incomplete; they usually miss a bunch of requirements. They contradict each other, sometimes on the same page. So we have to talk about them a lot to clear them up—and then when we sort things out, the job of updating the written spec usually gets left for last, if it ever happens at all.”

“Still,” I said, “if you see an inconsistency between the spec and the product, you at least suspect a problem, don’t you?”

“Well, yeah. When the spec and the product disagree, there’s usually a problem somewhere—either with the product, or with the spec. Or both. When we’re not sure, the program manager is usually the one who clears things up. Sometimes the programmers fix the product. Sometimes the the product turns out to be right, and it’s the spec that’s wrong—but then we know at least the BA’s ought to fix the spec, even if they don’t get around to it right away.”

“So if you use a specification as an oracle, it’s somewhat reliable, but it’s not guaranteed to be right. What does that sound like?”

He paused again. “It’s a heuristic. An oracle is a special kind of heuristic. An oracle is a heuristic principle or mechanism by which we recognize a problem.

“That’s the way I like to say it these days, yes,” I replied. “For one thing, having the word ‘heuristic’ in the definition of ‘oracle’ seems to help people recognize that there’s some kind of distinction to be made between heuristics and oracles. But for another, I think it’s important to emphasize that oracles help us to learn things. And that, since they’re heuristics, oracles are fallible and context-dependent. No oracle comes with a guarantee that it’s giving you the right answer. An oracle can only point you to a possible problem.

Tony’s brow furrowed again.

To be continued…

Heuristics for Understanding Heuristics

Friday, April 20th, 2012

This conversation is fictitious, but it’s also representative of several chats that I’ve had with testers over the last few weeks.

Tony, a tester friend, approached me recently, and told me that he was having trouble understanding heuristics and oracles. I have a heuristic approach for solving the problem of people not understanding a word:

Give ’em a definition.

So, I told him:

A heuristic is a fallible method for solving a problem or making a decision.

After I tried the “Give ’em a definition” heuristic, I tested to see if Tony seemed to understand. His eyes were a little glazed over. I applied a heuristic for making the decision, did he get it?

When someone’s eyes glaze over, they don’t get it.

Heuristics aren’t guaranteed to work. For example, sometimes the general “Give ’em a definition” heuristic solves the problem of people not understanding something, and sometimes it doesn’t. In the latter case, I apply another heuristic:

Give ’em an explanation.

So I told him:

“When you know how to solve a problem, you might follow a rule. When you’re not so sure about how to solve the problem, following a rule won’t help you. Not knowing how to solve a problem means not knowing which rule to apply, or whether there’s a rule at all. When you’re in uncertain conditions, or dealing with imperfect or incomplete information, you apply heuristics—methods that might work, or that might fail.

“As an adjective, ‘heuristic’ means ‘serving to discover’ or ‘helping to learn’. When Archimedes realized that things that sink displace their volume of water, and things that float displace their mass, he ran naked through the streets of Athens yelling, ‘Eureka!’ or ‘I’ve discovered it!’ ‘Eureka’ and ‘heuristic’ come from the same root word in Greek.

Tony was listening thoughtfully, but his brow was still furrowed. So I applied another teaching heuristic:

Give ’em something to compare.

I said, “Here’s one way of understanding heuristics: compare ‘heuristic’ with ‘algorithm’. An algorithm is a method for solving a problem that’s guaranteed to have a right answer. So an algorithm is like a rule that you follow; a heuristic is like a rule of thumb that you apply. Rules of thumb usually work, but not always.”

Sometimes providing a comparable idea solves the problem of understanding something, and sometimes it doesn’t. Tony nodded, but still looked a little puzzled. I wasn’t sure I had solved the problem, so I applied a new heuristic:

Point ’em to a book.

I suggested that he read George Polya’s book How to Solve It. “In that book, Polya presents a set of ideas and questions you can ask yourself that can help you to solve math problems.”

“Wait… I thought you always solved math problems with algorithms,” Tony said.

“That’s when you know how to solve the problem. When you don’t, Polya’s suggestions—heuristics—can get you started. They don’t always work, but they tend to be pretty powerful, and when one doesn’t work, you try another one. You never know which questions or ideas will help you solve the problem most quickly. So you practice this cycle: apply a heuristic, and if you’re still stuck, try another one. After a while, you develop judgement and skill, which is what you need to apply heuristics well. Polya talks about that a lot. He also emphasizes just how much heuristics are fallible and context-dependent.”

Mind you, neither Tony nor I had a copy of Polya’s book right handy, and Tony wanted to understand “heuristics” better now. The “point ’em to a book” heuristic had failed this time, even though it might have worked in a different context. So I tried yet another heuristic to solve the problem:

Point ’em to another book.

I suggested that he read Gut Feelings by Gerd Gigerenzer. “In that book, Gigerenzer emphasizes that heuristics tend to be fast and frugal (that is, quick and inexpensive). That’s important, he says: humans need heuristics because they’re typically dealing with bounded rationality.”

Uh-oh. Tony’s eyes had glazed over again at the mention of “bounded rationality”. So I applied a heuristic:

Even when it’s a deep concept, a fast and frugal explanation might do.

After all, Polya says that a heuristic isn’t intended to be perfect. Instead, heuristics are provisional and context-dependent. So in order to provide a quick understanding of “bounded rationality”, I said, “In a nutshell, bounded rationality is a situation when you have incomplete knowledge, imperfect understanding, and limited time.”

He grinned, and said, “What, like when you’re testing? Like most of the time in life?”

“Yes. Billy Vaughan Koen, in another book, Discussion of the Method, says that the engineering method is ‘to cause the best change in a poorly understood situation within the available resources.'”

“So he’s saying that engineers apply heuristics?” Tony asked. “I guess that makes sense, since engineers solve problems in ways that usually work, but sometimes there are failures.”

He seemed to be getting it. But I wanted to test that, so I applied a heuristic for making the decision, “Does he get it?

Ask the student to provide an example.

So I said, “I think you might have it. But can you provide me with an example of a heuristic?”

He said, “Okay. I think so.” He paused. “Here’s a heuristic for solving the problem of opening a door: ‘Pull on the handle; push on the plate.’ That’s what you do when you get to a door, right? It’s a heuristic that usually works. Well… it might fail. It could be one of those annoying doors that have handles on both sides, where you have to push the handle or pull the handle to open the door. It might be one of those doors that opens both ways, like the doors for restaurant kitchens, so there’s no handle. The door might not even have a handle or a plate; it might have a knob. In that case, you apply another heuristic: ‘Turn the knob’. That’s a solution for the problem of opening a door that doesn’t have a handle or a plate. But that heuristic might fail too. The door might be locked, even though the knob turns. It might be one of those fancy doors that have dead-bolt locks and knobs that don’t turn. It might not have a knob at all; it might have one of those old-fashioned latches. So none of those heuristics guarantees a solution, but each one might help to solve the problem of getting through the door.”

“Great! I think you’ve got it.”

“To be precise about it,” he said, “you can’t be sure, so you’re applying heuristics that help you to make the decision that I get it.”

I laughed. “Right. So what’s the difference,” I asked, “between an oracle and a heuristic?”

He paused.

(to be continued…)

Should Testers Play Planning Poker?

Wednesday, October 26th, 2011

My colleague and friend Eric Jacobson, who recently (as I write) did a bang-up job on his first conference presentation at STAR West 2011, asks a question in response to this blog post from 2006. (I like it when people reflect on an issue for a few years.) Eric asks:

You are suggesting it may not make sense for testers to give time-based estimates to their teams, but what about relative estimates? Let’s say a Rapid Software Tester is asked to participate in Planning Poker (relative-based story estimation) on an Agile Scrum team. I’ve always considered this a golden opportunity. Are you suggesting said tester may want to refuse to participate in the Planning Poker?

Having observed Planning Poker in action, I’m conflicted. Estimating anything is always a bit of a dodgy business, even at the best of times. That’s especially true for investigation and in particular for discovery. (I’ve written about some of the problems with estimation here and in subsequent posts, and with how those problems pertain to testing here.) Yet Planning Poker may be one way to get a good deal closer to the best of times. I like the idea of testers hearing what’s going on in planning sessions, and of offering perspective on the possible implications of work or change. On the other hand, at Planning Poker sessions I’ve observed or participated in, testers are often pressured to lower their numbers. In an environment where there’s trust, there tends to be much less pressure; in an environment where there’s less trust, I’d take pressure to lower the estimate as a test result with several possible interpretations. (I leave those interpretations as an exercise for the reader, but don’t stop until you get to five, at least.)

In any case, some fundamental problems remain: First, testing is oriented towards discovering things, not building things. At the root of it all, any estimate of how long it will take to test something is like estimating how long it will take you to evaluate someone’s ability to speak Spanish (which I wrote about here), and discovering problems in their ability to express themselves. If you already know something or can reasonably anticipate it, that helps a lot, and the Planning Poker approach (among many others) can help with that to some degree.

The second problem is that there’s not necessarily symmetry between the effort in creating something and the effort in testing it. A function or feature that takes very little effort to program might take an enormous amount of effort to test. What kinds of variation could we put into data, workflow, timing, platform dependencies and interactions, scenarios, and so forth? Meanwhile, a feature that takes signficant amounts of programming effort could take almost no time to test (since “programming effort” could include an enormous amount of testing effort). There are dozens of factors involved, including the amount of testing the programmers do as they code; what kind of review is being done; what the scope of the change is; when particular discoveries get made (during “development time” or “testing time”; the skill of the parties involved; the testability of the product under test; how buggy the finished feature is (in which case there will be more time needed for investigation and reporting)… Planning Poker doesn’t solve the asymmetry problem, but it provides a venue for discussing it and getting started on sorting it out.

The third problem, closely related to the second, is this idea that all testing work associated with developing something must and shall happen within the same iteration. Testing never ends; it only stops. So it’s folly to think that all testing for a given amount of programming work can always fit into the same iteration in which the work is done. I’d argue that we need a more nuanced perspective and more options than that. The decision as to how much testing we’ll need is informed by many factors. Paradoxically, we’ll need some testing to help reveal and inform our notions of how much testing we’ll need.

I understand the desire to close the book on a development story within the sprint. I often—even usually—share that desire. Yet many kinds of testing work must respond to development work, and in such cases the development work has to be complete in some lesser sense than “fully tested”. Many kinds of confirmatory checking work, it seems to me, can be done within the same sprint as the programming work; no problem there. Yet it seems to me that other kinds of testing can reasonably wait for subsequent sprints—indeed, must wait for subsequent sprints, unless we’d like to have programmers stop all programming work altogether after a certain day in the sprint. Let me give you an example: in big banks, some kinds of transactions take several days to wend their way through batch processes that are run overnight. The testing work associated with that can be simulated, for sure (indeed, one would hope that most of such work would be simulated), but only at the expense of some loss of realism. For the test, whether the realism is important or not is always an open question with a fallible answer. Instead of making sure that there’s NO testing debt, consider reasonable, small, and sustainable amounts of testing debt that spans iterations. Agile can be about actual agility, instead of dogma.

So… If playing Planning Poker is part of the context, go for it. It’s a heuristic approach to getting people to consider testing more consciously and thoughtfully, and there’s something to that. It’s oriented towards estimating things in a more comprehensible time frame, and in digestible chunks of task and effort. Planning Poker is fallible, and one approach among many possible approaches. Like everything else, its usefulness largely depends mostly on the people using it, and how they use it.