Blog Posts for the ‘Heuristics’ Category

Expected Results

Sunday, August 23rd, 2020

Klára Jánová is a dedicated tester who studies and practices and advocates Rapid Software Testing. Recently, on LinkedIn, she said:

I might EXPECT something to happen. But that doesn’t necessarily mean that I WANT IT/DESIRE for IT to happen. I even may want it to happen, but it not happening doesn’t have to automatically mean that there’s a problem.

The point of this post: no more “expected results” in the bug reports, please!

In reply, Derek Charles asked:

Then how else would you communicate to the developer or the team what is SUPPOSED to happen? I think that expected results are very necessary especially when regressions are found during testing.

Klara replied:

I suggest to describe the behavior that the tester recognizes as problematic and explain WHY it might be a problem for someone—the reasoning why the behavior is perceived as a bug—that’s what really matters.

Exactly so. Klára is referring here to problems and oracles—means by which we recognize problems when we encounter them in testing.

There’s an issue with the “what is supposed to happen” stuff: in development work, what is supposed to happen is not always entirely clear. Moreover, and more importantly, since testers don’t run the project or the business, we don’t mandate what is supposed to happen.

For instance, while testing, I may observe something in the product that I find confusing, or surprising, or wrong. When I look up the intended behaviour in the specification, it says one thing; the developer, claiming that the spec is out of date, contradicts it; and the product owner confirms that the spec is outdated. But she also says that the developer’s interpretation of what should happen is not what she wants him to implement. And then, when I consult an RFC, the product owner’s interpretation is inconsistent with what the RFC says should be the appropriate behaviour.

Fortunately, I don’t have to decide, and I don’t have to say what should happen. My job as a tester is to report on an apparent inconsistency between the product and presumably desirable things, or between the product and someone’s expressed desire or requirement. In the case above, I let the product owner know about the inconsistency between her interpretation and the standard, and she makes the call on what she and the business want from the product.

That is, even though I have certain expectations, I might be wrong about them and about what I think should be. For instance, she might decide that our product is not going to support that standard. She might point out that the standard I’m considering has been superseded by a later one. In any case, what is supposed to happen gets decided not by me, but by the people who run things. That’s what they’re paid for. This is a good thing, not a bad thing.

But still, I’d like to honour Derek’s question: as testers, how should we report a problem without referring to “expected results”?

  • Instead of saying “expected result” and leaving it that, we could say “inconsistent with the specification”.

    Inconsistency with the specification is a special case of a more general way of recognizing and describing a problem: inconsistency with claims. “Inconsistency with claims” is an oracle heuristic. (A heuristic is a fallible means for solving a problem; an oracle is a special kind of heuristic which, fallibly, helps you to solve the problem of identifying and describing a bug.) When a product is inconsistent with a claim that someone important makes about it, there’s likely a problem, either with the product or the claim. As a tester, I don’t have to decide which.

    The specification is a particular form of a claim that someone is making about what the product is like, or what it should be like. Claims can be made in design sessions, planning meetings, pair programming, hallway conversations, training workshops… Claims can be represented in help files, marketing materials, workflow diagrams, lookup tables, user manuals, whiteboard sketches, UML diagrams… Claims can also be represented in the code of an automated check, where someone has written code to compare the output of the product with an anticipated and presumably desirable result. Recognizing many sources of claims and inconsistencies with them makes us more powerful testers.

    Whatever relevant claim you’re referring to, having said “inconsistent with a claim” (and having identified the nature of the claim, and where or whom it comes from), you don’t need to say “expected result”.

  • Instead of saying “expected result” and leaving it that, you could say “inconsistent with how the product used to work”.

    Inconsistency with history is an oracle heuristic. After a change, the product might have a new bug in it. On the other hand, the product might have been wrong all along, and now it’s right. (This is an example of how oracles can mislead us or conflict with each other, which is why it’s a good idea to identify the oracles we’re applying in problem reports.) If you (or others) aren’t aware of why the desirable change was made, that’s a different kind of problem, but a problem nonetheless.

    Either way, having said “inconsistent with how the product used to work” (and having described that in terms of a problem), you don’t need to say “expected result”.

  • Instead of saying “expected result” and leaving it that, you could say “inconsistent with respect to the product itself”.

    Inconsistency within the product is an oracle heuristic. This can takes a number of forms: the product might return inconsistent results from one run to the next; the product could afford a tidy, smooth interface in one place, and a frustrating, confusing interface in another; the product could present output very precisely in one part of the product, and imprecisely in another; one component in the product could log output using one format, while another component’s log output is in a different format, which makes analysis more difficult…

    The inconsistency might be undesirable (because of a reliability problem), or it might be completely desirable (a Web page for a newspaper should change from day to day), or it might desirable or undesirable in ways that you’re not aware of (since, like me, you probably don’t know everything).

    In general, people tend to prefer things that present themselves in a consistent way. Here’s a trivial example from Microsoft Office (Office 365, these days): to search for text in Word, the keyboard command is Ctrl-F. In Outlook, part of the same product suite, Ctrl-F triggers the Forward Message action instead; F4 triggers a search. Had Outlook and Word been designed by the same teams at the same time, this probably would have been identified as a bug, and addressed. In the end, the Office suite’s program managers decided that consistency with history dominated inconsistency within the product, and now we all have to live with that. Oh well.

    In any case, having said “inconsistent with respect to some aspect of the same product” (and having identified the specifics of the inconsistency), you don’t need to say “expected result”.

  • Instead of saying “expected result” and leaving it that, you could say “inconsistency with a comparable product” (and identify the product, and the nature of the inconsistency).

    Inconsistency with a comparable product is an oracle heuristic. Any product (something that someone has produced) that provides a relevant point of comparison is, by defintion, a comparable product. That includes competitive products, of course; Microsoft Word and Google Docs are comparable products, in that sense. Microsoft Word and WordPad are comparable products too; they have many features in common. If Word can’t open an .RTF file generated by WordPad, we have reason to suspect a problem in one product or the other. If WordPad prints an RTF file properly, and Word does not, we have reason to suspect a problem in Word.

    Is the Unix program wc (wc stands for “word count”) a comparable product to Microsoft Word? All wc does is count words in text files, so no, except… Word has a word-counting feature. If Word’s calculation for the number of words in a text file is inexplicably different from wc‘s count, we have reason to suspect a problem in one product or the other.

    Test tools and suites of automated output checks represent comparable products too. If the output from your product is inconsistent with the specified and desired results provided by your test tool, or with some data that it processes to produce such results, you have reason to suspect a problem somewhere.

    In any case, having said “inconsistent with a comparable product”, and having identified the product and the basis for comparison, you don’t need to say “expected result”.

Those are just a few examples. When we teach Rapid Software Testing, we offer a set of oracle heuristics that identify principles of desirable (and undesirable) consistency (and inconsistency) for identifying bugs; you can read more about those here.

James Bach has recently identified another principle that might apply to bugs but that, in my view, more powerfully applies to enhancement requests: we desire the product to be consistent with acceptable quality: that is, not only good, but every bit as good as it can be.

Why is all this a big deal? Several reasons, I think.

First, “expected result” begs the question of where the expectation comes from. It’s just a middleman for something we could say more specifically. Why not get to the point and say it while at the same time sounding like a pro? Because…

Second, being specific about where the expectation comes from saves time and focuses conversation on the (un)desirable (in)consistencies that matter when developers and product owners are deciding whether something is a bug worth fixing. It also helps to focus repair in the appropriate claim (for example, if the product is right and the spec is wrong, it’s a prompt to repair the spec).

Third, it helps for us to remember that our job as testers is not to confirm that the product works “as expected”, but to ask “is there a problem here?” A product can fulfill an expectation and nonetheless have terrible problems about it. It’s our job to seek and find and describe inconsistencies and problems that matter before it’s too late.

And finally…

Fourth, speaking in terms of an oracle instead of an “expected result” can help to avoid patronizing, condescending, time-wasting, and obvious elements of bug reports that cause developers to feel insulted or to roll their eyes.

Actual result: Product crashes.

Expected result: Product does not crash.

Don’t be that tester.

Further reading:

Not-So-Great Expectations
Oracles From the Inside Out
FEW HICCUPPS

Want to learn how to observe, analyze, and investigate software? Want to learn how to talk more clearly about testing with your clients and colleagues? Rapid Software Testing Explored, presented by me and set up for the daytime in North America and evenings in Europe and the UK, November 9-12. James Bach will be teaching Rapid Software Testing Managed November 17-20, and a flight of Rapid Software Testing Explored from December 8-11. There are also classes of Rapid Software Testing Applied coming up. See the full schedule, with links to register here.

It’s Not A Factory

Tuesday, April 19th, 2016

One model for a software development project is the assembly line on the factory floor, where we’re making a buhzillion copies of the same thing. And it’s a lousy model.

Software is developed in an architectural studio with people in it. There are drafting tables, drawing instruments, good lighting, pens and pencils and paper. And erasers, and garbage cans that get full of coffee cups and crumpled drawings. Good ideas become better ideas as they are sketched, analysed, criticised, and revised. A lot of bad ideas are discovered and rejected before the final plans are drawn.

Software is developed in a rehearsal hall with people in it. The room is also filled with risers and chairs and other temporary staging elements, and with substitute props that stand in for the finished products. There’s a piano to accompany the singers while the orchestra is being rehearsed in another hall. Lighting, sound, costumes and makeup are designed and folded into the rehearsal process as we experiment with different ways of bringing the show to life. Everyone tries stuff that doesn’t work, or doesn’t fit, or doesn’t sound right, or doesn’t look good at first. Frustration arises, feelings get bruised, and then breakthroughs happen and problems get solved. Lots of experiments lead to that joyful and successful opening night.

Software is developed in a workshop with people in it; skilled craftspeople who build tools and workspaces for themselves and each other, as part of the process of crafting products for people to buy. Even though they try to keep the shop clean, there’s occasional sawdust and smoke and spilled glue and broken machinery. Work in progress gets tested, and weaknesses are exposed—sometimes late in the game—and get fixed.

In all of these places, variation is encouraged. Designs are tinkered with. Discoveries are celebrated. Learning happens. Most importantly, skill and tacit knowledge are both applied and developed.

The Lean model for software development might seem a more humane step forward from the older days, but it’s still based on the factory. Ideas aren’t widgets whose delivery you can schedule just in time. Failed experiments aren’t waste when you learn from them, and if you know it won’t be waste from the outset, it’s not really an experiment. Everything that makes it into the product should represent something that the customer values, but when we’re creating something novel (which we’re always doing to some degree as we’re building software), we’re exploring and trying things out to help refine our understanding of what the customer actually values.

If there is any parallel between software and manufacturing, it is this: the “software development” part of manufacturing happens before the assembly line—in the design studio, where the prototypes are being developed, refined, and selected for mass production. The manufacturing part? That’s the copy command that deploys a copy of the installation package to all the machines in the enterprise, or the disk duplicator that stamps out a million DVDs with copies of the golden master on it, or the Web server that delivers a copy of the product to anyone who requests it. Getting to that first copy, though? That’s a studio thing, not an assembly-line thing.

The primary inspiration for this post is a conversation I had with Cem Kaner in 2008. Another is the book Artful Making by Robert Austin and Lee Devin, which I first read around the same time. Yet another is Christopher Alexander’s A Pattern Language. One more: my long-ago career in theatre, which prepared me better than you can imagine for a life in software development.

100% Coverage is Possible

Saturday, April 16th, 2016

In testing, what does “100% coverage” mean? 100% of what, specifically?

Some people might say that “100% coverage” could refer to lines of code, or branches within the code, or the conditions associated with the branches. That’s fine, but saying “100% of the lines (or branches, or conditions) in the program were executed” doesn’t tell us anything about whether those lines were good or bad, useful or useless.

“100% code coverage” doesn’t tell us anything about what the programmers intended, what the user desired, or what the tester observed. It says nothing about the tester’s engagement with the testing; whether the tester was asleep or awake. It ignores the oracles that the tester applied;how the tester recognized—or failed to recognize—bugs and other problems that were encountered during the testing. It suggests that some machinery processed something; nothing more.

Here’s a potentially helpful way to think about this:

“X coverage is how thoroughly we have examined the product with respect to some model of X”.

So: risk coverage is how thoroughly we have examined the product with respect to some model of risk; requirements coverage is how how thoroughly we have examined the product with respect to some model of requirements; code coverage is how thoroughly we have examined the product with respect to some model of code.

To claim 100% coverage is essentially the same as saying “We’ve looked for bugs everywhere!” For a skilled tester, any “100%” claim about coverage should prompt critical thinking: “How much” compared to what? 100% of what, specifically? Some model of X—which one? Whose model? How well does the “model of X” model reality? What does the model of X leave out of the universe of possible ways of thinking about X? And what non-X things should we also be considering when we’re testing?

Here’s just one example: code coverage is usually described in terms of the code that we’ve written, or that we have available to evaluate. Yet every program we write interacts with some platform that might include third-party libraries, browsers, plug-ins, operating systems, file systems, firmware. Our code might interact with our own libraries that we haven’t instrumented this time. So “code coverage” refers to some code in the system, but not all the code in the system.

Once I did a test (or was it 10,000 tests?) wherein I used an automated check to run through all 10,000 possible settings of a particular variable. That was 100% coverage of that variable being used in a particular moment in the execution of the system, on that day.

But it was not 100% of all the possible sequences of those settings, nor 100% of the possible subsequent paths through the product. It wasn’t 100% of the possible variations in pacing, or system load, or times of day when the system could be used. That test wasn’t representative of all of the possible stakeholders who might be using that variable, nor how they might use it.

What would “100% requirements coverage” mean? Would it mean that every statement in the requirements document was covered by a test? If you think so, it might be worthwhile to consider all the models that are in play.

The requirements document is a model of the product’s requirements. It refers to ideas that have been explicitly expressed by some people, but not by all of the people who might have requirements for the product. The requirements document models what those people thought they wanted at a certain point, but not necessarily what they want now.
The requirements document doesn’t account for all of the ideas or ideas that people had that may have been tacit, or implicit, or latent.

You can subject “statement”, “covered”, and “test” to the same kind of treatment. A statement is a model of what someone is thinking at a given point in time; our notion of what “covered” means is governed our models of coverage; our notion of “a test” is conditioned by our models of testing. It’s models all the way down.

Things in testing keep reminding me of a passage from Computer Programming Fundamentals by Herbert Leeds and Jerry Weinberg:

“One of the lessons to be learned … is that the sheer number of tests performed is of little significance in itself. Too often, the series of tests simply proves how good the computer is at doing the same things with different numbers. As in many instances, we are probably misled here by our experiences with people, whose inherent reliability on repetitive work is at best variable. With a computer program, however, the greater problem is to prove adaptability, something which is not trivial in human functions either. Consequently we must be sure that each test does some work not done by previous tests. To do this, we must struggle to develop a suspicious nature as well as a lively imagination.“

Testing is an open investigation. 100% coverage of a particular factor may be possible—but that requires a model so constrained that we leave out practically everything else that might be important.

Test coverage, like quality, is not something that yields very well to quantitative measurements, except when we’re talking of very narrow and specific conditions. But we can discuss coverage, and ask questions about whether it’s what we want, whether we’re happy with it, or whether we want more.

Further reading:

Got You Covered http://developsense.com/articles/2008-09-GotYouCovered.pdf
Cover or Discover http://developsense.com/articles/2008-10-CoverOrDiscover.pdf
A Map by Any Other Name http://developsense.com/articles/2008-11-AMapByAnyOtherName.pdf
What Counts http://www.developsense.com/articles/2007-11-WhatCounts.pdf

Oracles Are About Problems, Not Correctness

Thursday, March 12th, 2015

As James Bach and I have have been refining our ideas of testing, we’ve been refining our ideas about oracles. In a recent post, I referred to this passage:

Program testing involves the execution of a program over sample test data followed by analysis of the output. Different kinds of test output can be generated. It may consist of final values of program output variables or of intermediate traces of selected variables. It may also consist of timing information, as in real time systems.

The use of testing requires the existence of an external mechanism which can be used to check test output for correctness. This mechanism is referred to as the test oracle. Test oracles can take on different forms. They can consist of tables, hand calculated values, simulated results, or informal design and requirements descriptions.

—William E. Howden, A Survey of Dynamic Analysis Methods, in Software Validation and Testing Techniques, IEEE Computer Society, 1981

While we have a great deal of respect for the work of testing pioneers like Prof. Howden, there are some problems with this description of testing and its focus on correctness.

  • Correct output from a computer program is not an absolute; an outcome is only correct or incorrect relative to some model, theory, or principle. Trivial example: Even the mathematical rule “one divided by two equals one-half” is a heuristic for dividing things. In most domains, it’s true, but as in George Carlin’s joke, when you cut a crumb in two, you don’t have two half-crumbs; you have two crumbs.
  • A product can produce a result that is functionally correct, and yet still be deeply unsatisfactory to its user. Trivial example: a calculator returns the value “4” from the function “2 + 2″—and displays the result in white on a white background.
  • Conversely, a product can produce an incorrect result and still be quite acceptable. Trivial example: a computer desktop clock’s internal state and second hand drift a few tenths of a second each second, but the program resets itself to be consistent with an atomic clock at the top of every minute. The desktop clock almost never shows the right time precisely, but the human observer doesn’t notice and doesn’t really care. Another trivial example: a product might return a calculation inconsistent with its oracle in the tenth decimal place, when only the first two or three decimal places really matter.
  • The correct outcome of a program or function is not always known in advance. Some development and testing work, like some science, is done in an attempt to discover something new; to establish what a correct answer might look like; to explore a mathematical model; to learn about the limitations of a novel system. In such cases, our ideas of correctness or acceptability are not clear from the outset, and must be developed. (See Collins and Pinch’s The Golem books, which discuss the messiness and confusion of controversial science.) Trivial example: in benchmarking, correctness is not at issue. Comparison between one system and another (or versions of the same system at different times) is the mission of testing here.
  • As we’re developing and testing a product, we may observe things that are unexpected, under-described or completely undescribed. In order to program a machine to make an observation, we must anticipate that observation and encode it. The machine doesn’t imagine, invent, or learn, and a machine cannot produce an unanticipated oracle in response to an observation. By contrast, human observers continually learn and refine their ideas on what to observe. Sometimes we observe a problem without having anticipated it. Sometimes we become aware that we’re making a new observation—one that may or may not represent a problem. Distinct from checking, testing continually affords new things to observe. Testing prompts us to decide when new observations represent problems, and testing informs decisions about what to do about them.
  • An oracle may be in error, or irrelevant. Trivial examples: a program that checks the output of another program may have its own bugs. A reference document may be outdated. A subject matter expert who is usually a reliable source of information may have forgotten something.
  • Oracles might be inconsistent with each other. Even though we have some powerful models for it, temperature measurement in climatology is inherently uncertain. What is the “correct” temperature outdoors? In the sunlight? In the shade? When the thermometer is near a building or farther away? Over grass, or over pavement? Some of the issues are described in this remarkable article (read the comments, too).
  • Although we can demonstrate incorrectness in a program, we cannot prove a program to be correct. As Djikstra put it, testing can only show the presence of errors, not their absence; and to go even deeper, Popper pointed out that theories can only be falsified, and not proven. Trivial example: No matter how many tests we run on that calculator, we can never know that it will always return 4 given the inputs 2 + 2; we can only infer that it will do so through induction, and induction can be deeply problemmatic. In a Nassim Taleb’s example (cribbed from Bertrand Russell and David Hume), every day the turkey uses induction to reinforce his belief in the farmer’s devotion to the desires and interests of turkeys—until a few days before Thanksgiving, when the turkey receives a very sudden, unpleasant, and (alas for the turkey) momentary flash of insight.
  • Sometimes we don’t need to know the correct result to know that the observed result is wrong. Trivial example: the domain of the cosine function ranges from -1 to 1. I don’t need to know the correct value for cos(72) to know that an output of 4.2 is wrong. (Elaine Weyuker discusses this in a paper called “On Testing Nontestable Programs” (Weyuker, Elaine, “On Testing Nontestable Programs”, Department of Computer Science, Courant Institute of Mathematical Sciences, New York University). “Frequently the tester is able to state with assurance that a result is incorrect without actually knowing the correct answer.”)

Checking for correctness—especially when the test output is observed and evaluated mechanically or indirectly—is a risky business. All oracles are fallible. A “passing” test, based on comparison with a fallible oracle cannot prove correctness, and no number of “passing” tests can do that. In this, a test is like a scientific experiment: an experiment’s outcome can falsify one theory while supporting another, but an experiment cannot prove a theory to be true. A million observations of white swans says nothing about the possibility that there might be black swans; a million passing tests, a million observations of correct behaviour cannot eliminate the possibility that there might be swarms of bugs. At best, a passing test is essentially the observation of one more white swan. We urge those who rely on passing acceptance tests to remember this.

A check can suggest the presence of a problem, or can at best provide support for the idea that the program can work. But no matter what oracle we might use, a test cannot prove that a program is working correctly, or that the program will work . So what can oracles actually do for us?

If we invert the focus on correctness, we can produce a more robust heuristic. We can’t logically use an oracle to prove that a system is behaving correctly or that it will behave correctly, but we can use an oracle to help falsify the theory that it is behaving correctly. This is why, in Rapid Software Testing, we say that an oracle is a means by which we recognize a problem when it happens during testing.

Very Short Blog Posts (21): You Had It Last!

Tuesday, November 4th, 2014

Sometimes testers say to me “My development team (or the support people, or the managers) keeping saying that any bugs in the product are the testers’ fault. ‘It’s obvious that any bug in the product is the tester’s responsibility,’ they say, ‘since the tester had the product last.’ How do I answer them?”

Well, you could say that the product’s problems are the responsibility of the tester because the tester had the product last—and that a successful product was successful because the programmers and the business people did such a good job at preventing bugs. But that would be to explain any failures in the product in one way, and to explain any successes in the product in a completely different way.

The idea that testers are responsible for bugs because “the testers had it last” is something Jerry Weinberg once called the “Hot Potato Theory of Software Development”. This suggests a way around the problem—or at least to highlight how silly the theory is. If you’re a tester, send a message to the developers and the product managers noting that you’ve finished testing, and ask them to take on final look at the product before they decide to ship it. Then they had it last.

Instead, let’s be consistent. Testers don’t put the bugs in, and testers miss some of the bugs because bugs are, by their nature, hidden. Moreover, the bugs are hidden so well that not even the people who put them in could find them. The bugs are hidden by people, and by the consequences of how we choose to do software development.

So let’s all work to prevent the bugs, and to find them more quickly. Let’s talk about problems in development that allow bugs to hide. Let’s all work on testability, so that we can find bugs earlier, and more easily, before the bugs have a chance to hide deeply. And let’s all share responsibility for our failures and our successes.

This post, original published in 2014. got a little update in December 2021.

Testing is…

Tuesday, October 28th, 2014

Every now and again, someone makes some statement about testing that I find highly questionable or indefensible, whereupon I might ask them what testing means to them. All too often, they’re at a loss to reply because they haven’t really thought deeply about the matter; or because they haven’t internalized what they’ve thought about; or because they’re unwilling to commit to any statement about testing. And then they say something vague or non-committal like “it depends” or “different things to different people” or “that’s a matter of context”, without suggesting relevant dependencies, people, or context factors.

So, for those people, I offer a set of answers from which they can choose one; or they can adopt the entire list wholesale; or they use one or more items as a point of departure for something of their own invention. You don’t have to agree with any of these things; in that case, invent your own ideas about testing from whole cloth. But please: if you claim to be a tester, or if you are making some claim about testing, please prepare yourself and have some answer ready when someone asks you “what is testing?”. Please.

Here are some possible replies; I believe everything is Tweetable, or pretty close.

  • Testing is—among other things—reviewing the product and ideas and descriptions of it, looking for significant and relevant inconsistencies.
  • Testing is—among other things—experimenting with the product to find out how it may be having problems—which is not “breaking the product”, by the way.
  • Testing is—among other things—something that informs quality assurance, but is not in and of itself quality assurance.
  • Testing is—among other things—helping our clients to make empirically informed decisions about the product, project, or business.
  • Testing is—among other things—a process by which we systematically examine any aspect of the product with the goal of preventing surprises.
  • Testing is—among other things—a process of interacting with the product and its systems in many ways that challenge unwarranted optimism.
  • Testing is—among other things—observing and evaluating the product, to see where all those defect prevention ideas might have failed.
  • Testing is—among other things—a special part of the development process focused on discovering what could go badly (or what is going badly).
  • Testing is—among other things—exploring, discovering, investigating, learning, and reporting about the product to reveal new information.
  • Testing is—among other things—gathering information about the product, its users, and conditions of its use, to help defend value.
  • Testing is—among other things—raising questions to help teams to develop products that more quickly and easily reveal their own problems.
  • Testing is—among other things—helping programmers and the team to learn about unanticipated aspects of the product we’re developing.
  • Testing is—among other things—helping our clients to understand the product they’ve got so they can decide if it’s the product they want.
  • Testing is—among other things—using both tools and direct interaction with the product to question and evaluate its behaviours and states.
  • Testing is—among other things—exploring products deeply, imaginatively, and suspiciously, to help find problems that threaten value.
  • Testing is—among other things—performing actual and thought experiments on products and ideas to identify problems and risks.
  • Testing is—among other things—thinking critically and skeptically about products and ideas around them, with the goal of not being fooled.
  • Testing is—among other things—evaluating a product by learning about it through exploration, experimentation, observation and inference.

You’re welcome.

How Models Change

Saturday, July 19th, 2014

Like software products, models change as we test them, gain experience with them, find bugs in them, realize that features are missing. We see opportunities for improving them, and revise them.

A product coverage outline, in Rapid Testing parlance, is an artifact (a map, or list, or table…) that identifies the dimensions or elements of a product. It’s a kind of inventory of aspects of the product that could be tested. Many years ago, my colleague and co-author James Bach wrote an article on product elements, identifying Structure, Function, Data, Platform, and Operations (SFDPO; think “San Francisco DePOt”, he suggested) as a set of heuristic guidewords for creating or structuring or reviewing the highest levels of a coverage outline.

A few years later, I was working as a tester. While I was on that assignment, I missed a few test ideas and almost missed a few bugs that I might have noticed earlier had I thought of “Time” as another guideword for modeling the product. After some discussion, I persuaded James that Time was a worthy addition to the Product Elements list. I wrote my own article on that, Time for New Test Ideas).

Over the years, it seemed that people were excited by the idea of using SFDPOT as the starting point for a general coverage outline. Many people reported getting a lot of value out of it, so in my classes, I’ve placed more and more emphasis on using and practicing the application of that part of the Heuristic Test Strategy Model. One of the exercises involves creating a mind map for a real software product. I typically offer that one way to get started on creating a coverage outline is to walk through the user interface and enumerate each element of the UI in the mind map.

(Sometimes people ask, “Why bother? Don’t the specifications or the documentation or the Help file provide maps of the UI? What’s the point of making another one?” One answer is that the journey, rather than the map, is the point. We learn one set of things by reading about a product; we learn different things—and we typically learn more deeply—by touring the product, interacting with it, gaining experience with it, and organizing descriptions of what we’ve found. Moreover, at each moment, we may notice, infer, or wonder about things that the documentation doesn’t address. When we recognize something new, we can add it to our coverage model, our risk list, or our test ideas—plus we might recognize and note some bugs or issues along the way. Another answer is that we should treat anything that any documentation says about a product as a rumour until we’ve engaged with the product.)

One issue kept coming up in class: on the product coverage outline, where should the map of the user interface go? Under Functions (what the product does)? Or Operations (how people use the product)? Or Structure (the bits and pieces of the product)? My answer was that it doesn’t matter much where you put things on your coverage outline, as long as it fits for you and the people with whom you might be sharing the map. The idea is to identify things that could be tested, and not to miss important stuff.

After one class, I was on the phone with James, and I happened to mention that day’s discussion. “I prefer to put the UI under Structure,” I noted.

What? That’s crazy talk! The UI goes under Functions!”

“What?” I replied. “That’s crazy talk. The UI isn’t Functions. Sure, it triggers functions. But it doesn’t perform those functions.”

“So what?” asked James. “If it’s how the user gets at functions, it fits under Functions just fine. What makes you think the UI goes under Structure?”

“Well, the UI has a structure. It’s… structural.”

Everything has a structure,” said James. “The UI goes under Functions.”

And so we argued on. Then one of us—and I honestly don’t remember who—suggested that maybe the UI was important enough to be its own top-level product element. I do remember James pointing out that if when we think of interfaces, plural, there might be several of them—not just the graphical user interface, but maybe a command-line interface. An application programming interface.

“Hmmm…,” I said. This reminded me of the four-user model mentioned in How to Break Software (human user, API user, operating system user, file system user). “Interfaces,” I said. “Operating system interface, file system interface, network interface, printer interface, debugging interface, other devices…”

“Right,” said James. “Plus there are those other interface-y things—importing and exporting stuff, for instance.”

“Aren’t those covered under ‘Functions’?”

“Sure. Or they might be, depending on how you think about it. But the point of this kind of model isn’t to be a template, or a form you fill out. It’s to help us reduce the chances that we might miss something important. Our models are leaky abstractions; overlaps are okay,” said James. Which, of course, was exactly the same argument I had used on him several years earlier when we had added Time to the model. Then he paused. “Ah! But we don’t want to break the mnemonic, do we? San Francisco DePOT.”

“We can deal with that. Just misspell ‘depot’ San Francisco DIPOT. SFDIPOT.”

And so we updated the model.

I wonder what it will look like five years from now.

Scenarios Ain’t Just Use Cases

Thursday, May 15th, 2014

How do people use a software product? Some development groups model use through use cases. Typically use cases are expressed in terms of the user performing a set of step-by-step behaviours: 1, then 2, then 3, then 4, then 5. In those groups, testers may create test cases that map directly onto the use cases. Sometimes, that gets called a scenario, and the testing of it is called a scenario test.

According to Cem Kaner, a scenario is a “hypothetical story, used to help a person think through a complex problem or system.” He also says that a scenario test has several characteristics: it is motivating, in that stakeholders would push to fix problems that the test revealed; credible, in that it not only could happen, but that things like it could probably happen; that it involves complexity in terms of use, environments, or data. (Read his paper on scenario testing here.)

Taking the steps directly from a use case and then calling it a scenario limits your view of what a scenario is, which in turn limits your testing. People do not do 1, 2, 3, 4, and 5 in real life. Instead, they

  • do 1
  • start 2
  • respond to one email, and delete a bunch of get-rich-quick offers
  • resume 2
  • take a phone call from the dog grooming studio; Fluffy will be ready at 4:30
  • realize they’ve lost track of what they were doing in 2
  • go back to 1
  • restart 2
  • look up some figures in Excel
  • place a pizza order for the lunchtime meeting
  • finish 2
  • go to 3
  • accidentally paste the pizza order into some field in 3
  • dismiss the error message, after a fruitless attempt to decipher what it means it
  • finish 3
  • forget to save their work; thank heaven for the auto-save feature
  • get called to an all-hands meeting
  • return to find that the machine has entered sleep mode
  • wake up the machine
  • dismiss a dialog saying that it’s now safe to remove the device, even though they didn’t want to remove the device; the message is due to an operating-system bug related to sleep mode
  • discuss rumours raised from the all-hands meeting on Instant Messaging
  • start 4
  • realize they got something wrong in step 2
  • go back through 3 to 2
  • go to lunch
  • wake up the damned machine again
  • dismiss the damned dialog again
  • correct 2
  • go forward to 3
  • accept the values that were left there from (auto-)saving the first time through (but which due to the changes in 2 are now invalid)
  • go into 4
  • get confused about an element of the user interface in 4
  • realize something looks wrong because of the inappropriately saved value from 3
  • go back to 3
  • fix the values and save the page
  • go to 4
  • look away from the computer, notice there’s a new plant in the corner, under the picture—when did that get there, anyway?
  • complete 4
  • start 5
  • get invited for coffee
  • come back
  • wake up the damned machine again
  • dismiss the damned dialog again
  • worry irrationally that they didn’t complete 4
  • open the app in a second window to confirm that they have in fact completed 4, inadvertently jostling 4’s state
  • restart 5
  • take a phone call in the middle of trying to do 5; “Fluffy appears to be sick and could you show up half an hour earlier?”
  • change their minds about something in 4
  • go back and fix it
  • get tapped on the shoulder by the boss
  • start 5
  • almost finish 5
  • forget to save their work
  • program crashes; thank heaven for the auto-save feature
  • find out that auto-save mode doesn’t actually save every time.

If you want to show that the system can work, by all means check the system by following the procedure that the use case prescribes. That’s something we call sympathetic testing, and it’s a reasonable place to start testing; to learn about the feature; to find how people might derive value from the feature; to begin building your models of the product, and how there might be problems in it.

But if you want to discover problems that matter to people before those people find them, test the system by introducing lots of variation, pauses, distractions, concurrent actions, and galumphing.

Related post: Why We Do Scenario Testing

Harry Collins and The Motive for Distinctions

Monday, March 3rd, 2014

“Computers and their software are two things. As collections of interacting cogs they must be ‘checked’ to make sure there are no missing teeth and the wheels spin together nicely. Machines are also ‘social prostheses’, fitting into social life where a human once fitted. It is a characteristic of medical prostheses, like replacement hearts, that they do not do exactly the same job as the thing they replace; the surrounding body compensates.

“Contemporary computers cannot do just the same thing as humans because they do not fit into society as humans do, so the surrounding society must compensate for the way the computer fails to reproduce what it replaces. This means that a complex judgment is needed to test whether software fits well enough for the surrounding humans to happily ‘repair’ the differences between humans and machines. This is much more than a matter of deciding whether the cogs spin right.”

—Harry Collins

Harry Collins—sociologist of science, author, professor at Cardiff University, a researcher in the fields of the public understanding of science, the nature of expertise, and artificial intelligence—was slated to give a keynote speech at EuroSTAR 2013. Due to illness, he was unable to do so. The quote above is the abstract from the talk that Harry never gave. (The EuroSTAR community was very lucky and grateful to have his colleague, Rob Evans, step in at the last minute with his own terrific presentation.)

Since I was directed to Harry’s work in 2010 (thank you, Simon Schaffer), James Bach and I have been galvanized by it. As we’ve been trying to remind people for years, software testing is a complex, cognitive, social task that requires skill, tacit knowledge, and many kinds of expertise if we want people to do it well. Yet explaining testing is tricky, precisely because so much of what skilled testers do is tacit, and not explicit; learned by practice and by immersion in a culture, not from documents or other artifacts; not only mechanical and algorithmic, but heuristic and social.

Harry helps us by taking a scalpel to concepts and ideas that many people consider obvious or unimportant, and dissecting those ideas to reveal the subtle and crucial details under the surface.

As an example, in Tacit and Explicit Knowledge, he takes the idea of tacit knowledge—formerly, any kind of knowledge that was not told—and divides it into three kinds: relational, the kind of knowledge that resides in an individual human mind, and that in general could be told; somatic, resident in the system of a human body and a human mind; and collective, residing in society and in the ever-changing relationships between people in a culture.

How does that matter? Consider the Google car. On the surface, operating a car looks like a straightforward activity, easily made explicit in terms of the laws of physics and the rules of the road. Look deeper, and you’ll realize that driving is a social activity, and that interaction between drivers, cyclists, and other pedestrians is negotiated in real time, in different ways, all over the world.

So we’ve got Google cars on the road experimentally in California and Washington; how will they do in Beijing, in Bangalore, or in Rome? How will they interact with human drivers in each society? How will they know, as human drivers do, the extent to which it is socially acceptable to bend the rules—and socially unacceptable not to bend them?

In many respects, machinery can do far better than humans in the mechanical aspects of driving. Yet testing the Google car will require far more than unit checks or a Cucumber suite—it will require complex evaluation and judgement by human testers to see whether the machinery—with no awareness or understanding of social interactions, for the foreseeable future—can be accommodated by the surrounding culture.

That will require a shift from the way testing is done at Google according to some popular stories. If you want to find problems that matter to people before inflicting your product on them, you must test—not only the product in isolation, but in its relationships with other people.

In Rapid Software Testing, our goal all the way along has been to probe into the nature of testing and the way we talk about it, with the intention of empowering people to do it well. Part of this task involves taking relational tacit knowledge and making it explicit. Another part involves realizing that certain skills cannot be transferred by books or diagrams or video tutorials, but must be learned through experience and immersion in the task. Rather than hand-waving about “intuition” and “error guessing”, we’d prefer to talk about and study specific, observable, trainable, and manageable skills.

We could talk about “test automation” as though it were a single subject, but it’s more helpful to distinguish the many ways that we could use tools to support and amplify our testing—for checking specific facts or states, for generating data, for visualization, for modeling, for coverage analysis… Instead of talking about “automated testing” as though machines and people were capable of the same things, we’d rather distinguish between checking (something that machines can do, an activity embedded in testing) and testing (which requires humans), so as to make both our checking and our testing more powerful.

The abstract for Prof. Collins’ talk, quoted above, is an astute, concise description of why skilled testing matters. It’s also why the distinction between testing and checking matters, too. For that, we are grateful.

There will be much more to come in these pages relating Harry’s work to our craft of testing; stay tuned. Meanwhile, I give his books my highest recommendation.

Tacit and Explicit Knowledge
Rethinking Expertise (co-authored with Rob Evans)
The Shape of Actions: What Humans and Machines Can Do (co-authored with Martin Kusch)
The Golem: What You Should Know About Science (co-authored with Trevor Pinch)
The Golem at Large: What You Should Know About Technology (co-authored with Trevor Pinch)
Changing Order: Replication and Induction in Scientific Practice
Artificial Experts: Social Knowledge and Intelligent Machines

Why Would a User Do THAT?

Monday, March 4th, 2013

If you’ve been in testing for long enough, you’ll eventually report or demonstrate a problem, and you’ll hear this:

“No user would ever do that.”

Translated into English, that means “No user that I’ve thought of, and that I like, would do that on purpose, or in a way that I’ve imagined.” So here are a few ideas that might help to spur imagination.

  • The user made a simple mistake, based on his erroneous understanding of how the program was supposed to work.
  • The user had a simple slip of the fingers or the mind—inadvertently pasting a letter from his mother into the “Withdrawal Amount” field.
  • The user was distracted by something, and happened to omit an important step from a normal process.
  • The user was curious, and was trying to learn about the system.
  • The user was a hacker, and wanted to find specific vulnerabilities in the system.
  • The user is confused by the poor affordances in the product, and at that point was willing to try anything to get his task accomplished.
  • The user was poorly trained in how to use the product.
  • The user didn’t do that. The product did that, such that the user appeared to do that.
  • Users actually do that all the time, but the designer didn’t realize it, so product’s design is inconsistent with the way the user actually works.
  • The product used to do it that way, but to the user’s surprise now does it this way.
  • The user was looking specifically for vulnerabilities in the product as a part of an evaluation of competing products.
  • The product did something that the user perceived as unusual, and the user is now exploring to get to the bottom of it.
  • The user did that because some other vulnerability—say, a botched installation of the product—led him there.
  • The user was in another country, where they use commas instead of periods, dashes instead of slashes, kilometres instead of miles… Or where dates aren’t rendered the way we render them here.
  • The user was testing the product.
  • The user didn’t realize this product doesn’t work the way that product does, even though the products have important and relevant similarities.
  • The user did that, prompted by an error in the documentation (which in turn was prompted by an error in a designer’s description of her intentions).
  • To the designer’s surprise, the user didn’t enter the data via the keyboard, but used the clipboard or a programming interface to enter a ton of data all at once.
  • The user was working for another company, and was trying to find problems in an active attempt to embarrass the programmer.
  • The user observed that this sequence of actions works in some other part of the product, and figured that the same sequence of actions would be appropriate here too.
  • The product took a long time to respond, the user got impatient, and started doing other stuff before the product responded to his earlier request.

And I’m not even really getting started. I’m sure you can supply lots more examples.

Do you see? The space of things that people can do intentionally or unintentionally, innocently or malevolently, capably or erroneously, is huge. This is why it’s important to test products not only for repeatability (which, for computer software, is relatively easy to demonstrate) but also for adaptability. In order to do this, we must do much more than show that a program can produce an expected, predicted result. We must also expose the product to reasonably foreseeable misuse, to stress, to the unexpected, and to the unpredicted.