Blog Posts for the ‘Models’ Category

It’s Not A Factory

Tuesday, April 19th, 2016

One model for a software development project is the assembly line on the factory floor, where we’re making a buhzillion copies of the same thing. And it’s a lousy model.

Software is developed in an architectural studio with people in it. There are drafting tables, drawing instruments, good lighting, pens and pencils and paper. And erasers, and garbage cans that get full of coffee cups and crumpled drawings. Good ideas become better ideas as they are sketched, analysed, criticised, and revised. A lot of bad ideas are discovered and rejected before the final plans are drawn.

Software is developed in a rehearsal hall with people in it. The room is also filled with risers and chairs and other temporary staging elements, and with substitute props that stand in for the finished products. There’s a piano to accompany the singers while the orchestra is being rehearsed in another hall. Lighting, sound, costumes and makeup are designed and folded into the rehearsal process as we experiment with different ways of bringing the show to life. Everyone tries stuff that doesn’t work, or doesn’t fit, or doesn’t sound right, or doesn’t look good at first. Frustration arises, feelings get bruised, and then breakthroughs happen and problems get solved. Lots of experiments lead to that joyful and successful opening night.

Software is developed in a workshop with people in it; skilled craftspeople who build tools and workspaces for themselves and each other, as part of the process of crafting products for people to buy. Even though they try to keep the shop clean, there’s occasional sawdust and smoke and spilled glue and broken machinery. Work in progress gets tested, and weaknesses are exposed—sometimes late in the game—and get fixed.

In all of these places, variation is encouraged. Designs are tinkered with. Discoveries are celebrated. Learning happens. Most importantly, skill and tacit knowledge are both applied and developed.

The Lean model for software development might seem a more humane step forward from the older days, but it’s still based on the factory. Ideas aren’t widgets whose delivery you can schedule just in time. Failed experiments aren’t waste when you learn from them, and if you know it won’t be waste from the outset, it’s not really an experiment. Everything that makes it into the product should represent something that the customer values, but when we’re creating something novel (which we’re always doing to some degree as we’re building software), we’re exploring and trying things out to help refine our understanding of what the customer actually values.

If there is any parallel between software and manufacturing, it is this: the “software development” part of manufacturing happens before the assembly line—in the design studio, where the prototypes are being developed, refined, and selected for mass production. The manufacturing part? That’s the copy command that deploys a copy of the installation package to all the machines in the enterprise, or the disk duplicator that stamps out a million DVDs with copies of the golden master on it, or the Web server that delivers a copy of the product to anyone who requests it. Getting to that first copy, though? That’s a studio thing, not an assembly-line thing.

The primary inspiration for this post is a conversation I had with Cem Kaner in 2008. Another is the book Artful Making by Robert Austin and Lee Devin, which I first read around the same time. Yet another is Christopher Alexander’s A Pattern Language. One more: my long-ago career in theatre, which prepared me better than you can imagine for a life in software development.

100% Coverage is Possible

Saturday, April 16th, 2016

In testing, what does “100% coverage” mean? 100% of what, specifically?

Some people might say that “100% coverage” could refer to lines of code, or branches within the code, or the conditions associated with the branches. That’s fine, but saying “100% of the lines (or branches, or conditions) in the program were executed” doesn’t tell us anything about whether those lines were good or bad, useful or useless.

“100% code coverage” doesn’t tell us anything about what the programmers intended, what the user desired, or what the tester observed. It says nothing about the tester’s engagement with the testing; whether the tester was asleep or awake. It ignores the oracles that the tester applied;how the tester recognized—or failed to recognize—bugs and other problems that were encountered during the testing. It suggests that some machinery processed something; nothing more.

Here’s a potentially helpful way to think about this:

“X coverage is how thoroughly we have examined the product with respect to some model of X”.

So: risk coverage is how thoroughly we have examined the product with respect to some model of risk; requirements coverage is how how thoroughly we have examined the product with respect to some model of requirements; code coverage is how thoroughly we have examined the product with respect to some model of code.

To claim 100% coverage is essentially the same as saying “We’ve looked for bugs everywhere!” For a skilled tester, any “100%” claim about coverage should prompt critical thinking: “How much” compared to what? 100% of what, specifically? Some model of X—which one? Whose model? How well does the “model of X” model reality? What does the model of X leave out of the universe of possible ways of thinking about X? And what non-X things should we also be considering when we’re testing?

Here’s just one example: code coverage is usually described in terms of the code that we’ve written, or that we have available to evaluate. Yet every program we write interacts with some platform that might include third-party libraries, browsers, plug-ins, operating systems, file systems, firmware. Our code might interact with our own libraries that we haven’t instrumented this time. So “code coverage” refers to some code in the system, but not all the code in the system.

Once I did a test (or was it 10,000 tests?) wherein I used an automated check to run through all 10,000 possible settings of a particular variable. That was 100% coverage of that variable being used in a particular moment in the execution of the system, on that day.

But it was not 100% of all the possible sequences of those settings, nor 100% of the possible subsequent paths through the product. It wasn’t 100% of the possible variations in pacing, or system load, or times of day when the system could be used. That test wasn’t representative of all of the possible stakeholders who might be using that variable, nor how they might use it.

What would “100% requirements coverage” mean? Would it mean that every statement in the requirements document was covered by a test? If you think so, it might be worthwhile to consider all the models that are in play.

The requirements document is a model of the product’s requirements. It refers to ideas that have been explicitly expressed by some people, but not by all of the people who might have requirements for the product. The requirements document models what those people thought they wanted at a certain point, but not necessarily what they want now.
The requirements document doesn’t account for all of the ideas or ideas that people had that may have been tacit, or implicit, or latent.

You can subject “statement”, “covered”, and “test” to the same kind of treatment. A statement is a model of what someone is thinking at a given point in time; our notion of what “covered” means is governed our models of coverage; our notion of “a test” is conditioned by our models of testing. It’s models all the way down.

Things in testing keep reminding me of a passage from Computer Programming Fundamentals by Herbert Leeds and Jerry Weinberg:

“One of the lessons to be learned … is that the sheer number of tests performed is of little significance in itself. Too often, the series of tests simply proves how good the computer is at doing the same things with different numbers. As in many instances, we are probably misled here by our experiences with people, whose inherent reliability on repetitive work is at best variable. With a computer program, however, the greater problem is to prove adaptability, something which is not trivial in human functions either. Consequently we must be sure that each test does some work not done by previous tests. To do this, we must struggle to develop a suspicious nature as well as a lively imagination.“

Testing is an open investigation. 100% coverage of a particular factor may be possible—but that requires a model so constrained that we leave out practically everything else that might be important.

Test coverage, like quality, is not something that yields very well to quantitative measurements, except when we’re talking of very narrow and specific conditions. But we can discuss coverage, and ask questions about whether it’s what we want, whether we’re happy with it, or whether we want more.

Further reading:

Got You Covered
Cover or Discover
A Map by Any Other Name
What Counts

How Models Change

Saturday, July 19th, 2014

Like software products, models change as we test them, gain experience with them, find bugs in them, realize that features are missing. We see opportunities for improving them, and revise them.

A product coverage outline, in Rapid Testing parlance, is an artifact (a map, or list, or table…) that identifies the dimensions or elements of a product. It’s a kind of inventory of aspects of the product that could be tested. Many years ago, my colleague and co-author James Bach wrote an article on product elements, identifying Structure, Function, Data, Platform, and Operations (SFDPO; think “San Francisco DePOt”, he suggested) as a set of heuristic guidewords for creating or structuring or reviewing the highest levels of a coverage outline.

A few years later, I was working as a tester. While I was on that assignment, I missed a few test ideas and almost missed a few bugs that I might have noticed earlier had I thought of “Time” as another guideword for modeling the product. After some discussion, I persuaded James that Time was a worthy addition to the Product Elements list. I wrote my own article on that, Time for New Test Ideas).

Over the years, it seemed that people were excited by the idea of using SFDPOT as the starting point for a general coverage outline. Many people reported getting a lot of value out of it, so in my classes, I’ve placed more and more emphasis on using and practicing the application of that part of the Heuristic Test Strategy Model. One of the exercises involves creating a mind map for a real software product. I typically offer that one way to get started on creating a coverage outline is to walk through the user interface and enumerate each element of the UI in the mind map.

(Sometimes people ask, “Why bother? Don’t the specifications or the documentation or the Help file provide maps of the UI? What’s the point of making another one?” One answer is that the journey, rather than the map, is the point. We learn one set of things by reading about a product; we learn different things—and we typically learn more deeply—by touring the product, interacting with it, gaining experience with it, and organizing descriptions of what we’ve found. Moreover, at each moment, we may notice, infer, or wonder about things that the documentation doesn’t address. When we recognize something new, we can add it to our coverage model, our risk list, or our test ideas—plus we might recognize and note some bugs or issues along the way. Another answer is that we should treat anything that any documentation says about a product as a rumour until we’ve engaged with the product.)

One issue kept coming up in class: on the product coverage outline, where should the map of the user interface go? Under Functions (what the product does)? Or Operations (how people use the product)? Or Structure (the bits and pieces of the product)? My answer was that it doesn’t matter much where you put things on your coverage outline, as long as it fits for you and the people with whom you might be sharing the map. The idea is to identify things that could be tested, and not to miss important stuff.

After one class, I was on the phone with James, and I happened to mention that day’s discussion. “I prefer to put the UI under Structure,” I noted.

What? That’s crazy talk! The UI goes under Functions!”

“What?” I replied. “That’s crazy talk. The UI isn’t Functions. Sure, it triggers functions. But it doesn’t perform those functions.”

“So what?” asked James. “If it’s how the user gets at functions, it fits under Functions just fine. What makes you think the UI goes under Structure?”

“Well, the UI has a structure. It’s… structural.”

Everything has a structure,” said James. “The UI goes under Functions.”

And so we argued on. Then one of us—and I honestly don’t remember who—suggested that maybe the UI was important enough to be its own top-level product element. I do remember James pointing out that if when we think of interfaces, plural, there might be several of them—not just the graphical user interface, but maybe a command-line interface. An application programming interface.

“Hmmm…,” I said. This reminded me of the four-user model mentioned in How to Break Software (human user, API user, operating system user, file system user). “Interfaces,” I said. “Operating system interface, file system interface, network interface, printer interface, debugging interface, other devices…”

“Right,” said James. “Plus there are those other interface-y things—importing and exporting stuff, for instance.”

“Aren’t those covered under ‘Functions’?”

“Sure. Or they might be, depending on how you think about it. But the point of this kind of model isn’t to be a template, or a form you fill out. It’s to help us reduce the chances that we might miss something important. Our models are leaky abstractions; overlaps are okay,” said James. Which, of course, was exactly the same argument I had used on him several years earlier when we had added Time to the model. Then he paused. “Ah! But we don’t want to break the mnemonic, do we? San Francisco DePOT.”

“We can deal with that. Just misspell ‘depot’ San Francisco DIPOT. SFDIPOT.”

And so we updated the model.

I wonder what it will look like five years from now.

Scenarios Ain’t Just Use Cases

Thursday, May 15th, 2014

How do people use a software product? Some development groups model use through use cases. Typically use cases are expressed in terms of the user performing a set of step-by-step behaviours: 1, then 2, then 3, then 4, then 5. In those groups, testers may create test cases that map directly onto the use cases. Sometimes, that gets called a scenario, and the testing of it is called a scenario test.

According to Cem Kaner, a scenario is a “hypothetical story, used to help a person think through a complex problem or system.” He also says that a scenario test has several characteristics: it is motivating, in that stakeholders would push to fix problems that the test revealed; credible, in that it not only could happen, but that things like it could probably happen; that it involves complexity in terms of use, environments, or data. (Read his paper on scenario testing here.)

Taking the steps directly from a use case and then calling it a scenario limits your view of what a scenario is, which in turn limits your testing. People do not do 1, 2, 3, 4, and 5 in real life. Instead, they

  • do 1
  • start 2
  • respond to one email, and delete a bunch of get-rich-quick offers
  • resume 2
  • take a phone call from the dog grooming studio; Fluffy will be ready at 4:30
  • realize they’ve lost track of what they were doing in 2
  • go back to 1
  • restart 2
  • look up some figures in Excel
  • place a pizza order for the lunchtime meeting
  • finish 2
  • go to 3
  • accidentally paste the pizza order into some field in 3
  • dismiss the error message, after a fruitless attempt to decipher what it means it
  • finish 3
  • forget to save their work; thank heaven for the auto-save feature
  • get called to an all-hands meeting
  • return to find that the machine has entered sleep mode
  • wake up the machine
  • dismiss a dialog saying that it’s now safe to remove the device, even though they didn’t want to remove the device; the message is due to an operating-system bug related to sleep mode
  • discuss rumours raised from the all-hands meeting on Instant Messaging
  • start 4
  • realize they got something wrong in step 2
  • go back through 3 to 2
  • go to lunch
  • wake up the damned machine again
  • dismiss the damned dialog again
  • correct 2
  • go forward to 3
  • accept the values that were left there from (auto-)saving the first time through (but which due to the changes in 2 are now invalid)
  • go into 4
  • get confused about an element of the user interface in 4
  • realize something looks wrong because of the inappropriately saved value from 3
  • go back to 3
  • fix the values and save the page
  • go to 4
  • look away from the computer, notice there’s a new plant in the corner, under the picture—when did that get there, anyway?
  • complete 4
  • start 5
  • get invited for coffee
  • come back
  • wake up the damned machine again
  • dismiss the damned dialog again
  • worry irrationally that they didn’t complete 4
  • open the app in a second window to confirm that they have in fact completed 4, inadvertently jostling 4’s state
  • restart 5
  • take a phone call in the middle of trying to do 5; “Fluffy appears to be sick and could you show up half an hour earlier?”
  • change their minds about something in 4
  • go back and fix it
  • get tapped on the shoulder by the boss
  • start 5
  • almost finish 5
  • forget to save their work
  • program crashes; thank heaven for the auto-save feature
  • find out that auto-save mode doesn’t actually save every time.

If you want to show that the system can work, by all means check the system by following the procedure that the use case prescribes. That’s something we call sympathetic testing, and it’s a reasonable place to start testing; to learn about the feature; to find how people might derive value from the feature; to begin building your models of the product, and how there might be problems in it.

But if you want to discover problems that matter to people before those people find them, test the system by introducing lots of variation, pauses, distractions, concurrent actions, and galumphing.

Related post: Why We Do Scenario Testing

Why Would a User Do THAT?

Monday, March 4th, 2013

If you’ve been in testing for long enough, you’ll eventually report or demonstrate a problem, and you’ll hear this:

“No user would ever do that.”

Translated into English, that means “No user that I’ve thought of, and that I like, would do that on purpose, or in a way that I’ve imagined.” So here are a few ideas that might help to spur imagination.

  • The user made a simple mistake, based on his erroneous understanding of how the program was supposed to work.
  • The user had a simple slip of the fingers or the mind—inadvertently pasting a letter from his mother into the “Withdrawal Amount” field.
  • The user was distracted by something, and happened to omit an important step from a normal process.
  • The user was curious, and was trying to learn about the system.
  • The user was a hacker, and wanted to find specific vulnerabilities in the system.
  • The user is confused by the poor affordances in the product, and at that point was willing to try anything to get his task accomplished.
  • The user was poorly trained in how to use the product.
  • The user didn’t do that. The product did that, such that the user appeared to do that.
  • Users actually do that all the time, but the designer didn’t realize it, so product’s design is inconsistent with the way the user actually works.
  • The product used to do it that way, but to the user’s surprise now does it this way.
  • The user was looking specifically for vulnerabilities in the product as a part of an evaluation of competing products.
  • The product did something that the user perceived as unusual, and the user is now exploring to get to the bottom of it.
  • The user did that because some other vulnerability—say, a botched installation of the product—led him there.
  • The user was in another country, where they use commas instead of periods, dashes instead of slashes, kilometres instead of miles… Or where dates aren’t rendered the way we render them here.
  • The user was testing the product.
  • The user didn’t realize this product doesn’t work the way that product does, even though the products have important and relevant similarities.
  • The user did that, prompted by an error in the documentation (which in turn was prompted by an error in a designer’s description of her intentions).
  • To the designer’s surprise, the user didn’t enter the data via the keyboard, but used the clipboard or a programming interface to enter a ton of data all at once.
  • The user was working for another company, and was trying to find problems in an active attempt to embarrass the programmer.
  • The user observed that this sequence of actions works in some other part of the product, and figured that the same sequence of actions would be appropriate here too.
  • The product took a long time to respond, the user got impatient, and started doing other stuff before the product responded to his earlier request.

And I’m not even really getting started. I’m sure you can supply lots more examples.

Do you see? The space of things that people can do intentionally or unintentionally, innocently or malevolently, capably or erroneously, is huge. This is why it’s important to test products not only for repeatability (which, for computer software, is relatively easy to demonstrate) but also for adaptability. In order to do this, we must do much more than show that a program can produce an expected, predicted result. We must also expose the product to reasonably foreseeable misuse, to stress, to the unexpected, and to the unpredicted.

What’s Comparable (Part 2)

Tuesday, December 4th, 2012

In the previous post, Lynn McKee recognized that, with respect to the Comparable Product oracle heuristic, “comparable” can be have several expansive interpretations, and not just one narrow one. I’ll emphasize: “comparable product”, in the context of the FEW HICCUPPS oracle heuristics, can mean any software product, any attribute of a software product, or even attributes of non-software products that we could use as a basis for comparison. (Since “comparable product” is a heuristic, it can fail us by not helping us to recognize a problem, or by fooling us into believing that there is a problem where there really isn’t one. For now, at least, I leave the failure modes for each example below as an exercise for the reader. That said…) Here are some examples of comparable products that we could use when applying this heuristic.

An alternative product. Our product is intended to help us accomplish a particular task or set of tasks. We compare the overall operation of our product to the alternative product and its behaviour, look and feel, output, workflow, and so forth. If our product is inconsistent with a product that helps people do the same thing, then we might suspect a problem in our product. This is the “Microsoft Word vs. OpenOffice” sense of “comparable product”.

A commercially competitive product. This is a special case of “alternative product”. People often hold commercial products to a higher standard than they hold freeware. If our product is inconsistent with another commercial product that is in the same market category (think “Microsoft Word vs. WordPerfect”), then we might suspect a problem in our product.

A product that’s a member of the same suite of products. Imagine being a tester on the enormous team that produces Microsoft Office. In places, Microsoft Outlook’s behaviour is inconsistent with the behaviour of Microsoft Word. We might recognize that a user could be frustrated or annoyed by inconsistencies between those products, because those products could reasonably be expected to have a consistent look and feel. I use both Word and Outlook. Sometimes I want to find a particular word or phrase in a long Outlook message that someone sent me. I press Ctrl-F. Instead of popping open the Find dialog, Outlook opens a copy of the message to be Forwarded. The appropriate key to launch a search for something in Outlook is F4, which by default is assigned to “Redo or Repeat” in Word. (Note that Joel Spolsky’s Law of Leaky Abstractions starts to take effect here. This flavour of the comparable product heuristic starts to leak into territory covered by the “user expectations” heuristic. That’s okay; some overlap between oracle heuristics helps to reduce to chance that we’ll miss a problems if one heuristic misfires. Moreover, weighing information from a variety of oracles helps us to evaluate the signficance of a given problem. There’s another leaky abstraction here too: what is a product? Given that Word is a product and Outlook is a product, is Office a product?)

Two products that are subcomponents within the same larger product. As in the Office/Outlook/Word example just above, Outlook isn’t even consistent within itself. In the (incoming) message reading window, Ctrl-F triggers the Forward Message function. In the (outgoing) message editing window, Ctrl-F does bring up the Find dialog. That’s because I have Outlook configured to use Word’s editor as Outlook’s. (There’s a leaky abstraction here too: the “consistency within the product” heuristic, where similar behaviours and states within the product should be consistent with one another. It’s good when oracles overlap!)

An existing product whose sole purpose is comparable to a specific feature in our product. A very simple product might have a purpose that is directly comparable to a purpose, feature or function in our product. A command-line tool like wc (Unix’ command-line word-count program) isn’t comparable with Microsoft Word in the large, but it can be used as a point of comparison for a specific aspect of Word’s behaviour.

An existing product that is different, yet shares some comparable feature, function, or concept. Many non-testers (and, apparently, many testers too) would consider Halo IV and Microsoft Word to be in completely different categories, yet there are similarities. Both are pieces of computer software; both process data; both exhibit behaviour; both save and restore state; both may change their appearance depending on the display settings. If either one were to crash, respond slowly, or misrepresent something on the screen, we might recognize a problem, and recognizing or conceiving of a problem in one might trigger us to consider a problem in the other.

A chain of events in some product. We might choose to build simple test automation to aid us in comparing the output of comparable functions or algorithms in two products. (For example, if we were testing OpenOffice, we might use scripting to compare OpenOffice’s result of a sin(x) function with Microsoft Excel’s API result, or we could use a call to the Web to obtain comparable output from the sin(x) function in Wolfram Alpha.) Those comparisons may become much more interesting when we chain a number of functions together. Note that if we’re not modeling accurately, coding carefully, and logging diligently, comparisons of chains of events may be harder to analyze.

A product that we develop specifically to implement a comparable algorithm. While working at a bank, I developed an Excel spreadsheet and VBA code to model the business logic for the teller workstation application I was testing. I used the use cases for the application as a specification, which allowed me to predict and test the ways in which which general ledger accounts would be affected by each supported transaction. This was a superb way to learn about the application, the business rules, and the power of Excel.

A reference output or artifact. Those who use FIT or FitNesse develop tables of functions, inputs, and outputs that the tool compares to output from integration-level functions; those tables are comparable products. If our testing mission were to examine the font output of Word, the display from a font management tool could be comparable to Word’s output. The comparable product may not even be instantiated in software or in electronic form. For example, we could compare the fonts in the output of our presentation software to the fonts in a Letraset catalog; we could compare the output from a pocket calculator to the output of our program; we could compare aggregated output from our program to a graph sketched on paper by a statistician; we could compare the data in our mailing list to the data in the postal code book. (Well, we used to do that; now it’s much easier to do it with some scripting that gets the data from the postal service.) More than once I’ve found a bug by comparing the number posted on the “Contact Us” page to the number printed on our business cards or in our marketing material. We could also compare output produced by our program today (to output produced by our program yesterday (an idea that leaks into the “consistency with history” heuristic).

A product that we don’t like. I remember this joke from somewhere in Isaac Asimov’s work: “People compare my violin playing to Jascha Heifetz. They say, ‘A Heifetz he ain’t!'” A comparable product is not always comparable in a desirable way. If someone touts a music management product to me saying “it’s just like iTunes!”, I’m not likely to use it. If people have been known to complain about a product, and our product provides the same basis for a complaint, we suspect a problem with our product. (The Law of Leaky Abstractions applies here too, leaking into the “familiar problems” heuristic, in which a product should be inconsistent with patterns of problems that we’ve seen before.)

Patterns of behaviour in a range or sphere of products. We can compare our product against our experience with technology and with entire classes of relevant or interesting products, without immediately referring to a specific product. “It’s one thing from freeware, but I didn’t expect crashes this often from a professional product.” “Well, this would be passable on a Windows system, but you know about those finicky Mac users.” “Yikes! I didn’t expect this product to make my password visible on-screen!” “Aw geez; the on-screen controls here are just as confusing as they are on a real VCR—and there are no tooltips, either.” “The success code is 1? Non-zero return codes on command-line utilities usually represent errors, don’t they?”

All of these things point to a few overarching points.

  • “Similar” and “comparable” can be interpreted narrowly or broadly. Even when products are dissimilar in important respects, even one point of similarity may be useful.

  • Products can be compared by similarity or by contrast.

  • We can make or prepare comparable products, in addition to referring to existing ones.

  • A comparable product may or may not be computer software.

  • Especially in reference to the few categories above, there is great value for a tester in knowing not only about technologies and functional aspects of products in the same product space, but also about user interface conventions, business or workplace domains, sources of background information, cultural and aesthetic characteristics, design heuristics, and all kinds of other things because…

  • If the object of the exercise is to find problems in the product quickly, it’s a good idea to have access to a requisite variety of ideas about what we might use as bases for comparison. (I describe “requisite variety” here, and Jurgen Appello describes it even better here.)

  • Bugs thrive on overly narrow or overly broad interpretations of “comparable”. Know what you’re comparing, and why the comparison matters to your testing and to your clients.

The comparable product heuristic is an oracle principle, but in describing it here, I haven’t paid much attention to mechanisms by which we might make comparisons. We’ll get to that issue next.

What’s Comparable (Part 1)

Monday, December 3rd, 2012

People interpret requirements and specifications in different ways, based on their models, and their past experiences, and their current context. When they hear or read something, many people tend to choose an interpretation that is familiar to them, which may close off their thinking about other possible interpretations. That’s not a big problem in simple, stable systems. It’s a bigger problem in software development. The problems we’re trying to solve are neither simple nor stable, and the same is true with the software that we’re developing.

The interpretation problem applies not only to software development and testing, but to the teaching of testing too. For example, in Rapid Software Testing, James Bach and I teach that an oracle is a way to recognize a problem, and one of the most important and powerful a broader set of oracle heuristics.

Here’s how the typical experiment went. We started by asking “We’re thinking of applying the comparable product oracle heuristic to a test of Microsoft Word. What product could we use for that?” Almost everyone suggested OpenOffice Writer, which seems to be the last remaining well-known full-featured word processing alternative to Microsoft Word. Some suggested WordPad, or Notepad, although almost everyone who did so suggested that WordPad (much less Notepad) wouldn’t be much use as comparable products. “Why not?” we asked. In general, the answer was that WordPad and Notepad were too simple, and didn’t reflect the complexity of Word.

Then we asked some follow-up questions. Is Word comparable with Unix’s command-line program wc? Most people said No (for some, we had to explain what wc is; it counts the words in a file that you provide as input). It was only when we asked, “What if we were testing the word count feature in Microsoft Word?” that the light began to dawn. When we asked if Word was comparable with Halo (the game), most people still said No. When we encouraged them to think more broadly about specific features of Word that we might compare with Halo, they started to get unstuck, and began to realize that while Word and Halo were dramatically different products in important respects, they were nonetheless comparable on some levels.

By contrast, here’s a conversation with Lynn McKee. The chat has been edited to de-Skypeify it (I’ve removed some typos, fixed some punctuation, and removed a couple of digressions not consequential to the conversation).

Michael: If you were asked, “We’re thinking of applying the comparable product oracle heuristic to a test of Microsoft Word. What product could we use for that?”, how would you answer?

Lynn: Hmmm. Certainly, we could use products such as “Open Office”, “Notepad” and others. Could you tell me more about what “we” are hoping to learn about the product under test to better assess which comparable products to use? Is this a brand new product? A version release? If so, what changed and what functions are we interested in comparing?

Michael: That’s a pretty good answer. A followup question: do you know the wc program, typically available under Unix?

Lynn: Sorry, I am not familiar with that product. Can you tell me more about it? How does it relate to your product? Is your product running on Unix?

Michael: Yes. wc is a command-line program. Its purpose is to count the words in a document. You supply the document as input; it returns the number of words in that document.

Lynn: While you were typing, I used my handy Google search to tell me a bit about Unix WC. Oh interesting, so are you looking to gather information about how capable, performant, etc the word count functionality is within MS Word? Can you tell me more about what functions of MS Word interest you the most? And why?

Michael: One more question: Halo IV — the game. Is that a comparable product to MS Word?

Lynn: Sheesh, I’ve only ever seen ads. Lemme think. It blows people’s brains out…sometimes I want to do that with MS Word. 😉 It would depend on what type of comparison we are hoping to draw. For example, Halo is a game and does require interaction with a user. From a UI perspective, there are menus and other forms of cause-and-effect type of interaction—that is, when I do X, I expect Y. There are also state comparisons I could draw. When I start a new game, save a game, reopen a game I have expectations about the state the game should be in. This is similar to how I may expect a document to behave with states. I may also expect certain behavior with pausing or crashing the game in terms of recovery that could be compared to MS Word. Conversely… if I am looking to compare the product’s ability to display fonts, images, format tables, etc. then I may find very low value in comparing the products. I think that you could compare any two products but you may find very different value in the comparison exercise, depending on what you hope to learn.

This is an answer that I would consider exemplary. I have related it here because it was outstanding in two ways: it was an extremely good answer, but it was also exceptional, in that most people didn’t consider wc or Halo to be even remotely comparable to Microsoft Word without a good deal of prompting. Lynn, on the other hand, recognized that “comparable” doesn’t necessarily mean “highly similar”; it can also mean “anything or any aspect of something that you might use as a basis for comparison“. She immediately questioned the question, to make sure that she understood the task at hand. She also did a bit of research on her own while I was answering the question, and asked some highly relevant questions about risks and particular concerns that I might have. Note that she’s doing important informal work—understanding the testing mission—before making too firm a commitment to what might or might not be considered “comparable” for the purposes of a particular question that we might have about the product.

I’ll have more to say about the Comparable Product heuristic tomorrow.

Where Does All That Time Go?

Tuesday, October 30th, 2012

It had been a long day, so a few of the fellows from the class agreed to meet a restaurant downtown. The main courses had been cleared off the table, some beer had been delivered, and we were waiting for dessert. Pedro (not his real name) was complaining, again, about how much time he had to spend doing administrivial tasks—meetings, filling out forms, time sheets, requisitions, and the like. “Everything takes so long. I want a pad of paper to take notes, I have to fill out a form for it. God help me if I run out of forms!”

“How much time do you spend on this kind of stuff each week?” I asked.

Pedro replied, “An hour a day. Maybe two, some days. Meetings…let’s say an hour and a half, on average.”

Wow, I thought—that’s a pretty good chunk of the week. I had an idea.

“Let’s visualize this, I said.” I took out my trusty Moleskine notebook. I prefer the version with the graph paper in it, for occasions just like this one. I outlined a grid, 20 squares across by two down.

Empty Week

“So you spend, on average, an hour and a half each day on compliance stuff. One-point-five times five, or 7.5 hours a week. Let’s make it eight. Put a C in eight squares.” He did that.


“Okay,” I said. “You were griping today about how much time you spend wrestling with your test environments.”

Pedro’s eyes lit up. “Yes!” he said. “That’s the big one. See, it’s mobile stuff. We have a server component and a handset component to what we do, and the server stuff is a real bear.”

“Tell me more.”

“It’s a big deal. We’ve got one environment that models the production system. The software we’re developing has been so buggy that we can’t tell whether a given problem is general, or specific to the handset, so we have another one that we set up to do targeted testing every time we add support for a new handset. That’s the one I work with. Trouble is, setting it up takes ages and it’s really finicky. I have to do everything really carefully. I’ve asked for time to do scripting to automate some of it, but they won’t give that to me, because they’re always in such a rush. So, I do it by hand. It’s buggy, and I make the odd mistake. Either way, when I find out it doesn’t work, I have to troubleshoot it. That means I have to get on instant messaging or the phone to the developers, and figure out what’s wrong; then I have to figure out where to roll back to. And usually that’s right from the start. It wastes hours. And it’s every day.”

“Okay. Show me that on our little table, here. Use an S to represent each hour your spend each day.”

Whereupon Pedro proceded to fill in squares. Ten of them. Ten more. And then, eight more.


“Really?!” I said. “28 hours a week divided by five days—that’s more than five hours a day. Seriously?”

“Totally,” said Pedro. “It’s most of the day, every day, honestly. Never mind the tedium. What’s really killing me is that I don’t feel like I’m getting any real testing work done.”

“No kidding. There’s no time for it. There are only four squares left in the week. Plus, something you said earlier today about tons of bugs that aren’t related to setting up?”

“Right. When it comes to the stuff that I’m actually being asked to test, there’s lots of bugs there too. So my ‘testing time’ isn’t really testing. It’s mostly taken up with trying to reproduce and document the bugs.”

“Yes. In session-based test management, that’s bug investigation and reporting—B-time. And it does interrupt test design and execution—T-time—which is what produces actual test coverage, learning about what’s actually going on in the product. So, how much B-time?” He filled in three of the squares with Bs.

Bug Investigation and Reporting

“And T-time?”

He had room left to put in one lonely little T in the lower right corner.

Testing Time

“Wow,” I laughed. “One-fortieth of your whole week is spent in getting actual test coverage. The rest is all overhead. Have you told them how it affects you?”

“I’ve mentioned it,” he said.

“So look at this,” I suggested. “It’s even more clear when we use colour for emphasis.”

With Colour

“Whoa. I never looked at it that way. And then,” he paused. “Then they ask me, ‘Why didn’t you find that bug?'”

“Well,” I said, “considering the illusion they’re probably working under, it’s not an unreasonable question.”

“What do you mean?” Pedro asked.

“What does it say on your business card?”

“‘Software Testing’.”

“And what does it say on the door of the the test lab?”

“‘Test Lab’,” said Pedro.

“And they call you…?”


“No,” I laughed. “They say you’re a… what?”

“Oh. A tester.”

“So since you’re a tester, and since the door on the test lab says ‘Test Lab’, and your business card says ‘Testing’, they figure that’s all you do. The illusion is what Jerry Weinberg calls the Lumping Problem. All of those different activities—administrative compliance, setup, bug investigation and reporting, and test design and execution—are lumped into a single idea for them.” And I drew it for him.

Management's Dream

“That’s management’s illusion, there. Since, in their imagination, you’ve got forty hours of testing time in a week, it’s not unreasonable for them to wonder why you didn’t find that bug.”

“Hmmm. Right,” said Pedro.

“When in fact, what they’re getting from you is this.” And I drew it for him.

Testing Reality

“For testing—actual interaction with the product, looking for problems, you’ve got one-fortieth of the time they think you’ve got. One lonely little T. Is that part of your test report?”

“Oy,” he said. “Maybe I should show them something like this.”

“Maybe you should,” I said.

A couple of nights later, I showed that page of my notebook to James Bach over Skype. “Wow,” he said. “That guy could be forty times more productive!”


“Well, no, not really, of course. But suppose the programmers checked their work a little more carefully, or suppose the testers practiced writing more concise bug reports and sharpened their investigating skill. One of those two things could cut the bug investigation time by a third. That would give more time for testing, when they’re not being interrupted by other stuff. What if they cut the setup time by a half, and that administrivia by half?”

“Four, fourteen…” I said. “That would give eighteen more hours for testing and bug investigation, for a total of 22 hours. And even if they’re still doing two hours of bug investigation for every one hour of testing time… well, that’s seven times more productive, at least.”

“Seven times the test coverage if they get some of those issues worked out, then,” said James.

“Maybe de-lumping is the kind of thing lots of testers would want to do in their test reports,” I said.

How about you?

Time, Coverage, and Maps

Monday, October 15th, 2012

Over the last few years, people have become increasingly enthusiastic about the idea of mind mapping to help them describe or illustrate or otherwise consider test coverage. For me, Darren McMillan was the one who really got the ball rolling here, here, and here. More recently there have been other examples to present coverage ideas. Colleague Adam Goucher has weighed in here. But there’s another thing you can do, something that James Bach and I have been talking about in the Rapid Software Testing class for a couple of years now: You can use a mind map to help decide about how might allocate your time when you’re dealing with an uncertain situation. You can do this with a functional or structural diagram, too. Let’s try this with an example.

  • There is a given number of hours in a typical week; let’s say 40.
  • There are some testers on the team; let’s say four.
  • Each tester can accomplish a certain amount of uninterrupted testing time in the course of a day. For this exercise, let’s say that it’s three 90-minute sessions per tester per day. That means that each tester could accomplish 15 sessions per week, so our team of four could pull off 60 sessions per week.

Now, most sessions are not entirely productive in terms of test coverage. That is, sessions are not typically dedicated entirely to on-charter test design and execution (that’s called that testing time or T time in session-based test management). T-time is regularly interrupted by other activities. Apart from test-design and execution, whereby we obtain test coverage, there’s usually some setup time (S-time), and there’s almost always some bug investigation and reporting time (B-time). We can’t predict how well any given session is going to go, but over time we can learn to develop a sort of first-order, back-of-the-envelope, finger-in-the-air, heuristic, probably-wrong-but-right-enough kind of guess. We are talking about predicting the future, here. Let’s say that between them, our general experience with this development group is that B-time and S-time tend to cost us about a third of our time as we’re testing.

So, in order to figure out how we’re going to spend our time this week, we can’t say that we’re going to get 60 sessions worth of test coverage. Our effective testing time is more like 40 idealized sessions. Let’s represent those sessions with sticky notes—one perfectly session per note. For the team we’re imagining here, we’d have 40 sticky notes to work with.

Different sessions usually have different themes—tasks, activities, or approaches. As we engage with a brand-new feature, we might perform an “intake” or “survey” or “reconnaissance” session, with the goal of identifying what’s there to be tested. “Analysis” sessions might help us to decide on where certain risks are, what we want to cover, or how we want to cover it. As we get deeper into the testing of a particular feature (“deep coverage” sessions), we might want a given session to be focused on a particular kind of test coverage—straightforward capability testing, data- or domain-focused testing, or testing on a specific platform. Maybe we want to cover a feature of the product while focusing on a particular parafunctional quality criterion, like performance or usability. Perhaps we want to allocate some sessions to design or coding of test oracles. Maybe we could dedicate a session or two to exploring the product based on problem reports from the help desk. If we’d like to highlight specific dimensions of activities or coverage, we can decorate our sticky notes with icons, one or two key words, or a dot——or we can use different colours for the notes, or some combination of these things. In this example, we’ll use little icons to represent classes of activities.

Now get the team together in front of a whiteboard or flip chart to look at your structural diagram, flowchart, or mind map. Place a sticky note (perhaps with a few words of explanatory text) on each node (functional area) or line (interface) on the map you’d like to cover with an idealized session. Keep putting sticky notes on the diagram until you’ve used them up.

By the time you’re out of sticky notes, you will have begun to develop some ideas about what you might or might not be able to accomplish given a week to do it. Are some areas not covered at all? Pick up a sticky note from somewhere else, and move it around. Should certain risky or complex areas receive more attention than others? If so, they might be worthy of more nodes and more than one sticky note. Not enough sticky notes—that is, not enough time, given the people you have available—to cover the whole diagram as well as you’d like? In that case, something has to change, but if all the assumptions above still hold, the catch is that you only have 40 sticky notes to work with and to redeploy.

There’s another catch, too. Diagrams are models. Models are simplifications of reality, and so they leave stuff out. In this kind of exercise, things that don’t appear on your diagrams can be easy to forget. Some essential aspects of test coverage might not fit very well on the diagram, or indeed on any diagram. As you notice missing items or missing ideas, put them on the diagram or on a list in one corner of the space. Keep asking what might be missing from the diagram or the list. Each element on the list is a potential candidate for its own sticky note—or maybe you can cover two or three list items within a single session.

Once the diagram has been covered with general ideas, we can choose to write a more specific and refined charter for each sticky note.

Maybe our sessions won’t be as productive as we thought. If, in the course of testing, we determine that our assumptions aren’t meshing well with reality, we can revisit the diagram and the sticky notes to re-evaluate as soon as we have any information that might threaten the schedule or the anticipated test coverage. We’d typically look at a new diagram, or look at an old diagram in a very different way, every week or two in any case.

This approach could be adapted to mesh very well with the ideas that Paul Holland outlines in this article.

Making things visible provides a point of departure for conversations about strategy, logistics, and timing. It’s important for us to have the skill of telling a story about what there is to test, how we could test it, what we could cover, and what our constraints might be. Some simple visual aids can help us to illustrate that story.

What Exploratory Testing Is Not (Part 1): Touring

Thursday, December 15th, 2011

Touring is one way of structuring exploratory testing, but exploratory testing is not necessarily touring, and touring is not necessarily exploratory.

At one extreme, a tourist might parachute into a territory for which there is no detailed knowledge of the landscape, flora and fauna, or human culture, with the goal of identifying what’s there to be learned. Except in such cases, we wouldn’t call her a tourist; we’d call her an anthropologist, or a field botanist, or a field geologist, or an archaeologist. The activity is in this case is interactive with the territory. At the other extreme, a tourist might visit a travel agent, get on a plane to Orlando, meet a chartered bus at the airport, and sit through the rides at Disney World. The touring activity there is largely passive. and we would call someone engaged in it a “tourist”—although “engagement” is something of an exaggeration here. For this kind of person, serious explorers and locals would probably use the word “tourist” in a somewhat deprecating kind of way.

Touring a program can be done in a more scripted or more exploratory way, just as touring a city can be done in a more scripted or more exploratory way. A tourist has many options. Before going on a trip, a tourist might study what is already known about a particular destination. To prepare, she might supply herself with maps and travel guides, and some ideas about destinations of interest. Upon arrival, she might choose a set of walking tours from a guidebook and follow the routes closely, eating only at the restaurants identified in the guidebook, noting buildings and artifacts and other objects of interest by matching them with the descriptions and illustrations. At a given site, she might listen to a prepared audio guide that directs her observations very specifically. She might spend all of her time in the presence of a tour guide who tells her what to observe and how to interpret it. She might choose to accept everything the tour guide told her as the complete story, and refrain from asking questions. Even though the experience would be new to her, and she might learn something from it, she would not likely add much to what is already known. We call that activity touring, but it isn’t very exploratory, and a report on such a tour would largely recapitulate the guidebook. Is your testing like that?

On the other hand, rather than touring like a tourist, she might cover a territory as a historian, or a social scientist, or a travel writer. In that kind of role, she would have a research goal based on the idea of obtaining new knowledge. Learning something new and imparting it to other people requires a more open agenda than sitting on the bus while someone or something else directs your attention. Our researcher might make her way directly to particular destinations or landmarks and begin her research there, or she might amble through neighbourhoods or historical sites to discover new things about them. She could choose to focus on specific aspects of what’s there to observe, or she could choose to let the observations come to her—and, of course, she might do both. She might work with descriptions that she had been given with the intention of adding to them, or she might work from a set of questions that haven’t been asked before. Depending on her mission, she might choose to look for specific patterns or problems, or she might seek deeper understanding that would help her to identify or refine what kind of patterns or problems to look for. Even though the mission to discover new information might come from someone else, she remains in control of the specifics of the itinerary and of each activity from one moment to the next. Is your testing more like that?

One of the hallmarks of exploratory activity is the extent to which it is guided and structured by the person performing that activity. Another hallmark is the extent to which new knowledge feeds into choice of which action to perform next. Touring is not equivalent to exploration; touring can be done is a scripted way or an exploratory way.

This series continues with What Exploratory Testing Is Not (Part 2): After-Everything-Else Testing.

And, of course, in the face of all the forthcoming instances of what exploratory testing is not, you might want to know our current take on what exploratory testing is.