It’s Not A Factory

April 19th, 2016

One model for a software development project is the assembly line on the factory floor, where we’re making a buhzillion copies of the same thing. And it’s a lousy model.

Software is developed in an architectural studio with people in it. There are drafting tables, drawing instruments, good lighting, pens and pencils and paper. And erasers, and garbage cans that get full of coffee cups and crumpled drawings. Good ideas become better ideas as they are sketched, analysed, criticised, and revised. A lot of bad ideas are discovered and rejected before the final plans are drawn.

Software is developed in a rehearsal hall with people in it. The room is also filled with risers and chairs and other temporary staging elements, and with substitute props that stand in for the finished products. There’s a piano to accompany the singers while the orchestra is being rehearsed in another hall. Lighting, sound, costumes and makeup are designed and folded into the rehearsal process as we experiment with different ways of bringing the show to life. Everyone tries stuff that doesn’t work, or doesn’t fit, or doesn’t sound right, or doesn’t look good at first. Frustration arises, feelings get bruised, and then breakthroughs happen and problems get solved. Lots of experiments lead to that joyful and successful opening night.

Software is developed in a workshop with people in it; skilled craftspeople who build tools and workspaces for themselves and each other, as part of the process of crafting products for people to buy. Even though they try to keep the shop clean, there’s occasional sawdust and smoke and spilled glue and broken machinery. Work in progress gets tested, and weaknesses are exposed—sometimes late in the game—and get fixed.

In all of these places, variation is encouraged. Designs are tinkered with. Discoveries are celebrated. Learning happens. Most importantly, skill and tacit knowledge are both applied and developed.

The Lean model for software development might seem a more humane step forward from the older days, but it’s still based on the factory. Ideas aren’t widgets whose delivery you can schedule just in time. Failed experiments aren’t waste when you learn from them, and if you know it won’t be waste from the outset, it’s not really an experiment. Everything that makes it into the product should represent something that the customer values, but when we’re creating something novel (which we’re always doing to some degree as we’re building software), we’re exploring and trying things out to help refine our understanding of what the customer actually values.

If there is any parallel between software and manufacturing, it is this: the “software development” part of manufacturing happens before the assembly line—in the design studio, where the prototypes are being developed, refined, and selected for mass production. The manufacturing part? That’s the copy command that deploys a copy of the installation package to all the machines in the enterprise, or the disk duplicator that stamps out a million DVDs with copies of the golden master on it, or the Web server that delivers a copy of the product to anyone who requests it. Getting to that first copy, though? That’s a studio thing, not an assembly-line thing.

The primary inspiration for this post is a conversation I had with Cem Kaner in 2008. Another is the book Artful Making by Robert Austin and Lee Devin, which I first read around the same time. Yet another is Christopher Alexander’s A Pattern Language. One more: my long-ago career in theatre, which prepared me better than you can imagine for a life in software development.

100% Coverage is Possible

April 16th, 2016

In testing, what does “100% coverage” mean? 100% of what, specifically?

Some people might say that “100% coverage” could refer to lines of code, or branches within the code, or the conditions associated with the branches. That’s fine, but saying “100% of the lines (or branches, or conditions) in the program were executed” doesn’t tell us anything about whether those lines were good or bad, useful or useless. It doesn’t tell us anything about what the programmers intended, what the user desired, or what the tester observed. It says nothing about the tester’s engagement with the testing; whether the tester was asleep or awake. It ignores the oracles that the tester applied;how the tester recognized—or failed to recognize—bugs and other problems that were encountered during the testing. It suggests that some machinery processed something; nothing more.

Here’s a potentially helpful way to think about this:

“X coverage is how thoroughly we have examined the product with respect to some model of X”.

So: risk coverage is how thoroughly we have examined the product with respect to some model of risk; requirements coverage is how how thoroughly we have examined the product with respect to some model of requirements; code coverage is how thoroughly we have examined the product with respect to some model of code.

To claim 100% coverage is essentially the same as saying “We’ve looked for bugs everywhere!” For a skilled tester, any “100%” claim about coverage should prompt critical thinking: “How much” compared to what? 100% of what, specifically? Some model of X—which one? Whose model? How well does the “model of X” model reality? What does the model of X leave out of the universe of possible ways of thinking about X? And what non-X things should we also be considering when we’re testing?

Here’s just one example: code coverage is usually described in terms of the code that we’ve written, or that we have available to evaluate. Yet every program we write interacts with some platform that might include third-party libraries, browsers, plug-ins, operating systems, file systems, firmware. Our code might interact with our own libraries that we haven’t instrumented this time. So “code coverage” refers to some code in the system, but not all the code in the system.

Once I did a test (or was it 10,000 tests?) wherein I used an automated check to run through all 10,000 possible settings of a particular variable. That was 100% coverage of that variable being used in a particular moment in the execution of the system, on that day. But it was not 100% of all the possible sequences of those settings, nor 100% of the possible subsequent paths through the product. It wasn’t 100% of the possible variations in pacing, or system load, or times of day when the system could be used. That test wasn’t representative of all of the possible stakeholders who might be using that variable, nor how they might use it.

What would “100% requirements coverage” mean? Would it mean that every statement in the requirements document was covered by a test? If you think so, it might be worthwhile to consider all the models that are in play. The requirements document is a model of the product’s requirements. It refers to ideas that have been explicitly expressed by some people, but not by all of the people who might have requirements for the product. The requirements document models what those people thought they wanted at a certain point, but not necessarily what they want now. The requirements document doesn’t account for all of the ideas or ideas that people had that may have been tacit, or implicit, or latent. You can subject “statement”, “covered”, and “test” to the same kind of treatment. A statement is a model of what someone is thinking at a given point in time; our notion of what “covered” means is governed our models of coverage; our notion of “a test” is conditioned by our models of testing. It’s models all the way down.

Things in testing keep reminding me of a passage from Computer Programming Fundamentals by Herbert Leeds and Jerry Weinberg:

“One of the lessons to be learned … is that the sheer number of tests performed is of little significance in itself. Too often, the series of tests simply proves how good the computer is at doing the same things with different numbers. As in many instances, we are probably misled here by our experiences with people, whose inherent reliability on repetitive work is at best variable. With a computer program, however, the greater problem is to prove adaptability, something which is not trivial in human functions either. Consequently we must be sure that each test does some work not done by previous tests. To do this, we must struggle to develop a suspicious nature as well as a lively imagination.“

Testing is an open investigation. 100% coverage of a particular factor may be possible—but that requires a model so constrained that we leave out practically everything else that might be important. Test coverage, like quality, is not something that yields very well to quantitative measurements, except when we’re talking of very narrow and specific conditions. But we can discuss coverage, and ask questions about whether it’s what we want, whether we’re happy with it, or whether we want more.

Further reading:

Got You Covered
Cover or Discover
A Map by Any Other Name
What Counts

As Expected

April 12th, 2016

This morning, I started a local backup. Moments later, I started an online backup. I was greeted with this dialog:

Looks a little sparse. Unhelpful. But there is that “More details” drop-down to click on. Let’s do that.

Ah. Well, that’s more information. But it’s confusing and unhelpful, but I suppose it holds the promise of something more helpful to come. I notice that there’s a URL, but that it’s not a clickable link. I notice that if the dialog means what it says, I should copy those error codes and be ready to paste them into the page that comes up. I can also infer that there’s not local help for these error codes. Well, let’s click on the Knowledge Base button.

Oh. The issue is that another backup is running, and starting a second one is not allowed.

As a tester, I wonder how this was tested.

Was an automated check programmed to start a backup, start a second backup, and then query to see if a dialog appeared with the words “Failed to run now: task not executed” in it? If so, the behaviour is as expected, and the check passed.

Was an automated check programmed to start a backup, start a second backup, and then check for any old dialog to appear? If so, the behaviour is as expected, and the check passed.

Was a test script given to a tester that included the instruction to start a backup, start a second backup, and then check for a dialog to appear, including the words “Failed to run now: task not executed”? Or any old dialog that hinted at something? If so, the behaviour is as expected, and the “manual” test passed.

Here’s what that first dialog could have said: “A backup is in progress. Please wait for that backup to complete before starting another.”

At this company, what is the basic premise for testing? When testing is designed, and when results are interpreted, is the focus on confirming that the product “works as expected”? If so, and if the expectations above are met, no bug will be noticed. To me, this illustrates the basic bankruptcy of testing to confirm expectations; to “make sure the tests all pass”; to show that the product “meets requirements”. “Meets requirements”, in practice, is typically taken to mean “is consistent with statements in a requirements document, however misbegotten those statements might be”.

Instead of confirmation, “pass or fail”, “meets the requirements (documents)” or “as expected”, let’s test from the perspective of two questions: “Is there a problem here?” and “Are we okay with this?” As we do so, let’s look at some of the observations that we might make were and questions we might ask. (Notice that I’m doing this without reference to a specification or requirements document.)

Upon starting a local backup and then attempting to start an online backup, I observe this dialog.

I am surprised by the dialog. My surprise is an oracle, a means by which I might recognize a problem. Why am I surprised? Is there a problem here?

I had a desire to create a local backup and an online backup at the same time. On a multi-tasking, multi-threaded operating system, that desire seems reasonable to me, and I’m surprised that it didn’t happen.

Inconsistency with reasonable user desire is an oracle principle, linked to quality criteria that might include capability, usability, performance, and charisma. The product apparently fails to fulfill quality criteria that, in my opinion, a reasonable user might have. Of course, as a tester, I don’t run the project. So I must ask the designer, or the developer, or the product manager: Are we okay with this?

This might be exactly the dialog that has been programmed to appear under this condition—whatever the condition is. I don’t know that condition, though, because the dialog doesn’t tell me anything specific about the problem that the software is having with fulfilling my desire. So I’m somewhat frustrated, and confused. Is there a problem here?

I can’t explain or even understand what’s going on, other than the fact that my desire has been thwarted. My oracle—pointing to a problem—is inconsistency with explainability, in addition to inconsistency with my desires. So I’m seeing a potential problem not only with the product’s behaviour, but also in the dialog. Are we okay with this?

Maybe more information will clear that up.

Still nothing more useful here. All I see is a bunch of error codes; no further explanation of why the product won’t do what I want. I remain frustrated, and even more confused than before. In fact, I’m getting annoyed. Is there a problem here?

One key purpose of a dialog is to provide a user with useful information, and the product seems inconsistent with that (the inconsistency-with-purpose oracle). Are these codes correct? Maybe these error codes are wildly wrong. If they are, that would be a problem too. If that’s the case, I don’t have a spec available, so that’s a problem I’m simply going to miss. Are we okay with that?

I have to accept that, as a human being, there are some problems I’m going to miss—although, if I were testing this in-house, there are things I could do to address the gaps in my knowledge and awareness. I could note the codes and ask the developer about them; or I could ask for a table of the available codes. (Oh… no one has collected a comprehensive listing of the error codes; they’re just scattered through the product’s source code. Are we okay with this?)

Back to the dialog. Maybe those error codes are precisely correct, but they’re not helping me. Are we okay with this?

All right, so there’s that Knowledge Base button. Let’s try it. When I click on the button, this appears:

Let’s look at this in detail. I observe the title: 32493: Acronis True Image: “Failed to run now: task not executed.” That’s consistent with the message that was in the dialog. I notice the dates; something like this has been appeared in the knowledgebase for a while. In that sense, it seems that the product is consistent with its history, but is that a desirable consistency? Is there a problem here?

The error codes being displayed on this Web page seem consistent with the error codes in the dialog, so if there’s a problem with that, I don’t see it. Then I notice the line that says “You cannot run two tasks simultaneously.” Reading down over a long list of products, and through the symptoms, I observe that the product is not intended to perform two tasks simultaneously. The workaround is to wait until the first task is done; then start the second one. In that sense, the product indeed “works as expected”. And yet…are we okay with this?

Once again, it seems to me that attempting to start a second task could be a reasonable user desire. The product doesn’t support that, but maybe we’re okay with that. Yet is there a problem here?

The product displays a terse, cryptic error message that confuses and annoys the user without fulfilling its apparent intended purpose to inform the user of something. The product sends the user to the Web (not even to a local Help file!) to find that the issue is an ordinary, easily anticipated limitation of the program. It does look kind of amateurish to deal with this situation in this convoluted way, instead of simply putting the relevant information in the initial dialog. Is there a problem here?

I believe that this behaviour is inconsistent with an image that the company might reasonably want to project. The behaviour is also inconsistent with the quality criteria we call usability and charisma. A usable product is one that behaves in a way that allows the user to accomplish a task (including dealing with the product’s limitations) quickly and smoothly. A charismatic product is one that does its thing in an elegant way; that engages the user instead of irritating the user; that doesn’t make the development group look silly; that doesn’t prompt a blog post from a customer highlighting the silliness.

So here’s my bug report. Note that I don’t mention expectations, but I do talk about desires, and I cite two oracles. The title is “Unhelpful dialog inconsistent with purpose.” The body would say “Upon attempting to start a second backup while one is in progress, a dialog appears saying ‘Failed to run now: task not executed.’ While technically correct, this message seems inconsistent with the purpose of informing the user that we can’t perform two backup tasks at once. The user is then sent to the (online) knowledge base to find this out. This also seems inconsistent with the product’s image of giving the user a seamless, reliable experience. Is all this desired behaviour?”

Finally: it could be that the testers discovered all of these problems, and laid them out for the the product’s designers, developers, and managers, just as I’ve done here. And maybe the reports were dismissed because the product works “as expected”. But “as expected” doesn’t mean “no problem”. If I can’t trust a backup product to post a simple, helpful dialog, can I really trust it to back up my data?

You Are Not Checking

April 10th, 2016

Note: This post refers to testing and checking in the Rapid Software Testing namespace. This post has received a few minor edits since it was first posted.

For those disinclined to read Testing and Checking Refined, here are the definitions of testing and checking as defined by me and James Bach within the Rapid Testing namespace.

Testing is the process of evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, etc.

(A test is an instance of testing.)

Checking is the process of making evaluations by applying algorithmic decision rules to specific observations of a product.

(A check is an instance of checking.)

You are not checking. Well, you are probably not checking; you are certainly not only checking. You might be trying to do checking. Yet even if you are being asked to do checking, or if you think you’re doing checking, you will probably fail to do checking, because you are a human. You can do things that could be encoded as checks, but you will do many other things too, at the same time. You won’t be able to restrict yourself to doing only checking.

Checking is a part of testing that can be performed entirely algorithmically. Remember that: checking is a part of testing that can be performed entirely algorithmically. The exact parallel to that in programming is compiling: compiling is a part of programming that can be performed entirely algorithmically. No one talks of “automated compiling”, certainly not anymore. It is routine to think of compiling as an activity performed by a machine. We still speak of “automated checking” because we have only recently introduced “checking” as a term of art. We say “automated checking” to emphasize that checking by definition can be, and in practice probably should be, automated.

If you are trying to do only checking, you will screw it up, because you are not a robot. Your humanity—your faculties that allow you to make unprogrammed observations and evaluations; your tendency to vary your behaviour; your capacity to identify unanticipated risks—will prevent you from living to an algorithm. As a human tester—not a robot—you’re essentially incapable of sticking strictly to what you’ve been programmed to do. You will inevitably think or notice or conjecture or imagine or learn or evaluate or experiment or explore. At that point, you will have jumped out of checking and into the wider activities of testing. (What you do with the outcome of your testing is up to you, but we’d say that if your testing produces information that might matter to a client, you should probably follow up on it and report it.)

Your unreliability and your variability is, for testing, a good thing. Human variability is a big reason why you’ll find bugs even when you’re following a script that the scriptwriter—presumably—completed successfully. (In our experience, if there’s a test script, someone has probably tried to perform it and has run through it successfully at least once.)

So, unless you’ve given up your humanity, it is very unlikely that you are only checking. What’s more likely is that you are testing. There are specific observations that you may be performing, and there are specific decision rules that you may be applying. Those are checks, and you might be performing them as tactics in your testing. Many of your checks will happen below the level of your awareness. But just as it would be odd to describe someone’s activities at the dinner table as “biting” when they were eating, it would be odd to say that you were “checking” when you were testing.

Perhaps another one of your tactics, while testing, is programming a computer—or using a computer that someone else has programmed—to perform checking. In Rapid Software Testing, people who develop checks are generally called toolsmiths, or technical testers—people who are not intimidated by technology or code.

Remember: checking is a part of testing that can be performed entirely algorithmically. Therefore, if you’re a human, neither instructing the machine to start checking nor developing checks is “doing checking”.

Testers who develop checks are not “doing checking”. The checks themselves are algorithmic, and they are performed algorithmically by machinery, but the testers are not following algorithms as they develop checks, or deciding that a check should be performed, or evaluating the outcome of the checking. Similarly, programmers who develop classes and functions are not “doing compiling”. Those programmers are not following algorithms to produce code.

Toolsmiths who develop tools and frameworks for checking, and who program checks, are not “doing checking” either. Developers who produce tools and compilers for compiling are not “doing compiling”. Testers who produce checking tools should be seen as skilled specialists, just as developers who produce compilers are seen as skilled specialists. In order to develop excellent checks and excellent checking tools, a tester needs two distinct kinds of expertise: testing expertise, and programming and development expertise.

Testers apply checking as tactic of testing. Checking is embedded within a host of testing activities: modeling the test space; identifying risks; framing questions that can be asked about the product; encoding those questions in terms of algorithmic actions, observations, outcomes, and reports; choosing when the checking should be done; interpreting the outcome of checks, whether green or red.

Notice that checking does not find bugs. Testers—or developers temporarily in a testing role or a testing mindset—who apply checking find bugs, and the checks (and the checking) play a role in finding bugs.

In all of our talk about testing and checking, we are not attempting to diminish the role of people who create and use testing tools, including checks and checking. Nothing could be farther from the truth. Tools are vital to testing. Tools support testing.

We are, however, asking that testing not be reduced to checking. Checking is not testing, just as compiling is not software development. Checking may be a very important tactic in our testing, and as such, it is crucial to consider how it can be done expertly to assist our testing. It is important to consider the extents and limits of what checking can do for us. Testing a whole product while being fixated on checking is like like developing a whole product while being fixated on compiling.

Oracles from the Inside Out, Part 9: Conference as Oracle and as Destination

March 17th, 2016

Over this long series, I’ve described my process of reasoning about problems, using this table:

So far, I’ve mostly talked about the role of experience, inference, and reference. However, I’m typically testing for and with clients—product managers, developers, designers, documenters, and so forth. In doing so, I’m trying to establish a shared understanding of the product with the rest of the team. That understanding is developed through conference; conversation and interaction with those other people. So the lower left quadrant represents two things at once: a set of oracles on the one hand, and my destination on the other.

A brief recap: while testing, I experience and develop my own set of mental models of the product and feelings about it, and reason about possible problems in it. In many cases—for instance, when I get a feeling of surprise or confusion, I’m able to use the consistency principles in the upper right to make inferences that I’m seeing a problem. My inferences might be mediated by references like a document (a specification, or a diagram, or a standard) or a tool (a suite of automated checks, or something that helps me to aggregate and visualize patterns in the data). Those media afford a move from upper right to lower right, and back again to a stronger inference in the upper right.

In other cases, my experiences, inferences, and references may not be enough for me to convince myself that I’m seeing a problem or missing one. If so, one possible move is to ask another tester, a developer, a expert user, a novice user, a product owner, or subject matter expert for information or an opinion. (In Rapid Testing, we often call such a person a live oracle.) When I do that, I’m moving from inference to conference, from upper right to lower left. Occasionally that communication happens immediately and tacitly, without my having to refer to explicit inferences or references. More often, it’s a longer and more involved discussion.

I could use the expertise of a particular person as an oracle, and rely upon that person to declare that he or she is seeing a problem. However, perspectives differ, people have blind spots, everyone is capable of making a mistake, and what was true yesterday may not be true today. Thus there is a risk that a live oracle could be oblivious to certain kinds of problems, or could mislead me into believing there’s a problem where there isn’t one. No oracle—not even a live one, nor a group of them—is infallible. The expert user might not notice an ease-of-learning problem that would cause a novice to stumble. A new programmer might not see a usability problem that an experienced tester would notice right away.

Perhaps more interestingly, people might disagree about whether there’s a problem or not. Such disagreements themselves are oracles, alerting me to problems in the project as well as the product. Feelings can provide important clues about the meaning and the significance of a problem. As we work together, I can listen to people’s opinions, observe the emotional weight they carry, weigh agreements and disagreements between people who matter, and compare their feelings with my own. I move between conference and inference to to recognize or refine my perception of a problem.

The ultimate goal for my testing is to end up in that lower left quadrant with one person in particular: my most important client, the person responsible for making content and release decisions about the product. (That person may have one of a number of titles or labels, including product manager, program manager, project manager, development manager… Here, let’s call that person the Client.) I want my models and feelings about the product to be consistent with the Client’s models and feelings. Experience, inference, reference, and conference help me to do that.

Here’s a fact-based but somewhat fictionalized example. A few years ago, I was working at a financial institution. One of the technical support people mentioned in passing that a surprisingly high proportion of her work was dealing with failed transactions involving two banks out of the hundreds that we interacted with. That triggered a feeling of curiosity: was there a bug in our code? That feeling prompted me to investigate.

Each record had a transaction identifier associated with it. The transaction ID was generated from various bits of data, including the customer account number, and it included a calculated check digit. When I started testing, I noticed that the two banks in question used six-digit account numbers, rather than the more common seven-digit form. I cooked up a script to perform a large number of simulated transactions with those two banks. When I examined the logs, I found that a small number of transactions had invalid account numbers. That problem should have been trapped by the check digit functions, but the transactions were allowed to pass through the system unhindered.

When I mentioned the problem in passing to the product owner, I observed that she seemed unperturbed; she didn’t seem to be taking the problem very seriously. The discrepancy between our feelings suggested that one of two things must have be true: either I hadn’t framed the problem sufficiently well for her to recognize its significance; or she had information that I didn’t, information that would have changed my perception of the problem and lessened my emotional reaction to what I was seeing.

“The problem is only with those two banks,” she said. “Six-digit account numbers, right? We have to special-case those by adding a trailing zero for the check digit function. Something about the check digit calculation fails about one time in a couple of hundred, but the transaction goes through anyway. But later, when we send the acknowledgement packet, those two banks reject it. So six-digit numbers are a pain, but we’ve always been able to deal with the occasional failure.” Here she was using the “patterns of familiar problems” and “history” oracle principles as her means of recognizing a problem. But something else was going on: she was using those two principles to calibrate the significance of the problem in terms of her own mental models, and those principles were helping to dampen her concern. Those oracles suggested that to her that I was observing a problem, but not a big problem.

I did a search of the database, and discovered that there were eight other banks that used six-digit numbers. I wrote a quick script to extract all of the records for those banks. All of transactions had happened successfully.

“OK, but here’s what I found out,” I replied. “There are eight other banks that use six-digit numbers, and we’ve never seen a check-digit failure in those.”

“Really?” she said. “Wow. I thought those were the only two.” I could see that she was suddently more engaged. The fact that the product was inconsistent with itself was a powerful oracle. Awareness of the inconsistency raised her emotional state.

“Yep,” I said. “Here’s the thing: for those two banks—and only for those two—we’re serving up the wrong Web page to get input, which is obviously inconsistent with our design. That page provides the customer with a seven-digit input field. I looked at the logs, and I tried a bunch of stuff myself. Here’s what I think is happening: when the customer enters in a six-digit account number, the page rejects their input because it’s too short, and tells them they need to put in a seven-digit number. It looks to me like a few of the customers are trying to work around the error message by putting in a leading zero. They do that because we show an image to illustrate example input. That image is a seven-digit number that has a leading zero in it. What’s funny is that that the wrong thing to do—putting in a leading zero—actually succeeds every now and again; the hash function for the check digit generates a valid transaction ID by coincidence. Not very often, but enough for it to register.”

“Interesting!” she said. She smiled. “Good detective work there.”

“So, are we going to fix it?” I asked, confident that we finally had a shared understanding of the problem.


I was surprised, and felt myself becoming a little agitated. “Nope?!”

“Well, probably not. We’re replacing the whole input process in six months or so. Since we can deal with the problem as it is, and since the developers are busy on the new version, we’re cool with muddling along.” She noticed from my expression that I suddenly felt deflated. “Listen, that was some really good testing,” she said. “And I really appreciate the effort, and I understand your concern. I get that it’s a real problem for a handful of customers (here, she was acknowledging the inconsistency with user desires oracle), although once they’ve called us, they’re aware of the workaround. I know it does sound like a pretty easy fix, and we could fix it. But then we’d want to test it to make sure that the whole process keeps working for all of the customers of those banks, not just the ones who have had the problems. And with the new version coming up, trust me: you’ll have more than enough to do.”

I was a little disappointed that my investigation hadn’t resulted in a fix, but I did feel that she’d been listening. I had heard enough from her to dampen my own emotional state down so that it was well calibrated with hers.

When I observe a problem, the Client might or might not agree with me that it is a problem. That’s okay. As a tester, I’m not judge or jury for the problem, but I do want to make sure that my report has been heard and understood. After that, the Client can decide what she likes.

She might decide that it’s an important and urgent problem, and that it needs to be addressed right away. She might agree that it’s a problem, but not a problem worth fixing. She might believe that the problem is worth fixing, but not right away. She might dismiss my report of an inconsistency between the product some principle by citing other, more important principles with which the product is consistent.

Oracles give us means not only to recognize problems, but also to interpret and explain our feelings about them. When I can frame my experience—feelings and mental models—in terms of inferences about inconsistencies, I’m better prepared for a conversation—a conference—with my client about each problem, and why I believe it’s a problem.

A Context-Driven Approach to Automation in Testing

January 31st, 2016

(We interrupt the previously-scheduled—and long—series on oracles for a public service announcement.)

Over the last year James Bach and I have been refining our ideas about the relationships between testing and tools in Rapid Software Testing. The result is this paper. It’s not a short piece, because it’s not a light subject. Here’s the abstract:

There are many wonderful ways tools can be used to help software testing. Yet, all across industry, tools are poorly applied, which adds terrible waste, confusion, and pain to what is already a hard problem. Why is this so? What can be done? We think the basic problem is a shallow, narrow, and ritualistic approach to tool use. This is encouraged by the pandemic, rarely examined, and absolutely false belief that testing is a mechanical, repetitive process.

Good testing, like programming, is instead a challenging intellectual process. Tool use in testing must therefore be mediated by people who understand the complexities of tools and of tests. This is as true for testing as for development, or indeed as it is for any skilled occupation from carpentry to medicine.

You can find the article here. Enjoy!

Oracles from the Inside Out, Part 8: Successful Stumbling

November 26th, 2015

When we’re building a product, despite everyone’s good intentions, we’re never really clear about what we’re building until we try to build some of it, and then study what we’ve built. Even after that, we’re never sure, so to reduce risk, we must keep studying. For economy, let’s group the processes associated with that study—review, exploration, experimentation, modelling, checking, evaluating, among many others—and call them testing. Whether we’re testing running code or testing ideas about it, testing at every step reveals problems in what we’ve built so far, and in our ideas about what we’ve built.

Clever people have the capacity to detect some problems and address them before they become bigger problems. A smart business analyst is aware of unusual exceptions in a workflow, recognizes an omission in the requirements document, and gets it corrected. An experienced designer goes over her design in her head, notices a gap in her model, and refines it. A sharp programmer, pairing with another, realizes that a function is using a data type that will overflow, and points out the problem such that it gets fixed right away.

Notice that in each one of these cases, it’s not quite right to say that the business analyst, the designer, or the programmer prevented a problem. It’s more accurate to say that a person detected a little problem and prevented it from becoming a bigger problem. Bug-stuff was there, but a savvy person stomped it while it was an egg or a nymph, before it could hatch or develop into a full-blown cockroach. In order to prevent bigger problems successfully, we have to become expert at detecting the small ones while they’re small.

Sometimes we can be clever and anticipate problems, and design our testing to shine light on them. We can build collaboration into our designs, review into our specifications, and pairing into our programming. We can set up static analysis tools that check code for inconsistency with explicit rules. When we’re dealing with running code, testing might take the form of specific procedures for a tester to follow; sometimes it takes the form of explicit conditions to observe; and sometimes it takes the form of automated checks. All of these approaches can help to find problems along the way.

It’s a fact that when we’re testing, we don’t always find the problems we set out to find. One reason might be, alas, that the problems have successfully evaded our risk ideas, our procedures, our coverage, and our oracles. But another reason might be that, thanks to people’s diligence, some problems were squashed before they had a chance to encounter our testing for them.

Conversely, some problems that we do find are ones that we didn’t anticipate. Instead, we stumble over them. “Stumbling” may sound unappealing until we consider the role that serendipity—accidental or incidental discovery—has played in every aspect of human achievement.

So here, I’m not talking about stumbling in terms of clumsiness. Instead, I’m speaking in terms of what we might find, against the odds, through a combination of diligent search, experimentation, openness to discovery, and alertness—as people have stumbled over diamonds, lost manuscripts, new continents, or penicillin. Chance favours the explorer and—as Pasteur pointed out—the prepared mind. If we don’t open our testing to problems where customers could stumble, customers will find those places.

Productive stumbling can be extended and amplified by tools. They don’t have to be fancy tools by any means, either.

Example: Stuck for a specific idea about risk heuristic, I created some tables of more-or-less randomized data in Excel, and used a Perl script to cover all of the possible values in a four-digit data field. One of those values returned an inappropriate result—one stumble over a gold nugget of a bug. Completely unexpectedly, though, I also stumbled over a sapphire: while scanning quickly through the log file, using a blink oracle: every now and then, a transaction took ten times longer than it should have courtesy of a startling and completely unrelated bug.

Example: At a client site, I had a suspicion that a test script contained an unreasonable amount of duplication. I opened the file in a text editor, selected the first line in a data structure, hit the Ctrl-F key, and kept hitting it. I applied a blink oracle again: most of the text didn’t change at all; tiny patches, representing a handful of variables flickered. Within a few seconds I had discovered that the script wasn’t really doing anything significant except trying the same thing with different numbers. More importantly, I discovered that the tester needed real help in learning how to create flexible, powerful, and maintainable test code.

Example: I wrote a program as a testing exercise for our Rapid Software Testing class. A colleague used James Bach’s PerlClip tool to discover the limit on the amount of data that the program would accept. From this, he realized that he could determine precisely the maximum numeric value supported by the program, something that I, the programmer, had never considered. (When you’re in the building mindset, there’s a lot that you don’t consider.)

Example: Another colleague, testing the same program, used Excel to generate all of the possible values for one of the input fields. From this he determined that the program was interpreting input strings in ways that, once again, I had never considered. Just this test and the previous one revealed information that exploded my five-line description of the program into fifteen far more detailed lines, laden with surprises and exceptions. One of these lines represents a dangerous and subtle gotcha in the programming language’s standard libraries. All this learning came from a program that is, at its core, only two lines of code! What might we learn about a program that’s two million lines of code?

Example: In this series of posts on oracles, I’ve already recounted the tale of how James took data from hundreds of test runs, and used Excel’s conditional formatting feature to visualize the logged results. The visualizations instantly highlighted patterns that raised questions about the behaviour of a product, questions that fed back into refinements of the requirements and design decisions.

Example: While developing a tool to simulate multi-step transactions in a banking application, I discovered that the order in which the steps were performed had a significant impact on the bank’s profit on the overall transaction. This is only one instance of a pattern I’ve seen over and over again: while developing the infrastructure to perform checking, I stumble over bug after bug in the application to be tested. Subsequently, after the bugs are fixed and the product is stabilized and carefully maintained, the checks—despite their value as change detectors—don’t reveal bugs. Most of the value of the checks gets cashed in the testing activity that produces them.

Example: James performed 3000 identical queries on eBay; one query every two or three seconds. He expected random variation over time (i.e. a “drunkard’s walk”). Instead, the visualization allowed him to see suspicious repeating jumps and drops that looked anything but random. Analysis determined that he was probably seeing the effects of many servers responding to his query—some of which occasionally failed to contribute results before timing out.

These examples show how we can use tools powerfully: to generate data sets and increase coverage, so that we can bring specific conditions to our attention; to amplify signals amidst the noise; to highlight subtle patterns and make them clearly visible; to afford observation of things that we never expected to see; to perturb or stress the system such that rare or hidden problems become perceptible.

The traditional view of an oracle is an ostensibly “correct” reference that we can compare to the output from the program. A common view of test automation is using a tool to act like a robotic and unimaginative user to produce output to be checked against a reference oracle. A pervasive view of testing is nothing more than simple output checking, focused on getting right answers and ignoring the value of raising important new questions. In Rapid Software Testing, we think this is too narrow and limiting a view of oracles, of automation, and of testing itself. Testing is exploring a product and experimenting with it, so that we can learn about it, discover surprising things, and help our clients evaluate whether the product they’ve got is the product they want. Automated checking is only one way in which we can use tools to aid in our exploration, and to shine light on the product—and excellent automated checking depends on exploratory work to help us decide what might be interesting to check and to help us to refine our oracles. An oracle is any means—a feeling, principle, person, mechanism, or artifact—by which we might recognize a problem that we encounter during testing. And oracles have another role to play, which I’ll talk about in the last post in this long series.

Oracles from the Inside Out, Part 7: References as Checks

October 12th, 2015

Over the last few blog posts, I’ve been focusing on oracles—means by which we could recognize a problem when we encounter it during testing. So far, I’ve talked about

  • feelings and private mental models within internal, tacit experiences;
  • consistency heuristics by which we can make inferences that help us to articulate why we think and feel that something is a problem;
  • brief exchanges—tiny bursts of conferences—between people with collective tacit knowledge, and shared feelings and mental models;
  • data sets and tools that produce visualizations that are explicit—things that we can point to—references that we can observe, analyze and discuss

Most of the examples I’ve shown so far involve applying oracles retrospectively—seeing a problem and responding to it, starting in the top left corner of this diagram.

But maybe experience with the product isn’t the only place we could start. Maybe we could start in the bottom right of the table, with tools.

Let’s begin by asking ourselves why we can’t see instantly when things go wrong with software. Why aren’t all bugs immediately obvious? The first-order answer is itself obvious: software is invisible. It’s composed, essentially, of electrons running through sand, based on volumes of instructions written by fallible humans, and whatever happens takes place inside tiny boxes whose contents and inner structures are obscure to us. That answer should remind us to presume that many bugs are by their nature hidden and subtle. Given that, how can we make bugs obvious? If we wish to identify something that we can’t perceive with unaided observation, we’ll need internal or external instrumentation. If we think and talk about tool support to reveal bugs, we might choose to develop it more often, and learn to build it wisely and reliably.

With those facts in front of us, how might we prepare to ourselves to anticipate and to notice problems, with the help of tools to extend our observational powers?

1) Learn the product. That process may—indeed, should, if possible—start even before the product has been built; we can learn about the product and people’s ideas for it as it is being designed. We become more powerful testers as we add to our knowledge of the product, the problem it is intended to solve, and the conditions under which it will be used. This is not just an individual process, but a social process, a team process. A development group not only builds a product; it learns to build a product as it tries to build the product. Both kinds of learning continue throughout the project.

2) As we learn the product, consider risks. What could go wrong? Considering risks may also start before the product has been built, is also a collaborative process, and is also continuous. For instance…

  • A system may not be able to fulfill the user’s task. It may not produce the desired result; some feature or function may be missing. It may not do the right things, or it may do the right things in the wrong way. It may do the wrong things. That is, the system might be have a problem related to capability.
  • A program may exhibit inconsistent behaviour over time. Outputs may vary in undesirable ways. Functions or features may be unavailable from time to time. Something that works in one version may fail to work in the next. That is, the system might have a problem with reliability.
  • A system may be vulnerable to attack or manipulation, or it may expose data to the world that should be kept private. It may permit records to be changed or altered. That is, the system might have a problem related to security.
  • A system might run into difficulty when overloaded or starved of resources. The system might be slow to respond even under normal conditions. That is, the system may have problems with performance.
  • A system may have trouble handling more complex processing, larger amounts of data, or larger numbers of users than could be supported by the original design. That is, the system may have problems related to scalability.
  • Something that should be present might be absent; or something that should be absent might be present. Files might be missing from the distribution, or proprietary files might be included inadvertently. Registry entries or resource files might not include appropriate configuration settings. The uninstaller might leave rubbish lying around, zap data that the user wants to retain, or uninstall components of other programs. That is, the system may have problems related to installability.

This set of examples is by no means complete. There’s a long list of ways in which users might obtain value from a product, and practically an infinite list of ways things that could go wrong. Although machinery cannot evaluate quality, specific conditions within these quality criteria in particular are amenable to being checked by mechanisms either external or internal to the program. Those checks can direct human attention to those conditions. So…

3) As we learn about the product and what can go wrong, consider how a check might detect something going wrong. One rather obvious way to detect a problem would be to imagine a process that the product might follow, drive the product with a tool, and then check if the process comes to an end with a desirable result. For extra points, have the tool collect output produced by the product or some feature within it, and then have the tool check that data for correctness against some reference. But we could also use checking productively by

  • examining data at intermediate states, as it is being processed, and not only at output time;
  • evaluating components of the product to see if any are missing, or the wrong version, or superfluous;
  • identifying platform elements—systems or resources upon which our product depends—and their attributes, including their versions and their capabilities;
  • observing the environment in which the program is running, to see if it changes in some detectable and significant way;
  • monitoring and inspecting the system to determine when it enters some state, when some event occurs, or when some condition is fulfilled;
  • timing processes within the system that must be completed within known, specific limits.

When something noteworthy happens, we have the option of either logging the incident or being notified immediately by some kind of alert or alarm.

Checking of this kind is a special case of something more general: bringing problems to our awareness and attention. Again, machinery cannot evaluate quality or recognize threats to value, so a check requires us to anticipate and encode each specific condition to be checked, and, after the check, to interpret its outcome whether red or green.

Moreover, to match the value of checks with the cost of developing and maintaining them, and to avoid being overwhelmed by having to interpret the results from automated checks, we must find ways to decide what’s likely to be interesting and not so interesting. Using tools to help us learn about that is the subject of the next post.

Oracles from the Inside Out, Part 6: Oracles as Extensions of Testers

September 21st, 2015

The previous post in this series was something of a diversion from the main thread, but we’ll get back on track in this long-ish post. To review: Marshall McLuhan famously said that “the medium is the message”. He used this snappy slogan to point out that media, tools, and technologies are not only about what they contain; they’re also about themselves, changing the scale, pace, or pattern of human affairs and activities. As he somewhat less famously said, “We shape our tools; thereafter, our tools shape us.”

In the Rapid Software Testing namespace, we recognize the traditional interpretation of an oracle in software testing as “an external mechanism which can be used to check test output for correctness”. W.E. Howden, who introduced the term, said that oracles “can consist of tables, hand calculated values, simulated results, or informal design and requirements descriptions”. But after recognizing that interpretation, we offer a more general and, to us, a more useful notion of “oracle”: a means by which we recognize a problem when we encounter one during testing.

We also make a distinction between testing and checking. Testing is the process of evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, etc. Checking is the process of making evaluations by applying algorithmic decision rules to specific observations of a product. A check produces a bit—true or false, green or red, pass or fail. A check is a kind of formal testing; testing that is done is a specific way, or with the goal of determining specific facts. Checking is a tactic of testing, but it’s certainly not all there is to testing.

In that light, Howden’s examples of oracles—references—form bases for checks. Add McLuhan’s insights, and we can recognize that checks are media that extend and accelerate our capacity to observe inconsistency with a specific claim (“the product, given these inputs, should produce output consistent with the values in this table”), or with a comparable product (“the product, given this inputs, should produce output consistent with the values produced by this simulator”). So: automated checks are media that extend accelerate our capacity to perform checks; to observe and evaluate the product in accordance with explicit, algorithmic decision rules.

Why is all this a big deal?

One: because media don’t recognize problems; media extend, enhance, enable, accelerate, intensify, or amplify our capability to recognize problems.

My friend Pradeep Soundararajan attracted my attention and respect several years ago by this astute observation: “It is not a test that finds a bug; it is a human that finds a bug and a test plays a role in helping the human find it.” To paraphrase Pradeep, the tool doesn’t recognize the problem; the tool plays a role in recongnizing the problem. The microscope does not see; the microscope helps us to see. The burglar alarm doesn’t detect a burglary; the burglar alarm extends our senses over distance to recognize movement, whereby we can infer that a burglary might be happening. The Wikipedia entry Exploration of Mars errs by saying “The exploration of Mars is the study of Mars by spacecraft.” In fact, the exploration of Mars is the study of Mars by humans, enabled by spacecraft. Tools amplify whatever we are. Tools can extend people’s competence and capabilities to help them focus on what matters, allowing tehm to become aware of important new things. Tools can just as easily extend people’s incompetence and incapabilities to overfocus their attention on the known or the trivial, allowing to be oblivious to important things.

Two: because tools can do so much more for testers than automated checking.

Testing is not simply checking to determine whether we’re getting the right answers. Testing is also about making sure we’re asking important questions—and discovering important questions, and discovering things about how we might develop and apply checks.

For instance: my colleague James Bach was working on testing a medical device a few years back. I’ve done some work with this company too, so in order to respect non-disclosure agreements, let’s just say that the medical device is a Zapper Box, intended to deliver Healing Energy to a patient’s body over an Appropriate Period of Time. The Zapper Box is controlled by a Control Box and its software. Too much of a Good Thing can be a Very Bad Thing so, crucially, the Control Box is also intended to stop the Zapper Box from delivering that energy when the Appropriate Period of Time becomes an Inappropriate Period of Time, whereupon the Healing Energy becomes Killing Energy.

The Control Box has a display that shows the operator the amount of Healing Energy that is supposedly being delivered. James and one of the other testers set up a test rig, a meter to monitor the amount of Healing Energy that was actually being delivered. In one of the tests, they monitored and logged the amount of Healing Energy delivered after the device’s operator turned off the Healing Energy Tap via the Control Box. The log recorded measurements at intervals of one-tenth of a second. The testers did several hundred instances of this test. Then James took the logs, and used Microsoft Excel’s conditional formatting feature to highlight various levels of Healing Energy. Red indicated that the device was delivering Energy that could be described here as Very Hot; yellow represented Somewhat Hot; grey represented Cooling Down; green represented Cool. Then, James took over a hundred lines of data, and shrank the lines down until the numbers were too small to see, such that only two digits are readable: the numbers 1 and 2 labelling the vertical lines, which represent the one- and two-second marks after the Healing Energy Tap had been turned off. In Rapid Software Testing we call this a Blink Test, or using a Blink Oracle. Here’s what the testers saw.

What do you observe? Here’s what I observe:

  • Over about 150 test runs, the level of Energy appears to remain Very Hot for .3 seconds to over a second after the Healing Energy Tap was turned off.
  • Towards the end of the observed period, the device seems to remain in a Very Hot state for longer, and more often. (“Longer” here means tending closer one second than to .3 seconds.)
  • The variance in the Energy during the cooling period seems to be greater than at the beginning.
  • Over time, the level of Energy appears to remain in a Cooling Down state for longer and longer.
  • In at least one instance, the Zapper goes from a Cooling Down state back up to a Somewhat Hot state before returning to a Cooling Down state.
  • Starting from about the middle of the observed period, the Zapper appears many times not to reach the Cool state at all.

As I’m observing these results, two questions are looping in my mind. The first is “Is there a problem here?” The second is “Is my client okay with all this?”

When I see the inconsistency of the results, I experience a vague feeling of unease and confusion, followed by feelings of curiosity. The curiosity is about the product’s behaviour, but it’s also about why I feel uneasy and confused. Notice that my feelings and mental models are internal, tacit, and private, at the moment I have them. If I want to make sense of the relationships between my observations and my feelings, I have to do some detective work; feelings don’t come with return addresses on them.

I pause for a moment, and quickly consider: the cooldown period is not the same every time. I make an inference that that could be a problem, since we typically desire a product’s behaviour to be consistent with itself. Is there a problem here? I wonder if the product has shown a consistent, stable cooldown pattern in the past. If it has, it’s not doing that now. I make an inference that product might be inconsistent with its history. Is there a problem here? There’s a related inference: all other things being equal, we desire a product to be consistent with its history, so an inconsistency with its history would point to a possible problem. Another inference: there may be a document that makes a claim about the behaviour required to fulfill the desire of some important person. Inconsistency with that document (a reference; a medium that extends communication and expression of desire) could represent a problem, since (by inference) inconsistency with the client’s desire would probably be a problem too. I don’t have a copy of such a document. Is my client okay with that? Where is the document? Who wrote it? Whose desires are being represented? Can I confer with those people; that is, can I have conference with them? If I can’t, the quality and relevance of my analysis will be compromised. Is my client okay with that? Since this is a medical device, is there a standards document (a reference) to which it should conform? I don’t know; in order to sort that out, I might have engage in conference with a domain expert who is aware of such things. The expert is also a medium, extending my capacity the domain, and the desires of people in that domain. I might have to examine a reference and engage in conference to find a relevant reference, or to understand it.

If I pay attention, I notice that Excel (and its conditional formatting feature) are media too, extending my capacity to observe patterns in the behaviour of the Zapper Box. The image I’m looking at is a reference. The test rig hooked up to the Zapper Box is also a medium, a reference, enabling testers’ capacity to observe the amount of Energy at the electrode. The test rig’s log is a medium to which I can refer, providing a record of precise observations that I can analyse over the longer term. The Control Box also includes testability features that produce a log (yet another reference). Those features extend the testers’ capacity to record the actions of the control box, and the log enables the comparison between the test rig’s log and the operator’s intended actions. There is a whole network of references here. I could apply consistency oracles to all of them, singly and in combination.

My curiosity is also aroused by questions about the test. Did the testers follow the same protocol on each test run? Was the Zapper Box powered on for the same amount of time before the tester flipped the switch at the Control Box? Did the measurement start exactly when the Zapper Box was told to turn off? Was the connection between the Control Box and the Zapper Box reliable? How would we know? Did the testers allow the Zapper Box to cool down between test runs? For how long? For the same amount of time each time? Is the test rig measuring the Energy properly? How would we know?

Notice: although there has been extensive use of tools, no checking has occurred here! James and the testers tested the product, gathered data, and presented that data in a form that allowed them (and others) to see interesting patterns. I have been analyzing those patterns (and I hope you have too).

Some oracles just provide us with “pass or fail”, “true or false”, “green or red” results. Other oracles point us to possible problems upon which we may shine light. In this case, we would need more information—more references, more tools, more conference—before we could determine more clearly whether there is a problem. We might or might not need any of these things before the client could decide that there is a problem. More significantly, we would need more information to determine how we might be able to check for a problem. We must apply informal oracles before we can learn how to apply formal oracles well.

In other words: the process of recognizing a problem sometimes requires us to travel all over the oracle quadrants. That’s far more than “trying the product and comparing it to the specification”.

All this sets us up for the next few posts: if we considered the oracle principles (themselves media for recognizing problems that threaten people’s desires!), how could we imagine applying tools to enable, extend, enhance, accelerate, or intensify our capabilities as testers? What other roles might oracles play in the process? And if we are to use oracles and tools wisely, what effects—good and bad—could we anticipate from applying them? As we shape our tools, how might our tools shape us?

Oracles from the Inside Out, Part 5: Oracles as References as Media

September 15th, 2015

Try asking testers how they recognize problems. Many will respond that they compare the product to its specification, and when they see an inconsistency between the product and its specification, they report a bug. Others will talk about creating and running automated checks, using tools to compare output from the product to specific, pre-determined, expected results; when the product produces a result inconsistent with expectations, the check identifies a bug which the tester then reports to the developer or manager. It might be tempting to think of this as moving from the bottom right quadrant on this table to the bottom left.

Traditional talk about oracles refers almost exclusively to references. W.E. Howden, who introduced “oracle” as a term of testing art, said that an oracle as “an external mechanism which can be used to check test output for correctness”. Yet thinking of oracles in terms of correctness leads to some pretty serious problems. (I’ve outlined some of them here).

In the Rapid Software Testing namespace, we take a different, broader view of oracles. Rather than focusing on correctness, we focus on problems: an oracle is a means by which we recognize a problem when we encounter one during testing. Checking for correctness, as Howden puts it, may severely limit our capacity to notice many kinds of problems. A product or service can be correct with respect to some principle, but have plenty of problems that aren’t identified by that principle; and a product can produce incorrect results without the incorrectness representing a problem for anyone. When testers fixate on documented requirements, there’s a risk that they will restrict their attention to looking for inconsistencies with specific claims; when testers fixate on automated checks, there’s a risk that they will restrict their focus to inconsistency with a comparable algorithm. Focus your attention too narrowly on a particular oracle—or a particular class of oracle—and you can be confident of one thing: you’ll miss lots of bugs.

Documents and tools are media. In the most general sense, “medium” is descriptive of something in between, like “small” and “large”. But “medium” as a noun, a medium, can be between lots of things. A communication medium like radio sits between performers and an audience; a psychic medium, so the claim goes, provides a bridge between a person and the spirit world; when people want to exchange things of value, they use often use money as a medium for the exchange. Marshall McLuhan, an early and influential media theorist, said that a medium is anything that humans create or use to effect change. Media are tools, technologies that people use to extend, enhance, enable, accelerate, or intensify human capabilities. Extension is the most obvious and prominent effect of media. Most people think of media in terms of communications media. A medium can certainly be printed pages or television screens that enable messages to be conveyed from one person to another. McLuhan viewed the phonetic alphabet as a technology—a medium that extended the range of speech over great distances and accelerated its transmission. But a cup of coffee is a medium too; it extends alertness and wakefulness, and when consumed socially with others, it can extend conversation and friendliness. Media, placed between a product and our observation of it, extend our capacity to recognize bugs.

McLuhan emphasized that media change things in many different ways at the same time. In addition to extending or enabling or accelerating our capabilities, McLuhan said, every new medium obsolesces one or more existing media, grabbing our attention away from old things; every new medium retrieves notions of formerly obsolescent media, making old things new again. McLuhan used heat as a metaphor for the degree to which media require the involvement of the user; a “cool” medium like radio, he said, requires the listener to participate and fill in the missing pieces of the experience; a “hot” medium like a movie, provides stimulation to the ear and especially the eye, requiring less engagement from the viewer. Every medium, when “overheated” (McLuhan’s term for a medium that has been stretched or extended beyond its original or intended capacity), reverses into the opposite of what it might have been originally intended to accomplish. Socrates (and the King of Egypt) recognized that writing could extend memory, but could reverse into forgetfulness (see Plato’s dialogue Phaedrus). Coffee extends alertness and conversation, but too much of it and people become too wired work and too addled to chat. A medium always draws attention to itself to some degree; an overheated medium may dazzle us so much that we begin to ignore what it contains or what we intended it to do for us. More importantly, a medium affects us. This is one of the implications of McLuhan’s famous but oblique statement “the medium is the message”. By “message”, he means “the change of scale or pace or pattern” that a new invention or innovation “introduces into human affairs.” (This explanation comes from Mark Federman, to whom I’m indebted for explaining McLuhan’s work to me over the years.)

When we pay attention, we can easily observe media overheating both in talk about testing and development work and in the work itself. Documents and tools frequently dominate conversations. In some organizations, a problem won’t be considered a bug unless it is inconsistent with an explicit statement in a specification or requirements document. Yet documents are only partial representations, subsets, of what people claim to have known or believed at some point in time, and times change. In some places, testing work is dominated by automated checking. Checks can be very valuable, providing great precision and fast feedback. But checks may focus on functional aspects of the product, and less on other parafunctional attributes.

McLuhan’s work emphasizes that media are essentially neutral, agnostic to our purposes. It is our engagement with media that produces good or bad outcomes—good and bad outcomes. Perhaps the most important implication of McLuhan’s work is that media amplify whatever we are. If we’re fabulous testers, our tools extend our capabilities, helping us to be even more fabulous. But if we’re incompetent, tools extend our incompetence, allowing us to do bad testing faster and worse than we’ve ever been able to do it before. To the degee that we are inclined to avoid conflict and arguments, we will use documents to help us avoid conflict and arguments; to the degree that we are inclined to welcome discussion and the refinement of ideas, then documents can help us do that. If we are disposed to be alert to a wide range of problems, automated checks will help us as we diversify our scope; if we are oblivious to certain kinds of problems in the product, automated checks will amplify our oblivion.

Reference oracles—documents, checking tools, representative data, comparable products—are unquestionably media, extending all of the other kinds of oracles: private and shared mental models, both private and shared feelings, conversations with others, and principles of consistency. How can we evaluate them? What do we use them for? And how can we use them to help us find problems without letting them overwhelm or displace all the other ways we might have of finding problems? That’s the subject of the next post.