s/automation/programming/

June 2nd, 2016

Several years ago in one of his early insightful blog posts, Pradeep Soundarajan said this:

“The test doesn’t find the bug. A human finds the bug, and the test plays a role in helping the human find it.”

More recently, Pradeep said this:

Instead of saying, “It is programmed”, we say, “It is automated”. A world of a difference.

It occurred to me instantly that it could make a world of difference, so I played with the idea in my head.

Automated checks? “Programmed checks.” 

Automated testing? “Programmed testing.” 

Automated tester?  “Programmed tester.” 

Automated test suite?  “Programmed test suite.”

Let’s automate to do all the testing?  “Let’s write programs to do all the testing.”

Testing will be faster and cheaper if we automate. “Testing will be faster and cheaper if we write programs.”

Automation will replace human testers. “Writing programs will replace human testers.”

To me, the substitutions all generated a different perspective and a different feeling from the originals. When we don’t think about it too carefully, “automation” just happens; machines “do” automation.  But when we speak of programming, our knowledge and experience remind us that we need people do programming, and that good programming can be hard, and that good programming requires skill.  And even good programming is vulnerable to errors and other problems.

So by all means, let’s use hardware and software tools skilfully to help us investigate the software we’re building.  Let’s write and develop and maintain programs that afford deeper or faster insight into our products (that is, our other programs) and their behaviour.  Let’s use and build tools that make data generation, visualisation, analysis, recording, and reporting easier. Let’s not be dazzled by writing programs that simply get the machinery to press its own buttons; let’s talk about how we might use our tools to help us reveal problems and risks that really matter to us and to our clients.  

And let’s consider the value and the cost and the risk associated with writing more programs when we’re already rationally uncertain about the programs we’ve got.

The Honest Manual Writer Heuristic

May 30th, 2016

Want a quick idea for a a burst of activity that will reveal both bugs and opportunities for further exploration? Play “Honest Manual Writer”.

Here’s how it works: imagine you’re the world’s most organized, most thorough, and—above all—most honest documentation writer. Your client has assigned you to write a user manual, including both reference and tutorial material, that describes the product or a particular feature of it. The catch is that, unlike other documentation writers, you won’t base your manual on what the product should do, but on what it does do.

You’re also highly skeptical. If other people have helpfully provided you with requirements documents, specifications, process diagrams or the like, you’re grateful for them, but you treat them as rumours to be mistrusted and challenged. Maybe someone has told you some things about the product. You treat those as rumours too. You know that even with the best of intentions, there’s a risk that even the most skillful people will make mistakes from time to time, so the product may not perform exactly as they have intended or declared. If you’ve got use cases in hand, you recognize that they were written by optimists. You know that in real life, there’s a risk that people will inadvertently blunder or actively misuse the product in ways that its designers and builders never imagined. You’ll definitely keep that possibility in mind as you do the research for the manual.

You’re skeptical about your own understanding of the product, too. You realize that when the product appears to be doing something appropriately, it might be fooling you, or it might be doing something inappropriate at the same time. To reduce the risk of being fooled, you model the product and look at it from lots of perspectives (for example, consider its structure, functions, data, interfaces, platform, operations, and its relationship to time; and business risk, and technical risk). You’re also humble enough to realize that you can be fooled in another way: even when you think you see a problem, the product might be working just fine.

Your diligence and your ethics require you to envision multiple kinds of users and to consider their needs and desires for the product (capability, reliability, usability, charisma, security, scalability, performance, installability, supportability…). Your tutorial will be based on plausible stories about how people would use the product in ways that bring value to them.

You aspire to provide a full accounting of how the product works, how it doesn’t work, and how it might not work—warts and all. To do that well, you’ll have to study the product carefully, exploring it and experimenting with it so that your description of it is as complete and as accurate as it can be.

There’s a risk that problems could happen, and if they do, you certainly don’t want either your client or the reader of your manual to be surprised. So you’ll develop a diversified set of ways to recognize problems that might cause loss, harm, annoyance, or diminished value. Armed with those, you’ll try out the product’s functions, using a wide variety of data. You’ll try to stress out the product, doing one thing after another, just like people do in real life. You’ll involve other people and apply lots of tools to assist you as you go.

For the next 90 minutes, your job is to prepare to write this manual (not to write it, but to do the research you would need to write it well) by interacting with the product or feature. To reduce the risk that you’ll lose track of something important, you’ll probably find it a good idea to map out the product, take notes, make sketches, and so forth. At the end of 90 minutes, check in with your client. Present your findings so far and discuss them. If you have reason to believe that there’s still work to be done, identify what it is, and describe it to your client. If you didn’t do as thorough a job as you could have done, report that forthrightly (remember, you’re super-honest). If anything that got in the way of your research or made it more difficult, highlight that; tell your client what you need or recommend. Then have a discussion with your client to agree on what you’ll do next.

Did you notice that I’ve just described testing without using the word “testing”?

Testers Don’t Prevent Problems

May 4th, 2016

Testers don’t prevent errors, and errors aren’t necessarily waste.

Testing, in and of itself, does not prevent bugs. Platform testing that reveals a compatibility bug provides a developer with information. That information prompts him to correct an error in the product, which prevents that already-existing error from reaching and bugging a customer.

Stress testing that reveals a bug in a function provides a developer with information. That information helps her to rewrite the code and remove an error, which prevents that already-existing error from turning into a bug in an integrated build.

Review (a form of testing) that reveals an error in a specification provides a product team with information. That information helps the team in rewriting the spec correctly, which prevents that already-existing error from turning into a bug in the code.

Transpection (a form of testing) reveals an error in a designer’s idea. The conversation helps the designer to change his idea to prevent the error from turning into a design flaw.

You see? In each case, there is an error, and nothing prevented it. Just as smoke detectors don’t prevent fires, testing on its own doesn’t prevent problems. Smoke detectors direct our attention to something that’s already burning, so we can do something about it and prevent the situation from getting worse. Testing directs our attention to existing errors. Those errors will persist—presumably with consequences—unless someone makes some change that fixes them.

Some people say that errors, bugs, and problems are waste, but they are not in themselves wasteful unless no one learns from them and does something about them. On the other hand, every error that someone discovers represents an opportunity to take action that prevents the error from becoming a more serious problem. As a tester, I’m fascinated by errors. I study errors: how people commit errors (bug stories; the history of engineering), why they make errors (fallible heuristics; cognitive biases), where we we might find errors (coverage), how we might recognize errors (oracles). I love errors. Every error that is discovered represents an opportunity to learn something—and that learning can help people to change things in order to prevent future errors.

So, as a tester, I don’t prevent problems. I play a role in preventing problems by helping people to detect errors. That allows those people to prevent those errors from turning into problems that bug people.

Still squeamish about errors? Read Jerry Weinberg’s e-book, Errors: Bugs, Boo-boos, Blunders.

Is There a Simple Coverage Metric?

April 26th, 2016

In response to my recent blog post, 100% Coverage is Possible, reader Hema Khurana asked:

“Also some measure is required otherwise we wouldn’t know about the depth of coverage. Any straight measures available?”

I replied, “I don’t know what you mean by a ‘straight’ measure. Can you explain what you mean by that?”

Hema responded: “I meant a metric some X/Y.”

In all honesty, it’s sometimes hard to remain patient when this question seems to come up at every conference, in every class, week upon week, year upon year. Asking me about this is a little like asking Chris Hadfield—since he’s a well-known astronaut and a pretty smart guy—if he could provide a way of measuring the area of the flat, rectangular earth. But Hema hasn’t asked me before, and we’ve never met, so I don’t want to be immediately dismissive.

My answer, my fast answer, is No. One key problem here is related to what Y could possibly represent. What counts? Maybe we could talk about Y in terms of a number of test cases, and X as how many of those test cases we’ve executed so far. If Y is 600 and X is 540, we could say that testing is 90% done. But that ignores at least two fundamental problems.

The first problem is that, irrespective of the number of test cases we have, we could choose to add more at any time as (via testing) we discover different conditions that we would like to evaluate. Or maybe we could choose to drop test cases when we realize that they’re out of date or irrelevant or erroneous. That is, unless we decide to ignore what we’ve learned, Y will, quite appropriately, change over time.

The second problem is that—at least in my view, and in the view of my colleagues—test cases are a ludicrous way to think about testing.

Another almost-as-quick answer would be to encourage people to re-read that 100% Coverage is Possible post (and the Further Reading links), and to keep re-reading until they get it.

But that’s probably not very encouraging to someone who is asking a naive question, and I’d like to more be helpful than that.

Here’s one thing we could do, if someone were desperate for numbers that summarize coverage: we could make a qualitative evaluation of coverage, and put numbers (or letters, or symbols) on a scale that is nominal and very weakly ordinal.

Our qualitative evaluation would be rooted in analysis of many dimensions of coverage. The Product Elements and Quality Criteria sections of the Heuristic Test Strategy Model provides a framework for generating coverage ideas or for reviewing our coverage retrospectively. We would review and discuss how much testing we’ve done of specific features, or particular functional areas, or perceived risks, and summarize our evaluation using a simple scale that would go something like this:

Level 0 (or X, or an empty circle, or…): We know nothing at all about this area of the product.

Level 1 (or C, or a glassy-eyed emoticon, or…): We have done a very cursory evaluation of this area. Smoke- or sanity-level; we’ve visited this feature and had a brief look at it, but we don’t really know very much about it; we haven’t probed it in any real depth.

Level 2 (or B, or a normal-looking emoticon, or…): We’ve had a reasonable look at this area, although we haven’t gone all the way deep. We’ve examined the common, the core, the critical, the happy paths, the handling of everyday errors or exceptions. We’ve pretty familiar with this area. We’ve done the kind of testing that would expose some significant bugs, if they were there.

Level 3 (or A, or a determined-looking angel emoticon, or…): We’ve really kicked this area harshly and hard. We’ve looked at unusual and complex conditions or states. We’ve probed deeply for subtle or hidden bugs. We’ve exposed the product to the extreme, the exceptional, the rare, the improbable. We’ve looked for bugs that are deep in the corners or hidden in the dark. If there were a serious bug, we’re pretty sure we would have found it by now.

Strictly speaking, these numbers are placed on an ordinal scale, in the sense that Level 3 coverage is deeper than Level 2, which is deeper than Level 1. (If you don’t know about scales of measurement, you should learn about them before providing or asking for metrics. And there are some other things to look at.) The numbers are certainly not an interval scale, or a ratio scale. They may not be commensurate from one feature area to the next; that is, they may represent different notions of coverage, different amounts of effort, different modes of evaluation. By design, these numbers should not be treated as valid measurements, and we should make sure that everyone on the project knows it. They are little labels that summarize evaluations and product elements and effort, factors that must be discussed to be understood. But those discussions can lead to understanding and consensus between ourselves, our colleagues, and our clients.

It’s Not A Factory

April 19th, 2016

One model for a software development project is the assembly line on the factory floor, where we’re making a buhzillion copies of the same thing. And it’s a lousy model.

Software is developed in an architectural studio with people in it. There are drafting tables, drawing instruments, good lighting, pens and pencils and paper. And erasers, and garbage cans that get full of coffee cups and crumpled drawings. Good ideas become better ideas as they are sketched, analysed, criticised, and revised. A lot of bad ideas are discovered and rejected before the final plans are drawn.

Software is developed in a rehearsal hall with people in it. The room is also filled with risers and chairs and other temporary staging elements, and with substitute props that stand in for the finished products. There’s a piano to accompany the singers while the orchestra is being rehearsed in another hall. Lighting, sound, costumes and makeup are designed and folded into the rehearsal process as we experiment with different ways of bringing the show to life. Everyone tries stuff that doesn’t work, or doesn’t fit, or doesn’t sound right, or doesn’t look good at first. Frustration arises, feelings get bruised, and then breakthroughs happen and problems get solved. Lots of experiments lead to that joyful and successful opening night.

Software is developed in a workshop with people in it; skilled craftspeople who build tools and workspaces for themselves and each other, as part of the process of crafting products for people to buy. Even though they try to keep the shop clean, there’s occasional sawdust and smoke and spilled glue and broken machinery. Work in progress gets tested, and weaknesses are exposed—sometimes late in the game—and get fixed.

In all of these places, variation is encouraged. Designs are tinkered with. Discoveries are celebrated. Learning happens. Most importantly, skill and tacit knowledge are both applied and developed.

The Lean model for software development might seem a more humane step forward from the older days, but it’s still based on the factory. Ideas aren’t widgets whose delivery you can schedule just in time. Failed experiments aren’t waste when you learn from them, and if you know it won’t be waste from the outset, it’s not really an experiment. Everything that makes it into the product should represent something that the customer values, but when we’re creating something novel (which we’re always doing to some degree as we’re building software), we’re exploring and trying things out to help refine our understanding of what the customer actually values.

If there is any parallel between software and manufacturing, it is this: the “software development” part of manufacturing happens before the assembly line—in the design studio, where the prototypes are being developed, refined, and selected for mass production. The manufacturing part? That’s the copy command that deploys a copy of the installation package to all the machines in the enterprise, or the disk duplicator that stamps out a million DVDs with copies of the golden master on it, or the Web server that delivers a copy of the product to anyone who requests it. Getting to that first copy, though? That’s a studio thing, not an assembly-line thing.

The primary inspiration for this post is a conversation I had with Cem Kaner in 2008. Another is the book Artful Making by Robert Austin and Lee Devin, which I first read around the same time. Yet another is Christopher Alexander’s A Pattern Language. One more: my long-ago career in theatre, which prepared me better than you can imagine for a life in software development.

100% Coverage is Possible

April 16th, 2016

In testing, what does “100% coverage” mean? 100% of what, specifically?

Some people might say that “100% coverage” could refer to lines of code, or branches within the code, or the conditions associated with the branches. That’s fine, but saying “100% of the lines (or branches, or conditions) in the program were executed” doesn’t tell us anything about whether those lines were good or bad, useful or useless. It doesn’t tell us anything about what the programmers intended, what the user desired, or what the tester observed. It says nothing about the tester’s engagement with the testing; whether the tester was asleep or awake. It ignores the oracles that the tester applied;how the tester recognized—or failed to recognize—bugs and other problems that were encountered during the testing. It suggests that some machinery processed something; nothing more.

Here’s a potentially helpful way to think about this:

“X coverage is how thoroughly we have examined the product with respect to some model of X”.

So: risk coverage is how thoroughly we have examined the product with respect to some model of risk; requirements coverage is how how thoroughly we have examined the product with respect to some model of requirements; code coverage is how thoroughly we have examined the product with respect to some model of code.

To claim 100% coverage is essentially the same as saying “We’ve looked for bugs everywhere!” For a skilled tester, any “100%” claim about coverage should prompt critical thinking: “How much” compared to what? 100% of what, specifically? Some model of X—which one? Whose model? How well does the “model of X” model reality? What does the model of X leave out of the universe of possible ways of thinking about X? And what non-X things should we also be considering when we’re testing?

Here’s just one example: code coverage is usually described in terms of the code that we’ve written, or that we have available to evaluate. Yet every program we write interacts with some platform that might include third-party libraries, browsers, plug-ins, operating systems, file systems, firmware. Our code might interact with our own libraries that we haven’t instrumented this time. So “code coverage” refers to some code in the system, but not all the code in the system.

Once I did a test (or was it 10,000 tests?) wherein I used an automated check to run through all 10,000 possible settings of a particular variable. That was 100% coverage of that variable being used in a particular moment in the execution of the system, on that day. But it was not 100% of all the possible sequences of those settings, nor 100% of the possible subsequent paths through the product. It wasn’t 100% of the possible variations in pacing, or system load, or times of day when the system could be used. That test wasn’t representative of all of the possible stakeholders who might be using that variable, nor how they might use it.

What would “100% requirements coverage” mean? Would it mean that every statement in the requirements document was covered by a test? If you think so, it might be worthwhile to consider all the models that are in play. The requirements document is a model of the product’s requirements. It refers to ideas that have been explicitly expressed by some people, but not by all of the people who might have requirements for the product. The requirements document models what those people thought they wanted at a certain point, but not necessarily what they want now. The requirements document doesn’t account for all of the ideas or ideas that people had that may have been tacit, or implicit, or latent. You can subject “statement”, “covered”, and “test” to the same kind of treatment. A statement is a model of what someone is thinking at a given point in time; our notion of what “covered” means is governed our models of coverage; our notion of “a test” is conditioned by our models of testing. It’s models all the way down.

Things in testing keep reminding me of a passage from Computer Programming Fundamentals by Herbert Leeds and Jerry Weinberg:

“One of the lessons to be learned … is that the sheer number of tests performed is of little significance in itself. Too often, the series of tests simply proves how good the computer is at doing the same things with different numbers. As in many instances, we are probably misled here by our experiences with people, whose inherent reliability on repetitive work is at best variable. With a computer program, however, the greater problem is to prove adaptability, something which is not trivial in human functions either. Consequently we must be sure that each test does some work not done by previous tests. To do this, we must struggle to develop a suspicious nature as well as a lively imagination.“

Testing is an open investigation. 100% coverage of a particular factor may be possible—but that requires a model so constrained that we leave out practically everything else that might be important. Test coverage, like quality, is not something that yields very well to quantitative measurements, except when we’re talking of very narrow and specific conditions. But we can discuss coverage, and ask questions about whether it’s what we want, whether we’re happy with it, or whether we want more.

Further reading:

Got You Covered http://developsense.com/articles/2008-09-GotYouCovered.pdf
Cover or Discover http://developsense.com/articles/2008-10-CoverOrDiscover.pdf
A Map by Any Other Name http://developsense.com/articles/2008-11-AMapByAnyOtherName.pdf
What Counts http://www.developsense.com/articles/2007-11-WhatCounts.pdf

As Expected

April 12th, 2016

This morning, I started a local backup. Moments later, I started an online backup. I was greeted with this dialog:

Looks a little sparse. Unhelpful. But there is that “More details” drop-down to click on. Let’s do that.

Ah. Well, that’s more information. But it’s confusing and unhelpful, but I suppose it holds the promise of something more helpful to come. I notice that there’s a URL, but that it’s not a clickable link. I notice that if the dialog means what it says, I should copy those error codes and be ready to paste them into the page that comes up. I can also infer that there’s not local help for these error codes. Well, let’s click on the Knowledge Base button.

Oh. The issue is that another backup is running, and starting a second one is not allowed.

As a tester, I wonder how this was tested.

Was an automated check programmed to start a backup, start a second backup, and then query to see if a dialog appeared with the words “Failed to run now: task not executed” in it? If so, the behaviour is as expected, and the check passed.

Was an automated check programmed to start a backup, start a second backup, and then check for any old dialog to appear? If so, the behaviour is as expected, and the check passed.

Was a test script given to a tester that included the instruction to start a backup, start a second backup, and then check for a dialog to appear, including the words “Failed to run now: task not executed”? Or any old dialog that hinted at something? If so, the behaviour is as expected, and the “manual” test passed.

Here’s what that first dialog could have said: “A backup is in progress. Please wait for that backup to complete before starting another.”

At this company, what is the basic premise for testing? When testing is designed, and when results are interpreted, is the focus on confirming that the product “works as expected”? If so, and if the expectations above are met, no bug will be noticed. To me, this illustrates the basic bankruptcy of testing to confirm expectations; to “make sure the tests all pass”; to show that the product “meets requirements”. “Meets requirements”, in practice, is typically taken to mean “is consistent with statements in a requirements document, however misbegotten those statements might be”.

Instead of confirmation, “pass or fail”, “meets the requirements (documents)” or “as expected”, let’s test from the perspective of two questions: “Is there a problem here?” and “Are we okay with this?” As we do so, let’s look at some of the observations that we might make were and questions we might ask. (Notice that I’m doing this without reference to a specification or requirements document.)

Upon starting a local backup and then attempting to start an online backup, I observe this dialog.

I am surprised by the dialog. My surprise is an oracle, a means by which I might recognize a problem. Why am I surprised? Is there a problem here?

I had a desire to create a local backup and an online backup at the same time. On a multi-tasking, multi-threaded operating system, that desire seems reasonable to me, and I’m surprised that it didn’t happen.

Inconsistency with reasonable user desire is an oracle principle, linked to quality criteria that might include capability, usability, performance, and charisma. The product apparently fails to fulfill quality criteria that, in my opinion, a reasonable user might have. Of course, as a tester, I don’t run the project. So I must ask the designer, or the developer, or the product manager: Are we okay with this?

This might be exactly the dialog that has been programmed to appear under this condition—whatever the condition is. I don’t know that condition, though, because the dialog doesn’t tell me anything specific about the problem that the software is having with fulfilling my desire. So I’m somewhat frustrated, and confused. Is there a problem here?

I can’t explain or even understand what’s going on, other than the fact that my desire has been thwarted. My oracle—pointing to a problem—is inconsistency with explainability, in addition to inconsistency with my desires. So I’m seeing a potential problem not only with the product’s behaviour, but also in the dialog. Are we okay with this?

Maybe more information will clear that up.

Still nothing more useful here. All I see is a bunch of error codes; no further explanation of why the product won’t do what I want. I remain frustrated, and even more confused than before. In fact, I’m getting annoyed. Is there a problem here?

One key purpose of a dialog is to provide a user with useful information, and the product seems inconsistent with that (the inconsistency-with-purpose oracle). Are these codes correct? Maybe these error codes are wildly wrong. If they are, that would be a problem too. If that’s the case, I don’t have a spec available, so that’s a problem I’m simply going to miss. Are we okay with that?

I have to accept that, as a human being, there are some problems I’m going to miss—although, if I were testing this in-house, there are things I could do to address the gaps in my knowledge and awareness. I could note the codes and ask the developer about them; or I could ask for a table of the available codes. (Oh… no one has collected a comprehensive listing of the error codes; they’re just scattered through the product’s source code. Are we okay with this?)

Back to the dialog. Maybe those error codes are precisely correct, but they’re not helping me. Are we okay with this?

All right, so there’s that Knowledge Base button. Let’s try it. When I click on the button, this appears:

Let’s look at this in detail. I observe the title: 32493: Acronis True Image: “Failed to run now: task not executed.” That’s consistent with the message that was in the dialog. I notice the dates; something like this has been appeared in the knowledgebase for a while. In that sense, it seems that the product is consistent with its history, but is that a desirable consistency? Is there a problem here?

The error codes being displayed on this Web page seem consistent with the error codes in the dialog, so if there’s a problem with that, I don’t see it. Then I notice the line that says “You cannot run two tasks simultaneously.” Reading down over a long list of products, and through the symptoms, I observe that the product is not intended to perform two tasks simultaneously. The workaround is to wait until the first task is done; then start the second one. In that sense, the product indeed “works as expected”. And yet…are we okay with this?

Once again, it seems to me that attempting to start a second task could be a reasonable user desire. The product doesn’t support that, but maybe we’re okay with that. Yet is there a problem here?

The product displays a terse, cryptic error message that confuses and annoys the user without fulfilling its apparent intended purpose to inform the user of something. The product sends the user to the Web (not even to a local Help file!) to find that the issue is an ordinary, easily anticipated limitation of the program. It does look kind of amateurish to deal with this situation in this convoluted way, instead of simply putting the relevant information in the initial dialog. Is there a problem here?

I believe that this behaviour is inconsistent with an image that the company might reasonably want to project. The behaviour is also inconsistent with the quality criteria we call usability and charisma. A usable product is one that behaves in a way that allows the user to accomplish a task (including dealing with the product’s limitations) quickly and smoothly. A charismatic product is one that does its thing in an elegant way; that engages the user instead of irritating the user; that doesn’t make the development group look silly; that doesn’t prompt a blog post from a customer highlighting the silliness.

So here’s my bug report. Note that I don’t mention expectations, but I do talk about desires, and I cite two oracles. The title is “Unhelpful dialog inconsistent with purpose.” The body would say “Upon attempting to start a second backup while one is in progress, a dialog appears saying ‘Failed to run now: task not executed.’ While technically correct, this message seems inconsistent with the purpose of informing the user that we can’t perform two backup tasks at once. The user is then sent to the (online) knowledge base to find this out. This also seems inconsistent with the product’s image of giving the user a seamless, reliable experience. Is all this desired behaviour?”

Finally: it could be that the testers discovered all of these problems, and laid them out for the the product’s designers, developers, and managers, just as I’ve done here. And maybe the reports were dismissed because the product works “as expected”. But “as expected” doesn’t mean “no problem”. If I can’t trust a backup product to post a simple, helpful dialog, can I really trust it to back up my data?

You Are Not Checking

April 10th, 2016

Note: This post refers to testing and checking in the Rapid Software Testing namespace. This post has received a few minor edits since it was first posted.

For those disinclined to read Testing and Checking Refined, here are the definitions of testing and checking as defined by me and James Bach within the Rapid Testing namespace.

Testing is the process of evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, etc.

(A test is an instance of testing.)

Checking is the process of making evaluations by applying algorithmic decision rules to specific observations of a product.

(A check is an instance of checking.)

You are not checking. Well, you are probably not checking; you are certainly not only checking. You might be trying to do checking. Yet even if you are being asked to do checking, or if you think you’re doing checking, you will probably fail to do checking, because you are a human. You can do things that could be encoded as checks, but you will do many other things too, at the same time. You won’t be able to restrict yourself to doing only checking.

Checking is a part of testing that can be performed entirely algorithmically. Remember that: checking is a part of testing that can be performed entirely algorithmically. The exact parallel to that in programming is compiling: compiling is a part of programming that can be performed entirely algorithmically. No one talks of “automated compiling”, certainly not anymore. It is routine to think of compiling as an activity performed by a machine. We still speak of “automated checking” because we have only recently introduced “checking” as a term of art. We say “automated checking” to emphasize that checking by definition can be, and in practice probably should be, automated.

If you are trying to do only checking, you will screw it up, because you are not a robot. Your humanity—your faculties that allow you to make unprogrammed observations and evaluations; your tendency to vary your behaviour; your capacity to identify unanticipated risks—will prevent you from living to an algorithm. As a human tester—not a robot—you’re essentially incapable of sticking strictly to what you’ve been programmed to do. You will inevitably think or notice or conjecture or imagine or learn or evaluate or experiment or explore. At that point, you will have jumped out of checking and into the wider activities of testing. (What you do with the outcome of your testing is up to you, but we’d say that if your testing produces information that might matter to a client, you should probably follow up on it and report it.)

Your unreliability and your variability is, for testing, a good thing. Human variability is a big reason why you’ll find bugs even when you’re following a script that the scriptwriter—presumably—completed successfully. (In our experience, if there’s a test script, someone has probably tried to perform it and has run through it successfully at least once.)

So, unless you’ve given up your humanity, it is very unlikely that you are only checking. What’s more likely is that you are testing. There are specific observations that you may be performing, and there are specific decision rules that you may be applying. Those are checks, and you might be performing them as tactics in your testing. Many of your checks will happen below the level of your awareness. But just as it would be odd to describe someone’s activities at the dinner table as “biting” when they were eating, it would be odd to say that you were “checking” when you were testing.

Perhaps another one of your tactics, while testing, is programming a computer—or using a computer that someone else has programmed—to perform checking. In Rapid Software Testing, people who develop checks are generally called toolsmiths, or technical testers—people who are not intimidated by technology or code.

Remember: checking is a part of testing that can be performed entirely algorithmically. Therefore, if you’re a human, neither instructing the machine to start checking nor developing checks is “doing checking”.

Testers who develop checks are not “doing checking”. The checks themselves are algorithmic, and they are performed algorithmically by machinery, but the testers are not following algorithms as they develop checks, or deciding that a check should be performed, or evaluating the outcome of the checking. Similarly, programmers who develop classes and functions are not “doing compiling”. Those programmers are not following algorithms to produce code.

Toolsmiths who develop tools and frameworks for checking, and who program checks, are not “doing checking” either. Developers who produce tools and compilers for compiling are not “doing compiling”. Testers who produce checking tools should be seen as skilled specialists, just as developers who produce compilers are seen as skilled specialists. In order to develop excellent checks and excellent checking tools, a tester needs two distinct kinds of expertise: testing expertise, and programming and development expertise.

Testers apply checking as tactic of testing. Checking is embedded within a host of testing activities: modeling the test space; identifying risks; framing questions that can be asked about the product; encoding those questions in terms of algorithmic actions, observations, outcomes, and reports; choosing when the checking should be done; interpreting the outcome of checks, whether green or red.

Notice that checking does not find bugs. Testers—or developers temporarily in a testing role or a testing mindset—who apply checking find bugs, and the checks (and the checking) play a role in finding bugs.

In all of our talk about testing and checking, we are not attempting to diminish the role of people who create and use testing tools, including checks and checking. Nothing could be farther from the truth. Tools are vital to testing. Tools support testing.

We are, however, asking that testing not be reduced to checking. Checking is not testing, just as compiling is not software development. Checking may be a very important tactic in our testing, and as such, it is crucial to consider how it can be done expertly to assist our testing. It is important to consider the extents and limits of what checking can do for us. Testing a whole product while being fixated on checking is like like developing a whole product while being fixated on compiling.

Oracles from the Inside Out, Part 9: Conference as Oracle and as Destination

March 17th, 2016

Over this long series, I’ve described my process of reasoning about problems, using this table:

So far, I’ve mostly talked about the role of experience, inference, and reference. However, I’m typically testing for and with clients—product managers, developers, designers, documenters, and so forth. In doing so, I’m trying to establish a shared understanding of the product with the rest of the team. That understanding is developed through conference; conversation and interaction with those other people. So the lower left quadrant represents two things at once: a set of oracles on the one hand, and my destination on the other.

A brief recap: while testing, I experience and develop my own set of mental models of the product and feelings about it, and reason about possible problems in it. In many cases—for instance, when I get a feeling of surprise or confusion, I’m able to use the consistency principles in the upper right to make inferences that I’m seeing a problem. My inferences might be mediated by references like a document (a specification, or a diagram, or a standard) or a tool (a suite of automated checks, or something that helps me to aggregate and visualize patterns in the data). Those media afford a move from upper right to lower right, and back again to a stronger inference in the upper right.

In other cases, my experiences, inferences, and references may not be enough for me to convince myself that I’m seeing a problem or missing one. If so, one possible move is to ask another tester, a developer, a expert user, a novice user, a product owner, or subject matter expert for information or an opinion. (In Rapid Testing, we often call such a person a live oracle.) When I do that, I’m moving from inference to conference, from upper right to lower left. Occasionally that communication happens immediately and tacitly, without my having to refer to explicit inferences or references. More often, it’s a longer and more involved discussion.

I could use the expertise of a particular person as an oracle, and rely upon that person to declare that he or she is seeing a problem. However, perspectives differ, people have blind spots, everyone is capable of making a mistake, and what was true yesterday may not be true today. Thus there is a risk that a live oracle could be oblivious to certain kinds of problems, or could mislead me into believing there’s a problem where there isn’t one. No oracle—not even a live one, nor a group of them—is infallible. The expert user might not notice an ease-of-learning problem that would cause a novice to stumble. A new programmer might not see a usability problem that an experienced tester would notice right away.

Perhaps more interestingly, people might disagree about whether there’s a problem or not. Such disagreements themselves are oracles, alerting me to problems in the project as well as the product. Feelings can provide important clues about the meaning and the significance of a problem. As we work together, I can listen to people’s opinions, observe the emotional weight they carry, weigh agreements and disagreements between people who matter, and compare their feelings with my own. I move between conference and inference to to recognize or refine my perception of a problem.

The ultimate goal for my testing is to end up in that lower left quadrant with one person in particular: my most important client, the person responsible for making content and release decisions about the product. (That person may have one of a number of titles or labels, including product manager, program manager, project manager, development manager… Here, let’s call that person the Client.) I want my models and feelings about the product to be consistent with the Client’s models and feelings. Experience, inference, reference, and conference help me to do that.

Here’s a fact-based but somewhat fictionalized example. A few years ago, I was working at a financial institution. One of the technical support people mentioned in passing that a surprisingly high proportion of her work was dealing with failed transactions involving two banks out of the hundreds that we interacted with. That triggered a feeling of curiosity: was there a bug in our code? That feeling prompted me to investigate.

Each record had a transaction identifier associated with it. The transaction ID was generated from various bits of data, including the customer account number, and it included a calculated check digit. When I started testing, I noticed that the two banks in question used six-digit account numbers, rather than the more common seven-digit form. I cooked up a script to perform a large number of simulated transactions with those two banks. When I examined the logs, I found that a small number of transactions had invalid account numbers. That problem should have been trapped by the check digit functions, but the transactions were allowed to pass through the system unhindered.

When I mentioned the problem in passing to the product owner, I observed that she seemed unperturbed; she didn’t seem to be taking the problem very seriously. The discrepancy between our feelings suggested that one of two things must have be true: either I hadn’t framed the problem sufficiently well for her to recognize its significance; or she had information that I didn’t, information that would have changed my perception of the problem and lessened my emotional reaction to what I was seeing.

“The problem is only with those two banks,” she said. “Six-digit account numbers, right? We have to special-case those by adding a trailing zero for the check digit function. Something about the check digit calculation fails about one time in a couple of hundred, but the transaction goes through anyway. But later, when we send the acknowledgement packet, those two banks reject it. So six-digit numbers are a pain, but we’ve always been able to deal with the occasional failure.” Here she was using the “patterns of familiar problems” and “history” oracle principles as her means of recognizing a problem. But something else was going on: she was using those two principles to calibrate the significance of the problem in terms of her own mental models, and those principles were helping to dampen her concern. Those oracles suggested that to her that I was observing a problem, but not a big problem.

I did a search of the database, and discovered that there were eight other banks that used six-digit numbers. I wrote a quick script to extract all of the records for those banks. All of transactions had happened successfully.

“OK, but here’s what I found out,” I replied. “There are eight other banks that use six-digit numbers, and we’ve never seen a check-digit failure in those.”

“Really?” she said. “Wow. I thought those were the only two.” I could see that she was suddently more engaged. The fact that the product was inconsistent with itself was a powerful oracle. Awareness of the inconsistency raised her emotional state.

“Yep,” I said. “Here’s the thing: for those two banks—and only for those two—we’re serving up the wrong Web page to get input, which is obviously inconsistent with our design. That page provides the customer with a seven-digit input field. I looked at the logs, and I tried a bunch of stuff myself. Here’s what I think is happening: when the customer enters in a six-digit account number, the page rejects their input because it’s too short, and tells them they need to put in a seven-digit number. It looks to me like a few of the customers are trying to work around the error message by putting in a leading zero. They do that because we show an image to illustrate example input. That image is a seven-digit number that has a leading zero in it. What’s funny is that that the wrong thing to do—putting in a leading zero—actually succeeds every now and again; the hash function for the check digit generates a valid transaction ID by coincidence. Not very often, but enough for it to register.”

“Interesting!” she said. She smiled. “Good detective work there.”

“So, are we going to fix it?” I asked, confident that we finally had a shared understanding of the problem.

“Nope.”

I was surprised, and felt myself becoming a little agitated. “Nope?!”

“Well, probably not. We’re replacing the whole input process in six months or so. Since we can deal with the problem as it is, and since the developers are busy on the new version, we’re cool with muddling along.” She noticed from my expression that I suddenly felt deflated. “Listen, that was some really good testing,” she said. “And I really appreciate the effort, and I understand your concern. I get that it’s a real problem for a handful of customers (here, she was acknowledging the inconsistency with user desires oracle), although once they’ve called us, they’re aware of the workaround. I know it does sound like a pretty easy fix, and we could fix it. But then we’d want to test it to make sure that the whole process keeps working for all of the customers of those banks, not just the ones who have had the problems. And with the new version coming up, trust me: you’ll have more than enough to do.”

I was a little disappointed that my investigation hadn’t resulted in a fix, but I did feel that she’d been listening. I had heard enough from her to dampen my own emotional state down so that it was well calibrated with hers.

When I observe a problem, the Client might or might not agree with me that it is a problem. That’s okay. As a tester, I’m not judge or jury for the problem, but I do want to make sure that my report has been heard and understood. After that, the Client can decide what she likes.

She might decide that it’s an important and urgent problem, and that it needs to be addressed right away. She might agree that it’s a problem, but not a problem worth fixing. She might believe that the problem is worth fixing, but not right away. She might dismiss my report of an inconsistency between the product some principle by citing other, more important principles with which the product is consistent.

Oracles give us means not only to recognize problems, but also to interpret and explain our feelings about them. When I can frame my experience—feelings and mental models—in terms of inferences about inconsistencies, I’m better prepared for a conversation—a conference—with my client about each problem, and why I believe it’s a problem.

A Context-Driven Approach to Automation in Testing

January 31st, 2016

(We interrupt the previously-scheduled—and long—series on oracles for a public service announcement.)

Over the last year James Bach and I have been refining our ideas about the relationships between testing and tools in Rapid Software Testing. The result is this paper. It’s not a short piece, because it’s not a light subject. Here’s the abstract:

There are many wonderful ways tools can be used to help software testing. Yet, all across industry, tools are poorly applied, which adds terrible waste, confusion, and pain to what is already a hard problem. Why is this so? What can be done? We think the basic problem is a shallow, narrow, and ritualistic approach to tool use. This is encouraged by the pandemic, rarely examined, and absolutely false belief that testing is a mechanical, repetitive process.

Good testing, like programming, is instead a challenging intellectual process. Tool use in testing must therefore be mediated by people who understand the complexities of tools and of tests. This is as true for testing as for development, or indeed as it is for any skilled occupation from carpentry to medicine.

You can find the article here. Enjoy!