Deeper Testing (2): Automating the Testing

April 22nd, 2017

Here’s an easy-to-remember little substitution that you can perform when someone suggests “automating the testing”:

“Automate the evaluation
and learning
and exploration
and experimentation
and modeling
and studying of the specs
and observation of the product
and inference-drawing
and questioning
and risk assessment
and prioritization
and coverage analysis
and pattern recognition
and decision making
and design of the test lab
and preparation of the test lab
and sensemaking
and test code development
and tool selection
and recruiting of helpers
and making test notes
and preparing simulations
and bug advocacy
and triage
and relationship building
and analyzing platform dependencies
and product configuration
and application of oracles
and spontaneous playful interaction with the product
and discovery of new information
and preparation of reports for management
and recording of problems
and investigation of problems
and working out puzzling situations
and building the test team
and analyzing competitors
and resolving conflicting information
and benchmarking…”

And you can add things to this list too. Okay, so maybe it’s not so easy to remember. But that’s what it would mean to automate the testing.

Use tools? Absolutely! Tools are hugely important to amplify and extend and accelerate certain tasks within testing. We can talk about using tools in testing in powerful ways for specific purposes, including automated (or “programmed“) checking. Speaking more precisely costs very little, helps us establish our credibility, and affords deeper thinking about testing—and about how we might apply tools thoughtfully to testing work.

Just like research, design, programming, and management, testing can’t be automated. Trouble arises when we talk about “automated testing”: people who have not yet thought about testing too deeply (particularly naïve managers) might sustain the belief that testing can be automated. So let’s be helpful and careful not to enable that belief.

Deeper Testing (1): Verify and Challenge

March 16th, 2017

What does it mean to do deeper testing? In Rapid Software Testing, James Bach and I say:

Testing is deep to the degree that it has a probability of finding rare, subtle, or hidden problems that matter.

Deep testing requires substantial skill, effort, preparation, time, or tooling, and reliably and comprehensively fulfills its mission.

By contrast, shallow testing does not require much skill, effort, preparation, time, or tooling, and cannot reliably and comprehensively fulfill its mission.

Expressing ourselves precisely is a skill. Choosing and using words more carefully can sharpen the ways we think about things. In the next few posts, I’m going to offer some alternative ways of expressing the ideas we have, or interpreting the assignments we’ve been given. My goal is to provide some quick ways to achieve deeper, more powerful testing.

Many testers tell me that their role is to verify that the application does something specific. When we’re asked to that, it can be easy to fall asleep. We set things up, we walk through a procedure, we look for a specific output, and we see what we anticipated. Huzzah! The product works!

Yet that’s not exactly testing the product. It can easily slip into something little more than a demonstration—the kinds of things that you see in a product pitch or a TV commercial. The demonstration shows that the product can work, once, in some kind of controlled circumstance. To the degree that it’s testing, it’s pretty shallow testing. The product seems to work; that is, it appears to meet some requirement to some degree.

If you want bugs to survive, don’t look too hard for them! Show that the product can work. Don’t push it! Verify that you can get a correct result from a prescribed procedure. Don’t try to make the product expose its problems.

But if you want to discover the bugs, present a challenge to the product. Give it data at the extremes of what it should be able to handle, just a little beyond, and well beyond. Stress the product out; overfeed it, or starve it of something that it needs. See what happens when you give the product data that it should reject. Make it do things more complex than the “verification” instructions suggest. Configure the product (or misconfigure it) in a variety of ways to learn how it responds. Violate an explicitly stated requirement. Rename or delete a necessary file, and observe whether the system notices. Leave data out of mandatory fields. Repeat things that should only happen once. Start a process and then interrupt it. Imagine how someone might accidentally or maliciously misuse the product, and then act on that idea. While you’re at it, challenge your own models and ideas about the product and about how to test it.

We can never prove by experiment—by testing—that we’ve got a good product; when the product stands up to the challenge, we can only say that it’s not known to be bad. To test a product in any kind of serious way is to probe the extents and limits of what it can do; to expose it to variation; to perform experiments to show that the product can’t do something—or will do something that we didn’t want it to do. When the product doesn’t meet our challenges, we reveal problems, which is the first step towards getting them fixed.

So whenever you see, or hear, or write, or think “verify”, try replacing it with “challenge“.

The Test Case Is Not The Test

February 16th, 2017

A test case is not a test.

A recipe is not cooking. An itinerary is not a trip. A score is not a musical performance, and a file of PowerPoint slides is not a conference talk.

All of the former things are artifacts; explicit representations. The latter things are human performances.

When the former things are used without tacit knowledge and skill, the performance is unlikely to go well. And with tacit knowledge and skill, the artifacts are not central, and may not be necessary at all.

The test case is not the test. The test is what you think and what you do. The test case may have a role, but you, the tester, are at the centre of your testing.

Further reading: http://www.satisfice.com/blog/archives/1346

Drop the Crutches

January 5th, 2017

This post is adapted from a recent blast of tweets. You may find answers to some of your questions in the links; as usual, questions and comments are welcome.

Update, 2017-01-07: In response to a couple of people asking, here’s how I’m thinking of “test case” for the purposes of this post: Test cases are formally structured, specific, proceduralized, explicit, documented, and largely confirmatory test ideas. And, often, excessively so. My concern here is directly proportional to the degree to which a given test case or a given test strategy emphasizes these things.

crutches

I had a fun chat with a client/colleague yesterday. He proposed—and I agreed—that test cases are like crutches. I added that the crutches are regularly foisted on people who weren’t limping to start with. It’s as though before the soccer game begins, we hand all the players a crutch. The crutches then hobble them.

We also agreed that test cases often lead to goal displacement. Instead of a thorough investigation of the product, the goal morphs into “finish the test cases!” Managers are inclined to ask “How’s the testing going?” But they usually don’t mean that. Instead, they almost certainly mean “How’s the product doing?” But, it seems to me, testers often interpret “How’s the testing going?” as “Are you done those test cases?”, which ramps up the goal displacement.

Of course, “How’s the testing going?” is an important part of the three-part testing story, especially if problems in the product or project are preventing us from learning more deeply about the product. But most of the time, that’s probably not the part of story we want to lead with. In my experience, both as a program manager and as a tester, managers want to know one thing above all:

Are there problems that threaten the on-time, successful completion of the project?

The most successful and respected testers—in my experience—are the ones that answer that question by actively investigating the product and telling the story of what they’ve found. The testers that overfocus on test cases distract themselves AND their teams and managers from that investigation, and from the problems investigation would reveal.

For a tester, there’s nothing wrong with checking quickly to see that the product can do something—but there’s not much right—or interesting—about it either. Checking seems to me to be a reasonably good thing to work into your programming practice; checks can be excellent alerts to unwanted low-level changes. But when you’re testing, showing that the product can work—essentially, demonstration—is different from investigating and experimenting to find out how it does (or doesn’t) work in a variety of circumstances and conditions. Sometimes people object saying that they have to confirm that the product works and that they have don’t have time to investigate. To me, that’s getting things backwards. If you actively, vigorously look for problems and don’t find them, you’ll get that confirmation you crave, as a happy side effect.

No matter what, you must prepare yourself to realize this:

Nobody can be relied upon to anticipate all of the problems that can beset a non-trivial product.

fractalWe call it “development” for a reason. The product and everything around it, including the requirements and the test strategy, do not arrive fully-formed. We continuously refine what we know about the product, and how to test it, and what the requirements really are, and all of those things feed back into each other. Things are revealed to us as we go, not as a cascade of boxes on a process diagram, but more like a fractal.

The idea that we could know entirely what the requirements are before we’ve discussed and decided we’re done seems like total hubris to me. We humans have a poor track record in understanding and expressing exactly what we want. We’re no better at predicting the future. Deciding today what will make us happy ten months—or even days—from now combines both of those weaknesses and multiplies them.

For that reason, it seems to me that any hard or overly specific “Definition of Done” is antithetical to real agility. Let’s embrace unpredictability, learning, and change, and treat “Definition of Done” as a very unreliable heuristic. Better yet, consider a Definition of Not Done Yet: “we’re probably not done until at least These Things are done”. The “at least” part of DoNDY affords the possibility that we may recognize or discover important requirements along the way. And who knows?—we may at any time decide that we’re okay with dropping something from our DoNDY too. Maybe the only thing we can really depend upon is The Unsettling Rule.

Test cases—almost always prepared in advance of an actual test—are highly vulnerable to a constantly shifting landscape. They get old. And they pile up. There usually isn’t a lot of time to revisit them. But there’s typically little need to revisit many of them either. Many test cases lose relevance as the product changes or as it stabilizes.

Many people seem prone to say “We have to run a bunch of old test cases because we don’t know how changes to the code are affecting our product!” If you have lost your capacity to comprehend the product, why believe that you still comprehend those test cases? Why believe that they’re still relevant?

Therefore: just as you (appropriately) remain skeptical about the product, remain skeptical of your test ideas—especially test cases. Since requirements, products, and test ideas are subject to both gradual and explosive change, don’t overformalize or otherwise constrain your testing to stuff that you’ve already anticipated. You WILL learn as you go.

Instead of overfocusing on test cases and worrying about completing them, focus on risk. Ask “How might some person suffer loss, harm, annoyance, or diminished value?” Then learn about the product, the technologies, and the people around it. Map those things out. Don’t feel obliged to be overly or prematurely specific; recognize that your map won’t perfectly match the territory, and that that’s okay—and it might even be a Good Thing. Seek coverage of risks and interesting conditions. Design your test ideas and prepare to test in ways that allow you to embrace change and adapt to it. Explain what you’ve learned.

Do all that, and you’ll find yourself throwing away the crutches that you never needed anyway. You’ll provide a more valuable service to your client and to your team. You and your testing will remain relevant.

Happy New Year.

Further reading:

Testing By Percentages
Very Short Blog Posts (11): Passing Test Cases
A Test is a Performance
Test Cases Are Not Testing: Toward a Culture of Test Performance” by James Bach & Aaron Hodder (in http://www.testingcircus.com/documents/TestingTrapeze-2014-February.pdf#page=31)

Very Short Blog Posts (31): Ambiguous or Incomplete Requirements

December 19th, 2016

This question came up the other day in a public forum, as it does every now and again: “Should you test against ambiguous/incomplete requirements?”

My answer is Yes, you should. In fact, you must, because all requirements documents are to some degree ambiguous or incomplete. And in fact, all requirements are to some degree ambiguous and incomplete.

And that is an important reason why we test: to help discover how the product is inconsistent with people’s current requirements, even though it might be consistent with requirements that they may have specified—ambiguously or incompletely—at some point in the past.

In other words: we test not only to compare the product to documented requirements, but to discover and help refine requirements that may otherwise be ambiguous, unclear, inconsistent, out of date, unrecognized, or emergent.

Very Short Blog Posts (30): Checking and Measuring Quality

November 14th, 2016

This is an expansion of some recent tweets.

Do automated tests (in the RST namespace, checks) measure the quality of your product, as people sometimes suggest?

First, the check is automated; the test is not. You are performing a test, and you use a check—or many checks—inside the test. The machinery may press the buttons and return a bit, but that’s not the test. For it to be a test, you must prepare the check to cover some condition and alert you to a potential problem; and after the check, you must evaluate the outcome and learn something from it.

The check doesn’t measure. In the same way, a ruler doesn’t measure anything. The ruler doesn’t know about measuring. You measure, and the ruler provides a scale by which you measure. The Mars rovers do not explore. The Mars rovers don’t even know they’re on Mars. Humans explore, and the Mars rovers are ingeniously crafted tools that extend our capabilities to explore.

So the checks measure neither the quality of the product nor your understanding of it. You measure those things—and the checks are like tiny rulers. They’re tools by which you operate the product and compare specific facts about it to your understanding of it.

Peter Houghton, whom I greatly admire, prompted me to think about this issue. Thanks to him for the inspiration. Read his blog.

s/automation/programming/

June 2nd, 2016

Several years ago in one of his early insightful blog posts, Pradeep Soundarajan said this:

“The test doesn’t find the bug. A human finds the bug, and the test plays a role in helping the human find it.”

More recently, Pradeep said this:

Instead of saying, “It is programmed”, we say, “It is automated”. A world of a difference.

It occurred to me instantly that it could make a world of difference, so I played with the idea in my head.

Automated checks? “Programmed checks.” 

Automated testing? “Programmed testing.” 

Automated tester?  “Programmed tester.” 

Automated test suite?  “Programmed test suite.”

Let’s automate to do all the testing?  “Let’s write programs to do all the testing.”

Testing will be faster and cheaper if we automate. “Testing will be faster and cheaper if we write programs.”

Automation will replace human testers. “Writing programs will replace human testers.”

To me, the substitutions all generated a different perspective and a different feeling from the originals. When we don’t think about it too carefully, “automation” just happens; machines “do” automation.  But when we speak of programming, our knowledge and experience remind us that we need people do programming, and that good programming can be hard, and that good programming requires skill.  And even good programming is vulnerable to errors and other problems.

So by all means, let’s use hardware and software tools skilfully to help us investigate the software we’re building.  Let’s write and develop and maintain programs that afford deeper or faster insight into our products (that is, our other programs) and their behaviour.  Let’s use and build tools that make data generation, visualisation, analysis, recording, and reporting easier. Let’s not be dazzled by writing programs that simply get the machinery to press its own buttons; let’s talk about how we might use our tools to help us reveal problems and risks that really matter to us and to our clients.  

And let’s consider the value and the cost and the risk associated with writing more programs when we’re already rationally uncertain about the programs we’ve got.

The Honest Manual Writer Heuristic

May 30th, 2016

Want a quick idea for a a burst of activity that will reveal both bugs and opportunities for further exploration? Play “Honest Manual Writer”.

Here’s how it works: imagine you’re the world’s most organized, most thorough, and—above all—most honest documentation writer. Your client has assigned you to write a user manual, including both reference and tutorial material, that describes the product or a particular feature of it. The catch is that, unlike other documentation writers, you won’t base your manual on what the product should do, but on what it does do.

You’re also highly skeptical. If other people have helpfully provided you with requirements documents, specifications, process diagrams or the like, you’re grateful for them, but you treat them as rumours to be mistrusted and challenged. Maybe someone has told you some things about the product. You treat those as rumours too. You know that even with the best of intentions, there’s a risk that even the most skillful people will make mistakes from time to time, so the product may not perform exactly as they have intended or declared. If you’ve got use cases in hand, you recognize that they were written by optimists. You know that in real life, there’s a risk that people will inadvertently blunder or actively misuse the product in ways that its designers and builders never imagined. You’ll definitely keep that possibility in mind as you do the research for the manual.

You’re skeptical about your own understanding of the product, too. You realize that when the product appears to be doing something appropriately, it might be fooling you, or it might be doing something inappropriate at the same time. To reduce the risk of being fooled, you model the product and look at it from lots of perspectives (for example, consider its structure, functions, data, interfaces, platform, operations, and its relationship to time; and business risk, and technical risk). You’re also humble enough to realize that you can be fooled in another way: even when you think you see a problem, the product might be working just fine.

Your diligence and your ethics require you to envision multiple kinds of users and to consider their needs and desires for the product (capability, reliability, usability, charisma, security, scalability, performance, installability, supportability…). Your tutorial will be based on plausible stories about how people would use the product in ways that bring value to them.

You aspire to provide a full accounting of how the product works, how it doesn’t work, and how it might not work—warts and all. To do that well, you’ll have to study the product carefully, exploring it and experimenting with it so that your description of it is as complete and as accurate as it can be.

There’s a risk that problems could happen, and if they do, you certainly don’t want either your client or the reader of your manual to be surprised. So you’ll develop a diversified set of ways to recognize problems that might cause loss, harm, annoyance, or diminished value. Armed with those, you’ll try out the product’s functions, using a wide variety of data. You’ll try to stress out the product, doing one thing after another, just like people do in real life. You’ll involve other people and apply lots of tools to assist you as you go.

For the next 90 minutes, your job is to prepare to write this manual (not to write it, but to do the research you would need to write it well) by interacting with the product or feature. To reduce the risk that you’ll lose track of something important, you’ll probably find it a good idea to map out the product, take notes, make sketches, and so forth. At the end of 90 minutes, check in with your client. Present your findings so far and discuss them. If you have reason to believe that there’s still work to be done, identify what it is, and describe it to your client. If you didn’t do as thorough a job as you could have done, report that forthrightly (remember, you’re super-honest). If anything that got in the way of your research or made it more difficult, highlight that; tell your client what you need or recommend. Then have a discussion with your client to agree on what you’ll do next.

Did you notice that I’ve just described testing without using the word “testing”?

Testers Don’t Prevent Problems

May 4th, 2016

Testers don’t prevent errors, and errors aren’t necessarily waste.

Testing, in and of itself, does not prevent bugs. Platform testing that reveals a compatibility bug provides a developer with information. That information prompts him to correct an error in the product, which prevents that already-existing error from reaching and bugging a customer.

Stress testing that reveals a bug in a function provides a developer with information. That information helps her to rewrite the code and remove an error, which prevents that already-existing error from turning into a bug in an integrated build.

Review (a form of testing) that reveals an error in a specification provides a product team with information. That information helps the team in rewriting the spec correctly, which prevents that already-existing error from turning into a bug in the code.

Transpection (a form of testing) reveals an error in a designer’s idea. The conversation helps the designer to change his idea to prevent the error from turning into a design flaw.

You see? In each case, there is an error, and nothing prevented it. Just as smoke detectors don’t prevent fires, testing on its own doesn’t prevent problems. Smoke detectors direct our attention to something that’s already burning, so we can do something about it and prevent the situation from getting worse. Testing directs our attention to existing errors. Those errors will persist—presumably with consequences—unless someone makes some change that fixes them.

Some people say that errors, bugs, and problems are waste, but they are not in themselves wasteful unless no one learns from them and does something about them. On the other hand, every error that someone discovers represents an opportunity to take action that prevents the error from becoming a more serious problem. As a tester, I’m fascinated by errors. I study errors: how people commit errors (bug stories; the history of engineering), why they make errors (fallible heuristics; cognitive biases), where we we might find errors (coverage), how we might recognize errors (oracles). I love errors. Every error that is discovered represents an opportunity to learn something—and that learning can help people to change things in order to prevent future errors.

So, as a tester, I don’t prevent problems. I play a role in preventing problems by helping people to detect errors. That allows those people to prevent those errors from turning into problems that bug people.

Still squeamish about errors? Read Jerry Weinberg’s e-book, Errors: Bugs, Boo-boos, Blunders.

Is There a Simple Coverage Metric?

April 26th, 2016

In response to my recent blog post, 100% Coverage is Possible, reader Hema Khurana asked:

“Also some measure is required otherwise we wouldn’t know about the depth of coverage. Any straight measures available?”

I replied, “I don’t know what you mean by a ‘straight’ measure. Can you explain what you mean by that?”

Hema responded: “I meant a metric some X/Y.”

In all honesty, it’s sometimes hard to remain patient when this question seems to come up at every conference, in every class, week upon week, year upon year. Asking me about this is a little like asking Chris Hadfield—since he’s a well-known astronaut and a pretty smart guy—if he could provide a way of measuring the area of the flat, rectangular earth. But Hema hasn’t asked me before, and we’ve never met, so I don’t want to be immediately dismissive.

My answer, my fast answer, is No. One key problem here is related to what Y could possibly represent. What counts? Maybe we could talk about Y in terms of a number of test cases, and X as how many of those test cases we’ve executed so far. If Y is 600 and X is 540, we could say that testing is 90% done. But that ignores at least two fundamental problems.

The first problem is that, irrespective of the number of test cases we have, we could choose to add more at any time as (via testing) we discover different conditions that we would like to evaluate. Or maybe we could choose to drop test cases when we realize that they’re out of date or irrelevant or erroneous. That is, unless we decide to ignore what we’ve learned, Y will, quite appropriately, change over time.

The second problem is that—at least in my view, and in the view of my colleagues—test cases are a ludicrous way to think about testing.

Another almost-as-quick answer would be to encourage people to re-read that 100% Coverage is Possible post (and the Further Reading links), and to keep re-reading until they get it.

But that’s probably not very encouraging to someone who is asking a naive question, and I’d like to more be helpful than that.

Here’s one thing we could do, if someone were desperate for numbers that summarize coverage: we could make a qualitative evaluation of coverage, and put numbers (or letters, or symbols) on a scale that is nominal and very weakly ordinal.

Our qualitative evaluation would be rooted in analysis of many dimensions of coverage. The Product Elements and Quality Criteria sections of the Heuristic Test Strategy Model provides a framework for generating coverage ideas or for reviewing our coverage retrospectively. We would review and discuss how much testing we’ve done of specific features, or particular functional areas, or perceived risks, and summarize our evaluation using a simple scale that would go something like this:

Level 0 (or X, or an empty circle, or…): We know nothing at all about this area of the product.

Level 1 (or C, or a glassy-eyed emoticon, or…): We have done a very cursory evaluation of this area. Smoke- or sanity-level; we’ve visited this feature and had a brief look at it, but we don’t really know very much about it; we haven’t probed it in any real depth.

Level 2 (or B, or a normal-looking emoticon, or…): We’ve had a reasonable look at this area, although we haven’t gone all the way deep. We’ve examined the common, the core, the critical, the happy paths, the handling of everyday errors or exceptions. We’ve pretty familiar with this area. We’ve done the kind of testing that would expose some significant bugs, if they were there.

Level 3 (or A, or a determined-looking angel emoticon, or…): We’ve really kicked this area harshly and hard. We’ve looked at unusual and complex conditions or states. We’ve probed deeply for subtle or hidden bugs. We’ve exposed the product to the extreme, the exceptional, the rare, the improbable. We’ve looked for bugs that are deep in the corners or hidden in the dark. If there were a serious bug, we’re pretty sure we would have found it by now.

Strictly speaking, these numbers are placed on an ordinal scale, in the sense that Level 3 coverage is deeper than Level 2, which is deeper than Level 1. (If you don’t know about scales of measurement, you should learn about them before providing or asking for metrics. And there are some other things to look at.) The numbers are certainly not an interval scale, or a ratio scale. They may not be commensurate from one feature area to the next; that is, they may represent different notions of coverage, different amounts of effort, different modes of evaluation. By design, these numbers should not be treated as valid measurements, and we should make sure that everyone on the project knows it. They are little labels that summarize evaluations and product elements and effort, factors that must be discussed to be understood. But those discussions can lead to understanding and consensus between ourselves, our colleagues, and our clients.