Blog Posts from July, 2020

A Testopsy: Learning from Performance

Monday, July 27th, 2020

What’s the difference between Rapid Software Testing (RST) and other forms of testing? In RST, the process model is not the centre of testing; neither is formal documentation; nor are tools. All of those things play a role in testing, of course, but they’re not at the centre.

In RST, the centre of testing is the skill set and the mindset of the individual tester, and heuristics that testers apply.

A heuristic is a fallible means of solving a problem. That is, a heuristic might work, or it might fail. A heuristic will fail when it is applied to the wrong kind of problem; or when it is applied with insufficient judgement, wisdom, skill, or care; or when some context factor or another derails it. All of the models that we apply to the product and to the test space are heuristic. All test techniques are heuristic. All of the ways in which we could apply tools are heuristic. All the ways we have of deciding that there’s a problem (that is, all of our oracles) are heuristic. And this doesn’t apply only to testing; everything in software development, and in the broader field of engineering itself, is heuristic.

So, in order to get good at testing, we must learn about heuristics that we can apply powerfully in our work. We must also consider how our heuristics can fail, too. One of the better ways to do that is to review and evaluate our work periodically in a very detailed way. In Rapid Software Testing, we call that a testopsy.

Earlier this year, James Bach and I did a testopsy on a session of testing that we had performed together about six months earlier, in preparation for the Rapid Software Testing Applied class that we teach. By examining our performance, we able to notice and name heuristics and patterns that help us to think about testing, to describe it, and to understand how testing can go right—and sometimes not so right.

Here are just a few things we learned—or learned more deeply—from that session and the testopsy we performed:

  • When we’re doing pair testing, a lot of tacit knowledge emerges into the explicit. Each person’s performance is visible to the other, raising observations and questions about things that have not been shared up to that point. Through that, knowledge gets shared, discussed, and refined.
  • Products often give us lists of their own features in odd places, in interesting ways, that afford some efficiency for identifying coverage ideas.
  • There’s a phenomenon that happens in testing that we’re calling a “bug cascade“—periods where we are stressed or even overwhelmed by overlapping and competing investigations of complex and confusing behaviour.
  • During a bug cascade, we often recognize that we don’t know enough about the product to perform good analysis and troubleshooting.
  • Bugs get noticed and then lost, or missed altogether, during a bug cascade…
  • …but having a video and reviewing it can help us to recover what we’ve lost.
  • Analyzing the product (which had been our original misison for the session) can be severely disrupted by a cascade of bugs.
  • We coined a term, “mutually disruptive processes”, to describe one of the consequence of the bug cascade—which, when you’re working alone, is self-disruptive.
  • We coined another term, “the money booth effect”, to account for the collapse of productivity that is the consequence of mutually- or self-disruptive processes.
  • It is a good idea to be forgiving of ourselves for these problems. Although we can try to manage them to some degree, they are intrinsic to the process of learning and testing a product.

There’s lots more to the testopsy, which you can see here.

Why is this all important? Because in order to do something well, we must understand it, and testing is often terribly misunderstood—by managers, by developers, and, sadly, by testers themselves. By doing deep study of our work from time to time, we can begin the process of framing it, describing it, dicussing it, and developing expertise in it.

Rapid Software Testing Managed is coming up August 12-14. Rapid Software Testing Explored, set up for the daytime in Europe, the UK, and India runs September 15-18, and another session of Rapid Software Testing Applied runs from September 23-25. See the full schedule, with links to register here.

Breaking the Test Case Addiction (Part 12)

Saturday, July 25th, 2020

In previous posts in this series, I made a claim about the audience for a test report:

They almost certainly don’t want to know about when the testing is going to be done (although they might think they do).

It’s true that managers frequently ask testers when the testing will be done. That’s a hard question to answer, but maybe not for reasons that you—or they—might have considered.

By definition, testers who are working for clients do not work independently. We are providing services to our clients. We gain experience with the product, explore it and experiment with it so that that our clients can determine the status of the product. Knowledge of the status of the product allows our clients to decide whether product is ready to ship, or whether there is more development work to do.

Whatever testing we may have performed, we could always perform more; but once the client decides more development work won’t be worthwhile, development stops, and testing stops along with it. (At least, pre-release testing stops. Live-site monitoring and other forms of information gathering begin when the product is released, presenting an opportunity for learning about the quality of the product and about the quality of the testing that’s been done on it. Sometimes that learning comes with a big price tag.) The real question on the table, then, is not when testing work will be done, but when the development work will be done.

So, brace yourself: the fact is that no one really cares when testing will be done, because testing is never done; it only stops. Testing stops when the client determines that there is no more development work worth doing. The client—not the tester—decides when development is done. And how does the client decide that?

The client decides based on economics, reasoning, politics, and emotion. This is a complex decision, and here comes a long sentence that illustrates just how complex the decision is.

The client will decide to ship the product when she believes that

  • she knows enough about the product, the actual known problems about it, and the potential for unknown problems about it, such that…
  • the product provides sufficient benefits—that is, the product will help its users to accomplish a task, or some set of tasks; and
  • the product has a sufficiently small number of known bad problems about it; and
  • the product is sufficiently unlikely to have unknown bad problems; and
  • more development work—adding new features and fixing problems—will not be worthwhile, because
  • the benefits from the product outweigh the known problems to a sufficient degree that customers will obtain the value they want; and
  • the business can deal with known problems about the product, sufficiently inexpensively for the business to sustain the product and the business; and
  • the business can deal with whatever unknown problems may still exist; and
  • the client will not be in political trouble with her social group (including the team, management, and society at large) if she turns out to be wrong about any of all this; and
  • she feels okay about all of these things.

So when will testing be done? The client can declare testing to be done at any moment when the client is satisfied that all of these conditions have been fulfilled. So when the client asks “When will testing be done?”, that question amounts to “When will I be satisfied that development work is done?” And how can you, the tester, predict when someone else will be satisfied by work being done by other people?

You can’t. So I would recommend that you don’t, and that you don’t try. Instead, I’d suggest that you negotiate your role and your commitments. At first, this may look like a long conversation.

Try something like this:

“I understand that you want to know when testing will be done, because you want to know when development will be done; that is, when you will be satisfied that the product is ready to ship. I don’t know how to make a reliable prediction about when you will be satisfied, but here’s something that I can propose in return.

“I will start testing right now; that is, I will start obtaining experience with the product, exploring it, performing experiments on it, analyzing it. I’ll learn rapidly about the technology, the clients for the product, and the contexts in which the product will be used. As a tester, my special focus will be on evaluating it like a good critic; finding problems that threaten the value of the product to people who matter—especially you.

“Things will tend to go better if I’m able to help find problems early on—in the design of the product, or in our understanding of how its users might get value from it, or in the context that surrounds it. I don’t presume to be the manager or designer of the product, but I may have some suggestions for it—especially in terms of how to make the product more practically testable.

“As the product is being built, I’ll work closely with you and with the developers to help everyone make sure that the product we’re building is reasonably close to the product we think we’re building. The testing we need for that tends to be relatively shallow, focusing on quick feedback that doesn’t slow down or interrupt the pace of development. I’d recommend that you give the developers time and support to do their work in a disciplined way, as good craftspeople do. That discipline includes review, testing, and checking their work as they go, so that easy-to-find problems don’t get buried and cause trouble for everyone later. I can offer help with that, to the degree that the developers welcome it.

“The more that the developers can cover that quick, shallower testing, the more I’ll be able to focus on deep testing to find rare, hidden, subtle, intermittent, platform-dependent, emergent, elusive problems that matter. Deep testing requires a different mindset from the builder’s mindset, and changing mental gears to do deep testing can really disrupt the developers’ flow. So I’ll try to do deep testing as much as I can in parallel with the shallower testing that the developers are doing all the way along.

“At every step, I’ll let you know about any problems that I see in the product. I’ll be giving you bug reports, of course. I’ll also let you know about how the testing is going—what has been covered and what hasn’t. I’ll use coverage outlines in some form to help illustrate that, and I’m happy to offer you a variety of formats for them so you can choose one that works for you.

“If I notice a lot of bugs that seem like they should have been easy to find, I’ll let you know right away. For one thing, when there are lots of shallow bugs, deep testing becomes harder and slower, because I’m obliged to pause to investigate and report those bugs. More significantly, though, lots of shallow bugs might indicate that the developers are working too fast, or are under too much pressure. When people are pressed, they tend to have a hard time maintaining discipline and mental control over their work. In software, that’s a Severity 0 project risk; it leads to bugs, and some of those bugs may be deep enough that they’ll get past us—especially if we’re investigating and reporting the shallower bugs.

“I’m prepared to test or review anything you give me at any time; I’ll let you know how that influences the pace of other work that you’ve asked me to do.

“If there is testing that must be done formally—that is, in a specific way, or to check specific facts—I can certainly do that. I’ll provide you (and the auditors, if necessary) with evidence to support claims about all of the testing that has been done, both formal and informal. I’ll also let you know about extra costs associated with formal work—the time and effort it takes—and how it might affect our ability to find problems that matter.

“Apropos of that, I’ll keep track of anything that might threaten the on-time, successful completion of whatever work we’re doing. If you like, I’ll help to maintain product and project risk lists. (I’d recommend that the project manager be responsible for those, though.)

“I’ll keep track of where my own time is going, so that I’ll be able to produce a credible account of anything that is slowing down my work or making it harder. I’ll let you know what I need or recommend to make testing go as quickly and as easily as possible, and I invite you to ask for anything that helps make the product status or the testing work more legible—visible, readable, or understandable—to you.

“My goal is to help you to be immediately aware of everything you need to know to anticipate and inform a shipping decision.

“I know that this doesn’t directly answer the question of when testing will be done; but testing ends when we know the development work is done. So perhaps the best thing is for us to go together to the designers and developers. You can ask them when they anticipate that the development work will be done, and when the problems we encounter along the way will be fixed. I will help them to identify problems and risks, and to remember to include time and resources for testability as they give their estimate. As we’re working together to build and test the product, we can develop and refine our understanding about it, and we can be continually aware of its status. When that’s the case, you’ll be able to decide quickly whether there’s more development work to do, or whether you believe the product is ready for release.”

That’s a fairly thorough description of testing work. It’s a pretty long statement, isn’t it? Reading it aloud takes me just over five minutes. In real life, it would probably be interrupted by questions from time to time, too. So let’s imagine that the whole conversation might take 15 minutes, or even half an hour. But let me leave this post—and this series of posts—with these questions:

In a project that can take weeks or months, wouldn’t one relatively short conversation describing the testing role and affirming the tester’s commitments be worthwhile?

In that thorough description of testing work, did you notice that the expression “test cases” didn’t come up?

Breaking the Test Case Addiction (Part 11)

Friday, July 24th, 2020

In the previous post in this series, I made these claims about the audience for test reports:

  • They almost certainly don’t want to know about test case counts (although they might think they do).
  • They almost certainly don’t want to know about pass-fail ratios (although they might think they do).
  • They almost certainly don’t want to know about when the testing is going to be done (although they might think they do).

It’s far more likely that they want an answer to these questions:

What is the actual status of the product? Are there problems that threaten the value of the product? How do you—the tester—know? Do these problems threaten the on-time, successful completion of our work?

In this post, I’ll address the first two claims; I’ll leave the latter claim for next time.

They almost certainly don’t want to know about test case counts (although they might think they do).

Imagine asking a tester to test a cheap pocket calculator for you. We will call him “Eccles” (in honour of The Goon Show). You tell him your intentions for it: you would like use it mostly to help you to divide the bill for a group of friends at a restaurant, and other everyday tasks. Eccles disappears, and returns a few minutes later. You ask him if he has found any problems. He says No. You ask to see his results, and he shows you his two test cases:

Input: 1 + 1 Result: 2 (Pass)
Input: 2 + 2 Result: 4 (Pass)

You quite reasonably believe that Eccles’ testing is inadequate. You tell him that you want more test cases. He listens, appears to understand the problem, and nods. He disappears again, and considerably later he returns, telling you that he has run 100 test cases—50 times more than the first time! And he has carefully documented the results:

Input: 1 + 1 Result: 2 (Pass)
Input: 2 + 2 Result: 4 (Pass)
Input: 3 + 3 Result: 6 (Pass)
Input: 4 + 4 Result: 8 (Pass)
Input: 5 + 5 Result: 10 (Pass)
Input: 6 + 6 Result: 12 (Pass)
Input: 7 + 7 Result: 14 (Pass)
Input: 8 + 8 Result: 16 (Pass)
Input: 9 + 9 Result: 18 (Pass)
Input: 10 + 10 Result: 20 (Pass)
Input: 11 + 11 Result: 22 (Pass)

Input: 99 + 99 Result: 198 (Pass)
Input: 100 + 100 Result: 200 (Pass)

To the degree that more is better here, it’s not very much better.

The trouble, of course, is that the count doesn’t mean anything without context. What aspects of the product are being tested? Has the testing been limited to only mathematical functions within the product? If so, has the tester at least given some coverage to all of them—and if not, which ones has the tester not covered—and why not? Has the tester considered other things that could diminish, damage, or destroy the value of the product? Has the tester considered performance and reliability? Has the tester considered the different people who might use the product, and the ways in which they might use the product in the real world?

Testing is the process of evaluating a product by learning about through experiencing, exploring and experimenting, which includes to some degree questioning, studying, modeling, observation, inference, sensemaking, risk analysis, critical thinking—and many other things too. A test is a an instance of testing. Not all tests are equal in terms of effort, time, skill, scope, risk focus,…

Test cases tend represent things that are easy to describe about a test: directly observable behaviour that can be described or encoded explicitly; and observable and describable outputs. Test cases both assume and ignore tacit knowledge.

But neither tests nor test cases are commensurate—that is, they cannot be counted as though they were equivalent units—so “test case” is not a valid unit of measurement.

  • From one case to another, test cases vary widely in scope, in coverage, in cost, in risk focus, and in value.
  • The design of a test case is subjective, based at least to some degree on the mental models and mindset of individual testers.
  • Test cases involve different test techniques.
  • Test cases are not independent; the outcome of one might influence the outcome of another.
  • Test cases are not interchangeable. They’re different, depending on the feature, function, data, and product in front of us.
  • Test cases do not—and cannot—capture all the testing work that occurs, such as learning, conjecture, discoveries, bug investigation, and so forth.
  • Test cases don’t even capture the work of designing the test cases, nor of analyzing the results!
  • And finally… testers often don’t follow the test cases anyway—and certainly not in the same way every time! A test is a performance, and a test case is like a script and stage directions for that performance. As with actors working from a script, the performance will vary from tester to tester, and from time to time.

Note that none of these things is necessarily a problem. Indeed, in testing, there’s considerable value in variation and variability. Bugs aren’t all the same, and they’re not always in the same place. There is a big problem in trying to treat test cases as equivalent for the purposes of counting them. (I’ve talked about that many times before, including here, and here.)

Now, there is at least one argument in favour of test cases:

Perhaps someone wants to verify that a specific procedure can be followed, with specific preconditions and specific inputs, in order to show that the procedure and inputs will produce a specific result. And, in fact, perhaps that procedure, or some part of it at least, can be automated.

That’s okay, although there are at least two problems to consider. First, all that specification tends to take time and effort which can be costly, and which can swamp the value of what we might learn from following the procedure. Second, demonstrating that something can work based on specific procedures and inputs doesn’t mean that it will work. A variation in the procedure, or the conditions, or the inputs will result different output. Even holding the conditions and the procedure steady, and obtaining the correct output might result in an outcome that is terribly wrong in some sense.

Perhaps someone wants certain conditions to be identified and covered. If that’s true, identify those conditions and cover them. There are plenty of ways to do that without over-formalizing or over-proceduralizing the testing work.

Consider

  • noting those conditions in guidance for human interaction with the product;
  • reviewing existing logs or records to see if those conditions have been covered, and if not, cover them; or
  • creating automated low- or middle-level checks for those conditions.

Over 50 years ago, Jerry Weinberg wrote this passage:

One of the lessons to be learned from such experiences is that the sheer number of tests performed is of little significance in itself. Too often, the series of tests simply proves how good the computer is at doing the same things with different numbers. As in many instances, we are probably misled here by our experiences with people, whose inherent reliability on repetitive work is at best variable. With a computer program, however, the greater problem is to prove adaptability, something which is not trivial in human functions either. Consequently we must be sure that each test really does some work not done by previous tests. To do this, we must struggle to develop a suspicious nature as well as a lively imagination.”

Leeds and Weinberg, Computer Programming Fundamentals: Based on the IBM System/360, 1970

So, consider thinking in terms of testing, rather than test cases. And if you are applying test cases, please don’t count them. And if you count them, please don’t believe that the count means anything.

They almost certainly don’t want to know about pass-fail ratios (although they might think they do).

If a test case count is not a valid measure of test coverage, then a ratio derived from that count is invalid too, whether used to evaulate the quality of the product or the quality of the testing. I’ve heard tell of organizations that have a policy that says “when 97% of the test cases pass, the product is ready for shipping”. It shouldn’t take long to see the foolishness of this policy; it’s like a doctor say that when 97% of the data points in your medical checkup indicate no problem, you’re healthy.

Just as “the sheer number of tests is of little interest in itself”, the ratio of passing tests to failing ones is both insignificant and easy to game. Insignificant, because a product can be passing all of the tests that we’ve performed so far and still have terrible problems. Also insignificant, because a product can fail to pass hundreds of tests—but if those tests are outdated, inconsequential, overly precise, or otherwise irrelevant, there’s no problem. Easy to game, because if you want to make the product look better than it is, it’s a simple matter to perform more passing tests.

The point of testing is not to provide a pat on the head for the product; the point is to evaluate its true status, and to identify problems that threaten the value of the product to people who matter—to the users or customers of the software, or to anyone affected by it; to the support organization; to the operations people; and, ultimately, to the business.

Several years ago, a participant in one of my Rapid Software Testing classes approached me after I had mentioned this 97% pass rate business (which I’ll call 97PR henceforth). He said, “It’s funny you should mention it. I’ve worked at two companies where they used that measure to decide when to ship.”

“Really?” I replied. “Do you mind me asking—which ones?”

“Well,” he said. “One was Nortel.” I winced; Nortel was a huge Canadian success story until all of a sudden it wasn’t. “The other,” he said, “was RIM—Research in Motion. The Blackberry people.” I winced again.

Was 97PR responsible for the demise of these two companies? Probably not—certainly not directly. But to me, the 97PR suggests a company where engineering has been reduced to scorekeeping. If you want to fool people about something, providing numbers without context is a great way to do it. And if you want other people to fool you, ask for numbers without context.

For the calculator example above, what would a better test report look like? Here’s what I might offer:

“I’ve tested the calculator for basic math operations that seem likely to be important in calculating restaurant cheques: addition, multiplication, subtraction, and division. I imagined that you would be wanting to do this for groups of up to a dozen people. I did a handful of variations of each math operation, up to the limits of what the display of calculator supports, including stuff like dviding by zero. Beware, because if you do that by accident, you’ll lose what you’ve entered so far. (Aside: Windows Calculator loses the operations before a divide-by-zero too.) I took notes, if you want to see them.”

The client, of course, could stop me at any time. What if she didn’t? What would a deeper test report look like? Given some time, I might offer this:

“I tested the memory-store and memory-recall functions, too, and didn’t observe any problems. Even though they’re present as buttons on the calculator, I didn’t bother to test the higher-order math functions like squares, square roots, and trigonometric functions, since I reckoned you wouldn’t need those for restaurant bills and I didn’t want to waste your time by testing them. But if you want me to, I can.

“The buttons provide haptic feedback, so it’s easy to tell when they’ve been pressed, and there’s no key-repeat function, so it’s easier to avoid accidental double keypresses on this calculator than it is on others. I looked at it in low-light conditions; its LCD screen may be a little hard to see in a dark restaurant. It’s solar-powered, and there’s a feature that turns itself off after five minutes. In that case, it forgets whatever data you’ve entered.

“I dumped some water on the keypad, and it continued to perform without any problems. After I immersed it in a glass of water, though, I had to let it dry for a couple of days before it started working again, but it now seems to be working just fine.”

Yes; all that takes quite a bit longer to say—or to write—than “We’ve run 5163 tests, and of those, 118 are failing, for a pass rate of 97.7 per cent.” It’s also more informative—by a country mile—about the quality of the product and the quality of the testing.

So what do you do when a manager asks for test case counts or pass-fail ratios? Here’s a reply from James Bach: “I’m sorry, but misleading you is not a service that I offer.” Consider offering a three-part testing story instead.

We’ll get to that last claim about a test report’s audience (they almost certainly don’t want to know about when the testing is going to be done (although they might think they do)) in the next and final post in this all-too-long series.