A Tale of Four Projects

April 23rd, 2014

Once upon time, in a high-tech business park far, far away, there were four companies, each working on a development project.

In Project Blue, the testers created a suite of 250 test cases, based on 50 use cases, before development started. These cases remained static throughout the project. Each week saw incremental improvement in the product, although things got a little stuck towards the end. Project Blue kept a table of passing vs. failing test cases, which they updated each week.

Date Passed Failed Total
01-Feb 25 225 250
08-Feb 125 125 250
15-Feb 175 75 250
22-Feb 200 50 250
29-Feb 225 25 250
07-Mar 225 25 250

In Project Red, testers constructed a suite of 10 comprehensive scenarios. The testers refined these scenarios as development progressed. In the last week of the project, a change in one of the modules broke several elements in scenario that had worked in the first two weeks. One of Project Red’s KPIs was a weekly increase in the Passing Scenarios Ratio.

Date Passed Failed Total
01-Feb 1 9 10
08-Feb 5 5 10
15-Feb 5 3 10
22-Feb 8 2 10
29-Feb 9 1 10
07-Mar 9 1 10

Project Green used an incremental strategy to design and refine a suite of test cases. Management added more testers to the project each week. As the project went on, the testers also recruited end users to assist with test design and execution. At the end of four weeks, the team’s Quality Progress Table looked like this:

Date Passed Failed Total
01-Feb 1 9 10
08-Feb 25 25 50
15-Feb 70 30 100
22-Feb 160 40 200

In Week 5 of Project Green, the managers called a monster triage session that led to the deferral of dozens of Severity 2, 3, and 4 bugs. Nine showstopper bugs remained. In order to focus on the most important problems, management decreed that only the showstoppers would be fixed and tested in the last week. And so, in Week 6 of Project Green, the programmers worked on only the showstopper bugs. The fixes were tested using 30 test cases. Testing revealed that six showstoppers were gone, and three persisted. All the deferred Severity 2, 3, and 4 bugs remained in the product, but to avoid confusion, they no longer appeared on the Quality Progress Table.

Date Passed Failed Total
01-Feb 1 9 10
08-Feb 25 25 50
15-Feb 70 30 100
22-Feb 160 40 200
29-Feb 450 50 500
07-Mar 27 3 30

In the first few weeks of Project Purple, testers worked interactively with the product to test the business rules, while a team of automation specialists attempted to create a framework that would exercise the product under load and stress conditions. At the end of Week Four, the Pass Rate Dashboard looked like this:

Date Passed Failed Total
01-Feb 1 9 10
08-Feb 25 25 50
15-Feb 70 30 100
22-Feb 80 20 100

In Week 5 of Project Purple, the automation framework was finally ready. 820 performance scenario tests were run that revealed 80 new bugs, all related to scalability problems. In addition, none of the bugs opened in Week 4 were fixed; two key programmers were sick. So at the end of Week 5, this was the picture from the Pass Rate Dashboard:

Date Passed Failed Total
01-Feb 1 9 10
08-Feb 25 25 50
15-Feb 70 30 100
22-Feb 80 20 100
29-Feb 900 100 1000

In Week 6 of Project Purple, the programmers heroically fixed 40 bugs. But that week, a tester discovered a bug in the automation framework. When that bug was fixed, the framework revealed 40 entirely new bugs. And they’re bad; the programmer report most of them will take at least three weeks to fix. Here’s the Pass Rate Dashboard at the end of Week 6:

Date Passed Failed Total
01-Feb 1 9 10
08-Feb 25 25 50
15-Feb 70 30 100
22-Feb 80 20 100
29-Feb 900 100 1000
07-Mar 900 100 1000

Here’s the chart that plots the percentage of passing test cases, per week, for all four projects.

Four entirely different projects.

As usual, James Bach contributed to this article.

“In The Real World”

April 21st, 2014

In Rapid Software Testing, James Bach, our colleagues, and I advocate an approach that puts the skill set and the mindset of the individual tester—rather than some document or tool or test case or process modelY—at the centre of testing. We advocate an exploratory approach to testing so that we find not only the problems that people have anticipated, but also the problems they didn’t anticipate. We challenge the value of elaborately detailed test scripts that are expensive to create and maintain. We note that inappropriate formality can drive testers into an overly focused mode that undermines their ability to spot important problems in the product. We don’t talk about “automated testing”, because testing requires study, learning, critical thinking, and qualitative evaluation that cannot be automated. We talk instead about automated checking, and we also talk about using tools (especially inexpensive, lightweight tools) to extend our human capabilities.

We advocate stripping documentation back to the leanest possible form that still completely supports and fulfills the mission of testing. We advocate that people take measurement seriously by studying measurement theory and statistics, and by resisting or rejecting metrics that are based on invalid models. We’re here to help our development teams and their managers, not mislead them.

All this appeals to thinking, curious, engaged testers who are serious about helping our clients to identify, evaluate, and stamp out risk. But every now and then, someone objects. “Michael, I’d love to adopt all this stuff. Really I would. But my bosses would never let me apply it. We have to do stuff like counting test cases and defect escape ratios, because in the real world…” And then he doesn’t finish the sentence.

I have at least one explanation for why the sentence dangles: it’s because the speaker is going through cognitive dissonance. The speaker realizes that what he is referring to is not his sense of the real world, but a fantasy world that some people try to construct, often with the goal of avoiding the panic they feel when they confront complex, messy, unstable, human reality.

Maybe my friend shares my view of the world. That’s what I’m hoping. My world is one in which I have to find important problems quickly for my clients, without wasting my client’s time or money. In my world, I have to develop an understanding of what I’m testing, and I have to do it quickly. I’ve learned that specifications are rarely reliable, consistent, or up to date, and that the development of the product has often raced ahead of people’s capacity to document it. In my world, I learn most rapidly about products and tools by interacting with them. I might learn something about the product by reading about it, but I learn more about the product by talking with people about it, asking lots of questions about it, sketching maps of it, trying to describe it. I learn even more about it by testing it, and by describing what I’ve found and how I’ve tested. I don’t structure my testing around test cases, since my focus is on discovering important problems in the product—problems that may not be reflected in scripted procedures that are expensive to prepare and maintain. At best, test cases and documents might help, maybe, but in my world, I find problems mostly because of what I think and what I do with the product.

In my world, it would be fantasy to think that a process model or a document—rather than the tester—is central to excellent testing (just as only in my fantasy world could a document—rather than the manager—be central to excellent management). In my world, people are at the centre of the things that people do. In my version of a fantasy world, one could believe that conservative confirmatory testing is enough to show that the product fulfills its goals without significant problems for the end user. In my world, we must explore and investigate to discover unanticipated problems and risks. If I wanted to do fake testing, I would foster the appearance of productivity by creating elaborate, highly-polished documentation that doesn’t help us to do work more effectively and more efficiently. But I don’t want to do that. In my world, doing excellent testing takes precedence over writing about how we intend—or pretend—to do testing. In my version of a dystopic fantasy world, it would be okay to accept numbers without question or challenge, even if the number has little or no relationship to what supposedly is being measured. In my world, quantitative models allow us to see some things more clearly while concealing other things that might be important. So in my world, it’s a good idea to evaluate our numbers and our models—and our feelings about them—critically to reduce the chance that we’ll mislead ourselves and others. I could fantasize about a world in which it would be obvious that numbers should drive decisions. In what looks like the real world to me, it’s safer to use numbers to help question our feelings and our reasoning.

Are you studying testing and building your skills? Are you learning about and using approaches that qualitative researchers use to construct and describe their work? If you’re using statistics to describe the product or the project, are you considering the validity of your constructs—and threats to validity? Are you considering how you know what you know? Are you building and refining a story of the product, the work you’re doing, and the quality of that work? If you’re creating test tools, are you studying programming and using the approaches that expert programmers use? Are you considering the cost and value of your activities and the risks that might affect them? Are you looking only at the functional aspects of the product, or are you learning about how people actually use the product to get real work done? Real-world people doing real-world jobs—research scientists, statisticians, journalists, philosophers, programmers, managers, subject matter experts—do these things. I believe I can learn to do them, and I’m betting you could learn them too. It’s a big job to learn all this stuff, but learning—for ourselves and for others—is the business that I think we testers are in, in my real world. Really.

Is your testing bringing people closer to what you would consider a real understanding of the real product—especially real problems and real risks in the product—so that they can make informed decisions about it? Or is it helping people to sustain ideas that you would consider fantasies?

Very Short Blog Posts (16): Usability Problems Are Probably Testability Problems Too

April 16th, 2014

Want to add ooomph to your reports of usability problems in your product? Consider that usability problems also tend to be testability problems. The design of the product may make it frustrating, inconsistent, slow, or difficult to learn. Poor affordances may conceal useful features and shortcuts. Missing help files could fail to address confusion; self-contradictory or misleading help files could add to it. All of these things may threaten the value of the product for the intended users. Bad as they might be, problems like this may also represent issues for testing. A product with a slick and speedy user interface is more likely to be pleasure to test. Clumsy or demotivating user interfaces present issues that may make testing harder or slower—and issues give bugs more time and more opportunity to hide.

Related post: You’ve Got Issues

I’ve Had It With Defects

April 2nd, 2014

The longer I stay in the testing business and reflect on the matter, the more I believe the concept of “defects” to be unclear and unhelpful.

A program may have a coding error that is clearly inconsistent with the program’s specification, whereupon I might claim that I’ve found a defect. The other day, an automatic product update failed in the middle of the process, rendering the product unusable. Apparently a defect. Yet let’s look at some other scenarios.

  • I perform a bunch of testing without seeing anything that looks like a bug, but upon reviewing the code, I see that it’s so confusing and unmaintainable in its current state that future changes will be risky. Have I found a defect? And how many have I found?
  • I observe that a program seems to be perfectly coded, but to a terrible specification. Is the product defective?
  • A program may be perfectly coded to a wonderfully written specification— even though the writer of the specification may have done a great job at specifying implementation for a set of poorly conceived requirements. Should I call the product defective?
  • Our development project is nearing release, but I discover a competitive product with this totally compelling feature that makes our product look like an also-ran. Is our product defective?
  • Half the users I interview say that our product should behave this way, saying that it’s ugly and should be easier to learn; the other half say it should behave that way, pointing out that looks don’t matter, and once you’ve used the product for a while, you can use it quickly and efficiently. Have I identified a defect?
  • The product doesn’t produce a log file. If there were a log file, my testing might be faster, easier, or more reliable. If the product is less testable than it could be, is it defective?
  • I notice that the Web service that supports our chain of pizza stores slows down noticeably dinner time, when more people are logging in to order. I see a risk that if business gets much better, the site may bog down sufficiently that we may lose some customers. But at the moment, everything is working within the parameters. Is this a defect? If it’s not a defect now, will it magically change to a defect later?

On top of all this, the construct “defect” is at the centre of a bunch of unhelpful ideas about how to measure the quality of software or of testing: “defect count”; “defect detection rate”; “defect removal efficiency”. But what is a defect? If you visit LinkedIn, you can often read some school-marmish clucking about defects. People who talk about defects seem refer to things that are absolutely and indisputably wrong with the product. Yet in my experience, matters are rarely so clear. If it’s not clear what is and is not a defect, then counting them makes no sense.

That’s why, as a tester, I find it much more helpful to think in terms of problems. A problem is “a difference between what is perceived and what is desired” or “an undesirable situation that is significant to and maybe solvable by some agent, though probably with some difficulty”. (I’ve written more about that here.) A problem is not something that exists in the software as such; a problem is relative, a relationship between the software and some person(s). A problem may take the form of a bug—something that threatens the value of the product—or an issue—something that threatens the value of the testing, or of the project, or of the business.

As a tester, I do not break the software. As a reminder of my actual role, I often use a joke that I heard attributed to Alan Jorgenson, but which may well have originated with my colleague James Bach: “I didn’t break the software; it was broken when I got it.” That is, rather than breaking the software, I find out how and where it’s broken. But even that doesn’t feel quite right. I often find that I can’t describe the product as “broken” per se; yet the relationship between the product and some person might be broken. I identify and illuminate problematic relationships by using and describing oracles, the means by which we recognize problems as we’re testing.

Oracles are not perfect and testers are not judges, so to me it would seem presumptuous of me to label something a defect. As James points out, “If I tell my wife that she has a defect, that is not likely to go over well. But I might safely say that she is doing something that bugs me.” Or as Cem Kaner has suggested, shipping a product with known defects means shipping “defective software”, which could have contractual or other legal implications (see here and here, for examples).

On the one hand, I find that “searching for defects” seems too narrow, too absolute, too presumptuous, and politically risky for me. On the other, if you look at the list above, all those things that were questionable as defects could be described more easily and less controversially as problems that potentially threaten the value of the product. So “looking for problems” provides me with wider scope, recognizes ambiguity, encourages epistemic humility, and acknowledges subjectivity. That in turn means that I have to up my game, using many different ways to model the product, considering lots of different quality criteria, and looking not only for functional problems but anything that might cause loss, harm, or annoyance to people who matter.

Moreover, rejecting the concept of defects ought to help discourage us from counting them. Given the open-ended and uncertain nature of “problem”, the idea of counting problems would sound silly to most people—but we can talk about problems. That would be a good first step towards solving them—addressing some part of the difference between what is perceived and what is desired by some person or persons who matter.

That’s why I prefer looking for problems—and those are my problems with “defects”.

Very Short Blog Posts (15): “Manual” and “Automated” Testers

April 1st, 2014

“Help Wanted. Established scientific research lab seeks Intermediate Level Manual Scientist. Role is intended to complement our team of Automated and Semi-Automated Scientists. The successful candidate will perform research and scientific experiments without any use of tools (including computer hardware or software). Requires good communication skills and knowledge of the Hypothesis Development Life Cycle. Bachelor’s degree or five years of experience in manual science preferred.”

Sounds ridiculous, doesn’t it? It should.

Related post:

“Manual” and “Automated” Testing

Very Short Blog Posts (14): “It works!”

March 31st, 2014

“It works” is one of Jerry Weinberg‘s nominees for the most ambiguous sentence in the English language.

To me, when people say “it works”, they really mean

Some aspect
of some feature
or some function
to meet some requirement
to some degree
based on some theory
and based on some observation
that some agent made
under some conditions
or maybe more.

One of the most important tasks for a tester is to question the statement “it works”, to investigate the claim, and to elaborate on it such that important people in the product know what it really means.

Related posts:

A Little Blog Post on a Big Idea: Does the Software Work? (Pete Walen)
Behavior-Driven Development vs. Testing (James Bach)


Perfect Software and Other Illusions about Testing (Jerry Weinberg)

Very Short Blog Posts (13): When Will Testing Be Done?

March 21st, 2014

When a decision maker asks “When will testing be done?”, in my experience, she really means is “When will I have enough information about the state of the product and the project, such that I can decide to release or deploy the product?”

There are a couple of problems with the latter question. First, as Cem Kaner puts it, “testing is an empirical, technical investigation of the product, done on behalf of stakeholders, that provides quality-related information of the kind that they seek”. Yet the decision to ship is a business decision, and not purely a technical one; factors other than testing inform the shipping decision. Second, only the decision-maker can decide how much information is enough for her purposes.

So how should a tester answer the question “When will testing be done?” My answer would go like this:

“Testing will be done when you decide to ship the product. That will probably be when you feel that you have enough information about the product, its value, and real and potential risks—and about what I’ve covered and how well I’ve covered it to find those things out. So I will learn everything I can about the product, as quickly as possible, and I’ll continuously communicate what I’ve learned to you. I’ll also help you to identify things that you might consider important influences on your decision. If you’d like me to keep testing after deployment (for example, to help technical support), I’ll do that too. Testing will be done when you decide that you’re satisfied that you need no more information from testing.”

That’s your very (or at least pretty) short blog post. For more, see:

Test Estimation is Really Negotiation

Test Project Estimation, The Rapid Way

Project Estimation and Black Swans (Part 5): Test Estimation: Is there really such a thing as a test project, or is it mostly inseparable from some other activities?

Got You Covered: Excellent testing starts by questioning the mission. So, the first step when we are seeking to evaluate or enhance the quality of our test coverage is to determine for whom we’re determining coverage, and why.

Cover or Discover: Excellent testing isn’t just about covering the “map”—it’s also about exploring the territory, which is the process by which we discover things that the map doesn’t cover.

A Map By Any Other Name: A mapping illustrates a relationship between two things. In testing, a map might look like a road map, but it might also look like a list, a chart, a table, or a pile of stories. We can use any of these to help us think about test coverage.

Testing, Checking, and Convincing the Boss to Explore: You might want to take a more exploratory approach to the testing of your product or service, yet you might face some difficulty in persuading people who are locked into an idea of testing the product as “checking to make sure that it works”. So, some colleagues came up with ideas that might help.

Harry Collins and The Motive for Distinctions

March 3rd, 2014

“Computers and their software are two things. As collections of interacting cogs they must be ‘checked’ to make sure there are no missing teeth and the wheels spin together nicely. Machines are also ‘social prostheses’, fitting into social life where a human once fitted. It is a characteristic of medical prostheses, like replacement hearts, that they do not do exactly the same job as the thing they replace; the surrounding body compensates.

“Contemporary computers cannot do just the same thing as humans because they do not fit into society as humans do, so the surrounding society must compensate for the way the computer fails to reproduce what it replaces. This means that a complex judgment is needed to test whether software fits well enough for the surrounding humans to happily ‘repair’ the differences between humans and machines. This is much more than a matter of deciding whether the cogs spin right.”

—Harry Collins

Harry Collins—sociologist of science, author, professor at Cardiff University, a researcher in the fields of the public understanding of science, the nature of expertise, and artificial intelligence—was slated to give a keynote speech at EuroSTAR 2013. Due to illness, he was unable to do so. The quote above is the abstract from the talk that Harry never gave. (The EuroSTAR community was very lucky and grateful to have his colleague, Rob Evans, step in at the last minute with his own terrific presentation.)

Since I was directed to Harry’s work in 2010 (thank you, Simon Schaffer), James Bach and I have been galvanized by it. As we’ve been trying to remind people for years, software testing is a complex, cognitive, social task that requires skill, tacit knowledge, and many kinds of expertise if we want people to do it well. Yet explaining testing is tricky, precisely because so much of what skilled testers do is tacit, and not explicit; learned by practice and by immersion in a culture, not from documents or other artifacts; not only mechanical and algorithmic, but heuristic and social.

Harry helps us by taking a scalpel to concepts and ideas that many people consider obvious or unimportant, and dissecting those ideas to reveal the subtle and crucial details under the surface. As an example, in Tacit and Explicit Knowledge, he divides tacit knowledge—formerly, any kind of knowledge that was not told—and divided it into three kinds: relational, the kind of knowledge that resides in an individual human mind, and that general could be told; somatic, resident in the system of a human body and a human mind; and collective, residing in society and in the ever-changing relationships between people in a culture.

How does that matter? Consider the Google car. On the surface, operating a car looks like a straightforward activity, easily made explicit in terms of the laws of physics and the rules of the road. Look deeper, and you’ll realize that driving is a social activity, and that interaction between drivers, cyclists, and other pedestrians is negotiated in real time, in different ways, all over the world. So we’ve got Google cars on the road experimentally in California and Washington; how will they do in Beijing, in Bangalore, or in Rome? How will they interact with human drivers in each society? How will they know, as human drivers do, the extent to which it is socially acceptable to bend the rules—and socially unacceptable not to bend them? In many respects, machinery can do far better than humans in the mechanical aspects of driving. Yet testing the Google car will require far more than unit checks or a Cucumber suite—it will require complex evaluation and judgement by human testers to see whether the machinery—with no awareness or understanding of social interactions, for the foreseeable future—can be accommodated by the surrounding culture. That will require a shift from the way testing is done at Google according to some popular stories. If you want to find problems that matter to people before inflicting your product on them, you must test—not only the product in isolation, but in its relationships with other people.

Our goal, all the way along, has been to probe into the nature of testing and the way we talk about it, with the intention of empowering people to do it well. Part of this task involves taking relational tacit knowledge and making it explicit. Another part involves realizing that certain skills cannot be transferred by books or diagrams or video tutorials, but must be learned through experience and immersion in the task. Rather than hand-waving about “intuition” and “error guessing”, we’d prefer to talk about and study specific, observable, trainable, and manageable skills. We could talk about “test automation” as though it were a single subject, but it’s more helpful to distinguish the many ways that we could use tools to support and amplify our testing—for checking specific facts or states, for generating data, for visualization, for modeling, for coverage analysis… Instead of talking about “automated testing” as though machines and people were capable of the same things, we’d rather distinguish between checking (something that machines can do, an activity embedded in testing) and testing (which requires humans), so as to make both our checking and our testing more powerful.

The abstract for Prof. Collins’ talk, quoted above, is an astute, concise description of why skilled testing matters. It’s also why the distinction between testing and checking matters, too. For that, we are grateful.

There will be much more to come in these pages relating Harry’s work to our craft of testing; stay tuned. Meanwhile, I give his books my highest recommendation.

Tacit and Explicit Knowledge
Rethinking Expertise (co-authored with Rob Evan)
The Shape of Actions: What Humans and Machines Can Do (co-authored with Martin Kusch)
The Golem: What You Should Know About Science (co-authored with Trevor Pinch)
The Golem at Large: What You Should Know About Technology (co-authored with Trevor Pinch)
Changing Order: Replication and Induction in Scientific Practice
Artificial Experts: Social Knowledge and Intelligent Machines

Very Short Blog Posts (12): Scripted Testing Depends on Exploratory Testing

February 23rd, 2014

People commonly say that exploratory testing “is a luxury” that “we do after we’ve finished our scripted testing”. Yet there is no scripted procedure for developing a script well. To develop a script, we must explore requirements, specifications, or interfaces. This requires us to investigate the product and the information available to us; to interpret them and to seek ambiguity, incompleteness, and inconsistency; to model the scope of the test space, the coverage, and our oracles; to conjecture, experiment, and make discoveries; and to perform testing and obtain feedback on how the scripts relate to the actual product, rather than the one imagined or described or modeled in an artifact; to observe and interpret and report the test results, and to feed them back into the process; and to do all of those things in loops and bursts of testing activity. Scripted testing is preceded by and embedded in exploratory processes that are not luxuries, but essential.

“We are unable to reply directly”

February 10th, 2014

Apropos of my recent post responding to the sentiment “We have to automate”, I got a splendid example of the suppressed choice again today. If you haven’t read that post, you might find it helpful to read it now to set the context for my main point here.

It started when I was sitting at home this morning, using my laptop. The dialog below popped up on my screen.

An unhelpful message from Google Calendar Sync

Clicking on the link in the dialog brought forth the Web page that that you see below it. Actually, Google Sync isn’t syncing my (Outlook) calendar with a mobile device (leastwise, not to my knowledge—which would be another issue). Google Sync is syncing my calendar with my wife’s calendar (on another laptop) and with a colleague’s calendar somewhere far away. Now: arguably a laptop is a mobile device, but that doesn’t seem to be what the Web page refers to. There are links associated with mobile browsers, Android, or iOS devices. I can’t fathom how the apparent purpose of the dialog relates to the page that gets delivered. So it seems to me that no human at Google has evaluated this dialog and this link to see that they match up—or if the human evaluation was done, no one has seen fit to address the mismatch.

So, somewhat later in the day, I got an email from the Google Accounts Help Center. The message may have been related to the fact that, on Friday, my frequent-traveler movements seem to have triggered an alert that temporarily blocked access to my mail account. The email contained an invitation to complete a short survey: “Take one minute to answer a few short questions to help make the Google Accounts Help Center better. If you visited the Google Accounts Help Center in the past 3 days, please click the button below to complete the survey.”

Well, I visited the Help Center on Friday (I guess… did I?), so that’s within the last three days. On the other hand, the survey points to something different. One question is:

How satisfied or dissatisfied are you with your most recent Google Accounts Help visit today?

Another is

Why did you visit Google Accounts Help today?

Note “today”. So maybe the survey refers to today’s calendar incident; maybe to Friday’s account block. I can’t tell. And in neither case did I ask a question per se. Oh well. I griped about the 2016 error message. If they can’t sort it out, they’ll be able to figure it out by getting in touch with me.

Yet it’s unlikely that they’ll do that. The survey finished with

While we’re unable to answer your question directly, we’ll use this information to improve our online help resources.

And here’s how this ties to Friday’s post: Google is able to answer my question directly—or they certainly could be. They don’t want to answer my question directly. They choose not to answer. That’s different.