Blog Posts for the ‘Skills’ Category

Very Short Blog Posts (26): You Don’t Need Acceptance Criteria to Test

Tuesday, February 24th, 2015

You do not need acceptance criteria to test.

Reporters do not need acceptance criteria to investigate and report stories; scientists do not need acceptance criteria to study and learn about things; and you do not need acceptance criteria to explore something, to experiment with it, to learn about it, or to provide a description of it.

You could use explicit acceptance criteria as a focusing heuristic, to help direct your attention toward specific things that matter to your clients; that’s fine. You might choose to use explicit acceptance criteria as claims, oracles that help you to recognize a problem that happens as you test; that’s fine too. But there are many other ways to identify problems; quality criteria may be tacit, not explicit; and you may discover many problems that explicit acceptance criteria don’t cover.

You don’t need acceptance criteria to decide whether something is acceptable or unacceptable. As a tester you don’t have decision-making authority over acceptability anyway. You might use acceptance criteria to inform your testing, and to identify threats to the value of the product. But you don’t need acceptance criteria to test.

Very Short Blog Posts (24): You Are Not a Bureaucrat

Saturday, February 7th, 2015

Here’s a pattern I see fairly often at the end of bug reports:

Expected: “Total” field should update and display correct result.
Actual: “Total” field updates and displays incorrect result.

Come on. When you write a report like that, can you blame people for thinking you’re a little slow? Or that you’re a bureaucrat, and that testing work is mindless paperwork and form-filling? Or perhaps that you’re being condescending?

It is absolutely important that you describe a problem in your bug report, and how to observe that problem. In the end, a bug is an inconsistency between a desired state and an observed state; between what we want and what we’ve got. It’s very important to identify the nature of that inconsistency; oracles are our means of recognizing and describing problems. But in the relationship between your observation and the desired state, the expectation is the middleman. Your expectation is grounded in a principle based on some desirable consistency. If you need to make that principle explicit, leave out the expectation, and go directly for a good oracle instead.

When Programmers (and Testers) Do Their Jobs

Monday, December 22nd, 2014

For a long time, I’ve admired Robert (“Uncle Bob”) Martin’s persistent advocacy of craftsmanship in programming and software development. Recently on Twitter, he said

One of the most important tasks in the testing role is to identify alternative interpretations of apparently clear and simple statements. Uncle Bob’s statement appears clear and simple, but as with any sentence that can be read by a human, it affords multiple interpretations. One interpretation might be that “when programmers do their jobs, testers find nothing and therefore have nothing useful to contribute“. I’m pretty sure Uncle Bob didn’t mean to say that, although it seems that at least one of my colleagues might have taken that interpretation. I prefer to think Uncle Bob’s intention was to remind programmers to take responsibility for the integrity and quality of their work, and not to slight testers.

As a tester, part of my job to help reduce the chance that statements could be misinterpreted or taken in an overly simplistic way. I think Uncle Bob probably meant the first item on this list of a few possible interpretations (and I hope he’d agree with the other ones that I offer here, too):

  • When programmers do their jobs, testers find nothing that takes the form of blatant coding errors.
  • When programmers do their jobs, testers find nothing inconsistent with what the programmers have been asked to do—although the testers might discover problems in the design or the requirements that were given to the programmers to implement.
  • When programmers do their jobs, testers find nothing that indicates the programmer has been negligent or sloppy, although even the best programmers are not perfect.
  • When programmers do their jobs, testers find nothing that makes the product hard to test; instead, they receive a highly testable product that provides access to things like log files and testable interfaces.
  • When programmers do their jobs, testers find nothing problemmatic, although they might discover unanticipated value in the product.
  • When programmers do their jobs, testers find nothing that interferes with deep testing—looking for rare, hidden, subtle, or platform-related problems that could escape even the most diligent programmers.
  • When programmers do their jobs, testers find nothing that slows them down in developing a more comprehensive understanding of the business needs, making their testing more relevant.
  • When programmers do their jobs, testers find nothing that takes time away from developing rich test ideas, scenarios, and experiments that yield a deep understanding of the product and its emergent behaviours.
  • When programmers do their jobs, testers find nothing more to ask for in terms of useful tools that would aid testing.

In the same thread, James Bach pointed out that even when programmers do their jobs, testers find that the product is doing its job, and that testers find important truths about the product. Neither of these is exactly “nothing”. So…

  • When programmers do their jobs, testers shine light on exactly how well the programmers have done their jobs.
  • When programmers do their jobs, testers identify ways in which other people might have different interpretations of a job well done.
  • When programmers do their jobs, testers have more time to compare our product with competitors’ products, pointing out areas of strengths and weaknesses in each one.

Programmers are also in the business of clearing up misinterpretations. I posted a simpler version of one of the ideas above on Twitter:

“When programmers do their jobs, testers find deep, rare, hidden, subtle, or platform-related problems.”

That sentence was limited by Twitter’s 140-character limit, and limited further by the Twitter handles of couple of addressees to whom I was responding. Ron Jeffries, on a mission similar to mine, pointed out that some testers find deep, rare, hidden, subtle, or platform-related problems. I agree with Ron, and I’ll add that even the best testers—just like the best developers—are human, and limited, and can occasionally miss problems. So:

  • Testers (and programmers) who focus on excellence, craftsmanship, skill, and collaboration will help each other, and will tend to find problems that can be addressed before the product is released—and will tend to produce more valuable products as a result.

Very Short Blog Posts (20): More About Testability

Monday, July 14th, 2014

A few weeks ago, I posted a Very Short Blog Post on the bare-bones basics of testability. Today, I saw a very good post from Adam Knight talking about telling the testability story. Adam focused, as I did, on intrinsic testability—things in the product itself that it more testable. But testability isn’t just a product attribute. In Heuristics of Testability (material we developed in a session of Rapid Software Testing Intensive Online), James Bach shows that testability is a set of relationships between product (“intrinsic testability”); project (“project-related testability”); tester (“subjective testability”); what we want from the product (“value-related testability”); and how we know what we know and what we need to know (“epistemic testability”).

Be sure of this: anything that makes testing harder or slower gives bugs more time or more opportunities to hide. In telling an expert and compelling story of our testing, it’s essential to identify and address things that make it harder to understand the product we’ve got—things that help to increase the risk that it won’t be the product our clients want.

Harry Collins and The Motive for Distinctions

Monday, March 3rd, 2014

“Computers and their software are two things. As collections of interacting cogs they must be ‘checked’ to make sure there are no missing teeth and the wheels spin together nicely. Machines are also ‘social prostheses’, fitting into social life where a human once fitted. It is a characteristic of medical prostheses, like replacement hearts, that they do not do exactly the same job as the thing they replace; the surrounding body compensates.

“Contemporary computers cannot do just the same thing as humans because they do not fit into society as humans do, so the surrounding society must compensate for the way the computer fails to reproduce what it replaces. This means that a complex judgment is needed to test whether software fits well enough for the surrounding humans to happily ‘repair’ the differences between humans and machines. This is much more than a matter of deciding whether the cogs spin right.”

—Harry Collins

Harry Collins—sociologist of science, author, professor at Cardiff University, a researcher in the fields of the public understanding of science, the nature of expertise, and artificial intelligence—was slated to give a keynote speech at EuroSTAR 2013. Due to illness, he was unable to do so. The quote above is the abstract from the talk that Harry never gave. (The EuroSTAR community was very lucky and grateful to have his colleague, Rob Evans, step in at the last minute with his own terrific presentation.)

Since I was directed to Harry’s work in 2010 (thank you, Simon Schaffer), James Bach and I have been galvanized by it. As we’ve been trying to remind people for years, software testing is a complex, cognitive, social task that requires skill, tacit knowledge, and many kinds of expertise if we want people to do it well. Yet explaining testing is tricky, precisely because so much of what skilled testers do is tacit, and not explicit; learned by practice and by immersion in a culture, not from documents or other artifacts; not only mechanical and algorithmic, but heuristic and social.

Harry helps us by taking a scalpel to concepts and ideas that many people consider obvious or unimportant, and dissecting those ideas to reveal the subtle and crucial details under the surface.

As an example, in Tacit and Explicit Knowledge, he takes the idea of tacit knowledge—formerly, any kind of knowledge that was not told—and divides it into three kinds: relational, the kind of knowledge that resides in an individual human mind, and that in general could be told; somatic, resident in the system of a human body and a human mind; and collective, residing in society and in the ever-changing relationships between people in a culture.

How does that matter? Consider the Google car. On the surface, operating a car looks like a straightforward activity, easily made explicit in terms of the laws of physics and the rules of the road. Look deeper, and you’ll realize that driving is a social activity, and that interaction between drivers, cyclists, and other pedestrians is negotiated in real time, in different ways, all over the world.

So we’ve got Google cars on the road experimentally in California and Washington; how will they do in Beijing, in Bangalore, or in Rome? How will they interact with human drivers in each society? How will they know, as human drivers do, the extent to which it is socially acceptable to bend the rules—and socially unacceptable not to bend them?

In many respects, machinery can do far better than humans in the mechanical aspects of driving. Yet testing the Google car will require far more than unit checks or a Cucumber suite—it will require complex evaluation and judgement by human testers to see whether the machinery—with no awareness or understanding of social interactions, for the foreseeable future—can be accommodated by the surrounding culture.

That will require a shift from the way testing is done at Google according to some popular stories. If you want to find problems that matter to people before inflicting your product on them, you must test—not only the product in isolation, but in its relationships with other people.

In Rapid Software Testing, our goal all the way along has been to probe into the nature of testing and the way we talk about it, with the intention of empowering people to do it well. Part of this task involves taking relational tacit knowledge and making it explicit. Another part involves realizing that certain skills cannot be transferred by books or diagrams or video tutorials, but must be learned through experience and immersion in the task. Rather than hand-waving about “intuition” and “error guessing”, we’d prefer to talk about and study specific, observable, trainable, and manageable skills.

We could talk about “test automation” as though it were a single subject, but it’s more helpful to distinguish the many ways that we could use tools to support and amplify our testing—for checking specific facts or states, for generating data, for visualization, for modeling, for coverage analysis… Instead of talking about “automated testing” as though machines and people were capable of the same things, we’d rather distinguish between checking (something that machines can do, an activity embedded in testing) and testing (which requires humans), so as to make both our checking and our testing more powerful.

The abstract for Prof. Collins’ talk, quoted above, is an astute, concise description of why skilled testing matters. It’s also why the distinction between testing and checking matters, too. For that, we are grateful.

There will be much more to come in these pages relating Harry’s work to our craft of testing; stay tuned. Meanwhile, I give his books my highest recommendation.

Tacit and Explicit Knowledge
Rethinking Expertise (co-authored with Rob Evans)
The Shape of Actions: What Humans and Machines Can Do (co-authored with Martin Kusch)
The Golem: What You Should Know About Science (co-authored with Trevor Pinch)
The Golem at Large: What You Should Know About Technology (co-authored with Trevor Pinch)
Changing Order: Replication and Induction in Scientific Practice
Artificial Experts: Social Knowledge and Intelligent Machines

Very Short Blog Posts (11): Passing Test Cases

Wednesday, January 29th, 2014

Testing is not about making sure that test cases pass. It’s about using any means to find problems that harm or annoy people.

Testing involves far more than checking to see that the program returns a functionally correct result from a calculation.

Testing means putting something to the test, investigating and learning about it through experimentation, interaction, and challenge. Yes, tools may help in important ways, but the point is to discover how the product serves human purposes, and how it might miss the mark.

So a skilled tester does not ask simply “Does this check pass or fail?” Instead, the skilled tester probes the product and asks a much more rich and fundamental question: Is there a problem here?

Very Short Blog Posts (10): Planning and Preparation

Wednesday, January 15th, 2014

A plan is not a document. A plan is a set of ideas that may be represented by a document or by other kinds of artifacts. In Rapid Testing, we emphasize preparing your mind, your skills, and your tools, and sharpening them all as you go. We don’t reject planning, but we de-emphasize it in favour of preparation. We also recommend that you keep the artifacts that represent your plans as concise and as flexible as they can reasonably be.

The world of technology is complex and constantly changing. If you’re prepared, you have a much better chance of adapting and reacting appropriately to a situation when the plans have gone awry. But all the planning in the world can’t help you if you’re not prepared.

Very Short Blog Posts (7): Planning vs. Preparation

Sunday, November 3rd, 2013

Imagine a software project. Imagine the things that you want to accomplish, the problems you might encounter, the workarounds you could apply, the accidents (both happy and sad) that might happen, the missteps you may take, the steps you can take to prevent them; all of the actions you can perform to manage the project. Now, make a detailed plan that takes all of your expectations into account.

The more detailed your plan, the more likely it will differ from reality in important respects. Unexpected things will happen, some positive, some negative, and many of them out of your control. You can’t predict future events reliably, but you can prepare to respond to them. Therefore: you might want to relax your effort on specific plans somewhat, and emphasize developing skills and resources that will help you to deal capably with surprises.

I Might Be Wrong (But Not For Me)

Tuesday, March 6th, 2012

Jerry Weinberg tells a story (yes, it’s me; I’m telling yet another Jerry Weinberg story) of meeting an old friend who looked distraught.

“What’s the matter?” Jerry asked.

The fellow replied, “Well, I’m kind of shellshocked. My wife just left me.”

“Was that a surprise?”

“Yes, it really was,” the fellow said. “I mean, we had had some problems, but I thought they were all settled.”

Jerry paused for a moment. Then he said, “nothing is ever settled.”

Several years after hearing that story I recognized its power as a general systems law. Obviously, I didn’t discover it, but I did name it. I call it “The Unsettling Rule”: Nothing is ever settled.

In Lessons Learned in Software Testing by Kaner, Bach, and Pettichord, Lesson 145 is “Use the IEEE Standard 829 for Test Documentation”. Lesson 146, on the facing page, is “Don’t Use the IEEE Standard 829”. When the book was published, some reviewers said “What’s the problem with these guys? They can’t even get it together to tell a consistent story!” Others, including me, thought that this pair of pages in particular was wonderful. It underscored the degree to which issues in the world of software testing are not settled, the degree to which our craft is a long dialogue in which there are many voices to be heard, many options to be discussed, and many contexts be considered.

The difference between the context-driven school (or approach; there’s now apparently disagreement between whether it’s a school or an approach!) and other school/approaches is that these disagreements can get aired in public. There are some fundamental principles on which we agree, and there are some other things on which we don’t agree. Whatever else happens, in this community, we try to make sure that there’s no fake consensus. This is alarming and disturbing, sometimes, to some people, and it can be stressful to the participants. But when it comes up, it’s a hallmark of our community that we try to deal with it. It helps to keep us sharp, and it helps to keep us honest.

Recently I wrote a blog post in which I took the position that the often-used pass-vs.-fail ratio is an invalid and misleading measurement. To summarize the post, I said, “At best, if everyone ignores it entirely, it’s simply playing with numbers. Otherwise, producing a pass/fail ratio is irresponsible, unethical, and unprofessional… The ratio of passing test cases to failing test cases is at best irrelevant, and more often a systemic means of self- and organizational deception. Reducing the product story to a number means reducing its relationship with people to a number. By extension, that means reducing people to numbers too. So to irresponsible, unethical, and unprofessional, we can add unscientific and inhumane.”

I recognize that, coming from someone who claims to be context-driven, that’s pretty extreme stuff. Yet, in its form, it’s consistent with one of those pages or the other in Lessons Learned in Software Testing (with some omissions, which I’ll address shortly). It is also consistent with a set of principles that James Bach and I espouse as part in our Rapid Software Testing class:

We will not knowingly or negligently mislead our clients and colleagues. This ethical premise drives a lot of the structure of Rapid Software Testing. Testers are frequently the target of well-meaning but unreasonable or ignorant requests by their clients. We may be asked to suppress bad news, to create test documentation that we have no intention of using, or to produce invalid metrics to measure progress. We must politely but firmly resist such requests unless, in our judgment, they serve the better interests of our clients. At minimum we must advise our clients of the impact of any task or mode of working that prevents us from testing, or creates a false impression of the testing.

To me, that statement is both in tension with and consistent with several of the principles of the context-driven school, the first and second (“The value of any practice depends on its context” and “There are good practices in context, but there are no best practices”) and the seventh (“Only through judgment and skill, exercised cooperatively throughout the entire project, are we able to do the right things at the right times to effectively test our products.”)

Pass-vs.-fail ratios, to me, fly in the face of one of the “principles in action” listed at “Metrics that are not valid are dangerous.”

Cem Kaner disagrees with the position expressed in my post. It seems to me that Cem’s disagreement hangs on the degree of danger and our reactions to it. I hold that in practical contexts, pass-vs.-fail ratios so dangerous that for almost all cases, they cross over the line into “unethical:, like giving the car keys to someone who is obviously drunk, or like planting land mines near a community well, even though in some rare contexts, such things could be done in good faith and without harm. Cem’s position seems to be (and I welcome correction, if it’s warranted) that although pass-vs.-fail ratios are exemplary of dangerous metrics, they’re not unethical.

Let’s start with two points that I’d like to make about the “unethical” label. One is that my ethical sense is personal, and so are the views posted on my blog. Although I’m happy when other people share them, unless otherwise stated, I don’t represent the view of any community, including my own. I don’t make claims to universal ethics. Second, Cem refers to “using the accusation of unethical as a way of shutting down discussion of whether an idea (unethical!) was any good or not.” I’m not using it that way. I have no intention whatsoever of shutting down debate (as if I could in any case!). Unless claimed otherwise, I am stating personal principles; not Right and Wrong, but right and wrong for me. I don’t know of any agency (other than society) who can make claims of Right or Wrong, and even then claims seem always context-specific.

Whether pass-vs.-fail ratios are wrong or Wrong, they’re certainly wrong for me, wrong enough that I’m uncomfortable with using them on the job. I’m sufficiently uncomfortable that I’m usually going to decline to provide them, just as I would not accept a job in which I was obliged to shoot people. Other people might choose to become mercenaries or to go to war for their countries; I’d be a conscientious objector. That wrongness is relative too, of course. It’s subject to the Relative Rule; that any abstract X is X to some person, at some time. I can only warrant my own ethical stance for the moment. My position on some issues has changed over the years, courtesy of some pleasant and unpleasant experiences. I’m not currently aware of things that might cause my stand to change in the future, but I have to leave the possibility open.

So, is providing pass vs. fail rates unethical? On reflection, I have to say reluctantly, yeah, I think so; not absolutely, but in most practical circumstances. For me, the crucial test is in the last of Cem’s questions about ethics: “Are you helping someone else lie, cheat, steal, intimidate, or cause harm?” My answer is that I see a great deal of risk—and admittedly risk is only potential harm—that I will be aiding the client in some form of oppression or deception, either to himself or to his superiors. (The latter is a situation that I have been in before, with pass-vs.-fail ratios at the centre of the story in a project associated with a $33 million dollar loss.) Most of the time, providing pass-vs.-fail ratios is a test activity that I would stop immediately, using the “mission rejected” stopping heuristic (one that I hadn’t noted until Cem himself pointed it out).

Cem doesn’t provide any contexts in which pass-vs.-fail ratios might be useful, but as a context-driven tester, it’s my obligation to accept his critique and his challenge, and consider some contexts in which I might use them. (This is the omission from my post post that I mentioned above, and it’s the way that the controversy was handled in Lessons Learned: with a serving of context) I present them in order from the least plausible to the most plausible.

“Your daughter will die” or “we’ll shoot this dog.” If someone employs a threat of harm to some person or being or something of value, I have to evaluate the relative damage afforded by providing the measure or not.

When mandated by force of law. If I were on the witness stand, and a lawyer asked me, “What were the pass-vs.-fail ratios at release time for this project,” I’d be required by law to respond. I can imagine a likely way it would play out, too: “92.7%, but I’d also like to make it clear that—” “No further questions, Your Honour.”

If I provided the data with all of the appropriate disclaimers AND I were sure that the disclaimer would be heard. If the client (and the client’s client, and so forth) were to relay the data and the disclaimer reliably to the point where the data would be used, I might be persuaded to provide the data. But I’d have to weigh that against the risk that I was wrong about the disclaimer being heard. Moreover, in my professional judgement, it would be wasting my client(s)’s time.

As a placebo. I might give a pass vs. fail ratio long enough to convince my client that it’s not helpful or necessary, while doing other things to test well and provide her with other forms of reliable information. I’d remain pretty uncomfortable with dispensing the sugar pills, though, and would work at ways of getting around it.

In the course of demonstrating that pass-vs.-fail ratios are a bad idea. In some contexts, pass-vs.-fail ratios provide what Kirk and Miller call quixotic reliability. That is, the measurement seems to correlate with other measurements of the state of the project. I might provide pass-vs.-fail ratios long enough to show a divergence between that data and other measures of project or product health.

If I were aware that the person receiving the data was in possession of all the contextual information that I believe they needed to put it to appropriate and non-harmful use. We use this in one of the exercises in our class, based on a bug from an actual product. We present a very specific set of tests that are the same in every material way but for two variables. The total domain space to put these variables in combination is a set with 2304 elements. When used in a test that covers all of these elements, 510 provide a “fail” result. All of the test cases are of the same kind, and our students knows that those test cases are comparable for the purposes that they’re considering. In that case, that kind of ratio in that kind of context has some value in describing that kind of coverage. So there might be some pedagogical or rhetorical value to reporting a pass-vs.-fail ratio there. Interestingly, the root of the problem is a data type problem in a single line of code. That helps to illuminate the discussion of “one bug or 510?” which in turn illuminates how bug counts and failure counts aren’t well correlated. It also helps to illuminate opportunity cost in paying overmuch attention to this problem when there are many other things that we might test.

To me, the real challenge is in coming up with a case in which this invalid, dangerous metric in its most common applications might be used for good. In the contexts where they’re commonly discussed and used—overwhelmingly commonly, in my view—pass-vs.-fail ratios are used to express the quality of testing, the health of the project, or the readiness of the product. In those contexts, the risk of misuse, whether intentional or inadvertent, is high—like placing a loaded gun with the safety off in a crowded subway car. As I’ve heard Cem say before, “I’d like to call them an Industry Worst Practice, but being context-driven, I can’t.” Once again, Cem has reminded me of why I can’t commit to the “unethical” charge absolutely and in all cases. He’s provided me with a challenge and an opportunity to sharpen my analysis, and I thank him for that.

Postscript, March 28, 2012: In private correspondence and conversation, Cem suggested a different interpretation of a paragraph from this post that I quoted above to provide context for this post. In order to ward off that interpretation, here’s how I might write that paragraph today:

“The ratio of passing test cases to failing test cases is at best irrelevant, and more often a systemic means of self- and organizational deception. Reducing the product story to this invalid number without additional information means reducing the product’s relationship with people to this invalid number. By extension when this invalid number is being used to evaluate people, that means reducing people to this invalid number too. So to irresponsible, unethical, and unprofessional, in this case we could add unscientific and inhumane.”

To be clear: these two posts have not been a blanket condemnation of all measurement, but of a particular metric that fails spectacularly when subjected to the tests of construct validity and reasonable and foreseeable side effects in Kaner and Bond’s Software Engineering Metrics: What Do They Measure and How Do We Know?. Pass vs. fail is not an imperfect metric; this is a metric that has no discernable construct validity to me (or even to Cem). I’ve both experienced and seen pain and systematic deception with this metric at the centre of it. In this, it’s not like imperfect financial figures that are generated by legitimate companies subject to scrutiny by regulators, by auditors, by shareholders, and by markets. It’s more like financial forecasting data dreamed up by Bernie Madoff. I don’t mind dealing with imperfect but plausibly valid information; that’s all a tester ever gets to do, really. But if Bernie Madoff were to ask me to lend my credibility to his models, data, or business practices, I’d feel personally bound to decline that particular request.

Why Pass vs. Fail Rates Are Unethical (Test Reporting Part 1)

Thursday, February 23rd, 2012

Calculating a ratio of passing tests to failing tests is a relatively easy task. If it is used as a means of estimating the state of a development project, though, the ratio is invalid, irrelevant, and misleading. At best, if everyone ignores it entirely, it’s simply playing with numbers. Otherwise, producing a pass/fail ratio is irresponsible, unethical, and unprofessional.

A passing test is no guarantee that the product is working correctly or reliably. Instead, a passing test is an observation that the program appeared to work correctly, under some set of conditions that we were conscious of (and many that we weren’t), using a selection of specific inputs (and not using the rest of an essentially infinite set), at some time (to which we will never return), on some machine (that was in a particular state at that time; we observed and understood only a fraction of that state), based on a handful of things that we were looking at (and a boatload of things that we weren’t looking at, not that we’d have any idea where or how to look for everything). At best, a passing test is a rumour of success. Take any of the parameters above, change one bit, and we could have had a failing test instead.

Meanwhile, a failing test is no guarantee of a failure in the product we’re testing. Someone may have misunderstood a requirement, and turned that misunderstanding into an inappropriate test procedure. Someone may have understood the requirement comprehensively, and erred in establishing the test procedure; someone else may have erred in following it. The platform on which we’re testing may be misconfigured, or there may be something wrong with something on the system, such that our failing test points to that problem and is not an indicator of a problem in our product. If the test was being assisted by automation, perhaps there was a bug in the automation. Our test tools may be misconfigured such that they’re not doing what we think they’re doing. When generating data, we may have misclassified invalid data as valid, or vice versa, and not noticed it. We may have inadvertently entered the wrong data. The timing of the test may be off, such that system was not ready for the input we provided. There may be an as-yet-not-understood reason why the product is providing a result which seems incorrect to us, but which is in fact correct. A failing test is an allegation of failure.

When we do the math based on these assumptions, the unit of measurement in which pass/fail rates are expressed is rumours over allegations. Is this a credible unit of measurement?

Neither rumours nor allegations are things. Uncertainties are not units with a valid natural scale against which they can be measured. One entity that we call a “test case”, whether passing or failing, may consist of a single operation, observation and decision rule. Another entity called “test case” may consist of hundreds or thousands or millions of operations, all invisible, with thousands of opportunities for a tester to observe problems based not only on explicit knowledge, but also on tacit knowledge. Measuring while failing to account for clear differences between entities demolishes the construct validity of the measurement. Treating test cases—whether passing or failing—as though they were countable objects is a classic case of the reification fallacy. Aggregating scale-free, reified (non-)entities loses information about each instance, and loses information about any relationships between them. Some number of rumours doesn’t tell us anything about the meaning, significance, or value of any given passing tests, nor does the aggregate tell us anything about coverage that the passing tests provide, nor does the number tell us about missing coverage. Some number of allegations of which we’re aware doesn’t tell us anything about the seriousness of those allegations, nor does it tell us about undiscovered allegations. Dividing one invalid number by another invalid doesn’t mean the invalidity cancels and produces a valid ratio.

When the student has got an answer wrong, and the student is misinformed, there’s a problem. What does the number of questions that the teacher asked have to do with it? When a manager interviews a candidate for a job, and halfway through the interview he suddenly starts shouting obscenities at her, will the number of questions the manager asked have to do anything to do with her hiring decision? If the battery on the Tesla Roadster is ever completely drained, the car turns into a brick with a $40,000 bill attached to it. Does anyone, anywhere, care about the number of passing tests that were done on the car?

If we are asked to produce pass/fail ratios, I would argue that it’s our professional responsibility to politely refuse to do it, and to explain why: we should not be offering our clients the service of self-deception and illusion, nor should our client accept those services. The ratio of passing test cases to failing test cases is at best irrelevant, and more often a systemic means of self- and organizational deception. Reducing the product story to a number means reducing its relationship with people to a number. By extension, that means reducing people to numbers too. So to irresponsible, unethical, and unprofessional, we can add unscientific and inhumane.

So what’s the alternative? We’ll get to that tomorrow.