DevelopsenseLogo

Why Pass vs. Fail Rates Are Unethical (Test Reporting Part 1)

Calculating a ratio of passing tests to failing tests is a relatively easy task. If it is used as a means of estimating the state of a development project, though, the ratio is invalid, irrelevant, and misleading. At best, if everyone ignores it entirely, it’s simply playing with numbers. Otherwise, producing a pass/fail ratio is irresponsible, unethical, and unprofessional.

A passing test is no guarantee that the product is working correctly or reliably. Instead, a passing test is an observation that the program appeared to work correctly, under some set of conditions that we were conscious of (and many that we weren’t), using a selection of specific inputs (and not using the rest of an essentially infinite set), at some time (to which we will never return), on some machine (that was in a particular state at that time; we observed and understood only a fraction of that state), based on a handful of things that we were looking at (and a boatload of things that we weren’t looking at, not that we’d have any idea where or how to look for everything). At best, a passing test is a rumour of success. Take any of the parameters above, change one bit, and we could have had a failing test instead.

Meanwhile, a failing test is no guarantee of a failure in the product we’re testing. Someone may have misunderstood a requirement, and turned that misunderstanding into an inappropriate test procedure. Someone may have understood the requirement comprehensively, and erred in establishing the test procedure; someone else may have erred in following it. The platform on which we’re testing may be misconfigured, or there may be something wrong with something on the system, such that our failing test points to that problem and is not an indicator of a problem in our product. If the test was being assisted by automation, perhaps there was a bug in the automation. Our test tools may be misconfigured such that they’re not doing what we think they’re doing. When generating data, we may have misclassified invalid data as valid, or vice versa, and not noticed it. We may have inadvertently entered the wrong data. The timing of the test may be off, such that system was not ready for the input we provided. There may be an as-yet-not-understood reason why the product is providing a result which seems incorrect to us, but which is in fact correct. A failing test is an allegation of failure.

When we do the math based on these assumptions, the unit of measurement in which pass/fail rates are expressed is rumours over allegations. Is this a credible unit of measurement?

Neither rumours nor allegations are things. Uncertainties are not units with a valid natural scale against which they can be measured. One entity that we call a “test case”, whether passing or failing, may consist of a single operation, observation and decision rule. Another entity called “test case” may consist of hundreds or thousands or millions of operations, all invisible, with thousands of opportunities for a tester to observe problems based not only on explicit knowledge, but also on tacit knowledge. Measuring while failing to account for clear differences between entities demolishes the construct validity of the measurement. Treating test cases—whether passing or failing—as though they were countable objects is a classic case of the reification fallacy. Aggregating scale-free, reified (non-)entities loses information about each instance, and loses information about any relationships between them. Some number of rumours doesn’t tell us anything about the meaning, significance, or value of any given passing tests, nor does the aggregate tell us anything about coverage that the passing tests provide, nor does the number tell us about missing coverage. Some number of allegations of which we’re aware doesn’t tell us anything about the seriousness of those allegations, nor does it tell us about undiscovered allegations. Dividing one invalid number by another invalid doesn’t mean the invalidity cancels and produces a valid ratio.

When the student has got an answer wrong, and the student is misinformed, there’s a problem. What does the number of questions that the teacher asked have to do with it? When a manager interviews a candidate for a job, and halfway through the interview he suddenly starts shouting obscenities at her, will the number of questions the manager asked have to do anything to do with her hiring decision? If the battery on the Tesla Roadster is ever completely drained, the car turns into a brick with a $40,000 bill attached to it. Does anyone, anywhere, care about the number of passing tests that were done on the car?

If we are asked to produce pass/fail ratios, I would argue that it’s our professional responsibility to politely refuse to do it, and to explain why: we should not be offering our clients the service of self-deception and illusion, nor should our client accept those services. The ratio of passing test cases to failing test cases is at best irrelevant, and more often a systemic means of self- and organizational deception. Reducing the product story to a number means reducing its relationship with people to a number. By extension, that means reducing people to numbers too. So to irresponsible, unethical, and unprofessional, we can add unscientific and inhumane.

So what’s the alternative? We’ll get to that tomorrow.

24 replies to “Why Pass vs. Fail Rates Are Unethical (Test Reporting Part 1)”

  1. I look forward to reading tomorrow. I understand the problems with counting test cases, trying to equate them, etc. I prefer to consider which tests were able to run to completion, and assuming the found problems were converted into defect tickets (or improved tests), then I can look at the weight of the unfixed defects (as well as anecdotal consideration) in my analysis of “release-readiness”. I would consider the dead battery as a show stopper.

    Just as people try to make a product better, I believe in trying to make the tests better. IN other words, find the errors of my ways. As time goes by, I should convert the rumors and allegations into facts.

    But I can’t wait to see what you have. I am sure I’ll learn something good.

    Dave

    Reply
  2. I’m starting to get tired reading more and more test articles that falls in the philosophical side (a rumour of success, an allegation of failure, rectification fallacy) just to hide our inability of providing a clear answer to our work colleagues (either developers, project managers, clients).

    Michael replies: You seem to be suggesting “philosophy has no practical implications”. If that’s true, and if philosophy means “thinking about what we know and why we believe we know it,” let’s do a simple substitution on that suggestion: “Thinking about what we know and why we believe we know it has no practical applications.” Is that what you mean? And I’m not talking about this stuff “just to hide the problem”; I’m talking about it for exactly the opposite reason: to expose the problem and bring it into the light. (Oh, and it’s reification, for the record.)

    Meanwhile, it seems ridiculous to me to compensate for the problem of unskilled reporting by producing bogus numbers. Why not address the problem of unskilled reporting by developing skill?

    As engineers we need to provide clear facts and metrics and the “pass to fail ratio” is one which can show the current health state of a product. The higher the pass rate, the higher the confidence of the people working on the project that we are on the right track.

    Michael replies: Dude, did you even read the post above? Do you see the problems in the measurement? Do you understand that there’s a difference between warranted and unwarranted confidence? Testers aren’t in the “higher confidence” or the “right track” or the “reassurance” business; we’re in the business of observing and describing what’s actually there in the product. Confidence is orthogonal to that.

    Do not mix external noise. If the requirements are bad, than the problem is elsewhere, not in the “pass to fail ratio”. If the tester writes bad tests, than the problem is elsewhere, not in the “pass to fail ratio”, etc.

    There’s a simple way to improve the pass to fail ratio: run more passing tests, or stop running failing tests. If someone were to do that intentionally, should that build confidence for our clients? I’d answer No, and I bet you’d agree. Now: what if someone were to do that inadvertently, out of naivety or ignorance or incompetence, even though they have the best of intentions? Should that build confidence for our clients? I’d argue that if someone can do that intentionally, other people can do it inadvertently—and if a client is ignorant enough to buy into the idea of the pass/fail ratio, that same client will be ill-equipped to tell the difference between competence and incompetence.

    Reply
  3. Michael, let me start with some critique: no measurement is guaranteed to be precise. We know there false negatives and false positives occur in testing, causing variation on any measurements we could do related to test results.

    Michael replies: I’m not talking about precision here. I’m talking about validity. Precision is about the number of decimal places in your number. You can divide passing bugs by failing bugs and get a percentage to six decimal places if you like. The validity question is about whether your number measures what you think it measures.

    On my practice, working with skilled and experienced testers the tolerance level has never been so high to make any measure neither invalid, nor irrelevant.

    You’d have to tell me what you’re measuring and how you’re measuring it for me to evaluate this claim.

    I could see how high tolerance using unskilled/untrained testers could result in misleading measures. The only thing I see unethical here is if you choose to hide information about your tester skills and risk of high variation.

    I think you might want to consider some precision in your terminology here. Tolerance, risk of high variation, and precision refer to different things.

    Nevertheless I’m happy you touched the other part of the issue (in the second part of you post). It reflected our previous discussion with you on softwaretestingclub.com were you suggested me trading post analogy. The problem as I see it is simple: if one feature have 10 test cases and another just one it does not mean the first is 10 times as hard to implement or 10 times as valued by customer. A test case could (especially if you weight them) be unit of measuring testing, it is not unit of measuring software. Looking forward to read your alternative and compare to what I did few a years ago.

    Is a management case a unit of measuring management? Is a driving case a unit of measuring driving? Is a flying case a unit of measuring piloting an airplane? We don’t use these things for measuring those activities. Why not?

    Ainars

    Reply
  4. I agree that it’s a false metric. Unfortunately, sometimes even explaining why it’s a false metric seems to fall on deaf ears – they want a number that is “easy”.

    Michael replies: Give ’em a really easy number, then. I remember Jerry Weinberg’s answer to a request for a really easy number once.

    “Three,” he said.

    “Three?!” responding the person asking.

    “Yeah… three,” said Jerry. Then after a pause, “Why… were you expecting some other number?”

    Reply
  5. Hi Michael! I can’t believe that the pass/fail ratio is considered a good measure… But I’ve never asked myself about this and I’m afraid that there are many “fake” measures like this.

    Reply
  6. Hi Michael,

    While reading “Things That Make Us Smart” I encountered a quote that I think fits with this in a way. The author said “People value what they can measure (or represent)”

    I interpret this quote multiple ways, but ultimately I focus on the “can” part. By can I mean “Have the knowledge or skills to perform.”

    Most people have the knowledge and skills to count and divide so that’s what they do. They value it (even though they shouldn’t) and they don’t value things that they don’t know how to measure.

    By teaching people other ways to represent testing progress and software quality, they will hopefully stop valuing pass and fail rates and start valuing the testing story. You, James, Cem, and Jerry have all shown me alternative ways to represent these things and I’m a better tester for it.

    I look forward to your future posts showing us what some of the alternatives are to pass/fail rates.

    Reply
  7. Your opening paragraph of this post got me awake! “Is Michael really saying that!?”

    While pass/fail ratios can be horribly misleading, saying that they /are/ (implying an always) makes it a too black-white for me!

    Here’s the way I see it: If you want to build a company, you need an idea. But you also need an accountant. The latter can’t run the company, but the company can’t run without him. Likewise, if you want to build a system you need an idea – and “counters”. E.g. passed/failed unit checks, so the /testers/ don’t need to spend all their time checking the obvious, but can do ‘intelligent’ work instead. E.g. testing the unit checks!

    I love structure, measurement, quantification, ratios, numbers. I do so because these things simplify my understanding of what’s going on around me – so I can focus on the important stuff without loosing sleep at night.

    I may be a bit pedantic here. That’s my personality – and cultural background, I suppose, but I was really put off by your opening paragraph, which I think is a pity, since the rest of your blog is to the point.

    Cheers,
    Anders

    Reply
  8. Hi nice article and timely for me as we are dealing with this at work at the moment.

    A few years ago at uni we covered these measurement concepts such as construct validity so I am familiar with them. But I am so used to counting test cases since I left uni and started working, that your ideas aren’t fully clicking for me. And if they arent clear in my head, I wont feel confident discussing them with colleagues.

    Michael replies: It’s a good idea to seek clarity; to think it through before you talk it through, or to talk it through with trusted colleagues before a more risky audience.

    So I’ve been trying to think of an analogy, especially one that will be meaningful to colleagues with different backgrounds.

    How about: would you estimate the quality of a restaurant based on the ratio of all good reviews you could find, to all bad reviews you could find?

    I like it. Not only would I use it as an analogy, I might actually do it, and then reveal the content of the reviews to explore the difference in credibility they might reveal. You might also see what happens when you put all of the moderate reviews into one class or the other.

    Reply
  9. To Adi:
    “to hide our inability of providing a clear answer to our work colleagues (either developers, project managers, clients).”

    Testers can rarely provide so called “clear answer” because of one simple issue. Difference of definitions of “Clear Answer”. As a tester you should know that it is impossible to test every possible variation. But Managers, Developers and Customers often don’t know that and by getting pass/fail results tend to make false assumptions. Like when everything has passed, then software has no bugs in it and it works under any possible condition. But that’s not true. To give clear answer then every possible condition affecting directly or indirectly software under testing must be pointed out. Starting from software configuration to date/time of the test (more oftenly occuring bug issue than you’d think) to CPU temperature to solar effects (like magnetic storms) to your own typing speed and mouse click speed. After counting up all those conditions and getting pass result only points out that it passed during THAT time. You can’t guarantee that it works after 5 or 10 or 30 minutes. You could do educated guess and assume it probably works if every possible variable that affects the system won’t change, but you can’t guarantee that since some variables might affect it the way you couldn’t perceive beforehand.

    “The higher the pass rate, the higher the confidence of the people working on the project that we are on the right track.” So if we only run tests that are passing then the project is on the right track? I’ve experience that some developers and project managers actually prefer less pass/fail and more of “Story” format as test results. Pass/fail gives binary answer to a question that can’t be answered in binary form without putting the whole responsibility to tester. Also story format carries additional information that just can’t be presented in pass/fail mode.
    For example – Tester has to test if new pop-up opens. While testing it, tester notices that after pop-up has opened, there is a slightly annoying delay before it opens. If he would use pass/fail method, then the test case would be PASS, because the pop-up opened. However forming it into a story indicates that there might be a problem with the pop-up and should be attended to.

    to Michael: Very good article and if you would allow, then I would like to use it as reference point while explaining issues with pass/fail.

    Reply
  10. @Tom

    I’ve been thinking of other analogies too, and here are two that I think can have an impact:

    Imagine that your doctor said that she ran 30 tests on you and that 28 passed… is that good enough information?

    Imagine that I interviewed someone for you and the person “passed” 95 of the 100 questions I asked while another person “passed” 80 of the 100. Is it obvious who to hire?

    Reply
  11. I had argued with my colleagues about this for ages. I gave up. Hopefully now that someone else with a blog and a reputation says it, it will have more convincing power.

    Reply
  12. Hiya,

    Sorry if I’m coming at this late plus I haven’t read the second part of your article yet.

    A lack of control and the subsequent false metrics is something that I’ve struggled with for years in testing.

    Especially in terms of getting my stakeholders to understand why measuring the progress of the testing based on the number of tests completed is completely wrong.

    (I do wonder sometimes if these false metrics are a symptom of the fragmentation on standards for testing?)

    Michael replies: I wonder sometimes if these false metrics are a symptom of folklorization on standards for testing. The metrics, and others like them, appear in a lot of books and conference presentations and such. I wonder if people would come up with similar kinds of metrics if they were starting independently of those sources.

    So I can understand exactly where you’re coming from with this article.

    However there is something I’d like your thoughts on…When I look at your first two paragraphs you provide a well defined set of criteria of why a test may appear to have passed or failed, but then you end each with a “rumour of pass” or an “allegation of failure”.

    I was thinking that actually you have quantified why something may have passed or failed quite strongly and precisely, therefore based on your criteria why can’t you/we quantify why a test has passed; And if we do then surely we have control and we can measure?

    Regards,
    Stev

    Michael replies: I think you mean “qualify”, rather than “quantify” here. In order to identify why a given test had passed or failed, you’d need to know and to be able to describe many things. What’s the nature of test test? The nature of the failure? The significance of each one? The cause of the problem? Who would it affect? How would it affect them? How to deal with the failure—by modifying the test or the product or the test suite…?

    After you’ve sorted all that out, you could render it all down to a number (but more appropriately a set of numbers). Once you had the qualitative data in hand, would the quantitative data be helpful? Perhaps, but I wouldn’t know how to start sorting out and weighing the quantities in a useful way—and I’ve been a program manager. I found the story of each problem to be far more important than any aggregation of problems.

    Reply
  13. I agree with you about folklore being responsible for the misguided metrics we constantly use to understand to control testing.

    I’m not too sure how we could identify a different set of metrics, most “managers” I’ve worked with are only typically interested in progress, budget and quality (however that is defined) – because that’s how he/she is measured and on a wider scale that’s how a business is run, measured and controlled.

    So I’m not sure how I as a tester can break that cycle?

    Yes I did mean qualify – thanks for the correction.

    As part of that same paragraph you provide a list of questions that (in my opinion) could be aimed at particular stakeholders across the delivery lifecycle and as such are you not talking about (or hinting at) a Triage process?

    With regards to your last paragraph if you have the previously mentioned qualitative data (or is it really information by answering the questions?) I agree that the individual story of each problem is more important than the sum of any group of numbers.

    As I write this I’m wondering if I should be looking more at the value that each measurement adds to the story of testing.

    Regards,

    Steve.

    Reply
  14. […] What would a “pass” on this test reveal? If it is done by a different tester in a slightly different way eg inputting Box B before content in Box A, or tabbing instead of mouse clicks is that a different test? It could get a different result. Also, if it fails the first 2 times, works on the third, is broken by a regression issue, fixed and is currently working at release, that usually shows as Count of Test Case = 1, Pass = 1, 100% Pass – and tells us nothing about the potential risks of the function. […]

    Reply

Leave a Comment