Blog: Another Silly Quantitative Model

John D. Cook recently issued a blog post, How many errors are left to find?, in which he introduces yet another silly quantitative model for estimating the number of bugs left in a program.

The Lincoln Index, as Mr. Cook refers to it here, was used as a model for evaluating typographical errors, and was based on a method for estimating the population of a given species of animal. There are several terrible problems with this analysis.

First, reification error. Bugs are relationships, not things in the world. A bug is a perception of a problem in the product; a problem is a difference between what is perceived and what is desired by some person. There are at least four ways to make a problem into a non-problem: 1) Change the perception. 2) Change the product. 3) Change the desire. 4) Ignore the person who perceives the problem. Any time a product owner can say, “That? Nah, that’s not a bug,” the basic unit of the system of measurement is invalidated.

Second, even if we suspended the reification problem, the model is inappropriate. Bugs cannot be usefully modelled as a single kind of problem or a single population. Typographical errors are not the only problems in writing; a perfectly spelled and syntactically correct piece of writing is not necessarily a good piece of writing. Nor are plaice the only species of fish in the fjords, nor are fish the only form of life in the sea, nor do we consider all life forms as equivalently meaningful, significant, benign, or threatening. Bugs have many different manifestations, from capability problems to reliability problems to compatibility problems to performance problems and so forth. Some of those problems don’t have anything to do with coding errors (which themselves could be like typos or grammatical errors or statements that can interpreted ambiguously). Problems in the product may include misunderstood requirements, design problems, valid but misunderstood implementation of the design, and so forth. If you want to compare estimating bugs in a program to a population estimate, it would be more appropriate to compare it to estimating the number of all biological organisms in a given place. Imagine some of the problems in doing that, and you may get some insight into the problem of estimating bugs.

Third, there’s Djikstra’s notion that testing can show the presence of problems, but not their absence. That’s a way of noting that testing is subject to the Halting Problem. Since you can’t tell if you’ve found the last problem in the product, you can’t estimate how many are left in it.

Fourth, the Ludic Fallacy (part one). Discovery and analysis of problems in a product is not a probabilistic game, but a non-linear, organic system of exploration, discovery, investigation, and learning. Problems are discovered at neither a steady nor a random rate. Indeed, discoveries often happen in clusters as the tester learns about the program and things that might threaten its value. The Lincoln Index, focused on typos—a highly decidable and easily understood problem that could largely be accomplished by checking—doesn’t fit for software testing.

Fifth, the Ludic Fallacy (part two). Mr. Cook’s analysis implies that all problems are of equal value. Those of us who have done testing and studied it for a long time know that, from one time to another, some testers find a bunch of problems, and others find relatively few. Yet those few problems might be of critical significance, and the many of lesser significance. It’s an error to think in terms of a probabilistic model without thinking in terms of the payoff. Related to that is the idea that the number of bugs remaining in the product may not be that big a deal. All the little problems might pale in significance next to the one terrible problem; the one terrible problem might be easily fixable while the little problems grind down the users’ will to live.

Sixth, measurement-induced distortion. Whenever you measure a self-aware system, you are likely to introduce distortion (at best) and dysfunction (at worst), as the system adapts itself to optimize the thing that’s being measured. Count bugs, and your testers will report more bugs—but finding more bugs can get in the way of finding more important bugs. That’s at least in part because of…

Seventh, the Lumping Problem (or more formally, Assimiliation Bias). Testing is not a single activity; it’s a collection of activities that includes (at least) setup, investigation and reporting, and design and execution. Setup and investigation and reporting take time away from test coverage. When a tester finds a problem, she investigates reports it. That time is time that she can’t spend finding other problems. The irony here is that the more problems you find, the fewer problems you have time to find. The quality of testing work also involves the quality of the report. Reporting time, since it isn’t taken into account in the model, will distort the perception of the number of bugs remaining.

Eighth, estimating the number of problems remaining in the product takes time away from sensible, productive activities. Considering that the number of problems remaining is subjective, open-ended, and unprovable, one might be inclined to think that counting how many problems are left is a waste of time better spent on searching for other bad ones.

I don’t think I’ve found the last remaining problem with this model.

But it does remind me that when people see bugs as units and testing as piecework, rather than the complex, non-linear, cognitive process that it is, they start inventing all these weird, baseless, silly quantitative models that are at best unhelpful and that, more likely, threaten the quality of testing on the project.

8 Responses to “Another Silly Quantitative Model”

  1. John Cook says:

    Thanks for a detailed response to my blog post. I presented the Lincoln Index as a simple and interesting method and discussed reasons it might not apply well. However, I do believe it could be useful in some contexts. For example, proofreading is a kind of testing, and I imagine the method could be practical in that context. It could also be useful in testing hardware. But I agree that most software testing is far more complex than proofreading or testing hardware components.

    Yes. One reason for that is that, for software products, the problem domain tends to be large. The inputs tend to be highly variable and to a great degree uncontrollable, and (therefore) the outcomes are far less deterministic. More importantly, however, even proofreading is far more than a search for typographical errors. Software testing is far more than a search for coding errors, or even a search for bugs.

    Regarding Djikstra’s objection, I believe his point is logically correct but of limited practical value. It would be nice to know with certainty that a program is error free, but that’s not realistic for large programs. In practice, we can only have an idea (formal or informal) of the probability of an error. Testing will not prove the absence of bugs, but it can increase your confidence that the probability of a user running into a bug is small.

    In some sense, it doesn’t matter whether a program has bugs; it only matters whether a user encounters a bug. (And to your fifth point, it matters how consequential the bug is.) If buggy code is unreachable code, it doesn’t matter. If buggy code is very unlikely to be reached and will have minor consequences if it is reached, it doesn’t matter.

    Yes; that’s crucial. Every time I run into one of these “simple” measures, they’re all about counting and not about understanding. Yes, understanding might be more difficult or more time-consuming than counting. And software development might be more difficult or more time-consuming than typing.

    It may be impossible to meaningfully estimate the probability that a program is without error, but it is possible to estimate the probability of a user running into an error. You could measure, for example, mean time between failures for people using the software under certain circumstances.

    Funny you should mention it. Someone’s attempt to use Mean Time Between Failures as a measure of software quality was the first time that I had a visceral reaction to bogus software metrics. A few years later, I was delighted to find that Kaner and Bond effectively trashed a similar measure, Mean Time To Failure, in this paper.

    I appreciate your gracious response, John. I hope my reply was helpful, and I encourage you to continue the conversation as we consider and study the problems of testing.

  2. An excellent ripping apart of a Platonic Fallacy! Bravo!

    It’s not surprising that the model was proposed by a mathematician.

    Forgive the quants, for they know not what they do. A little more suprisingly, perhaps, they appear not to know how many times they do it.

  3. Veretax says:

    Michael,

    Great response! I totally agree with what you have said. What’s more, having read the other blog entry, I find a few other things lacking. Namely recurrence of bugs. In my experience some bugs that you find may appear to exist only in one context or scope, but in reality the real flaw is deeper in the layers of logic beyond that particular screen.

    A good example would be a typical web page. Let’s say that when an item X is entered into the form and submitted, that it is added to some generated number and should equal some quantity which the user wants to know. The tester discovers that the math in this page is wrong, and with a bit of investigation points to an error in how the page brings in the data say from the database.

    The developer naturally will look and find a way to fix this bug, but what if he fixes it in the wrong place, say on the page itself rather than back in the data access code? The same flaw by nature if being called elsewhere on another page could crop up as many times as there are other pages to report.

    It then is actually the same bug whether caught by the same tester or a different tester even though it may appear on multiple screens.

    Another thought is that what happens when you have one bug report fix resulting in new bugs being introduced? Bug A is fixed and as a consequence Bugs X and Y are caused. Then Bugs X and Y might also get fixed, but they cause bugs J and K and S and T. This is a place where I’d question how you’d go about counting those bugs as because it is possible that those bugs happened because Bug A’s fix was a lazy fix that was not well conceived and thus resulted in multiple more Bugs showing up elsewhere in the system.

    Of course each time you have that kind of situation prop up you have the compounding of Developer and Tester time to find, check, and validate the fix. Now you might hope that bugs X and Y, J and K, and s and T would be found in the course of that validation, but it is also possible that due to some subtlety in the code that those bugs won’t appear in that particular checking and validation, but may appear later down the road, giving the appearance that they are in fact new bugs, when they are really the consequence of a bad bug fixed earlier.

    A great article Michael, and thanks for the links on Halting Process and Ludic Fallacy, I have seen them discussed but never realized they had a name before ;)

  4. Astrobe says:

    In seems that you use a different terminology, where what is commonly named “testing” is named “checking”, while “testing” refers to an activity with a larger scope than ‘our’ testing.

    Yup.

    I think you sometimes confuse the terminologies in your arguments. Obviously, Cook was talking about “checking”.

    It’s not at all obvious to me that John (Cook) was talking about checking. It wasn’t clear to me what kind of testing he was talking about. No matter; the model doesn’t work for either kind of testing.

    (BTW, your terminology has some inconsistencies: “unit testing” should be “unit checking”).

    I agree. Yet when I do that, I run the risk that people won’t know what I’m talking about, so I cut them some slack. I cover that issue here and here, by the way. There’s a lot more to the Testing vs. Checking discussion if you’re willing to surf the blog.

    In “testing vs Checking”, you write:
    “Testing is, in part, the process of finding out whether our checks have been good enough. When we find a problem through testing, one reasonable response is to write one or more checks to make sure that that particular problem doesn’t crop up again.”

    When you conduct tests on some new part of your application, and that you find N serious bugs (the kind of bug that would definitely prevent from shipping the product) that were not caught by the checks, don’t you think that the proposed estimator could give at least something like a degree of confidence on quality of the checks or on the quality of the new code, or even both?

    Sure. Yet there’s a difference between a qualitative assessment of the code and an unsupportable quantitative projection like “the number of remaining bugs”. By the way, your sentence still works fine if you leave out the N.

  5. Simon Morley says:

    Relevance… hmmm.

    I’m currently reading Fashionable Nonsense – about using terminology from one area of study and applying it to something else without giving the reasoning or explanation. I didn’t see the reasoning behind the applicability of the formula in the article. (You’ve highlighted problems with using such a formula and drawing conclusions from its use.)

    I’m skeptical that I’d like Sokal, although I really appreciate the prank that he pulled. Trouble is, some real papers on software metrics have the flavour of his prank—yet the authors of those are considered the traditionalists, where I think they’d label us the post-modernists. It’s a funny world.

    If I take my observed problem count with the article and multiply by yours, divide by the ones in common – it will give the remaining errors in the article. Right? Oh, but I reckon that my errors do not overlap with yours, ie divide by 0. Ooops…

    On the subject of typos I’m reminded of the Wall Street Journal (from Why We Make Mistakes):

    “Some jesters in a British competition described in a page-one article last Monday ride on unicycles. The article incorrectly said they ride on unicorns.” Relevant?

    Perhaps. (I’ve just finished that book myself, in fact. It was okay; I wasn’t thrilled with it, but I’ve read a lot on the subject so I might be jaded.) The sentence was syntactically correct without being semantically correct. So, from the typographical error perspective, not a recognizable problem. From the copy editing perspective, a problem. This suggests that the count of problems will be profoundly affected by your testing mission and your perception of the mission—and that the count doesn’t tell you much that’s useful.

    Top marks for including a reference to unicorns, though. Always good to see one of those.

  6. Ben Klaasen says:

    Michael, you said “I’m skeptical that I’d like Sokal” – it’s my guess that you would enjoy at least the introduction to “Intellectual impostures“. Sokal uses the same relentless and rigourous logical approach as you do.

    I don’t know enough about Sokal to determine whether that’s a compliment. [grin] But I’ll take it as one; thank you.

    Excellent piece, thank you for that.

    And thank you for that.

  7. Greg B says:

    In my current project, the product owner has assumed the risk of any financial losses stemming from bugs in our software. He wants to release the product to customers, but he is of course nervous. How do you propose he should best go about deciding when to release? How should he reason about the risks, short of using a quantitative model?

    This question seemed important enough to me that I wrote a whole new blog post about it. Thanks for asking!

  8. Chris Wallace says:

    Chris—delightful to hear from you!

    There will always be a tendency to try and simplify testing. A mathematical model that one can plug a few variables into and receive a concrete answer of how long testing is going to take has been the questing beast of management for probably as long as software has existed. A ‘complex’ quantitative model has the added bonus of looking really impressive because it’s math and there are equations and stuff! You can blind them with science.

    Not me! Misleading people is not a service that I offer.

    How many of us in testing have been asked, “How long is testing going to take?” or “How will you know when you’re done?” There is a very strong drive both externally (from project management, customers, marketing, etc.) and internally (I need to give them an answer, however imperfect, so I can keep my job) to come up with an answer to those questions that is based on a model you can throw into a power point demonstration . . . preferably with crawling ants because everything looks better with crawling ants. I would hope that most of us know better than to think of bugs as units and testing as piecework, however, it seems as if very few others outside of testing understand that as well (or are willing to admit it). Complex, non-linear, cognitive processes are messy and screw with schedules. There will always be a desire to come up with something simpler and more predictable, no matter how irrelevant and inaccurate the end result may be. It’s human nature (something that most quantitative models fail to take into consideration, ironically enough).

    It’s not wrong for people to want that. I’d like a pony. (Actually, that’s not really true, but my daughter would like a pony.) But as much as they want it, they can’t reasonably expect to get it, and in many cases the attempt to get it will bring on disaster. The desire for stable, predictable delivery of fish, for example, led to attempts “scientific management” of the Newfoundland cod fishery. Yet natural systems incoporate a lot of variation and are more complex than we understand&perhaps a lot more complex than we can understand. The fishery closed in 1992, and hasn’t re-opened since. That’s a story that you can hear here.

    I’m not saying that it’s wrong to manage software projects, either. I think, though, that it’s wrong to expect too much from them. What does “too much” mean? I’d include highly linear progress, interchangability of people, and a simplistic, strictly quantified, brain-off approach to measurement.

Leave a Reply