Comments on: Another Silly Quantitative Model

By: Chris Wallace

Chris Wallace — Wed, 21 Jul 2010 16:19:44 +0000

Chris—delightful to hear from you! There will always be a tendency to try and simplify testing. A mathematical model that one can plug a few variables into and receive a concrete answer of how long testing is going to take has been the questing beast of management for probably as long as software has existed. A ‘complex’ quantitative model has the added bonus of looking really impressive because it's math and there are equations and stuff! You can blind them with science. Not me! Misleading people is not a service that I offer. How many of us in testing have been asked, "How long is testing going to take?" or "How will you know when you're done?" There is a very strong drive both externally (from project management, customers, marketing, etc.) and internally (I need to give them an answer, however imperfect, so I can keep my job) to come up with an answer to those questions that is based on a model you can throw into a power point demonstration . . . preferably with crawling ants because everything looks better with crawling ants. I would hope that most of us know better than to think of bugs as units and testing as piecework, however, it seems as if very few others outside of testing understand that as well (or are willing to admit it). Complex, non-linear, cognitive processes are messy and screw with schedules. There will always be a desire to come up with something simpler and more predictable, no matter how irrelevant and inaccurate the end result may be. It's human nature (something that most quantitative models fail to take into consideration, ironically enough). It's not wrong for people to want that. I'd like a pony. (Actually, that's not really true, but my daughter would like a pony.) But as much as they want it, they can't reasonably expect to get it, and in many cases the attempt to get it will bring on disaster. The desire for stable, predictable delivery of fish, for example, led to attempts "scientific management" of the Newfoundland cod fishery. Yet natural systems incoporate a lot of variation and are more complex than we understand&perhaps a lot more complex than we can understand. The fishery closed in 1992, and hasn't re-opened since. That's a story that you can hear here. I'm not saying that it's wrong to manage software projects, either. I think, though, that it's wrong to expect too much from them. What does "too much" mean? I'd include highly linear progress, interchangability of people, and a simplistic, strictly quantified, brain-off approach to measurement.

By: Greg B

Greg B — Tue, 20 Jul 2010 01:40:41 +0000

In my current project, the product owner has assumed the risk of any financial losses stemming from bugs in our software. He wants to release the product to customers, but he is of course nervous. How do you propose he should best go about deciding when to release? How should he reason about the risks, short of using a quantitative model? This question seemed important enough to me that I wrote a whole new blog post about it. Thanks for asking!

By: Ben Klaasen

Ben Klaasen — Thu, 15 Jul 2010 08:35:10 +0000

Michael, you said "I’m skeptical that I’d like Sokal" - it's my guess that you would enjoy at least the introduction to "Intellectual impostures". Sokal uses the same relentless and rigourous logical approach as you do. I don't know enough about Sokal to determine whether that's a compliment. [grin] But I'll take it as one; thank you. Excellent piece, thank you for that. And thank you for that.

By: Simon Morley

Simon Morley — Thu, 15 Jul 2010 02:34:59 +0000

Relevance... hmmm. I'm currently reading Fashionable Nonsense - about using terminology from one area of study and applying it to something else without giving the reasoning or explanation. I didn't see the reasoning behind the applicability of the formula in the article. (You've highlighted problems with using such a formula and drawing conclusions from its use.) I'm skeptical that I'd like Sokal, although I really appreciate the prank that he pulled. Trouble is, some real papers on software metrics have the flavour of his prank—yet the authors of those are considered the traditionalists, where I think they'd label us the post-modernists. It's a funny world. If I take my observed problem count with the article and multiply by yours, divide by the ones in common - it will give the remaining errors in the article. Right? Oh, but I reckon that my errors do not overlap with yours, ie divide by 0. Ooops... On the subject of typos I'm reminded of the Wall Street Journal (from Why We Make Mistakes): "Some jesters in a British competition described in a page-one article last Monday ride on unicycles. The article incorrectly said they ride on unicorns." Relevant? Perhaps. (I've just finished that book myself, in fact. It was okay; I wasn't thrilled with it, but I've read a lot on the subject so I might be jaded.) The sentence was syntactically correct without being semantically correct. So, from the typographical error perspective, not a recognizable problem. From the copy editing perspective, a problem. This suggests that the count of problems will be profoundly affected by your testing mission and your perception of the mission—and that the count doesn't tell you much that's useful. Top marks for including a reference to unicorns, though. Always good to see one of those.

By: Astrobe

Astrobe — Wed, 14 Jul 2010 19:25:03 +0000

In seems that you use a different terminology, where what is commonly named "testing" is named "checking", while "testing" refers to an activity with a larger scope than 'our' testing. Yup. I think you sometimes confuse the terminologies in your arguments. Obviously, Cook was talking about "checking". It's not at all obvious to me that John (Cook) was talking about checking. It wasn't clear to me what kind of testing he was talking about. No matter; the model doesn't work for either kind of testing. (BTW, your terminology has some inconsistencies: "unit testing" should be "unit checking"). I agree. Yet when I do that, I run the risk that people won't know what I'm talking about, so I cut them some slack. I cover that issue here and here, by the way. There's a lot more to the Testing vs. Checking discussion if you're willing to surf the blog. In "testing vs Checking", you write: "Testing is, in part, the process of finding out whether our checks have been good enough. When we find a problem through testing, one reasonable response is to write one or more checks to make sure that that particular problem doesn’t crop up again." When you conduct tests on some new part of your application, and that you find N serious bugs (the kind of bug that would definitely prevent from shipping the product) that were not caught by the checks, don't you think that the proposed estimator could give at least something like a degree of confidence on quality of the checks or on the quality of the new code, or even both? Sure. Yet there's a difference between a qualitative assessment of the code and an unsupportable quantitative projection like "the number of remaining bugs". By the way, your sentence still works fine if you leave out the N.

By: Veretax

Veretax — Wed, 14 Jul 2010 18:49:16 +0000

Michael,

Great response! I totally agree with what you have said. What’s more, having read the other blog entry, I find a few other things lacking. Namely recurrence of bugs. In my experience some bugs that you find may appear to exist only in one context or scope, but in reality the real flaw is deeper in the layers of logic beyond that particular screen.

A good example would be a typical web page. Let’s say that when an item X is entered into the form and submitted, that it is added to some generated number and should equal some quantity which the user wants to know. The tester discovers that the math in this page is wrong, and with a bit of investigation points to an error in how the page brings in the data say from the database.

The developer naturally will look and find a way to fix this bug, but what if he fixes it in the wrong place, say on the page itself rather than back in the data access code? The same flaw by nature if being called elsewhere on another page could crop up as many times as there are other pages to report.

It then is actually the same bug whether caught by the same tester or a different tester even though it may appear on multiple screens.

Another thought is that what happens when you have one bug report fix resulting in new bugs being introduced? Bug A is fixed and as a consequence Bugs X and Y are caused. Then Bugs X and Y might also get fixed, but they cause bugs J and K and S and T. This is a place where I’d question how you’d go about counting those bugs as because it is possible that those bugs happened because Bug A’s fix was a lazy fix that was not well conceived and thus resulted in multiple more Bugs showing up elsewhere in the system.

Of course each time you have that kind of situation prop up you have the compounding of Developer and Tester time to find, check, and validate the fix. Now you might hope that bugs X and Y, J and K, and s and T would be found in the course of that validation, but it is also possible that due to some subtlety in the code that those bugs won’t appear in that particular checking and validation, but may appear later down the road, giving the appearance that they are in fact new bugs, when they are really the consequence of a bad bug fixed earlier.

A great article Michael, and thanks for the links on Halting Process and Ludic Fallacy, I have seen them discussed but never realized they had a name before 😉

By: Abraham Heward

Abraham Heward — Wed, 14 Jul 2010 17:10:18 +0000

An excellent ripping apart of a Platonic Fallacy! Bravo! It's not surprising that the model was proposed by a mathematician. Forgive the quants, for they know not what they do. A little more suprisingly, perhaps, they appear not to know how many times they do it.

By: John Cook

John Cook — Wed, 14 Jul 2010 16:57:52 +0000

Thanks for a detailed response to my blog post. I presented the Lincoln Index as a simple and interesting method and discussed reasons it might not apply well. However, I do believe it could be useful in some contexts. For example, proofreading is a kind of testing, and I imagine the method could be practical in that context. It could also be useful in testing hardware. But I agree that most software testing is far more complex than proofreading or testing hardware components. Yes. One reason for that is that, for software products, the problem domain tends to be large. The inputs tend to be highly variable and to a great degree uncontrollable, and (therefore) the outcomes are far less deterministic. More importantly, however, even proofreading is far more than a search for typographical errors. Software testing is far more than a search for coding errors, or even a search for bugs. Regarding Djikstra's objection, I believe his point is logically correct but of limited practical value. It would be nice to know with certainty that a program is error free, but that's not realistic for large programs. In practice, we can only have an idea (formal or informal) of the probability of an error. Testing will not prove the absence of bugs, but it can increase your confidence that the probability of a user running into a bug is small. In some sense, it doesn't matter whether a program has bugs; it only matters whether a user encounters a bug. (And to your fifth point, it matters how consequential the bug is.) If buggy code is unreachable code, it doesn't matter. If buggy code is very unlikely to be reached and will have minor consequences if it is reached, it doesn't matter. Yes; that's crucial. Every time I run into one of these "simple" measures, they're all about counting and not about understanding. Yes, understanding might be more difficult or more time-consuming than counting. And software development might be more difficult or more time-consuming than typing. It may be impossible to meaningfully estimate the probability that a program is without error, but it is possible to estimate the probability of a user running into an error. You could measure, for example, mean time between failures for people using the software under certain circumstances. Funny you should mention it. Someone's attempt to use Mean Time Between Failures as a measure of software quality was the first time that I had a visceral reaction to bogus software metrics. A few years later, I was delighted to find that Kaner and Bond effectively trashed a similar measure, Mean Time To Failure, in this paper. I appreciate your gracious response, John. I hope my reply was helpful, and I encourage you to continue the conversation as we consider and study the problems of testing.