Archive for the ‘Bugs’ Category

When A Bug Isn’t Really Fixed

Tuesday, January 11th, 2011

On Monday, January 10, Ajay Balamurugadas tweeted, “When programmer has fixed a problem, he marks the prob as fixed. Programmer is often wrong. – #Testing computer software book Me: why?”

I intended to challenge Ajay, but I made a mistake, and sent the message out to a general audience:

“Challenge for you: think of at least ten reasons why the programmer might be wrong in marking a problem fixed. I’ll play too. #testing”

True to my word, I immediately started writing this list:

1. The problem is fixed on a narrow set of platforms, but not on all platforms. (Incomplete fix.)

2. In fixing the problem, the programmer introduced a new problem. (In this case, one could argue that the original problem has been fixed, but others could argue that there’s a higher-order problem here that encompasses both the first problem and the second.) (Unwanted side effects)

3. The tester might have provided an initial problem report that was insufficient to inform a good fix. Retesting reveals that the programmer, through no fault of his own, fixed the special case, rather than the general case. (Poor understanding, tester-inflicted in this case)

4. The problem is intermittent, and the programmer’s tests don’t reveal the problem every time. (Intermittent problem)

5. The programmer might have “fixed” the problem without testing the fix at all. Yes, this isn’t supposed to happen. Stuff happens sometimes. (Programmer oversight)

6. The programmer might be indifferent or malicious. (Avoidance, Indifference, or Pathological Problem)

7. The programmer might have marked the wrong bug fixed in the tracking system. (Tracking system problem)

8. One programmer might have fixed the problem, but another programmer (or even the same one) might have broken or backed out the fix since the problem was marked “fixed”. Version control goes a long way towards lessening this problem.  Everyone makes mistakes, and even people of good will can thwart version control sometimes.  (Version control problem)

9. The problem might be fixed on the programmer’s system, but not in any other environments. Inconsistent or missing .DLLs can be at the root of this problem. (Programmer environment not matched to test or production)

10. The programmer hasn’t seen the problem for a long time, and guesses hopefully that he might have fixed it. (Avoidance or indifference)

I had promised myself that I wouldn’t look at Twitter as I prepared the list. By the time I finished, though, the iPhone on my hip was buzzing with Twitter notifications to the point where I was getting a deep-tissue massage.

When I got up this morning, there were still more replies. Peter Haworth-Langford wanted “Do we need to focus on what’s wrong? Solutions? Are we missing something else by focusing on what’s wrong?” The last response on the thread, so far as I know, was Darren McMillan asking, rather like Peter, “I’d like to know which role you are playing when you say I’ll play too? Customer/PM….. Is there more to the challenge?” Nope. And thank goodness for that; I was swamped with replies. I decided to gather them and try grouping them to see if there were patterns that emerged.

Very few problems indeed have any single, unique, and exclusive cause. Moreover, this list is based on Joel’s Law of Leaky Abstractions: “All non-trivial abstractions are to some degree leaky.” Some of the problems listed below may fit into more than one category, and, for you, may fit better into a category to which I’ve arbitrarily assigned them. Cool; we think differently. Let me know how, and why.

Note also that we’re looking at this without prejudice. The object of this game is not to question anyone’s skill or integrity, but rather to try to imagine what could happen if all the stars are misaligned.  We’re not saying that these things always happen, or even that they frequently happen. Indeed, for a couple, of items, they might never have happened. In a brainstorm like this, even wildly improbable ideas are welcome, because they might trigger us to think of a more plausible risk.

Erroneous Fix

Just as people might fail to implement something the first time, they might fail when they try to fix the error, even though their acting in good faith with all of the skill they’ve got.

Lynn McKee: He is a human being and simply made an error in fixing it.
Darren McMillan: The developer didn’t have the skills to apply fix correctly, resulting code caused later regression issues.

That kind of problem gets compounded when we add one or more of the other problems listed below.

Incomplete Fix

Somtimes fixes are incomplete. Sometimes the special case has been fixed, but not the general case. Sometimes that’s because of a poor understanding of the scope of the problem; problems are usually multi-dimensional.  (We’ll get to the root of the misunderstanding later.)

Ben Kelly: ‘Fix’ hid erroneous behavior but did not resolve the underlying problem
Ben Simo: The problem existed in multiple places and requires additional fixes.
Lanette Creamer: Fix isn’t accesible
Lanette Creamer: Fix is not localized.
Lanette Creamer: Fix isn’t triggered in some paths.
Lanette Creamer: Fix doesn’t integrate w other code
Stephen Hill: The programmer might have fixed that symptom of the bug but not dealt with the root cause.
Stephen Hill: Has the fix been applied only to new installs or can it retrospectively fix pre-existing installs too?

Sometimes people will fix the letter of the problem without doing all of the related work.

Ben Kelly: Bug fix did not have accompanying automation checks added (in a culture where this is the norm)

It’s possible for people comply maliciously to the letter of the spec, fixing a coding problem while ignoring an underlying design problem.

Erkan Yilmaz: dev knows since decision abt design it’s bad but against his belief fixs also bug(s). He cant look honestly in mirror anymore

Unwanted Side Effects

Sometimes a good-faith attempt to fix the problem introduces a new problem, or helps to expose an old problem. Much of the time such problems could easily intersect with “Incomplete Fix” above or “Poor Understanding” below.

Ben Simo: The fix created a new problem worse than the solution.
Lanette Creamer: fix breaks some browsers/platforms
Lanette Creamer: fix has memory leaks
Lanette Creamer: fix breaks laws/reqirements
Lanette Creamer: fix slows performance to a crawl.
Michel Kraaij: The bug was fixed, but “spaghetti code” increased.
Michel Kraaij: The dev did fix the bug, but introduced a dozen other bugs. (this issue fixed or not?
Michel Kraaij: The fix increases the user’s manual process to an unacceptable level.
Nancy Kelln: Was never actually a problem. By applying a ‘fix’ they now broke it.
Pete Walen: Mis-read problem description, “fixed” something that was previously working.

Intermittent Problem

In my early days as telephone support person at Quarterdeck, customers used to ask me why we hadn’t fixed a particular problem. I observed that problems that happened in every case, on every platform, tended to get a lot of attention. Sometimes a fix will appear to solve a problem whose symptom is intermittent. The fix might apply to first iteration through a loop, but not subsequent iterations; or for all instances except the intended maximum. Problems may reproduce easily with certain data, and not so often or not at all with other data. Timing, network traffic, available resources can conspire to make a problem intermittent.

Michel Kraaij: The dev happened to use test data which didn’t make the bug occur.
Michel Kraaij: The dev followed “EVERY described step to reproduce the bug” and now the bug didn’t occur anymore.
Pete Walen: Fixed sql query with commands that don’t work on that DB… not that that ever happened to anything I tested…
Pradeep Soundararajan: might have thought the bug to be fixed by tryin to repeat test mentioned in the report although there are other ways to repro

Environment Issues

We’ve all heard (or said) “Well, it works on my machine.” The programmer’s environment many not match the test environments.

Adam Yuret: The fix only works on the Dev’s non-production analogous workstation/environment.
Michel Kraaij: The fix is based on module, which has became obsolete earlier, but wasn’t removed from the dev’s env.
Pradeep Soundararajan: Works on his machine
Stephen Hill: Might be fixed in dev’s environment where he has all the DLLs already in place but not on a clean m/c.

“Works on my machine” is a special case of a more general problem: the programmer’s environment might not be representative of the test environment, but the test environment might not be representative of the production environment, either. There might not be a test environment. Patches, different browser versions, different OS versions, different libraries, different mobile platforms… all those differences can make it appear that a problem has been fixed.

Darren McMillan: Fix on production code blocking customer upgrades
Lynn McKee: No two tests are ever exactly the same, so even tho code change was made something is diff in testers “environment”.
Michel Kraaij: The fix demands a very expensive hardware upgrade for the production environment.
Pete Walen: Fixed code for 1 DB, not for the other 3, and not the one that was used in testing.

Poor Understanding or Explanation

Arguably all problems include an element of this one. Sometimes there’s poor communication between the programmer and the tester, due to either or both. The tester may not have described or explained the problem well, and the programmer provided a perfect fix to the problem as (poorly) described. Sometimes the programmer doesn’t understand the problem or the implications of the fix, and provides an imperfect fix to a well-described problem. Sometimes a report might seem to refer to the same problem as another, when the report really refers to different problem. These problems can be aided or exacerbated by the medium of communication: the bug tracking system, linguistic or cultural differences, written instead of face-to-face communication (or the converse).

Ben Kelly: Programmer can’t reproduce the problem – tester didn’t provide sufficient info.

Ben Simo: The problem wasn’t understood well enough to be satisfactorily fixed.
Michel Kraaij: The dev asked whether “the problem was solved this way?” He got back a “yes”. He just happened to ask the wrong stakeholder. Nancy Kelln: Wasn’t clear what the problem was and they fixed something else.
Pradeep Soundararajan: Assumes it to be a duplicate of a bug he recently fixed.
Zeger Van Hese: Developer solved the wrong problem (talking to the tester would have helped).
Ben Simo: The tester was wrong and gave the programmer information leading to breaking, not fixing, the software.
Michel Kraaij: The tester classifies the bug as “incorrectly fixed”. However, it’s the tester who’s wrong. Bug IS fixed.

Zeger Van Hese: The dev doesn’t see a problem and marks it fixed (as in: functions as designed)

Another variation on poor understanding is that the “problem” might not be a problem at all.

Darren McMillan: It wasn’t a problem after all. Customer actually considered the problem a feature. On hearing about the fix customer cried.
Darren McMillan: Problem came from a tester with a tendency to create his own problems. Wasn’t actually a problem worth fixing.

Finally in this category, a “problem” can be defined as “a difference between things as perceived and things as desired” (that’s from Exploring Requirements Weinberg and Gause). To that, I would add the refinement suggested by the Relative Rule “…to some person, at some time.” A bug is not a thing in the program; it’s a relationship between the product and some person. One way to address a problem is to solve it technically, of course. But there other ways to address the problem: change the perception (without changing the fact of the matter); change the desire; decide to ignore the person with the problem; or wait, such that perhaps the problem, the perception, the desire, or the person are no longer relevant.

Darren McMillan: Fix was a customer support call. Fix satisfied customer, didn’t fit product needs.
Pradeep Soundararajan: Because the bug is the perception not the code

Insufficient Programmer Testing

Partial problem fixes, intermittent problems, and poor understanding tend not to thrive in the face of programmers who think and act critically. Inadequate programmer testing is never the sole source of a problem, but it can contribute to a problem being marked “fixed when it really isn’t.

Ben Simo: The fix wasn’t sufficiently tested.
Darren McMillan: Obvious: wasn’t properly tested, didn’t consult required parties for fix,
Lynn McKee: …and therefore must not have done any or /any/ effective testing on his end first.

Version Control Problems

Version control software was relatively new to most on the PC platform in the middle 1990s. These days, it’s implemented far more commonly, which is almost entirely to the good. Yet no tool can guarantee perfect execution for either individuals or teams, and accidents still happen. Alas, version control can’t guarantee that the customer has the same configuration on which you’re developing or testing—or that he had yesterday, or that he’ll have tomorrow.

George Dinwiddie: Forgot to check in some of the new code.
Nancy Kelln: Bad code merge overwrote the changes. Bug fix got lost.
Pete Walen: Fixed code, did not check it in, fixed it again, check in BOTH (contradictory fixes)
Pete Walen: Fixed code, forgot to check it in. Twice. #hypotheticallyofcourse
Pradeep Soundararajan: Has a fix and marks fix before he commits the code and checks in the wrong one.
Stephen Hill: Programmer might not be using the same code build as the customer so does not get the bug.

Tracking Tool Problems

Sometime our tools introduce problems of their own.

Ben Kelly: Programmer fat-fingers the form response to a bug fix.
Nancy Kelln: Marked the wrong bug as fixed in the tracking tool.
Pradeep Soundararajan: The bug tracking system could have a bug that occasionally registers Fixed for any other option.
Pradeep Soundararajan: He might have overlooked a bug id and marked it fixed while it was the other
Zeger Van Hese: The dev wanted to re-assign, but marked it fixed instead (context: he hates that defect tracking system they made him use)

Process or Responsibility Issues

Perhaps there’s a protocol that should be followed, and perhaps it hasn’t been followed correctly—or at all.

Ben Simo: Perhaps it isn’t the responsibility of one programmer OR one tester to mark a problem fixed on their own.
Darren McMillan: Obvious: didn’t follow company procedure for fixes (fix notes, check lists, communications)
Dave Nicolette: Maybe programmer shouldn’t be the one to say the problem is fixed, but only that it’s ready for another review.
Michel Kraaij: The dev misunderstood the bug fixing process and declared it “fixed” instead of “resolved”.

Avoidance, Indifference, or Pathological Behaviour

We don’t like to think about certain kinds of behaviour, yet they can happen. As Herbert Leeds and Jerry Weinberg put it in one of the first books on computer programming, “When we have been working on a program for a long time, and if someone is pressing us for completion, we put aside our good intentions and let our judgment be swayed.”

Ben Kelly: Programmer is under duress from management to ‘fix all the bugs immediately’
Michel Kraaij: The dev is just lazy
Michel Kraaij: The dev is out of “fixing time”. To make the fake deadline, he declares it fixed.
Pradeep Soundararajan: Is forced by a manager or a stakeholder to do so
Pradeep Soundararajan: That way he is buying more time because it goes through a big cycle before it comes back.

Maybe the pressure comes from outside of work.

Erkan Yilmaz: Dev has import. date w. fiancé’s parents, but boss wants dev 2 work overtime suddenly, family crisis happens

There may be questions of competence and trust between the parties. A report from unskilled tester who cries “Wolf!” too often might not be taken seriously. A bad reputation may influence a programmer to reject a bug with minimal (and insufficient) investigation.

Ben Kelly: Bug was dismissed as the tester reporting had a track record of reporting false positives.

Perhaps someone else was up to some mischief.

Erkan Yilmaz: dev went 2 other person’s pc who was away + used that pc 4 fix/marking – during that he found info that invaded privacy

Sometimes we can persuade ourselves that there isn’t a problem any more, even though we haven’t checked.

Michel Kraaij: The dev has a huge ego and declares EVERY bug he touches as “fixed”.
Pradeep Soundararajan: He did some code change that made the bug unreproducible and hence considers it to be fixed.
Pradeep Soundararajan: Considers it fixed thinking it was logged on latest version & prior versions have it altho code base in production is old
Pradeep Soundararajan: Thinks his colleague already fixed that.
Stephen Hill: Has the person for whom the problem was ‘a problem’ re-tested under the same circumstances as previously?

Sometimes management or measurement exacerbates undesirable behaviours.

Ben Kelly: Programmer’s incentive scheme counts the number of bugs fixed (& today is the deadline)
Ben Simo: Perhaps programmer is evaluated by number of things mark fixed; not producing working software.
Lynn McKee: He is being measured on how many bugs he fixes and hopes no one will notice no actual coding was done.
Pradeep Soundararajan: Yielding to SLA demands to fix a bug within a specific time.

It is remarkable how a handful of experienced people can come up with a list of this length and scope. Thank you all for that.

Another Silly Quantitative Model

Wednesday, July 14th, 2010

John D. Cook recently issued a blog post, How many errors are left to find?, in which he introduces yet another silly quantitative model for estimating the number of bugs left in a program.

The Lincoln Index, as Mr. Cook refers to it here, was used as a model for evaluating typographical errors, and was based on a method for estimating the population of a given species of animal. There are several terrible problems with this analysis.

First, reification error. Bugs are relationships, not things in the world. A bug is a perception of a problem in the product; a problem is a difference between what is perceived and what is desired by some person. There are at least four ways to make a problem into a non-problem: 1) Change the perception. 2) Change the product. 3) Change the desire. 4) Ignore the person who perceives the problem. Any time a product owner can say, “That? Nah, that’s not a bug,” the basic unit of the system of measurement is invalidated.

Second, even if we suspended the reification problem, the model is inappropriate. Bugs cannot be usefully modelled as a single kind of problem or a single population. Typographical errors are not the only problems in writing; a perfectly spelled and syntactically correct piece of writing is not necessarily a good piece of writing. Nor are plaice the only species of fish in the fjords, nor are fish the only form of life in the sea, nor do we consider all life forms as equivalently meaningful, significant, benign, or threatening. Bugs have many different manifestations, from capability problems to reliability problems to compatibility problems to performance problems and so forth. Some of those problems don’t have anything to do with coding errors (which themselves could be like typos or grammatical errors or statements that can interpreted ambiguously). Problems in the product may include misunderstood requirements, design problems, valid but misunderstood implementation of the design, and so forth. If you want to compare estimating bugs in a program to a population estimate, it would be more appropriate to compare it to estimating the number of all biological organisms in a given place. Imagine some of the problems in doing that, and you may get some insight into the problem of estimating bugs.

Third, there’s Djikstra’s notion that testing can show the presence of problems, but not their absence. That’s a way of noting that testing is subject to the Halting Problem. Since you can’t tell if you’ve found the last problem in the product, you can’t estimate how many are left in it.

Fourth, the Ludic Fallacy (part one). Discovery and analysis of problems in a product is not a probabilistic game, but a non-linear, organic system of exploration, discovery, investigation, and learning. Problems are discovered at neither a steady nor a random rate. Indeed, discoveries often happen in clusters as the tester learns about the program and things that might threaten its value. The Lincoln Index, focused on typos—a highly decidable and easily understood problem that could largely be accomplished by checking—doesn’t fit for software testing.

Fifth, the Ludic Fallacy (part two). Mr. Cook’s analysis implies that all problems are of equal value. Those of us who have done testing and studied it for a long time know that, from one time to another, some testers find a bunch of problems, and others find relatively few. Yet those few problems might be of critical significance, and the many of lesser significance. It’s an error to think in terms of a probabilistic model without thinking in terms of the payoff. Related to that is the idea that the number of bugs remaining in the product may not be that big a deal. All the little problems might pale in significance next to the one terrible problem; the one terrible problem might be easily fixable while the little problems grind down the users’ will to live.

Sixth, measurement-induced distortion. Whenever you measure a self-aware system, you are likely to introduce distortion (at best) and dysfunction (at worst), as the system adapts itself to optimize the thing that’s being measured. Count bugs, and your testers will report more bugs—but finding more bugs can get in the way of finding more important bugs. That’s at least in part because of…

Seventh, the Lumping Problem (or more formally, Assimiliation Bias). Testing is not a single activity; it’s a collection of activities that includes (at least) setup, investigation and reporting, and design and execution. Setup and investigation and reporting take time away from test coverage. When a tester finds a problem, she investigates reports it. That time is time that she can’t spend finding other problems. The irony here is that the more problems you find, the fewer problems you have time to find. The quality of testing work also involves the quality of the report. Reporting time, since it isn’t taken into account in the model, will distort the perception of the number of bugs remaining.

Eighth, estimating the number of problems remaining in the product takes time away from sensible, productive activities. Considering that the number of problems remaining is subjective, open-ended, and unprovable, one might be inclined to think that counting how many problems are left is a waste of time better spent on searching for other bad ones.

I don’t think I’ve found the last remaining problem with this model.

But it does remind me that when people see bugs as units and testing as piecework, rather than the complex, non-linear, cognitive process that it is, they start inventing all these weird, baseless, silly quantitative models that are at best unhelpful and that, more likely, threaten the quality of testing on the project.

Return to Ellis Island

Tuesday, February 23rd, 2010

Dave Nicollette responds to my post on the Ellis Island bug. I appreciate his continuing the conversation that started in the comments to my post.

Dave says, “In describing a ‘new’ category of software defect he calls Ellis Island bugs…”.

I want to make it clear: there is nothing new about Ellis Island bugs, except the name. They’ve been with us forever, since before there were computers, even.

He goes on to say “Using the typical behavior-driven approach that is popular today, one of the very first things I would think to write (thinking as a developer, not as a tester) is an example that expresses the desired behavior of the code when the input values are illogical. Protection against Ellis Island bugs is baked in to contemporary software development technique.”

I’m glad Dave does that. I’m glad his team does that. I’m glad that it’s baked in to contemporary software development technique. That’s a good thing.

First, there’s no evidence to suggest that excellent coding practices are universal, and plenty of evidence to suggest that they aren’t. Second, the Ellis Island problem is not a problem that you introduce in your own code. It’s a class of problem that you have to discover. As Dave rightly points out,

“…only way to catch this type of defect is by exploring the behavior of the code after the fact. Typical boundary-condition testing will miss some Ellis Island situations because developers will not understand what the boundaries are supposed to be.”

The issue is not that “developers” will not understand what the boundaries are supposed to be. (I think Dave means “programmers” here, but that’s okay, because other developers, including testers won’t understand what the boundaries are supposed to be either.) People in general will not understand what the boundaries are supposed to be without testing and interacting with the built product. And even then, people will understand only to the extent that they have the time and resources to test.

Dave seems to have locked onto the triangle program as an example of a “badly developed program”. Sure it’s a badly developed program. I could do better than that, and so could Dave. Part of the point of our exercise is that if the testers looked at the source code (which we supply, quietly, along with the program), they’d be more likely to find that kind of bug. Indeed, when programmers are in the class and have the initiative to look at the source, they often spot that problem, and that provides an important lesson for the testers: it might be a really good idea to learn to read code.

Yet testing isn’t just about questioning and evaluating the code that we write, because the code that we write is Well Tested and Good and Pure. We don’t write badly developed programs. That’s a thing of the past. Modern development methods make sure that problem never happens. The trouble is that APIs and libraries and operating systems and hardware ROMs weren’t written by our ideal team. They were written by other teams, whose minds and development practices and testing processes we do not, cannot, know. How do we know that the code that we’re calling isn’t badly developed code? We don’t know, and so we have to test.

I think we’d agree that Ruby, in general, is much better developed software than the triangle program, so let’s look at that instead.

The Pickaxe says of the String::to_i() method: “If there is not a valid number at the start of str, 0 is returned. The method never raises an exception.” That’s cool. Except that I see two things that are suprising.

The first is that to_i returns zero, instead of an exception. That is, it returns a value (quite probably the wrong value) in exactly the same data type as the calling function would expect. That leaves the door wide open for misinterpretation by someone who hasn’t tested the function seeking that kind of problem. We thought we had done that, and we were mistaken. Our tests were revealing accurately that invalid data of a certain kind was being rejected appropriately, but we weren’t yet sensitized to a problem that was revealed only by later tests.

The second surprising thing is that the documentation is flatly wrong: to_i absolutely does throw exceptions when you hand it a parameter outside the range 2 through 36. We discovered that through testing too. That’s interesting. I’d far rather it threw an exception on a number that it can’t parse properly, so that I could more easily detect that situation and handle it more in the way that I’d like.

Well, after a bunch of testing by students and experts alike, we finally surprised ourselves with some data and a condition that revealed the problem. We thought that we had tested really well, and we found out that we hadn’t caught everything. So now I have to write some code that checks the string and the return value more carefully than Ruby itself does. That’s okay. No problem. Now… that’s one method in one class of all of Ruby. What other surprises lurk?

(Here’s one. When I copied the passage in bold above from my PDF copy of the Pickaxe, I got more than I bargained for: in addition to the text that I copied, I got this: “Report erratum Prepared exclusively for Michael Bolton”. Should I have been surprised by that or not?)

Whatever problem we anticipate, we can insert code to check for that problem. Good. Whatever problem we discover, we can insert code to check for that problem too. That’s great. In fact, we check for all the problems that our code could possibly run into. Or rather we think we do, and we don’t know when we’re not doing it. To address that problem, we’ve got a team around us who provides us with lots of test ideas, and pairs and reviews and exercises the code that we write, and we all do that stuff really well.

The problem comes with the fact that when we’re writing software, we’re dealing with far more than just the software we write. That other software is typically a black box to us. It often comes to us documented poorly and tested worse. It does things that we don’t know about, that we can’t know about. It may do things that its developers considered reasonable but that we would consider surprising. Having been surprised, we might also consider it reasonable… but we’d consider it surprising first.

Let me give you two more Ellis Island examples. Many years ago, I was involved with supporting (and later program managing and maintaining) a product called DESQview. Once we had a fascinating problem that we heard about from customers. On a particular brand of video card (from a company called “Ahead”), typing DV wouldn’t start DESQview and give you all that multitasking goodness. Instead, it would cause the letters VD to appear in the upper left corner of the display, and then hang the system. We called the manufacturer of that card—headquartered in Germany—, and got one in. We tested it, and couldn’t reproduce the problem. Yet customers kept calling in with the problem. At one point, I got a call from a customer who happened to be a systems integrator, and he had a card to spare. He shipped it to us.

The first Ellis Island surprise was that this card, also called “Ahead” was from a Taiwanese company, not a German one. The second surprise was that, at the beginning of a particular INT 10h call, the card saved the contents of the CPU registers, and restored them at the end of that call. The Ellis Island issue here was that the BX register was not returned in its original state, but set to 0 instead. After the fact, after the discovery, the programmer developed a terminate-and-stay-resident program to save and restore the registers, and later folded that code into DESQview itself to special-case that card.

Now: our programmers were fantastic. They did a lot of the Agile stuff before Agile was named; they paired, they tested, they reviewed, they investigated. This problem had nothing to do with the quality of the code that they had written. It had everything to do with the fact that you’d expect someone using the processor not to muck with what was already there, combined with the fact that in our test lab we didn’t have every video card on the planet.

The oddest thing about Dave’s post is that he interprets my description of the Ellis Island problem as an argument “to support status quo role segregation.” Whaa…? This has nothing to do with role segregation. Nothing. At one point, I say “the programmer’s knowledge is, at best, is a different set compared to what empirical testing can reveal.” That’s true in any situation, be it a solo shop, a traditional shop, or an Agile shop. It’s true of anyone’s understanding of any situation. There’s always more to know than we think there is, and there’s always another interpretation that one could take, rightly or wrongly. Let me give you an example of that:

When I say “the programmer’s knowledge is, at best, is a different set compared to what empirical testing can reveal,” there is nothing in that sentence, nor in the rest of the post, to suggest that the programmers shouldn’t explore, or that testers should be the only ones to explore. Dave simply made that part up. My post says one thing, mostly on epistemology, that we don’t know what we don’t know. From my post, Dave takes another interpretation about organizational dynamics that is completely orthogonal to my point. Which, in fact, is an Ellis Island kind of problem on its own.

The Ellis Island Bug

Wednesday, February 10th, 2010

A couple of years ago, I developed a version of a well-known reasoning exercise. It’s a simple exercise, and I implemented it as a really simple computer program. I described it to James Bach, and suggested that we put it in our Rapid Software Testing class.

James was skeptical. He didn’t figure from my description that the exercise would be interesting enough. I put in a couple of little traps, and tried it a few times with colleagues and other unsuspecting victims, sometimes in person, sometimes over the Web. Then I tried the actual exercise on James, using the program. He helpfully stepped into one of the traps. Thus emboldened, I started using the exercise in classes. Eventually James found an occasion to start using it too. He watched students dealing with it, had some epiphanies, tried some experiments. At one point, he sat down with his brother Jon and they tested the program aggressively, and revealed a ton of new information about it—many of which I hadn’t known myself. And I wrote the thing.

Experiential exercises are like peeling an onion; beneath everything we see on the surface, there’s another layer that we can learn about. Today we made a discovery; we found a bug as we transpected on the exercise, and James put a name on it.

We call it an Ellis Island bug. Ellis Island bugs are data conversion bugs, in which a program silently converts an input value into a different value. They’re named for the tendency of customs officials at Ellis Island, a little way back in history, to rename immigrants unilaterally with names that were relatively easy to spell. With an Ellis Island bug, you could reasonably expect an error on a certain input. Instead you get the program’s best guess at what you “really meant”.

There are lots of examples of this. We have an implementation of the famous triangle program, written many years ago in Delphi. The program takes three integers as input, with each number representing the length of a side of a triangle. Then the program reports on whether the triangle is scalene, isoceles, or equilateral. Here’s the line that takes the input:

function checksides (a, b, c : shortint) : string

Here, no matter what numeric value you submit, the Delphi libraries will return that number as a signed integer between -128 and 127. This leads to all kinds of amusing results: a side of length greater than 127 will invisibly be converted to a negative number, causing the program to report “not a triangle” until the number is 256 or greater; and entries like 300, 300, 44 will be interpreted as an equilateral triangle.

Ah, you say, but no one uses Delphi any more. So how about C? We’ve been advised forever not to trust input formatting strings, and to parse them ourselves. How about Ruby?

Ruby’s String object supplies a to_i method, which converts a string to its integer representation. Here’s what the Pickaxe says about that:

to_i str.to_i( base=10 ) ? int

Returns the result of interpreting leading characters in str as an integer base base (2 to 36). Given a base of zero, to_i looks for leading 0, 0b, 0o, 0d, or 0x and sets the base accordingly. Leading spaces are ignored, and leading plus or minus signs are honored. Extraneous characters past the end of a valid number are ignored. If there is not a valid number at the start of str, 0 is returned. The method never raises an exception.

We discovered a bunch of things today as we experimented with our program. The most significant thing was the last two sentences: an invalid number is silently converted to zero, and no exception is raised!

We found the problem because we thought we were seeing a different one. Our program parses a string for three numbers. Depending upon the test that we ran, it appeared as though multiple signs were being accepted (+–+++–), but that only the first sign was being honoured. Or that only certain terms in the string tolerated multiple signs. Or that you could use multiple signs once in a string—no, twice. What the hell? All our confusion vanished when we put in some debug statements and saw invalid numbers being converted to 0, a kind of guess as to what Ruby thought you meant.

This is by design in Ruby, so some would say it’s not a bug. Yet it leaves Ruby programs spectacularly vulnerable to bugs wherein the programmer isn’t aware of the behaviour of the language. I knew about to_i’s ability to accept a parameter for a number base (someone showed it to me ages ago), but I didn’t know about the conversion-to-zero error handling. I would have expected an exception, but it doesn’t do that. It just acts like an old-fashioned customs agent: “S-C-H-U-M-A-C… What did you say? Schumacher? You mean Shoemaker, right? Let’s just make that Shoemaker. Youse’ll like that better here, trust me.”

We also discovered that the method is incorrectly documented: to_i does raise an exception if you pass it an invalid number base—37, for example.

There are many more stories to tell about this program—in particular, how the programmer’s knowledge is, at best, is a different set compared to what empirical testing can reveal. Many of the things we’ve discovered about this trivial program could not have been caught by code review; many of them aren’t documented or are poorly documented both in the program and in the Ruby literature. We couldn’t look them up, and in many cases we couldn’t have anticipated them if they hadn’t emerged from testing.

There are other examples of Ellis Island bugs. A correspondent, Brent Lavelle, reports that he’s seen a bug in which 50,00 gets converted to 5000, even if the user is from France or Germany (in those countries, a comma rather than a period denotes the decimal, and they use spaces where we use commas).

Now: boundary tests may reveal some Ellis Island bugs. Other Ellis Island bugs defy boundary testing, because there’s a catch: many such tests would require you to know what the boundary is and what is supposed to happen when it is crossed. From the outside, that’s not at all clear. It’s not even clear to the programmer, when libraries are doing the work. That’s why it’s insufficient to test at the boundaries that we know about already; that’s why we must explore.

Best Bug… or Bugs?

Wednesday, December 9th, 2009

And now for the immodest part of the EuroSTAR 2009 Test Lab report:  I won the Best Bug award, although it’s not clear to me which bug got the nod, since I reported several fairly major problems. 

I tested OpenEMR.  For me, one candidate for the most serious problem would have been a consistent pattern of inconsistency in input handling and error checking.  I observed over a dozen instances of some kind of sloppiness.

This reminded me of a problem that we testers often see in project work, the problem of measuring by counting things—counting bugs, counting bug reports, counting requirements.  When the requirement is to defend the application against overflowing text fields and vulnerability to input constraint attacks by hackers, how should we count?  How many mentions of that should there be?  One, in a statement of general principles at the beginning of a requirements document?  Hundreds, in a statement of specific purpose for each input field in a functional specification?  How many requirements are there to make sure that fields don’t overflow?  How many requirements that they support only the characters, numbers, or date ranges that they’re supposed to?  What about traceability?  If this is a genuine problem, and the requirements documents don’t mention a particular requirement explicitly, should we refrain from reporting on a problem with that implicit requirement?

When I report an issue—for example, that practically all of the input fields in OpenEMR have some kind of problem with them—should that count as one bug report?  Since it applies to hundreds of fields, should it count as hundreds of bug reports?  When such a pervasive overall problem exists, should the tester make a report for each and every field in which he observes a problem?  And if you want to answer Yes, to that question:  is it worth the opportunity cost to do that when there are plenty of other problems in the product?

So again, there were so many instances of unconstrained and unchecked input that I stopped recording specifics and instead reported a general pattern in the bug tracking system.  My decision to do this was an instance of the Dead Horse stopping heuristic; reporting yet another instance of the same class of problem would be like flogging a dead horse.  I could have wasted a lot of time and energy reporting each instance of each problem I observed, along with specific symptoms and possible ramifications of each one.  Yet I’m very skeptical that this would serve the project well.  In my experience as a program manager for a product whose code was being developed outside our company, I found that there was steadily diminishing return in value for many reports of the same ilk.  When, in testing, we identified a general pattern of failure, we stopped looking for more instances.  We sent the product back to the development shop, and required the programmers and their testers to review the product through-and-through for that kind of problem.

If I were to be evaluated on the number of bugs that I found, I’d find it hard to resist the easy pickings of yet another input constraint attack bug report.  Yet when I’m testing, every moment of bug investigation and reporting is, by some reckoning, another moment that I can’t spend on obtaining more test coverage (more about that here).  By focusing on investigating and reporting on input problems (and thereby increasing my bug count), am I missing opportunities to design and perform tests on scheduling conflict-resolution algorithms, workflows, database integrity,…?

There were two other fairly serious problems that I observed.  One was that the Chinese version of the product showed a remarkable number of English words, presumably untranslated, interspersed among the ideograms; I expected to see no English at all.  I treated that problem in the same way as the input constraint problem:  with a single report of a general problem.

The second serious problem was that searches of various kinds would place a link in the address bar.  The link represented a command to a CGI script of some kind, which evidently constructed and forwarded a query to an underlying SQL database.  Backspacing over the last digit in the address bar and replacing it with a slash caused a lovely SQL error message to appear on the screen, unhandled by any of OpenEMR’s code.  The message could have been used, said our local product owner, to expose the structure of the database to snoops or hackers.  I found that problem by a defocusing heuristic—looking at the browser, rather than the browser window.

I don’t know which of these problems took Best Bug honours.  I’m not sure that the presenters specified which bug they were crediting with Best Bug.  That makes a certain kind of sense, since I can’t tell which of these problems is the most serious either.  After all, a problem isn’t its own thing; it’s a relationship between a person and a product or a situation.  There are plenty of ways to address a problem.  You could fix the product or the situation.  You could change the perspective or the perception of the person observing the problem, say by keeping the problem as it is but providing a workaround.  You could choose to ignore the problem yourself, which underscores the fact that a problem for some person might not be a problem for you.  That’s why it’s not helpful to count problems.

Managers:  do you see how evaluating testers based on test cases or bug counts, rather than the value of reporting, will lead to distortion at best, and more likely to dysfunction?  Do you see how providing overstructured test scripts or test cases could reduce the diversity—and therefore the quality—of testing?  Do you see how the notion of “one test per requirement” or “one positive and one negative test per requirement” is misleading?

Testers:  do you see how being evaluated on bug counts could lead to inattentional blindness with respect to the more serious problems than the low-hanging fruit affords?  Do you see how focusing on bugs, rather than focusing on test coverage, could reduce the value of your testing?

Instead of counting things, let’s consider evaluating testing work in a different way.  Let’s consider the overall testing story and its many dimensions.  Let’s think about the story around each  bug, and each bug report—not just the number of reports, but the meaning and significance of each one.  Let’s look at the value of the information to stakeholders, primarily to programmers and to product owners.  Let’s think about the extent to which the tester makes things easier for others on the team, including other testers.  Let’s look at the diversity of problems discovered, the diversity of approaches used, and the diversity of tools and techniques applied.  And rather than using this information to reward or punish testers, let’s use it to guide coaching, mentoring, and training such that the focus is on developing skill for everyone.

The dimensions above are qualitative, rather than quantitative.  Yet if our mission is to provide information to inform decisions about quality, we of all people should recognize that expressing value in terms of numbers often removes important information rather than adding it.

Additional reading: 

Measuring and Managing Performance in Organizations (Robert D. Austin)
Software Engineering Metrics:  What Do They Measure and How Do We Know? (Kaner and Bond)
Quality Software Management, Vol. 2:  First Order Measurement (Weinberg)
Perfect Software (and Other Illusions About Testing) (Weinberg)