Blog Posts from February, 2010

Return to Ellis Island

Tuesday, February 23rd, 2010

Dave Nicollette responds to my post on the Ellis Island bug. I appreciate his continuing the conversation that started in the comments to my post.

Dave says, “In describing a ‘new’ category of software defect he calls Ellis Island bugs…”.

I want to make it clear: there is nothing new about Ellis Island bugs, except the name. They’ve been with us forever, since before there were computers, even.

He goes on to say “Using the typical behavior-driven approach that is popular today, one of the very first things I would think to write (thinking as a developer, not as a tester) is an example that expresses the desired behavior of the code when the input values are illogical. Protection against Ellis Island bugs is baked in to contemporary software development technique.”

I’m glad Dave does that. I’m glad his team does that. I’m glad that it’s baked in to contemporary software development technique. That’s a good thing.

First, there’s no evidence to suggest that excellent coding practices are universal, and plenty of evidence to suggest that they aren’t. Second, the Ellis Island problem is not a problem that you introduce in your own code. It’s a class of problem that you have to discover. As Dave rightly points out,

“…only way to catch this type of defect is by exploring the behavior of the code after the fact. Typical boundary-condition testing will miss some Ellis Island situations because developers will not understand what the boundaries are supposed to be.”

The issue is not that “developers” will not understand what the boundaries are supposed to be. (I think Dave means “programmers” here, but that’s okay, because other developers, including testers won’t understand what the boundaries are supposed to be either.) People in general will not understand what the boundaries are supposed to be without testing and interacting with the built product. And even then, people will understand only to the extent that they have the time and resources to test.

Dave seems to have locked onto the triangle program as an example of a “badly developed program”. Sure it’s a badly developed program. I could do better than that, and so could Dave. Part of the point of our exercise is that if the testers looked at the source code (which we supply, quietly, along with the program), they’d be more likely to find that kind of bug. Indeed, when programmers are in the class and have the initiative to look at the source, they often spot that problem, and that provides an important lesson for the testers: it might be a really good idea to learn to read code.

Yet testing isn’t just about questioning and evaluating the code that we write, because the code that we write is Well Tested and Good and Pure. We don’t write badly developed programs. That’s a thing of the past. Modern development methods make sure that problem never happens. The trouble is that APIs and libraries and operating systems and hardware ROMs weren’t written by our ideal team. They were written by other teams, whose minds and development practices and testing processes we do not, cannot, know. How do we know that the code that we’re calling isn’t badly developed code? We don’t know, and so we have to test.

I think we’d agree that Ruby, in general, is much better developed software than the triangle program, so let’s look at that instead.

The Pickaxe says of the String::to_i() method: “If there is not a valid number at the start of str, 0 is returned. The method never raises an exception.” That’s cool. Except that I see two things that are suprising.

The first is that to_i returns zero, instead of an exception. That is, it returns a value (quite probably the wrong value) in exactly the same data type as the calling function would expect. That leaves the door wide open for misinterpretation by someone who hasn’t tested the function seeking that kind of problem. We thought we had done that, and we were mistaken. Our tests were revealing accurately that invalid data of a certain kind was being rejected appropriately, but we weren’t yet sensitized to a problem that was revealed only by later tests.

The second surprising thing is that the documentation is flatly wrong: to_i absolutely does throw exceptions when you hand it a parameter outside the range 2 through 36. We discovered that through testing too. That’s interesting. I’d far rather it threw an exception on a number that it can’t parse properly, so that I could more easily detect that situation and handle it more in the way that I’d like.

Well, after a bunch of testing by students and experts alike, we finally surprised ourselves with some data and a condition that revealed the problem. We thought that we had tested really well, and we found out that we hadn’t caught everything. So now I have to write some code that checks the string and the return value more carefully than Ruby itself does. That’s okay. No problem. Now… that’s one method in one class of all of Ruby. What other surprises lurk?

(Here’s one. When I copied the passage in bold above from my PDF copy of the Pickaxe, I got more than I bargained for: in addition to the text that I copied, I got this: “Report erratum Prepared exclusively for Michael Bolton”. Should I have been surprised by that or not?)

Whatever problem we anticipate, we can insert code to check for that problem. Good. Whatever problem we discover, we can insert code to check for that problem too. That’s great. In fact, we check for all the problems that our code could possibly run into. Or rather we think we do, and we don’t know when we’re not doing it. To address that problem, we’ve got a team around us who provides us with lots of test ideas, and pairs and reviews and exercises the code that we write, and we all do that stuff really well.

The problem comes with the fact that when we’re writing software, we’re dealing with far more than just the software we write. That other software is typically a black box to us. It often comes to us documented poorly and tested worse. It does things that we don’t know about, that we can’t know about. It may do things that its developers considered reasonable but that we would consider surprising. Having been surprised, we might also consider it reasonable… but we’d consider it surprising first.

Let me give you two more Ellis Island examples. Many years ago, I was involved with supporting (and later program managing and maintaining) a product called DESQview. Once we had a fascinating problem that we heard about from customers. On a particular brand of video card (from a company called “Ahead”), typing DV wouldn’t start DESQview and give you all that multitasking goodness. Instead, it would cause the letters VD to appear in the upper left corner of the display, and then hang the system. We called the manufacturer of that card—headquartered in Germany—, and got one in. We tested it, and couldn’t reproduce the problem. Yet customers kept calling in with the problem. At one point, I got a call from a customer who happened to be a systems integrator, and he had a card to spare. He shipped it to us.

The first Ellis Island surprise was that this card, also called “Ahead” was from a Taiwanese company, not a German one. The second surprise was that, at the beginning of a particular INT 10h call, the card saved the contents of the CPU registers, and restored them at the end of that call. The Ellis Island issue here was that the BX register was not returned in its original state, but set to 0 instead. After the fact, after the discovery, the programmer developed a terminate-and-stay-resident program to save and restore the registers, and later folded that code into DESQview itself to special-case that card.

Now: our programmers were fantastic. They did a lot of the Agile stuff before Agile was named; they paired, they tested, they reviewed, they investigated. This problem had nothing to do with the quality of the code that they had written. It had everything to do with the fact that you’d expect someone using the processor not to muck with what was already there, combined with the fact that in our test lab we didn’t have every video card on the planet.

The oddest thing about Dave’s post is that he interprets my description of the Ellis Island problem as an argument “to support status quo role segregation.” Whaa…? This has nothing to do with role segregation. Nothing. At one point, I say “the programmer’s knowledge is, at best, is a different set compared to what empirical testing can reveal.” That’s true in any situation, be it a solo shop, a traditional shop, or an Agile shop. It’s true of anyone’s understanding of any situation. There’s always more to know than we think there is, and there’s always another interpretation that one could take, rightly or wrongly. Let me give you an example of that:

When I say “the programmer’s knowledge is, at best, is a different set compared to what empirical testing can reveal,” there is nothing in that sentence, nor in the rest of the post, to suggest that the programmers shouldn’t explore, or that testers should be the only ones to explore. Dave simply made that part up. My post says one thing, mostly on epistemology, that we don’t know what we don’t know. From my post, Dave takes another interpretation about organizational dynamics that is completely orthogonal to my point. Which, in fact, is an Ellis Island kind of problem on its own.

The Ellis Island Bug

Wednesday, February 10th, 2010

A couple of years ago, I developed a version of a well-known reasoning exercise. It’s a simple exercise, and I implemented it as a really simple computer program. I described it to James Bach, and suggested that we put it in our Rapid Software Testing class.

James was skeptical. He didn’t figure from my description that the exercise would be interesting enough. I put in a couple of little traps, and tried it a few times with colleagues and other unsuspecting victims, sometimes in person, sometimes over the Web. Then I tried the actual exercise on James, using the program. He helpfully stepped into one of the traps. Thus emboldened, I started using the exercise in classes. Eventually James found an occasion to start using it too. He watched students dealing with it, had some epiphanies, tried some experiments. At one point, he sat down with his brother Jon and they tested the program aggressively, and revealed a ton of new information about it—many of which I hadn’t known myself. And I wrote the thing.

Experiential exercises are like peeling an onion; beneath everything we see on the surface, there’s another layer that we can learn about. Today we made a discovery; we found a bug as we transpected on the exercise, and James put a name on it.

We call it an Ellis Island bug. Ellis Island bugs are data conversion bugs, in which a program silently converts an input value into a different value. They’re named for the tendency of customs officials at Ellis Island, a little way back in history, to rename immigrants unilaterally with names that were relatively easy to spell. With an Ellis Island bug, you could reasonably expect an error on a certain input. Instead you get the program’s best guess at what you “really meant”.

There are lots of examples of this. We have an implementation of the famous triangle program, written many years ago in Delphi. The program takes three integers as input, with each number representing the length of a side of a triangle. Then the program reports on whether the triangle is scalene, isoceles, or equilateral. Here’s the line that takes the input:

function checksides (a, b, c : shortint) : string

Here, no matter what numeric value you submit, the Delphi libraries will return that number as a signed integer between -128 and 127. This leads to all kinds of amusing results: a side of length greater than 127 will invisibly be converted to a negative number, causing the program to report “not a triangle” until the number is 256 or greater; and entries like 300, 300, 44 will be interpreted as an equilateral triangle.

Ah, you say, but no one uses Delphi any more. So how about C? We’ve been advised forever not to trust input formatting strings, and to parse them ourselves. How about Ruby?

Ruby’s String object supplies a to_i method, which converts a string to its integer representation. Here’s what the Pickaxe says about that:

to_i str.to_i( base=10 ) ? int

Returns the result of interpreting leading characters in str as an integer base base (2 to 36). Given a base of zero, to_i looks for leading 0, 0b, 0o, 0d, or 0x and sets the base accordingly. Leading spaces are ignored, and leading plus or minus signs are honored. Extraneous characters past the end of a valid number are ignored. If there is not a valid number at the start of str, 0 is returned. The method never raises an exception.

We discovered a bunch of things today as we experimented with our program. The most significant thing was the last two sentences: an invalid number is silently converted to zero, and no exception is raised!

We found the problem because we thought we were seeing a different one. Our program parses a string for three numbers. Depending upon the test that we ran, it appeared as though multiple signs were being accepted (+–+++–), but that only the first sign was being honoured. Or that only certain terms in the string tolerated multiple signs. Or that you could use multiple signs once in a string—no, twice. What the hell? All our confusion vanished when we put in some debug statements and saw invalid numbers being converted to 0, a kind of guess as to what Ruby thought you meant.

This is by design in Ruby, so some would say it’s not a bug. Yet it leaves Ruby programs spectacularly vulnerable to bugs wherein the programmer isn’t aware of the behaviour of the language. I knew about to_i’s ability to accept a parameter for a number base (someone showed it to me ages ago), but I didn’t know about the conversion-to-zero error handling. I would have expected an exception, but it doesn’t do that. It just acts like an old-fashioned customs agent: “S-C-H-U-M-A-C… What did you say? Schumacher? You mean Shoemaker, right? Let’s just make that Shoemaker. Youse’ll like that better here, trust me.”

We also discovered that the method is incorrectly documented: to_i does raise an exception if you pass it an invalid number base—37, for example.

There are many more stories to tell about this program—in particular, how the programmer’s knowledge is, at best, is a different set compared to what empirical testing can reveal. Many of the things we’ve discovered about this trivial program could not have been caught by code review; many of them aren’t documented or are poorly documented both in the program and in the Ruby literature. We couldn’t look them up, and in many cases we couldn’t have anticipated them if they hadn’t emerged from testing.

There are other examples of Ellis Island bugs. A correspondent, Brent Lavelle, reports that he’s seen a bug in which 50,00 gets converted to 5000, even if the user is from France or Germany (in those countries, a comma rather than a period denotes the decimal, and they use spaces where we use commas).

Now: boundary tests may reveal some Ellis Island bugs. Other Ellis Island bugs defy boundary testing, because there’s a catch: many such tests would require you to know what the boundary is and what is supposed to happen when it is crossed. From the outside, that’s not at all clear. It’s not even clear to the programmer, when libraries are doing the work. That’s why it’s insufficient to test at the boundaries that we know about already; that’s why we must explore.

Testing and Management Parallels

Thursday, February 4th, 2010

Rikard Edgren, Henrik Emilsson and Martin Jansson collaborate on blog called thoughts from the test eye. In a satirical post from this past summer called “Scripted vs Exploratory Testing from a Managerial Perspective“, Martin proposes that “From a managerial perspective without knowing too much about testing, your sole experience comes from the scripted test environment…” But I think that from a managerial perspective, there is another place you could look to understand skilled testing: managing. I’ll follow the points in Martin’s post.

If you’re a capable manager, and you’re managing other managers, you know that there are things for which scripting doesn’t work:

Control. Managers guide the managers working under them, but everyone involved knows that managers don’t have complete control over what they’re managing. No script can capture the esssence of management work. (If scripts could do that, we’d have automated management by now.) Managers know that when they have some written guidance on how workers are to perform certain tasks, effective workers and managers alike must adapt to the situation and use their judgement. If, as a manager, you could script workers’ actions completely, they wouldn’t come to your office to ask for help, and you wouldn’t have to assist, guide, motivate, or reprimand them. You, the manager, have to observe a variety of things that cannot be anticiapted, and respond to what actually happens. You might have checklists, but you don’t have a list of scripted tasks. You recognize that knowing when management work will end for a particular project can be anticipated but not predicted with certainty. Indeed, that’s a function of the risks that you’re hired to manage and the problems you’re hired to solve. As a manager, you’re managing many things simultaneously. You have the freedom and responsibility to carry out your work in the manner you think best, and you grant similar freedom and responsibility to your people. Isn’t all that like being a tester, and like managing testers?

Hierarchy. There is a structure to management, with different roles playing their part in the system. No competent manager supervising other managers would characterize management as “some people to do the thinking and others execute”. That would suggest that some managers think and other managers execute. As a manager, you recognize that all managers worthy of the role both think and execute, with the recognition that an organization is stronger as a collaborative network. Isn’t that like being a tester, and like managing testers?

Scalability. You know that in management, you can’t easily bring in people who can execute management scripts that other managers have written. Managers need to own their processes. Getting new managers in the middle of a project would derail it, and you can’t take just anyone. Isn’t that like being a tester, and like managing testers?

Management Software. As a manager, you know that no tool—even one that costs several million dollars—can replace your judgment. At best, it collate data and can generate excellent reports, but the decision-making is yours. As a manager, you’re leery of having your work overly mediated. When you have important but mundane tasks to perform, you hand off the non-sapient parts to computing machinery, but you apply sapience to planning, designing, and programming the tools—and you apply sapience to observing the results, to determining their meaning and signifiance, and to your response. When you have to delegate sapient work, you know that it can’t be performed by a machine. So you hire someone—a person, not a machine—to do the work with your collaboration and guidance. Isn’t that like being a tester, and like managing testers?

Education. You look back on how you learned, and you realize that, whether you had years of schooling or learned on the job, you don’t believe in mail-order management courses, and you harbour no illusions that a two-day course accompanied by a piece of paper can teach you how to be a manager; nor can you trust that someone brandishing a similar piece of paper is ready for a management job until you know a lot more about him. Isn’t that like being a tester, and like managing testers?

What does Exploratory Testing (ET) include? Well, it’s kind of like management, isn’t it?

Empowerment and Self Managment. Managers perform management actions as they go along. Managers do not need people to design their actions for them. Managers foster leadership by empowering people to use their skills; guiding, but not controlling; granting freedom and requiring responsibility. Isn’t that like being a tester, and like managing testers?

Taming Chaos. At the beginning of any management assignment, you can’t be certain about how you are going to manage, nor on how the managers reporting to you will manage. You have not planned everything out in detail before you start managing; you can’t, and you know you’d be fooling yourself if you pretended to do so. You cannot report exactly how long time you need, since you don’t know everything in advance. In fact, discovering what needs to be done is a key aspect of your work. You recognize that management is a holistic process, not a linear one. You will use your skills, combined with all of the information available to inform your decisions on time, scope, quality, innovation, skill, and learning. You will use feedback from your surroundings to gather the information you need to make decisions. Isn’t that like being a tester, and like managing testers?

Scaling Up. When you’re hiring people to be managers who report to you, you only want managers. If you have people who aren’t ready to be managers, but who show promise, you’ll train and mentor them into the role. Not anyone can be a manager. It is hard to get “just anyone” to help out since you cannot use “just anyone” from the organisation immediately. They need to learn real management skills to be effective, which means that, among other things they must be given the freedom to make mistakes that can be observed and corrected in an empowering, fault-tolerant environment. That’s how people learn to become excellent managers:  through experience sharpened by mentorship. When looked at this way, management does scale. Isn’t that like being a tester, and like managing testers?

Certification and Training. Multiple-choice based certification for managers is insufficient to evaluate the quality of a manager.  The certification doesn’t matter anyway; what you seek is skill.  To develop that, there are degree programs, and there are shorter skill-based courses that involve simulations, open discussion, and open-ended learning. Good courses are valuable supplements to an environment that fosters learning and innovation; courses that teach only management nomenclature are a waste of time and money. Isn’t that like being a tester, and like managing testers?

Management Software Isn’t Management. Management isn’t done by software. Major software vendors have tools to support management, but the tools don’t do the management, and the tools don’t replace managers. Customer relationship management software is not customer relationship management; enterprise resource management software isn’t enterprise resource management. A real manager knows that it is what she thinks and what she does is important; that for her real work–the analysis and decision making–her paper notepad is as just as valid a tool as an Excel spreadsheet, and that no tool, no matter how big or how expensive or how powerful, is anything more than a tool. Isn’t that like being a tester, and like managing testers?

Excellent testing skill has much in common with excellent management skill. As testers, maybe we can use the similarities between them to help explain the work that we do.