Rapid Software Testing Public Events in Europe

March 1st, 2010

It’s a busy season in Europe for Rapid Testing this spring.

I’m going to be at the Norwegian Computer Society’s FreeTest, a conference on free testing tools in Trondheim, Norway, where I’ll be giving a keynote talk on testing vs. checking on March 26.  That’s preceded by a three-day public session of Rapid Software Testing, from March 23-25.  Register here.

After that I’m off to Germany for a three-day public offering of Rapid Software Testing in Berlin, sponsored by Testing Experience.  That class happens March 29-31.  Can’t make it yourself?  Please spread the word!

Stephen Allott at Electromind is setting up a three-day Rapid Software Testing class that I’ll teach in London, May 11-13.  There’s also a testers’ gathering to be held in some accommodating pub on Wednesday the 12th.  If you’re in the area (or can get there), I’d love the opportunity to meet and chat.  Drop a line to me for details.

While all that’s going on, my colleague James Bach will be in Sweden—delivering a public RST class for AddQ Consulting in Kista near Stockholm March 16-18; a session of Rapid Software Testing in Gothenburg March 22-24, a tutorial on Self-Education for Testers on March 25, and an appearance at the SAST conference on March 26.  That’s interspersed with a bunch of corporate consulting, after which he’ll be at the ACCU Conference in Oxford, UK April 14-17.

Return to Ellis Island

February 23rd, 2010

Dave Nicollette responds to my post on the Ellis Island bug. I appreciate his continuing the conversation that started in the comments to my post.

Dave says, “In describing a ‘new’ category of software defect he calls Ellis Island bugs…”.

I want to make it clear: there is nothing new about Ellis Island bugs, except the name. They’ve been with us forever, since before there were computers, even.

He goes on to say “Using the typical behavior-driven approach that is popular today, one of the very first things I would think to write (thinking as a developer, not as a tester) is an example that expresses the desired behavior of the code when the input values are illogical. Protection against Ellis Island bugs is baked in to contemporary software development technique.”

I’m glad Dave does that. I’m glad his team does that. I’m glad that it’s baked in to contemporary software development technique. That’s a good thing.

First, there’s no evidence to suggest that excellent coding practices are universal, and plenty of evidence to suggest that they aren’t. Second, the Ellis Island problem is not a problem that you introduce in your own code. It’s a class of problem that you have to discover. As Dave rightly points out,

“…only way to catch this type of defect is by exploring the behavior of the code after the fact. Typical boundary-condition testing will miss some Ellis Island situations because developers will not understand what the boundaries are supposed to be.”

The issue is not that “developers” will not understand what the boundaries are supposed to be. (I think Dave means “programmers” here, but that’s okay, because other developers, including testers won’t understand what the boundaries are supposed to be either.) People in general will not understand what the boundaries are supposed to be without testing and interacting with the built product. And even then, people will understand only to the extent that they have the time and resources to test.

Dave seems to have locked onto the triangle program as an example of a “badly developed program”. Sure it’s a badly developed program. I could do better than that, and so could Dave. Part of the point of our exercise is that if the testers looked at the source code (which we supply, quietly, along with the program), they’d be more likely to find that kind of bug. Indeed, when programmers are in the class and have the initiative to look at the source, they often spot that problem, and that provides an important lesson for the testers: it might be a really good idea to learn to read code.

Yet testing isn’t just about questioning and evaluating the code that we write, because the code that we write is Well Tested and Good and Pure. We don’t write badly developed programs. That’s a thing of the past. Modern development methods make sure that problem never happens. The trouble is that APIs and libraries and operating systems and hardware ROMs weren’t written by our ideal team. They were written by other teams, whose minds and development practices and testing processes we do not, cannot, know. How do we know that the code that we’re calling isn’t badly developed code? We don’t know, and so we have to test.

I think we’d agree that Ruby, in general, is much better developed software than the triangle program, so let’s look at that instead.

The Pickaxe says of the String::to_i() method: “If there is not a valid number at the start of str, 0 is returned. The method never raises an exception.” That’s cool. Except that I see two things that are suprising.

The first is that to_i returns zero, instead of an exception. That is, it returns a value (quite probably the wrong value) in exactly the same data type as the calling function would expect. That leaves the door wide open for misinterpretation by someone who hasn’t tested the function seeking that kind of problem. We thought we had done that, and we were mistaken. Our tests were revealing accurately that invalid data of a certain kind was being rejected appropriately, but we weren’t yet sensitized to a problem that was revealed only by later tests.

The second surprising thing is that the documentation is flatly wrong: to_i absolutely does throw exceptions when you hand it a parameter outside the range 2 through 36. We discovered that through testing too. That’s interesting. I’d far rather it threw an exception on a number that it can’t parse properly, so that I could more easily detect that situation and handle it more in the way that I’d like.

Well, after a bunch of testing by students and experts alike, we finally surprised ourselves with some data and a condition that revealed the problem. We thought that we had tested really well, and we found out that we hadn’t caught everything. So now I have to write some code that checks the string and the return value more carefully than Ruby itself does. That’s okay. No problem. Now… that’s one method in one class of all of Ruby. What other surprises lurk?

(Here’s one. When I copied the passage in bold above from my PDF copy of the Pickaxe, I got more than I bargained for: in addition to the text that I copied, I got this: “Report erratum Prepared exclusively for Michael Bolton”. Should I have been surprised by that or not?)

Whatever problem we anticipate, we can insert code to check for that problem. Good. Whatever problem we discover, we can insert code to check for that problem too. That’s great. In fact, we check for all the problems that our code could possibly run into. Or rather we think we do, and we don’t know when we’re not doing it. To address that problem, we’ve got a team around us who provides us with lots of test ideas, and pairs and reviews and exercises the code that we write, and we all do that stuff really well.

The problem comes with the fact that when we’re writing software, we’re dealing with far more than just the software we write. That other software is typically a black box to us. It often comes to us documented poorly and tested worse. It does things that we don’t know about, that we can’t know about. It may do things that its developers considered reasonable but that we would consider surprising. Having been surprised, we might also consider it reasonable… but we’d consider it surprising first.

Let me give you two more Ellis Island examples. Many years ago, I was involved with supporting (and later program managing and maintaining) a product called DESQview. Once we had a fascinating problem that we heard about from customers. On a particular brand of video card (from a company called “Ahead”), typing DV wouldn’t start DESQview and give you all that multitasking goodness. Instead, it would cause the letters VD to appear in the upper left corner of the display, and then hang the system. We called the manufacturer of that card—headquartered in Germany—, and got one in. We tested it, and couldn’t reproduce the problem. Yet customers kept calling in with the problem. At one point, I got a call from a customer who happened to be a systems integrator, and he had a card to spare. He shipped it to us.

The first Ellis Island surprise was that this card, also called “Ahead” was from a Taiwanese company, not a German one. The second surprise was that, at the beginning of a particular INT 10h call, the card saved the contents of the CPU registers, and restored them at the end of that call. The Ellis Island issue here was that the BX register was not returned in its original state, but set to 0 instead. After the fact, after the discovery, the programmer developed a terminate-and-stay-resident program to save and restore the registers, and later folded that code into DESQview itself to special-case that card.

Now: our programmers were fantastic. They did a lot of the Agile stuff before Agile was named; they paired, they tested, they reviewed, they investigated. This problem had nothing to do with the quality of the code that they had written. It had everything to do with the fact that you’d expect someone using the processor not to muck with what was already there, combined with the fact that in our test lab we didn’t have every video card on the planet.

The oddest thing about Dave’s post is that he interprets my description of the Ellis Island problem as an argument “to support status quo role segregation.” Whaa…? This has nothing to do with role segregation. Nothing. At one point, I say “the programmer’s knowledge is, at best, is a different set compared to what empirical testing can reveal.” That’s true in any situation, be it a solo shop, a traditional shop, or an Agile shop. It’s true of anyone’s understanding of any situation. There’s always more to know than we think there is, and there’s always another interpretation that one could take, rightly or wrongly. Let me give you an example of that:

When I say “the programmer’s knowledge is, at best, is a different set compared to what empirical testing can reveal,” there is nothing in that sentence, nor in the rest of the post, to suggest that the programmers shouldn’t explore, or that testers should be the only ones to explore. Dave simply made that part up. My post says one thing, mostly on epistemology, that we don’t know what we don’t know. From my post, Dave takes another interpretation about organizational dynamics that is completely orthogonal to my point. Which, in fact, is an Ellis Island kind of problem on its own.

The Ellis Island Bug

February 10th, 2010

A couple of years ago, I developed a version of a well-known reasoning exercise. It’s a simple exercise, and I implemented it as a really simple computer program. I described it to James Bach, and suggested that we put it in our Rapid Software Testing class.

James was skeptical. He didn’t figure from my description that the exercise would be interesting enough. I put in a couple of little traps, and tried it a few times with colleagues and other unsuspecting victims, sometimes in person, sometimes over the Web. Then I tried the actual exercise on James, using the program. He helpfully stepped into one of the traps. Thus emboldened, I started using the exercise in classes. Eventually James found an occasion to start using it too. He watched students dealing with it, had some epiphanies, tried some experiments. At one point, he sat down with his brother Jon and they tested the program aggressively, and revealed a ton of new information about it—many of which I hadn’t known myself. And I wrote the thing.

Experiential exercises are like peeling an onion; beneath everything we see on the surface, there’s another layer that we can learn about. Today we made a discovery; we found a bug as we transpected on the exercise, and James put a name on it.

We call it an Ellis Island bug. Ellis Island bugs are data conversion bugs, in which a program silently converts an input value into a different value. They’re named for the tendency of customs officials at Ellis Island, a little way back in history, to rename immigrants unilaterally with names that were relatively easy to spell. With an Ellis Island bug, you could reasonably expect an error on a certain input. Instead you get the program’s best guess at what you “really meant”.

There are lots of examples of this. We have an implementation of the famous triangle program, written many years ago in Delphi. The program takes three integers as input, with each number representing the length of a side of a triangle. Then the program reports on whether the triangle is scalene, isoceles, or equilateral. Here’s the line that takes the input:

function checksides (a, b, c : shortint) : string

Here, no matter what numeric value you submit, the Delphi libraries will return that number as a signed integer between -128 and 127. This leads to all kinds of amusing results: a side of length greater than 127 will invisibly be converted to a negative number, causing the program to report “not a triangle” until the number is 256 or greater; and entries like 300, 300, 44 will be interpreted as an equilateral triangle.

Ah, you say, but no one uses Delphi any more. So how about C? We’ve been advised forever not to trust input formatting strings, and to parse them ourselves. How about Ruby?

Ruby’s String object supplies a to_i method, which converts a string to its integer representation. Here’s what the Pickaxe says about that:

to_i str.to_i( base=10 ) ? int

Returns the result of interpreting leading characters in str as an integer base base (2 to 36). Given a base of zero, to_i looks for leading 0, 0b, 0o, 0d, or 0x and sets the base accordingly. Leading spaces are ignored, and leading plus or minus signs are honored. Extraneous characters past the end of a valid number are ignored. If there is not a valid number at the start of str, 0 is returned. The method never raises an exception.

We discovered a bunch of things today as we experimented with our program. The most significant thing was the last two sentences: an invalid number is silently converted to zero, and no exception is raised!

We found the problem because we thought we were seeing a different one. Our program parses a string for three numbers. Depending upon the test that we ran, it appeared as though multiple signs were being accepted (+–+++–), but that only the first sign was being honoured. Or that only certain terms in the string tolerated multiple signs. Or that you could use multiple signs once in a string—no, twice. What the hell? All our confusion vanished when we put in some debug statements and saw invalid numbers being converted to 0, a kind of guess as to what Ruby thought you meant.

This is by design in Ruby, so some would say it’s not a bug. Yet it leaves Ruby programs spectacularly vulnerable to bugs wherein the programmer isn’t aware of the behaviour of the language. I knew about to_i’s ability to accept a parameter for a number base (someone showed it to me ages ago), but I didn’t know about the conversion-to-zero error handling. I would have expected an exception, but it doesn’t do that. It just acts like an old-fashioned customs agent: “S-C-H-U-M-A-C… What did you say? Schumacher? You mean Shoemaker, right? Let’s just make that Shoemaker. Youse’ll like that better here, trust me.”

We also discovered that the method is incorrectly documented: to_i does raise an exception if you pass it an invalid number base—37, for example.

There are many more stories to tell about this program—in particular, how the programmer’s knowledge is, at best, is a different set compared to what empirical testing can reveal. Many of the things we’ve discovered about this trivial program could not have been caught by code review; many of them aren’t documented or are poorly documented both in the program and in the Ruby literature. We couldn’t look them up, and in many cases we couldn’t have anticipated them if they hadn’t emerged from testing.

There are other examples of Ellis Island bugs. A correspondent, Brent Lavelle, reports that he’s seen a bug in which 50,00 gets converted to 5000, even if the user is from France or Germany (in those countries, a comma rather than a period denotes the decimal, and they use spaces where we use commas).

Now: boundary tests may reveal some Ellis Island bugs. Other Ellis Island bugs defy boundary testing, because there’s a catch: many such tests would require you to know what the boundary is and what is supposed to happen when it is crossed. From the outside, that’s not at all clear. It’s not even clear to the programmer, when libraries are doing the work. That’s why it’s insufficient to test at the boundaries that we know about already; that’s why we must explore.

Testing and Management Parallels

February 4th, 2010

Rikard Edgren, Henrik Emilsson and Martin Jansson collaborate on blog called thoughts from the test eye. In a satirical post from this past summer called “Scripted vs Exploratory Testing from a Managerial Perspective“, Martin proposes that “From a managerial perspective without knowing too much about testing, your sole experience comes from the scripted test environment…” But I think that from a managerial perspective, there is another place you could look to understand skilled testing: managing. I’ll follow the points in Martin’s post.

If you’re a capable manager, and you’re managing other managers, you know that there are things for which scripting doesn’t work:

Control. Managers guide the managers working under them, but everyone involved knows that managers don’t have complete control over what they’re managing. No script can capture the esssence of management work. (If scripts could do that, we’d have automated management by now.) Managers know that when they have some written guidance on how workers are to perform certain tasks, effective workers and managers alike must adapt to the situation and use their judgement. If, as a manager, you could script workers’ actions completely, they wouldn’t come to your office to ask for help, and you wouldn’t have to assist, guide, motivate, or reprimand them. You, the manager, have to observe a variety of things that cannot be anticiapted, and respond to what actually happens. You might have checklists, but you don’t have a list of scripted tasks. You recognize that knowing when management work will end for a particular project can be anticipated but not predicted with certainty. Indeed, that’s a function of the risks that you’re hired to manage and the problems you’re hired to solve. As a manager, you’re managing many things simultaneously. You have the freedom and responsibility to carry out your work in the manner you think best, and you grant similar freedom and responsibility to your people. Isn’t all that like being a tester, and like managing testers?

Hierarchy. There is a structure to management, with different roles playing their part in the system. No competent manager supervising other managers would characterize management as “some people to do the thinking and others execute”. That would suggest that some managers think and other managers execute. As a manager, you recognize that all managers worthy of the name both think and execute, with the recognition that an organization is stronger as a collaborative network. Isn’t that like being a tester, and like managing testers?

Scalability. You know that in management, you can’t easily bring in people who can execute management scripts that other managers have written. Managers need to own their processes. Getting new managers in the middle of a project would derail it, and you can’t take just anyone. Isn’t that like being a tester, and like managing testers?

Management Software As a manager, you know that no tool—even one that costs several million dollars—can replace your judgment. At best, it collate data and can generate excellent reports, but the decision-making is yours. As a manager, you’re leery of having your work overly mediated. When you have important but mundane tasks to perform, you hand off the non-sapient parts to computing machinery, but you apply sapience to planning, designing, and programming the tools—and you apply sapience to observing the results, to determining their meaning and signifiance, and to your response. When you have to delegate sapient work, you know that it can’t be performed by a machine. So you hire someone—a person, not a machine—to do the work with your collaboration and guidance. Isn’t that like being a tester, and like managing testers?

Education. You look back on how you learned, and you realize that, whether you had years of schooling or learned on the job, you don’t believe in mail-order management courses, and you harbour no illusions that a two-day course accompanied by a piece of paper can teach you how to be a manager; nor can you trust that someone brandishing a similar piece of paper is ready for a management job until you know a lot more about him. Isn’t that like being a tester, and like managing testers?

What does Exploratory Testing (ET) include? Well, it’s kind of like management, isn’t it?

Flatness In Organization. Managers perform management actions as they go along. Managers do not need people to design their actions for them. Managers foster leadership by empowering people to use their skills; guiding, but not controlling; granting freedom and requiring responsibility. Isn’t that like being a tester, and like managing testers?

Chaos Can Be Tamed. You have no idea on how you are going to manage, nor on how the managers reporting to you will manage. You have not planned everything out in detail before you start managing; you can’t, and you know you’d be fooling yourself if you pretended to do so. You cannot report exactly how long time you need, since you don’t know everything in advance. In fact, discovering what needs to be done is a key aspect of your work. You recognize that management is a holistic process, not a linear one. You will use your skills, combined with all of the information available to inform your decisions on time, scope, quality, innovation, skill, and learning. You will use feedback from your surroundings to gather the information you need to make decisions. Isn’t that like being a tester, and like managing testers?

Scalability. When you’re hiring people to be managers who report to you , you only want managers. If they’re not ready for that, but show promise, you’ll train and mentor them into the role. Not anyone can be a manager. It is hard to just get anyone to help out since you cannot use just anyone from the organisation. They need to learn real management skills to be effective., which means that, among other things they must be given the freedom to make mistakes that can be observed and corrected in an empowering, fault-tolerant environment. When looked at this way, management does scale. Isn’t that like being a tester, and like managing testers?

Skills-based Education Multiple-choice based certification for managers is insufficient. Better: there are degree programs, and there are shorter skill-based courses that involve simulations, open discussion, and testing actual software. Good courses are valuable supplements to an environment that fosters learning and innovation; courses that teach only management nomenclature are a waste of time and money. Isn’t that like being a tester, and like managing testers?

Management Software Isn’t Management. Management isn’t done by software. Major software vendors have tools for this, but they don’t replace managers. Customer relationship management software is not customer relationship management; enterprise resource management software isn’t enterprise resource management. A real manager knows that it is what she thinks and what she does is important; that for her real work–the analysis and decision making–her paper notepad is as just as valid a tool as an Excel spreadsheet, and that no tool, no matter how big or how expensive or how powerful, is anything more than a tool. Isn’t that like being a tester, and like managing testers?

Excellent testing skill has much in common with excellent management skill. As testers, maybe we can use the similarities between them to help explain the work that we do.

Exploratory Testing IS Accountable

January 27th, 2010

In this blog post, my colleague James Bach talks about logging and its importance in support of exploratory testing. Logging takes care of one part of the accountability angle, and in an approach like session-based test management (developed by James and his brother Jon), the test notes and the debrief take care of another part of it.

Logging records what happened from the perspective of the test system. Good logging relieves the tester from having to record specific actions in detail; the machine does that. The tester is thereby free to record test notes—a running account of the tester’s ideas, questions, and results as he tested, or what happened from the perspective of the tester. Those notes form the meat of the session sheet, which also includes

  • coverage data
  • who did the testing
  • when they started
  • how long it took
  • the proportion of time spent on test design and execution, bug investigation and reporting, and setup
  • the proportion of the time spent on on-charter work vs. opportunity work
  • references to log files, data files, and related material such as scenarios, help files, specifications, standards, and so forth
  • and, of course, bugs discovered and issues identified.

After the session or at the end of the day, the tester presents a report—the session sheet combined with an oral account—in the debrief, a conversation between the tester and the test lead or test manager. In the debrief, the test lead reviews—that is, tests—the tester’s experience and his report. The question “What happened?” gets addressed; the oral and written aspects of the report get discussed and evaluated; the session charter is confirmed or revised; holes are discovered and, where needed, plugged with followup testing; bug reports get reviewed; issues get brought up; coaching happens; mentoring happens; learning happens; knowledge gets transferred. The goal here is for the tester and the test lead to be able to say, “we can vouch for what was tested“.

The session sheet is structured in such a way that it can be scanned by a text-parsing tool written in Perl. The measurements (in particular the coverage measurements) are collected and collated automatically into reports in the form of sortable HTML tables. Session sheets are kept for later review, if they’re needed.

If logging in the program isn’t available right away, screen recording tools (like BB Test Assistant, Camtasia, Spector, …) can provide a retrospective account of what happened. (An over-the-shoulder video camera works too.) Note that these tools simply record video (and, optionally, sound—which is good for narration). Programmatic repetition of the session isn’t the point. Nor is the point to have a supervisor review the screen capture obsessively; that wastes time, and besides, nobody likes working for Big Brother. The idea is to use the video only when necessary—to aid in recollection where it’s needed, and to help in troubleshooting hard-to-reproduce bugs.

We suggest, where it doesn’t get in the way, taking the test notes on the same machine as the application under test, and using the text editor window popping up as a way to link the execution of the application with bugs, test ideas or questions. For bugs that don’t appear to be state-critical you can also take very brief notes for later followup. Include a time stamp, where the time stamp is an index into the recording; then revisit the recording later if more detail is called for. (In Notepad, you can press F5; in TextPad, Edit/Insert/Time, and it’s macroable; other text editors almost certainly have a similar feature.)

Between a charter, the session sheet, the oral report, data files, and the logs and the debrief, it’s hard for me to imagine a more accountable way of working. Each aspect of the reporting structure reinforces the others. This is why I get confused when test managers talk about exploratory testing being “unaccountable” or “unmanageable” or “unstructured”: when I ask them what accountability and management means to them, they point lamely to a pile of scripts or spreadsheets full of overspecified actions that were written weeks or months before the software was built, or they mumble something about not knowing what goes on in a tester’s head.

Any testing approach is manageable when you choose to manage it. If you want structure think about what you mean (maybe this guide to the structures of exploratory testing will help), identify the structures that are important to you, and develop those structures in your testers, in your team, and in your approaches. If you want accountability, provide structures for it (like session-based test management), and then require accountability. If you find that your testers aren’t sufficiently skilled, train them and mentor them. (And if you don’t know how to do that rapidly and effectively, we can help you.)

If there’s something you don’t like about the results you’re getting, manage: observe what’s going on in your system of testing, and put in a control action where you want to change something. If you want to know what’s going on in a tester’s head, observe her directly and interview her as she’s testing; have her pair with another tester or a test lead; critique her notes; debrief her and coach her, until you get the results that you seek. If you want to supercharge the efficiency of your testers, work with the programmers and their managers to focus on testability, with special attention paid to scriptable interfaces, logging, and at least some programmer testing. (It might help to identify the information-hiding and feedback-loop-lengthening costs of the absence of testability). If you find individual debriefs taking too long, or if you want to share information more broadly within the test team, try group debriefs at the end of one day or the beginning of the next. If you want to add features to the reporting protocol, add them; if you want to drop them, drop them. Experiment, re-evaluate, and tune your testing as you see fit.

And if you have a more manageable and accountable approach than this for fostering the discovery of important problems in the product, please let us know (me, or James, or Jon). We’d really like to hear about it.

Disposable Time

January 17th, 2010

In our Rapid Testing class, James Bach and I like to talk about an underappreciated tester resource: disposable time. Disposable time is the time that you can afford to waste without getting into trouble.

Now, we want to be careful about what we mean by “waste”, here. It’s not that you want to waste the time. You probably want to spend it wisely. It’s just that you won’t suffer harm if you do happen to waste it. Disposable time is to your working hours what disposable income is to your total personal income. (In fact, even that’s not quite correct, strictly speaking; we actually mean discretionary income: the money that’s left over after you’ve paid for all of the things that you must pay for—food, shelter, basic clothing, medical, and tax expenses. The money that people call disposable income is more properly called discretionary income; as Wikipedia says, “the amount of ‘play money’ left to spend or save.” Oh well. We’ll go with the incorrect but popular interpretation of “disposable” here.)

You’re never being scrutinized every minute of every day. Practically everyone has a few moments when no one important is watching. In that time, you might

  • try a tiny test that hasn’t been prescribed.
  • try putting in a risky value instead of a safe value.
  • pretend to change your mind, or to make a mistake, and go back a step or two; users make mistakes, and error handling and recovery are often the most vulnerable parts of the program.
  • take a couple of moments to glance at some background information relevant to the work that you’re doing.
  • write in your journal.
  • see if any of your colleagues in technical support have a hot issue that can inform some test ideas.
  • steal a couple of moments to write a tiny, simple program that will save you some time; use the saved time and the learning to extend your programming skills so that you can solve increasingly complex programming problems.
  • spend an extra couple of minutes at the end of a coffee break befriending the network support people.
  • sketch a workflow diagram for your product, and at some point show it to an expert, and ask if you’ve got it right.
  • snoop around in the support logs for the product.
  • add a few more lines to a spreadsheet of data values
  • help someone else solve a problem that they’re having.
  • chat with a programmer about some aspect of the technology.
  • even if you do nothing else, at least pause and look around the screen as you’re testing. Take a moment or two to recognize a new risk and write down a new question or a new test idea. Report on that idea later on; ask your test lead, your manager, or a programmer, or a product owner if it’s a risk worth investigating. Hang on to your notes. When someone asks “Why didn’t you find that bug,” you may have an answer for them.

If it turns out that you’ve made a bad investment, oh well. By defintion, however large or small the period, disposable is time that you can afford to blow without suffering consequences.

On the other hand, you may have made a good investment. You may have found a bug, or recognized a new risk, or learned something important, or helped someone out of a jam, or built on a professional relationship, or surprised and impressed your manager. You may have done all of these things at once. Even if you feel like you’ve wasted your time, you’ve probably learned enough to insulate yourself from wasting more time in the same way. When you discover that an alley is blind, you’re unlikely to return there when there are other things to explore.

In The Black Swan, Nassim Nicholas Taleb proposes an investment strategy wherein you put the vast bulk of your money, your nest egg, in very safe securities. You then invest a small amount—an amount that you can afford to lose—in very speculative bets that have a chance of providing a spectacular return. He call that very improbable high-return event a positive Black Swan. Your nest egg is like the part of your job that you must accomplish. Disposable time is like your Black Swan fund; you may lost it all, but you have a shot at a big payoff. But there’s an important difference, too: since learning is an almost inevitable product of using your disposable time, there’s almost always some modest positive outcome.

We encourage test managers to allow disposable time explicitly for their testers. As an example, Google provides its staff with Innovation Time Off. Engineers are encouraged to spend 20% of their time pursuing projects that interest them. That sounds like a waste, until one learns that Google projects like Gmail, Google News, Orkut, and AdSense came of these investments.

What Google may not know is that even within the other 80% of the time that’s ostensibly on mission, people still have, and are still using, non-explicit disposable time. People have that almost everywhere, whether they have explicit disposable time or not.

If you’re working in an environment where you’re being watched so closely that none of this is possible, and where you’re punished for learning or seeking problems, my advice is to make sure that slavery has been abolished in your jurisdiction. Then find a job where your testing skills are valued and your managers aren’t wasting their time by watching your work instead of doing theirs. But when you’ve got a few moments to fill, fill them and learn something!

Defect Detection Efficiency: An Evaluation of a Research Study

January 8th, 2010

Over the last several months, B.J. Rollison has been delivering presentations and writing articles and blog posts in which he cites a paper Defect Detection Efficiency: Test Case Based vs. Exploratory Testing, [DDE2007] by Juha Itkonen, Mika V. Mäntylä and Casper Lassenius (First International Symposium on Empirical Software Engineering and Measurement, pp. 61-70; the paper can be found here).


I appreciate the authors’ intentions in examining the efficiency of exploratory testing.  That said, the study and the paper that describes it have some pretty serious problems.

Some Background on Exploratory Testing

It is common for people writing about exploratory testing to consider it a technique, rather than an approach. “Exploratory” and “scripted” are opposite poles on a continuum. At one pole, exploratory testing integrates test design, test execution, result interpretation, and learning into a single person at the same time.  At the other, scripted testing separates test design and test execution by time, and typically (although not always) by tester, and mediates information about the designer’s intentions by way of a document or a program.As James Bach has recently pointed out, the exploratory and scripted poles are like “hot” and “cold”.  Just as there can be warmer or cooler water, there are intermediate gradations to testing approaches. The extent to which an approach is exploratory is the extent to which the tester, rather than the script, is in immediate control of the activity.  A strongly scripted approach is one in which ideas from someone else, or ideas from some point in the past, govern the tester’s actions. Test execution can be very scripted, as when the tester is given an explicit set of steps to follow and observations to make; somewhat scripted, as when the tester is given explicit instruction but is welcome or encouraged to deviate from it; or very exploratory, in which the tester is given a mission or charter, and is mandated to use whatever information and ideas are available, even those that have been discovered in the present moment.

Yet the approaches can be blended.  James points out that the distinguishing attribute in exploratory and scripted approaches is the presence or absence of loops.  The most extreme scripted testing would follow a strictly linear approach; design would be done at the beginning of the project; design would be followed by execution; tests would be performed in a prescribed order; later cycles of testing would use exactly the same tests for regression

Let’s get more realistic, though.  Consider a tester with a list of tests to perform, each using a data-focused automated script to address a particular test idea.  A tester using a highly scripted approach would run that script, observe and record the result, and move on to the next test.  A tester using a more exploratory approach would use the list as a point of departure, but upon observing an interesting result might choose to perform a different test from the next one on the list; to alter the data and re-run the test; to modify the automated script; or to abandon that list of tests in favour of another one.  That is, the tester’s actions in the moment would not be directed by earlier ideas, but would be informed by them. Scripted approaches set out the ideas in advance, and when new information arrives, there’s a longer loop between discovery and the incorporation of that new information into the testing cycle.  The more exploratory the approach, the shorter the loop.  Exploratory approaches do not preclude the use of prepared test ideas, although both James and I would argue that our craft, in general, places excessive emphasis on test cases and focusing techniques at the expense of more general heuristics and defocusing techniques.

The point of all this is that neither exploratory testing nor scripted approaches are testing techniques, nor bodies of testing techniques.  They’re approaches that can be applied to any testing technique.

To be fair to the authors of [DDE2007], since publication of their paper there has been ongoing progress in the way that many people—in particular Cem Kaner, James Bach, and I—articulate these ideas, but the fundamental notions haven’t changed significantly.

Literature Review

While the authors do cite several papers on testing and test design techniques, they do not cite some of the more important and relevant publications on the exploratory side.  Examples of such literature include “Measuring the Effectiveness of Software Testers” (Kaner, 2003; slightly updated in 2006); and “Software engineering metrics: What do they measure and how do we know?” (Kaner & Bond, 2004); and “Inefficiency and Ineffectiveness of Software Testing: A Key Problem in Software Engineering” (Kaner 2006; to be fair to the authors, this paper may have been published too late to inform [DDE2007]),  General Functionality and Stability Test Procedure (for Microsoft Windows 2000 Application Certification) (Bach, 2000); Satisfice Heuristic Test Strategy Model (Bach, 2000); How To Break Software (Whittaker, 2002).

The authors of [DDE2007] appear also to have omitted literature on the subject of exploration and its role in learning. Yet there is significant material on the subject, in both popular and more academic literature.  Examples here include Collaborative Discovery in a Scientific Domain (Okada and Simon; note that the subjects are testing software); Exploring Science: The Cognition and Development of Discovery Processes (David Klahr and Herbert Simon); Plans and Situated Actions (Lucy Suchman); Play as Exploratory Learning (Mary Reilly); How to Solve It (George Polya); Simple Heuristics That Make Us Smart (Gerg Gigerenzer); Sensemaking in Organizations (Karl Weick); Cognition in the Wild (Edward Hutchins); The Social Life of Information (Paul Duguid and John Seely Brown); Sciences of the Artificial (Herbert Simon); all the way back to A System of Logic, Ratiocinative and Inductive (John Stuart Mill, 1843).

These omissions are reflected in the study and the analysis of the experiment, and that leads to a common problem in such studies: heuristics and other important cognitive structures in exploration are treated as mysterious and unknowable.  For example, the authors say, “For the exploratory testing sessions we cannot determine if the subjects used the same testing principles that they used for designing the documented test cases or if they explored the functionality in pure ad-hoc manner. For this reason it is safer to assume the ad-hoc manner to hold true.”  [DDE2007, p. 69]  Why assume?  At the very least, one could at least observe the subjects and debrief them, asking about their approaches.  In fact, this is exactly the role that the test lead fulfills in the practice of skilled exploratory testing.  And why describe the principles only as “ad-hoc”?  It’s not like the principles can’t be articulated. I talk about oracle heuristics in this article, and talk about stopping heuristics here; Kaner’s Black Box Software Testing course talks about test design heuristics; James Bach’s work talks about test strategy heuristics (especially here); James Whittaker’s books talk about heuristics for finding vulnerabilities…

Tester Experience

The study was performed using testers who were, in the main, novices.  “27 subjects had no previous experience in software engineering and 63 had no previous experience in testing. 8 subjects had one year and 4 subjects had two years testing experience. Only four subjects reported having some sort of training in software testing prior to taking the course.”  ([DDE2007], p. 65 my emphasis)  Testing—especially testing using an exploratory approach—is a complex cognitive activity.  If one were to perform a study on novice jugglers, one would likely find that they drop an approximately equal number of objects, whether they were juggling balls or knives.

Tester Training

The paper notes that “subjects were trained to use the test case design techniques before the experiment.” However, the paper does not make note of any specific training in heuristics or exploratory approaches.  That might not be surprising in light of the weaknesses on the exploratory side of the literature review.  My experience, that of James Bach, and anecdotal reports from our clients suggests that even a brief training session can greatly increase the effectiveness of an exploratory approach.

Cycles of Testing

Testing happens in cycles.  In a strongly scripted testing, the process tends to the linear.  All tests are designed up front; then those tests are executed; then testing for that area is deemed to be done.  In subsequent cycles, the intention is to repeat the original tests to make sure that bugs are fixed to check for regression.  By contrast, exploratory testing is an organic and iterative process.  In an exploratory approach, the same area might be visited several times, such that learning from early “reconnaissance” sessions informs further exploration in subsequent “deep coverage” sessions.  The learning from those (and from ideas about bugs that have been found and fixed) informs “wrap-up sessions”, in which tests may be repeated, varied, or cut from new cloth.  No allowance is made for information and learning obtained during one round of testing to inform later rounds.  Yet such information and learning is typically of great value.

Quantitative vs. Qualitative Analysis

In the study, there is a great deal of emphasis placed on quantifying results, on experimental and on mathematical rigour.  However, such rigour may be misplaced when the products of testing are qualitative, rather than quantitative.

Finding bugs is important, finding many bugs is important, and finding important bugs is especially important. Yet bugs and bug reports are by no means the only products of testing.  The study largely ignores the other forms of information that testing may provide.

  • The tester might learn something about test design, and feed that learning into her approach toward test execution, or vice versa. The value of that learning might be realized immediately (as in an exploratory approach) or over time (as in a scripted approach).
  • The tester, upon executing a test, might recognize a new risk or missing coverage. That recognition might inform ideas about the design and choices of subsequent tests.  In a scripted approach, that’s a relatively long loop.  In an exploratory approach, upon noticing a new risk, the tester might choose to note findings for later on.  On the other hand, the discovery could be cashed immediately:  she  might choose to repeat the test, she might perform a variation on the same test, or might alter her strategy to follow a different line of investigation.  Compared to a scripted approach, the feedback loop between discovery and subsequent action is far shorter.  The study ignores the length of the feedback loops.
  • In addition to discovering bugs that threaten the value of the product, the tester might discover issues—problems that threaten the value of the testing effort or the development project overall.
  • The tester who takes an exploratory approach may choose to investigate a bug or an issue that she has found.  This may reduce the total bug count, but in some contexts may be very important to the tester’s client.  In such cases, the quality of the investigation, rather than the number of bugs found, would be important.

More work products from testing can be found here.

“Efficiency” vs. “Effectiveness”

The study takes a very parsimonious view of “efficiency”, and further confuses “efficiency” with “effectiveness”.  Two tests are equally effective if they produce the same effects. The discovery of a bug is certainly an important effect of a test.  Yet there are other important effects too, as noted above, but they’re not considered in the study.

However, even if we decide that bug-finding is the only worthwhile effect of a test, two equally effective tests might not be equally efficient.  I would argue that efficiency is a relationship between effectiveness and cost.  An activity is more efficient if it has the same effectiveness at lower cost in terms of time, money, or resources.  This leads to what is by far the most serious problem in the paper…

Script Preparation Time Is Ignored

The authors’ evaluation of “efficiency” leaves out the preparation time for the scripted tests! The paper says that the exploratory testing sessions took 90 minutes for design, preparation, and execution. The preparation for the scripted tests took seven hours, where the scripted test execution sessions took 90 minutes, for a total of 8.5 hours.  This fact is not highlighted; indeed, it is not mentioned until the eighth of ten pages. (page 68).  In journalism, that would be called burying the lead.  In terms of bug-finding alone, the authors suggest that the results were of equivalent effectiveness, yet the scripted approach took, in total, 5.6 times longer than the exploratory approach. What other problems could the exploratory testing approaches find given seven additional hours?

Conclusions

The authors offer these four conclusions at the end of the paper:

“First, we identify a lack of research on manual test execution from other than the test case design point of view. It is obvious that focusing only on test case design techniques does not cover many important aspects that affect manual testing. Second, our data showed no benefit in terms of defect detection efficiency of using predesigned test cases in comparison to an exploratory testing approach. Third, there appears to be no big differences in the detected defect types, severities, and in detection difficulty. Fourth, our data indicates that test case based testing produces more false defect reports.”

I would offer to add a few other conclusions.  The first is from the authors themselves, but is buried on page 68:  “Based on the results of this study, we can conclude that an exploratory approach could be efficient, especially considering the average 7 hours of effort the subjects used for test case design activities.”  Or, put another way,

  • During test execution
  • unskilled testers found the same number of problems, irrespective of the approach that they took, but
  • preparation of scripted tests increased testing time approximately by a factor of five
  • and appeared to add no significant value.

Now:  as much as I would like to cite this study as a significant win for exploratory testing, I can’t.  There are too many problems with it.  There’s not much value in comparing two approaches when those approaches are taken by unskilled and untrained people.  The study is heavy on data but light on information. There are no details about the bugs that were found and missed using each approach.  There’s no description of the testers’ activities or thought processes; just the output numbers.  There is the potential for interesting, rich stories on which bugs were found and which bugs were missed by which approaches, but such stories are absent from the paper.  Testing is a qualitative evaluation of a product; this study is a quantitative evaluation of testing.  Valuable information is lost thereby.

The authors say, “We could not analyze how good test case designers our subjects were and how much the quality of the test cases affected the results and how much the actual test execution aproach.”  Actually, they could have analyzed that.  It’s just that they didn’t.  Pity.

Handling an Overstructured Mission

December 26th, 2009

Excellent testers recognize that excellent testing is not merely a process of confirmation, verification, and validation.  Excellent testing is a process of exploration,discovery, investigation, and learning.

A correspondent that I consider to be an excellent tester (let’s call him Al) works in an environment where he is obliged by his managers to execute overly structured, highly confirmatory scripted tests. Al wrote to me recently, saying that he now realizes why that’s frustrating for him:  every time he runs through a scripted test, he gets five new ideas that he wants to act upon. I think that’s a wonderful thing, but when he acts on those ideas and fulfills his implicit mission (finding important problems in the product), it diverts him from his explicit mission (to complete some number of scripted tests per day), and he gets heat from his manager about that.  At the end of a couple of days, the manager wants to know why Al is behind schedule—even if Al has revealed important problems along the way—because the manager is focused on test effort in terms of test cases completed, rather than test ideas explored.

I suggested to Al (as I suggest to you, if you’re in that kind of situation) a workaround:  don’t act on the new test ideas; but do note them.  Jot them down in handwritten notes or a text file, and especially note your motivation for them—ideas about risk, coverage, oracles, strategies, and the like. Tell your test manager or test lead that you didn’t run tests associated with those ideas, and then ask, “Are you okay with us NOT running them?”

In addition, check in with your manager more often than once every two days. Deliver a report, including new ideas, at one- to two-hour intervals.  If direct personal contact isn’t available, try instant messages or email. If those don’t work, batch them, but note the time at which you started and/or stopped a burst of testing activity.

Al was excited about that.  “Wow!” he said.  “That also means defects arising from the new ideas are noted down. Currently, my management is under the impression that test cases are the things that reveal problems, but it’s my acting on my test ideas that really reveals the problems.”  He also noted, “There’s another bad that comes from that.  If the test cases don’t reveal problems, we take the problems that we’ve found and create a test case for them so that those problems aren’t missed next time.”  I’ve seen that happen a lot, too.  On the face of it, it doesn’t sound like a bad idea—except that specific problems that are fixed and verified tend to remain fixed.  Repeating those tests is an opportunity cost against new tests that would reveal previously undiscovered problems.

So:  the idea here is to make certain aspects of our work visible.  Scripted test cases often reveal problems as those cases are developed.  When those problems get fixed, the script loses its power.  Thus variation on the script, rather than following the script rigourously, tends to reveal the actual problem.  However, unless we’re clear that this is happening, managers will mistakenly give credit to the wrong thing—namely, the script—rather than to the mindset and the skill set of the tester.

Selena Delesie on Exploratory Test Chartering

December 18th, 2009

A little while ago, I mentioned that I’d be writing more about session-based test management (SBTM). For me, one thing that’s great about having a community of students and colleagues is that they can save me lots of time and work.

Selena Delesie took the Rapid Software Testing course from me a few years back (that is, she was a student). Since then, she has taken Rapid Testing and its practices, including SBTM, and made them her own. This is exactly what James Bach and I aim for.  We want to help testers, test leads, and managers realize the the most important factor in excellent testing, bar none, is the mindset and the skill set of the individual tester.  This means taking the ideas in the course and internalizing them, adopting them, developing them, experimenting with them, altering them to fit your context.  We get people started by making them feel powerful, mostly by helping them to recognize the power and skills that they already have. Then, after the class, they can feel confident in doing the heavy lifting on their own. Selena is by no means our only student who has done that, but she’s a paradigmatic example of what’s possible.

This post from her blog is a nice account of her appreciation of exploratory testing and of her career growth. That on its own would be good enough, but she’s now blogged a post on chartering sessions, and it’s excellent.  It identifies some of the common traps and misconceptions about chartering, and provides some sharp advice on how to avoid them. It talks not merely about how to charter, but how to do it in a way that affords the tester the freedom and responsibility to do his or her best work. Highest recommendation.

Structures of Exploratory Testing: Resources

December 14th, 2009

In a Webinar that he did for uTest on December 10, James Whittaker mused aloud about what a great idea it would be to structure exploratory testing and capture ideas about it in a repository for sharing with others. It seems to me that one ideal version of that would take the form of a bibliography in a book about exploratory testing, but apparently that’s not available. Yet I digress.

The fact is, people have been doing exactly that for years. And I do like the idea of having a repository and sharing, so here’s a survey of some exploratory testing structures and some writing about them that I hope people will find helpful. There are some excellent books out there, but for now, these ones are all online and free. Expect updates.

  • Evolving Work Products, Skills and Tactics, ET Polarities, and Test Strategy. James Bach, Jon Bach, and I authored the latest version of the Exploratory Skills and Dynamics list. This is a kind of evolving master list of exploratory testing structures. James describes it here.
  • Oracles. The HICCUPPS consistency heuristics, which James Bach initiated and which I wrote about in this article for Better Software in 2005. (Actually, at the time it was only HICCUPP—History, Image, Comparable Products, Claims, User Expectations, Purpose, Product—but since then we’ve also added S, for Standards and Statutes. Mike Kelly also talks about HICCUPP here.
  • Test Strategy. James Bach’s Heuristic Test Strategy Model isn’t restricted to exploratory approaches, but certainly helps to guide and structure them.
  • Data Type Attacks, Web Tests, Testing Wisdom, Heuristics, and Frameworks. Elisabeth Hendrickson’s Test Heuristics Cheat Sheet is a rich set of guideword heuristics and helpful reference information.
  • Context Factors, Information Objectives. Cem Kaner most recently delivered his Tutorial on Exploratory Testing for the QAI Quest Conference in Chicago, 2008. There’s a similar, but not identical talk here.
  • Quick Tests. In our Rapid Software Testing course, James Bach and I talk about quick tests. The course notes are available for free. Fire up Acrobat and search for “Quick Tests”.
  • Coverage (specific). Michael Hunter’s You Are Not Done Yet is a detailed set of coverage ideas to help prompt further exploration when you think you’re done.
  • Coverage (general). James Bach wrote this article in 2001, in which he summarizes test coverage ideas under the mnemonic “San Francisco Depot.”—Structure, Function, Data, Platform, and Operations. Several years later, I convinced him to add an element to the list, so now it’s “San Francisco Depot. The last T is for… 
  • Time. I realized a few years ago that some guideword heuristics might help us to pay attention to the ways in which products related to time, and vice versa. That turned into a Better Software article called “Time for New Test Ideas”.
  • Tours. Mike Kelly’s FCC CUTS VIDS Touring Heuristics (note the date) provides a set of structured approaches for touring the application. 
  • Stopping Heuristics. There are structures to deciding when to stop a given test, a line of investigation, or a test cycle. I catalogued them here, and Cem Kaner made a vital addition here.
  • Accountability, Reporting Progress. James and Jon Bach’s description of Session-Based Test Management is a set of structures for making exploratory testing sessions more accountable.
  • Procedure. The General Functionality and Stability Test Procedure. It was designed for Microsoft in the late 1990s by James Bach, and may be the first documented procedure to guide exploratory test execution and investigation.
  • Emotions. I gave a talk on emotions as powerful pointers to test oracles at STAR West in 2007. That helped to inspire some ideas about…
  • Noticing, Observation. At STAR East 2009, I did a keynote talk on noticing, which can be important for exploratory test execution. The talk introduces a number of areas in which we might notice, and some patterns to sharpen noticing.
  • Leadership. For the 2009 QAI Conference in Bangalore, India, I did a plenary talk in which I noted several important structural similarities between exploratory testing and leadership.

So, there it is: a repository. I’ll eventually reproduce it as part of the resources page on my Web site. Feel free to share; comments and suggestions for additions are welcome.