Monday, January 26, 2009
Repeatabiity and Adaptability
Having a little time off can create a lot more work. This stuff has me sufficiently excited that I'm finding it difficult to accomplish any of the mandated writing I have to do, but we'll let the fieldstones fall where they may.
Tectonic forces are building up due to friction between two plates. On the one hand, many people in the software development and testing business seem obsessed with the need to reduce variation, to focus on repeatability, to confirm the things that we know. These are admirable and important goals, and we'd have a tough time if we ignored them. On the other hand, many people—I'm one—while honouring the importance of the confirmatory mode, are more concerned with the need to examine adaptability, to recognize the limitations of repeatability, and to explore with the goal of extending the boundaries of what we know.
I'll have more to say about all this in the days ahead (let's face it; it'll probably take years), but today I was browsing General Principles of System Design (formerly titled On The Design of Stable Systems), and found this gem, The Fundamental Regulator Paradox:
The task of a regulator is to eliminate variation, but this variation is the ultimate source of information about the quality of its work. Therefore, the better job a regulator does, the less information it gets about how to improve.
Put into more memorable words, the Fundamental Regulator Paradox says: Better regulation today risks worse regulation tomorrow.
This sums up why you can't get through to anyone important at the big telcos by phoning; why they don't publish their phone numbers online, or why they bury them in the Web site; why the corporate immune system exists. It helps to explain how the very largest financial institutions proved to be highly vulnerable to huge losses. It helps to explain how governments that suppress dissent inevitably fall. These systems don't want to be disturbed, and the easiest way to do that s to reject information of all kinds. Mark Federman wrote a wonderful paper called Listening to the Voice of the Customer, which is exactly about all that stuff.
Recently I was polled on my opinion about a one-day power outage that happened in our neighbourhood. The poll questions and the format for answering them extremely restrictive, designed to simplify rich stories and detailed information into data—groupings of responses ranging from very satisfied (7) to very dissatisfied (1). I'm sure that this made the poll results more digestible for the utility company's managers, but by the time everything had been sifted into a one-to-seven value, any human dimension that might compel action would have been removed. That would include stories about seniors stranded without heat, lights, water, or elevators in 17-storey apartment buildings on the coldest day of the year, or the owners of small grocery stores that lost thousands of dollars because the fridges warmed up for lack of electricity before the building cooled down for lack of heat. And because the poll was designed to limit variability in the answers, I grew sufficiently frustrated to give up a few questions in. Thus the utility company ended up hearing nothing at all from me, just as The Fundamental Regulator Paradox would suggest.
So here are my questions: is your testing priority to make things repeatable, or is it to elicit new information? Is your job to reproduce known results, or to test for adaptability? And one that's a little more sobering, perhaps: to what extent does your current testing process reject information, rather than seeking it out? Do you let your program "speak its mind" to you by interviewing it and having a conversation with it? Or do you have a set of standard multiple choice questions that you want it to answer in a highly constrained way?
Thursday, January 22, 2009
Goin' to Carolina
The Triangle Information Systems Quality Association (TISQA) will present Agile Testing In The Carolinas, March 16th and 17th, 2009, at The Friday Center, Chapel Hill, North Carolina. The first day is a day of keynotes, conference sessions, and networking; the second is dedicated to half- and full-day workshops. Featured speakers include Shaun Bradshaw, T.R. Buskirk, Lisa Crispin, Howard Deiner, Bob Galen, Janet Gregory, Bill Loeb, Rad Rouzky, Megan Sumrell, Rob Walsh, C. Wathington, Laurie Williams, and Tao Xie.
I'll be giving my keynote called Two Futures of Software Testing and a track session on The Metrics Minefield on Monday. On Tuesday, I'll be presenting a one-day Exploratory Testing Masterclass (please bring a laptop!).
You can download a conference brochure, register for the conference, register for the workshops, and see more at the TISQA Web site.
Monday, January 19, 2009
Meaningful Metrics
Over the years, I can remember working with exactly one organization that used my idea of an excellent approach to software engineering metrics. Their approach was based on several points:
- They treated metrics as first-order approximations, and recognized that they were fundamentally limited and fallible.
- They used the metrics for estimating, rather than for predicting. When their estimations didn't work out, they didn't use the discrepancy to punish people. They used it to try to understand what they hadn't understood about the task in the first place.
- They used inquiry metrics, rather than control metrics. used the metrics to prompt questions about their assumptions, rather than to provide answers or drive their work.
- They used a large number of observational modes to manage their business and to evaluate (and solve) their problems. Most importantly, the managers observed people and what they did, rather than watching printed reports. They used close personal supervision, collaboration, and conversation as their primary approach to learning about what was happening on the project. They watched the game, rather than the box scores.
- They didn't collect any metrics on things that weren't interesting and important to them.
- They didn't waste time collecting or managing the metrics.
- They had no interest in making the metrics look good. They were interested in optimizing the quality of the work, not in the appearance afforded by the metrics.
- They took a social sciences approach to measurement, as Cem Kaner describes the social sciences here (in particular on page 3 of the slides). Rather than assuming that metrics gave them complete and accurate answers, they assumed that the metrics were giving them partial answers that might be useful.
In summary, they viewed metrics in the same kind of way as excellent testers view testing: with skepticism (that is, not rejecting belief but rejecting certainty), with open-mindedness, and with awareness of the capacity to be fooled. Their metrics were (are) heuristics, which they used in combination with dozens of other heuristics to help in observing and managing their projects.
The software development and testing business seems to have a very poor understanding of measurement theory and metrics-related pitfalls, so conversations about metrics are often frustrating for me. People assume that I don't like measurement of any kind. Not true; the issue is that I don't like bogus measurement, and there's an overwhelming amount of it out there.
So, to move the conversation along, I'll suggest that anyone who wants to have a reasonable discussion with me on metrics should read and reflect deeply uponSoftware Engineering Metrics: What Do They Measure and How Do We Know (Kaner and Bond)
and then explain how their metrics don't run afoul of the problems very clearly identified in the paper. It's not a long paper. It's written by academics but, mirabile dictu, it's as clear and readable as a newspaper article (for example, it doesn't use pompous Latin expressions like mirabile dictu).
Here are some more important references:
- The Dark Side of Software Metrics (.pdf, Hoffman)
- Meaningful Metrics (.pdf, Allison)
- How to Lie With Statistics (book, Huff)
- Measuring and Managing Performance in Organizations (book, Austin)
- Quality Software Management, Vol. 2: First Order Metrics (book, Weinberg)
- Why Does Software Cost So Much? (book, deMarco)
Show me metrics that have been thoughtfully conceived, reliably obtained, carefully and critically reviewed, and that avoid the problems identified in these works, and I'll buy into the metrics. Otherwise I'll point out the risks, or recommend that they be trashed. As James Bach says, "Helping to mislead our clients is not a service that we offer."
Sunday, January 18, 2009
Barber's Children Now Have Haircuts
Thank you, Mary!
Saturday, January 17, 2009
Metaphor: Silver Bullets
The problem is not that there are no silver bullets. There are silver bullets—or if there aren't any, you could make them fairly straightforwardly. The problems are
- Silver bullets are expensive, especially considering...
- There are no vampires.
Ideas Around Bug Clusters
Note that, as James Bach says in his forthcoming book, skepticism is not the rejection of belief; it's the rejection of certainty. My uncertainty with respect to bug clusters revolves around oversimplification of the notion. I think the idea as Erik explains it, while potentially powerful, begs important questions around modeling and factoring, and around cost versus value. My goal is to question the heuristic, and James and Pat Schroeder did with pairwise programming, not to dismiss it out of hand, but to arrive at a deeper understanding of how it might be useful and where it might be dangerous. In particular, I want to deprecate any notion that bug clusters might be a silver bullet, as Erik suggests—and I hope that when I'm done, he'll appreciate the critique and respond to it. (That's how things work in our community: Friends disagree. It's important and okay to disagree. We try to work it out, and it's nice if we do but okay if we don't.)
First, modeling. Myers suggested that certain parts of the program tended to be far more problematic than others. The "parts" in question were, to him, areas of the program's source code. To many people, the program is the source code. But consider Kaner's definition of a computer program, from his talk Software Testing as a Social Science. He says that a computer program is "A communication among several humans and computers who are distributed over space and time that contains instructions that can be run on a computer." By this definition, the source code isn't the program; it's contained within the program. The program is the communication, and part of that communication contains source code.
What else is in the communication? What are the factors of a computer program, when we use Kaner's definition as a point of departure?
A communication among separated agencies requires interfaces between them. There are several interfaces between the several people in the communication. There is a user interface for the end user, application programming interfaces for the programmers, testing interfaces for the testers. Typically there is documentation for each of these interfaces, too.
The communication between people is mediated by communication among computers and related systems. That mediated communication involves hardware (which may include computers, switching devices, processors that we may or may not think of as computers), protocols (which circumscribe the extents and limits of aspects of the communication), software (that may process the communication in some way, enhancing it or encapsulating it or compressing it or distorting it), and firmware (which itself may involve interfaces, data, programs, and protocols).
The product that enables the overall communication may be modeled in terms of structure (including computers, programs, modules, objects, classes, data objects within the classes, functions within the classes, and, yes, lines of code that enable those functions); in terms of functions that the product performs; in terms of data on which the program interacts; in terms of platforms upon which the program depends; in terms of operations that describe the way in which people use the program; and in terms of time and its interaction with the program.
This is just one set of ways of modeling or mapping the product. For the purposes of testing—questioning the product in order to evaluate it—we could conceive of many others. We could come up with a complexity map, or a map of Which Programmer Wrote Which Stuff, or Which Groups Specified Which Requirements. We could list our ideas about the program in terms of "things that involve localization and things that do not"; "things that involve currency and things that don't"; "database interactions and not-database-interactions"; "dialog layout"; "workflows"; ad infinitum. We could come up with lists of risks, subdivided into various categories. All of these models and maps provide us with ideas about relationships, about what might constitute "near" or "far away"—and consequently, potentially useful but potentially overwhelming numbers of ideas about the points around which things could be said or imagined to cluster.
I have no problem believing in the notion of bug clusters. Certainly people tend to be good at some things, but not so good at others. Our conceptions of something might be very clear in some areas, more vague in others. It's reasonable to see an error in one place and infer that may be errors close to it. But "close" using what map? Proximity is multivariate.
Second, cost versus value. When I find a bug, it's wonderful to be able to use that information to find another one, so I often look for other bugs "clustered" "near" this one, based on whatever model I've been using or whatever factors I choose—consciously, subconsciously, or unconsciously—to apply to my understanding of what I perceive to be the bug. If I'm really on my game as a tester, I must also be ready to suspend certainty on whether my client will even agree that this is a bug. The factors that I perceive might be very important, and those models might be very powerful—or they might be total red herrings. So at the same time, I also have to consider the the opportunity cost of this activity, and the risk of not finding other bugs which might be more serious, or other information that might be more important in other, "far away" areas of my current model, or in different models altogether. It might be a better use of time to report the first problem that I find, noting my models and my suspicions, and let others (the designers, the business analysts, the programmers, the product managers) seek "related" problems using my models or their own.
The Bug Cluster Heuristic says to me "If you see a problem, suspect similar problems near it." This heuristic can useful to me, but it depends on my models and my notion of "near", and it depends on my considering carefully the cost and the value of further investigation, versus the value of obtaining broader or deeper coverage. A momentary pause for reflection on those questions could be time well spent.
Thursday, January 15, 2009
Follow-up on EuroSTAR 2008 Presentation
Qualtech also asked me to write a follow-up piece, which I was happy to do.
Tuesday, January 13, 2009
The Most Serious Programming Error
Now, as it turns out, the headline oversimplified things. First, it masked an important distinction: the list was actually about the Top 25 Programming Errors that lead to security bugs and that enable cyber espionage and cyber crime. Moreover, it made an astonishing claim: "Agreement Will Change How Organizations Buy Software".
After a chat on the phone, I prepared a written reply for Nestor. His story appears here. Most of my comments are based on our voice conversation, but I thought that my written remarks might be of general interest.
As a tester, a testing consultant, and a trainer of testers, I think that it's terrific that this group of experts has come out with a top-25 list of common programming errors.
In the press release to which you provided a link (http://www.sans.org/top25errors/#cat1#cat1), the paragraphs that leapt to my attention were these:
"What was remarkable about the process was how quickly all the experts came to agreement, despite some heated discussion. 'There appears to be broad agreement on the programming errors," says SANS Director, Mason Brown, "Now it is time to fix them. First we need to make sure every programmer knows how to write code that is free of the Top 25 errors, and then we need to make sure every programming team has processes in place to find, fix, or avoid these problems and has the tools needed to verify their code is as free of these errors as automated tools can verify.'"
…
"Until now, most guidance focused on the 'vulnerabilities' that result from programming errors. This is helpful. The Top 25, however, focuses on the actual programming errors, made by developers that create the vulnerabilities. As important, the Top 25 web site provides detailed and authoritative information on mitigation. 'Now, with the Top 25, we can spend less time working with police after the house has been robbed and instead focus on getting locks on the doors before it happens,' said Paul Kurtz, a principal author of the US National Strategy to Secure Cyberspace and executive director of the Software Assurance Forum for Excellence in Code (SAFECode)."
Yet there's an issue hidden in plain sight here. The reason that consensus was obtained so quickly is that the leaders in the programming, testing, and security communities at large have known about these kinds of problems for years, and we've known about how to fix them, too. Yet most organizations haven't done much about them. Why not?
Quality is (in Jerry Weinberg's words) value to some person. Quality is not a property of software or any other product. Instead, it’s a relationship between the product and some person, and that person’s value set determines the relationship. It's entirely possible to create a program that is functionally correct, robustly secure, splendidly interoperable, and so forth, but it's not at all clear that these dimensions of software quality are at the top of the priority list for managers who are responsible for developing or purchasing software. The top priorities, it seems to me, are usually to provide something, anything that solves one instance of the problem with a) the fastest availability or time to market, and b) the lowest cost of developing or purchasing the software. Managers have problems that they want to solve, and they want to solve them right now at the lowest possible cost. This creates enormous pressure on programmers, testers, and other developer to produce something that "works"—that is, something that fulfills some requirement to some degree; something that fulfills some dimension of value but that misses others; something that gets a part of the job done, but which includes problems related to reliability, security, usability, performance, compatibility, or a host of other quality criteria.
Managers often choose to observe progress on a project in terms of “completed” features, where completion means that coding has been done to the degree that the program can perform some task. A more judicious notion of a completed feature is one that has also been thoroughly reviewed, tested, and fixed to provide the value it is intended to provide. This requires some time and a good deal of critical thinking, asking, “What values are we fulfilling? What values might we be ignoring or forgetting?”
Microsoft’s Vista provides a case in point. This is a product for which, in both development and testing, there was a tremendous focus on functional correctness. A huge suite of automated tests was executed against every build, and Microsoft doubled and redoubled its security efforts on the product—it focused on putting locks on the doors before the robbery happens, as Mr. Kurtz puts it above. The hardware driver model was made more robust, but this had the effect of making many older drivers obsolete, along with the printers, cameras, or sound cards that they supported. In addition, the user was prompted for permission to perform certain actions that, according to the operating system’s heuristics, posed some kind of security risk. Yet many people have hardware whose drivers turned out to be incompatible with Vista, and even more people were baffled by questions about security that, as non-experts, they were in no position to answer. Thus for many of its customers, Vista is like living inside a bank vault with a doltish security guard at the front door, asking you if it’s okay to perform an action that you’ve just initiated, and insisting that you have to upgrade your old DVD player. Microsoft, in responding to one set of values, clobbered another, whether inadvertently or intentionally.
According to the headline, "Experts Announce Agreement on the 25 Most Dangerous Programming Errors - And How to Fix Them / Agreement Will Change How Organizations Buy Software". I doubt that the agreement will change how organizations buy software, because practically all of the problems identified in the agreement are, at the time of purchase, invisible to the typical purchasing organization. There is a way around this. The vendors could manage development to foster system thinking, critical thinking, multi-dimensional thinking about quality, and the patience and support to allow it all to happen. The purchasing organizations could employ highly skilled testers of their own, and could provide time and the support to review and probe the software from all kinds of perspectives, including the source code—to which the vendors would have to agree. Yet software development has some harsh similarities to meat production: the vendors rarely want to let the customers see how it's being done, or many customers would swear off the whole business.
So I believe that it’s wonderful that yet another organization has come out with a list of software vulnerabilities, and perhaps this will have some resonance with the greater programming community. Yet in my view, the problem is not with what we do, but with how we think, how we manage our projects, and how we foster our culture. Many programmers are working in environments where thinking broadly and deeply about code quality—never mind overall system quality—goes unrewarded. Many testers are working in groups where testing is viewed as a rote, clerical activity that can be done by unskilled students or failed programmers. As a consultant who works all over the world, I can assure you that most development groups are not strongly encouraged to look outside their own organizations for resources and learning opportunities. If programmers obtain management support for collaboration with the customer on the larger issues of what’s important and why it’s important, then preventing the Top 25 programming errors will eventually become second nature. If testing becomes viewed as a skilled, investigative, and exploratory activity, focused on learning about the product and its value for the people using it, then we’ll make real progress. I applaud the initiative, but preventing the problems will take more work listing them. It will also take a change in values from programmers, testers, and managers alike.
Sunday, January 11, 2009
What Colour Is Your Box?
Another totally unscientific survey - how many readers of this site would consider themselves to be black box testers, white box testers or grey box testers ?
Or if you are a test manager, what colour testing do the testers you are in charge of do?
I whimsically replied, "I deny the existence of the box," a statement for which Michele Smith (quite justifiably) asked for clarification. Here's something pretty close to my reply.
I'm not sure the distinction between shades of boxes is terribly helpful. The box metaphor is helpful in one way: it reminds me to think about constraints by which I don't want to be bound. I can almost always do something about some of those constraints. Meanwhile, there are some constraints that I can only ever circumvent to some degree, not completely.
The metaphor of the black box is intended to represent a system for which we have no knowledge of the internals. As a tester, I'm rarely in that position unless I want to be, and I usually don't want to be. Simply by asking for it, I can usually pop the lid off the box entirely by asking for the source code itself, or for other useful forms of information about the internals. If I can't see the code, at least I can ask lots of questions; I use the Heuristic Test Strategy Model as a point of departure for them. By using tools of various kinds I can always peer inside the box to some extent—listing out the imported and exported functions, inspecting the resource table, or examining the compiled binary in a text editor for clues. By using knowledge and experience, I can make reasonable inferences about the internals. But those determinations don't tell me what's going on inside, and that might be crucially important. Thinking solely in terms of the black box limits me. I want out.
(The metaphor of the white box, to me, is silly, since a white box is just as opaque as a black one. Note that this is an example of matching bias; the opposite of black is white for most purposes, so people talk about white-box testing and forget what the metaphor is supposed to represent. Glass-box testing covers the intended meaning better for me.)
So let's look at the glass box, then. The glass box is clear, allowing us to see everything in its innards. That's certainly a nice idea, but it prompts some questions. What does "everything" mean? Is it really important to see everything? Could we see everything even if we looked?
Consider: we generally consider the glass box to be the source code for an application, consisting of instructions that our developers have written. Yet our developers make calls to application frameworks, third-party libraries, operating system functions, code that other developers wrote, hardware interfaces, and so on. Our code may provide services and functions for other code. So we're not really seeing everything; we're seeing what there is to see, and how it might refer to other black or glass boxes. We might be able to take advantage of our knowledge of the the internals of our code to use its interfaces or to instrument it in some way. But even if we can read the source code, we can't necessarily determine the significance of the function for some task, nor can we anticipate completely how it might go wrong. We might be able to determine if the code has problems in it, but we can't determine if the code was written in an optimal way, since we're residents of our own state of the art, as Billy Vaughan Koen might say (see his book Discussions of the Method). We can't completely recognize the programmer's intentions, her mental shortcuts, her brilliance, or her blind spots; the code might suggest things about that, but it doesn't tell us conclusively. Even if we can anticipate the platforms on which the code is supported, we can't represent them all in the lab, and we can't determine if a change to those platforms will render our code useless. Visibility into the glass box might be helpful, but it doesn't necessarily and on its own tell us about what might be important to someone who matters. Indeed, thinking too much in terms of the glass box may dazzle or distract us into thinking that the box itself is the important thing, rather than the value that the box provides with respect to the rest of the system and its human users.
The box that I'm currently looking, black or glass, at is always connected to other boxes. No box ever really stands alone; its behaviour is ultimately interesting only with respect to other boxes. Some of the more important black boxes here, boxes into which I cannot truly see, are the minds of the people who wrote the code and the minds of the people who use it. Thinking solely in terms of the glass box limits me, just as surely as thinking solely in terms of the black box does.
So: am I a black box tester? If I feel that I'm in that mode, one of my first impulses is to consider whether it's important, for the current question I have, to figure out what's going on inside. Am I a glass box tester? If I'm in that mode, one of my first impulses is to consider what I'm not noticing and how this box interacts with other boxes, some of which may be glass, some of which may be black, and some of which may be frosted, transparent, translucent, made of Lexan, made of cellophane, or made of diamond, all in every colour of the rainbow.
So whatever the colour of the box, whatever its visibility, the value of my examination is limited if I'm only looking at its insides or its outsides. As much as I can, I want to comprehend them both. In particular, if I'm trapped in a box, I want to escape before the food and oxygen runs out.
Friday, January 09, 2009
Back to the Drawing Board
Sunday, January 04, 2009
Credo
In particular, the statement emphasizes that adaptation to the needs of the project is the first step in the context-driven approach. This makes context-driven distinct from other approaches that may acknowledge the importance of context, but make something else—a standards focus, iterative development and unit tests, "best practices", a particular process model—paramount.

