Blog Posts for the ‘Accountability’ Category

Is There a Simple Coverage Metric?

Tuesday, April 26th, 2016

In response to my recent blog post, 100% Coverage is Possible, reader Hema Khurana asked:

“Also some measure is required otherwise we wouldn’t know about the depth of coverage. Any straight measures available?”

I replied, “I don’t know what you mean by a ‘straight’ measure. Can you explain what you mean by that?”

Hema responded: “I meant a metric some X/Y.”

In all honesty, it’s sometimes hard to remain patient when this question seems to come up at every conference, in every class, week upon week, year upon year. Asking me about this is a little like asking Chris Hadfield—since he’s a well-known astronaut and a pretty smart guy—if he could provide a way of measuring the area of the flat, rectangular earth. But Hema hasn’t asked me before, and we’ve never met, so I don’t want to be immediately dismissive.

My answer, my fast answer, is No. One key problem here is related to what Y could possibly represent. What counts? Maybe we could talk about Y in terms of a number of test cases, and X as how many of those test cases we’ve executed so far. If Y is 600 and X is 540, we could say that testing is 90% done. But that ignores at least two fundamental problems.

The first problem is that, irrespective of the number of test cases we have, we could choose to add more at any time as (via testing) we discover different conditions that we would like to evaluate. Or maybe we could choose to drop test cases when we realize that they’re out of date or irrelevant or erroneous. That is, unless we decide to ignore what we’ve learned, Y will, quite appropriately, change over time.

The second problem is that—at least in my view, and in the view of my colleagues—test cases are a ludicrous way to think about testing.

Another almost-as-quick answer would be to encourage people to re-read that 100% Coverage is Possible post (and the Further Reading links), and to keep re-reading until they get it.

But that’s probably not very encouraging to someone who is asking a naive question, and I’d like to more be helpful than that.

Here’s one thing we could do, if someone were desperate for numbers that summarize coverage: we could make a qualitative evaluation of coverage, and put numbers (or letters, or symbols) on a scale that is nominal and very weakly ordinal.

Our qualitative evaluation would be rooted in analysis of many dimensions of coverage. The Product Elements and Quality Criteria sections of the Heuristic Test Strategy Model provides a framework for generating coverage ideas or for reviewing our coverage retrospectively. We would review and discuss how much testing we’ve done of specific features, or particular functional areas, or perceived risks, and summarize our evaluation using a simple scale that would go something like this:

Level 0 (or X, or an empty circle, or…): We know nothing at all about this area of the product.

Level 1 (or C, or a glassy-eyed emoticon, or…): We have done a very cursory evaluation of this area. Smoke- or sanity-level; we’ve visited this feature and had a brief look at it, but we don’t really know very much about it; we haven’t probed it in any real depth.

Level 2 (or B, or a normal-looking emoticon, or…): We’ve had a reasonable look at this area, although we haven’t gone all the way deep. We’ve examined the common, the core, the critical, the happy paths, the handling of everyday errors or exceptions. We’ve pretty familiar with this area. We’ve done the kind of testing that would expose some significant bugs, if they were there.

Level 3 (or A, or a determined-looking angel emoticon, or…): We’ve really kicked this area harshly and hard. We’ve looked at unusual and complex conditions or states. We’ve probed deeply for subtle or hidden bugs. We’ve exposed the product to the extreme, the exceptional, the rare, the improbable. We’ve looked for bugs that are deep in the corners or hidden in the dark. If there were a serious bug, we’re pretty sure we would have found it by now.

Strictly speaking, these numbers are placed on an ordinal scale, in the sense that Level 3 coverage is deeper than Level 2, which is deeper than Level 1. (If you don’t know about scales of measurement, you should learn about them before providing or asking for metrics. And there are some other things to look at.) The numbers are certainly not an interval scale, or a ratio scale. They may not be commensurate from one feature area to the next; that is, they may represent different notions of coverage, different amounts of effort, different modes of evaluation. By design, these numbers should not be treated as valid measurements, and we should make sure that everyone on the project knows it. They are little labels that summarize evaluations and product elements and effort, factors that must be discussed to be understood. But those discussions can lead to understanding and consensus between ourselves, our colleagues, and our clients.

On a Role

Monday, June 15th, 2015

This article was originally published in the February 2015 edition of Testing Trapeze, an excellent online testing magazine produced by our testing friends in New Zealand. There are small edits here from the version I submitted.

Once upon a time, before I was a tester, I worked in theatre. Throughout my career, I took on many roles—but maybe not in the way you’d immediately expect. In my early days, I was a performer, acting in roles in the sense that springs to mind for most people when they think of theatre: characters in a play. Most of the time, though, I was in the role of a stage manager, which is a little like being a program manager in a software development group. Sometimes my role was that of a lighting designer, sound engineer, or stagehand. I worked in the wardrobe of the Toronto production of CATS for six months, too.

Recent discussions about software development have prompted me to think about the role of roles in our work, and in work generally. For example, in a typical theatre piece, an actor performs in three different roles at once. Here, I’ll classify them…

a first-order role, in which a person is a member of the theatre company throughout the rehearsal period and run of the play. If someone asks him “What are you working on these days?”, he’ll reply “I’m doing a show with the Mistytown Theatre Company.”

a second-order role that the person takes on when he arrives at the theatre, defocusing from his day-to-day role as a husband and father, and focusing his energy on being an actor, or stagehand, or lighting designer. He typically holds that second-order role over the course of the working day, and abandons it when it’s time to go home.

a third-order role that the actor performs as a specific character at some point during the show. In many cases, the actor takes on one character per performance. Occasionally an actor takes on several different characters throughout the course of the performance, playing a new third-order role from one moment to another. In an improvisational theatre company, a performer may pick up and drop third-order roles as quickly as you or I would don or doff a hat. In a more traditional style of theatre, roles are more sharply defined, and things can get confusing when actors suddenly and unexpectedly change roles mid-performance. (I saw that happen once during my theatre career. An elderly performer took ill during the middle of the first act, and her much younger understudy stepped in for the remainder of the show. It was necessary on that occasion, of course, but the relationships between the performers were shaken up for the rest of the evening, and there was no telling what sense the audience was able to make of the sudden switch until intermission when the stage manager made an announcement.)

It’s natural and normal to deal simultaneously with roles of different orders, but it’s hard to handle two roles of the same order at exactly the same time. For example, a person may be both a member of a theatre company and a parent, but it’s not easy to supervise a child while you’re on stage in the middle of a show. In a small theatre company, the same person might hold two second-order roles—as both an actor and a costume designer, say—but in a given moment, that person is focusing on either acting or costume design, but not both at once. People in a perfomer role tend not to play two different third-order roles—two different characters—at the same moment. There are rare exceptions, as in those weird Star Trek episodes or in movies like All of Me, in which one character is inhabiting the body of another. To perform successfully in two simultaneous third-order roles takes spectacular amounts of discipline and skill, and the occasions where it’s necessary to do so aren’t terribly common.

Some roles are more temporary than others. At the end of the performance, people drop their second-order roles to go home and live out their other, more long-term roles; husbands and wives, parents, daughters and sons. They may adopt other roles too: volunteer in the community soup kitchen; declarer in this hand of the bridge game; parishioners at the church; pitcher on the softball team.

Roles can be refined and redefined; in a dramatic television series, an actor performs in a third-order role in each episode, as a particular character. If it’s an interesting character, aspects of the role change and develop over time. At the end of the run of a show, people may continue in their first-order roles with the same theatre company; they may become directors or choreographers with that company; or they may move on to another role in another company. They may take on another career altogether. Other roles evolve too, from friend to lover to spouse to parent.

In theatre, a role is an identity that a person takes to fulfill some purpose in service of the theatre company, production, or the nightly show. More generally, a role is a position or function that a person adopts and performs temporarily. A role represents a set of services offered, and often includes tacit or explicit commmitments to do certain things for and with other people. A role is a way to summarize ideas about services people offer, activities they perform, and the goals that guide them.

Now: to software. As a member of a software development team within an organization, I’m an individual contributor. In that first-order role, I’m a generalist. I’ve been a program manager, programmer, tech support person, technical writer, network administrator, and phone system administrator, business owner, bookkeeper, teacher, musician… Those experiences have helped me to be aware of the diversity of roles on a project, to recognize and respect the the people who perform them, and to be able to perform them effectively to some extent if necessary. In the individual contributor role, I commit to taking on work to help the company to achieve success, just as (I hope) everyone else in the company does.

Normally I’m taking on the everyday, second-order role of a tester, just as member of a theatre company might walk through the door in the evening as a lighting technician. By adopting the testing role, I’m declaring my commitment to specialize in providing testing services for the project. That doesn’t limit me to testing, of course. If I’m asked, I might also do some programming or documentation work, especially in small development groups—just as an actor in a very small theatre company might help in the box office and take ticket orders from time to time. Nonetheless, my commitment and responsibility to provide testing services requires me to be very cautious about taking on things outside the testing role. When I’m hired as a tester, my default belief is that there’s going to be more than enough testing work to do. If I’m being asked to perform in a different role such that important testing work might be neglected or compromised, I must figure out the priorities with my client.

Within my testing role, I might take on a third-order role as a responsible tester (James Bach has blogged on the role of the responsible tester) for a given project, but I might take on a variety of third-order roles as a test jumper (James has blogged about test jumpers, too).

Like parts of an outfit that I choose to wear, a role is a heuristic that can help to suggest who I am and what I do. In a hospital, the medical staff are easy to identify, wearing uniforms, lab coats, or scrubs that distinguish them from civilian life. Everyone wears badges that allow others to identify them. Surgical staff wear personalized caps—some plain and ordinary, others colourful and whimsical. Doctors often have stethoscopes stuffed into a coat pocket, and certificates from medical schools on their walls. Yet what we might see remains a hint, not a certainty; someone dressed like a nurse may not be a nurse. The role is not a guarantee that the person is qualified to do the work, so it’s worthwhile to see if the garb is a good fit for the person wearing it.

The “team member” role is one thing; the role within the team is another. In a FIFA soccer match, the goalkeeper is dressed differently to make the distinct role—with its special responsibilities and expectations—clearly visible to everyone else, including his team members. The goalkeeper’s role is to mind the net, not to run downfield trying to score goals. There’s no rule against a goalie trying to do what a striker does, but to do so would be disruptive to the dynamics of the team. When a goalkeeper runs downfield trying to score goals, he leaves the net unattended—and those who chose to defend the goal crease aren’t allowed to use their hands.

In well-organized, self-organized teamwork, roles help to identify whether people are in appropriate places. If I’m known as a tester on the project and I am suddenly indisposed, unavailable, or out of position, people are more likely to recognize that some of the testing work won’t get done. Conversely, if someone else can’t fulfill their role for some reason, I’m prepared to step up and volunteer to help. Yet to be helpful, I need to coordinate consistently with the rest of the team to make sure our perceptions line up. On the one hand, I may not have have noticed important and necessary work. On the other, I don’t want to inflict help on the project, nor would it be respectful or wise for me to usurp anyone else’s role. Shifting positions to adapt to a changing situation can be a lot easier when roles help to frame where we’re coming from, where we are, and where we’re going.

A role is not a full-body tattoo, permanently inscribed on me, difficult and painful to remove. A role is not a straitjacket. I wouldn’t volunteer to wear a straitjacket, and I’ll resist if someone tries to put me into one. As Kent Beck has said, “Responsibility cannot be assigned; it can only be accepted. If someone tries to give you responsibility, only you can decide if you are responsible or if you aren’t.” (from Extreme Programming Explained: Embrace Change) I also (metaphorically) study escape artistry in the unlikely event that someone manages to constrain me. When I adopt a role, I must do so voluntarily, understanding the commitment I’m making and believing that I can perform it well—or learn it in a hurry. I might temporarily adopt a third-order role normally taken by someone else, but in the long run, I can’t commit to a role without full and ongoing understanding, agreement, and consent between me and my clients. If I resist accepting a role, I don’t do so capriciously or arbitrarily, but for deeply practical reasons related to three important problems.

The Expertise Problem. I’m willing to do or to learn almost anything, but there is often work for which I may be incompetent, unprepared or underqualified. Each set of tasks in software development requires a significant and distinct set of skills which must be learned and practiced if they are to be performed expertly. I don’t want fool my client or my team into believing that the work will be done well until I’m capable, so I’ll push back on working in certain roles unless my client is willing to accept the attendant risks.

For example, becoming an expert programmer takes years of focused study, experience, and determination. As Collins and Evans suggest, real expertise requires not skill, but also ongoing maintenance; immersion in a way of life. James Bach remarked to me recently, “The only reason that I’m not an expert programmer now is that I haven’t tried it. I’ve been in the software business for thirty years, and if I had focused on programming, I’d be a kick-ass programmer by now. But I chose to be a tester instead.” I feel the same way. Programming is a valuable means to end for me—it helps me get certain kinds of testing work done. I can be a quite capable programmer when I put my mind to it, but I find I have to do programming constantly—almost obsessively—to maintain my skills to my own standards. (These days, if I were asked to do any kind of production programming—even minor changes to the code—I would insist on both close collaboration with peers and careful review by an expert.) I believe I can perform competently, adequately, eventually, in any role. Yet competence and adequacy aren’t enough when I aspire to achieving excellence and mastery. At a certain point in my life, I decided to focus my time and energy on testing and the teaching of it; the testing and teaching roles are the ones that attract me most. Their skills are the ones that I am most interested in trying to master—just as others are focused on mastering programming skills. So: roles represent a heuristic for focusing my development of expertise, and for distributing expertise around the team.

The Mindset Problem. Building a product demands a certain mindset; testing it deeply demands another. When I’m programming or writing (as I’m doing now), I tend to be in the builder’s mindset. As such, I’m at close “critical distance” to the work. I’m seeing it from the position of an insider—me—rather than as an outsider. It’s relatively easy for me to perform shallow testing and spot coding errors, or spelling and grammatical mistakes—although after I’ve been looking at the work for a while, I may start to miss those as well. It’s quite a bit harder for me to notice deeper structural or thematic problems, because I’ve invested time and energy in building the piece as I have, converging towards something I believe that I want. To see deeper problems, I need the greater critical distance that’s available in the tester’s mindset—what testers or editors do. It’s not a trivial matter to switch between mindsets, especially with respect to one’s own work. Switching mindsets is not impossible, but shifting from building into good critical and analytical work is effortful and time-consuming, and messes with the flow.

One heuristic for identifying deep problems in my writing work would be to walk away from writing—from the builder’s mindset—and come back later with the tester’s mindset—just as I’ve done several times with this essay. However, the change in mindset takes time, and even after days or weeks, part of me remains in the writer’s mindset—because it’s my writing. Similarly, a programmer in the flow of developing a product may find it disruptive—both logistically and intellectually—to switch mindsets and start looking for problems. In fact, the required effort likely explains a good deal of some programmers’ stated reluctance to do deep testing on their own.

So another useful heuristic is for the builder to show the work to other people. As they are different people, other builders naturally have critical distance, but that distance gets emphasized when they agree to take on a testing role. I’ve done that with this article too, by enlisting helpers—other writers who adopt the roles of editors and reviewers. A reviewer might usually identify herself as a writer, just as someone in a testing role might normally identify as a programmer. Yet temporarily adopting a reviewer’s role and a testing mindset frames the approach to the task at hand—finding important problems in the work that are harder to see quickly from the builder’s mindset. In publishing, some people by inclination, experience, training, and skills specialize in editing, rather than writing. The editing role is analogous to that of the dedicated tester—someone who remains consistently in the tester’s mindset, at even farther critical distance from the work than the builder-helpers are—more quickly and easily able to observe deep, rare, or subtle problems that builders might not notice.

The Workspace Problem. Tasks in software development may require careful preparation, ongoing design, and day-to-day, long-term maintenance of environments and tools. Different jobs require different workspaces. Programmers, in the building role, set up their environments and tools to do development and building work most simply and efficiently. Setting up a test lab for all of its different purposes—investigation of problems from the field; testing for adaptability and platform support; benchmarking for performance—takes time and focus away from valuable development tasks. The testing role provides a heuristic for distributing and organizing the work of maintaining the test lab.

People sometimes say “on an Agile project, everybody does everything” or “there are no roles on an Agile project”. To me, that’s like saying that there is no particular focusing heuristic for the services that people offer; throwing out the baby of skill with the bathwater of overspecialization and isolation. Indeed, “everybody doing everything” seems to run counter to another idea important to Agile development: expertise and craftsmanship. A successful team is one in which people with diversified skills, interests, temperaments, and experiences work together to produce something that they could not have produced individually. Roles are powerful heuristics for helping to organize and structure the relationships between those people. Even though I’m willing to do anything, I can serve the project best in the testing role, just as others serve the project best in the developer role.

That’s the end of the article. However, my colleague James Bach offered these observations on roles, which were included as a sidebar to the article in the magazine.

A role is probably not:

  • a declaration of the only things you are allowed to do. (It is neither a prison cell nor a destiny from which escape is not possible.)
  • a declaration of the things that you and you only are allowed to do. (It is not a fortress that prevents entry from anyone outside.)
  • a one-size, exclusive, permanent, or generic structure.

A role is:

  • a declaration of what one can be relied upon to do; a promise to perform a service or services well. (Some of those services may be explict; others are tacit.)
  • a unifying idea serving to focus commitment, preparation, performance, and delivery of services.
  • a heuristic for helping people manage their time on a project, and to be able to determine spontaneously who to approach, consult with, or make requests to (or sometimes avoid), in order to get things done.
  • a heuristic for fostering personal engagement and responsibility.
  • a heuristic for defining or explaining the meaning of your work.
  • a flexible and non-exclusive structure that may exist over a span of moments or years.
  • a label that represents these things.
  • a voluntary commitment.

A role may or may not be:

  • an identity
  • a component of identity.

—James Bach

Taking Severity Seriously

Wednesday, January 14th, 2015

There’s a flaw in the way most organizations classify the severity of a bug. Here’s an example from the Elementool Web site (as of 14 January, 2015); I’m sure you’ve seen something like it:

Critical: The bug causes a failure of the complete software system, subsystem or a program within the system.
High: The bug does not cause a failure, but causes the system to produce incorrect, incomplete, inconsistent results or impairs the system usability.
Medium: The bug does not cause a failure, does not impair usability, and does not interfere in the fluent work of the system and programs.
Low: The bug is an aesthetic (sic —MB), is an enhancement (ditto) or is a result of non-conformance to a standard.

These are serious problems, to be sure—and there are problems with the categorizations, too. (For example, non-conformance to a medical device standard can get you publicly reprimanded by the FDA; how is that low severity?) But there’s a more serious problem with models of severity like this: they’re all about the system as though no person used that system. There’s no empathy or emotion here; there’s no impact on people. The descriptions don’t mention the victims of the problem, and they certainly don’t identify consequences for the business. What would happen if we thought of those categories a little differently?

Critical: The bug will cause so much harm or loss that customers will sue us, regulators will launch a probe of our management, newspapers will run a front-page story about us, and comedians will talk about us on late night talk shows. Our company will spend buckets of money on lawyers, public relations, and technical support to try to keep the company afloat. Many capable people will leave voluntarily without even looking for a new job. Lots of people will get laid off. Or, the bug blocks testing such that we could miss problems of this magnitude; go back to the beginning of this paragraph.

High: The bug will cause loss, harm, or deep annoyance and inconvenience to our customers, prompting them to flood the technical support phones, overwhelm the online chat team, return the product demanding their money back, and buy the competitor’s product. And they’ll complain loudly on Twitter. The newspaper story will make it to the front page of the business section, and our product will be used for a gag in Dilbert. Sales will take a hit and revenue will fall. The Technical Support department will hold a grudge against Development and Product Management for years. And our best workers won’t leave right away, but they’ll be sufficiently demoralized to start shopping their résumés around.

Medium: The bug will cause our customers to be frustrated or impatient, and to lose faith in our product such that they won’t necessarily call or write, but they won’t be back for the next version. Most won’t initiate a tweet about us, but they’ll eagerly retweet someone else’s. Or, the bug will annoy the CEO’s daughter, whereupon the CEO will pay an uncomfortable visit to the development group. People won’t leave the company, but they’ll be demotivated and call in sick more often. Tech support will handle an increased number of calls. Meanwhile, the testers will have—with the best of intentions—taken time to investigate and report the bug, such that other, more serious bugs will be missed (see “High” and “Critical” above). And a few months later, some middle manager will ask, uncomprehendingly, “Why didn’t you find that bug?”

Low: The bug is visible; it makes our customers laugh at us because it makes our managers, programmers, and testers look incompetent and sloppy‐and it causes our customers to suspect deeper problems. Even people inside the company will tease others about the problem via grafitti in the stalls in the washroom (written with a non-washable Sharpie).Again, the testers will have spent some time on investigation and reporting, and again test coverage will suffer.

Of course, one really great way to avoid many of these kinds of problems is to focus on diligent craftsmanship supported by scrupulous testing. But when it comes to that discussion in that triage meeting, let’s consider the impact on real customers, on the real people in our company, and on our own reputations.

When Programmers (and Testers) Do Their Jobs

Monday, December 22nd, 2014

For a long time, I’ve admired Robert (“Uncle Bob”) Martin’s persistent advocacy of craftsmanship in programming and software development. Recently on Twitter, he said

One of the most important tasks in the testing role is to identify alternative interpretations of apparently clear and simple statements. Uncle Bob’s statement appears clear and simple, but as with any sentence that can be read by a human, it affords multiple interpretations. One interpretation might be that “when programmers do their jobs, testers find nothing and therefore have nothing useful to contribute“. I’m pretty sure Uncle Bob didn’t mean to say that, although it seems that at least one of my colleagues might have taken that interpretation. I prefer to think Uncle Bob’s intention was to remind programmers to take responsibility for the integrity and quality of their work, and not to slight testers.

As a tester, part of my job to help reduce the chance that statements could be misinterpreted or taken in an overly simplistic way. I think Uncle Bob probably meant the first item on this list of a few possible interpretations (and I hope he’d agree with the other ones that I offer here, too):

  • When programmers do their jobs, testers find nothing that takes the form of blatant coding errors.
  • When programmers do their jobs, testers find nothing inconsistent with what the programmers have been asked to do—although the testers might discover problems in the design or the requirements that were given to the programmers to implement.
  • When programmers do their jobs, testers find nothing that indicates the programmer has been negligent or sloppy, although even the best programmers are not perfect.
  • When programmers do their jobs, testers find nothing that makes the product hard to test; instead, they receive a highly testable product that provides access to things like log files and testable interfaces.
  • When programmers do their jobs, testers find nothing problemmatic, although they might discover unanticipated value in the product.
  • When programmers do their jobs, testers find nothing that interferes with deep testing—looking for rare, hidden, subtle, or platform-related problems that could escape even the most diligent programmers.
  • When programmers do their jobs, testers find nothing that slows them down in developing a more comprehensive understanding of the business needs, making their testing more relevant.
  • When programmers do their jobs, testers find nothing that takes time away from developing rich test ideas, scenarios, and experiments that yield a deep understanding of the product and its emergent behaviours.
  • When programmers do their jobs, testers find nothing more to ask for in terms of useful tools that would aid testing.

In the same thread, James Bach pointed out that even when programmers do their jobs, testers find that the product is doing its job, and that testers find important truths about the product. Neither of these is exactly “nothing”. So…

  • When programmers do their jobs, testers shine light on exactly how well the programmers have done their jobs.
  • When programmers do their jobs, testers identify ways in which other people might have different interpretations of a job well done.
  • When programmers do their jobs, testers have more time to compare our product with competitors’ products, pointing out areas of strengths and weaknesses in each one.

Programmers are also in the business of clearing up misinterpretations. I posted a simpler version of one of the ideas above on Twitter:

“When programmers do their jobs, testers find deep, rare, hidden, subtle, or platform-related problems.”

That sentence was limited by Twitter’s 140-character limit, and limited further by the Twitter handles of couple of addressees to whom I was responding. Ron Jeffries, on a mission similar to mine, pointed out that some testers find deep, rare, hidden, subtle, or platform-related problems. I agree with Ron, and I’ll add that even the best testers—just like the best developers—are human, and limited, and can occasionally miss problems. So:

  • Testers (and programmers) who focus on excellence, craftsmanship, skill, and collaboration will help each other, and will tend to find problems that can be addressed before the product is released—and will tend to produce more valuable products as a result.

Very Short Blog Posts (21): You Had It Last!

Tuesday, November 4th, 2014

Sometimes testers say to me “My development team (or the support people, or the managers) keeping saying that any bugs in the product are the testers’ fault. ‘It’s obvious that any bug in the product is the tester’s responsibility,’ they say, ‘since the tester had the product last.’ How do I answer them?”

Well, you could say that the product’s problems are the responsibility of the tester because the tester had the product last—and that a successful product was successful because the programmers and the business people did such a good job at preventing bugs. But that would be to explain any failures in the product in one way, and to explain any successes in the product in a completely different way.

Instead, let’s be consistent. Testers don’t put the bugs in, and testers miss some of the bugs because bugs are, by their nature, hidden. Moreover, the bugs are hidden so well that not even the people who put them in could find them. The bugs are hidden by people, and by the consequences of how we choose to do software development. So let’s all work to prevent the bugs, and to find them more quickly. Let’s talk about problems in development that allow bugs to hide. Let’s all work on testability, so that we can find bugs earlier, and more easily, before the bugs have a chance to hide deeply. And let’s all share responsibility for our failures and our successes.

Facts and Figures in Software Engineering Research (Part 2)

Wednesday, October 22nd, 2014

On July 23, 2002, Capers Jones, Chief Scientist Emeritus of a company called Software Productivity Research gave a presentation called “SOFTWARE QUALITY IN 2002: A SURVEY OF THE STATE OF THE ART”. In this presentation, he shows data on a slide titled “U.S. Averages for Software Quality”.

US Averages for Software Quality 2002

(Source:, accessed September 5, 2014)

It is not clear what “defect potentials” means. A slide preceding this one says defect potentials are (or include) “requirements errors, design errors, code errors, document errors, bad fix errors, test plan errors, and test case errors.”

There no description in the presentation of the link between these categories and the numbers in the “Defect Potential” column. Yes, the numbers are expressed in terms of “defects per function point”, but where did the numbers for these “potentials” come from?

In order to investigate this question, I spent just over a hundred dollars to purchase three books by Mr. Jones: Applied Software Measurement (Second Edition) (1997) [ASM2]; Applied Software Measurement: Asssuring Productivity and Quality (Third Edition), 2008 [ASM3]; and The Economics of Software Quality (co-authored with Olivier Bonsignour) (2011). In [ASM2], he says

The “defect potential” of an application is the sum of all defects found during development and out into the field when the application is used by clients and customers. The kinds of defects that comprise the defect potential include five categories:

  • Requirements defects

  • Design defects

  • Source code defects

  • User documentation defects

  • “Bad fixes” or secondary defects found in repairs in prior defects

The information in this book is derived from observations of software projects that utilized formal design and code inspections plus full multistage testing activities. Obviously the companies also had formal and accurate defect tracking tools available.

Shortly afterwards, Mr. Jones says:

Note that this kind of data is clearly biased, since very few companies actually track life-cycle defect rates with the kind of precision needed to ensure really good data on this subject.

That’s not surprising, and it’s not the only problem. What are the biases? How might they affect the data? Which companies were included, and which were not? Did each company have the same classification scheme for assigning defects to categories? How can this information be generalized to other companies and projects?

More importantly, what is a defect? When does a coding defect become a defect (when the programmer types a variable name in error?) and when might it suddenly stop becoming a defect (when the programmer hits the backspace key three seconds later?)? Does the defect get counted as a defect in that case?

What is the model or theory that associates the number 1.25 in the slide above with the potential for defects in design? The text suggests that “defect potentials” refers to defects found—but that’s not a potential, that’s an outcome.

In Applied Software Measurement, Third Edition, things change a little:

The term “defect potential” refers to the probable number of defects found in five sources: requirements, design, source code, user documents, and bad fixes… The data on defect potentials comes from companies that actually have lifecycle quality measures. Only a few leading companies have this kind of data, and they are among the top-ranked companies in overall quality: IBM, Motorola, AT&T, and the like.

Note the change: there’s been a shift from the number of defects found to the probable number of defects found. But surely defects were either found or they weren’t; how can they be “probably found”? Perhaps this is a projection of defects to be found—but what is the projection based on? The text does not make this clear. And the question has still been begged: What is the model or theory that associates the number 1.25 in the slide above with the potential for defects in design?

These are questions of construct validity, about which I’ve written before. And there are many questions that one could ask about the way the data has been gathered, controlled, aggregated, normalized, and validated. But there’s something more troubling at work here.

Here’s a similar slide from a presentation in 2005:
US Averages for Software Quality 2005

(Source:, accessed September 5, 2014)

From a presentation in 2008:
US Averages for Software Quality 2008

(Source:, accessed September 5, 2014)

From a presentation in 2010:
US Averages for Software Quality 2010

(Source:, accessed September 5, 2014)

From a presentation in 2012:
US Averages for Software Quality 2012

(Source:, accessed September 5, 2014)

From a presentation in 2013:
US Averages for Software Quality 2013

(Source:, accessed September 5, 2014)

And here’s one from all the way back in 2000:
US Averages for Software Quality 2000

(Source:, accessed October 22, 2014)

What explains the stubborn consistency, over 13 years, of every single data point in this table?

I thank Laurent Bossavit for his inspiration and assistance in exploring this data.

Facts and Figures in Software Engineering Research

Monday, October 20th, 2014

On July 23, 2002, Capers Jones, Chief Scientist Emeritus of a company called Software Productivity Research, gave a presentation called “SOFTWARE QUALITY IN 2002: A SURVEY OF THE STATE OF THE ART”. In this presentation, he provided the sources for his data on the second slide:

SPR clients from 1984 through 2002
• About 600 companies (150 clients in Fortune 500 set)
• About 30 government/military groups
• About 12,000 total projects
• New data = about 75 projects per month
• Data collected from 24 countries
• Observations during more than a dozen lawsuits

(Source:, accessed September 5, 2014)

On May 2, 2005, Mr. Jones, this time billed as Chief Scientist and Founder of Software Quality Research, gave a presentation called “SOFTWARE QUALITY IN 2005: A SURVEY OF THE STATE OF THE ART”. In this presentation, he provided the source for his data, again on the second slide:

SPR clients from 1984 through 2005
• About 625 companies (150 clients in Fortune 500 set)
• About 35 government/military groups
• About 12,500 total projects
• New data = about 75 projects per month
• Data collected from 24 countries
• Observations during more than 15 lawsuits

(Source:, accessed September 5, 2014)

Notice that 34 months have passed between the two presentations, and that the “total projects number” has increased by 500. At 75 projects a month, we should expect that 2550 projects have been added to the original tally; yet only 500 projects have been added.

On January 30, 2008, Mr. Jones (Founder and Chief Scientist Emeritus of Software Quality Research), gave a presentation called “SOFTWARE QUALITY IN 2008: A SURVEY OF THE STATE OF THE ART”. This time the sources (once again on the second slide) looked like this:

SPR clients from 1984 through 2008
• About 650 companies (150 clients in Fortune 500 set)
• About 35 government/military groups
• About 12,500 total projects
• New data = about 75 projects per month
• Data collected from 24 countries
• Observations during more than 15 lawsuits

(Source:, accessed September 5, 2014)

This is odd. 32 months have passed since the May 2005 presentation. With new data being added at 75 projects per month, there should have been 2400 projects new since the prior presentation. Yet there has been no increase at all in the number of total projects.

On November 2, 2010, Mr. Jones (now billed as Founder and Chief Scientist Emeritus and as President of Capers Jones & Associates LLC) gave a presention called “SOFTWARE QUALITY IN 2010: A SURVEY OF THE STATE OF THE ART”. Here are the sources, once again from the second slide:

Data collected from 1984 through 2010
• About 675 companies (150 clients in Fortune 500 set)
• About 35 government/military groups
• About 13,500 total projects
• New data = about 50-75 projects per month
• Data collected from 24 countries
• Observations during more than 15 lawsuits

(Source:, accessed September 5, 2014)

Here three claims about the data have changed: 25 companies have been added to the data sources, 13,500 projects comprises the total set, and “about 50-75 projects” have been added (or are being added; this isn’t clear) per month. 21 full months have passed since the January presentation (which came at the end of the month). To be fair, the claim of an increase of 1,000 projects almost fits the lower bound of the claimed number of per-month increases (which would be 1,050 new projects since the last presentation), but not the claim of 75 per month (1,575 new projects). What does it mean to claim “new data = about 50-75 projects per month”, when the new data appears to be coming in a rate below the lowest rate claimed?

On May 1, 2012, Mr. Jones (CTO of Namcook Analytics LLC) gave a talk called “SOFTWARE QUALITY IN 2012: A SURVEY OF THE STATE OF THE ART”. Once again, the second slide provides the sources.

Data collected from 1984 through 2012
• About 675 companies (150 clients in Fortune 500 set)
• About 35 government/military groups
• About 13,500 total projects
• New data = about 50-75 projects per month
• Data collected from 24 countries
• Observations during more than 15 lawsuits

(Source:, accessed September 5, 2014)

Here there has been no change at all in any of the previous claims (except for the range of time over which the data has been collected). The claim that 50-75 projects per month has been added remains. At that rate, extrapolating from the claims in the November 2010 presentation, there should be between 14,400 and 14,850 projects in the data set. Yet the claim of 13,500 total projects also remains.

On August 18, 2013, Mr. Jones (now VP and CTO of Namcook Analytics LLC) gave a presentation “SOFTWARE QUALITY IN 2013: A SURVEY OF THE STATE OF THE ART”. Here are the data sources (from page 2)

Data collected from 1984 through 2013
• About 675 companies (150 clients in Fortune 500 set)
• About 35 government/military groups
• About 13,500 total projects
• New data = about 50-75 projects per month
• Data collected from 24 countries
• Observations during more than 15 lawsuits

(Source:, accessed September 5, 2014)

Once again, no change in the total number of projects, but the claim of 50-75 new projects remains. Again, based on the 2012 claim, 15 months in time passed (more like 16, but we’ll be generous here), and the growth claims in these presentations, there should be between 14,250 and 14,625 projects in the data set.

Based on the absolute claim of 75 new projects per month in the period 2002-2008, and 50 per month in the remainder, we’d expect 20,250 projects at a minimum by 2013. But let’s be conservative and generous, and base the claim of new projects per month at 50 for the entire period from 2002 to 2013. That would be 600 new projects per year over 11 years; 6,600 projects added to 2002’s 12,000 projects, for a total of 18,600 by 2013. Yet the total number of projects went up by only 1,500 over the 11-year period—less than one-quarter of what the “new data” claims would suggest.

In summary, we have two sets of figures in apparent conflict here. In each presentation,

1) the project data set is claimed to grow at a certain rate (50-75 per month, which amounts to 600-900 per year).
2) the reported number of projects grows at a completely different rate (on average, 136 per year).

What explains the inconsistency between the two sets of figures?

I thank Laurent Bossavit for his inspiration and help with this project.

Weighing the Evidence

Friday, September 12th, 2014

I’m going to tell you a true story.

Recently, in response to a few observations, I began to make a few changes in my diet and my habits. Perhaps you’ll be impressed.

  • I cut down radically on my consumption of sugar.
  • I cut down significantly on carbohydrates. (Very painful; I LOVE rice. I LOVE noodles.)
  • I started drinking less alcohol. (See above.)
  • I increased my intake of tea and water.
  • I’ve been reducing how much I eat during the day; some days I don’t eat at all until dinner. Other days I have breakfast, lunch, and dinner. And a snack.
  • I reflected on the idea of not eating during the day, thinking about Moslem friends who fast, and about Nassim Taleb’s ideas in Antifragile. I decided that some variation of this kind in a daily regimen is okay; even a good idea.
  • I started weighing myself regularly.

    Impressed yet? Let me give you some data.

    When I started, I reckon I was just under 169 lbs. (That’s 76.6 kilograms, for non-Americans and younger Canadians. I still use pounds. I’m old. Plus it’s easier to lose a pound than a kilogram, so I get a milestone-related ego boost more often.)

    Actually, that 169 figure is a bit of a guess. When I became curious about my weight, the handiest tool for measuring it was my hotel room’s bathroom scale. I kicked off my shoes, and then weighed myself. 173 lbs., less a correction for my clothes and the weight of all of the crap I habitually carry around in my pockets: Moleskine, iPhone, Swiss Army knife, wallet stuffed with receipts, pocket change (much of it from other countries). Sometimes a paperback.

    Eventually I replaced the batteries on our home scale (when did bathroom scales suddenly start needing batteries? Are there electronics in there? Is there software? Has it been tested?—but I digress). The scale implicitly claims a certain level of precision by giving readings to the tenth of a pound. These readings are reliable, I believe; that is, they’re consistent from one measurement to the next. I tested reliability by weighing myself several times over a five-minute period, and the results were consistent to the tenth of a pound. I repeated that test a day or two later. My weight was different, but I observed the same consistency.

    I’ve been making the measurement of my actual weight a little more precise by, uh, leaving the clothes out of the measurement. I’ve been losing between one and two pounds a week pretty consistently. A few days ago, I weighed myself, and I got a figure of 159.9 lbs. Under 160! Then I popped up for a day or two. This morning, I weighed myself again. 159.4! Bring on the sugar!

    That’s my true story. Now, being a tester, I’ve been musing about aspects of the measurement protocol.

    For example, being a bathroom scale, it’s naturally in the bathroom. The number I read from the scale can vary depending on whether I weigh myself Before or After, if you catch my meaning. If I’ve just drunk a half litre of water, that’s a whole pound to add to the variance. I’ve not been weighing myself at consistent times of the day, either. In fact, this afternoon I weighed myself again: 159.0! Aren’t you impressed!

    Despite my excitement, it would be kind of bogus for me to claim that I weigh 159.0 lbs, with the “point zero”. I would guess my weight fluctuates by at least a pound through the day. More formally, there’s natural variability in my weight, and to be perfectly honest, I haven’t measured that variability. If I were trying to impress you with my weight-loss achievement, I’d be disposed to report the lowest number on any given day. You’d be justified in being skeptical about my credibility, which would make me obliged to earn it if I care about you. So what could I do to make my report more credible?

    • I could weigh myself several times per day (say, morning, afternoon, and night) at regular times, average the results, and report the average. If I wanted to be credible, I’d tell you about my procedure. If I wanted to be very credible, I’d tell you about the variances in the readings. If I wanted to be super credible, I’d let you see my raw data, too.

      All that would be pretty expensive and disruptive, since I would have to spend few minutes going through a set procedure (no clothes, remember?) at very regular times, every day, whether I was at home or at a business lunch or travelling. Few hotel rooms provide scales, and even if they did, for consistency’s sake, I’d have to bring my own scale with me. Plus I’d have to record and organize and report the data credibly too. So…

    • Maybe I could weigh myself once a day. To get a credible reading, I’d weigh myself under very similar and very controlled conditions; say, each morning, just before my shower. This would be convenient and efficient, since doffing clothes is part of the shower procedure anyway. (I apologize for my consistent violation of the “no disturbing mental images” rule in this post.) I’d still have to bring my own scale with me on business trips to be sure I’m using consistent instrumentation.
    • Speaking of instrumentation, it would be a good idea for me to establish the reliability and validity of my scale. I’ve described its reliability above; it produces a consistent reading from one measurement to the next. Is it a valid reading, though? If I desired credibility, I’d calibrate the scale regularly by comparing its readings to a reference scale or reference weight that itself was known to be reliable (consistent between observations) and valid (consistent with some consensus-based agreement on what “a pound” is). If I wanted to be super-credible, I’d report whatever inaccuracy or variability I observed in the reading from my scale, and potential inconsistencies in my reference instruments, hoping that both were within an acceptable range of tolerance. I might also invite other people to scrutinize and critique my procedure.
    • If I wanted to be ultra-scientific, I’d also have to be prepared to explain my metric—the measurement function by which I hang a number on an observation. and the manner in which I operationalized the metric. The metric here is bound into the bathroom scale: for each unit pound placed on the scale, the figure display should increase by 1.0. We could test that as I did above. Or, more whimsically, if I were to put 159 one-pound weights on one side of Sir Bedevere’s largest scales, and me on the other, the scales would be in perfect balance (“and therefore… A WITCH!”), assuming no problems with the machinery.
    • If I missed any daily observations, that would be unfortunate and potentially misleading. Owning up to the omission and reporting it would probably preferable to covering it up. Covering up and getting caught would torpedo my credibility.
    • Based on some early samples, and occasional resampling, I could determine the variability of my own weight. When reporting, I could give a precise figure and along with the natural variation in the measurement: 159.4 lbs, +/- 1.2 lbs.
    • Unless I’m wasting away, you’d expect to see my weight stabilize after a while. Stabilize, but not freeze. Considering the natural variance in my weight, it would be weird and incredible if I were to report exactly the same weight week after week. In that case, you’d be justified to suspect that something was wrong. It could be a case of quixotic reliability—Kirk and Miller’s term for an observation that is consistent in a trivial and misleading way, as a broken thermometer might yield. Such observations, they say, frequently prove “only that the investigator has managed to observe or elicit ‘party line’ or rehearsed information. Americans, for example, reliably respond to the question ‘How are you?’ with the knee-jerk ‘Fine.” The reliability of this answer does not make it useful data about how Americans are.” Another possibility, of course, is that I’m reporting faked data.
    • It might be more reasonable to drop the precision while retaining accuracy. “About 160 lbs” is an accurate statement, even if it’s not a precise one. “About 160, give or take a pound or so” is accurate, with a little patina of precision and a reasonable and declared tolerance for imprecision.
    • Plus, I don’t think anyone else cares about a daily report anyhow. Even I am only really interested in things in the longer term. Having gone this far watching things closely, I can probably relax. One weighing a week, on a reasonably consistent day, first thing in the morning before the shower (I promise; that was the last time I’ll present that image) is probably fine. So I can relax the time and cost of the procedure, too.
    • I’m looking for progress over time to see the effects of the changes I’m made to my regimen. Saying “I weigh about 160. Six weeks ago, I weighed about 170” adds context to the report. I could provide the raw data:

      Plotting the data against time on a chart would illustrate the trend. I could show display the data in a way that showed impressive progress:

      But basing the Y-axis at 154.0 (to which Excel defaulted, in this case) wouldn’t be very credible because it exaggerates the significance of the change. To be credible, I’d use a zero base:

      Using a zero-based Y-axis on the chart would show the significance of change in a more neutral way.

    • To support the quantitative data, I might add other observations, too: I’ve run out of holes on my belt and my pants are slipping down. My wife has told me that I look trimmer. Given that, I could add add these observations to the long-term trend in the data, and could cautiously conclude that the regimen overall was having some effect.
    • All this is fine if I’m trying to find support for the hypothesis that my new regimen is having some effect. It’s not so good for two other things. First, it does not prove that my regimen change is having an effect. Maybe it’s having no effect at all, and I’ve been walking and biking more than before; or maybe I acquired some kind of wasting disease just as I began to cut down on the carbs. Second, it doesn’t identify specific factors that brought about weight loss and rule out other factors. To learn about those and to report on them credibly, I’d have to go back to a more refined approach. I would have to vary aspects of my diet while controlling others and make precise observations of what happened. I’d have to figure out what factors to vary, why they might be important, and what effects they might have. In other words, I’d be developing a hypothesis tied to a model and a body of theory. Then I’d set up experiments, systematically varying the inputs to see their effects, and searching for other factors that might influence the outcomes. I’d have to control for confounding factors outside of my diet. To make the experiment credible, I’d have to show that the numbers were focused on describing results, and not on attaining a goal. That’s the distinction between inquiry metrics and control metrics: an inquiry metric triggers questions; a control metric influences or drives decisions.

    When I focus on the number, I set up the possibility of some potentially harmful effects. To make the number look really good on any given day, I might cut my water intake. To make the number look fabulous over a prolonged period (say, as long as I was reporting my weight to you), I could simply starve myself until you stopped paying attention. Then it’d be back to lots of sugar in the coffee, and yes, I will have another beer, thank you.) I know that if I were to start exercising, I’d build up muscle mass, and muscle weighs more than flab. It becomes very tempting to optimize my weight in pounds, not only to impress you, but also to make me feel proud of myself. Worst of all: I might rig the system not consciously, but unconsciously. Controlling the number is reciprocal; the number ends up controlling me.

    Having gone through all of this, it might be a good idea to take a step back and line up the accuracy and precision of my measurement scheme with my goal—which I probably should have done in the first place. I don’t really care how much I weigh in pounds; that’s just a number. No one else should care how much I weigh every day. And come to think of it, even if they did care, it’s none of their damn business. The quantitative value of my weight is only a stand-in—a proxy or an indirect measurement—for my real goal. My real goal is to look and feel more sleek and trim. It’s not to weigh a certain number of pounds; it’s to get to a state where my so-called “friends” stop patting my belly and asking me when the baby is due. (You guys know who you are.)

    That goal doesn’t warrant a strict scientific approach, a well-defined system of observation, and precise reporting, because it doesn’t matter much except to me. Some data might illustrate or inform the story of my progress, but the evidence that matters is in the mirror; do I look and feel better than before?

    In a different context, you may want to persuade people in a professional discipline of some belief of some course of action, while claiming that you’re making solid arguments based on facts. If so, you have to marshal and present your facts in a way that stands up to scrutiny. So, over the next little while, I’ll raise some issues and discuss things that might be important for credible reporting in a professional community.

    This blog post was strongly influenced by several sources.

    Cem Kaner and Walter P. Bond, “Software Engineering Metrics: What Do They Measure and How Do We Know“. In particular, I used the ten questions on measurement validity from that paper as a checklist for my elaborate and rigourous measurement procedures above. If you’re a tester and you haven’t read the paper, my advice is to read it. If you have read it, read it again.

    Shadish, Cook, and Campbell, Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Snappy title, eh? As books go, it’s quite expensive, too. But if you’re going to get serious about looking at measurement validity, it’s a worthwhile investment, extremely interesting and informative.

    Jerome Kirk and Mark L. Miller, Reliability and Validity in Qualitative Research. This very slim book raises lots of issues in performing, analyzing, and reporting if your aim is to do credible research. (Ultimately, all research, whether focused on quantitative data or not, serves a qualitative purpose: understanding the nature of things at least a little better.)

    Gerald M. (Jerry) Weinberg, Quality Software Management, Vol. 2: First Order Measurement, (also available as two e-books, “How to Observe Software” and “Responding to Significant Software Events”)

    Edward Tufte’s Presenting Data and Information (a mind-blowing one-day course) and his books The Visual Display of Quantitative Information; Envisioning Information; Visual Explanations; and Beautiful Evidence.

    Prior Art Dept.: As I was writing this post, I dimly recalled Brian Marick posting something on losing weight several years ago. I deliberately did not look at that post until I was finished with this one. From what I can see, that material ( was not related to this. On the other hand, I hope Brian continues to look and feel his best. 🙂

    I thank Laurent Bossavit and James Bach for their reviews of earlier drafts of this article.

Construct Validity

Tuesday, September 9th, 2014

A construct, in science, is (informally) a pattern or a means of categorizing something you’re talking about, especially when the thing you’re talking about is abstract.

Constructs are really important in both qualitative and quantitative research, because they allow us to differentiate between “one of these” and “not one of these”, which is one of the first steps in measurement and analysis. If you want to describe something or count it such that other people find you credible, you’ll need to describe the difference between “one” and “not-one” in a way that’s valid. (“Valid” here means that you’ve provided descriptions, explanations, or measurements for your categorization scheme while managing or ruling out alternatives, such that other people are prepared to accept your construct, and your definition can withstand challenges successfully.)

If you’re familiar with object-oriented programming, you might think of a construct as being like a class, in that objects have an “is a” relationship to a class. In an object-oriented program, things tend to be pretty tidy; an object is either a member of a certain class or it isn’t. For example, in Ruby, an object will respond to a query of the kind_of?() method with a binary true or false. In the world, not under the control of nice, neat models developed by programmers armed with digital computers, things are more messy.

Supposing that someone asks you to identify vehicles and pedestrians passing by a little booth that he’s set up. It seems pretty obvious that you’d count cars and trucks without asking him for clarification. However, what about bicycles? Tricycles? A motor scooter? An electric motor scooter? If a unicyclist goes by, do we count him? A skateboarder? A pickup truck towing a wagon with two ATVs in it? A recreational vehicles towing a car? An ATV? A tractor, pulling a wagon? A diesel truck pulling a trailer? How do you count a tow-truck, towing another vehicle, with the other vehicle’s driver riding in the tow truck? As one vehicle or two? A bus? A car transporter—a truck with nine vehicles on it? Who cares, you ask?

Well, the booth is at the entrance to a ferry boat, and the fee is $60 per vehicle, $5 per passenger, and $10 for pedestrians. Lots of people (especially those self-righteous cyclists)(relax; I’m one of them too) will gripe if they’re charged sixty bucks. Yet where I live, a bicycle is considered a vehicle under the Highway Traffic Act, which would suit the ferry owner who wants to maximize the haul of cash. He’d like especially like to see $600 from the car transporter. So in regular life, categorization schemes count, and the method for determining what fits into what category counts too.

How many vehicles?

If the problem is tricky for physical things—widgets—it’s super-tricky for abstractions in science that pertains to humans. You’ve decided to study the effect of a new medicine, and you want to try it out on healthy people to check for possible side effects. What is a healthy person? Health is an abstraction; a construct. If someone is in terrific shape but happens to have a cold today, does that person count as healthy? Over the last few summers, I’ve met a kid who’s a friend of a friend. He’s fit, strong, capable, active… and he does kidney dialysis ever couple of days or so. Healthy? A transplant patient who is in great shape, but who needs a daily dose of anti-rejection drugs: healthy?

If your country gives extra points to potential immigrants who are bilingual (as mine does), what level of fluency constitutes competence in a language to the degree that you can decide, “bilingual or not”? Note that I’m not referring to a test of whether someone is bilingual or not; I’m talking about the criteria that we’re going to test for; our sorting rules. Economists talk about “the economy” growing; what constitutes “the economy”? People speak of “events”; when airplanes hit the World Trade Center, was that one event or two? Who cares? Property owners and insurance companies cared very deeply indeed.

Construct validity is important in the “hard” physical sciences. “Temperature” is a construct. “To discuss the validity of a thermometer reading, a physical theory is necessary. The theory must posit not only that mercury expands linearly with temperature, but that water in fact boils at 100°. With such a theory, a thermometer that reads 82° when the water breaks into a boil can be reckoned inaccurate. Yet if the theory asserts that water boils at different temperatures under different ambient pressures, the same measurement may be valid under different circumstances — say at one half an atmosphere.” (Kirk and Miller, Reliability and Validity in Qualitative Research) Atmosopheric pressure varies from day to day, from hour to hour. So what is the temperature outside your window right now? The “correct” answer is surprisingly hard to decide.

In the “soft” social sciences and qualitative research, the measurement problem is even harder. Kirk and Miller go on, “In the case of qualitative observations, the issue of validity is not a matter of methodological hairsplitting about the fifth decimal point, but a question of whether the researcher sees what he or she thinks he or she sees.” (Kirk and Miller, Reliability and Validity in Qualitative Research)

When we come to the field of software development, there are certain constructs that people bandy about as though they were widgets, instead of idea-stuff: requirements; defects; test cases; tests; fixes; discoveries. What is a “programmer”? What is a “tester”? Is a programmer who spends a couple of days writing a test framework a programmer or a tester? Questions like these raise problems for anyone who wants a quantitative answer to the question, “How many testers per developer?” Kaner, Hendrickson, and Smith-Brock go into extensive detail on the subject. I’ve written about what counts before, too.

There’s a terrible difficulty in our craft: those who seem most eager to measure things seem not to pay very much attention to the problem of construct validity, as Cem Kaner and Walter P. Bond point out in this landmark paper, “Software Engineering Metrics: What Do They Measure and How Do We Know”). (I’m usually loath to say “All testers should do X”, but I think anyone serious about measurement in software development should read this paper. It’s not hard. Do it now. I’ll wait.)

If you’re doing research into software development, how do you define, describe, and justify your notion of “defects” such that you count all the things that are defects, and leave out all the things that aren’t defects, and such that your readers agree? If you’re getting reports and aggregating data from the field, how do you make sure that other people are counting the same way as you are? Does “defect” have the same meaning in a game development shop as it does for the makers of avionics software? If you’re attempting to prove something in a quantitative, rigourous and scientific way, how do you answer objections when you say something is a defect and someone else says it isn’t? How do you respond when someone wants to say that “there’s more to defects than coding errors”?

Those questions will become very important in the days to come. Stay tuned.

For extra reading: See Shadish, Cook, and Campbell, Experimental and Quasi-Experimental Designs for Generalized Causal Inference. This book is unusually expensive, but well worth it if you’re serious about measurement and validity.

Rising Against the Rent-Seekers

Monday, August 25th, 2014

At CAST 2014, a quiet, modest, thoughtful, and very experienced man named James Christie gave a talk called “Standards: Promoting Quality or Restricting Competition?”. The talk followed on from his tutorial at EuroSTAR 2013 on working with auditors—James is a former auditor himself—and from his blogs on software standards over the years.

James’ talk introduced to our community the term rent-seeking. Rent-seeking is the act of using political means—the exercise of power—to obtain wealth without creating wealth; see and One form of rent-seeking is using regulations or standards in order to create or manipulate a market for consulting, training, and certification.

James’ CAST presentation galvanized several people in attendance to respond to ISO Standard 29119, the most recent rent-seeking scheme by a very persistent group of certificationists and standards promoters. Since the ISO standard on standards requires—at least in theory—consensus from industry experts, some people proposed a petition to demonstrate opposition and the absence of consensus amongst skilled testers. I have signed this petition, and I urge you to read it, and, if you agree, to sign it too.

Subsequently, a publication named Professional Tester published—under an anonymous byline—a post about the petition, with the provocative title “Book burners threaten (old) new testing standard”. Presumably such (literally) inflammatory language was meant as clickbait. Ordinarily such things would do little to foster thoughtful discussion about the issues, but it prompted some quite thoughtful reactions. Here’s one example; here’s another. Meanwhile, if the author wishes to characterize me as a book burner, here are (selected) contents of my library relevant to software testing. Even the lamest testing books (and some are mighty lame) have yet to be incinerated.

In the body text, the anonymous author mischaracterises the petition and its proponents, of which I am one. “Their objection,” (s)he says, “is that not everyone will agree with what the standard says: on that criterion nothing would ever be published.” I might not agree with what the standard says, but that’s mostly a side issue for the purposes of this post. I disagree with what the authors of the standard attempt to do with it.

1) To prescribe expensive, time-consuming, and wasteful focus on bloated process models and excessive documentation. My concern here is that organizations and institutions will engage in goal displacement: expending money, time and resources on demonstrating compliance with the standard, rather than on actually testing their products and services. Any kind of work presents opportunity cost; when you’re doing something, most of the time it prevents you from doing something else. Every minute that a tester spends on wasteful documentation is a minute that the tester cannot fulfill the overarching mission of testing: learning about the product, with an emphasis on discovering important problems that threaten value or safety, so that our clients can make informed decisions about problems and risks.

I am not objecting here to documentation, as the calumny from Professional Tester suggests. I am objecting to excessive and wasteful documentation. Ironically, the standard itself provides an example: the current version of ISO 29119-1 runs to 64 pages; 29119-2 has 68 pages; and 29119-3 has 138 pages. If those pages follow the pattern of earlier drafts, or of most other ISO documents, you have a long, pointless, and sleep-inducing read ahead of you. Want a summary model of the testing process? Try this example of what the rent-seekers propose as their model of of testing work. Note the model’s similarity to that of a (overly complex and poorly architected) computer program.

2) To set up an unnecessary market for training, certification, and consultancy in interpreting and applying the standard. The primary tactic here is to instill the fear of being de-certified. We’ve been here before, as shown in this post from Tom DeMarco (date uncertain, but it seems to have been written prior to 2000).

Rent-seeking is of the essence, and we’ve been here before in another sense: this was one of the key goals of the promulgators of the ISEB and ISTQB. In the image, they’ve saved the best for last.

The well-informed reader will note that the list of organizations behind those schemes and the members of the ISO 29119 international working group look strikingly similar.

If the working group happens to produce a massive and opaque set of documents, and you’re in an environment that claims conformance to the 29119 standards, and you want to get some actual testing work done, you’ll probably find it helpful to hire a consultant to help you understand them, or to help defend you from charges that you were not following the standard. Maybe you’ll want training and certification in interpreting the standard—services that the authors’ consultancies are primed to offer, with extra credibility because they wrote the standards! Good thing there are no ethical dilemmas around all of this.

3) To use the ISO’s standards development process to help suppress dissent. If you want to be on the international working group, it’s a commitment to six days of non-revenue work, somewhere in the world, twice a year. The ISO/IEC does not pay for travel expenses. Where have international working group meetings been held? According to the Web site, meetings seem to have been held in Seoul, South Korea (2008); Hyderabad, India (2009); Niigata, Japan (2010); Mumbai, India (2011); Seoul, South Korea (2012); Wellington New Zealand (2013). Ask yourself these questions:

  • How many independent testers or testing consultants from Europe or North America have that kind of travel budget?

  • What kinds of consultants might be more likely to obtain funding for this kind of travel?

  • Who benefits from the creation of a standard whose opacity demands a consultant to interpret or to certify?

Meanwhile, if you join one of the local working groups, there are two ways that the group arrives at consensus.

  • By reaching broad agreement on the content. (Consensus, by the way, does not mean unanimity—that everyone agrees with the the content. It would be closer to say that in a consensus-based decision-making process, everyone agrees that they can live with the content.) But, if you can’t get to that, there’s another strategy.

  • By attrition. If your interest is in promulgating an unwieldy and opaque standard, there will probably be objectors. When there are, wait them out until they get frustrated enough to leave the decision-making process. Alan Richardson describes his experience with ISEB in this way.

In light of that, ask yourself these questions:

  • How many independent consultants have the time and energy to attend local working groups, often during otherwise billable hours?

  • What kinds of consultants might be more likely to support attendance at local working groups?

  • Who benefits from the creation of a standard that needs a consultant to interpret or to certify?

4) To undermine the role of skill in testing, and the reputations of people who discuss and promote it. “The real reason the book burners want to suppress it is that they don’t want there to be any standards at all,” says the polemicist from Professional Tester. I do want there to be standards for widgets and for communication protocols, but not for complex, cognitive, context-sensitive intellectual work. There should be standards for designed things that are intended to work together, but I’m not at all sure there should be mandated standards for how to do design. S/he goes on: “Effective, generic, documented systematic testing processes and methods impact their ability to depict testing as a mystic art and themselves as its gurus.” Far from treating testing as a mystic art, appealing to things like “intuition” and “experienced-based techniques”, my community has been trying to get to the heart of testing skills, flexible and responsive coverage reporting, tacit and explict knowledge, and the premises of the way we do testing. I’ve seen no such effort to dig deeper into these subjects—and to demystify them—from the rent-seekers.

Unlike the anonymous author at Professional Tester, I am willing to stand behind my work, my opinions, and my reputation by signing my name and encouraging comments. Feel free.

—Michael B.