Blog Posts for the ‘Context’ Category

What Do You Mean By “Arguing Over Semantics”?

Wednesday, April 3rd, 2013

Commenting on testing and checking, one correspondent responds:

“To be honest, I don’t care what these types of verification are called be it automated checking or manual testing or ministry of John Cleese walks. What I would like to see is investment and respect being paid to testing as a profession rather than arguing with ourselves over semantics.”

My very first job in software development was as a database programmer at a personnel agency. Many times I wrote a bit of new code, I got a reality check: the computer always did exactly what I said, and not necessarily what I meant. The difference was something that I experienced as a bug. Sometimes the things that I told the computer were consistent with what I meant to tell it, but the way I understood something and the way my clients understood something was different. In that case, the difference was something that my clients experienced as a bug, even though I didn’t, at first. The issue was usually that my clients and I didn’t agree on what we said or what we meant. That wasn’t out of ignorance or ill-will. The problem was often that my clients and I had shallow agreement on a concept. A big part of the job was refining our words for things—and when we did that, we often found that the conversation refined our ideas about things too. Those revelations (Eric Evans calls them “knowledge crunching”) are part of the process of software development.

As the only person on my development team, I was also responsible for preparing end-user documentation for the program. My spelling and grammar could be impeccable, and spelling and grammar checkers could check my words for syntactic correctness. When my description of how to use the program was vague, inaccurate, or imprecise, the agents who used the application would get confused, or would make mistakes, or would miss out on something important. There was a real risk that the company’s clients wouldn’t get the candidates they wanted, or that some qualified person wouldn’t get a shot at a job. Being unclear had real consequences for real people.

A few years later, my friend Dan Spear—at the time, Quaterdeck’s chief scientist, and formerly the principal programmer of QEMM-386—accepted my request for some lessons in assembly language programming. He began the first lesson while we were both sitting back from the keyboard. “Programming a computer,” he began, “is the most humbling thing that you can do. The computer is like a mirror. It does exactly what you tell it to do, and in doing that, it reflects any sloppiness in your thinking or in your way of expressing yourself.”

I was a program manager (a technical position) for the company for four years. Towards the end of my tenure, we began working on an antivirus product. One of the product managers (“product manager” was a marketing position) wanted to put a badge on the retail box: “24 hour support response time!” In a team meeting, we technical people made it clear that we didn’t provide 24-hour monitoring of our support channels. The company’s senior management clearly had no intention of staffing or funding 24-hour support, either. We were in Los Angeles, and the product was developed in Israel. It took development time—sometimes hours, but sometimes days—to analyse a virus and figure out ways to detect and to eradicate it. Nonetheless, the marketing guy (let’s call him Mark) continued to insist that that’s what he wanted to put on the box. One of the programming liaisons (let’s call him Paul) spoke first:

Paul: “I doubt that some of the problems we’re going to see can be turned around in 24 hours. Polymorphic viruses can be tricky to identify and pin down. So what do you mean by 24-hour response time?”

Mark: “Well, we’ll respond within 24 hours.”

Paul: “With a fix?”

Mark: “Not necessarily, but with a response.”

Paul: “With a promise of a fix ? A schedule for a fix?”

Mark: “Not necessarily, but we will respond.”

Paul: “What does it mean to respond?”

Mark: “When someone calls in, we’ll answer the phone.”

Sam (a support person): “We don’t have people here on the weekends.”

Mark: “Well, 24 hours during the week.”

Sam: “We don’t have people here before 7:00am, or after 5:00pm.”

Mark: “Well… we’ll put someone to check voicemail as soon as they get in… and, on the weekends… I don’t know… maybe we can get someone assigned to check voicemail on the weekend too, and they can… maybe, uh… send an email to Israel. And then they can turn it around.”

At this point, as the program manager for the product, I’d had enough. I took a deep breath, and said, “Mark, if you put ’24-hour response time’ on the box, I will guarantee that that will mislead some people. And if we mislead people to take advantage of them, knowing that we’re doing it, we’re lying. And if they give us money because of a lie we’re telling, we’re also stealing. I don’t think our CEO wants to be the CEO of a lying, stealing company.”

There’s a common thread that runs through these stories: they’re about what we say, about what we mean, and about whether we say what we mean and mean what we say. That’s semantics: the relationships between words and meaning. Those relationships are central to testing work.

If you feel yourself tempted to object to something by saying “We’re arguing about semantics,” try a macro expansion: “We’re arguing about what we mean by the words we’re choosing,” which can then be shortened to “We’re arguing about what we mean.” If we can’t settle on the premises of a conversation, we’re going to have an awfully hard time agreeing on conclusions.

I’ll have more to say on this tomorrow.

Premises of Rapid Software Testing, Part 3

Thursday, September 27th, 2012

Over the last two days, I’ve published the premises of the Rapid Software Testing classes and methodology, as developed by James Bach and me. The first set addresses the nature of Rapid Testing’s engagement with software development—an ambitious activity, performed by fallible humans for other fallible humans, under conditions of uncertainty and time pressure. The second set addresses the nature of testing as an investigative activity focused on understanding the product and discovering problems that threaten its value. Today I present the last three premises, which deal with our relationship to our clients and to quality.

6. We commit to performing credible, cost-effective testing, and we will inform our clients of anything that threatens that commitment. Rapid Testing seeks the fastest, least expensive testing that completely fulfills the mission of testing. We should not suggest million dollar testing when ten dollar testing will do the job. It’s not enough that we test well; we must test well given the limitations of the project. Furthermore, when we are under constraints that may prevent us from doing a good job, testers must work with the client to resolve those problems. Whatever we do, we must be ready to justify and explain it.

7. We will not knowingly or negligently mislead our clients and colleagues. This ethical premise drives a lot of the structure of Rapid Software Testing. Testers are frequently the target of well-meaning but unreasonable or ignorant requests by their clients. We may be asked to suppress bad news, to create test documentation that we have no intention of using, or to produce invalid metrics to measure progress. We must politely but firmly resist such requests unless, in our judgment, they serve the better interests of our clients. At minimum we must advise our clients of the impact of any task or mode of working that prevents us from testing, or creates a false impression of the testing.

8. Testers accept responsibility for the quality of their work, although they cannot control the quality of the product. Testing requires many interlocking skills. Testing is an engineering activity requiring considerable design work to conceive and perform. Like many other highly cognitive jobs, such as investigative reporting, piloting an airplane, or programming, it is difficult for anyone not actually doing the work to supervise it effectively. Therefore, testers must not abdicate responsibility for the quality of their own work. By the same token, we cannot accept responsibility for the quality of the product itself, since it is not within our span of control. Only programmers and their management control that. Sometimes testing is called “QA.” If so, we choose to think of it as quality assistance (an idea due to Cem Kaner) or quality awareness, rather than quality assurance.

Premises of Rapid Software Testing, Part 2

Wednesday, September 26th, 2012

Yesterday I published the first three premises that underlie the Rapid Software Testing methodology developed and taught by James Bach and me. Today’s two are on the nature of “test” as an activity—a verb, rather than a noun—and the purpose of testing as we see it: understanding the product and imparting that understanding to our clients, with emphasis on problems that threaten the product’s value.

4. A test is an activity; it is a performance, not an artifact. Most testers will casually say that they “write tests” or that they “create test cases.” That’s fine, as far as it goes. That means they have conceived of ideas, data, procedures, and perhaps programs that automate some task or another; and they may have represented those ideas in writing or in program code. Trouble occurs when any of those things is confused with the ideas they represent, and when the representations become confused with actually testing the product. This is a fallacy called reification, the error of treating abstractions as though they were things. Until some tester engages with the product, observes it and interprets those observations, no testing has occurred. Even if you write a completely automatic checking process, the results of that process must be reviewed and interpreted by a responsible person.

5. Testing’s purpose is to discover the status of the product and any threats to its value, so that our clients can make informed decisions about it. There are people that have other purposes in mind when they use the word “test.” For some, testing may be a ritual of checking that basic functions appear to work. This is not our view. We are on the hunt for important problems. We seek a comprehensive understanding of the product. We do this in support of the needs of our clients, whoever they are. The level of testing necessary to serve our clients will vary. In some cases the testing will be more formal and simple, in other cases, informal and elaborate. In all cases, testers are suppliers of vital information about the product to those who must make decisions about it. Testers light the way.

I’ll continue with the last three premises of Rapid Software Testing tomorrow.

Premises of Rapid Software Testing, Part 1

Tuesday, September 25th, 2012

In February of 2012, James Bach and I got together for a week of work one-on-one, face-to-face—something that happens all too rarely. We worked on a number of things, but the principal outcome was a statement of the premises on which Rapid Software Testing—our classes and our methodology—are based. In deference to Twitter-sized attention spans like mine, I’ll post the premises over the next few days. Here’s the preamble and the first three points:

These are the premises of the Rapid Software Testing methodology. Everything in the methodology derives in some way from this foundation. These premises derive from our experience, study, and discussions over a period of decades. They have been shaped by the influence of two thinkers above all: Cem Kaner and Jerry Weinberg, both of whom have worked as programmers, managers, social scientists, authors, teachers, and of course, testers. (We do not claim that Cem or Jerry will always agree with James or me, or with each other. Sometimes they will disagree. We are not claiming their endorsement here, but instead we are gratefully acknowledging their positive impact on our thinking and on our work. We urge thinking testers everywhere to study the writings and ideas of these two men.)

1. Software projects and products are relationships between people, who are creatures both of emotion and rational thought. Yes, there are technical, physical, and logical elements as well, and those elements are very substantial. But software development is dominated by human aspects: politics, emotions, psychology, perception, and cognition. A project manager may declare that any given technical problem is not a problem at all for the business. Users may demand features they will never use. Your fabulous work may be rejected because the programmer doesn’t like you. Sufficiently fast performance for a novice user may be unacceptable to an experienced user. Quality is always value to some person who matters. Product quality is a relationship between a product and people, never an attribute that can be isolated from a human context.

2. Each project occurs under conditions of uncertainty and time pressure. Some degree of confusion, complexity, volatility, and urgency besets each project. The confusion may be crippling, the complexity overwhelming, the volatility shocking, and the urgency desperate. There are simple reasons for this: novelty, ambition, and economy. Every software project is an attempt to produce something new, in order to solve a problem. People in software development are eager to solve these problems. At the same time, they often try to do a whole lot more than they can comfortably do with the resources they have. This is not any kind of moral fault of humans. Rather, it’s a consequence of the so-called “Red Queen” effect from evolutionary theory (the name for which comes from Through the Looking Glass): you must run as fast as you can just to stay in the same place. If your organization doesn’t run with the risk, your competitors will—and eventually you will be working for them, or not working at all.

3. Despite our best hopes and intentions, some degree of inexperience, carelessness, and incompetence is normal. This premise is easy to verify. Start by taking an honest look at yourself. Do you have all of the knowledge and experience you need to work in an unfamiliar domain, or with an unfamiliar product? Have you ever made a spelling mistake that you didn’t catch? Which testing textbooks have you read carefully? How many academic papers have you pored over? Are you up to speed on set theory, graph theory, and combinatorics? Are you fluent in at least one programming language? Could you sit down right now and use a de Bruijn sequence to optimize your test data? Would you know when to avoid using it? Are you thoroughly familiar with all the technologies being used in the product you are testing? Probably not—and that’s okay. It is the nature of innovative software development work to stretch the limits of even the most competent people. Other testing and development methodologies seem to assume that everyone can and will do the right thing at the right time. We find that incredible. Any methodology that ignores human fallibility is a fantasy. By saying that human fallibility is normal, we’re not trying to defend it or apologize for it, but we are pointing out that we must expect to encounter it in ourselves and in others, to deal with it compassionately, and make the most of our opportunities to learn our craft and build our skills.

I’ll continue with more Rapid Software Testing premises tomorrow.

I Might Be Wrong (But Not For Me)

Tuesday, March 6th, 2012

Jerry Weinberg tells a story (yes, it’s me; I’m telling yet another Jerry Weinberg story) of meeting an old friend who looked distraught.

“What’s the matter?” Jerry asked.

The fellow replied, “Well, I’m kind of shellshocked. My wife just left me.”

“Was that a surprise?”

“Yes, it really was,” the fellow said. “I mean, we had had some problems, but I thought they were all settled.”

Jerry paused for a moment. Then he said, “nothing is ever settled.”

Several years after hearing that story I recognized its power as a general systems law. Obviously, I didn’t discover it, but I did name it. I call it “The Unsettling Rule”: Nothing is ever settled.

In Lessons Learned in Software Testing by Kaner, Bach, and Pettichord, Lesson 145 is “Use the IEEE Standard 829 for Test Documentation”. Lesson 146, on the facing page, is “Don’t Use the IEEE Standard 829″. When the book was published, some reviewers said “What’s the problem with these guys? They can’t even get it together to tell a consistent story!” Others, including me, thought that this pair of pages in particular was wonderful. It underscored the degree to which issues in the world of software testing are not settled, the degree to which our craft is a long dialogue in which there are many voices to be heard, many options to be discussed, and many contexts be considered.

The difference between the context-driven school (or approach; there’s now apparently disagreement between whether it’s a school or an approach!) and other school/approaches is that these disagreements can get aired in public. There are some fundamental principles on which we agree, and there are some other things on which we don’t agree. Whatever else happens, in this community, we try to make sure that there’s no fake consensus. This is alarming and disturbing, sometimes, to some people, and it can be stressful to the participants. But when it comes up, it’s a hallmark of our community that we try to deal with it. It helps to keep us sharp, and it helps to keep us honest.

Recently I wrote a blog post in which I took the position that the often-used pass-vs.-fail ratio is an invalid and misleading measurement. To summarize the post, I said, “At best, if everyone ignores it entirely, it’s simply playing with numbers. Otherwise, producing a pass/fail ratio is irresponsible, unethical, and unprofessional… The ratio of passing test cases to failing test cases is at best irrelevant, and more often a systemic means of self- and organizational deception. Reducing the product story to a number means reducing its relationship with people to a number. By extension, that means reducing people to numbers too. So to irresponsible, unethical, and unprofessional, we can add unscientific and inhumane.”

I recognize that, coming from someone who claims to be context-driven, that’s pretty extreme stuff. Yet, in its form, it’s consistent with one of those pages or the other in Lessons Learned in Software Testing (with some omissions, which I’ll address shortly). It is also consistent with a set of principles that James Bach and I espouse as part in our Rapid Software Testing class:

We will not knowingly or negligently mislead our clients and colleagues. This ethical premise drives a lot of the structure of Rapid Software Testing. Testers are frequently the target of well-meaning but unreasonable or ignorant requests by their clients. We may be asked to suppress bad news, to create test documentation that we have no intention of using, or to produce invalid metrics to measure progress. We must politely but firmly resist such requests unless, in our judgment, they serve the better interests of our clients. At minimum we must advise our clients of the impact of any task or mode of working that prevents us from testing, or creates a false impression of the testing.

To me, that statement is both in tension with and consistent with several of the principles of the context-driven school, the first and second (“The value of any practice depends on its context” and “There are good practices in context, but there are no best practices”) and the seventh (“Only through judgment and skill, exercised cooperatively throughout the entire project, are we able to do the right things at the right times to effectively test our products.”)

Pass-vs.-fail ratios, to me, fly in the face of one of the “principles in action” listed at http://www.context-driven-testing.com: “Metrics that are not valid are dangerous.”

Cem Kaner disagrees with the position expressed in my post. It seems to me that Cem’s disagreement hangs on the degree of danger and our reactions to it. I hold that in practical contexts, pass-vs.-fail ratios so dangerous that for almost all cases, they cross over the line into “unethical:, like giving the car keys to someone who is obviously drunk, or like planting land mines near a community well, even though in some rare contexts, such things could be done in good faith and without harm. Cem’s position seems to be (and I welcome correction, if it’s warranted) that although pass-vs.-fail ratios are exemplary of dangerous metrics, they’re not unethical.

Let’s start with two points that I’d like to make about the “unethical” label. One is that my ethical sense is personal, and so are the views posted on my blog. Although I’m happy when other people share them, unless otherwise stated, I don’t represent the view of any community, including my own. I don’t make claims to universal ethics. Second, Cem refers to “using the accusation of unethical as a way of shutting down discussion of whether an idea (unethical!) was any good or not.” I’m not using it that way. I have no intention whatsoever of shutting down debate (as if I could in any case!). Unless claimed otherwise, I am stating personal principles; not Right and Wrong, but right and wrong for me. I don’t know of any agency (other than society) who can make claims of Right or Wrong, and even then claims seem always context-specific.

Whether pass-vs.-fail ratios are wrong or Wrong, they’re certainly wrong for me, wrong enough that I’m uncomfortable with using them on the job. I’m sufficiently uncomfortable that I’m usually going to decline to provide them, just as I would not accept a job in which I was obliged to shoot people. Other people might choose to become mercenaries or to go to war for their countries; I’d be a conscientious objector. That wrongness is relative too, of course. It’s subject to the Relative Rule; that any abstract X is X to some person, at some time. I can only warrant my own ethical stance for the moment. My position on some issues has changed over the years, courtesy of some pleasant and unpleasant experiences. I’m not currently aware of things that might cause my stand to change in the future, but I have to leave the possibility open.

So, is providing pass vs. fail rates unethical? On reflection, I have to say reluctantly, yeah, I think so; not absolutely, but in most practical circumstances. For me, the crucial test is in the last of Cem’s questions about ethics: “Are you helping someone else lie, cheat, steal, intimidate, or cause harm?” My answer is that I see a great deal of risk—and admittedly risk is only potential harm—that I will be aiding the client in some form of oppression or deception, either to himself or to his superiors. (The latter is a situation that I have been in before, with pass-vs.-fail ratios at the centre of the story in a project associated with a $33 million dollar loss.) Most of the time, providing pass-vs.-fail ratios is a test activity that I would stop immediately, using the “mission rejected” stopping heuristic (one that I hadn’t noted until Cem himself pointed it out).

Cem doesn’t provide any contexts in which pass-vs.-fail ratios might be useful, but as a context-driven tester, it’s my obligation to accept his critique and his challenge, and consider some contexts in which I might use them. (This is the omission from my post post that I mentioned above, and it’s the way that the controversy was handled in Lessons Learned: with a serving of context) I present them in order from the least plausible to the most plausible.

“Your daughter will die” or “we’ll shoot this dog.” If someone employs a threat of harm to some person or being or something of value, I have to evaluate the relative damage afforded by providing the measure or not.

When mandated by force of law. If I were on the witness stand, and a lawyer asked me, “What were the pass-vs.-fail ratios at release time for this project,” I’d be required by law to respond. I can imagine a likely way it would play out, too: “92.7%, but I’d also like to make it clear that—” “No further questions, Your Honour.”

If I provided the data with all of the appropriate disclaimers AND I were sure that the disclaimer would be heard. If the client (and the client’s client, and so forth) were to relay the data and the disclaimer reliably to the point where the data would be used, I might be persuaded to provide the data. But I’d have to weigh that against the risk that I was wrong about the disclaimer being heard. Moreover, in my professional judgement, it would be wasting my client(s)’s time.

As a placebo. I might give a pass vs. fail ratio long enough to convince my client that it’s not helpful or necessary, while doing other things to test well and provide her with other forms of reliable information. I’d remain pretty uncomfortable with dispensing the sugar pills, though, and would work at ways of getting around it.

In the course of demonstrating that pass-vs.-fail ratios are a bad idea. In some contexts, pass-vs.-fail ratios provide what Kirk and Miller call quixotic reliability. That is, the measurement seems to correlate with other measurements of the state of the project. I might provide pass-vs.-fail ratios long enough to show a divergence between that data and other measures of project or product health.

If I were aware that the person receiving the data was in possession of all the contextual information that I believe they needed to put it to appropriate and non-harmful use. We use this in one of the exercises in our class, based on a bug from an actual product. We present a very specific set of tests that are the same in every material way but for two variables. The total domain space to put these variables in combination is a set with 2304 elements. When used in a test that covers all of these elements, 510 provide a “fail” result. All of the test cases are of the same kind, and our students knows that those test cases are comparable for the purposes that they’re considering. In that case, that kind of ratio in that kind of context has some value in describing that kind of coverage. So there might be some pedagogical or rhetorical value to reporting a pass-vs.-fail ratio there. Interestingly, the root of the problem is a data type problem in a single line of code. That helps to illuminate the discussion of “one bug or 510?” which in turn illuminates how bug counts and failure counts aren’t well correlated. It also helps to illuminate opportunity cost in paying overmuch attention to this problem when there are many other things that we might test.

To me, the real challenge is in coming up with a case in which this invalid, dangerous metric in its most common applications might be used for good. In the contexts where they’re commonly discussed and used—overwhelmingly commonly, in my view—pass-vs.-fail ratios are used to express the quality of testing, the health of the project, or the readiness of the product. In those contexts, the risk of misuse, whether intentional or inadvertent, is high—like placing a loaded gun with the safety off in a crowded subway car. As I’ve heard Cem say before, “I’d like to call them an Industry Worst Practice, but being context-driven, I can’t.” Once again, Cem has reminded me of why I can’t commit to the “unethical” charge absolutely and in all cases. He’s provided me with a challenge and an opportunity to sharpen my analysis, and I thank him for that.


Postscript, March 28, 2012: In private correspondence and conversation, Cem suggested a different interpretation of a paragraph from this post that I quoted above to provide context for this post. In order to ward off that interpretation, here’s how I might write that paragraph today:

“The ratio of passing test cases to failing test cases is at best irrelevant, and more often a systemic means of self- and organizational deception. Reducing the product story to this invalid number without additional information means reducing the product’s relationship with people to this invalid number. By extension when this invalid number is being used to evaluate people, that means reducing people to this invalid number too. So to irresponsible, unethical, and unprofessional, in this case we could add unscientific and inhumane.”

To be clear: these two posts have not been a blanket condemnation of all measurement, but of a particular metric that fails spectacularly when subjected to the tests of construct validity and reasonable and foreseeable side effects in Kaner and Bond’s Software Engineering Metrics: What Do They Measure and How Do We Know?. Pass vs. fail is not an imperfect metric; this is a metric that has no discernable construct validity to me (or even to Cem). I’ve both experienced and seen pain and systematic deception with this metric at the centre of it. In this, it’s not like imperfect financial figures that are generated by legitimate companies subject to scrutiny by regulators, by auditors, by shareholders, and by markets. It’s more like financial forecasting data dreamed up by Bernie Madoff. I don’t mind dealing with imperfect but plausibly valid information; that’s all a tester ever gets to do, really. But if Bernie Madoff were to ask me to lend my credibility to his models, data, or business practices, I’d feel personally bound to decline that particular request.

Should Testers Play Planning Poker?

Wednesday, October 26th, 2011

My colleague and friend Eric Jacobson, who recently (as I write) did a bang-up job on his first conference presentation at STAR West 2011, asks a question in response to this blog post from 2006. (I like it when people reflect on an issue for a few years.) Eric asks:

You are suggesting it may not make sense for testers to give time-based estimates to their teams, but what about relative estimates? Let’s say a Rapid Software Tester is asked to participate in Planning Poker (relative-based story estimation) on an Agile Scrum team. I’ve always considered this a golden opportunity. Are you suggesting said tester may want to refuse to participate in the Planning Poker?

Having observed Planning Poker in action, I’m conflicted. Estimating anything is always a bit of a dodgy business, even at the best of times. That’s especially true for investigation and in particular for discovery. (I’ve written about some of the problems with estimation here and in subsequent posts, and with how those problems pertain to testing here.) Yet Planning Poker may be one way to get a good deal closer to the best of times. I like the idea of testers hearing what’s going on in planning sessions, and of offering perspective on the possible implications of work or change. On the other hand, at Planning Poker sessions I’ve observed or participated in, testers are often pressured to lower their numbers. In an environment where there’s trust, there tends to be much less pressure; in an environment where there’s less trust, I’d take pressure to lower the estimate as a test result with several possible interpretations. (I leave those interpretations as an exercise for the reader, but don’t stop until you get to five, at least.)

In any case, some fundamental problems remain: First, testing is oriented towards discovering things, not building things. At the root of it all, any estimate of how long it will take to test something is like estimating how long it will take you to evaluate someone’s ability to speak Spanish (which I wrote about here), and discovering problems in their ability to express themselves. If you already know something or can reasonably anticipate it, that helps a lot, and the Planning Poker approach (among many others) can help with that to some degree.

The second problem is that there’s not necessarily symmetry between the effort in creating something and the effort in testing it. A function or feature that takes very little effort to program might take an enormous amount of effort to test. What kinds of variation could we put into data, workflow, timing, platform dependencies and interactions, scenarios, and so forth? Meanwhile, a feature that takes signficant amounts of programming effort could take almost no time to test (since “programming effort” could include an enormous amount of testing effort). There are dozens of factors involved, including the amount of testing the programmers do as they code; what kind of review is being done; what the scope of the change is; when particular discoveries get made (during “development time” or “testing time”; the skill of the parties involved; the testability of the product under test; how buggy the finished feature is (in which case there will be more time needed for investigation and reporting)… Planning Poker doesn’t solve the asymmetry problem, but it provides a venue for discussing it and getting started on sorting it out.

The third problem, closely related to the second, is this idea that all testing work associated with developing something must and shall happen within the same iteration. Testing never ends; it only stops. So it’s folly to think that all testing for a given amount of programming work can always fit into the same iteration in which the work is done. I’d argue that we need a more nuanced perspective and more options than that. The decision as to how much testing we’ll need is informed by many factors. Paradoxically, we’ll need some testing to help reveal and inform our notions of how much testing we’ll need.

I understand the desire to close the book on a development story within the sprint. I often—even usually—share that desire. Yet many kinds of testing work must respond to development work, and in such cases the development work has to be complete in some lesser sense than “fully tested”. Many kinds of confirmatory checking work, it seems to me, can be done within the same sprint as the programming work; no problem there. Yet it seems to me that other kinds of testing can reasonably wait for subsequent sprints—indeed, must wait for subsequent sprints, unless we’d like to have programmers stop all programming work altogether after a certain day in the sprint. Let me give you an example: in big banks, some kinds of transactions take several days to wend their way through batch processes that are run overnight. The testing work associated with that can be simulated, for sure (indeed, one would hope that most of such work would be simulated), but only at the expense of some loss of realism. For the test, whether the realism is important or not is always an open question with a fallible answer. Instead of making sure that there’s NO testing debt, consider reasonable, small, and sustainable amounts of testing debt that spans iterations. Agile can be about actual agility, instead of dogma.

So… If playing Planning Poker is part of the context, go for it. It’s a heuristic approach to getting people to consider testing more consciously and thoughtfully, and there’s something to that. It’s oriented towards estimating things in a more comprehensible time frame, and in digestible chunks of task and effort. Planning Poker is fallible, and one approach among many possible approaches. Like everything else, its usefulness largely depends mostly on the people using it, and how they use it.

Testing: Difficult or Time-Consuming?

Thursday, September 29th, 2011

In my recent blog post, Testing Problems Are Test Results, I noted a question that we might ask about people’s perceptions of testing itself:

Does someone perceive testing to be difficult or time-consuming? Who? What’s the basis for that perception? What assumptions underlie it?

The answer to that question may provide important clues to the way people think about testing, which in turn influences the cost and value of testing.

As an example, an pseudonymous person (“PM Hut”) who is evidently associated with project management in some sense (s/he provides the URL http://www.pmhut.com) answered my questions above.

Just to answer your question “Does someone perceive testing to be difficult or time-consuming?” Yes, everyone, I can’t think of a single team member I have managed who doesn’t think that testing is time consuming, and they’d rather do something else.

This, alas, isn’t an unusual response. To someone like me who offers help in increasing the value and reducing the cost of testing, it triggers some questions that might prompt reframes or further questions.

  • What do the team members think testing is? Do they think that it’s something ancillary to the project, rather than an essential and integrated aspect of software development? To me, testing is about gathering information and raising awareness that’s essential for identifying product risks and steering the project. That’s incredibly important and valuable.

    So when the team members are driving a car, do they perceive looking out the windshield to be difficult or time-consuming? Do they perceive looking at the dashboard to be difficult or time-consuming? If so, why? What are the differences between the way they obtain awareness when they’re driving a car, versus the way they obtain awareness when they’re contributing to the development of a product or service?

  • Do the team members think testing is the mindless repetition of actions and observation of specific outputs, as prescribed by someone else? If so, I’d agree with them that testing is an unpalatable activity—except I don’t call that testing. I call it checking, and I’d rather let a machine do it. I’d also ask if checking is being done automatically by the programmers at lower levels where it tends to be fast, cheap, easy, useful and timely—or manually at higher levels, where it tends to be slower, more expensive, more difficult, less useful, and less timely—and tedious?
  • Is testing focused mostly on confirmation of things that we already know or hope to be true? Is it mostly focused on the functional aspects of the program (which are amenable to checking)? People tend to find this dull and tedious, and rightly so. Or is testing an active search for new information, problems, and risks? Does it include focus on parafunctional aspects of the product—the things that provide important perceptions of real value to real people? Are the testers given the freedom and responsibility to manage a good deal of their own investigation? Testers tend to find this kind of approach a lot more engaging and a lot more interesting, and the results are typically more wide-ranging, informative, and valuable to programmers and managers.
  • Is testing overburdened by meaningless and valueless paperwork, bureaucracy, and administrivia? How did that come to pass? Are team members aware that there are simple, lightweight, rapid, and highly effective ways of planning, recording, and reporting testing work and project status?
  • Are there political issues? Are testers (or people acting temporarily in a testing role) routinely blown off (as in this example)? Are the nuggets of information revealed by testing habitually dismissed? Is that because testing is revealing trivial information? If so, is there a problem with specific testing skills like modeling the test space, determining coverage, determining oracles, recording, or reporting?
  • Have people been trained on the basis of testing as a skilled, sophisticated thinking art? Or is testing something for which capability can be assessed by a trivial, 40-question multiple choice exam?
  • If testing is being done well (which given people’s attitudes expressed above would be a surprise), are programmers or managers afraid of having to deal with the information that testing reveals? Does that lead to recrimination and conflict?
  • If there’s a perception that testing is by its nature dull and slow, are the testers aware of the quick testing approaches in our Rapid Software Testing class (PDF, page 97-99) , in the Black Box Software Testing course offered by the Association for Software Testing, or in James Whittaker’s How to Break Software? Has anyone read and absorbed Lessons Learned in Software Testing?
  • If there’s a perception that technical reviews are slow, have the testers, programmers, or managers read Perfect Software and Other Illusions About Testing? Do they recognize the ways in which careful observation provides us with “instant reviews” (see Perfect Software, page 143)? Has anyone on the team read any other of Jerry Weinberg’s books on software management and measurement?
  • Have the testers, programmers, and managers recognized the extent to which exploratory testing is going on all the time? Do they recognize that issues revealed by testing might be even more important than bugs? Do they understand that every test result and every testing problem points to meta-information that can be extremely valuable in managing the project?

On PM Hut’s own Web site, there’s an article entitled “Why Project Managers Fail“. The author, Jim Benson, lists five common problems, each of which could be quickly revealed by looking at testing as a source of information, rather than by simply going through the motions. Take it from the former program manager of a product that, in its day, was the best-selling piece of commercial software in the world: testers, testing, and the information they reveal are a project manager’s best friends and most valuable assets—when you have the awareness to recognize them.

Testing need not be difficult, tedious or time-consuming. A perception that it is so, or that it must be so, suggests a problem with testing as practised or testing as perceived. Astute managers and teams will investigate that important and largely mistaken perception.

The Best Tour

Thursday, June 30th, 2011

Cem Kaner recently wrote a reply to my blog post Of Testing Tours and Dashboards. One way to address the best practice issue is to go back to the metaphor and ask “What would be the best tour of London?” That question should give rise to plenty of other questions.

  • Are you touring for your own purposes, or in support of someone else’s interests? To what degree are other people interested in what you learn on the tour? Are you working for them? Who are they? Might they be a travel agency? A cultural organization? A newspaper? A food and travel show on TV? The history department of a university? What’s your information objective? Does the client want quick, practical, or deep questions answered? What’s your budget?
  • How well do you know London already?  How much would you like to leave open the possibility of new discoveries?  What maps or books or other documentation do you have to help to guide or structure your tour?  Is updating those documents part of your purpose?
  • Is someone else guiding your tour? What’s their reputation? To what extent do you know and trust them? Are they going to allow you the opportunity and the time to follow your own lights and explore, or do they have a very strict itinerary for you to follow? What might you see—or miss—as a result?
  • Are you traveling with other people? What are they interested in? To what degree do you share your discovery and learning?
  • How would you prefer to get around? By Tube, to get around quickly? By a London Taxi (which includes some interesting information from the cabbie? By bus, so you can see things from the top deck? On foot? By tour bus, where someone else is doing all the driving and all the guiding (that’s scripted touring)?
  • What do you need to bring with you? Notepad? Computer? Mobile phone? Still camera? Video camera? Umbrella? Sunscreen? (It’s London; you’ll probably need the umbrella.)
  • How much time do you have available?   An afternoon?  A day?  A few days? A week?  A month?
  • What are you (or your clients) interested in? Historical sites? Art galleries? Food? Museums? Architecture? Churches? Shopping? How focused do you want your tour to be? Very specialized, or a little of this and a little of that? What do you consider “in London”, and what’s outside of it?
  • How are you going to organize your time? How are you going to account for time spent in active investigation and research versus moving from place to place, breaks, and eating? How are you going to budget time to collect your findings, structure and summarize your experience, and present a report?
  • How do you want to record your tour? If you’re working for a client, what kind of report do they want? A conversation? Written descriptions? Pictures? Do they want things in a specific format?

(Note, by the way, that these questions are largely structured around the CIDTESTD guidewords in the Heuristic Test Strategy Model (Customer, Information, Developer Relations, Equipment and Tools, Schedule, Test Item, and Deliverables)—and that there are context-specific questions that we can add as we model and explore the mission space and the testing assignment.)

There is no best tour of London; they have their strengths and weaknesses. Reasonable people who think about it for a moment realize that the “best” tour of London is a) relative to some person; b) relative to that person’s purposes and interests; c) relative to what the person already knows; d) relative to the amount of time available.  And such a reasonable person would be able to apply that metaphor to software testing tours too.

Common Languages Ain’t So Common

Tuesday, June 28th, 2011

A friend told me about a payment system he worked on once. In the system models (and in the source code), the person sending notification of a pending payment was the payer. The person who got that notice was called the payee. That person could designate somone else—the recipient—to pick up the money. The transfer agent would credit the account of the recipient, and debit the account of the person who sent notification—the payer, who at that point in the model suddenly became known as the sender. So, to make that clear: the payer sends email to the payee, who receives it. The sender pays money to the recipient (who accepts the payment.) Got that clear? It turns out there was a logical, historical reason for all this. Everything seemed okay at the beginning of the project; there was one entity named “payer” and another named “payee”. Payer A and Payee B exchanged both email and money, until someone realized that B might give someone else, C, the right to pick up the money. Needing another word for C, the development group settled on “recipient”, and then added “sender” to the model for symmetry, even though there was no real way for A to split into two roles as B had. Uh, so far.

There’s a pro-certification argument that keeps coming back to the discussion like raccoons to a garage: the claim that, whatever its flaws, “at least certification training provides us with a common language for testing.” It’s bizarre enough that some people tout this rationalization; it’s even weirder that people accept it without argument. Fortunately, there’s an appropriate and accurate response: No, it doesn’t. The “common language” argument is riddled with problems, several of which on their own would be showstoppers.

  • Which certification training, specifically, gives us a common language for testing? Aren’t there several different certification tribes? Do they all speak the same language? Do they agree, or disagree on the “common language”? What if we believe certification tribes present (at best) a shallow understanding and a shallow description of the ideas that they’re describing?
  • Who is the “us” referred to in the claim? Some might argue that “us” refers to the testing “industry”, but there isn’t one. Testing is practiced in dozens of industries, each with its own contexts, problems, and jargon.
  • Maybe “us” refers to our organization, or our development shop. Yet within our own organization, which testers have attended the training? Of those, has everyone bought into the common language? Have people learned the material for practical purposes, or have they learned it simply to pass the certification exam? Who remembers it after the exam? For how long? Even if they remember it, do they always and everafter use the language that has been taught in the class?
  • While we’re at it, have the programmers attended the classes? The managers? The product owners? Have they bought in too?
  • With that last question still hanging, who within the organization decides how we’ll label things? How does the idea of a universal language for testing fit with the notion of the self-organizing team? Shouldn’t choices about domain-specific terms in domain-specific teams be up to those teams, and specific to those domains?
  • What’s the difference between naming something and knowing something? It’s easy enough to remember a label, but what’s the underlying idea? Terms of art are labels for constructs—categories, concepts, ideas, thought-stuff. What’s in and what’s out with respect to a given category or label? Does a “common language” give us a deep understanding of such things? Please, please have a look at Richard Feynman’s take on differences between naming and knowing, http://www.youtube.com/watch?v=05WS0WN7zMQ.
  • The certification scheme has representatives from over 25 different countries, and must be translated into a roughly equivalent number of languages. Who translates? How good are the translations?
  • What happens when our understanding evolves? Exploratory testing, in some literature, is equated with “ad hoc” testing, or (worse) “error guessing”. In the 1990s, James Bach and Cem Kaner described exploratory testing as “simultaneous test design, test execution, and learning”. In 2006, participants in the Workshop on Heuristic and Exploratory Techniques discussed and elaborated their ideas on exploratory testing. Each contributed a piece to a definition synthesized by Cem Kaner: “Exploratory software testing is a style of software testing that emphasizes the personal freedom and responsibility of the individual tester to continually optimize the value of her work by treating test-related learning, test design, test execution, and test result interpretation as mutually supportive activities that run in parallel throughout the project.” That doesn’t roll off the tonque quite so quickly, but it’s a much more thorough treatment of the idea, identifying exploratory testing as an approach, a way that you do something, rather than something that you do. Exploratory work is going on all the time in any kind of complex cognitive activity, and our understanding of the work and of exploration itself evolves (as we’ve pointed out here, and here, and here, and here, and here.). Just as everyday, general-purpose languages adopt new words and ideas, so do the languages that we use in our crafts, in our communities, and with our clients.

In software development, we’re alway solving new problems. Those new problems may involve people to work with entirely new technological or business domains, or to bridge existing domains with new interactions and new relationships. What happens when people don’t have a common language for testing, or for anything else in that kind of development process? Answer: they work it out. As Peter Galison notes in his work on trading zones, “Cultures in interaction frequently establish contact languages, systems of discourse that can vary from the most function-specific jargons, through semispecific pidgins, to full-fledged creoles rich enough to support activities as complex as poetry and metalinguistic reflection.”  Each person in a development group brings elements of his or her culture along for the ride; each project community develops its own culture and its own language.

Yes, we do need common language for testing. Anthropology shows us that meaningful language develops organically when people gather for a common purpose in a particular context. Just as we need tests that are specific to a given context, we need terms that are that way too. So instead of focusing training on memorizing glossary entries, let’s teach testers more about the relationships between words and ideas. Let’s challenge each other to ask better questions about the language we’re using, and how it might be fooling us.

More of What Testers Find

Wednesday, March 30th, 2011

Damn that James Bach, for publishing his ideas before I had a chance to publish his ideas! Now I’ll have to do even more work!

A couple of weeks back, James introduced a few ideas to me about things that testers find in addition to bugs.  He enumerated issues, artifacts, and curios.  The other day I was delighted to find an elaboration of these ideas (to which he added risks and testability issues) in his blog post called What Testers Find.  Delighted, because it notes so many important things that testers learn and report beyond bugs.  Delighted, because it gives me an opportunity and an incentive to dive into James’ ideas more deeply. Delighted, because it gives us all a chance to explore and identify a much richer view of testing than the simplistic notion that “testers find bugs”.

Despite the fact that testers find much more than bugs, let’s start with bugs.  James begins his list of what testers find by saying

Testers find bugs. In other words, we look for anything that threatens the value of the product.

How do we know that something threatens the value of the product?  The fact is, we don’t know for sure.  Quality is value to some person, and different people will have different perceptions of value.  Since we don’t own the product, the project, or the business, we can’t make absolute declarations of whether something is a bug or whether it’s worth fixing.  The programmers, the managers, and the project owner will make those determinations, and often they’re running in different directions.  Some will see a problem as a bug; some won’t.  Some won’t even see a problem. It seems like the only certain thing here is uncertainty.  So what can we testers do?

We find problems that might threaten the value of the product to some person who matters. How do we do that? We identify quality criteria–aspects of the product that provide some kind of value to customers or users that we like, or that help to defend the product from users that we don’t like, such as unethical hackers or fraudsters or thieves.  If we’re doing a great job, we also to account for the fact that users we do like will make mistakes from time to time.  So defending value also means making the product robust to human ineptitude and imperfection.  In the Heuristic Test Strategy Model (which we teach as part of the Rapid Software Testing course), we identify these quality criteria:

  • Capability (or functionality)
  • Reliability
  • Usability
  • Security
  • Scalability
  • Performance
  • Installability
  • Compatibility
  • Supportability
  • Testability
  • Maintainability
  • Portability
  • Localizability

In order to identify threats to the quality of the product, we use oracles.  Oracles are heuristic (useful, fast, inexpensive, and fallible) principles or mechanisms by which we recognize problems.  Most oracles are based on the notion of consistency.  We expect a product to be consistent with

  • History (the product’s own history, prior results from earlier test runs, our experience with the product or other products like it…)
  • Image (a reputation our development organization wants to project, our brand identity,…)
  • Comparable products (products like this one that we develop, competitors’ products, test programs or algorithms,…)
  • Claims (things that important people say about the product, requirements, specifications, user documentation, marketing material,…)
  • User expections (what reasonable people might anticipate the product could or should do, new features, fixed bugs,…)
  • Product (behaviour of the interface and UI elements, values that should be the same in different views,…)
  • Purpose (explicitly stated uses of the product, uses that might be implicit or inferred from the product’s design, no excessive bells and whistles,…)
  • Standards (relevant published guidelines, conventions for use or appearance for products of this class or in this domain, behaviour appropriate to the local market,…)
  • Statutes (relevant laws, relevant regulations,…)

In addition to these consistency heuristics, there’s an inconsistency heuristic too:  we’d like the product to be inconsistent with patterns of problems that we’ve seen before.  Typically those problems are founded in one of the consistency heuristics listed above. Yet it’s perfectly reasonable to observe a problem and recognize it first by its familiarity. We’ve seen lots of testers do that over the years.

We encourage people do come up with their own lists, or modifications to ours. You don’t have to use Heuristic Test Strategy Model if it doesn’t work for you.  You can create your own models for testing, and we actively encourage people who want to become great testers to do that.  Testers find models, ways of looking at the product, the project, and testing itself, in the effort to wrestle down the complexity of the systems we’re testing and the approaches that we need to test them.

In your context, do you see a useful distinction between compatibility (playing nice with other programs that happen to co-exist on the system) and  interoperability (working well with programs with which your application specifically interacts)?  Put interoperability on your quality criteria list.  Is accessibility for disabled users so important for your product that you want to highlight it in a separate quality criterion?  Put it on your list. Recently, James noticed that explicablility is a consistency heuristic that can act as an oracle too:  when we see behaviour we can’t explain or make sense of, we have reason to suspect that there might be a problem.  Testers find factors, relevant and material aspects of our models, products, projects, businesses, and test strategies.

When testers see some inconsistency in the product that threatens one or more of the quality criteria, we report.  For the report to be relevant and meaningful, it must link quality criteria, oracles, and risk in ways that are clear, meaningful, and important to our clients. Rather than simply noticing an inconsistency, we must show why the inconsistency threatens some quality criterion for some person who matters.  Establishing and describing those links in a chain of logic from the test mission to the test result is an activity that James and I call test framing.  So:  Testers find frames, the logical relationships between the test mission, our observations of the product, potential problems, and why we think they might be problems. James gave an example of a bug (“a list of countries in a form is missing ‘France’”). That might mean a minor usabilty problem based on one quality criterion, with a simple workaround (the customer trying to choose a time zone from a list of countries presented as examples; so pick Spain, which is in the same time zone). Based on another criterion like localizability, we’d perceive a more devastating problem (the customer is trying to choose a language, so despite the fact that the Web site has been translated, it won’t be presented in French, cutting our service off from a nation of 65 million people).

In finding bugs, testers find many other things too.  Excellent testing depends on our being able to identify and articulate what we find, how we find it, and how we contextualize it. That’s an ongoing process.  Testers find testing itself.

And there’s more, if you follow the link.