Blog Posts from September, 2014

A Response to Anne Mette Hass

Saturday, September 20th, 2014

In response to my earlier blog post, I received this comment from Anne Mette Hass. I’m going to reproduce it in its entirety, and then I’ll respond to each point.

I think this ‘war’ against the ISO standard is so sad. Nobody has set out to or want to harm anybody else with this work. The more variation we have in viewpoints and the more civilized be can debate, the wiser we get as times go by.

There is no way the standard ever wanted to become or ever will become a ‘kalifat’.

So why are you ‘wasting’ so much time and energy on this? What are you afraid of?

Best regards,
Anne Mette

PS. I’m likely not to answer if you answer me – I have better things to do.

And now my response:

Anne Mette,

It may surprise you to hear that I agree with many of the conclusions that you’ve given here. The trouble is that I don’t agree with your premises.

I think this ‘war’ against the ISO standard is so sad.

I agree that it’s sad.

It’s sad that a small group of people and/or organizations have decided unilaterally to proclaim an “internationally-recognised and agreed standard”, hiding behind ISO processes and implicitly claiming it to be based on consensus of the affected stakeholders, when it is manifestly not.

It’s sad that the working group has proceeded to declare a “standard” when the convenor of the working group has admitted that it has no evidence of efficacy. Those who claim to be expert testers would raise the alarm about an inefficacious product, would gather evidence, and would investigate. The “standard” has not been tested with actual application in the field.

It’s sad when the convenor of the working group treats “craftsmen” as a word with a negative connotation.

It’s sad that those people producing the “standard” have so stubbornly and aggressively ignored the breadth of ideas in the craft of testing, preferring to adopt a simplistic and shallow syllabus that was developed through a similar process. (“The ISO 29119 is based on the ISTQB syllabi, and, as far as I understand, it is the intention that the ISO 29119 testing process and testing documentation definitions will be adopted by ISTQB over time.” —Anne Mette Hass, https://www.facebook.com/permalink.php?story_fbid=10152731672549009&id=144926394008)

It’s sad that members of the Working Group would issue blatantly contradictory and false statements about their intentions (“There is also no link between the ISO/IEC/IEEE Testing Standards and the ISTQB tester certification scheme.” —Stuart Reid, http://softwaretestingstandard.org/29119petitionresponse.php).

It is sad that ISO’s reputation stands a chance of being tarnished by this debacle. It’s not that the opponents of 29119 are opposed to standards. There is a place for standards in physical things that need to be interoperable. There is a place for standardization in communication protocols. There is no place for standardization of testing when the business of technology development requires variation and deviation from the norm.

War is a predictable response when people attempt to game political systems and invade territories. Wars happen when one group of people imposes its beliefs and its way of life over another group of people. War is what happens when politics fails. And war is always sad.

Nobody has set out to or want to harm anybody else with this work.

I’m aware of several people who worked on the ISTQB certification scheme. They entered into that with good will and the best of intentions, hoping to contribute new ideas, alternative views, and helpful critique. They have reported, both publicly and privately, that their contributions were routinely rejected or ignored until eventually they gave up in frustration. This was a pattern that carried over from the Software Engineering Body of Knowledge, the CMM, IEEE 829 and other standards.

Some people have asked “If you don’t like the way the standard turned out, why didn’t you get involved with the development of the standard?” This is the equivalent of saying “If you don’t like where the hijacked plane landed, why didn’t you put on a mask and join us when we stormed the cockpit?”

Even for people of good will who stuck with the effort, the road to hell is paved with good intentions. Whatever your motives, it is important to consider reasonably foreseeable consequences of your actions. On Twitter, Laurent Bossavit has said “ISO 29119 may be software testing’s Laetrile. Never proven, toxic side effects, sold as ‘better than nothing’ to the desperate and unwary. And I do mean ‘sold’, at $200 each chapter. That’s de rigueur when selling miracle cures. (More on Laetrile: http://t.co/R6J9V6OMIs)” Similarly, I’m sure that Jenny McCarthy meant to harm no one with her ill-informed and incorrect claims that vaccinations caused autism. But Jenny McCarthy is neither doctor, nor scientist, nor tester.

Here’s what we’ve seen from the community’s experience with the ISTQB: whatever good intentions anyone might have had, a lot of money has been transferred from organizations and individuals into the pockets of people who have commercialized the certification. Testers have been not only threatened with unemployment, but prevented from working unless or until they have become certified. And ISO 29119 plows the soil for another crop of certification schemes.

The more variation we have in viewpoints and the more civilized be can debate, the wiser we get as times go by.

I agree with that too. I’m all for variation in viewpoints. The trouble is, by definition, the purpose of a standard is to suppress variation. That’s what makes it “standard”. I, for one, would enthusiastically join a civilized debate (please inform me of any point at which you believe my discourse in this matter has been uncivilized), but it appears that you are dismissing the idea out of hand: I refer readers to your postscript.

There is no way the standard ever wanted to become or ever will become a ‘kalifat’.

I agree there too. The “standard” doesn’t want anything. My concern is about the people who develop and promote the standard‐what they want.

So why are you ‘wasting’ so much time and energy on this? What are you afraid of?

I don’t think that anyone who is opposing the “standard” is wasting time at all. I’m a tester. It’s my job and my passion. When someone is attempting to claim a “standard” approach to my craft, I’m disposed to investigate the claim. Notice that the opponents of the “standard” are the ones doing vigourous investigative work and looking for bugs; the development team is showing no sign of doing it. When you ask “why are you wasting so much time and energy on this?” it reminds me of a developer who doesn’t believe that his product should be tested.

I’m not worried about the “standard” becoming a caliphate; I’m concerned about it becoming anything more than a momentary distraction. And I’m not afraid; I’m anticipating and I’m protesting. Specifically, I’m anticipating and protesting

  • another round of dealing with uninformed managers and human resource people requiring candidates to have experience in or credentials for yet another superficial body of “knowledge”;

  • another round of bogus certification schemes that pick the pockets of naïve or vulnerable testers, especially in developing countries;

  • another several years of having to persuade innocent managers that intellectual work cannot and should not be “standardised”, turned into bureaucracy and paperwork;

  • another several years of explaining that, despite what some “standard” says, a linear process model for testing (even one that weasels out and says that some iteration may occur) is deeply flawed and farcical;

  • the gradual drift of the “voluntary” “standard” into mandatory compliance, as noted by the National Institute of Standards and Technology (the second last paragraph here) and as helpfully offered by ISO.

  • waste associated with having to decide whether to follow given points in the standard or reject them. (Colleagues who have counted report that there are over 100 points at which a “standard-compliant” organization must identify its intention to follow or deviate from the “standard”. That’s overhead and extra effort for any organization that wants simply to do a good job of testing on its own terms.)

  • goal displacement as organizations orient themselves towards complying to the letter of the standard rather than, say, testing to help make sure that their products don’t fail, or harm people, or kill people.

Best regards,
Anne Mette

PS. I’m likely not to answer if you answer me – I have better things to do.

Since Anne Mette evinces no intention of responding, I will now address the wider community.

There’s an example of a response from the 29119 crowd, folks. This one makes no attempt to address any of the points raised in my post; presents not a single reasoned argument; nor any supporting evidence. Mind, you don’t need supporting evidence when you don’t present an argument. But at least we get a haughty dismissal from someone who has “better things to do” than to defend the quality of the work.

Dramatis Personae

Thursday, September 18th, 2014

On September 8, Stuart Reid, the convenor of the Working Group, posted a response to the Stop 29119 petition that included this paragraph:

(The petitioners claim that) “The standards ‘movement’ is politicized and driven by big business to the exclusion of others. A large proportion of the members of WG26 are listed at our About WG26 page along with their employers. This list does not support the assertion in the petition. The seven editors (who do the majority of the work) are from a government department, a charity, two small testing consultancies, a mid-size testing consultancy, a university and one is semi-retired.”

I believe Dr. Reid misinterprets the position of the petition’s authors and signers as objecting to the standards process “driven by big business to the exclusion of others”. Here I can only speak for myself. My concern is not about the size of the businesses involved. Instead, it is this: if a handful of consultancies of any size were to use the ISO standards process to set the terms for “the only internationally-recognised and agreed standards”, it would raise a plausible perception of conflict of interest. If those standards were lengthy, complex, open to interpretation, and paywalled, and if those consultancies were to offer services related to standards compliance, the possibility of motivations other than altruism could loom. So, as the convenor of the working group, Dr. Reid is right to attempt to make affiliations clear and transparent.

As of September 3, the roster for the ISO 29119 working group looked like this:

The convenor of ISO/IEC JTC1/SC7 WG26 is:

Dr Stuart Reid – Testing Solutions Group, United Kingdom
The co-editors of ISO/IEC/IEEE 29119 Software Testing and members of WG26 are:

Anne Mette Hass (editor of ISO/IEC/IEEE 29119-3) – KOMBIT, Denmark
Jon Hagar (product editor of ISO/IEC/IEEE 29119) – USA
Matthias Daigl (editor of ISO/IEC/IEEE 29119-5) – Koln University, Germany
Prof Qin Liu (co-editor of ISO/IEC/IEEE 29119-4) – School of Software Engineering, Tongji University, China
Sylvia Veeraraghavan (editor of ISO/IEC/IEEE 29119-2) – Mindtree, India
Dr Tafline Murnane (editor of ISO/IEC/IEEE 29119-4) – K. J. Ross & Associates, Australia
Wonil Kwon (ISO/IEC 33063 Process Assessment Model for Software testing processes) – Software Testing Alliances, South Korea

(Source: http://bit.ly/wg26201312)

As of September 12, that page had changed:

The convenor of ISO/IEC JTC1/SC7 WG26 is:

Dr Stuart Reid – Testing Solutions Group, United Kingdom
The co-editors of ISO/IEC/IEEE 29119 Software Testing and members of WG26 are:

Anne Mette Hass (editor of ISO/IEC/IEEE 29119-3) – KOMBIT, Denmark
Jon Hagar (product editor of ISO/IEC/IEEE 29119) – USA
Matthias Daigl (editor of ISO/IEC/IEEE 29119-5) – Koln University, Germany
Prof Qin Liu (co-editor of ISO/IEC/IEEE 29119-4) – School of Software Engineering, Tongji University, China
Sylvia Veeraraghavan (editor of ISO/IEC/IEEE 29119-2) – Janaagraha, India
Dr Tafline Murnane (editor of ISO/IEC/IEEE 29119-4) – K. J. Ross & Associates, Australia
Wonil Kwon (ISO/IEC 33063 Process Assessment Model for Software testing processes) – Software Testing Alliances, South Korea

(Source: http://bit.ly/wg26201409)

Anne Mette Hass‘ affiliation has been listed as KOMBIT for several years. Her LinkedIn history suggests other possible connections. There, as of September 14, 2014, she is listed as a Compliance Consultant for NNIT. A search for “29119” in NNIT’s web site leads quickly to an events page (http://www.nnit.com/pages/Events.aspx, retrieved September 14, 2014) that features a promotion for “Webinar – The Core of Testing, Dynamic Testing Process, According to ISO 29119.” Prior to this, once again according to LinkedIn, Ms. Hass worked for Delta, a Danish test consultancy. A search for “29119” on Delta’s site leads quickly to a page that begins “DELTA’s experts participate as key players in a variety of national and international norms and standardization groups”. ISO 29119 is listed as one of those international standards.

I presume that Jon Hagar, with no affiliation listed, is the “semi-retired” editor to whom Dr. Reid refers. Per LinkedIn, he is currently an independent consultant. Prior to this, Jon was an Engineer-Manager of Software Testing at Lockheed Martin.

Matthias Daigl‘s affiliation is listed as on the Working Group’s roster as “Koln University”, yet his profile on LinkedIn lists him as “Managing Consultant at imbus”. It makes no mention of Koln University. On the imbus.de site, you can find this page, which includes this paragraph: “Represented by our managing consultant Matthias Daigl we take an active part in the development of the series 29119. Matthias Daigl is a member of the DIN norm commission 043-01-07 ‘Software und System-Engineering’ and he is one of the two Germans who belong to the international ISO/IEC JTC 1 subcommittee 07 workgroup 26 ‘Software testing’ working on the standard 29119 with test experts from the whole world. There the imbus consultant has the editor role for the part 29119-5.”

There are several listings for Qin Liu on LinkedIn. One of them refers to an associate professor at Tongji University.

Dr. Tafline Murname‘s affiliation is with KJ Ross and Associates. “KJ Ross provide you with independent software quality expertise, either in-house, fully outsourced or a blend of both. With 100 local and 3000 offshore trained test consultants on hand, our service is carried out to national and international standards, including ISO/IEC 29119 and ISO/IEC 17025.” It is worth noting that KJ Ross does not explicitly offer 29119 consulting services on its Web site; if it is marketing such services, it is not doing so aggressively.

Wonil Kwon is listed as “Software Testing Alliances, South Korea”. LinkedIn shows this affiliation, along with one for STEN, a testing consultancy. http://www.sten.or.kr/index.php.

Sylvia Veeraraghavan‘s affiliation according to the Working Group roster page suddenly changed on or about September 3, 2014. She is now with Janaagraha, a charity. Prior to that, though, she was with Mindtree, a company that assertively touts its part in developing 29119, and which sells related consulting services.

So, let’s review the claim of affiliations for the seven editors as currently listed on the page.

A government department. Dr. Reid apparently refers to Matt Mansell, affiliated with the Department of Internal Affairs, New Zealand. This description is consistent with Mr. Mansell’s current and past affiliations on LinkedIn. Oddly, Mr. Mansell’s name is no longer listed among the editors; he was formerly given credit as the editor of 29119-1.

A charity. Technically true, but Ms. Veeraraghavan very recently resigned from a large testing consultancy; a lucky break for Dr. Reid in terms of the timing of his response to the petition.

Two small testing consultancies and a mid-size consultancy. Ms. Hass, Mr. Daigl, and Mr. Kwon currently work for testing consultancies, and until very recently, Ms. Veeraraghavan did too. Ms. Murname also works for a consultancy that touts her work on ISO 29119, and notes that its services “are carried out to national and international standards, including ISO/IEC 29119”.

A university. Two per the roster, but of these only one claim—Qin Liu’s—is supported by LinkedIn. Why is Mr. Daigl listed as being with Koln University? Could it be because of this? https://www.imbus.de/english/academy/certified-tester/

One semi-retired. True.

Finally, note that when Dr. Reid lists the editors of the standards, he does not refer to himself, even though he is arguably the most publicly prominent member of the Working Group, and its convenor. Dr. Reid is the CTO of Testing Solutions Group, a testing services consultancy. From TSG’s Web site: “For companies looking to make the switch to ISO 29119, TSG can provide help with implementation or measure how closely existing processes conform to the standard and an action plan to being about compliance.” (sic) (Source: http://www.testing-solutions.com/services/stqa/iso-29119-implementation).

Six of nine core members of the working group appear to be affiliated with consultancies. Why does Dr. Reid offer a different assessment? Would it be to distract from the appearance of a conflict of interest?

In addition, Dr. Reid states:

There is also no link between the ISO/IEC/IEEE Testing Standards and the ISTQB tester certification scheme.

That’s interesting. No link, eh?

(All links retrieved 18 September, 2014.)

I observe that, as Santayana said, “Those who cannot remember the past are condemned to repeat it”. It will be very interesting to watch what happens over the next few years.

So, has Dr. Reid been transparent, forthcoming, and credible about affiliations between himself and the editors (who, in his words, do the majority of the work) and organizations that are positioned to benefit from the standard? Has the Working Group diligently upheld and documented its compliance with ISO’s Codes of Conduct?

Those who support the standard, when you can find them, often tout the advantages of an international language for testing. I’ve written against this idea in the past. However, it is true that Latin was used as an international language for a long time. To this day, some phrases survive in common parlance: Cui bono? Quis custodiet ipsos custodes?

One final note: Investigation is not well served by standardization. I did not follow a standardized process in preparing this report.

Weighing the Evidence

Friday, September 12th, 2014

I’m going to tell you a true story.

Recently, in response to a few observations, I began to make a few changes in my diet and my habits. Perhaps you’ll be impressed.

  • I cut down radically on my consumption of sugar.
  • I cut down significantly on carbohydrates. (Very painful; I LOVE rice. I LOVE noodles.)
  • I started drinking less alcohol. (See above.)
  • I increased my intake of tea and water.
  • I’ve been reducing how much I eat during the day; some days I don’t eat at all until dinner. Other days I have breakfast, lunch, and dinner. And a snack.
  • I reflected on the idea of not eating during the day, thinking about Moslem friends who fast, and about Nassim Taleb’s ideas in Antifragile. I decided that some variation of this kind in a daily regimen is okay; even a good idea.
  • I started weighing myself regularly.

    Impressed yet? Let me give you some data.

    When I started, I reckon I was just under 169 lbs. (That’s 76.6 kilograms, for non-Americans and younger Canadians. I still use pounds. I’m old. Plus it’s easier to lose a pound than a kilogram, so I get a milestone-related ego boost more often.)

    Actually, that 169 figure is a bit of a guess. When I became curious about my weight, the handiest tool for measuring it was my hotel room’s bathroom scale. I kicked off my shoes, and then weighed myself. 173 lbs., less a correction for my clothes and the weight of all of the crap I habitually carry around in my pockets: Moleskine, iPhone, Swiss Army knife, wallet stuffed with receipts, pocket change (much of it from other countries). Sometimes a paperback.

    Eventually I replaced the batteries on our home scale (when did bathroom scales suddenly start needing batteries? Are there electronics in there? Is there software? Has it been tested?—but I digress). The scale implicitly claims a certain level of precision by giving readings to the tenth of a pound. These readings are reliable, I believe; that is, they’re consistent from one measurement to the next. I tested reliability by weighing myself several times over a five-minute period, and the results were consistent to the tenth of a pound. I repeated that test a day or two later. My weight was different, but I observed the same consistency.

    I’ve been making the measurement of my actual weight a little more precise by, uh, leaving the clothes out of the measurement. I’ve been losing between one and two pounds a week pretty consistently. A few days ago, I weighed myself, and I got a figure of 159.9 lbs. Under 160! Then I popped up for a day or two. This morning, I weighed myself again. 159.4! Bring on the sugar!

    That’s my true story. Now, being a tester, I’ve been musing about aspects of the measurement protocol.

    For example, being a bathroom scale, it’s naturally in the bathroom. The number I read from the scale can vary depending on whether I weigh myself Before or After, if you catch my meaning. If I’ve just drunk a half litre of water, that’s a whole pound to add to the variance. I’ve not been weighing myself at consistent times of the day, either. In fact, this afternoon I weighed myself again: 159.0! Aren’t you impressed!

    Despite my excitement, it would be kind of bogus for me to claim that I weigh 159.0 lbs, with the “point zero”. I would guess my weight fluctuates by at least a pound through the day. More formally, there’s natural variability in my weight, and to be perfectly honest, I haven’t measured that variability. If I were trying to impress you with my weight-loss achievement, I’d be disposed to report the lowest number on any given day. You’d be justified in being skeptical about my credibility, which would make me obliged to earn it if I care about you. So what could I do to make my report more credible?

    • I could weigh myself several times per day (say, morning, afternoon, and night) at regular times, average the results, and report the average. If I wanted to be credible, I’d tell you about my procedure. If I wanted to be very credible, I’d tell you about the variances in the readings. If I wanted to be super credible, I’d let you see my raw data, too.

      All that would be pretty expensive and disruptive, since I would have to spend few minutes going through a set procedure (no clothes, remember?) at very regular times, every day, whether I was at home or at a business lunch or travelling. Few hotel rooms provide scales, and even if they did, for consistency’s sake, I’d have to bring my own scale with me. Plus I’d have to record and organize and report the data credibly too. So…

    • Maybe I could weigh myself once a day. To get a credible reading, I’d weigh myself under very similar and very controlled conditions; say, each morning, just before my shower. This would be convenient and efficient, since doffing clothes is part of the shower procedure anyway. (I apologize for my consistent violation of the “no disturbing mental images” rule in this post.) I’d still have to bring my own scale with me on business trips to be sure I’m using consistent instrumentation.
    • Speaking of instrumentation, it would be a good idea for me to establish the reliability and validity of my scale. I’ve described its reliability above; it produces a consistent reading from one measurement to the next. Is it a valid reading, though? If I desired credibility, I’d calibrate the scale regularly by comparing its readings to a reference scale or reference weight that itself was known to be reliable (consistent between observations) and valid (consistent with some consensus-based agreement on what “a pound” is). If I wanted to be super-credible, I’d report whatever inaccuracy or variability I observed in the reading from my scale, and potential inconsistencies in my reference instruments, hoping that both were within an acceptable range of tolerance. I might also invite other people to scrutinize and critique my procedure.
    • If I wanted to be ultra-scientific, I’d also have to be prepared to explain my metric—the measurement function by which I hang a number on an observation. and the manner in which I operationalized the metric. The metric here is bound into the bathroom scale: for each unit pound placed on the scale, the figure display should increase by 1.0. We could test that as I did above. Or, more whimsically, if I were to put 159 one-pound weights on one side of Sir Bedevere’s largest scales, and me on the other, the scales would be in perfect balance (“and therefore… A WITCH!”), assuming no problems with the machinery.
    • If I missed any daily observations, that would be unfortunate and potentially misleading. Owning up to the omission and reporting it would probably preferable to covering it up. Covering up and getting caught would torpedo my credibility.
    • Based on some early samples, and occasional resampling, I could determine the variability of my own weight. When reporting, I could give a precise figure and along with the natural variation in the measurement: 159.4 lbs, +/- 1.2 lbs.
    • Unless I’m wasting away, you’d expect to see my weight stabilize after a while. Stabilize, but not freeze. Considering the natural variance in my weight, it would be weird and incredible if I were to report exactly the same weight week after week. In that case, you’d be justified to suspect that something was wrong. It could be a case of quixotic reliability—Kirk and Miller’s term for an observation that is consistent in a trivial and misleading way, as a broken thermometer might yield. Such observations, they say, frequently prove “only that the investigator has managed to observe or elicit ‘party line’ or rehearsed information. Americans, for example, reliably respond to the question ‘How are you?’ with the knee-jerk ‘Fine.” The reliability of this answer does not make it useful data about how Americans are.” Another possibility, of course, is that I’m reporting faked data.
    • It might be more reasonable to drop the precision while retaining accuracy. “About 160 lbs” is an accurate statement, even if it’s not a precise one. “About 160, give or take a pound or so” is accurate, with a little patina of precision and a reasonable and declared tolerance for imprecision.
    • Plus, I don’t think anyone else cares about a daily report anyhow. Even I am only really interested in things in the longer term. Having gone this far watching things closely, I can probably relax. One weighing a week, on a reasonably consistent day, first thing in the morning before the shower (I promise; that was the last time I’ll present that image) is probably fine. So I can relax the time and cost of the procedure, too.
    • I’m looking for progress over time to see the effects of the changes I’m made to my regimen. Saying “I weigh about 160. Six weeks ago, I weighed about 170” adds context to the report. I could provide the raw data:

      Plotting the data against time on a chart would illustrate the trend. I could show display the data in a way that showed impressive progress:

      But basing the Y-axis at 154.0 (to which Excel defaulted, in this case) wouldn’t be very credible because it exaggerates the significance of the change. To be credible, I’d use a zero base:

      Using a zero-based Y-axis on the chart would show the significance of change in a more neutral way.

    • To support the quantitative data, I might add other observations, too: I’ve run out of holes on my belt and my pants are slipping down. My wife has told me that I look trimmer. Given that, I could add add these observations to the long-term trend in the data, and could cautiously conclude that the regimen overall was having some effect.
    • All this is fine if I’m trying to find support for the hypothesis that my new regimen is having some effect. It’s not so good for two other things. First, it does not prove that my regimen change is having an effect. Maybe it’s having no effect at all, and I’ve been walking and biking more than before; or maybe I acquired some kind of wasting disease just as I began to cut down on the carbs. Second, it doesn’t identify specific factors that brought about weight loss and rule out other factors. To learn about those and to report on them credibly, I’d have to go back to a more refined approach. I would have to vary aspects of my diet while controlling others and make precise observations of what happened. I’d have to figure out what factors to vary, why they might be important, and what effects they might have. In other words, I’d be developing a hypothesis tied to a model and a body of theory. Then I’d set up experiments, systematically varying the inputs to see their effects, and searching for other factors that might influence the outcomes. I’d have to control for confounding factors outside of my diet. To make the experiment credible, I’d have to show that the numbers were focused on describing results, and not on attaining a goal. That’s the distinction between inquiry metrics and control metrics: an inquiry metric triggers questions; a control metric influences or drives decisions.

    When I focus on the number, I set up the possibility of some potentially harmful effects. To make the number look really good on any given day, I might cut my water intake. To make the number look fabulous over a prolonged period (say, as long as I was reporting my weight to you), I could simply starve myself until you stopped paying attention. Then it’d be back to lots of sugar in the coffee, and yes, I will have another beer, thank you.) I know that if I were to start exercising, I’d build up muscle mass, and muscle weighs more than flab. It becomes very tempting to optimize my weight in pounds, not only to impress you, but also to make me feel proud of myself. Worst of all: I might rig the system not consciously, but unconsciously. Controlling the number is reciprocal; the number ends up controlling me.

    Having gone through all of this, it might be a good idea to take a step back and line up the accuracy and precision of my measurement scheme with my goal—which I probably should have done in the first place. I don’t really care how much I weigh in pounds; that’s just a number. No one else should care how much I weigh every day. And come to think of it, even if they did care, it’s none of their damn business. The quantitative value of my weight is only a stand-in—a proxy or an indirect measurement—for my real goal. My real goal is to look and feel more sleek and trim. It’s not to weigh a certain number of pounds; it’s to get to a state where my so-called “friends” stop patting my belly and asking me when the baby is due. (You guys know who you are.)

    That goal doesn’t warrant a strict scientific approach, a well-defined system of observation, and precise reporting, because it doesn’t matter much except to me. Some data might illustrate or inform the story of my progress, but the evidence that matters is in the mirror; do I look and feel better than before?

    In a different context, you may want to persuade people in a professional discipline of some belief of some course of action, while claiming that you’re making solid arguments based on facts. If so, you have to marshal and present your facts in a way that stands up to scrutiny. So, over the next little while, I’ll raise some issues and discuss things that might be important for credible reporting in a professional community.


    This blog post was strongly influenced by several sources.

    Cem Kaner and Walter P. Bond, “Software Engineering Metrics: What Do They Measure and How Do We Know“. In particular, I used the ten questions on measurement validity from that paper as a checklist for my elaborate and rigourous measurement procedures above. If you’re a tester and you haven’t read the paper, my advice is to read it. If you have read it, read it again.

    Shadish, Cook, and Campbell, Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Snappy title, eh? As books go, it’s quite expensive, too. But if you’re going to get serious about looking at measurement validity, it’s a worthwhile investment, extremely interesting and informative.

    Jerome Kirk and Mark L. Miller, Reliability and Validity in Qualitative Research. This very slim book raises lots of issues in performing, analyzing, and reporting if your aim is to do credible research. (Ultimately, all research, whether focused on quantitative data or not, serves a qualitative purpose: understanding the nature of things at least a little better.)

    Gerald M. (Jerry) Weinberg, Quality Software Management, Vol. 2: First Order Measurement, (also available as two e-books, “How to Observe Software” and “Responding to Significant Software Events”)

    Edward Tufte’s Presenting Data and Information (a mind-blowing one-day course) and his books The Visual Display of Quantitative Information; Envisioning Information; Visual Explanations; and Beautiful Evidence.

    Prior Art Dept.: As I was writing this post, I dimly recalled Brian Marick posting something on losing weight several years ago. I deliberately did not look at that post until I was finished with this one. From what I can see, that material (http://www.exampler.com/old-blog/2005/04/02/#big-visible-belly) was not related to this. On the other hand, I hope Brian continues to look and feel his best. 🙂

    I thank Laurent Bossavit and James Bach for their reviews of earlier drafts of this article.

Construct Validity

Tuesday, September 9th, 2014

A construct, in science, is (informally) a pattern or a means of categorizing something you’re talking about, especially when the thing you’re talking about is abstract.

Constructs are really important in both qualitative and quantitative research, because they allow us to differentiate between “one of these” and “not one of these”, which is one of the first steps in measurement and analysis. If you want to describe something or count it such that other people find you credible, you’ll need to describe the difference between “one” and “not-one” in a way that’s valid. (“Valid” here means that you’ve provided descriptions, explanations, or measurements for your categorization scheme while managing or ruling out alternatives, such that other people are prepared to accept your construct, and your definition can withstand challenges successfully.)

If you’re familiar with object-oriented programming, you might think of a construct as being like a class, in that objects have an “is a” relationship to a class. In an object-oriented program, things tend to be pretty tidy; an object is either a member of a certain class or it isn’t. For example, in Ruby, an object will respond to a query of the kind_of?() method with a binary true or false. In the world, not under the control of nice, neat models developed by programmers armed with digital computers, things are more messy.

Supposing that someone asks you to identify vehicles and pedestrians passing by a little booth that he’s set up. It seems pretty obvious that you’d count cars and trucks without asking him for clarification. However, what about bicycles? Tricycles? A motor scooter? An electric motor scooter? If a unicyclist goes by, do we count him? A skateboarder? A pickup truck towing a wagon with two ATVs in it? A recreational vehicles towing a car? An ATV? A tractor, pulling a wagon? A diesel truck pulling a trailer? How do you count a tow-truck, towing another vehicle, with the other vehicle’s driver riding in the tow truck? As one vehicle or two? A bus? A car transporter—a truck with nine vehicles on it? Who cares, you ask?

Well, the booth is at the entrance to a ferry boat, and the fee is $60 per vehicle, $5 per passenger, and $10 for pedestrians. Lots of people (especially those self-righteous cyclists)(relax; I’m one of them too) will gripe if they’re charged sixty bucks. Yet where I live, a bicycle is considered a vehicle under the Highway Traffic Act, which would suit the ferry owner who wants to maximize the haul of cash. He’d like especially like to see $600 from the car transporter. So in regular life, categorization schemes count, and the method for determining what fits into what category counts too.


How many vehicles?

If the problem is tricky for physical things—widgets—it’s super-tricky for abstractions in science that pertains to humans. You’ve decided to study the effect of a new medicine, and you want to try it out on healthy people to check for possible side effects. What is a healthy person? Health is an abstraction; a construct. If someone is in terrific shape but happens to have a cold today, does that person count as healthy? Over the last few summers, I’ve met a kid who’s a friend of a friend. He’s fit, strong, capable, active… and he does kidney dialysis ever couple of days or so. Healthy? A transplant patient who is in great shape, but who needs a daily dose of anti-rejection drugs: healthy?

If your country gives extra points to potential immigrants who are bilingual (as mine does), what level of fluency constitutes competence in a language to the degree that you can decide, “bilingual or not”? Note that I’m not referring to a test of whether someone is bilingual or not; I’m talking about the criteria that we’re going to test for; our sorting rules. Economists talk about “the economy” growing; what constitutes “the economy”? People speak of “events”; when airplanes hit the World Trade Center, was that one event or two? Who cares? Property owners and insurance companies cared very deeply indeed.

Construct validity is important in the “hard” physical sciences. “Temperature” is a construct. “To discuss the validity of a thermometer reading, a physical theory is necessary. The theory must posit not only that mercury expands linearly with temperature, but that water in fact boils at 100°. With such a theory, a thermometer that reads 82° when the water breaks into a boil can be reckoned inaccurate. Yet if the theory asserts that water boils at different temperatures under different ambient pressures, the same measurement may be valid under different circumstances — say at one half an atmosphere.” (Kirk and Miller, Reliability and Validity in Qualitative Research) Atmosopheric pressure varies from day to day, from hour to hour. So what is the temperature outside your window right now? The “correct” answer is surprisingly hard to decide.

In the “soft” social sciences and qualitative research, the measurement problem is even harder. Kirk and Miller go on, “In the case of qualitative observations, the issue of validity is not a matter of methodological hairsplitting about the fifth decimal point, but a question of whether the researcher sees what he or she thinks he or she sees.” (Kirk and Miller, Reliability and Validity in Qualitative Research)

When we come to the field of software development, there are certain constructs that people bandy about as though they were widgets, instead of idea-stuff: requirements; defects; test cases; tests; fixes; discoveries. What is a “programmer”? What is a “tester”? Is a programmer who spends a couple of days writing a test framework a programmer or a tester? Questions like these raise problems for anyone who wants a quantitative answer to the question, “How many testers per developer?” Kaner, Hendrickson, and Smith-Brock go into extensive detail on the subject. I’ve written about what counts before, too.

There’s a terrible difficulty in our craft: those who seem most eager to measure things seem not to pay very much attention to the problem of construct validity, as Cem Kaner and Walter P. Bond point out in this landmark paper, “Software Engineering Metrics: What Do They Measure and How Do We Know”). (I’m usually loath to say “All testers should do X”, but I think anyone serious about measurement in software development should read this paper. It’s not hard. Do it now. I’ll wait.)

If you’re doing research into software development, how do you define, describe, and justify your notion of “defects” such that you count all the things that are defects, and leave out all the things that aren’t defects, and such that your readers agree? If you’re getting reports and aggregating data from the field, how do you make sure that other people are counting the same way as you are? Does “defect” have the same meaning in a game development shop as it does for the makers of avionics software? If you’re attempting to prove something in a quantitative, rigourous and scientific way, how do you answer objections when you say something is a defect and someone else says it isn’t? How do you respond when someone wants to say that “there’s more to defects than coding errors”?

Those questions will become very important in the days to come. Stay tuned.

For extra reading: See Shadish, Cook, and Campbell, Experimental and Quasi-Experimental Designs for Generalized Causal Inference. This book is unusually expensive, but well worth it if you’re serious about measurement and validity.

Frequently-Asked Questions About the 29119 Controversy

Tuesday, September 2nd, 2014

This is a first stab at a frequently-asked questions list about the movement to stop ISO 29119. Here I speak for myself, and not for the community. If you see “we”, it refers to my perception of the community at large, but not necessarily to the whole community; your mileage may vary. There is plenty of discussion in the community; Huib Schoots is curating a collection of resources on the controversy. If you’ve got a different perception or a different opinion, please share it and let me know. Meanwhile, I hasten to point out that absolutely everyone is welcome and encouraged to share my opinions.

Q. Why bother with a community attack on ISO 29119? Isn’t it irrelevant to most testers? And why now?

To start with, we believe that ISO 29119 is irrelevant to all testers, in the sense that it seems to be an overstructured process model, focused on relentless, ponderous, wasteful bureaucracy and paperwork, with negligible content on actual testing. If your organization is in the business of producing pointless documentation, so be it, but that’s not what we call testing. The approaches suggested by 29119 might be useful to people who are more interested in ass coverage than in test coverage.

Originators and supporters of the petition are trying to establish a pattern of opposition to the standard. This becomes important when lawyers or auditors ask “Why didn’t you follow ‘an internationally agreed set of standards for software testing that can be used within any software development life cycle or organisation’?” Loud voices of opposition—not only to the standard, but also to the process by which it was created and by which it will be marketed—will help to show that the suggestion of “international agreement” is meaningless; that the standard misrepresents testing as many prominent testers see it; that the standard is overly complex and opaque; that it is both too vague here and too specific there to be useful in “any” organisation; and that radically different contexts for testing—quite appropriately—require radically different approaches for testing.

As to the “why now” question, there’s another reason for the groundswell that I think we’re discovering as we go: over the years, in fits and starts, the context-driven community has become much larger and more capable of acting like a community. And that despite the fact that people who aspire to be fiercely independent thinkers can be a fairly fractious bunch. A community that welcomes serious disagreement will have serious disagreements, and there have been some. Yet it seems that, every now and then, there are some things that are just odious enough to unite us. Personally, I’m treating this as a trial run and a learning experience to prepare for something seriously important.

Q. The promoters of the standard insist that it’s not mandatory, so what’s the fuss?

The promoters of the standard who say that the standard is not mandatory are being disingenuous. They are well aware of this idea:

“Still another classification scheme distinguishes between voluntary standards, which by themselves impose no obligations regarding use, and mandatory standards. A mandatory standard is generally published as part of a code, rule or regulation by a regulatory government body and imposes an obligation on specified parties to conform to it. However, the distinction between these two categories may be lost when voluntary consensus standards are referenced in government regulations, effectively making them mandatory standards.”

(Source: http://www.nist.gov/standardsgov/definestandards.cfm)

The 29119 promoters begin by using appeal to authority (in this case, the reputation of ISO) to declare a standard. If it so happens that a regulator or bureaucrat, uninformed about testing, happens upon “an internationally agreed set of standards for software testing that can be used within any software development life cycle or organisation” and refers to them in government regulations, well, then, so much the better for aspiring rent-seekers who might have been involved in drafting the standard.

Q. If ISO 29119 is so terrible, won’t it disappear under its own weight?

Yes, it probably will in most places. But for a while, some organizations (including public ones; your tax dollars at work, remember) will dally with it at great cost—including the easily foreseeable costs of unnecessary compliance, goal displacement, misrepresentation of testing, and yet another round of marketing of bogus certifications, whereby rent-seekers obtain an opportunity to pick the pockets of the naïve and the cynical.

Q. Aren’t you just griping because you’re worried that your non-standard approach to testing will put you out of business?

Here’s how I answered this question on one blog (with a couple of minor edits for typos and clarity):

“In one sense, it won’t make any difference to my business if 29119-1, 29119-2, and 29119-3 are left to stand, and if 29119-4 and 29119-5 move from draft to accepted. Rapid Software Testing is about actual testing skills—exploration, experimentation, critical thinking, scientific thinking, articulate reporting, and so forth. That doesn’t compete with 29119, in the same kind of way that a fish restaurant doesn’t compete with the companies that make canned tuna. We object to people manipulating the market and the ISO standards development process to suggest to the wider world that canned tuna is the only food fit for people to eat. I discuss that here: http://www.developsense.com/blog/2014/08/rising-against-the-rent-seekers/

“In another sense, 29119 could be fantastic for my business. It would offer me a way to extend the brand: how to do excellent, cost-effective testing that stands up to scrutiny in contexts where some bureaucrat, a long way away from the development project, was fooled into believing that 29119 was important. At the moment, I’m happy to refer that kind of business to colleagues of mine, but I suspect that it would be something of a gold mine for me. Yet still I oppose 29119, because what’s in my interest may not be in the interests of my clients and of society at large.

“Let me be specific: There are existing standards for medical devices, for avionics, and the like. Those standards matter, and many of them are concise and well-written, and were created by genuine collaboration among interested parties. Testers who are working on medical devices or on avionics software have a limited number of minutes in the working day. As someone who flies a lot, and as someone who is likely to require the help of medical devices in the foreseeable future, I would prefer that those testers spend as many minutes as humanly possible actually investigating the software, rather than complying (authentically, pathetically, or maliciously) to an unnecessary standard for process modeling, documentation, and strategising (a standard for developing a strategy—imagine that!).”

Q. You just don’t like standards. Isn’t that it?

Nope. I love standards when they’re used appropriately.

As I emphasized in a 2011 PNSQC presentation called “Standards and Deviations“, it is possible and often desirable to describe and standardize widgets—tangible, physical things that have quantifiably measurable attributes, and that must interface, interact, or fit with other things. Thank goodness for standardized screws and screwdrivers, CDs, and SATA hard drives! Bravo to the EU for mandating that power supplies for smartphones standardize on USB! Yet even with widgets, there are issues related to the tension between standards and an advancing state of the art. Here’s one of the best-ever articles on problems with standards: Joel Spolsky on Martian Headsets.

It is more difficult and to describe processes, since the description is, by necessity, a model of the process. It’s difficult for many people to avoid reifying the model—that is, to avoid treating the model—idea-stuff—as though it were a thing. For an example of reification of testing, take a few moments to reflect on the notion of representing testing work in terms of test cases; then read “Test Cases Are Not Testing: Toward a Culture of Test Performance” by James Bach & Aaron Hodder. More generally, 29119’s focus on the artifacts and the process model displace and de-centre the most important part of any testing effort: the skill set and the mindset of the individual tester.

Q. Do you really believe that ISO 29119 can be stopped?

No, of course we don’t. Curtis Stuehrenberg puts it perfectly in a discussion on LinkedIn: “The petition is not about stopping the publication any more than an anti-war march is about a reasonable expectation of ending a war through a parade. The point of the petition and the general chatter is to make sure at least some people hear there is a significant portion of the testing community who was not represented and who espouse different viewpoints and practices for software testing as a professional discipline.” If we can’t get the standard stopped by the ISO’s mechanisms, at least we can show that there is an absence of consensus outside of the 29119 working groups.

Q. The standard has been in development for the last seven years; why have you waited so long?

Some of us haven’t been waiting. For example, I gave this presentation in 2011. Some of us have been busy objecting to certification schemes. (There’s only so much rent-seeking one can oppose at once.) Several of us have argued at length and in public with some of the more prominent figures promoting the standard at conferences. They sometimes seem not to understand our objections. However, as Upton Sinclair said, “It is difficult to get a man to understand something, when his salary depends upon his not understanding it!” (http://en.wikiquote.org/wiki/Upton_Sinclair) Whether through terrible argumentation or deliberate guile, the responses in those discussions usually took the form of non-sequiturs: “The standard is optional; something is better than nothing; many people were involved; the perfect is the enemy of the good; we’re trying to help all those poor people who don’t know what to do.” The standards promoters should (and probably do) know that those statements would apply to any process model that any person or group could offer. Constructing bogus authority into the ISO, and then appealing to that authority, looks an awful lot like rent-seeking to me.

Moreover, it strains believe that the standard has undergone serious development when some of the basic models (for example, 29119’s model for the test planning process) have gone essentially unchanged over four years—a period that included the rise of smartphones and mobile technology, the aftermath of the financial crisis, and the emergence of tablet computing. Testing in an Agile context reportedly garners little more than a few hand-waving references. I can’t say I’m surprised that testing and checking don’t appear on 29119’s radar either.

Q. Why didn’t you object using the formal process set up by ISO?

As James Bach points out, the real question there has been begged: why should the craft have to defend itself against a standards process that is set up to favour the determined and the well-funded? ISO is a commercial organization; not an organ of the United Nations, emanating from elected representative governments; not an academic institution; not a representative group of practitioners; not ordained by any deity. The burden is on ISO to show the relevance of the standard, even under its own terms. Simon Morley deconstructs that.

Q. Wouldn’t it be good to have an international common language for software testing?

Great idea! In fact, it would be good to have an international common language for everything. And in order to be truly international and to represent the majority of people in the world, let’s make that language Mandarin, or Hindi.

There are many arguments to show that a common language for software testing is neither desirable nor possible. I’ve blogged about a few of them, and I’ve done that more than once.

Q. Why are you always against stuff? Don’t you want to be for something?

You don’t have to be for something to be against something that’s odious. But as a matter of fact, I am for something that is more important than any standard: freedom and responsibility for the quality of my work (as I hope all testers are for freedom and responsibility for the quality of their own work). That includes the responsibility to make my work capable, credible, open to scrutiny, and as cost-effective as possible. I must be responsible to my clients, to my craft, and to society as a whole. In my view, those responsibilities do not and should not include compliance with unnecessary, time-consuming, unrepresentative standards created by self-appointed documentation and process-model enthusiasts.

Some other things I’m for: the premises of Rapid Software Testing; the Rapid Testing framework; studying the structures of exploratory testing; the Heuristic Test Strategy Model; a set of commitments for testers to make to their clients; practicing the skill of test framing; excellent reporting; and a host of other things. This is unrepresentative of the wider testing community… so I bet you’re glad that compliance with standards set by James and me is voluntary. In fact, compliance with our standards requires you to invent testing for yourself; to adopt standards that help; and to resist the ones that don’t, including ours. But if you find something else that works for you, tell us. Tell everybody.

Q. What about the poor people who need guidance on how to test?

My (free) offerings to those poor people include those just above. Those poor people are welcome to use these suggestions and to investigate the alternatives that anyone else offers. That may be harder than referring to an ISO standard and appealing to its authority. (It may be considerably easier, too.) But my first piece of guidance on how to test is this: learn about testing, and learn how to test, through study and practice. I argue that ISO 29119 will not help you with that.

Q. Okay, despite the Quixotic nature of the petition, I’m convinced. Where do I sign?

Go to http://www.ipetitions.com/petition/stop29119. And thank you.

An Example of Progress in the Drafting of ISO 29119

Monday, September 1st, 2014

The proponents of ISO Standard 29119 proudly claim that they have received and responded to “literally thousands” of comments during the process of drafting the standard. So I thought it might be interesting to examine how one component of the basic model has changed or evolved through the course of its development.

Here’s a screenshot of a diagram that illustrates the test planning process, taken from a presentation given in 2009.

(Source: http://in2test.lsi.uniovi.es/gt26/presentations/ISO-29119-Javier-Tuya-PRIS2009.pdf, accessed September 1, 2014)

Here’s another diagram illustrating the test planning process, presumably reflecting input from all of those thousands of comments (plus changes due to the rise of smartphone and mobile technology, the consequences of the financial crisis, and the emergence of tablet computing) from a presentation from 2013:

(Source: http://www.siliconindia.com/events/siliconindia_events/presentation/P2KVy7Yu.pdf, accessed September 1, 2014)

But maybe that second presentation was based on an interim draft. Let’s look at something that should reflect the published standard, as it came from Stuart Reid, the convenor of the standard working group, in March 2014, after the standard was published.

Source: http://btdconf.com/session/225/ISO_29119:_The_New_International_Software_Testing_Standard, accessed August 21, 2014 by Huib Schoots)

Perhaps the model developed and presented by someone in 2009 was so universally representative of software test planning that it has been able to withstand the critique and feedback from a wide and diverse community of thousands of testers over four years, with only minimal and inconsequential changes to the text in the diagram. Or perhaps substantial disagreement with this model was ignored, suppressed, or subjected to a process of consensus based on attrition, as I alluded to in an earlier post.

Which do you think is more likely?

Please, sign the petition to stop ISO 29119.