Blog Posts for the ‘Uncategorized’ Category

The End of Manual Testing

Wednesday, November 8th, 2017

Testers: when we speak of “manual testing”, we help to damage the craft.

That’s a strong statement, but it comes from years of experience in observing people thinking and speaking carelessly about testing. Damage arises when some people who don’t specialize in testing (and even some who do) become confused by the idea that some testing is “manual” and some testing is “automated”. They don’t realize that software development and the testing within it are design studio work, not factory work. Those people are dazzled by the speed and reliability of automation in manufacturing. Very quickly, they begin to fixate on the idea that testing can be automated. Manual bad; automated good.

Soon thereafter, testers who have strong critical thinking skills and who are good at finding problems that matter have a hard time finding jobs. Testers with only modest programming skills and questionable analytical skills get hired instead, and spend months writing programs that get the machine to press its own buttons. The goal becomes making the automated checks run smoothly, rather than finding problems that matter to people. Difficulties in getting the machine to operate the product take time away from interaction with and observation of the product. As a result, we get products that may or may not be thoroughly checked, but that have problems that diminish or even destroy value.

(Don’t believe me? Here’s an account of testing from the LinkedIn engineering blog, titled “How We Make Our UI Tests Stable“. It’s wonderful that LinkedIn’s UI tests (checks, really) are stable. Has anyone inside LinkedIn noticed that LinkedIn’s user interface is a hot, confusing, frustrating, unusable mess? That LinkedIn Groups have lately become well-nigh impossible to find? That LinkedIn rudely pops up a distracting screen after each time you’ve accepted a new invitation to connect, interrupting your flow, rather than waiting until you’ve finished accepting or rejecting invitations? That these problems dramatically reduce the desire of people to engage with LinkedIn and see the ads on it?)

Listen: there is no “manual testing”; there is testing. There are no “manual testers”; there are testers. Checking—an element of testing, a tactic of testing—can be automated, just as spell checking can be automated. A good editor uses the spelling checker, while carefully monitoring and systematically distrusting it. We do not call spell checking “automated editing”, nor do we speak of “manual editors” and “automated editors”. Editors, just “editors”, use tools.

All doctors use tools. Some specialists use or design very sophisticated tools. No one refers to those who don’t as “manual doctors”. No one speaks of “manual researchers”, “manual newspaper reporters”, “manual designers”, “manual programmers”, “manual managers”. They all do brain- and human-centred work, and they all use tools.

Here are seven kinds of testers. The developer tests as part of coding the product, and the good ones build testability into the product, too. The technical tester builds tools for herself or for others, uses tools, and in general thinks of her testing in terms of code and technology. The adminstrative tester focuses on tasks, agreements, communication, and getting the job done. The analytical tester develops models, considers statistics, creates diagrams, uses math, and applies these approaches to guide her exploration of the product. The social tester enlists the aid of other people (including developers) and helps organize them to cover the product with testing. The empathic tester immerses himself in the world of the product and the way people use it. The user expert comes at testing from the outside, typically as a supporting tester aiding responsible testers.

Every tester interacts with the product by various means, perhaps directly and indirectly, maybe at high levels or low levels, possibly naturalistically or artificially. Some testers are, justifiably, very enthusiastic about using tools. Some testers who specialize in applying and developing specialized tools could afford to develop more critical thinking and analytical skill. Correspondingly, some testers who focus on analysis or user experience or domain knowledge seem to be intimidated by technology. It might help everyone if they could become more familiar and more comfortable with tools.

Nonetheless, referring to any of the testing skill sets, mindsets, and approaches as “manual” spectacularly misses the mark, and suggests that we’re confused about the key body part for testing: it’s the brain, rather than the hands. Yet testers commonly refer to “manual testing” without so much as a funny look from anyone. Would a truly professional community play along, or would it do something to stop that?

On top of all this, the “manual tester” trope leads to banal, trivial, clickbait articles about whether “manual testing” has a future. I can tell you: “manual testing” has no future. It doesn’t have a past or a present, either. That’s because there is no manual testing. There is testing.

Instead of focusing on the skills that excellent testing requires, those silly articles provide shallow advice like “learn to program” or “learn Selenium”. (I wonder: are these articles being written by manual writers or automated writers?) Learning to program is a good thing, generally. Learning Selenium might be a good thing too, in context. Thank you for the suggestions. Let’s move on. How about we study how to model and analyze risk? More focus on systems thinking? How about more talk about describing more kinds of coverage than code coverage? What about other clever uses for tools, besides for automated checks?

(Some might reply “Well, wait a second. I use the term ‘manual testing’ in my context, and everybody in my group knows what I mean. I don’t have a problem with saying ‘manual testing’.” If it’s not a problem for you, I’m glad. I’m not addressing you, or your context. Note, however, that your reply is equivalent to saying “it works on my machine.”)

Our most important goal as testers, typically, is to learn about problems that threaten value in our products, so that our clients can deal with those problems effectively. Neither our testing clients nor people who use software divide the problems they experience into “manual bugs” and “automated bugs”. So let’s recognize and admire technical testers, testing toolsmiths and the other specialists in our craft. Let us not dismiss them as “manual testers”. Let’s put an end to “manual testing”.

RST Slack Channel

Sunday, October 8th, 2017

Over the last few months, we’ve been inviting people from the Rapid Software Testing class to a Slack channel. We’re now opening it up to RST alumni.

If you’ve taken RST in the past, you’re welcome to join. Click here (or email me at slack@developsense.com), let me know where and when you took the class, and with which instructor. I’ll reply with an invitation.

Dev*Ops

Wednesday, October 4th, 2017

A while ago, someone pointed out that Development and Operations should work together in order to fulfill the needs and goals of the business, and lo, the DevOps movement was born. On the face of it, that sounds pretty good… except when I wonder: how screwed up must things have got for that to sound like a radical, novel, innovative idea?

Once or twice, I’ve noticed people referring to DevTestOps, which seemed to be a somewhat desperate rearguard attempt to make sure that Testing doesn’t get forgotten in the quest to fulfill the needs and goals of the business. And today—not for the first time— I noticed a reference to DevSecOps, apparently suggesting that Security is another discipline that should also be working with other groups in order to fulfill the needs and goals of the business.

Wow! This is great! Soon everyone who is employed by a business will be working together to fulfill the needs and goals of the business! Excelsior!

So, in an attempt to advance this ground-breaking, forward-thinking, transformative concept, I hereby announce the founding of a new movement:

DevSecDesTestDocManSupSerFinSalMarHumCEOOps

Expect a number of conferences and LinkedIn groups about it real soon now, along with much discussion about how to shift it left and automate it.

(How about we all decide from the get-go that we’re all working together, collaboratively, using tools appropriately, supporting each other to fulfill the needs and goals of the business? How about we make that our default assumption?)

Deeper Testing (3): Testability

Friday, September 29th, 2017

Some testers tell me that they are overwhelmed at the end of “Agile” sprints. That’s a test result—a symptom of unsustainable pace. I’ve written about that in a post called “Testing Problems are Test Results“.

In Rapid Software Testing, we say that testing is evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, and plenty of other stuff—perhaps including the design and programming of automated checks, and analysis of the outcome after they’ve been run. Note that the running of these checks—the stuff that the machine does—does not constitute testing, just as the compiling of the product—also the stuff that the machine does—does not constitute programming.

If you agree with this definition of testing, when someone says “We don’t have enough time for testing,” that literally means “we don’t have enough time to evaluate our product.” In turn, that literally means “we don’t have time to learn about this thing that (presumably) we intend to release.” That sounds risky.

If you believe, as I do, that evaluating a product before you release it is usually a pretty good idea, then it would probably also be a good idea to make testing as fast and as easy as possible. That is the concept of testability.

Most people think of a testability quite reasonably in terms of visibility and controllability in the product. Typically visibility refers to log files, monitoring, and other ways of seeing what the product is up to; controllability usually refers to interfaces that allow for easy manipulation of the product, most often via scriptable application programming interfaces (APIs).

It’s a good thing to have logging and scriptable interfaces, testability isn’t entirely a property of the product. Testability is a set of relationships between the product, the team, the tester, and the context in which the product is being developed and maintained. Changes in any one of these can make a product more or less testable. In Rapid Software Testing, we refer to the set of these relationships as practical testability. This breaks down into five other subcategories that overlap and interact to some degree.

Epistemic testability. (Yes, it’s a ten-dollar word. Epistemology is the study of how we know what we know. Excellent testing requires us to study epistemology if we want to avoid being fooled about what we know.) As we’re building a product, there’s a risk gap, the difference between what we know and what we need to know. A key purpose of testing is to explore the risk gap, shining light on the things that we don’t know, and identifying places that are beyond the current extent of our knowledge. A product that we don’t know well in a test space that we don’t know much about tends to make testing harder or slower.

Value-related testability. It’s easier to test a product when we know something about how people might use it and how they might intend get value from it. That means understanding people’s goals and purposes, and how the product is designed to fulfill and support them. It means considering who matters&mash;not just end users or customers, but also anyone who might have a stake in the success of the product. It means learning about dimensions of quality that might be more important or not so important to them.

Intrinsic testability. It’s easier to test a product when it is designed to help us understand its behaviour and its state. When the parts of the product are built cleanly and simply, testing as we go, testing the assembled parts will be easier. When we have logging and visibility into the workings of the product, and when we have interfaces that allow us to control it using tools, we can induce more variation that helps to shake the bugs out.

Project-related testability. It’s easier to test when the project is organized to support evaluation, exploration, experimentation, and learning. Testing is faster and easier when testers have access to other team members, to information, to tools, and to help.

Subjective testability. The tester is at the centre of the relationships between the product, the project, and the testing mission. Testing will be faster, easier, and better when the tester’s skills—and testing skill on the team—are sufficient to deal with the situation at hand.

Each one of these dimensions of testability fans out into specific ideas for making a product faster and easier to test. You can find a set of ideas and guidewords in a paper called Heuristics of Software Testability.

On an Agile team, a key responsibility for the tester is to ask and advocate for testability, and to highlight things that make testing harder or slower. Testability doesn’t come automatically. Teams and their managers are often unaware of obstacles. Programmers may have created unit checks for the product, which may help to reduce certain kinds of coding and design errors. Still, those checks will tend to be focused on testing functions deep in the code. Testability for other quality criteria—usability, compatibility, performance, installabilty, or compatibility, to name only a few—may not get much attention without testers speaking up for them.

A product almost always gets bigger and more complex with every build. Testability helps us to keep the pace of that growth sustainable. A less testable product contributes to an unsustainable pace. Unsustainable pace ratchets up the risk of problems that threaten the value of the product, the project, and the business.

So here’s a message for the tester to keep in front of the team during that sprint planning meeting, during the sprint, and throughout the project:

Let’s remember testability. When testing is harder or slower, bugs have more time and more opportunity to stay hidden. The hidden bugs are harder to find than any bugs we’ve found so far—otherwise we would have found them already. Those bugs—deeper, less obvious, more subtle, more intermittent—may be far worse than any bugs we’ve found so far. Right now, testability is not as good as it could be. Is everyone okay with that?

Deeper Testing (1): Verify and Challenge

Thursday, March 16th, 2017

What does it mean to do deeper testing? In Rapid Software Testing, James Bach and I say:

Testing is deep to the degree that it has a probability of finding rare, subtle, or hidden problems that matter.

Deep testing requires substantial skill, effort, preparation, time, or tooling, and reliably and comprehensively fulfills its mission.

By contrast, shallow testing does not require much skill, effort, preparation, time, or tooling, and cannot reliably and comprehensively fulfill its mission.

Expressing ourselves precisely is a skill. Choosing and using words more carefully can sharpen the ways we think about things. In the next few posts, I’m going to offer some alternative ways of expressing the ideas we have, or interpreting the assignments we’ve been given. My goal is to provide some quick ways to achieve deeper, more powerful testing.

Many testers tell me that their role is to verify that the application does something specific. When we’re asked to that, it can be easy to fall asleep. We set things up, we walk through a procedure, we look for a specific output, and we see what we anticipated. Huzzah! The product works!

Yet that’s not exactly testing the product. It can easily slip into something little more than a demonstration—the kinds of things that you see in a product pitch or a TV commercial. The demonstration shows that the product can work, once, in some kind of controlled circumstance. To the degree that it’s testing, it’s pretty shallow testing. The product seems to work; that is, it appears to meet some requirement to some degree.

If you want bugs to survive, don’t look too hard for them! Show that the product can work. Don’t push it! Verify that you can get a correct result from a prescribed procedure. Don’t try to make the product expose its problems.

But if you want to discover the bugs, present a challenge to the product. Give it data at the extremes of what it should be able to handle, just a little beyond, and well beyond. Stress the product out; overfeed it, or starve it of something that it needs. See what happens when you give the product data that it should reject. Make it do things more complex than the “verification” instructions suggest. Configure the product (or misconfigure it) in a variety of ways to learn how it responds. Violate an explicitly stated requirement. Rename or delete a necessary file, and observe whether the system notices. Leave data out of mandatory fields. Repeat things that should only happen once. Start a process and then interrupt it. Imagine how someone might accidentally or maliciously misuse the product, and then act on that idea. While you’re at it, challenge your own models and ideas about the product and about how to test it.

We can never prove by experiment—by testing—that we’ve got a good product; when the product stands up to the challenge, we can only say that it’s not known to be bad. To test a product in any kind of serious way is to probe the extents and limits of what it can do; to expose it to variation; to perform experiments to show that the product can’t do something—or will do something that we didn’t want it to do. When the product doesn’t meet our challenges, we reveal problems, which is the first step towards getting them fixed.

So whenever you see, or hear, or write, or think “verify”, try replacing it with “challenge“.

The Test Case Is Not The Test

Thursday, February 16th, 2017

A test case is not a test.

A recipe is not cooking. An itinerary is not a trip. A score is not a musical performance, and a file of PowerPoint slides is not a conference talk.

All of the former things are artifacts; explicit representations. The latter things are human performances.

When the former things are used without tacit knowledge and skill, the performance is unlikely to go well. And with tacit knowledge and skill, the artifacts are not central, and may not be necessary at all.

The test case is not the test. The test is what you think and what you do. The test case may have a role, but you, the tester, are at the centre of your testing.

Further reading: http://www.satisfice.com/blog/archives/1346

Throwing Kool-Aid on the Burning Books

Sunday, November 2nd, 2014

Another day, and another discovery of an accusation of Kool-Aid drinking or book burning from an ISO 29119 proponent (or an opponent of the opposition to it; a creature in the same genus, but not always the same species).

Most of the increasingly vehement complaints come from folks who have not read [ISO 29119], perhaps because they don’t want to pay for the privilege but also (and I’d guess mainly) because they are among the now possibly majority of folks who don’t read anything except from their favorite few Kool Aid pushers and don’t want their opinions muddled by actual information, especially any which might challenge their views.

http://sdtimes.com/editors-focus-ends-means/#sthash.vapl8RRZ.dpuf

The charge that the opponents of 29119 don’t read anything other than their favourite Kool-Aid pushers is almost—but not quite—as ludicrous as the idea that a complex, investigative, intellectual, activity like testing can be standardized. Does this look like a book-burner to you? One opponent to 29119 is the author of two of the best-selling (in, in my view, best-written) books on software testing in its relatively short history—does this look a book burner? (In the unlikely event that it does, drop in to his web site and have a look at his publications and the references within.) Here’s a thoughtful opponent of 29119; book burner? How about this—book burner? And here’s a relatively recent snapshot of my own library.

For some contrast, have a look at the standard itself. As a matter of fact, other than the standards that it replaces, along with the ISTQB Foundation Syllabus, the standard’s bibliographies include references to no works at all; neither in testing nor in any of the other domains that relate to testing—programming, psychology, mathematics, history, measurement, anthropology, critical thinking, economics, philosophy, computer science, sociology, systems thinking, qualitative research, software development… The ISTQB syllabus includes a handful of books about testing, and only about testing. The most recent reference is to Lee Copeland’s A Practitioner’s Guide to Software Testing, which—although a quite worthy book for new testers—was published in 2004, a full seven years before the syllabus was published in 2011.

Update, November 4: Sharp-eyed reader Nathalie van Delft points out that Part One of the Standard contains references to two books that are not prior standards or ISTQB material: Crispin and Gregory’s Agile Testing (2009), and Koen’s Definition of the Engineering Method (1985). So, one book since 2004, and one book on engineering, in the Concepts and Definitions section of the standard.

Where are the references to other books, old or new, that would be genuinely helpful to new testers, like Petzold’s Code: The Hidden Language of Computer Hardware and Software, or Weinberg’s Perfect Software and Other Illusions About Testing, or Marick’s Everyday Scripting in Ruby? Why is the syllabus not updated with important new books like Kahneman’s Thinking Fast and Slow, or Kaner and Fiedler’s Foundations of Software Testing, or Elisabeth Hendrickson’s Explore It!, even if the rest of the syllabus remains static? Worried that things might get too heady for foundation-level testers? Why not refer to The Cartoon Guide to Statistics or a first-year college book on critical thinking, like Levy’s Tools of Critical Thinking, or introductory books on systems thinking like Meadows’ Thinking in Systems: A Primer or Weinberg and Gause’s Are Your Lights On?

See the difference? Our community encourages testers to study the craft; to read; to import new ideas from outside the field; to argue and debate; to learn from history; to think independently. We also cop to errors, when someone points them out; thanks, Nathalie. Some of the books above are by intellectual or commercial competitors, or contain material on which there is substantial disagreement between individuals and clans in the wider community. Big deal; those books are useful and important, and they’re part of the big conversation about testing.

You could only believe that the thoughtful opponents to 29119 are book-burners or Kool-Aid drinkers… well, if you haven’t read what they’ve been writing.

So, to those who answer the opposition to 29119 with calumny… drink up. And know that no smoke detectors were activated in the preparation of this blog post.

Facts and Figures in Software Engineering Research

Monday, October 20th, 2014

On July 23, 2002, Capers Jones, Chief Scientist Emeritus of a company called Software Productivity Research, gave a presentation called “SOFTWARE QUALITY IN 2002: A SURVEY OF THE STATE OF THE ART”. In this presentation, he provided the sources for his data on the second slide:

SPR clients from 1984 through 2002
• About 600 companies (150 clients in Fortune 500 set)
• About 30 government/military groups
• About 12,000 total projects
• New data = about 75 projects per month
• Data collected from 24 countries
• Observations during more than a dozen lawsuits

(Source: http://bit.ly/ZDFKaT, accessed September 5, 2014)

On May 2, 2005, Mr. Jones, this time billed as Chief Scientist and Founder of Software Quality Research, gave a presentation called “SOFTWARE QUALITY IN 2005: A SURVEY OF THE STATE OF THE ART”. In this presentation, he provided the source for his data, again on the second slide:

SPR clients from 1984 through 2005
• About 625 companies (150 clients in Fortune 500 set)
• About 35 government/military groups
• About 12,500 total projects
• New data = about 75 projects per month
• Data collected from 24 countries
• Observations during more than 15 lawsuits

(Source: http://bit.ly/1vEJVAc, accessed September 5, 2014)

Notice that 34 months have passed between the two presentations, and that the “total projects number” has increased by 500. At 75 projects a month, we should expect that 2550 projects have been added to the original tally; yet only 500 projects have been added.

On January 30, 2008, Mr. Jones (Founder and Chief Scientist Emeritus of Software Quality Research), gave a presentation called “SOFTWARE QUALITY IN 2008: A SURVEY OF THE STATE OF THE ART”. This time the sources (once again on the second slide) looked like this:

SPR clients from 1984 through 2008
• About 650 companies (150 clients in Fortune 500 set)
• About 35 government/military groups
• About 12,500 total projects
• New data = about 75 projects per month
• Data collected from 24 countries
• Observations during more than 15 lawsuits

(Source: http://www.jasst.jp/archives/jasst08e/pdf/A1.pdf, accessed September 5, 2014)

This is odd. 32 months have passed since the May 2005 presentation. With new data being added at 75 projects per month, there should have been 2400 projects new since the prior presentation. Yet there has been no increase at all in the number of total projects.

On November 2, 2010, Mr. Jones (now billed as Founder and Chief Scientist Emeritus and as President of Capers Jones & Associates LLC) gave a presention called “SOFTWARE QUALITY IN 2010: A SURVEY OF THE STATE OF THE ART”. Here are the sources, once again from the second slide:

Data collected from 1984 through 2010
• About 675 companies (150 clients in Fortune 500 set)
• About 35 government/military groups
• About 13,500 total projects
• New data = about 50-75 projects per month
• Data collected from 24 countries
• Observations during more than 15 lawsuits

(Source: http://www.sqgne.org/presentations/2010-11/Jones-Nov-2010.pdf, accessed September 5, 2014)

Here three claims about the data have changed: 25 companies have been added to the data sources, 13,500 projects comprises the total set, and “about 50-75 projects” have been added (or are being added; this isn’t clear) per month. 21 full months have passed since the January presentation (which came at the end of the month). To be fair, the claim of an increase of 1,000 projects almost fits the lower bound of the claimed number of per-month increases (which would be 1,050 new projects since the last presentation), but not the claim of 75 per month (1,575 new projects). What does it mean to claim “new data = about 50-75 projects per month”, when the new data appears to be coming in a rate below the lowest rate claimed?

On May 1, 2012, Mr. Jones (CTO of Namcook Analytics LLC) gave a talk called “SOFTWARE QUALITY IN 2012: A SURVEY OF THE STATE OF THE ART”. Once again, the second slide provides the sources.

Data collected from 1984 through 2012
• About 675 companies (150 clients in Fortune 500 set)
• About 35 government/military groups
• About 13,500 total projects
• New data = about 50-75 projects per month
• Data collected from 24 countries
• Observations during more than 15 lawsuits

(Source: http://sqgne.org/presentations/2012-13/Jones-Sep-2012.pdf, accessed September 5, 2014)

Here there has been no change at all in any of the previous claims (except for the range of time over which the data has been collected). The claim that 50-75 projects per month has been added remains. At that rate, extrapolating from the claims in the November 2010 presentation, there should be between 14,400 and 14,850 projects in the data set. Yet the claim of 13,500 total projects also remains.

On August 18, 2013, Mr. Jones (now VP and CTO of Namcook Analytics LLC) gave a presentation “SOFTWARE QUALITY IN 2013: A SURVEY OF THE STATE OF THE ART”. Here are the data sources (from page 2)

Data collected from 1984 through 2013
• About 675 companies (150 clients in Fortune 500 set)
• About 35 government/military groups
• About 13,500 total projects
• New data = about 50-75 projects per month
• Data collected from 24 countries
• Observations during more than 15 lawsuits

(Source: http://namcookanalytics.com/wp-content/uploads/2013/10/SQA2013Long.pdf, accessed September 5, 2014)

Once again, no change in the total number of projects, but the claim of 50-75 new projects remains. Again, based on the 2012 claim, 15 months in time passed (more like 16, but we’ll be generous here), and the growth claims in these presentations, there should be between 14,250 and 14,625 projects in the data set.

Based on the absolute claim of 75 new projects per month in the period 2002-2008, and 50 per month in the remainder, we’d expect 20,250 projects at a minimum by 2013. But let’s be conservative and generous, and base the claim of new projects per month at 50 for the entire period from 2002 to 2013. That would be 600 new projects per year over 11 years; 6,600 projects added to 2002’s 12,000 projects, for a total of 18,600 by 2013. Yet the total number of projects went up by only 1,500 over the 11-year period—less than one-quarter of what the “new data” claims would suggest.

In summary, we have two sets of figures in apparent conflict here. In each presentation,

1) the project data set is claimed to grow at a certain rate (50-75 per month, which amounts to 600-900 per year).
2) the reported number of projects grows at a completely different rate (on average, 136 per year).

What explains the inconsistency between the two sets of figures?

I thank Laurent Bossavit for his inspiration and help with this project.

Dramatis Personae

Thursday, September 18th, 2014

On September 8, Stuart Reid, the convenor of the Working Group, posted a response to the Stop 29119 petition that included this paragraph:

(The petitioners claim that) “The standards ‘movement’ is politicized and driven by big business to the exclusion of others. A large proportion of the members of WG26 are listed at our About WG26 page along with their employers. This list does not support the assertion in the petition. The seven editors (who do the majority of the work) are from a government department, a charity, two small testing consultancies, a mid-size testing consultancy, a university and one is semi-retired.”

I believe Dr. Reid misinterprets the position of the petition’s authors and signers as objecting to the standards process “driven by big business to the exclusion of others”. Here I can only speak for myself. My concern is not about the size of the businesses involved. Instead, it is this: if a handful of consultancies of any size were to use the ISO standards process to set the terms for “the only internationally-recognised and agreed standards”, it would raise a plausible perception of conflict of interest. If those standards were lengthy, complex, open to interpretation, and paywalled, and if those consultancies were to offer services related to standards compliance, the possibility of motivations other than altruism could loom. So, as the convenor of the working group, Dr. Reid is right to attempt to make affiliations clear and transparent.

As of September 3, the roster for the ISO 29119 working group looked like this:

The convenor of ISO/IEC JTC1/SC7 WG26 is:

Dr Stuart Reid – Testing Solutions Group, United Kingdom
The co-editors of ISO/IEC/IEEE 29119 Software Testing and members of WG26 are:

Anne Mette Hass (editor of ISO/IEC/IEEE 29119-3) – KOMBIT, Denmark
Jon Hagar (product editor of ISO/IEC/IEEE 29119) – USA
Matthias Daigl (editor of ISO/IEC/IEEE 29119-5) – Koln University, Germany
Prof Qin Liu (co-editor of ISO/IEC/IEEE 29119-4) – School of Software Engineering, Tongji University, China
Sylvia Veeraraghavan (editor of ISO/IEC/IEEE 29119-2) – Mindtree, India
Dr Tafline Murnane (editor of ISO/IEC/IEEE 29119-4) – K. J. Ross & Associates, Australia
Wonil Kwon (ISO/IEC 33063 Process Assessment Model for Software testing processes) – Software Testing Alliances, South Korea

(Source: http://bit.ly/wg26201312)

As of September 12, that page had changed:

The convenor of ISO/IEC JTC1/SC7 WG26 is:

Dr Stuart Reid – Testing Solutions Group, United Kingdom
The co-editors of ISO/IEC/IEEE 29119 Software Testing and members of WG26 are:

Anne Mette Hass (editor of ISO/IEC/IEEE 29119-3) – KOMBIT, Denmark
Jon Hagar (product editor of ISO/IEC/IEEE 29119) – USA
Matthias Daigl (editor of ISO/IEC/IEEE 29119-5) – Koln University, Germany
Prof Qin Liu (co-editor of ISO/IEC/IEEE 29119-4) – School of Software Engineering, Tongji University, China
Sylvia Veeraraghavan (editor of ISO/IEC/IEEE 29119-2) – Janaagraha, India
Dr Tafline Murnane (editor of ISO/IEC/IEEE 29119-4) – K. J. Ross & Associates, Australia
Wonil Kwon (ISO/IEC 33063 Process Assessment Model for Software testing processes) – Software Testing Alliances, South Korea

(Source: http://bit.ly/wg26201409)

Anne Mette Hass‘ affiliation has been listed as KOMBIT for several years. Her LinkedIn history suggests other possible connections. There, as of September 14, 2014, she is listed as a Compliance Consultant for NNIT. A search for “29119” in NNIT’s web site leads quickly to an events page (http://www.nnit.com/pages/Events.aspx, retrieved September 14, 2014) that features a promotion for “Webinar – The Core of Testing, Dynamic Testing Process, According to ISO 29119.” Prior to this, once again according to LinkedIn, Ms. Hass worked for Delta, a Danish test consultancy. A search for “29119” on Delta’s site leads quickly to a page that begins “DELTA’s experts participate as key players in a variety of national and international norms and standardization groups”. ISO 29119 is listed as one of those international standards.

I presume that Jon Hagar, with no affiliation listed, is the “semi-retired” editor to whom Dr. Reid refers. Per LinkedIn, he is currently an independent consultant. Prior to this, Jon was an Engineer-Manager of Software Testing at Lockheed Martin.

Matthias Daigl‘s affiliation is listed as on the Working Group’s roster as “Koln University”, yet his profile on LinkedIn lists him as “Managing Consultant at imbus”. It makes no mention of Koln University. On the imbus.de site, you can find this page, which includes this paragraph: “Represented by our managing consultant Matthias Daigl we take an active part in the development of the series 29119. Matthias Daigl is a member of the DIN norm commission 043-01-07 ‘Software und System-Engineering’ and he is one of the two Germans who belong to the international ISO/IEC JTC 1 subcommittee 07 workgroup 26 ‘Software testing’ working on the standard 29119 with test experts from the whole world. There the imbus consultant has the editor role for the part 29119-5.”

There are several listings for Qin Liu on LinkedIn. One of them refers to an associate professor at Tongji University.

Dr. Tafline Murname‘s affiliation is with KJ Ross and Associates. “KJ Ross provide you with independent software quality expertise, either in-house, fully outsourced or a blend of both. With 100 local and 3000 offshore trained test consultants on hand, our service is carried out to national and international standards, including ISO/IEC 29119 and ISO/IEC 17025.” It is worth noting that KJ Ross does not explicitly offer 29119 consulting services on its Web site; if it is marketing such services, it is not doing so aggressively.

Wonil Kwon is listed as “Software Testing Alliances, South Korea”. LinkedIn shows this affiliation, along with one for STEN, a testing consultancy. http://www.sten.or.kr/index.php.

Sylvia Veeraraghavan‘s affiliation according to the Working Group roster page suddenly changed on or about September 3, 2014. She is now with Janaagraha, a charity. Prior to that, though, she was with Mindtree, a company that assertively touts its part in developing 29119, and which sells related consulting services.

So, let’s review the claim of affiliations for the seven editors as currently listed on the page.

A government department. Dr. Reid apparently refers to Matt Mansell, affiliated with the Department of Internal Affairs, New Zealand. This description is consistent with Mr. Mansell’s current and past affiliations on LinkedIn. Oddly, Mr. Mansell’s name is no longer listed among the editors; he was formerly given credit as the editor of 29119-1.

A charity. Technically true, but Ms. Veeraraghavan very recently resigned from a large testing consultancy; a lucky break for Dr. Reid in terms of the timing of his response to the petition.

Two small testing consultancies and a mid-size consultancy. Ms. Hass, Mr. Daigl, and Mr. Kwon currently work for testing consultancies, and until very recently, Ms. Veeraraghavan did too. Ms. Murname also works for a consultancy that touts her work on ISO 29119, and notes that its services “are carried out to national and international standards, including ISO/IEC 29119”.

A university. Two per the roster, but of these only one claim—Qin Liu’s—is supported by LinkedIn. Why is Mr. Daigl listed as being with Koln University? Could it be because of this? https://www.imbus.de/english/academy/certified-tester/

One semi-retired. True.

Finally, note that when Dr. Reid lists the editors of the standards, he does not refer to himself, even though he is arguably the most publicly prominent member of the Working Group, and its convenor. Dr. Reid is the CTO of Testing Solutions Group, a testing services consultancy. From TSG’s Web site: “For companies looking to make the switch to ISO 29119, TSG can provide help with implementation or measure how closely existing processes conform to the standard and an action plan to being about compliance.” (sic) (Source: http://www.testing-solutions.com/services/stqa/iso-29119-implementation).

Six of nine core members of the working group appear to be affiliated with consultancies. Why does Dr. Reid offer a different assessment? Would it be to distract from the appearance of a conflict of interest?

In addition, Dr. Reid states:

There is also no link between the ISO/IEC/IEEE Testing Standards and the ISTQB tester certification scheme.

That’s interesting. No link, eh?

(All links retrieved 18 September, 2014.)

I observe that, as Santayana said, “Those who cannot remember the past are condemned to repeat it”. It will be very interesting to watch what happens over the next few years.

So, has Dr. Reid been transparent, forthcoming, and credible about affiliations between himself and the editors (who, in his words, do the majority of the work) and organizations that are positioned to benefit from the standard? Has the Working Group diligently upheld and documented its compliance with ISO’s Codes of Conduct?

Those who support the standard, when you can find them, often tout the advantages of an international language for testing. I’ve written against this idea in the past. However, it is true that Latin was used as an international language for a long time. To this day, some phrases survive in common parlance: Cui bono? Quis custodiet ipsos custodes?

One final note: Investigation is not well served by standardization. I did not follow a standardized process in preparing this report.

Weighing the Evidence

Friday, September 12th, 2014

I’m going to tell you a true story.

Recently, in response to a few observations, I began to make a few changes in my diet and my habits. Perhaps you’ll be impressed.

  • I cut down radically on my consumption of sugar.
  • I cut down significantly on carbohydrates. (Very painful; I LOVE rice. I LOVE noodles.)
  • I started drinking less alcohol. (See above.)
  • I increased my intake of tea and water.
  • I’ve been reducing how much I eat during the day; some days I don’t eat at all until dinner. Other days I have breakfast, lunch, and dinner. And a snack.
  • I reflected on the idea of not eating during the day, thinking about Moslem friends who fast, and about Nassim Taleb’s ideas in Antifragile. I decided that some variation of this kind in a daily regimen is okay; even a good idea.
  • I started weighing myself regularly.

    Impressed yet? Let me give you some data.

    When I started, I reckon I was just under 169 lbs. (That’s 76.6 kilograms, for non-Americans and younger Canadians. I still use pounds. I’m old. Plus it’s easier to lose a pound than a kilogram, so I get a milestone-related ego boost more often.)

    Actually, that 169 figure is a bit of a guess. When I became curious about my weight, the handiest tool for measuring it was my hotel room’s bathroom scale. I kicked off my shoes, and then weighed myself. 173 lbs., less a correction for my clothes and the weight of all of the crap I habitually carry around in my pockets: Moleskine, iPhone, Swiss Army knife, wallet stuffed with receipts, pocket change (much of it from other countries). Sometimes a paperback.

    Eventually I replaced the batteries on our home scale (when did bathroom scales suddenly start needing batteries? Are there electronics in there? Is there software? Has it been tested?—but I digress). The scale implicitly claims a certain level of precision by giving readings to the tenth of a pound. These readings are reliable, I believe; that is, they’re consistent from one measurement to the next. I tested reliability by weighing myself several times over a five-minute period, and the results were consistent to the tenth of a pound. I repeated that test a day or two later. My weight was different, but I observed the same consistency.

    I’ve been making the measurement of my actual weight a little more precise by, uh, leaving the clothes out of the measurement. I’ve been losing between one and two pounds a week pretty consistently. A few days ago, I weighed myself, and I got a figure of 159.9 lbs. Under 160! Then I popped up for a day or two. This morning, I weighed myself again. 159.4! Bring on the sugar!

    That’s my true story. Now, being a tester, I’ve been musing about aspects of the measurement protocol.

    For example, being a bathroom scale, it’s naturally in the bathroom. The number I read from the scale can vary depending on whether I weigh myself Before or After, if you catch my meaning. If I’ve just drunk a half litre of water, that’s a whole pound to add to the variance. I’ve not been weighing myself at consistent times of the day, either. In fact, this afternoon I weighed myself again: 159.0! Aren’t you impressed!

    Despite my excitement, it would be kind of bogus for me to claim that I weigh 159.0 lbs, with the “point zero”. I would guess my weight fluctuates by at least a pound through the day. More formally, there’s natural variability in my weight, and to be perfectly honest, I haven’t measured that variability. If I were trying to impress you with my weight-loss achievement, I’d be disposed to report the lowest number on any given day. You’d be justified in being skeptical about my credibility, which would make me obliged to earn it if I care about you. So what could I do to make my report more credible?

    • I could weigh myself several times per day (say, morning, afternoon, and night) at regular times, average the results, and report the average. If I wanted to be credible, I’d tell you about my procedure. If I wanted to be very credible, I’d tell you about the variances in the readings. If I wanted to be super credible, I’d let you see my raw data, too.

      All that would be pretty expensive and disruptive, since I would have to spend few minutes going through a set procedure (no clothes, remember?) at very regular times, every day, whether I was at home or at a business lunch or travelling. Few hotel rooms provide scales, and even if they did, for consistency’s sake, I’d have to bring my own scale with me. Plus I’d have to record and organize and report the data credibly too. So…

    • Maybe I could weigh myself once a day. To get a credible reading, I’d weigh myself under very similar and very controlled conditions; say, each morning, just before my shower. This would be convenient and efficient, since doffing clothes is part of the shower procedure anyway. (I apologize for my consistent violation of the “no disturbing mental images” rule in this post.) I’d still have to bring my own scale with me on business trips to be sure I’m using consistent instrumentation.
    • Speaking of instrumentation, it would be a good idea for me to establish the reliability and validity of my scale. I’ve described its reliability above; it produces a consistent reading from one measurement to the next. Is it a valid reading, though? If I desired credibility, I’d calibrate the scale regularly by comparing its readings to a reference scale or reference weight that itself was known to be reliable (consistent between observations) and valid (consistent with some consensus-based agreement on what “a pound” is). If I wanted to be super-credible, I’d report whatever inaccuracy or variability I observed in the reading from my scale, and potential inconsistencies in my reference instruments, hoping that both were within an acceptable range of tolerance. I might also invite other people to scrutinize and critique my procedure.
    • If I wanted to be ultra-scientific, I’d also have to be prepared to explain my metric—the measurement function by which I hang a number on an observation. and the manner in which I operationalized the metric. The metric here is bound into the bathroom scale: for each unit pound placed on the scale, the figure display should increase by 1.0. We could test that as I did above. Or, more whimsically, if I were to put 159 one-pound weights on one side of Sir Bedevere’s largest scales, and me on the other, the scales would be in perfect balance (“and therefore… A WITCH!”), assuming no problems with the machinery.
    • If I missed any daily observations, that would be unfortunate and potentially misleading. Owning up to the omission and reporting it would probably preferable to covering it up. Covering up and getting caught would torpedo my credibility.
    • Based on some early samples, and occasional resampling, I could determine the variability of my own weight. When reporting, I could give a precise figure and along with the natural variation in the measurement: 159.4 lbs, +/- 1.2 lbs.
    • Unless I’m wasting away, you’d expect to see my weight stabilize after a while. Stabilize, but not freeze. Considering the natural variance in my weight, it would be weird and incredible if I were to report exactly the same weight week after week. In that case, you’d be justified to suspect that something was wrong. It could be a case of quixotic reliability—Kirk and Miller’s term for an observation that is consistent in a trivial and misleading way, as a broken thermometer might yield. Such observations, they say, frequently prove “only that the investigator has managed to observe or elicit ‘party line’ or rehearsed information. Americans, for example, reliably respond to the question ‘How are you?’ with the knee-jerk ‘Fine.” The reliability of this answer does not make it useful data about how Americans are.” Another possibility, of course, is that I’m reporting faked data.
    • It might be more reasonable to drop the precision while retaining accuracy. “About 160 lbs” is an accurate statement, even if it’s not a precise one. “About 160, give or take a pound or so” is accurate, with a little patina of precision and a reasonable and declared tolerance for imprecision.
    • Plus, I don’t think anyone else cares about a daily report anyhow. Even I am only really interested in things in the longer term. Having gone this far watching things closely, I can probably relax. One weighing a week, on a reasonably consistent day, first thing in the morning before the shower (I promise; that was the last time I’ll present that image) is probably fine. So I can relax the time and cost of the procedure, too.
    • I’m looking for progress over time to see the effects of the changes I’m made to my regimen. Saying “I weigh about 160. Six weeks ago, I weighed about 170” adds context to the report. I could provide the raw data:

      Plotting the data against time on a chart would illustrate the trend. I could show display the data in a way that showed impressive progress:

      But basing the Y-axis at 154.0 (to which Excel defaulted, in this case) wouldn’t be very credible because it exaggerates the significance of the change. To be credible, I’d use a zero base:

      Using a zero-based Y-axis on the chart would show the significance of change in a more neutral way.

    • To support the quantitative data, I might add other observations, too: I’ve run out of holes on my belt and my pants are slipping down. My wife has told me that I look trimmer. Given that, I could add add these observations to the long-term trend in the data, and could cautiously conclude that the regimen overall was having some effect.
    • All this is fine if I’m trying to find support for the hypothesis that my new regimen is having some effect. It’s not so good for two other things. First, it does not prove that my regimen change is having an effect. Maybe it’s having no effect at all, and I’ve been walking and biking more than before; or maybe I acquired some kind of wasting disease just as I began to cut down on the carbs. Second, it doesn’t identify specific factors that brought about weight loss and rule out other factors. To learn about those and to report on them credibly, I’d have to go back to a more refined approach. I would have to vary aspects of my diet while controlling others and make precise observations of what happened. I’d have to figure out what factors to vary, why they might be important, and what effects they might have. In other words, I’d be developing a hypothesis tied to a model and a body of theory. Then I’d set up experiments, systematically varying the inputs to see their effects, and searching for other factors that might influence the outcomes. I’d have to control for confounding factors outside of my diet. To make the experiment credible, I’d have to show that the numbers were focused on describing results, and not on attaining a goal. That’s the distinction between inquiry metrics and control metrics: an inquiry metric triggers questions; a control metric influences or drives decisions.

    When I focus on the number, I set up the possibility of some potentially harmful effects. To make the number look really good on any given day, I might cut my water intake. To make the number look fabulous over a prolonged period (say, as long as I was reporting my weight to you), I could simply starve myself until you stopped paying attention. Then it’d be back to lots of sugar in the coffee, and yes, I will have another beer, thank you.) I know that if I were to start exercising, I’d build up muscle mass, and muscle weighs more than flab. It becomes very tempting to optimize my weight in pounds, not only to impress you, but also to make me feel proud of myself. Worst of all: I might rig the system not consciously, but unconsciously. Controlling the number is reciprocal; the number ends up controlling me.

    Having gone through all of this, it might be a good idea to take a step back and line up the accuracy and precision of my measurement scheme with my goal—which I probably should have done in the first place. I don’t really care how much I weigh in pounds; that’s just a number. No one else should care how much I weigh every day. And come to think of it, even if they did care, it’s none of their damn business. The quantitative value of my weight is only a stand-in—a proxy or an indirect measurement—for my real goal. My real goal is to look and feel more sleek and trim. It’s not to weigh a certain number of pounds; it’s to get to a state where my so-called “friends” stop patting my belly and asking me when the baby is due. (You guys know who you are.)

    That goal doesn’t warrant a strict scientific approach, a well-defined system of observation, and precise reporting, because it doesn’t matter much except to me. Some data might illustrate or inform the story of my progress, but the evidence that matters is in the mirror; do I look and feel better than before?

    In a different context, you may want to persuade people in a professional discipline of some belief of some course of action, while claiming that you’re making solid arguments based on facts. If so, you have to marshal and present your facts in a way that stands up to scrutiny. So, over the next little while, I’ll raise some issues and discuss things that might be important for credible reporting in a professional community.


    This blog post was strongly influenced by several sources.

    Cem Kaner and Walter P. Bond, “Software Engineering Metrics: What Do They Measure and How Do We Know“. In particular, I used the ten questions on measurement validity from that paper as a checklist for my elaborate and rigourous measurement procedures above. If you’re a tester and you haven’t read the paper, my advice is to read it. If you have read it, read it again.

    Shadish, Cook, and Campbell, Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Snappy title, eh? As books go, it’s quite expensive, too. But if you’re going to get serious about looking at measurement validity, it’s a worthwhile investment, extremely interesting and informative.

    Jerome Kirk and Mark L. Miller, Reliability and Validity in Qualitative Research. This very slim book raises lots of issues in performing, analyzing, and reporting if your aim is to do credible research. (Ultimately, all research, whether focused on quantitative data or not, serves a qualitative purpose: understanding the nature of things at least a little better.)

    Gerald M. (Jerry) Weinberg, Quality Software Management, Vol. 2: First Order Measurement, (also available as two e-books, “How to Observe Software” and “Responding to Significant Software Events”)

    Edward Tufte’s Presenting Data and Information (a mind-blowing one-day course) and his books The Visual Display of Quantitative Information; Envisioning Information; Visual Explanations; and Beautiful Evidence.

    Prior Art Dept.: As I was writing this post, I dimly recalled Brian Marick posting something on losing weight several years ago. I deliberately did not look at that post until I was finished with this one. From what I can see, that material (http://www.exampler.com/old-blog/2005/04/02/#big-visible-belly) was not related to this. On the other hand, I hope Brian continues to look and feel his best. 🙂

    I thank Laurent Bossavit and James Bach for their reviews of earlier drafts of this article.