Blog Posts from January, 2022

Testing Deep and Shallow (2): “Shallow” is a feature, not an insult!

Tuesday, January 11th, 2022

When we talk about deep and shallow testing in the Rapid Software Testing namespace, some people might assume that we mean “deep testing” is good and decent and honourable, and that we mean “shallow” to be an insult. But we don’t. “Shallow” is not an insult.

Depth and shallowness are ways of talking about the thoroughness of testing, but they’re not assessments of its value. The value or quality or appropriateness of thoroughness can only be decided in context. Shallow testing can be ideal for some purposes, and deep testing can be pathological. How so? Let’s start by getting clear on what we do mean.

Shallow testing is testing that has a chance of finding every easy bug.

“Shallow testing” is not an insult! Shallow doesn’t mean “slapdash”, and shallow doesn’t mean “sloppy”.

Both shallow testing and finding easy bugs are good things. We want to find bugs—especially easy bugs—as quickly and as efficiently as possible, and shallow testing has a chance of finding them. Shallow testing affords some coverage, typically in specific areas of the product. In lots of contexts, the fact that shallow testing isn’t deep is a feature, not a bug.

Here’s a form of shallow testing: TDD-style checks. When developers design and implement TDD checks, the goal is not to test the product deeply. The goal is is to make efficient, incremental progress in building a function or a feature. Each new check provides a quick indication that the new code does what the programmer intended it to do. Re-running the existing suite of checks provides a developer with some degree of confidence that the new code hasn’t introduced easy-to-find problems.

TDD makes rapid progress possible by focusing the programmer on experimenting with the design and writing code efficiently. That effort is backed with simple, quick, first-order output checks. For the purpose of getting a new feature built, that’s perfectly reasonable and responsible.

When I’m writing code, I don’t want to do challenging, long-sequence, thorough experiments that probe a lot of different coverage areas every time I change the danged code. Neither do you. TDD checks aren’t typically targeted towards testing for security and usability and performance and compatibility and installability risks. If they were, TDD would be intolerably slow and ponderous, and running the checks would take ages.

Checking of this nature is appropriately and responsibly quick, inexpensive, and just thorough enough, allowing the developers to make reliable progress without disrupting development work too much. The idea is to find easy bugs at the coal face, applying relatively little effort that affords maximum momentum. That speed and ease is absolutely a feature of shallow testing. And not a bug.

Shallow testing is also something that testers must do in their early encounters with the product, because there is no means to teleport a tester to deep testing right away.

A developer builds her mentals models of the product as part of the process of building it. The tester doesn’t have that insider’s, builder’s perspective. The absence of that perspective is both a feature and a bug. It’s a feature because the tester is seeing the product with fresh eyes, which can be helpful for identifying problems and risks. It’s a bug, because the tester must go through stage of learning, necessary confusion, and bootstrapping to learn about the product.

The Bootstrap Conjecture suggests that any process that is eventually done well and efficiently started off by being done poorly and ineffeciently; any process focused on trying to get things right the first time will be successful only if it’s trivial or lucky.

In early encounters with a product, a tester performs shallow testing—testing that has a chance of finding every easy bug. That affords the opportunity to learn the product, while absolving the tester of an obligation to try to get to deep testing too early.

So what is deep testing?

Deep testing is testing that maximizes the chance of finding every elusive bug that matters.

That needs some unpacking.

First, “maximize”. No testing, and no form of testing, can guarantee that we’ll find every bug. (Note that in Rapid Software Testing, a bug is anything about the product that might threaten its value to some person who matters.)

It’s a commonplace maxim that complete testing is impossible: we can’t enter every possible set of inputs; examine every possible set of outputs; exercise every function in the product, in every possible sequence, with every possible variation of timing, on every possible platform, in every possible machine state that we can’t completely control anyway.

Given that we’re dealing with an infinite, intractable, multi-dimensional test space, testing skill matters, but some degree of luck inevitably plays a role. We can only strive to maximize our chances of finding bugs, because bugs are to some degree elusive. Bugs can be subtle, hidden, rare, intermittent, or emergent.

Some bugs are subtle, based on poorly-understand aspects of programming languages, or surprising behavior of technologies.

Some bugs are hidden in complex or obscure or old code. Some bugs are hidden in code that we didn’t write, but that we’re calling in a library or framework or operating system.

Some bugs are rare, dependent on specific sets of unusual conditions, or triggered by code encountering particular data, or exclusive to specific platforms.

Some bugs are intermittent, only manifesting infrequently, when the system is in a particular state.

Perhaps most significantly, some bugs are emergent. All of the components in a product might be fine in isolation, but the overall system has problems when elements of it are combined. A shared library, developed internally, that supports one product might clobber functions in another. A product that renders fine on one browser might run afoul of different implementations of standards on another.

Just today, I got mail from a Mac user friend that I’m sure looked fine on his machine; it doesn’t get rendered properly under Windows Outlook. A product that performs fine in the lab can be subject to weird timing problems when network latency comes into play, or when lots of people are using the system at the same time.

Time can be a factor, too. One classic case is the Y2K problem; storing the year component of a date in a two-digit field wouldn’t have looked like much of a problem in 1970, when storage was expensive and people didn’t foresee that the system might still be in use a generation later. Programs that ran just fine on single-tasking 8086 processors encountered problems when run in virtual mode on the supposedly-compatible virtual 8086 mode on 80386 and later processors.

(This sort of stuff is all over the place. As of this writing, there seems to be some kind of latent bug on my Web site that only manifests when I try to update PHP, and that probably happens thanks to stricter checking by the newer PHP interpreter. It wasn’t a problem when I put the site together, years ago, and for now I’m in upgrade jail until I sort it all out. Sigh.)

Bugs that are elusive can evade even a highly disciplined development process, and can also evade deep testing. Again, there are no guarantees, but the idea behind deep testing is to maximize the chance of finding elusive bugs.

How do you know that a bug is, or was, elusive? When an elusive bug is found in development, before release, qualified people on the team will say things like, “Wow… it would have been really hard for me to notice that bug. Good thing you found it.”

When a bug in our product is found in the field, by definition it eluded us, but was it an elusive bug?

Elusiveness isn’t a property of a bug, but a social judgment—a relationship between the bug, people, and context. If a bug found in the field was elusive, our social group will tend to agree, “Maybe we could have caught that, but it would have been really, really hard.” If a bug wasn’t elusive, our social group will say “Given the time and resources available to us, we really should have caught that.” In either case, responsible people will say, “We can learn something from this bug.”

That suggests, accurately, that both elusiveness and depth are subjective and socially constructed. A bug that might have been easy to find for a developer—shallow from her perspective—might have become buried by the time it gets to the tester. When a bug has been buried under layers of code, such that it’s hard to reach from the surface of the product, finding that bug deliberately requires deep testing.

A tester who is capable of analyzing and modeling risk and writing code to generate rich test data is likely to find deeper, more elusive data-related bugs than a tester who is missing one of those skills.

A bug that is easy for a domain expert to notice might easily get past non-experts. Developing expertise in the product domain is an element of deeper testing.

A tester with a rich, diversified set of models for covering the product might find bugs she considers relatively easy to find, but which a developer without those models might consider to be a deep bug.

Deep testing is, in general, far more expensive and time-consuming than shallow testing. For that reason, we don’t want to perform deep testing

  • too often
  • prematurely
  • in a way oblivous to its cost
  • when it’s not valuable
  • when the feature in question and its relationship to the rest of the product is already well-understood
  • when risk is low
  • when shallow testing will do

We probably don’t need to perform deep testing when we’ve already done plenty of deep testing, and all we want to do is check the status of the build before release. We probably don’t need deep testing when a change is small, and simple, and well-contained, and both the change and its effects have been thoroughly checked. Such testing could easily be obsessive-compulsively, pathologically deep.

So, once again, the issue is not that shallow testing is bad and deep testing is good. In some contexts, shallow testing is just the thing we need, where deep testing would be overkill, expensive and unnecessary. The key is to consider the context, and the risk gap—the gaps between what we can reasonably say we know what we need to know in order to make good decisions about the product.

Testing Deep and Shallow (1): Coverage

Tuesday, January 11th, 2022

Many years ago, I went on a quest.

Coverage seemed to be an important word in testing, but it began to occur to me that I had been thinking about it in a vague, hand-wavey kind of way. I sensed that I was not alone in that.

I wanted to know what people meant by coverage. I wanted to know what I meant by coverage.

In the Rapid Software Testing class material, James Bach had been describing coverage as “the proportion of the product that has been tested”. That didn’t make sense to me.

Could we think of a product in those kinds of terms? A product can be a lot of things to a lot of people. We could look at a product as a set of bytes on a hard drive, but that’s not very helpful. A product is a set of files and modules that contain code that instantiate objects and data and functions. A product has interactions with hardware and software, some created by us, and some created by other people. A product provides (presumably valuable) functions and features to people. A product has interfaces, whereby people and programs can interact with it, feed it data, probe its internals, produce output.

A software product is not a static, tangible thing; it’s a set of relationships. What would 100% of a product, a set of relationships look like? That’s an important question, because unless we know what 100% looks like, the idea of “proportion” doesn’t carry much water.

So, as we do, James and I argued about it.

I went to the testing books. If they referred to coverage at all, most of them begged the question of what coverage is. The books that did describe coverage talked about it in terms of code coverage—lines of code, branches, paths, conditions… Testing Computer Software, for instance, cited Boris Beizer as saying that “testing to the level of ‘complete’ coverage will find, at best, half the bugs”. Huh? How could that make sense?

I eventually found a copy, in India, of Beizer’s Software Testing Techniques, which contained this intriuging hint in the index: “any metric of completeness with respect to a test selection criterion”. While the book talked about code coverage, it also talked about paths in terms of functional flows through the program.

James argued that “any metric of completeness with respect to a test selection criterion” wasn’t very helpful either. “Test selection criteria” are always based on some model of the product, he said.

A model is a an idea, activity, or object (such as an idea in your mind, a diagram, a list of words, a spreadsheet, a person, a toy, an equation, a demonstration, or a program…) that represents—literally, re-presents—something complex in terms of something simpler. By understanding something about the simpler thing, a good model can give us leverage on understanding the more complex thing.

There are as many ways to model a software product as there are ways to represent it, or its parts, or the things to which it relates. For instance: we can model a product by representing its components, in a diagram. We can model a product by describing it in a requirements document—which is itself a model of the requirements for the product. We can represent the information stored by a product by way of a database schema.

We can model a product in terms of its interfaces—APIs and command lines and GUIs and network protocols and printer ports. We can represent people’s interactions with a product by means of flowcharts, user stories, tutorials, or task lists. And of course, we are always modeling a product tacitly, with sets of ideas in our heads. We can represent those ideas in any number of ways.

The code is not the product. The product is that set of relationships between software, hardware, people, and their needs and desires, individually and in social groups. The code for the product is itself a model of the product. Code coverage is one way to describe how we’ve covered the product with testing.

And somewhere, in all of that back-and-forth discussion between James and me, a light began to dawn.

In the Rapid Software Testing namespace, when we’re talking about coverage generally,

Coverage is how thoroughly we have examined the product with respect to some model.

When we’re speaking about some kind of coverage, that refers to a specific model.

  • Functional coverage is how thoroughly we have examined the product with respect to some model of the functions in the product.
  • Requirements coverage is how thoroughly we have examined the product with respect to some model of the requirements.
  • Performance coverage is how thoroughly we have examined the product with respect to some model of performance.
  • Risk coverage is how thoroughly we have examined the product with respect to some model of risk.

Code coverage is how thoroughly we have examined the product with respect to some model of the code.

It should be plain to see that code coverage is not the same as risk coverage; that covering the code doesn’t cover all of the possible risks that might beset a product. It should be equally clear that risk coverage (how thoroughly we have examined the product with respect to some model of risk) doesn’t necessary cover all the code, either.

Which brings us to the next exciting installment: what we mean by deep and shallow testing.