Testing Deep and Shallow (2): “Shallow” is a feature, not an insult!

January 11th, 2022

When we talk about deep and shallow testing in the Rapid Software Testing namespace, some people might assume that we mean “deep testing” is good and decent and honourable, and that we mean “shallow” to be an insult. But we don’t. “Shallow” is not an insult.

Depth and shallowness are ways of talking about the thoroughness of testing, but they’re not assessments of its value. The value or quality or appropriateness of thoroughness can only be decided in context. Shallow testing can be ideal for some purposes, and deep testing can be pathological. How so? Let’s start by getting clear on what we do mean.

Shallow testing is testing that has a chance of finding every easy bug.

“Shallow testing” is not an insult! Shallow doesn’t mean “slapdash”, and shallow doesn’t mean “sloppy”.

Both shallow testing and finding easy bugs are good things. We want to find bugs—especially easy bugs—as quickly and as efficiently as possible, and shallow testing has a chance of finding them. Shallow testing affords some coverage, typically in specific areas of the product. In lots of contexts, the fact that shallow testing isn’t deep is a feature, not a bug.

Here’s a form of shallow testing: TDD-style checks. When developers design and implement TDD checks, the goal is not to test the product deeply. The goal is is to make efficient, incremental progress in building a function or a feature. Each new check provides a quick indication that the new code does what the programmer intended it to do. Re-running the existing suite of checks provides a developer with some degree of confidence that the new code hasn’t introduced easy-to-find problems.

TDD makes rapid progress possible by focusing the programmer on experimenting with the design and writing code efficiently. That effort is backed with simple, quick, first-order output checks. For the purpose of getting a new feature built, that’s perfectly reasonable and responsible.

When I’m writing code, I don’t want to do challenging, long-sequence, thorough experiments that probe a lot of different coverage areas every time I change the danged code. Neither do you. TDD checks aren’t typically targeted towards testing for security and usability and performance and compatibility and installability risks. If they were, TDD would be intolerably slow and ponderous, and running the checks would take ages.

Checking of this nature is appropriately and responsibly quick, inexpensive, and just thorough enough, allowing the developers to make reliable progress without disrupting development work too much. The idea is to find easy bugs at the coal face, applying relatively little effort that affords maximum momentum. That speed and ease is absolutely a feature of shallow testing. And not a bug.

Shallow testing is also something that testers must do in their early encounters with the product, because there is no means to teleport a tester to deep testing right away.

A developer builds her mentals models of the product as part of the process of building it. The tester doesn’t have that insider’s, builder’s perspective. The absence of that perspective is both a feature and a bug. It’s a feature because the tester is seeing the product with fresh eyes, which can be helpful for identifying problems and risks. It’s a bug, because the tester must go through stage of learning, necessary confusion, and bootstrapping to learn about the product.

The Bootstrap Conjecture suggests that any process that is eventually done well and efficiently started off by being done poorly and ineffeciently; any process focused on trying to get things right the first time will be successful only if it’s trivial or lucky.

In early encounters with a product, a tester performs shallow testing—testing that has a chance of finding every easy bug. That affords the opportunity to learn the product, while absolving the tester of an obligation to try to get to deep testing too early.

So what is deep testing?

Deep testing is testing that maximizes the chance of finding every elusive bug that matters.

That needs some unpacking.

First, “maximize”. No testing, and no form of testing, can guarantee that we’ll find every bug. (Note that in Rapid Software Testing, a bug is anything about the product that might threaten its value to some person who matters.)

It’s a commonplace maxim that complete testing is impossible: we can’t enter every possible set of inputs; examine every possible set of outputs; exercise every function in the product, in every possible sequence, with every possible variation of timing, on every possible platform, in every possible machine state that we can’t completely control anyway.

Given that we’re dealing with an infinite, intractable, multi-dimensional test space, testing skill matters, but some degree of luck inevitably plays a role. We can only strive to maximize our chances of finding bugs, because bugs are to some degree elusive. Bugs can be subtle, hidden, rare, intermittent, or emergent.

Some bugs are subtle, based on poorly-understand aspects of programming languages, or surprising behavior of technologies.

Some bugs are hidden in complex or obscure or old code. Some bugs are hidden in code that we didn’t write, but that we’re calling in a library or framework or operating system.

Some bugs are rare, dependent on specific sets of unusual conditions, or triggered by code encountering particular data, or exclusive to specific platforms.

Some bugs are intermittent, only manifesting infrequently, when the system is in a particular state.

Perhaps most significantly, some bugs are emergent. All of the components in a product might be fine in isolation, but the overall system has problems when elements of it are combined. A shared library, developed internally, that supports one product might clobber functions in another. A product that renders fine on one browser might run afoul of different implementations of standards on another.

Just today, I got mail from a Mac user friend that I’m sure looked fine on his machine; it doesn’t get rendered properly under Windows Outlook. A product that performs fine in the lab can be subject to weird timing problems when network latency comes into play, or when lots of people are using the system at the same time.

Time can be a factor, too. One classic case is the Y2K problem; storing the year component of a date in a two-digit field wouldn’t have looked like much of a problem in 1970, when storage was expensive and people didn’t foresee that the system might still be in use a generation later. Programs that ran just fine on single-tasking 8086 processors encountered problems when run in virtual mode on the supposedly-compatible virtual 8086 mode on 80386 and later processors.

(This sort of stuff is all over the place. As of this writing, there seems to be some kind of latent bug on my Web site that only manifests when I try to update PHP, and that probably happens thanks to stricter checking by the newer PHP interpreter. It wasn’t a problem when I put the site together, years ago, and for now I’m in upgrade jail until I sort it all out. Sigh.)

Bugs that are elusive can evade even a highly disciplined development process, and can also evade deep testing. Again, there are no guarantees, but the idea behind deep testing is to maximize the chance of finding elusive bugs.

How do you know that a bug is, or was, elusive? When an elusive bug is found in development, before release, qualified people on the team will say things like, “Wow… it would have been really hard for me to notice that bug. Good thing you found it.”

When a bug in our product is found in the field, by definition it eluded us, but was it an elusive bug?

Elusiveness isn’t a property of a bug, but a social judgment—a relationship between the bug, people, and context. If a bug found in the field was elusive, our social group will tend to agree, “Maybe we could have caught that, but it would have been really, really hard.” If a bug wasn’t elusive, our social group will say “Given the time and resources available to us, we really should have caught that.” In either case, responsible people will say, “We can learn something from this bug.”

That suggests, accurately, that both elusiveness and depth are subjective and socially constructed. A bug that might have been easy to find for a developer—shallow from her perspective—might have become buried by the time it gets to the tester. When a bug has been buried under layers of code, such that it’s hard to reach from the surface of the product, finding that bug deliberately requires deep testing.

A tester who is capable of analyzing and modeling risk and writing code to generate rich test data is likely to find deeper, more elusive data-related bugs than a tester who is missing one of those skills.

A bug that is easy for a domain expert to notice might easily get past non-experts. Developing expertise in the product domain is an element of deeper testing.

A tester with a rich, diversified set of models for covering the product might find bugs she considers relatively easy to find, but which a developer without those models might consider to be a deep bug.

Deep testing is, in general, far more expensive and time-consuming than shallow testing. For that reason, we don’t want to perform deep testing

  • too often
  • prematurely
  • in a way oblivous to its cost
  • when it’s not valuable
  • when the feature in question and its relationship to the rest of the product is already well-understood
  • when risk is low
  • when shallow testing will do

We probably don’t need to perform deep testing when we’ve already done plenty of deep testing, and all we want to do is check the status of the build before release. We probably don’t need deep testing when a change is small, and simple, and well-contained, and both the change and its effects have been thoroughly checked. Such testing could easily be obsessive-compulsively, pathologically deep.

So, once again, the issue is not that shallow testing is bad and deep testing is good. In some contexts, shallow testing is just the thing we need, where deep testing would be overkill, expensive and unnecessary. The key is to consider the context, and the risk gap—the gaps between what we can reasonably say we know what we need to know in order to make good decisions about the product.

Testing Deep and Shallow (1): Coverage

January 11th, 2022

Many years ago, I went on a quest.

Coverage seemed to be an important word in testing, but it began to occur to me that I had been thinking about it in a vague, hand-wavey kind of way. I sensed that I was not alone in that.

I wanted to know what people meant by coverage. I wanted to know what I meant by coverage.

In the Rapid Software Testing class material, James Bach had been describing coverage as “the proportion of the product that has been tested”. That didn’t make sense to me.

Could we think of a product in those kinds of terms? A product can be a lot of things to a lot of people. We could look at a product as a set of bytes on a hard drive, but that’s not very helpful. A product is a set of files and modules that contain code that instantiate objects and data and functions. A product has interactions with hardware and software, some created by us, and some created by other people. A product provides (presumably valuable) functions and features to people. A product has interfaces, whereby people and programs can interact with it, feed it data, probe its internals, produce output.

A software product is not a static, tangible thing; it’s a set of relationships. What would 100% of a product, a set of relationships look like? That’s an important question, because unless we know what 100% looks like, the idea of “proportion” doesn’t carry much water.

So, as we do, James and I argued about it.

I went to the testing books. If they referred to coverage at all, most of them begged the question of what coverage is. The books that did describe coverage talked about it in terms of code coverage—lines of code, branches, paths, conditions… Testing Computer Software, for instance, cited Boris Beizer as saying that “testing to the level of ‘complete’ coverage will find, at best, half the bugs”. Huh? How could that make sense?

I eventually found a copy, in India, of Beizer’s Software Testing Techniques, which contained this intriuging hint in the index: “any metric of completeness with respect to a test selection criterion”. While the book talked about code coverage, it also talked about paths in terms of functional flows through the program.

James argued that “any metric of completeness with respect to a test selection criterion” wasn’t very helpful either. “Test selection criteria” are always based on some model of the product, he said.

A model is a an idea, activity, or object (such as an idea in your mind, a diagram, a list of words, a spreadsheet, a person, a toy, an equation, a demonstration, or a program…) that represents—literally, re-presents—something complex in terms of something simpler. By understanding something about the simpler thing, a good model can give us leverage on understanding the more complex thing.

There are as many ways to model a software product as there are ways to represent it, or its parts, or the things to which it relates. For instance: we can model a product by representing its components, in a diagram. We can model a product by describing it in a requirements document—which is itself a model of the requirements for the product. We can represent the information stored by a product by way of a database schema.

We can model a product in terms of its interfaces—APIs and command lines and GUIs and network protocols and printer ports. We can represent people’s interactions with a product by means of flowcharts, user stories, tutorials, or task lists. And of course, we are always modeling a product tacitly, with sets of ideas in our heads. We can represent those ideas in any number of ways.

The code is not the product. The product is that set of relationships between software, hardware, people, and their needs and desires, individually and in social groups. The code for the product is itself a model of the product. Code coverage is one way to describe how we’ve covered the product with testing.

And somewhere, in all of that back-and-forth discussion between James and me, a light began to dawn.

In the Rapid Software Testing namespace, when we’re talking about coverage generally,

Coverage is how thoroughly we have examined the product with respect to some model.

When we’re speaking about some kind of coverage, that refers to a specific model.

  • Functional coverage is how thoroughly we have examined the product with respect to some model of the functions in the product.
  • Requirements coverage is how thoroughly we have examined the product with respect to some model of the requirements.
  • Performance coverage is how thoroughly we have examined the product with respect to some model of performance.
  • Risk coverage is how thoroughly we have examined the product with respect to some model of risk.

Code coverage is how thoroughly we have examined the product with respect to some model of the code.

It should be plain to see that code coverage is not the same as risk coverage; that covering the code doesn’t cover all of the possible risks that might beset a product. It should be equally clear that risk coverage (how thoroughly we have examined the product with respect to some model of risk) doesn’t necessary cover all the code, either.

Which brings us to the next exciting installment: what we mean by deep and shallow testing.

Lessons Learned in Finding Bugs

November 18th, 2021

This story is put together from several parallel experiences over the last while that I’ve merged into one narrative. The pattern of experiences and epiphanies is the same, but as they used to say on TV, names and details have been changed to protect the innocent.

I was recently involved in testing an online shopping application. In the app, there’s a feature that sends notifications via email.

On the administration screen for setting up that feature, there are “Save” and “Cancel” buttons near the upper right corner. Those buttons are not mapped to any keys. The user must either click on them with the mouse, or tab to them and press Enter.

Below and to the left, there are some fields to configure some settings. Then, at the bottom left, there is a field in in which the user can enter a default email notification message.

Add a nice short string to that text field, and everything looks normal. Fill up that field, and the field starts to expand rightwards to accommodate the text. The element in which the text field is embedded expands rightwards too.

Add enough text (representing a perfectly plausible length for an email notification message) to the text field, and the field and its container expand rightwards far enough that they start to spill off the edge of the screen.

And here’s the kicker: all this starts to obscure the Save and Cancel buttons in the top right, such that they can’t be clicked on any more. You can delete the text, but the field and container stubbornly remain the same enlarged size. That is, they don’t shrink, and the Save and Cancel buttons remain covered up.

If you stumble around with the Tab key, you can at least make the screen go away—but if you were unlucky enough to click “Save” and return to the application, the front-end remains in the messed-up state.

There is a configuration file, but it’s obfuscated so that you can’t simply edit it and adjust the length of the field to restore it to something that doesn’t cover the Save and Cancel buttons. You can delete the file, but if you do that, you’ll lose a ton of other configuration settings that you’ll have to re-enter.

The organization had, the testers told me, a set of automated checks for this screen. We looked into it. Those checks didn’t include any variation. For the email notification field, they changed the default to a short string of different happy-path data, and and pressed the Save button. But they didn’t press the on-screen Save button. They pressed a virtual Save button.

Thus, even if the check included some challenging data, the automated checks would still have been able to find and click on the correct invisible, inaccessible, virtual Save and Cancel buttons just fine. That is, there is no way that the checks would alert a tester or anyone else to this problem.

After searching for a product, there was a screen to display tiles of products returned in the search. Some searches returned a single product, displaying a single tile. It didn’t take very long for us to find that leaving that screen and coming back to it produced a second instance of the same tile. Leaving and coming back again left three tiles on the screen. It didn’t take long to produce enough tiles for a Gaudi building in Barcelona.

Logging in and putting products into the shopping cart was fine. Putting items into the shopping cart and then logging in put the session into a weird state. The number of items on the shopping cart icon was correct, based on what we had selected, but trying to get into the shopping cart and change the order produced a message that the shopping cart could not be accessed at this time, and all this rendered a purchase impossible. (I tried it later on the production site; same problem. Dang; I wanted those books.)

We found these problems within the first few minutes of free, playful interaction with this product and trying to find problems. We did it by testing experientially. That is, we interacted with the product such that our encounter was mostly indistinguishable from that of a user that we had in mind from one moment to the next. Most observers wouldn’t have noticed how our encounter was different from a user, unless that observer were keen to notice us doing testing.

That observer might have noticed us designing and performing experiments in real time, and taking notes. Those experiments were based on imagining data and work flows that were not explicitly stated in the requirements or use case documents. The experiments were targeted towards vulnerabilities and risks that we anticipated, imagined, and discovered. We weren’t there to demonstrate that everything was working just fine. We were there to test.

And our ideas didn’t stay static. As we experimented and interacted with the product, we learned more. We developed richer mental models of the data and how it would interact with functions in the product. We developed our models of how people might use the product, too; how they might perform some now-more-foreseeable actions—including some errors that they might commit that the product might not handle appropriately. That is, we were changing ourselves as we were testing. We were testing in a transformative way.

Upon recognizing subtle little clues—like the text field growing when it might have wrapped, or rendered existing data invisible by scrolling the text—we recognized the possibility of vulnerabilities and risks that we hadn’t anticipated. That is, we were testing in an exploratory way.

We didn’t let tools do a bunch of unattended work and then evaluate the outcomes afterwards, even though there can be benefits from doing that. Instead, our testing benefitted from our direct observation and engagement. That is, we were testing in an attended way.

We weren’t constrained by a set procedure, or by a script, or by tools that mediated and modified our naturalistic encounter with the product. That is, we weren’t testing in an instrumented way, but in an experiential way.

We were testing in a motivated way, looking for problems that people might encounter while trying to use the damned thing. Automated checks don’t have motivations. That’s fine; they’re simply extensions of people who do have motivations, and who write code to act on them. Even then, automated checks had not alerted anyone to this bug, and would never do so because of the differences between the way that humans and machines encounter the product.

Oh, and we found a bunch of other bugs too. Bunches of bugs.

In the process of doing all this, my testing partners realized something else. You see, this organization is similar to most: the testers typically design a bunch of scripted tests, and then run them over and over—essentially, automated checking without a machine. Eventually, some of the scripts get handed to coders who turn them into actual automated checks.

Through this experience, the testers noticed that neither their scripted procedures nor the automated checks had found the problems. They came to realize that even if someone wanted to them to create formalized procedures, it might be a really, really good idea to hold off on designing and writing the scripts until after they had obtained some experience with the product.

Having got some experience with the product, the testers also realized that there were patterns in the problems they were finding. The testers realized that they could take these patterns back to design meetings as suggestions for the developers’ reviews, and for unit- and integration-level checks. That in turn would mean that there would be fewer easy-to-find bugs on the surface. That would mean that testers would spend less time and effort on reporting those bugs—and that would mean that testers could focus their attention on deeper, richer experiential testing for subtler, harder-to-find bugs.

They also realized that they would likely find and report some problems during early experiential testing, and that the developers would fix those problems and learn from the experience. For a good number of these problems, after the fix, there would be incredibly low risk of them ever coming back again—because after the fix, it would be seriously unlikely that those bits of code would be touched in a way to make those particular problems come back.

This would reduce the need for lengthy procedural scripting associated with those problems; a handful of checklist items, at most, would do the trick. The fastest script you can write is the script you don’t have to write.

And adding automated checks for those problems probably wouldn’t be necessary or desirable. Remember?—automated checks had failed to detect the problems in the first place. The testers who wrote code could refocus their work on lower-level, machine-friendly interfaces to test the business rules and requirements before the errors got passed up to the GUI. At the same time, those testers could use code to generate rich input data sets, and use code to pump that data through the product.

Or those testers could create tools and visualizations and log parsers that would help the team see interesting and informative patterns in the output. Or those testers could create genuinely interesting and powerful and rich forms of automated checking, as in this example. (Using the packing function against the unpacking function is a really nice application of the forward-and-backward oracle heuristic.)

One of the best ways to “free up time for exploratory testing” is to automate some checks—especially at the developer level. But another great way to free up time for exploratory testing is to automate fewer expensive, elaborate checks that require a lot of development and maintenance effort and that don’t actually find bugs. Some checks are valuable, and fast, and inexpensive. The fastest, least expensive check you can write is the valueless check you don’t have to write.

And attended, exploratory, motivated, transformative, experiential testing is a great way to figure out which is which.


There’s an RST Explored class coming up for American days and European evenings. It runs January 17-20, 2022. For that, register here.

You (or your colleagues or other members of your network) might also be interested in the Rapid Software Testing Managed class, also for Europe, UK, and Indian Time Zones, which happens December 1-3, 2021. Information on the class here and more info here; registration here.

What Tests Should I Automate?

November 11th, 2021

Instead of asking “What tests should I automate?” consider asking some more pointed questions.

If you really mean “how should I think about using tools in testing?”, consider reading A Context-Driven Approach to Automation in Testing, and Testing and Checking Refined.

If you’re asking about the checking of output or other facts about the state of the product, keep reading.

Really good fact checking benefits from taking account of your status so that you don’t waste time:

  • Do I know enough about the product, and where there might be problems, to be sure that I’m not rushing into developing checks?

If the answer is No, it might be a better idea to do a deeper survey of the product, and scribble down some notes about risk as you go.

If the answer is Yes, then you might want to loop through a bunch of questions, starting here:

  • What specific fact about the product’s output or state do I want to check?
  • Do I know enough about the product, and where there might be problems, to be reasonsonably sure that this really is an important fact to check?
  • Is someone else (like a developer) checking this fact already?

Then maybe consider product risk:

  • What could go wrong in the product, such that I could notice it by checking this fact?
  • Is it a plausible problem? A significant risk?
  • Why do we think it’s a significant risk?
  • Is that foreboding feeling trying to tell us something?

Maybe if there’s serious risk here, a conversation to address the risk is a better idea than more testing or checks.

Assuming that it’s a worthwhile fact to check, move on to how you might go about checking the fact, and the cost of doing it:

  • What’s a really good way to check this fact?
  • Is that the fastest, easiest, least expensive way to check it?
  • Will the check be targeted at a machine-friendly interface?

Consider opportunity cost, especially if you’re targeting automated checks at the GUI:

  • What problems will I encounter in trying to check this fact this way, and doing that reliably?
  • What problems, near here or far away, might I be missing as a consequence of focusing on this fact, and checking it this way?
  • Is there some activity other than checking this fact that might be more valuable right now? In the long run?

Try thinking in terms of the longer term. On the one hand, the product and certain facts about it might remain very stable:

  • What would cause this fact to change? Is that likely? If not, is it worthwhile to create a check for this fact?

On the other hand, the product or the platform or the data or the test environment might become unstable. They might be unstable already:

Beware of quixotic reliability. That’s a wonderful term I first read about in Kirk and Miller’s Reliability and Validity in Qualitative Research. It refers a situation where we’re observing a consistent result that’s misleading, like a broken thermometer that reliably reads 37° Celsius. (Kirk and Miller have some really important things to say about confirmatory research, too.)

  • Is there a chance that this check might lull us to sleep by running green in a way that fools us?

To address the risk of quixotic reliability and to take advantage of what we might have learned, it’s good to look at every check, every once in a while:

  • What’s our schedule for reviewing this check? For considering sharpening it, or broadening it?

A sufficently large suite of checks is incomprehensible, so there’s no point point in running checks that are no longer worthwhile:

  • What’s our plan for checking this fact less often when we don’t need it so much? For retiring this check?

The next questions are especially important if you’re a tester. Checks are least expensive and most valuable when they provide fast feedback to the developers; so much so that it might be a good idea for the developers to check the code before it ever gets to you.

  • Am I the right person to take responsibility for checking this fact? Am I the only person? Should I be?
  • Is checking this fact, this way, the earliest way that we could become aware of a real problem related to it?

Given all that, think about utility—the intersection of cost, and risk, and value:

  • Do I still believe I really need to check this fact? Is it worthwhile to develop a check for it?

After you’ve asked these questions, move on to the next fact to be checked.

“But asking and answering all those questions will take a long time!”, you might reply.

Not that long. Now that you’ve been primed, this is a set of ideas that you can soon carry about in your mind, and run through the list at the speed of thought. Compare asking these questions with how long it takes to develop and run and interpret and revise and review and maintain even one worthless check.

Now it’s true, you can save some time and effort by skipping the interpretation and review and maintenance stuff. That is, essentially, by ignoring the check after you’re written it. But if you’re not going to pay attention to what a check are telling you, why bother with it at all? It’s faster still not to develop it in the first place.

With practice, the questions that I offered above can be asked routinely. Don’t assume that my list is comprehensive; ask your own questions too. If you pay attention to the answers, you can be more sure that your checks are powerful, valuable, and inexpensive.


There’s Rapid Software Explored class coming up for American days and European evenings. It runs January 17-20, 2022. Register here.

You (or your colleagues or other members of your network) might also be interested in the Rapid Software Testing Managed class which happens December 1-3, 2021, also for Europe/UK/Indian time zones. The class is for test managers and test leads. It’s also for people who aspire to those positions; for solo testers who must manage their own work responsibly; and for development or product managers who work with testers—anyone who is in a leadership position.

Testing Doesn’t Improve the Product

November 9th, 2021

(This post is adapted from my recent article on LinkedIn.)

Out there in the world, there is a persistent notion that “preventing problems early in the software development process will lead to higher-quality products than testing later will”. That isn’t true.

It’s untrue, but not for the reason that might first occur to most people. The issue is not that addressing problems early on is a bad idea. That’s usually a really good idea.

The issue is the statement is incoherent. Testing on its own, whether done early or late, will not lead to higher quality products at all.

Problem prevention, product improvements, and testing are different pursuits within development work. These activities are related, but testing can neither prevent problems nor improve the product. Something else, beyond testing, must happen.

Coming from me—a teacher and an advocate for skilled testing—that might seem crazy, but it’s true: testing doesn’t improve the product.

Investigative journalism is an information-gathering activity. Investigative journalists reveal problems in companies and governments and groups, problems that affect society.

Awareness of those problems may lead to public concern or outcry. The news reports on their own, though, don’t change anything. Change happens when boards, regulators, politicians, legal systems, leaders, workers, or social groups take action.

Testing, too, is an information-gathering activity. That information can be used to recognize problems in the product (“bugs”), or to identify aspects of the product that do not represent errors but that nonetheless could be improved (“enhancement requests”). Gathering information plays a role in making things better, but it doesn’t make things better intrinsically and automatically.

Consider: weighing yourself doesn’t cause you to lose weight. Blood tests don’t make you healthier. Standardized tests in schools don’t make kids smarter, and certainly don’t improve the quality of education.

What testing can do is to improve our understanding and awareness of whatever might be in front of us. Testing—the process of evaluating a product by learning about it through experiencing, exploring, and experimenting—helps our teams and our clients to become aware of problems. On becoming aware of them, our teams and clients might decide to address them.

To put it another way: testing is questioning a product in order to evaluate it. Neither the questions nor the answers make the product better. People acting on the answers can make the product better.

Similarly, in daily life, a particular reading on a bathroom scale might prompt us to eat more carefully, or to get more exercise, whereupon we might become more fit. A blood test might prompt a doctor to prescribe anti-malarial drugs, and if we take them as prescribed, we’re likely to control the malaria. Those standardized school tests might suggest changes to the curriculum, or to funding for education, or to teacher training. But until someone takes action, the test only improves awareness of the situation, not the situation itself.

In software development, improvement doesn’t happen unlesss someone addresses the problems that testing helps us to discover. Of course, if the problems aren’t discovered, improvement is much less likely to happen—and that’s why testing is so important. Testing helps us to understand the product we’ve got, so we can decide whether it’s the product we want. Where improvement is necessary, testing can reveal the need for improvement.

Some people believe that testing requires us to operate a product, thinking in terms of the product as a built piece of software. That’s a very important kind of testing, but it’s only one kind of testing activity, referring to one kind of product.

It can be helpful to consider a more expansive notion of a product as something that someone has produced. This means that testing can be applied to units or components or mockups or prototypes of an application.

And although we might typically call it review, we can a kind of testing to things people have written, or sketched, or said about a software product that does not yet exist. In these cases, the product is the artifact or the ideas that is represents. In this kind of test, experimentation consists of thought experiments; exploration applies to the product and to the space, or context, in which it is situated; experiencing the product applies to the process of analysis, and to experiences that we could imagine.

The outcome of the test-as-thought-experiment is the evaluation and learning that happens through these activities. That learning can be applied to correcting errors in the design and the development of the product—but once again, it’s the work that happens in response to testing, not the testing itself, that improves the product.

Just as testing doesn’t improve products, testing doesn’t prevent problems either. As testers, we have an abiding faith that there are already problems in anything that we’ve been asked to test. That is, the problems in the product are there before we encounter them. We must believe that problems have not been prevented. Indeed, our belief that problems have not been successfully prevented is a key motivating idea for testing work.

So what good is testing if it can’t prevent problems? Testing can help us to become aware of real problems that are really there. That’s good. That might even be great, because with that awareness, people can make changes to prevent those unprevented problems from going any further, and that’s good too.

It’s a great idea for people who design and build products to try to prevent problems early in development. To the degree that the attempt can be successful, the builders are more likely to develop a high-quality product. Nonetheless, problems can elude even a highly disciplined development process. There are at least two ways to find out if that has happened.

One way is to test the product all the way through its development, from intention to realization. Test the understanding of what the customer really wants, by engaging with a range of customers and learning about what they do. Test the initially fuzzy and ever-sharper vision of the designer, through review and discussion and what-if questions.

Test the code at its smallest units and at every stage of integration, through more review, pairing, static analysis tools, and automated output checking. Check the outputs of the build process for bad or missing components and incorrect settings. These forms of testing are usually not terribly deep. That’s a a good thing, because deep testing may take time, effort, and preparation that can be disruptive to developers. Without deep testing, though, bugs can elude the developers.

So, in parallel to the developers’ testing, assign some people to focus on and to perform deep testing. Deep testing is targeted towards rare, hidden, subtle, intermittent, emergent bugs that can get past the speedy, shallow, non-disruptive testing that developers—quite reasonably—prefer most of the time.

If your problem-prevention and problem-mitigation strategies have been successful, if you’ve been testing all along, and if you’ve built testability into the product, you’ll have a better understanding of it. You’ll also be less likely to encounter shallow problems late in the game. If you don’t have to investigate and report those problems, deep testing can be relatively quicker and easier.

If your problem-prevention and problem-mitigation strategies have been unsuccessful, deep testing is one way to find out. The problems that you discover can be addressed; the builders can make improvements to the product; and problems for the business and for the customer can be prevented before the product ships.

The other way to find out if a problem has eluded your problem prevention processes is to release the product to your otherwise unsuspecting customers, and take the chance that the problems will be both infrequent and insignificant enough that your customers won’t suffer much.

Here are some potential objections:

If software testing does not reveal the need for improvement then the improvement will not happen.

That’s not true. Although testing shines light on things that can be improved, improvement can happen without testing.

Testing can happen without improvement, too. For instance…

  • I perform a test. I find a bug in a feature. The program manager says, “I disagree that that’s a bug. We’re not doing anything in response to that report.”
  • I perform a test. I find a bug in a feature. The program manager says “I agree that that’s a bug. However, we don’t have time to fix it before we ship. We’ll fix it in the next cycle.”
  • I test. I find a bug. The program manager agrees that it’s a bug. The developer tries to fix it, but makes a mistake and the fix is ineffective.
  • I test. I find a bug. The program manager agrees, the developer fixes that bug, but along the way introduces new bugs, each of which is worse than the first.

In each case above, 1) Has the product been tested? (Yes.) 2) Has the product been improved? (No.)

Saying that testing doesn’t improve the product diminishes the perceived value of testing.

Saying that testing does improve the product isn’t true, and miscalibrates the role of the tester relative to the people who design, build, and manage the product.

Let’s be straight about this: we play a role in product improvement, and that’s fine and valuable and honourable. Being truthful and appropriately humble about the extents and limits of what testing can actually do diminishes none of its value. We don’t design or develop or improve the product, but we give insight to the people who do.

The value argument in favour of testing is easy to make. As I pointed out above, investigative journalists don’t run governments and don’t set public policy. Would you want them to? Probably not; making policy is appropriately the role of policy-makers. On the other hand, would you want to live in a society without investigative journalism? Now: would you want to live a world of products that had been released without sufficiently deep testing?

When there’s the risk of loss, harm, bad feelings, or diminished value for people, it’s a good idea to be aware of problems before it’s too late, and that’s where testing helps. Testing on its own neither prevents problems nor improves the product. But testing does make it possible to anticipate problems that need to be prevented, and testing shines light on the places where the product might need to be improved.

Experience Report: Katalon Studio

November 5th, 2021

Warning: this is another long post. But hey, it’s worth it, right?

Introduction

This is an experience report of attempting to perform a session of sympathetic survey and sanity testing on a “test automation” tool. The work was performed in September 2021, with follow-up work November 3-4, 2021. Last time, the application under test was mabl. This time, the application is Katalon Studio.

My self-assigned charter was to explore and survey Katalon Studio, focusing on claims and identifying features in the product through sympathetic use.

As before, I will include some meta-notes about the testing in indented text like this.

The general mission of survey testing is learning about the design, purposes, testability, and possibilities of the product. Survey testing tends to be spontaneous, open, playful, and relatively shallow. It provides a foundation for effective, efficient, deliberative, deep testing later on.

Sanity testing might also be called “smoke testing”, “quick testing”, or “build verification testing”. It’s brief, shallow testing to determine whether if the product is fit for deeper testing, or whether it has immediately obvious or dramatic problems.

The idea behind sympathetic testing is not to find bugs, but to exercise a product’s features in a relatively non-challenging way.

Summary

My first impression is that the tool is unstable, brittle and prone to systematic errors and omissions.

A very short encounter with the product reveals startlingly obvious problems, including hangs and data loss. I pointed Katalon Record to three different Web applications. In each case, Katalon’s recording functions failed to record my behaviours reliably.

I stumbled over several problems that are not included in this report, and I perceive many systemic risks to be investigated. As with mabl, I was encountered enough problems on first encounter with the product that it swamped my ability to stay focused and keep track of them all. I did record a brief video that appears below.

Both the product’s design and documentation steer the user—a tester, presumably—towards very confirmatory and shallow testing. The motivating idea seems to be recording and playing back actions, checking for the presence of on-screen elements, and completing simple processes. This kind of shallow testing could be okay, as far as it goes, if it were inexpensive and non-disruptive, and if the tool were stable, easy to use, and reliable—which it seems not to be.

The actual testing here took no more than an hour, and most of that time was consumed by sensemaking, and reproducing and recording bugs. Writing all this up takes considerably more time. That’s an important thing for testers to note: investigating and reporting bugs, and preparing test reports is important, but presents opportunity cost against interacting with the product to obtain deeper test coverage.

Were I motivated, I could invest a few more hours, develop a coverage outline, and perform deeper testing on the product. However, I’m not being compensated for this, and I’ve encountered a blizzard of bugs in a very short time.

In my opinion, Katalon Studio has not been competently, thoroughly, and broadly tested; or if it has, its product management has either ignored or decided not to address the problems that I am reporting here. This is particularly odd, since one would expect a testing tools company, of all things, to produce a well-tested, stable, and polished product. Are the people developing Katalon Studio using the product to help with the testing of itself? Neither a Yes nor a No answer bodes well.

It’s possible that everything smooths out after a while, but I have no reason to believe that. Based on my out-of-the-box experience, I would anticipate that any tester working with this tool would spend enormous amounts of time and effort working around its problems and limitations. That would displace time for achieving the tester’s presumable mission: finding deep problems in the product she’s testing. Beware of the myth that “automation saves time for exploratory testing”.

Setup and Platform

I performed most of this testing on September 19, 2021 using Chrome 94 on a Windows 10 system. The version of Katalon Studio was 8.1.0, build 208, downloaded from the Katalon web site (see below).

During the testing, I pointed Katalon Studio’s recorder at Mattermost, a popular open-source Slack-like chat system with a Web client; a very simple Web-based app that we use for an exercise that we use in out Rapid Software Testing classes; and at CryptPad Kanban, an open-source, secure kanban board product.

Testing Notes

On its home page, Katalon claims to offer “An all-in-one test automation solution”. It suggests that you can “Get started in no time, scale up with no limit, for any team, at any level.”

Katalon Claims: "An all-in-one test automation solution. Get started in no time, scale up with no limit, for any team, at any level."

I started with the Web site’s “Get Started” button. I was prompted to create a user account and to sign in. Upon signing in, the product provides two options: Katalon Studio, and Katalon TestOps. There’s a button to “Create your first test”. I chose that.

A download of 538MB begins. The download provides a monolithic .ZIP file. There is no installer, and no guide on where to put the product. (See Bug 1.)

I like to keep things tidy, so I create a Katalon folder beneath the Program Files folder, and extract the ZIP file there. Upon starting the program, it immediately crashes. (See Bug 2.)

Katalon crashes on startup, saying "An error has occurred.  See the log file."

The error message displayed is pretty uninformative, simply saying “An error has occurred.” It does, however, point to a location for the log file. Unfortunately, the dialog doesn’t offer a way to open the file directly, and the text in the dialog isn’t available via cut and paste. (See Bug 3.)

Search Everything to the rescue! I quickly navigate to the log file, open it, and see this:

java.lang.IllegalStateException: The platform metadata area could not be written: C:\Program Files\Katalon\Katalon_Studio_Windows_64-8.1.0\config\.metadata. By default the platform writes its content under the current working directory when the platform is launched. Use the -data parameter to specify a different content area for the platform. (My emphasis, there.)

That -data command-line parameter is undocumented. Creating a destination folder for the product’s data files, and starting the product with the -data parameter does seem to create a number of files in the destination folder, so it does seem to be a legitimate parameter. (Bug 4.) (Later: the product does not return a set of supported parameters when requested; Bug 5.)

I moved the product files to a folder under my home directory, and it started apparently normally. Help/About suggests that I’m working with Katalon Studio v. 8.1.0, build 208.

I followed the tutorial instructions for “Creating Your First Test”. As with my mabl experience report, I pointed Katalon Recorder at Mattermost (a freemium chat server that we use in Rapid Software Testing classes). I performed some basic actions with the product: I entered text (with a few errors and backspaces). I selected some emoticons from Mattermost’s emoticon picker, and entered a few more via the Windows on-screen keyboard. I uploaded an image, too.

Data entered into Mattermost for Katalon

I look at what Katalon is recording. It seems as though the recording process is not setting things up for Katalon to type data into input fields character by character, as a human would. It looks like the product creates a new “Set Text” step each time a character is added or deleted. That’s conjecture, but close examination of the image here suggests that that’s possible.

Katalon records a "Set Text" step for every keystroke.

Two things: First, despite what vendors claim, “test automation” tools don’t do things just the way humans do. They simulate user behaviours, and the simulation can make the product under test behave in ways dramatically different from real circumstances.

Second, my impression is that Katalon’s approach to recording and displaying the input would make editing a long sequence of actions something of a nightmare. Further investigation is warranted.

Upon ending the recording activity, I was presented with the instruction “First, save the captured elements that using in the script.” (Bug 6.)

Katalon says, "First, save the captured elements that using in the script."

This is a cosmetic and inconsequential problem in terms of operating the product, of course. It’s troubling, though, because it is inconsistent with an image that the company would probably prefer to project. What feeling do you get when a product from a testing tools company shows signs of missing obvious, shallow bugs right out of the box? For me, the feeling is suspicion; I worry that the company might be missing deeper problems too.

This also reminds me of a key problem with automated checking: while it accelerates the pressing of keys, it also intensifies our capacity to miss things that happen right in front of our eyes… because machinery doesn’t have eyes at all.

There’s a common trope about human testers being slow and error prone. Machinery is fast at mashing virtual keys on virtual keyboards. It’s infinitely slow at recognizing problems it hasn’t been programmed to recognize. It’s infinitely slow at describing problems unless it has been programmed to do so.

Machinery doesn’t eliminate human error; it reduces our tendency towards some kind and increases the tendency for other kinds.

Upon saving the script, the product presents an error dialog with no content. Then the product hangs with everything disabled, including OK button on the error dialog. (Bug 7.) The only onscreen control still available is the Close button.

On attempting to save the script, Katalon crashes with an empty error dialog.

After clicking the Close button and restarting the product, I find that all of my data has been lost. (Bug 8.)

Pause: my feelings and intuition are already suggesting that the recorder part of the product, at least, is unstable. I’ve not been pushing it very hard, nor for very long, but I’ve seen several bugs and one crash. I’ve lost the script that was supposedly being recorded.

In good testing, we think critically about our feelings, but we must take them seriously. In order to do that, we follow up on them.

Perhaps the product simply can’t handle something about the way Mattermost processes input. I have no reason to believe that Mattermost is exceptional. To test that idea, I try a very simple program, UI-wise: the Pattern exercise from our Rapid Software Testing class.

The Pattern program is a little puzzle implemented as a very simple Web page. The challenge for the tester is to determine and describe patterns of text strings that match a pattern encoded in the program. The user types input into a textbox element, and then presses Enter or clicks on the Submit button. The back end determines whether the input matches the pattern, and returns a result; then the front end logs the outcome.

I type three strings into the field, each followed by the Enter key. As the video here shows, the application receives the input and displays it. Then I type one more string into the field, and click on the submit button. Katalon Recorder fails to record all three of the strings that were submitted via the Enter key, losing all of the data! (Bug 9.)

Here’s a video recording of that experience:

The whole premise of a record-and-playback tool is to record user behaviour and play it back. Submitting Web form input via the Enter key is perfectly common and naturalistic user behaviour, and it doesn’t get recorded.

The workaround for this is for the tester to use the mouse to submit input. At that, though, Katalon Recorder will condition testers to interact with the product being tested in way that does not reflect real-world use.

I saved the “test case”, and then closed Katalon Studio to do some other work. When I returned and tried to reopen the file, Katalon Studio posted a dialog “Cannot open the test case.” (Bug 10.)

On attempting to open a saved "test case", Katalon can

To zoom that up…

On attempting to open a saved "test case", Katalon can

No information is provided other than the statement “Cannot open the test case”. Oh well; at least it’s an improvement over Bug 7, in which the error dialog contained nothing at all.

I was interested in troubleshooting the Enter key problem. There is no product-specific Help option under the Help menu. (Bug 11.)

No product-specific help under the Help menu.

Clicking on “Start Page” produces a page in the main client window that offers “Help Center” as a link.

Katalon Start Page.

Clicking on that link doesn’t take me to documentation for this product, though. It takes me to the Katalon Help Center page. In the Help Center, I encounter a page where the search field looks like… white text on an almost-white background. (Bug 12.)

Katalon Support Search: White text on a white background.

In the image, I’ve highlighted the search text (“keystrokes”). If I hadn’t done that, you’d hardly be able to see the text at all. Try reading the text next to the graphic of the stylized “K”.

I bring this sort of problem up in testing classes as the kind of problem that can be missed by checks of functional correctness. People often dismiss it as implausible, but… here it is. (Update, 2021/11/04: I do not observe this behaviour today.)

It takes some exploration to find actual help for the product (https://docs.katalon.com/katalon-studio/docs/overview.html). (Again, Bug 11.)

From there it takes more exploration to find the reference to the WebUI SendKeys function. When I get there, there’s an example, but not of appropriate syntax for sending the Enter key, and no listing of the supported keys and how to specify them. In general, the documentation seems pretty poor. (I have not recorded a specific bug for this.

This is part of The Secret Life of Automation. Products like Katalon are typically documented to the bare minimum, apparently on the assumption that the user of the product has the same tacit knowledge as the builders of the product. That tacit knowledge may get developed with effort and time, or the tester may simply decide to take workarounds (like preferring button clicks to keyboard actions, or vice versa) so that the checks can be run at all.

These products are touted as “easy to use”—and they often are if you use them in ways that follows the assumptions of the people who create them. If you deviate from the buiders’ preconceptions, though, or if your product isn’t just like the tool vendors’ sample apps, things start to get complicated in a hurry. The demo giveth, and the real world taketh away.

I turned the tool to record a session of simple actions with CryptPad Kanban (http://cryptpad.fr/kanban). I tried to enter a few kanban cards, and closed the recorder.
Playback stumbled on adding a new card, apparently because the ids for new card elements are dynamically generated.

At this point, Katalon’s “self=healing” functions began to kick in. Those functions are unsuccessful, and the process fails to complete. When I looked at the log output for the process, “self-healing” appears to consist of retrying an Xpath search for the added card over and over again.

To put it simply, “self-healing” doesn’t self-heal. (See Bug 13.)

The log output for the test case appears in a non-editable, non-copiable window, making it difficult to process and analyze. This is inconsistent with the facility available in the Help / About / System Configuration dialog, which allows copying and saving to a file. (See Bug 14.)

At this point, having spent an hour or so on testing, I stop.

Follow-up Work, November 3

I went to the location of katalon.exe and experimented with the command-line parameters.

As a matter of fact, no parameters to katalon.exe are documented; nor does the product respond to /?, -h, or –help. (See Bug 5.)

On startup the program creates a .metadata\.log file (no filename; just an extension) beneath the data folder. In that .log file I notice a number of messages that don’t look good; three instances of “Could not find element”; warnings for missing NLS messages; a warning to initialize the log4j system properly, and several messages related to key binding conflicts for several key combinations (“Keybinding conflicts occurred. They may interfere with normal accelerator operation.”). This bears further investigation some day.

Bug Summaries

Bug 1: There is no installation program and no instructions on where to put the product upon downloading it. Moreover, The installation guide at https://docs.katalon.com/katalon-studio/docs/getting-started.html#start-katalon-studio does not identify any particular location for the product. Inconsistent with usability, inconsistent with comparable products, inconsistent with installability, inconsistent with acceptable quality.

Bug 2: Product crashes when run from the \Program Files\Katlaon folder. This is due to Bug 1.

Bug 3: After the crash in Bug 2, the error dialog neither offers a way to to open the file directly nor provides a convenient way to copy the location. Inconsistent with usability.

Bug 4: Undocumented parameter -data to katalon.exe

Bug 5: Command-line help for katalon.exe does not list available command-line parameters.

Bug 6: Sloppy language in the “Creating Your First Script” introductory process: “First, save the captured elements that using in the script.” Inconsistent with image.

Bug 7: Hang upon saving my first script in the tutorial process, including an error dialog with no data whatsoever; only an OK button. Inconsistent with capability, inconsistent with reliability.

Bug 8: Loss of all recorded data for the session after closing the product after Bug 7. Inconsistent with reliability.

Bug 9: Katalon Studio’s recorder fails to record text input if it ends with an Enter key. The application under test accepts the Enter key fine. Inconsistent with purpose, inconsistent with usability.

Bug 10: Having saved a “test case” associated with Bug 9, closing the product, and then returning, Katalon Studio claims that it “cannot open the test case”. Inconsistent with reliability.

Bug 11: There is no product-specific “Help” entry under the main menu’s Help selection. Inconsistent with usability.

Bug 12: Search UI at katalon.com/s=keystrokes displays in white text on an almost-white background. Inconsistent with usability. (Possibly fixed at some point; on 2021/11/04, I did not observe this behaviour.)

Bug 13: “Self-healing” is ineffective, consisting of repeatedly trying the same Xpath-based approach to selecting an element that is not the same as the recorded one. Inconsistent with purpose.

Bug 14: Output data for a test case appears in a non-editable, non-copiable window, making it difficult to process and analyze. Inconsistent with usability, inconsistent with purpose. This is also inconsistent with the Help / About / System Configuration dialog, which allows both copying to the clipboard and saving to a file.

Rapid Software Explored for the Americas happens January 17-20, 2022; register here.

Experience Report: mabl Trainer and runner, and related features

October 18th, 2021

Warning: this is a long post.

Introduction

This is an experience report of attempting to perform a session of sympathetic survey and sanity testing, done in September 2021, with follow-up work October 13-15, 2021. The product being tested is mabl. My self-assigned charter was to perform survey testing of mabl, based on a performing a basic task with the product. The task was to automate a simple set of steps, using mabl’s Trainer and test runner mechanism.

I will include some meta-notes about the testing in indented text like this.

The general mission of survey testing is learning about the design, purposes, testability, and possibilities of the product. Survey testing tends to be spontaneous, open, playful, and relatively shallow. It provides a foundation for effective, efficient, deliberative, deep testing later on.

Sanity testing might also be called “smoke testing”, “quick testing”, or “build verification testing”. It’s brief, shallow testing to determine whether if the product is fit for deeper testing, or whether it has immediately obvious or dramatic problems.

The idea behind sympathetic testing is not to find bugs, but to exercise a product’s features in a relatively non-challenging way.

Summary

mabl’s Trainer and test runner show significant unreliability for recording and playback of very simple, basic tasks in Mattermost, a popular open-source Slack-like chat system with a Web client. I intended to survey other elements of mabl, but attempting these simple tasks (which took approximately three minutes and forty seconds to perform without mabl) triggered a torrent of bugs that has taken (so far) approximately ten hours to investigate and document to this degree.

There are many other bugs over which I stumbled that are not included in this report; the number of problems that I was encountering in this part of the product overwhelmed my ability to stay focused and organized.

A note on the term “bug”: in the Rapid Software Testing namespace, a bug is anything about the product that threatens its value to some person who matters. A little less formally, a bug is something that bugs someone who matters.

From this perspective, a bug is not necessarily a coding error, nor a “broken” feature. A bug is something that represents a problem for someone. Note also that “bug” is subjective; the mabl people could easily declare that something is not a bug on the implicit assumption that my perception of a bug doesn’t matter to them. However, I get to declare that what I see bugs me.

The bugs that I am reporting here are, in my opinion, serious problems for a testing tool—even one intended for shallow, repetitive, and mostly unhelpful rote checks. Many of the bugs considered alone would destroy the mabl’s usefulness to me, and would undermine the quality of my testing work. Yet these bugs are also very shallow; they were apparent in attempts to record, play back, and analyze a simple procedure, with no intention to provide difficult challenges to mabl’s Trainer and runner features.

It is my opinion that mabl itself has not been competently and thoroughly tested against products that would present a challenge to the Trainer or runner features; or if it has, its product management has either ignored or decided not to address the problems that I am reporting here.

I have not yet completed the initial charter of performing a systematic survey of these features. This is because my attempt to do was completely swamped by the effort required to record the bugs I was finding, and to record the additional bugs that I found while recording and investigating the initial bugs.

From one perspective, this could be seen as a deficiency in my testing. From another (and, I would argue, more reasonable) perspective, the experience that I have had so far would suggest at least two next steps if I were working for a client, depending on my client and my client’s purposes.

One next step might be to revisit the product and map out strategies for deeper testing. Another might be to decide that the survey cannot be completed efficiently right now and is not warranted until these problems are addressed. Of course, since I’m my own client here, I get to decide: I’m preparing a detailed report of bugs found in an attempt at sympathetic testing, and I’ll leave it at that.

mabl claims to “Improve reliability and reduce maintenance with the help of AI powered test automation”. This claim might bear investigation.

What is being used as training sets for the models, and where does the data come from? Is my data being used to train machine learning models for the applications I’m testing? If it’s only mine, is the data set large enough? Or is my data being used to develop ML models for other people’s applications? If “AI” is being used to find bugs or for “self-healing”, how does the “AI” comprehend the difference between “problem” and “no problem? And is the “AI” component being tested critically and thoroughly? These are matters for another day.

Setup and Platform

On its web site, mabl claims to provide “intelligent test automation for Agile teams”. The company also claims that you can “easily create reliable end-to-end tests that improve application quality without slowing you down.”

The claim about improving application quality is in trouble from the get-go. Neither testing nor tests improve application quality. Testing may reveal aspects of application quality, but until someone does something in response to the test report, application quality stays exactly where it is. As for the ease of creating tests… well, that’s what this report is about.

I registered an account to start a free trial, and downloaded v 1.2.2 of the product. (Between September 19 and October 13, the installer indicated that the product had been updated to version 1.3.5. Some of these notes refer to my first round of testing. Others represent updates as I revisited the product and its site to reproduce the problems I found, and to prepare this report. If the results between the two versions differ, I will point out the differences.)

I ran these tests on a Windows 10 system, using only Chrome as the browser. As of this writing, I am using Chome v.94

To play the role of product under test, I chose Mattermost. Mattermost is an open-source online communication tool, similar to Slack, that we use in our Rapid Software Testing classes, and provides both a desktop and a web-based client. Like Slack, you can be a member of different Mattermost “teams”, so I set up a team with a number of channels specifically for my work with mabl.

Testing Notes

I started the mabl trainer, and chose to begin a Web test, in mabl’s parlance. (A test is more than a series of recorded actions.) mabl launched a browser window that defaulted to 1000 pixels. I navigated to the Mattermost Web client. I entered a couple of lines of ordinary plain text, which was accepted by Mattermost, and which the mabl Trainer appeared to record.

I then entered some emoticons using Mattermost’s pop-up interface; the mabl Trainer appeared to record these, too. I used the Windows on-screen keyboard to enter some more emoticons.

Then I chose a graphic, uploaded it, and provided some accompanying text that appears just above the graphic.

First input attempt with mabl

I’m getting older, and I work far enough away from my monitors that I like my browser windows big, so that I can see what’s going on. Before ending the recording, I experimented a little with resizing the browser window. In my first few runs with v 1.2.2, this caused a major freakout for the trainer, which repeatedly flashed a popup that said “Maximizing Trainer” and looped endlessly until I terminated mabl.

(In version 1.3.5, it was possible to try to maximize the browswer window, but the training window stubbornly appeared to the right of the browser, even if I tried to drag it to another screen.) (See Bugs 1 and 2 below.)

I pressed “Close” on the Trainer window, and mabl prompted me to run the test. I chose Local Run, and pressed the “Start 1 run” button at the bottom of the “Ad hoc run” panel.

mabl Start Run dialog

A “Local Run Output” window appeared. mabl launched the browser in a way that covered the Local Run Window; an annoyance. mabl appeared to log into Mattermost successfully. The tool simulated a click on the appropriate team, and landed in that team at the top of the default channel. This is odd, because normally, Mattermost takes me to the end of the default channel. And then… nothing seemed to happen.

Whatever mabl was doing was happening below the level of visible input. (Later investigation shows that mabl, silently and by default, sets the height of the viewport much larger than the height of the browswer window and the height of my screen.) (See Bug 3 below.)

When I looked at the Mattermost instance in another window, it was apparent that mabl had failed to enter the first line of text that I had entered, even though the step is clearly listed:

mabl Missed Step Detail

Yet the Local run output window suggested that the text had been entered successfully,

mabl Local run output suggests successful text entry

mabl failed to enter most subsequent text entries, too. Upon investigation, it appears that the runner does type the body of the recorded text into the textbox element. After that, though, either the mabl Trainer does not record the Enter key to send the message, or the runner doesn’t simulate the playback of that key.

The consequence is that when it comes time to enter another line of text, mabl simply replaces the contents of the textbox element with the new line of text, and the previous line is lost.

Ending entries in this text element with Ctrl-Enter provides a workaroud to this behaviour, but that’s not the normal means for submitting a post in Mattermost. The Enter key on its own should do the trick.

More investigation revealed that this behaviour is the same whether the procedure is run locally or in the cloud. (See Bug 4 below.)

Many record-and-playback tools claim to simulate user behaviour. It is crucial to remember that human users enter data in one way—via mechanisms like keyboards, mice, touch pads, drawing tablets, etc.—and almost all playback tools use different means, in the form of software interfaces. The differences between input mechanisms are often ignored, but they can be significant.

Moreover, different playback tools use different approaches to simulate user input. Often these approaches throw away elements of user behaviours such as backspacing, pasting, copying, or deleting blocks of text, and submit only the edited string for processing. Such simulations will systematically miss problems that happen in real usage.

Upon trying to play back the typing of the emojis, mabl apparently became confused by Mattermost’s emoji popup. The log indicated that the application attempted to locate a specific element three times, then concluded that the element could not be found, whereupon the entire procedure errored out for good. The controls that mabl is seeking according to the recorded steps in the test are plainly accessible via the developer tools. All this seems inconsistent with mabl’s claims of “auto-healing”. (See Bugs 5 and 6 below.)

Local run output with errors

In these screenshots, some timestamps in the logs may appear out of sequence relative to this narrative. Some screenshots you’re seeing here are of repro instances, rather what happened the first time through. This is because I was encountering so many bugs while testing that my capacity to record them properly became overwhelmed, and had to return for analysis later.

The phenomenon of being swamped (or swarmed) by bugs like this is something we call a bug cascade in Rapid Software Testing. In my rough notes for my first run of this session, I observe “I should have been recording a video of all this.” It can be very useful to have a narrated video recording for later review.

I examined the “Local run output” window more closely and observed a number of problems. On this run and others, the listing claims to have entered text successfully when that text never appears in the application under test. Only the first 37 characters of the text entered by the runner appears in the log.

The local log contains time stamps, but not date stamps, and those time stamps are recorded in AM/PM format. Both of these are inconvenient for analysing the log files with tool support. There appears to be no mechanism for saving a file from the Local Run Output window. (See Bugs 7, 8, 9, 10, and 11 below.)

Local run output window

I looked for a means of looking at past runs from the desktop in various places in the mabl desktop client. I could not find one. (See Bug 12 below.)

Using Search Everything (a very useful tool that affords more or less instaneous search for all files on the system) I also looked for log files associated with individual test runs. I could not find any.

Search Everything quickly helped me to find mabl’s own application log (mablApp.log), and this did contain some data about the steps. Oddly, the runner data in mablApp.log is formatted in a more useful way than in the “Local run output” window. Local runs did not seem to record screen shots, either.

This was all pretty confusing at first, but later research revealed this: “Note that test artifacts – such as DOM snapshots or screenshots – are not captured from local runs. The mabl execution logs and final result are displayed in a separate window.” (https://help.mabl.com/docs/local-runs-in-the-desktop-app) That might be bad enough—but no run-specific local logs at all?

In order to try to troubleshoot the problems that I was experiencing with entering text, I looked at the Tests listings, and chose My New Test. This took me to an untitled window that lists the steps for the test (I will call this the “Test Steps View”).

Scanning the list of steps, I observed the second step was “Visit URL assigned to variable ‘app.url’.” This is factually correct, but unhelpful; how does one find the value of the variable? There is no indication of what that URL might be or how to find it conveniently. Indeed, the screen suggests that there are “no data-driven variables”—which seems false.

(Later investigation revealed that if I chose Edit Steps, then chose Quick Edit, then chose the URL to train against, and then chose the step “Visit URL assigned to variable ‘app.url'”, I could see a preview of the value. How about a display of the variable in the steps listing? A tooltip?) (See Bug 13 below.)

I examined the step that appeared to be failing to enter text. The text that I originally typed into the Mattermost window was not displayed in full, even though there’s plenty of space available for it in the window for the Test Steps View. (See Bug 14 below.) This behaviour is inconsistent with my ability to explain it, and it’s inconsistent with an implicit purpose of test steps view—(the ability to troubleshoot test steps easily). However, it is consistent with the display in the Log output window logging in mabl’s system log. (See Bug 15 below.)

As I noted above, further experimentation with Mattermost and with the mabl trainer showed that ending the input with Ctrl-Enter (rather than Mattermost’s default Enter) while recording allowed mabl to play back the text suggest. So, perhaps if I could edit the text entry step somehow, or if I could add a keystroke step, there would be a workaround for this problem if I’m willing to accept the risk that behaviours developed with the Trainer are inconsistent with the actual behaviours of the user.

In the Test Steps View, there is an “Edit Steps” dropdown, with the options “Quick Edit” and “Launch Trainer”. I clicked on Quick Edit, and a pop-up appeared immediately, confusingly, and probably unnecessarily: “Launching Trainer”.

mabl's

I selected the text entry step, hoping to edit some of the actions within it. Of the items that appear in the image below, note that only the input text can be edited; no other aspect of the step can be. (See Bug 16 below.)

mabl's Quick Edit window

It’s possible to send keypresses to a specific element. That capability has supposedly been available in mabl for a long time as claimed here. Could I add an escape sequence to the text by which I could enter a specific key or key combination? If such a feature is available, it’s not documented clearly. The documentation hints that certain keys might have escape strings—”[TAB]”, or “[ENTER]”. However, adding those strings to the end of the text doesn’t make the virtual keypress happen. (See Bug 17 below.)

The Quick Edit window offers the opportunity to insert a step. What if I try that? I scroll down to the step that enters the text, attempt to select that step with the mouse, and press the plus key at the bottom to insert the step. A dialog appears that offers a set of possible steps. Neither entering text, nor entering keystrokes, nor clicking the mouse appears on this list. (See Bug 18 below.)

mabl's options for inserting steps

(For those wondering if input features appear beneath the visible window, “Variables” is the last item in the list.)

When I look at Step 5 in the Test Steps View, I see that there’s a step that sends a Tab keypress to the “Email or Username” text field. Maybe I could duplicate that step, and drag it down to the point after my text entry. Then maybe I could modify the step to point to the post_textbox element, and to send the Enter key instead of the Tab key.

Yes, I can change [TAB] to [ENTER]. But I can’t change the destination element. (See Bug 19 below.)

mabl's facility for sending a keypress

Documenting this is difficult and frustrating. Each means of trying to send that damned Enter key is thwarted in some exasperating and inexplicable way. For those of you of a certain age, it’s like the Cone of Silence from Get Smart (the younger folks can look it up on the Web). I’m astonished by the incapability of the product, and because of that I second-guess myself and repeat my actions over and over again to make absolutely sure I’m not missing some obvious way to accomplish my goal. The strength of my feelings at this time are pointers to the significance of the problems I’m encountering.

I looked at some of the other steps displayed in the Test Steps View and in the Trainer. Note that steps are often described as “Click on a button” without displaying which button is being referred to, unless that button has an explicit text label. This is annoying, since human-readable information (like the aria-label attribute) is available, but I had to click on “edit” to see it. (See Bug 20 below.)

Vague \

Scanning the rest of the Test Steps View, I noticed an option to download a comma-separated-value (.CSV) file; perhaps that can be viewed and edited, and then uploaded somehow. I downloaded a .CSV and looked at it. It is consistent with what is displayed in the Test Steps View, but it does not accurately reflect the behaviour that mabl is trying to perform.

Once again, the text that mabl actually tries to enter in a text field (which can be observed if you scroll to the bottom of the browser window in the middle of the test) is elided, limited to 37 characters plus an ellipsis. (See Bug 21 below.)

This would be a more serious problem if I tried to edit the script and upload it. However, no worries there, because even though you can download a .CSV file of test steps, you can’t upload one. There’s nothing in the product’s UI, and a search of the Help file for “upload” revealed no means for uploading test step files. (See Bug 22 below.)

At this point, I began to give up on entering text naturalistically and reliably using the Trainer. I wondered if there was anything that I could rescue from my original task of entering text, throwing in some emojis, and uploading a file. I edited out the test steps over which mabl stumbled in order to get to the file upload. The test proceeded, but the file didn’t get uploaded. Perhaps this is because the Trainer doesn’t keep track of where the file came from on the local system. (See Bug 23 below.)

At this point, my energy for continuing with this report is flagging. Investigating and reporting bugs takes time, and when there are this many problems, it’s disheartening and exhausting. I’m also worried about this getting boring to read. This post probably still has several typos. I have left many bugs that I encountered out of the narrative here, but a handful of them appear below. I left many more undocumented and uninvestigated. (See Bugs 24, 25, 26, and 27 below.)

There is much more to mabl. Those aspects may be wonderful, or terrible. I don’t know, because I have not examined them in any detail, but I have strong suspicion of lots of further trouble. Here’s an example:

In my initial round of testing in September, I created a plan—essentially a suite of recorded procedures and tasks that mabl offers. That plan included crawling the entire Mattermost instance for broken links. mabls’s summary report indicated that everything had passed, and that there were no broken links. “Everything looks good!”

mabl claims everything looks good.

I scrolled down a bit, though, and looked at the individual items below. Upon seeing “Found 3 broken links” on the left, and saw the details on the right.

While mabl claims everything looks good, broken links.

In the October 13-15 test activity, I set up a task for mabl to crawl my blog looking for broken links. Thanks to various forms of bitrot (links that have moved or otherwise become obsolete, commenters whose web sites have gone defunct, etc.), there are lots of broken links. mabl reports that everything passed.

mabl results table suggesting everything passed

This looks fine until you look at the details. mabl identified 586 broken links (many of them are duplicates)… and yet the summary says “Visit all linked pages within the app” passed. (See Bug 28 below.)

Visit all linked pages within the app details

Epilogue

During my first round of testing in September, I contacted mabl support via chat, and informed the representative that I was encountering problems with the product while preparing a talk. The representative on the chat promised to have someone contact me about the problems. The next day, I received this email:

Email reply sent by mabl Customer Support. Helpful advice:  RTFM.

Let me zoom that up for you:

Email reply sent by mabl Customer Support, zoomed up.  Helpful advice:  RTFM.

And that, so it seems, is what passes for a response: RTFM.

Bug Summaries

Bug 1: Resizing the browser while training resulted in an endless loop that hangs the product. (Observed several times in 1.2.2; not observed so far in 1.3.5.)

Bug 2: The browser cannot be resized to the full size of the screen on which the training is happening; and at the same time, the trainer window cannot be repositioned onto another screen. (This was happening in 1.2.2 when resizing didn’t result in the endless loop above; it still happens in 1.3.5.) This is inconsistent with usability and inconsistent with comparable products; if the product is intended to replicate the experience of a user, it’s also inconsistent with purpose.

Bug 3: The default behaviour of the application running in the browser is different from the naturalistic encounter with the product and, as such, in this case, rendered input activity invisible unless I actively scrolled the browser window using the cursor keys, and until I figured out where the browser height was set. Inconsistent with usability for a test tool at first encounter; inconsistent with charisma.

Bug 4: mabl’s playback function doesn’t play back simple text entry into a Mattermost instance, but the logging claims that the text was entered correctly. This happens irrespective of whether the procedure is run from the cloud or from the local machine. This is inconsistent with comparable products; inconsistent with purpose; inconsistent with basic capabilities for a product of this nature; and inconsistent with claims (https://help.mabl.com/changelog/initial-keypress-support-in-the-mabl-trainer).

Bug 5: mabl seems unable to locate emojis in Mattermost’s emoji popup—something that a human tester would have no problem with—even though the Trainer supposedly captured the action. (Inconsistency with purpose.)

Bug 6: Auto-healing fail with respect to trying to locate buttons in the Mattermost emoji picker. (Inconsistency with claims.)

Bug 7: The “Local run output” window falsely suggests that attempts to enter text are successful when the text entry has not completed. (Inconsistent with basic functionality; inconsistent with purpose.)

Bug 8: The “Local run output” window does not record the actual text that was entered by the runner. Only the first 37 characters of the entry, followed by an ellipsis (“…”) are displayed. (Inconsistent with usability for a test tool.)

Bug 9: Date stamps are absent in the logging information displayed in the “Local run output” window. Only time stamps appear, and at that only precise down to the second. This is an inconvenience for analyzing logged results over several days. (Inconsistent with usability for testing purposes; also inconsistent with product (mabl’s own application log).)

Bug 10: Time stamps in the “Local run output” window are rendeded in AM/PM format, which makes sorting and searching via machinery less convenient. (Inconsistent with testability; also poor internationalization; and also inconsistent with mabl’s own application log.)

Bug 11: Cannot save data to a file directly from the “Local run output” window. (Inconsistent with purpose; inconsistent with usability; risk of data loss.) Workarounds: copying data from the log and pasting it into the user’s own record; spelunking through mabl’s mablApp.log file.

Bug 12: Local run log output does not appear in mabl’s GUI, neither under Results nor under the Run History tab for individual tests. If there is a facility for that available from the GUI, it’s very well hidden. (Inconsistent with usability for a record/playback testing tool.) Workaround: there is some data available in the general application log for the product, but it would require effort to be distentangled from the rest of the log entries.

Bug 13: The test steps editing window makes it harder than necessary to view the content of variables that will be used for the test procedure. For instance, the user must choose Edit Steps, then chose Quick Edit, then chose the URL to train against, and then chose the step “Visit URL assigned to variable ‘app.url’.

Bug 14: The main test editor window hides the content of text entry strings longer than about 40 characters. Since there is ample empty whitespace to the right, it is unclear why longer string of text aren’t displayed. Inconsistent with explainability, inconsistent with purpose (the ability to troubleshoot test steps easily).

Bug 15: mabl’s application log (mablApp.log) limits the total length of the typed string to 40 characters (37 characters, plus an ellipsis (…)). (Is the Local Output Log generated from the mablApp.log?)

Bug 16: In a step to enter text in Quick Edit mode, only the input text can be edited; no other aspect—neither the target nor the action of the step can be edited.

Bug 17: Escape sequences to send specific keys (e.g. Tab, Enter) are not supported by mabl’s Quick Edit step editor. Inconsistent with comparable products, inconsistent with purpose.

Bug 18: The “Insert Steps” option in the Quick Edit dialog does not offer options for entering text, sending keys, or clicking on elements. Inconsistent with purpose; inconsistent with comparable products.

Bug 19: The “Send keypress” dialog allows changing the key to be sent, or to add modifier keys, but doesn’t allow changing the element to which the key is sent.

Bug 20: The trainer window fails to identify which button is to be clicked in a step unless the button has a text label. Some useful information (e.g. the Aria Label or class ID) to identify the button is available if you enter the step and try to edit it. (Inconsistent with product; inconsistent with purpose)

Bug 21: The .CSV file identifying the steps for a test does not reflect the actual steps performed. (Inconsistent with product; inconsistent with the purpose of trying to see the actual steps in the procedure.) Workaround: going into each step in the Quick Edit or Trainer views displays the entire text, but for long procedures with strings longer than 40 characters, this could be very expensive in terms of time.

Bug 22: You can’t upload a CSV of test steps at all. Editing test steps depends on mabl’s highly limited Trainer or Quick Edit facilities—and Quick Edit depends on the Trainer. The purpose of downloaded CSV step files is unclear.

Bug 23: A file upload recorded through the Trainer / Runner mechanism never happens.

Bug 24: The Help/Get Logs for Support option isn’t set by default to go to the folder where mabl’s logs are stored. Instead, it opens up a normal File/Open window (in my case defaulting to the Downloads folder, perhaps because this is the most recent location where I opened my browser, or…)

Bug 25: The mabl menu View / Zoom In function claims to be mapped to the Ctrl-+. It isn’t. The Zoom Out (Ctrl–) and Actual Size (Ctrl-0) work.

Bug 26: I noticed on October 17 that an update was available. There is no indication that release notes are available or what has changed. When I do a Web search for mabl release notes, such release notes as exist are don’t refer to version numbers!

Bug 27: The mabl Trainer window doesn’t have controls typically found in the upper right of a Windows dialog, which makes resizing the window difficult and makes minimizing it impossible. (Inconsistent with comparable products; inconsistent with UI standards.)

Bug 28: mabl’s Results table falsely suggests that a check for broken links “passed”, when hundreds of broken links were found. (Inconsistent with comparable products; inconsistent with UI standards.)

I thank early readers Djuka Selendic, Jon Beller, and Linda Paustian for spotting problems in this post and bringing them to my attention. Testers help other people look good!

You may also like to peruse the next item in this series, Experience Report: Katalon Studio.

Rapid Software Explored for the Americas happens January 17-20, 2022; register here.

To Go Deep, Start Shallow

October 13th, 2021

Here are two questions that testers ask me pretty frequently:

How can I show management the value of testing?
How can I get more time to test?

Let’s start with the second question first. Do you feel overwhelmed by the product space you’ve been assigned to cover relative to the time you’ve been given? Are you concerned that you won’t have enough time to find problems that matter?

As testers, it’s our job to help to shine light on business risk. Some business risk is driven by problems that we’ve discovered in the product—problems that could lead to disappointed users, bad reviews, support costs… More business risk comes from deeper problems that we haven’t discovered yet, because our testing hasn’t covered the product sufficiently to reveal those problems.

All too often, managers allocate time and resources for testing based on limited, vague, and overly optimistic ideas about risk. So here’s one way to bring those risk ideas to light, and to make them more vivid.

  • Start by surveying the product and creating a product coverage outline that identifies what is there to be tested, where you’ve looked for problems so far, and where you could look more deeply for them. If you’ve already started testing, that’s okay; you can start your product coverage outline now.
  • As you go, develop a risk list based on bugs (known problems that threaten the value of the product), product risks (potential deeper, unknown problems in the product in areas that have not yet been covered by testing), and issues (problems that threaten the value of the testing work). Connect these to potential consequences for the business. Again, if you’re not already maintaining a risk list, you can start now.
  • And as you go, try performing some quick testing to find shallow bugs.

By “quick testing”, I mean performing fast, inexpensive tests that take little time to prepare and little effort to perform. As such, small bursts of quick testing can be done spontaneously, even when you’re in the middle of a more deliberative testing process. Fast, inexpensive testing of this nature often reveals shallow, easy-to-find bugs.

In general, in a quick test, we rapidly an encounter some aspect of the product, and then apply fast and easy oracles. Here are just a few examples of quick testing heuristics. I’ve given some of them deliberately goofy and informal names. Feel free to rename them, and to create your own list.

Blink. Load the same page in two browsers and switch quickly between them. Notice any significant differences?
Instant Stress. Overload a field with an overwhelming amount of data (consider PerlClip, BugMagnet or similar lightweight tools; or just use a text editor to create a huge string by copying and pasting); then try to save or complete the transaction. What happens?
Pathological Data. Provide data to a file that should trigger input filtering (reserved HTML characters, emojis…). Is the input handled appropriately?
Click Frenzy. Click in the same (or different) places rapidly and relentlessly. Any strange behaviours? Processing problems (especially at the back end)?
Screen Survey. Pause whatever you’re doing for a moment and look over the screen; see anything obviously inconsistent?
Flood the Field. Try filling each field to its limits. Is all the data visible? What were the actual limits? Is the team okay with them—or surprised to hear about them? What happens when you save the file or commit the transaction?
Empty Input. Leave “mandatory” fields empty. Is an error message triggered? Is the error message reasonable?
Ooops. Make a deliberate mistake, move on a couple of steps, and then try to correct it. Does the system allow you to correct your “mistake” appropriately, or does the mistake get baked in?
Pull Out the Rug. Start a process, and interrupt or undermine it somehow. Close the laptop lid; close the browser session; turn off wi-fi. If the process doesn’t complete, does the system recover gracefully?
Tug-of-War. try grabbing two resources at the same time when one should be locked. Does a change in one instance affect the other?
Documentation Dip. quickly open the spec or user manual or API documentation. Are there inconsistencies between the artifact and the product?
One Shot Stop. Try an idempotent action—doing something twice that should effect a change the first time, but not subsequent times, like upgrading an account status to the top tier and then trying to upgrade it again. Did a change happen the second time?
Zoom-Zoom. Grow or shrink the browser window (remembering that some people don’t see too well, and others want to see more). Does anything disappear?

It might be tempting for some people to to dismiss shallow bugs. “That’s a edge case.” “No user will do that.” “That’s not the right way to use the product.” “The users should read the manual.” Sometimes those things might even be true. Dismissing shallow bugs too casually, without investigation, could be a mistake, though.

Quick, shallow testing is like panning for gold: you probably won’t make much from the flakes and tiny nuggets on their own, but if you do some searching farther upstream, you might hit the mother lode. That is: shallow bugs should prompt at least some suspicion about the risk of deeper, more systemic problems and failure patterns about the product. In the coverage outline and risk list you’re developing, highlight areas where you’ve encountered those shallow bugs. Make these part of your ongoing testing story.

Now: you might think you don’t have time for quick testing, or to investigate those little problems that lead you to big problems. “Management wants me to finish running through all these test cases!” “Management wants to me to turn these test cases into automated checks!” “Management needs me to fix all these automated checks that got out of sync with the product when it changed!”

If those are your assignments from management, you may feel like your testing work is being micromanaged, but is it? Consider this: if managers were really scrutinizing your work carefully, there’s a good chance that they would be horrified at the time you’re spending on paperwork, or on fighting with your test tools, or on trying to teach a machine to recognise buttons on a screen, only to push them to repeatedly to demonstrate that something can work. And they’d probably be alarmed at how easily problems can get past these approaches, and they’d be surprised at the volume of bugs you’re finding without them—especially if you’re not reporting how you’re really finding the bugs.

Because managers are probably not observing you every minute of every day, you may have more opportunity for quick tests than you think, thanks to disposable time.

Disposable time, in the Rapid Software Testing namespace, is our term for time that you can afford to waste without getting into trouble; time when management isn’t actually watching what you’re doing; moments of activity that can be invested to return big rewards. Here’s a blog post on disposable time.

You almost certainly have some disposable time available to you, yet you might be leery about using it.

For instance, maybe you’re worried about getting into trouble for “not finishing the test cases”. It’s a good idea to cover the product with testing, of course, but structuring testing around “test cases” might be an unhelpful way to frame testing work, and “finishing the test cases” might be a kind of goal displacement, when the goal is finding bugs that matter.

Maybe your management is insisting that you create automated GUI checks, a policy arguably made worse by intractable “codeless” GUI automation tools that are riddled with limitations and bugs. This is not to say that automated checking is a bad thing. On the contrary; it’s a pretty reasonable idea for developers to to automate low-level output checks that give them fast feedback about undesired changes. It might also be a really good idea for testers to exercise the product using APIs or scriptable interfaces for testing. But why should testers be recapitulating developers’ lower-level checks while pointing machinery at the machine-unfriendly GUI? As my colleague James Bach says, “When it comes to technical debt, GUI automation is a vicious loan shark.”

If you feel compelled to focus on those assignments, consider taking a moment or two, every now and again, to perform a quick test like the ones above. Even if your testing is less constrained and you’re doing deliberative testing that you find valuable, it’s worthwhile to defocus on occasion and try a quick test. If you don’t find a bug, oh well. There’s a still good chance that you’ll have learned a little something about the product.

If you do find a bug and you only have a couple of free moments, at least note it quickly. If you have a little more time, try investigating it, or looking for a similar bug nearby. If you have larger chunks of disposable time, consider creating little tools that help you to probe the product; writing a quick script to generate interesting data; popping open a log file and scanning it briefly. All focus and no defocus makes Jack—or Jill—a dull tester.

Remember: almost always, the overarching goal of testing is to evaluate the product by learning about it, with a special focus on finding problems that matter to developers, managers, and customers. How do we get time to do that in the most efficient way we can? Quick, shallow tests can provide us with some hints on where to suspect risk. Once found, those problems themselves can help to provide evidence that more time for deep testing might be warranted.

Several years ago, I was listening while James Bach was teaching a testing workshop. “If I find enough important problems quickly enough,” he said, “the managers and developers will be all tied up in arguing about how to fix them before the ship date. They’ll be too busy to micromanage me; they’ll leave me alone.”

You can achieve substantial freedom to captain your own ship of testing work when you consistently bring home the gold to developers and managers. The gold, for testers, is awareness and evidence of problems that make managers say “Ugh… but thank heavens that the tester found that problem before we shipped.”

If you’re using a small fraction of your time to find problems and explore more valuable approaches to finding them, no one will notice on those rare occasions when you’re not successful. But if you are successful, by definition you’ll be accomplishing something valuable or impressive. Discovering shallow bugs, treating them as clues that point us towards deeper problems, finding those, and then reporting responsibly can show how productive spontaneous bursts of experimentation can be. The risks you expose can earn you more time and freedom to to deeper, more valuable testing.

Which brings us to back to the first question, way above: “How can I show management the value of testing?”

Even a highly disciplined and well-coordinated development effort will result in some bugs. If you’re finding bugs that matter—hidden, rare, elusive, emergent, surprising, important, bone-chilling problems that have got past the disciplined review and testing that you, the designers and the developers have done already—then you won’t need to do much convincing. Your bug reports and risk lists will do the convincing for you. Rapid study of the product space builds your mental models and points to areas for deeper examination. Quick, cheap little experiments help you to learn the product, and to find problems point to deeper problems. Finding those subtle, startling, deep problems starts with shallow testing that gets deeper over time.


Rapid Software Testing Explored for Europe and points east runs November 22-25, 2021. A session for daytime in the Americas and evenings in Europe runs January 17-20, 2022.

Alternatives to “Manual Testing”: Experiential, Attended, Exploratory

August 24th, 2021

This is an extension on a long Twitter thread from a while back that made its way to LinkedIn, but not to my blog.

No one ever sits in front of a computer and accidentally compiles a working program, so people know — intuitively and correctly — that programming must be hard. But almost anyone can sit in front of a computer and stumble over bugs, so people believe — intuitively and incorrectly — that testing must be easy!

Testers who take testing seriously have a problem with getting people to understand testing work.

The problem is a special case of the insider/outsider problem that surrounds any aspect of human experience: most of the time, those on the outside of a social group—a community; a culture; a group of people with certain expertise; a country; a fan club—don’t understand the insider’s perspective. The insiders don’t understand the outsiders’ perspective either.

We don’t know what we don’t know. That should be obvious, of course, but when we don’t know something, we have no idea of how little we comprehend it, and our experience and our lack of experience can lead us astray. “Driving is easy! You just put the car in gear and off you go!” That probably works really well in whatever your current context happens to be. Now I invite you to get behind the wheel in Bangalore.

How does this relate to testing? Here’s how:

No one ever sits in front of a computer and accidentally compiles a working program, so people know—intuitively and correctly—that programming must be hard.

By contrast, almost anyone can sit in front of a computer and stumble over bugs, so people believe—intuitively and incorrectly—that testing must be easy!

In our world of software development, there is a kind of fantasy that if everyone is of good will, and if everyone tries really, really hard, then everything will turn out all right. If we believe that fantasy, we don’t need to look for deep, hidden, rare, subtle, intermittent, emergent problems; people’s virtue will magically make them impossible. That is, to put it mildly, a very optimistic approach to risk. It’s okay for products that don’t matter much. But if our products matter, it behooves us to look for problems. And to find deep problems intentionally, it helps a lot to have skilled testers.

Yet the role of the tester is not always welcome. The trouble is that to produce a novel, complex product, you need an enormous amount of optimism; a can-do attitude. But as my friend Fiona Charles once said to me—paraphrasing Tom DeMarco and Tim Lister—”in a can-do environment, risk management is criminalized.” I’d go further: in a can-do environment, even risk acknowledgement is criminalized.

In Waltzing With Bears, DeMarco and Lister say “The direct result of can-do is to put a damper on any kind of analysis that suggests ‘can’t-do’…When you put a structure of risk management in place, you authorize people to think negatively, at least part of the time. Companies that do this understand that negative thinking is the only way to avoid being blindsided by risk as the project proceeds.”

Risk denial plays out in a terrific documentary, General Magic, about a development shop of the same name. In the early 1990s(!!), General Magic was working on a device that — in terms of capability, design, and ambition — was virtually indistinguishable from the iPhone that was released about 15 years later.

The documentary is well worth watching. In one segment, Marc Porat, the project’s leader, talks in retrospect about why General Magic flamed out without ever getting anywhere near the launchpad. He says, “There was a fearlessness and a sense of correctness; no questioning of ‘Could I be wrong?’. None. … that’s what you need to break out of Earth’s gravity. You need an enormous amount of momentum … that comes from suppressing introspection about the possibility of failure.”

That line of thinking persists all over software development, to this day. As a craft, the software development business systematically resists thinking critically about problems and risk. Alas for testers, that’s the domain that we inhabit.

Developers have great skill, expertise, and tacit knowledge in linking the world of people and the world of machines. What they tend not to have—and almost everyone is like this, not just programmers—is an inclination to find problems. The developer is interested in making people’s troubles go away. Testers have the socially challenging job of finding and reporting on trouble wherever they look. Unlike anyone else on the project, testers focus on revealing problems that are unsolved, or problems introduced by our proposed solution. That’s a focus which the builders, by nature, tend ot resist.

Resistance to thinking about problems plays out in many unhelpful and false ideas. Some people believe that the only kind of bug is a coding error. Some think that the only thing that matters is meeting the builders’ intentions for the product. Some are sure that we can find all the important problems in a product by writing mechanistic checks of the build. Those ideas reflect the natural biases of the builder—the optimist. Those ideas make it possible to imagine that testing can be automated.

The false and unhelpful idea that testing can be automated prompts the division of testing into “manual testing” and “automated testing”.

Listen: no other aspect of software development (or indeed of any human social, cognitive, intellectual, critical, analytical, or investigative work) is divided that way. There are no “manual programmers”. There is no “automated research”. Managers don’t manage projects manually, and there is no “automated management”. Doctors may use very powerful and sophisticated tools, but there are no “automated doctors”, nor are there “manual doctors”, and no doctor would accept for one minute being categorized that way.

Testing cannot be automated. Period. Certain tasks within and around testing can benefit a lot from tools, but having machinery punch virtual keys and compare product output to specificed output is not more “automated testing” than spell-checking is “automated editing”. Enough of all that, please.

It’s unhelpful to lump all non-mechanistic tasks in testing together under “manual testing”. Doing so is like referring to craft, social, cultural, aesthetic, chemical, nutritional, or economic aspects of cooking as “manual” cooking. No one who provides food with care and concern for human beings—or even for animals—would suggest that all that matters in cooking is the food processors and the microwave ovens and the blenders. Please.

If you care about understanding the status of your product, you’ll probably care about testing it. You’ll want testing to find out if the product you’ve got is the product you want. If you care about that, you need to understand some important things about testing.

If you want to understand important things about testing, you’ll want to consider some things that commonly get swept a carpet with the words “manual testing” repeatedly printed on it. Considering those things might require naming some aspects of testing that you haven’t named before.

Think about experiential testing, in which the tester’s encounter with the product, and the actions that the tester performs, are indistinguishable from those of the contemplated user. After all, a product is not just its code, and not just virtual objects on a screen. A software product is the experience that we provide for people, as those people try to accomplish a task, fulfill a desire, enjoy a game, make money, converse with people, obtain a mortgage, learn new things, get out of prison…

Contrast experiential testing with instrumented testing. Instrumented testing is testing wherein some medium (some tool, technology, or mechanism) gets in between the tester and the naturalistic encounter with and experience of the product. Instrumentation alters, or accelerates, or reframes, or distorts; in some ways helpfully, in other ways less so. We must remain aware of the effects, both desirable and undesirable, that instrumention brings to our testing.

Are you saying “manual testing”? You might be referring to the attended or engaged aspects of testing, wherein the tester is directly and immediately observing and analyzing aspects of the product and its behaviour in the moment that the behaviour happens. And you might want to contrast that with the algorithmic, unattended things that machines do—things that some people label “automated testing”—except that testing cannot be automated. To make something a test requires the design before the automated behaviour, and the interpretation afterwards. Those parts of the test, which depend upon human social competence to make a judgement, cannot be automated.

Are you saying “manual”? You might be referring to testing activity that’s transformative, wherein something about performing the test changes the tester in some sense, inducing epiphanies or learning or design ideas. Contrast that with procedures that are transactional: rote, routine, box-checking. Transactional things can be done mechanically. Machines aren’t really affected by what happens, and they don’t learn in any meaningful sense. Humans do.

Did you say “manual”? You might be referring to exploratory work, which is interestingly distinct from experiential work as described above. Exploratory—in the Rapid Software Testing namespace at least—refers to agency; who or what is in charge of making choices about the testing, from moment to moment. There’s much more to read about that.

Wait… how are experiential and exploratory testing not the same?

You could be exploring—making unscripted choices—in a way entirely unlike the user’s normal encounter with the product. You could be generating mounds of data and interacting with the product to stress it out; or you could be exploring while attempting to starve the product of resources. You could be performing an action and then analyzing the data produced by the product to find problems, at each moment remaining in charge of your choices, without control by a formal, procedural script.

That is, you could be exploring while encountering the product to investigate it. That’s a great thing, but it’s encountering the product like a tester, rather than like a user. It might not be a great idea to be aware of the differences between those two encounters, and take advantage of them, and not mix those up.

You could be doing experiential testing in a highly scripted, much-less-exploratory kind of way; for instance, following a user-targeted tutorial and walking through each of its steps to observe inconsistencies between the tutorial and the product’s behaviour. To an outsider, your encounter would look pretty much like a user’s encounter; the outsider would see you interacting with the product in a naturalistic way, for the most part—except for the moments where you’re recording observations, bugs, issues, risks, and test ideas. But most observers outside of testing’s form of life won’t notice those those moments.

Of course, there’s overlap between those two kinds of encounters. A key difference is that the tester, upon encountering a problem, will investigate and report it. A user is much less likely to do so. (Notice this phenomenon, while trying to enter a link from LinkedIn’s Articles editor; the “apply” button isn’t visible, and hides off the right-hand side of the popup. I found this while interacting with Linked experientially. I’d like to hope that I would have find that problem when testing intentionally, in an exploratory way, too.)

There are other dimensions of “manual testing”. For a while, we considered “speculative testing” as something that people might mean when they spoke of “manual testing”; “what if?” We contrasted that with “demonstrative” testing—but then we reckoned that demonstration is not really a test at all. Not intended to be, at least. For an action to be testing, we would hold that it must be mostly speculative by nature.

And here’s the main thing: part of the bullshit that testers are being fed is that “automated” testing is somehow “better” than “manual” testing because the latter is “slow and error prone”—as though people don’t make mistakes when they apply automation to checks. They do, and the automation enables those errors at a much larger and faster scale.

Sure, automated checks run quickly; they have low execution cost. But they can have enormous development cost; enormous maintenance cost; very high interpretation cost (figuring out what went wrong can take a lot of work); high transfer cost (explaining them to non-authors).

There’s another cost, related to these others. It’s very well hidden and not reckoned: we might call it interpretation cost or analysis cost. A sufficiently large suite of automated checks is impenetrable; it can’t be comprehended without very costly review. Do those checks that are always running green even do anything? Who knows?

Checks that run red get frequent attention, but a lot of them are, you know, “flaky”; they should be running green when they’re actually running red. Of the thousands that are running green, how many should be actually running red? It’s cognitively costly to know that—so people routinely ignore it.

And all of these costs represent another hidden cost: opportunity cost; the cost of doing something such that it prevents us from doing other equally or more valuable things. That cost is immense, because it takes so much time and effort to to automate GUIs when we could be interacting with the damned product.

And something even weirder is going on: instead of teaching non-technical testers to code and get naturalistic experience with APIs, we put such testers in front of GUIish front-ends to APIs. So we have skilled coders trying to automate GUIs, and at the same time, we have non-programming testers, using Cypress to de-experientialize API use! The tester’s experience of an API through Cypress is enormously different from the programmer’s experience of trying use the API.

And none of these testers are encouraged to analyse the cost and value of the approaches they’re taking. Technochauvinism (great word; read Meredith Broussard’s book Artificial Unintelligence) enforces the illusion that testing software is a routine, factory-like, mechanistic task, just waiting to be programmed away. This is a falsehood. Testing can benefit from tools, but testing cannot be mechanized.

Testing must be seen as a social (and socially challenging), cognitive, risk-focused, critical (in several senses), analytical, investigative, skilled, technical, exploratory, experiential, experimental, scientific, revelatory, honourable craft. Not “manual” or “automated”. Let us urge that misleading distinction to take a long vacation on a deserted island until it dies of neglect.

Testing has to be focused on finding problems that hurt people or make them unhappy. Why? Because optimists who are building a product tend to be unaware of problems, and those problems can lurk in the product. When the builders are aware of those problems, the builders can address them. Whereby they make themselves look good, make money, and help people have better lives.

Exact Instructions vs. Social Competence

July 5th, 2021

An amusing video from a few years back has been making the rounds lately. Dad challenges the kids to write exact instructions to make a peanut butter and jelly sandwich, and Dad follows those instructions. The kids find the experience difficult and frustrating, because Dad interprets the “exact” instructions exactly—but differently from the way the kids intended. I’ll be here when you get back. Go ahead and watch it.

Welcome back. When the video was posted in a recent thread on LinkedIn, comments tended to focus on the need for explicit documentation, or more specific instructions, or clear direction.

In Rapid Software Testing, we’d take a different interpretation. The issue here is not that instructions are unclear, or that the kids have expressed themselves poorly. Instead, we would emphasize that communicating clearly, describing intentions explictly, and performing actions appropriately all rely on tacit knowledge—knowledge that has not been made explicit. In that light, the kids did a perfectly reasonable job at describing the assignment.

Notice that the kids do not describe what peanut butter is; they do not have to not tell the father that one must twist the lid on the peanut butter jar to open it; nor do they have to explain that the markings on the paper are words representing their intentions. The father has sufficient tacit knowledge to be aware of those things. At a very young age, through socialization, observation, imitation, and practice, the dad acquired the tacit knowledge required to open peanut butter jars, to squeeze jelly dispensers without crushing them, to use butter knives to deliver peanut butter from jar to bread, to make reasonable inferences about what the “top” of the bread is, and so forth.

Even though he has sufficient tacit knowledge to interpret instructions for making a peanut butter and jelly sandwich, the dad pretends that he doesn’t. What makes the situation in the video funny for us and exasperating for the kids is our own tacit knowledge of things the father presumably should know as a normal American dad in a normal American kitchen. In particular, we’re aware that he should be able to interpret the instructions competently; to repair differences between the actions the kids intended him to take and the ones he chose to take.

In certain circles, there is an idea that “better requirements documents” or “clear communication” or “eliminating ambiguity” are royal roads to better software development and better testing. Certainly these things can help to some degree, but organizing teams and building products requires far more than explicit instructions. It requires the social context and tacit knowledge to interpret things appropriately. Dad misinterpreted on purpose. Development and testing groups can easily misintrepret by accident; unintentionally; obliviously.

Where do explicit instructions come from? Would they be any good if they weren’t rooted in knowledge about the customers’ form of life, and knowledge of the problems that customers face—the problems that the product could help to solve? Could they be expressed more concisely and more reliably when everyone involved had shared sets of feelings and mental models? And would exact instructions help if the person (or machine) receiving them didn’t have the social competence to interpret them appropriately?

In RST, we would hold that it’s essential for the tester to become immersed in the world of the product and in the customers’ forms of life to the greatest degree possible—a topic for posts to come.