Blog Posts from February, 2021

Flaky Testing

Monday, February 22nd, 2021

The expression “flaky tests” is evidence of flaky testing. No scientist refers to “flaky experimental results”. Scientists who observe inconsistency don’t dismiss it. They pay close attention to it, and probe it. They redesign their experiments or put better controls on them.

When someone refers to an automated check (or a suite of them) as a “flaky test”, the suggestion is that it represents an unreliable experiment. That assumption is misplaced. In fact, the experiment reliably shows that someone’s models of the product, check code, test environment, outcomes, theory, and the relationships between them are misaligned.

That’s not a “flaky experiment”. It’s an excellent experiment. The experiment is telling you something crucial: there’s something you don’t know. In science, a surprising, perplexing, or inconsistent result prompts scientists to begin an investigation. By contrast, in software, an inconsistent result prompts some people to shrug and ignore what the experiment is trying to tell them. Then they do weird stuff like calculating a “flakiness score”.

Of course, it’s very tempting psychologically to dismiss results that you can’t explain as “noise”, annoying pieces of red junk on your otherwise lovely all-green lawn. But a green lawn is not the goal. Understanding what the junk is, where it is, and how it gets there is the goal. It might be litter—it it might be a leaking container of toxic waste.

It’s not a great idea to perform a test that you don’t understand, unless your goal is to understand it and its relationship to the product. But it’s an even worse idea to dismiss carelessly a test outcome that you don’t understand. For a tester, that’s the epitome of “flaky”.

Now, on top of all that, there’s something even worse. Suppose you and your team have a suite of 100,000 automated checks that you proudly run on every build. Suppose that, of these, 100 run red. So you troubleshoot. It turns out that your product has problems indicated by 90 of the checks, but ten of the red results represent errors in the check code. No problem. You can fix those, now that you’re aware of the problems in them.

Thanks to the scrutiny that red checks receive, you have become aware that 10% of the outcomes you’re examining are falsely signalling failure when they are in reality successes. That’s only 10 “flaky” checks out of 100,000. Hurrah! But remember: there are 99,900 checks that you haven’t scrutinized. And you probably haven’t looked at them for a while.

Suppose you’re on a team of 10 people, responsible for 100,000 checks. To review those annually requires each person working solo to review 10,000 checks a year. That’s 50 per person (or 100 per pair) every working day of the year. Does your working day include that?

Here’s a question worth asking, then: if 10% of 100 red checks are misleadingly signalling a problem, what percentage of 99,900 green checks are misleadingly signalling “no problem”? They’re running green, so no one looks at them. They’re probably okay. But even if your unreviewed green checks are ten times more reliable than the red checks that got your attention (because they’re red), that’s 1%. That’s 999 misleadingly green checks.

Real testing requires intention and attention. It’s okay for a suite of checks to run unattended most of the time. But to be worth anything, they require periodic attention and review—or else they’re like smoke detectors, scattered throughout enormous buildings, whose batteries and states of repair are uncertain. And as Jerry Weinberg said, “most of the time, a nonfunctioning smoke alarm is behaviorally indistinguishable from one that works. Sadly, the most common reminder to replace the batteries is a fire.”

And after all this, it’s important to remember that most checks, as typically conceived, are about confirming the programmers’ intentions. In general, they represent an attempt to detect coding problems and thereby reduce programmers committing (pun intended) easily avoidable errors. This is a fine and good thing—mostly when the effort is targeted towards lower-level, machine-friendly interfaces.

Typical GUI checks, instrumented with machinery, are touted as “simulating the user”. They don’t really do any such thing. They simulate behaviours, physical keypresses and mouse clicks, which are only the visible aspects of using the product—and of testing. GUI checks do not represent users’ actions, which in the parlance of Harry Collins and Martin Kusch are behaviours plus intentions. Significantly, no one reduces programming or management to scripted and unmotivated keystrokes, yet people call automated GUI checks “simulating the user” or “automated testing”.

Such automated checks tell us almost nothing about how people will experience the product directly. They won’t tell us how the product supports the user’s goals and tasks—or where people might have problems getting what they want from the product. Automated checks will not tell us about people’s confusion or frustration or irritation with the product. And automated checks will not question themselves to raise concern about deeper, hidden risk.

More worrisome still: people who are sufficiently overfocused, fixated, on writing and troubleshooting and maintaining automated checks won’t raise those concerns either. That’s because programming automated GUI checks is hard, like all programming is hard. But programming a machine to simulate human behaviours via complex, ever-changing interfaces designed for humans instead of machines is especially hard. The effort easily displaces risk analysis, studying the business domain, learning about users’ problems, and critical thinking about all of that.

Testers: how much time and effort are you spending on care and feeding of scripts that represents distraction from interacting with the product and searching for problems that matter? How much more valuable would your coding be if it helped you examine, explore, and experiment with the product and its data? If you’re a manager, how much “testing” time is actually coding and fixing time, in which your testers are being asked to fuss with making the checks run green, and adapting them to ongoing changes in the product?

So the issue is not flaky tests, but flaky testing talk, and flaky test strategy. It’s amplified by referring to “flaky understanding” and “flaky explanation” and “flaky investigation” as “flaky tests”.

Some will object. “But that’s what people say! We can’t just change the language!” I agree. But if we don’t change the way we speak —and the way we think along with it—we won’t address the real flakiness, which the flakiness in our systems, and the flakiness in our understanding and explanations of those systems. With determination and skill and perseverance, we can change this. We can help our clients to understand the systems they’ve got, so that they can decide whether those are the systems they want.

Learn about how to focused on fast, inexpensive, powerful testing strategies to find problems that matter. Register for classes here.

Necessary Confusion and the Bootstrap Heuristic

Thursday, February 11th, 2021

I’m testing a test tool at the moment. I’m investigating it for a talk. The producers of the tool claim to have hundreds of thousands of users. A few positive remarks appear in a scrolling widget on the product’s web site from people purported to be users.

Me, I can’t make head or tail of the product. It doesn’t seem to do what it’s supposed to do. It looks like a chaotic mess. It’s baffling; it’s exasperating. I don’t know where to start in analysing it and preparing a report. I’m confused. But I’m okay with that.

Any worthwhile testing starts with some degree of necessary confusion.

Why? Because worthwhile testing is primarily about learning something about a product and learning about how to test it in a complex and uncertain space. That’s by nature confusing, and that’s normal.

If the test space is neither complex nor uncertain, and if there’s little risk, you may not need to test at all, and a simple demonstration might do the trick. Knowing that the product can work might be enough, for the moment.

That’s why, for developers, performing checks and automating them at the unit level can make a lot of sense. Those checks tend to address specific, atomic conditions; they’re simple to develop and perform and encode; and they provide quick feedback without slowing down development.

A product gets built from small, discrete components. Through small, gradual changes, it turns into something much bigger and more complex, with interacting components and emergent behaviours that are non-trivial.

An encounter with anything non-trivial that you’re not familiar with tends to be messy and confusing at first. At the same time, as a working tester, you’re probably under pressure to “get things right the first time” or “get everything sorted from the beginning”. But having everything sorted really means that we’re at the end of something that was unsorted, and we’re at the beginning of the next unsorted thing!

In Rapid Software Testing, we refer to the Bootstrap Conjecture:

Any process we care about that is done both well and efficiently began by being done poorly and inefficiently.

Therefore, having “done something right the first time” probably means that it wasn’t really right, or it wasn’t really the first time, or that it was trivial, or that you got lucky.

In learning about something complex and in learning how to test it, there are frequent periods of confusion. In fact, if we’re dealing with something complex and we feel we’re sure about how to test it, that should prompt us to pause and reflect: why are we so sure?

Necessary confusion is confusion for which we do not have an algorithmic resolution. To resolve necessary confusion, we must explore a complex solution space using heuristics (that is, means of solving problems that could work but that might fail) and bounded rationality (that is, reasoning in a space where there are limits on what we know and what we can know).

To overcome confusion, we have to play, puzzle, make conjectures, perform experiments, miss stuff, ask questions, make mistakes, and be patient. Necessary confusion always occurs during deep learning and innovation.

We’re often trained in our cultures, in our social groups, and in our schooling to deny that we’re confused. That gets ramped up as soon as we get into the software business: appearing not to know something is socially awkward—almost seen as a sin in some circles of knowledge work. Confusion can make us uncomfortable.

As a tester, you could just write (or worse, run) a bunch of automated scripts that check a new product or feature for specific, anticipated errors. If you do that without exploring the product and preparing your mind your testing will be blind to important bugs that could be there.

No set of instructions can teach you everything you need to learn about a product, and about the ways in which diverse people will try to use it. No formal procedure can anticipate how you or other people will experience the product. No testing framework will handle surprising behaviour without you learning how to deal with that framework. No tool, no “AI”, can determine whether the product is operating correctly, or whether a product manager will regard a red bar as something that amounts to an important bug. Complete and correct knowledge about those things isn’t available in advance.

You can learn how to test in advance. That will avoid some unnecessary confusion during testing. You can learn about the technology and domain of your product in advance, and that will avoid more unnecessary confusion during testing. You can learn to use particular tools in advance, and that might spare you some unnecessary confusion during testing too.

But you can’t deeply learn a new product or feature before encountering and interacting with it. The confusion you experience when learning a product is necessary, temporary, and healthy.

The key is to accept the confusion; to recognize that it’s okay to be confused. As we interact with the product and the people around it; as we gain experience; as we practice new skills and apply new tools, some of the confusion lifts.

Start with a survey of the product. Take a tour of the interfaces — the GUI, the command line, the API. Play with it. List out its key features. Create an outline of what is there to be tested. Consider who might use it, and for what. Build on your ideas of how they might value it, and how their value might be threatened. Think about data that gets taken in, processed, stored, retrieved, rendered, displayed, and deleted. How could any of that get messed up? How could the data be mishandled, misrepresented, excessively constrained, insufficiently constrained, or Just Plain Wrong?

And then iterate. Go through the same process with each function and feature, getting progressively deeper as you go. Maybe write little snippets of code to generate some data, or to analyze the output. (Have you been working with a product for a long time? This cycle is fractal; it applies to new functions or features, or to repairs in a product you know well.)

As we learn about the product domain; as we go about the business of sensemaking; as we develop our mental models; as we talk about the product and the problems we observe… more of the confusion dissipates. This can all happen remarkably quickly if we allow ourselves just a little time for experiencing, exploring, and experimenting with the product. Ironically, we must deliberately require and allow ourselves room for spontaneity. We need to be brave and open enough to help our managers understand how necessary that kind of work is — and how powerful it can be.

When we embrace the confusion and lean in, things begin to get clearer, our code and maps and lists get tidier, our notions of risk get sharper, and we’re better prepared to search for problems. And then we’re more likely to find the deep, dangerous problems that matter—the ones that everyone has missed so far. At the beginning, though, that process starts as we pull ourselves up by our own bootstraps.

The Bootstrap Heuristic is: begin in confusion; end in precision.

Oh… and that test tool that I’m testing? There’s a reason that I’m confused: I’ve got a confusing product in front of me. The product is inconsistent with claims that its producers make about it. The product’s behaviour is inconsistent with its purpose. It seems incapable of keeping track of its state. It provides misleading results. For outsiders, it seems designed to provide the impression that testing is happening, without any real testing going on. From the inside perspective of a tester, it’s baffling, and that’s largely because it doesn’t work.

So there’s another heuristic: persistent confusion about a product—confusion that doesn’t go away—is often a pointer to serious problems in it. If you, as a tester, can’t make sense of a product, how will the product’s customers make sense of it?

After working with this product for a little more than an hour, much of the confusion I referred to above has evaporated, and I can prepare a report with confidence.

I’m only left with one thing that I find confusing:

How can anybody be fooled by a tool like this?