Experience Report: mabl Trainer and runner, and related features

October 18th, 2021

Warning: this is a long post.

Introduction

This is an experience report of attempting to perform a session of sympathetic survey and sanity testing, done in September 2021, with follow-up work October 13-15, 2021. The product being tested is mabl. My self-assigned charter was to perform survey testing of mabl, based on a performing a basic task with the product. The task was to automate a simple set of steps, using mabl’s Trainer and test runner mechanism.

I will include some meta-notes about the testing in indented text like this.

The general mission of survey testing is learning about the design, purposes, testability, and possibilities of the product. Survey testing tends to be spontaneous, open, playful, and relatively shallow. It provides a foundation for effective, efficient, deliberative, deep testing later on.

Sanity testing might also be called “smoke testing”, “quick testing”, or “build verification testing”. It’s brief, shallow testing to determine whether if the product is fit for deeper testing, or whether it has immediately obvious or dramatic problems.

The idea behind sympathetic testing is not to find bugs, but to exercise a product’s features in a relatively non-challenging way.

Summary

mabl’s Trainer and test runner show significant unreliability for recording and playback of very simple, basic tasks in Mattermost, a popular open-source Slack-like chat system with a Web client. I intended to survey other elements of mabl, but attempting these simple tasks (which took approximately three minutes and forty seconds to perform without mabl) triggered a torrent of bugs that has taken (so far) approximately ten hours to investigate and document to this degree.

There are many other bugs over which I stumbled that are not included in this report; the number of problems that I was encountering in this part of the product overwhelmed my ability to stay focused and organized.

A note on the term “bug”: in the Rapid Software Testing namespace, a bug is anything about the product that threatens its value to some person who matters. A little less formally, a bug is something that bugs someone who matters.

From this perspective, a bug is not necessarily a coding error, nor a “broken” feature. A bug is something that represents a problem for someone. Note also that “bug” is subjective; the mabl people could easily declare that something is not a bug on the implicit assumption that my perception of a bug doesn’t matter to them. However, I get to declare that what I see bugs me.

The bugs that I am reporting here are, in my opinion, serious problems for a testing tool—even one intended for shallow, repetitive, and mostly unhelpful rote checks. Many of the bugs considered alone would destroy the mabl’s usefulness to me, and would undermine the quality of my testing work. Yet these bugs are also very shallow; they were apparent in attempts to record, play back, and analyze a simple procedure, with no intention to provide difficult challenges to mabl’s Trainer and runner features.

It is my opinion that mabl itself has not been competently and thoroughly tested against products that would present a challenge to the Trainer or runner features; or if it has, its product management has either ignored or decided not to address the problems that I am reporting here.

I have not yet completed the initial charter of performing a systematic survey of these features. This is because my attempt to do was completely swamped by the effort required to record the bugs I was finding, and to record the additional bugs that I found while recording and investigating the initial bugs.

From one perspective, this could be seen as a deficiency in my testing. From another (and, I would argue, more reasonable) perspective, the experience that I have had so far would suggest at least two next steps if I were working for a client, depending on my client and my client’s purposes.

One next step might be to revisit the product and map out strategies for deeper testing. Another might be to decide that the survey cannot be completed efficiently right now and is not warranted until these problems are addressed. Of course, since I’m my own client here, I get to decide: I’m preparing a detailed report of bugs found in an attempt at sympathetic testing, and I’ll leave it at that.

mabl claims to “Improve reliability and reduce maintenance with the help of AI powered test automation”. This claim might bear investigation.

What is being used as training sets for the models, and where does the data come from? Is my data being used to train machine learning models for the applications I’m testing? If it’s only mine, is the data set large enough? Or is my data being used to develop ML models for other people’s applications? If “AI” is being used to find bugs or for “self-healing”, how does the “AI” comprehend the difference between “problem” and “no problem? And is the “AI” component being tested critically and thoroughly? These are matters for another day.

Setup and Platform

On its web site, mabl claims to provide “intelligent test automation for Agile teams”. The company also claims that you can “easily create reliable end-to-end tests that improve application quality without slowing you down.”

The claim about improving application quality is in trouble from the get-go. Neither testing nor tests improve application quality. Testing may reveal aspects of application quality, but until someone does something in response to the test report, application quality stays exactly where it is. As for the ease of creating tests… well, that’s what this report is about.

I registered an account to start a free trial, and downloaded v 1.2.2 of the product. (Between September 19 and October 13, the installer indicated that the product had been updated to version 1.3.5. Some of these notes refer to my first round of testing. Others represent updates as I revisited the product and its site to reproduce the problems I found, and to prepare this report. If the results between the two versions differ, I will point out the differences.)

I ran these tests on a Windows 10 system, using only Chrome as the browser. As of this writing, I am using Chome v.94

To play the role of product under test, I chose Mattermost. Mattermost is an open-source online communication tool, similar to Slack, that we use in our Rapid Software Testing classes, and provides both a desktop and a web-based client. Like Slack, you can be a member of different Mattermost “teams”, so I set up a team with a number of channels specifically for my work with mabl.

Testing Notes

I started the mabl trainer, and chose to begin a Web test, in mabl’s parlance. (A test is more than a series of recorded actions.) mabl launched a browser window that defaulted to 1000 pixels. I navigated to the Mattermost Web client. I entered a couple of lines of ordinary plain text, which was accepted by Mattermost, and which the mabl Trainer appeared to record.

I then entered some emoticons using Mattermost’s pop-up interface; the mabl Trainer appeared to record these, too. I used the Windows on-screen keyboard to enter some more emoticons.

Then I chose a graphic, uploaded it, and provided some accompanying text that appears just above the graphic.

First input attempt with mabl

I’m getting older, and I work far enough away from my monitors that I like my browser windows big, so that I can see what’s going on. Before ending the recording, I experimented a little with resizing the browser window. In my first few runs with v 1.2.2, this caused a major freakout for the trainer, which repeatedly flashed a popup that said “Maximizing Trainer” and looped endlessly until I terminated mabl.

(In version 1.3.5, it was possible to try to maximize the browswer window, but the training window stubbornly appeared to the right of the browser, even if I tried to drag it to another screen.) (See Bugs 1 and 2 below.)

I pressed “Close” on the Trainer window, and mabl prompted me to run the test. I chose Local Run, and pressed the “Start 1 run” button at the bottom of the “Ad hoc run” panel.

mabl Start Run dialog

A “Local Run Output” window appeared. mabl launched the browser in a way that covered the Local Run Window; an annoyance. mabl appeared to log into Mattermost successfully. The tool simulated a click on the appropriate team, and landed in that team at the top of the default channel. This is odd, because normally, Mattermost takes me to the end of the default channel. And then… nothing seemed to happen.

Whatever mabl was doing was happening below the level of visible input. (Later investigation shows that mabl, silently and by default, sets the height of the viewport much larger than the height of the browswer window and the height of my screen.) (See Bug 3 below.)

When I looked at the Mattermost instance in another window, it was apparent that mabl had failed to enter the first line of text that I had entered, even though the step is clearly listed:

mabl Missed Step Detail

Yet the Local run output window suggested that the text had been entered successfully,

mabl Local run output suggests successful text entry

mabl failed to enter most subsequent text entries, too. Upon investigation, it appears that the runner does type the body of the recorded text into the textbox element. After that, though, either the mabl Trainer does not record the Enter key to send the message, or the runner doesn’t simulate the playback of that key.

The consequence is that when it comes time to enter another line of text, mabl simply replaces the contents of the textbox element with the new line of text, and the previous line is lost.

Ending entries in this text element with Ctrl-Enter provides a workaroud to this behaviour, but that’s not the normal means for submitting a post in Mattermost. The Enter key on its own should do the trick.

More investigation revealed that this behaviour is the same whether the procedure is run locally or in the cloud. (See Bug 4 below.)

Many record-and-playback tools claim to simulate user behaviour. It is crucial to remember that human users enter data in one way—via mechanisms like keyboards, mice, touch pads, drawing tablets, etc.—and almost all playback tools use different means, in the form of software interfaces. The differences between input mechanisms are often ignored, but they can be significant.

Moreover, different playback tools use different approaches to simulate user input. Often these approaches throw away elements of user behaviours such as backspacing, pasting, copying, or deleting blocks of text, and submit only the edited string for processing. Such simulations will systematically miss problems that happen in real usage.

Upon trying to play back the typing of the emojis, mabl apparently became confused by Mattermost’s emoji popup. The log indicated that the application attempted to locate a specific element three times, then concluded that the element could not be found, whereupon the entire procedure errored out for good. The controls that mabl is seeking according to the recorded steps in the test are plainly accessible via the developer tools. All this seems inconsistent with mabl’s claims of “auto-healing”. (See Bugs 5 and 6 below.)

Local run output with errors

In these screenshots, some timestamps in the logs may appear out of sequence relative to this narrative. Some screenshots you’re seeing here are of repro instances, rather what happened the first time through. This is because I was encountering so many bugs while testing that my capacity to record them properly became overwhelmed, and had to return for analysis later.

The phenomenon of being swamped (or swarmed) by bugs like this is something we call a bug cascade in Rapid Software Testing. In my rough notes for my first run of this session, I observe “I should have been recording a video of all this.” It can be very useful to have a narrated video recording for later review.

I examined the “Local run output” window more closely and observed a number of problems. On this run and others, the listing claims to have entered text successfully when that text never appears in the application under test. Only the first 37 characters of the text entered by the runner appears in the log.

The local log contains time stamps, but not date stamps, and those time stamps are recorded in AM/PM format. Both of these are inconvenient for analysing the log files with tool support. There appears to be no mechanism for saving a file from the Local Run Output window. (See Bugs 7, 8, 9, 10, and 11 below.)

Local run output window

I looked for a means of looking at past runs from the desktop in various places in the mabl desktop client. I could not find one. (See Bug 12 below.)

Using Search Everything (a very useful tool that affords more or less instaneous search for all files on the system) I also looked for log files associated with individual test runs. I could not find any.

Search Everything quickly helped me to find mabl’s own application log (mablApp.log), and this did contain some data about the steps. Oddly, the runner data in mablApp.log is formatted in a more useful way than in the “Local run output” window. Local runs did not seem to record screen shots, either.

This was all pretty confusing at first, but later research revealed this: “Note that test artifacts – such as DOM snapshots or screenshots – are not captured from local runs. The mabl execution logs and final result are displayed in a separate window.” (https://help.mabl.com/docs/local-runs-in-the-desktop-app) That might be bad enough—but no run-specific local logs at all?

In order to try to troubleshoot the problems that I was experiencing with entering text, I looked at the Tests listings, and chose My New Test. This took me to an untitled window that lists the steps for the test (I will call this the “Test Steps View”).

Scanning the list of steps, I observed the second step was “Visit URL assigned to variable ‘app.url’.” This is factually correct, but unhelpful; how does one find the value of the variable? There is no indication of what that URL might be or how to find it conveniently. Indeed, the screen suggests that there are “no data-driven variables”—which seems false.

(Later investigation revealed that if I chose Edit Steps, then chose Quick Edit, then chose the URL to train against, and then chose the step “Visit URL assigned to variable ‘app.url'”, I could see a preview of the value. How about a display of the variable in the steps listing? A tooltip?) (See Bug 13 below.)

I examined the step that appeared to be failing to enter text. The text that I originally typed into the Mattermost window was not displayed in full, even though there’s plenty of space available for it in the window for the Test Steps View. (See Bug 14 below.) This behaviour is inconsistent with my ability to explain it, and it’s inconsistent with an implicit purpose of test steps view—(the ability to troubleshoot test steps easily). However, it is consistent with the display in the Log output window logging in mabl’s system log. (See Bug 15 below.)

As I noted above, further experimentation with Mattermost and with the mabl trainer showed that ending the input with Ctrl-Enter (rather than Mattermost’s default Enter) while recording allowed mabl to play back the text suggest. So, perhaps if I could edit the text entry step somehow, or if I could add a keystroke step, there would be a workaround for this problem if I’m willing to accept the risk that behaviours developed with the Trainer are inconsistent with the actual behaviours of the user.

In the Test Steps View, there is an “Edit Steps” dropdown, with the options “Quick Edit” and “Launch Trainer”. I clicked on Quick Edit, and a pop-up appeared immediately, confusingly, and probably unnecessarily: “Launching Trainer”.

mabl's

I selected the text entry step, hoping to edit some of the actions within it. Of the items that appear in the image below, note that only the input text can be edited; no other aspect of the step can be. (See Bug 16 below.)

mabl's Quick Edit window

It’s possible to send keypresses to a specific element. That capability has supposedly been available in mabl for a long time as claimed here. Could I add an escape sequence to the text by which I could enter a specific key or key combination? If such a feature is available, it’s not documented clearly. The documentation hints that certain keys might have escape strings—”[TAB]”, or “[ENTER]”. However, adding those strings to the end of the text doesn’t make the virtual keypress happen. (See Bug 17 below.)

The Quick Edit window offers the opportunity to insert a step. What if I try that? I scroll down to the step that enters the text, attempt to select that step with the mouse, and press the plus key at the bottom to insert the step. A dialog appears that offers a set of possible steps. Neither entering text, nor entering keystrokes, nor clicking the mouse appears on this list. (See Bug 18 below.)

mabl's options for inserting steps

(For those wondering if input features appear beneath the visible window, “Variables” is the last item in the list.)

When I look at Step 5 in the Test Steps View, I see that there’s a step that sends a Tab keypress to the “Email or Username” text field. Maybe I could duplicate that step, and drag it down to the point after my text entry. Then maybe I could modify the step to point to the post_textbox element, and to send the Enter key instead of the Tab key.

Yes, I can change [TAB] to [ENTER]. But I can’t change the destination element. (See Bug 19 below.)

mabl's facility for sending a keypress

Documenting this is difficult and frustrating. Each means of trying to send that damned Enter key is thwarted in some exasperating and inexplicable way. For those of you of a certain age, it’s like the Cone of Silence from Get Smart (the younger folks can look it up on the Web). I’m astonished by the incapability of the product, and because of that I second-guess myself and repeat my actions over and over again to make absolutely sure I’m not missing some obvious way to accomplish my goal. The strength of my feelings at this time are pointers to the significance of the problems I’m encountering.

I looked at some of the other steps displayed in the Test Steps View and in the Trainer. Note that steps are often described as “Click on a button” without displaying which button is being referred to, unless that button has an explicit text label. This is annoying, since human-readable information (like the aria-label attribute) is available, but I had to click on “edit” to see it. (See Bug 20 below.)

Vague \

Scanning the rest of the Test Steps View, I noticed an option to download a comma-separated-value (.CSV) file; perhaps that can be viewed and edited, and then uploaded somehow. I downloaded a .CSV and looked at it. It is consistent with what is displayed in the Test Steps View, but it does not accurately reflect the behaviour that mabl is trying to perform.

Once again, the text that mabl actually tries to enter in a text field (which can be observed if you scroll to the bottom of the browser window in the middle of the test) is elided, limited to 37 characters plus an ellipsis. (See Bug 21 below.)

This would be a more serious problem if I tried to edit the script and upload it. However, no worries there, because even though you can download a .CSV file of test steps, you can’t upload one. There’s nothing in the product’s UI, and a search of the Help file for “upload” revealed no means for uploading test step files. (See Bug 22 below.)

At this point, I began to give up on entering text naturalistically and reliably using the Trainer. I wondered if there was anything that I could rescue from my original task of entering text, throwing in some emojis, and uploading a file. I edited out the test steps over which mabl stumbled in order to get to the file upload. The test proceeded, but the file didn’t get uploaded. Perhaps this is because the Trainer doesn’t keep track of where the file came from on the local system. (See Bug 23 below.)

At this point, my energy for continuing with this report is flagging. Investigating and reporting bugs takes time, and when there are this many problems, it’s disheartening and exhausting. I’m also worried about this getting boring to read. This post probably still has several typos. I have left many bugs that I encountered out of the narrative here, but a handful of them appear below. I left many more undocumented and uninvestigated. (See Bugs 24, 25, 26, and 27 below.)

There is much more to mabl. Those aspects may be wonderful, or terrible. I don’t know, because I have not examined them in any detail, but I have strong suspicion of lots of further trouble. Here’s an example:

In my initial round of testing in September, I created a plan—essentially a suite of recorded procedures and tasks that mabl offers. That plan included crawling the entire Mattermost instance for broken links. mabls’s summary report indicated that everything had passed, and that there were no broken links. “Everything looks good!”

mabl claims everything looks good.

I scrolled down a bit, though, and looked at the individual items below. Upon seeing “Found 3 broken links” on the left, and saw the details on the right.

While mabl claims everything looks good, broken links.

In the October 13-15 test activity, I set up a task for mabl to crawl my blog looking for broken links. Thanks to various forms of bitrot (links that have moved or otherwise become obsolete, commenters whose web sites have gone defunct, etc.), there are lots of broken links. mabl reports that everything passed.

mabl results table suggesting everything passed

This looks fine until you look at the details. mabl identified 586 broken links (many of them are duplicates)… and yet the summary says “Visit all linked pages within the app” passed. (See Bug 28 below.)

Visit all linked pages within the app details

Epilogue

During my first round of testing in September, I contacted mabl support via chat, and informed the representative that I was encountering problems with the product while preparing a talk. The representative on the chat promised to have someone contact me about the problems. The next day, I received this email:

Email reply sent by mabl Customer Support. Helpful advice:  RTFM.

Let me zoom that up for you:

Email reply sent by mabl Customer Support, zoomed up.  Helpful advice:  RTFM.

And that, so it seems, is what passes for a response: RTFM.

Bug Summaries

Bug 1: Resizing the browser while training resulted in an endless loop that hangs the product. (Observed several times in 1.2.2; not observed so far in 1.3.5.)

Bug 2: The browser cannot be resized to the full size of the screen on which the training is happening; and at the same time, the trainer window cannot be repositioned onto another screen. (This was happening in 1.2.2 when resizing didn’t result in the endless loop above; it still happens in 1.3.5.) This is inconsistent with usability and inconsistent with comparable products; if the product is intended to replicate the experience of a user, it’s also inconsistent with purpose.

Bug 3: The default behaviour of the application running in the browser is different from the naturalistic encounter with the product and, as such, in this case, rendered input activity invisible unless I actively scrolled the browser window using the cursor keys, and until I figured out where the browser height was set. Inconsistent with usability for a test tool at first encounter; inconsistent with charisma.

Bug 4: mabl’s playback function doesn’t play back simple text entry into a Mattermost instance, but the logging claims that the text was entered correctly. This happens irrespective of whether the procedure is run from the cloud or from the local machine. This is inconsistent with comparable products; inconsistent with purpose; inconsistent with basic capabilities for a product of this nature; and inconsistent with claims (https://help.mabl.com/changelog/initial-keypress-support-in-the-mabl-trainer).

Bug 5: mabl seems unable to locate emojis in Mattermost’s emoji popup—something that a human tester would have no problem with—even though the Trainer supposedly captured the action. (Inconsistency with purpose.)

Bug 6: Auto-healing fail with respect to trying to locate buttons in the Mattermost emoji picker. (Inconsistency with claims.)

Bug 7: The “Local run output” window falsely suggests that attempts to enter text are successful when the text entry has not completed. (Inconsistent with basic functionality; inconsistent with purpose.)

Bug 8: The “Local run output” window does not record the actual text that was entered by the runner. Only the first 37 characters of the entry, followed by an ellipsis (“…”) are displayed. (Inconsistent with usability for a test tool.)

Bug 9: Date stamps are absent in the logging information displayed in the “Local run output” window. Only time stamps appear, and at that only precise down to the second. This is an inconvenience for analyzing logged results over several days. (Inconsistent with usability for testing purposes; also inconsistent with product (mabl’s own application log).)

Bug 10: Time stamps in the “Local run output” window are rendeded in AM/PM format, which makes sorting and searching via machinery less convenient. (Inconsistent with testability; also poor internationalization; and also inconsistent with mabl’s own application log.)

Bug 11: Cannot save data to a file directly from the “Local run output” window. (Inconsistent with purpose; inconsistent with usability; risk of data loss.) Workarounds: copying data from the log and pasting it into the user’s own record; spelunking through mabl’s mablApp.log file.

Bug 12: Local run log output does not appear in mabl’s GUI, neither under Results nor under the Run History tab for individual tests. If there is a facility for that available from the GUI, it’s very well hidden. (Inconsistent with usability for a record/playback testing tool.) Workaround: there is some data available in the general application log for the product, but it would require effort to be distentangled from the rest of the log entries.

Bug 13: The test steps editing window makes it harder than necessary to view the content of variables that will be used for the test procedure. For instance, the user must choose Edit Steps, then chose Quick Edit, then chose the URL to train against, and then chose the step “Visit URL assigned to variable ‘app.url’.

Bug 14: The main test editor window hides the content of text entry strings longer than about 40 characters. Since there is ample empty whitespace to the right, it is unclear why longer string of text aren’t displayed. Inconsistent with explainability, inconsistent with purpose (the ability to troubleshoot test steps easily).

Bug 15: mabl’s application log (mablApp.log) limits the total length of the typed string to 40 characters (37 characters, plus an ellipsis (…)). (Is the Local Output Log generated from the mablApp.log?)

Bug 16: In a step to enter text in Quick Edit mode, only the input text can be edited; no other aspect—neither the target nor the action of the step can be edited.

Bug 17: Escape sequences to send specific keys (e.g. Tab, Enter) are not supported by mabl’s Quick Edit step editor. Inconsistent with comparable products, inconsistent with purpose.

Bug 18: The “Insert Steps” option in the Quick Edit dialog does not offer options for entering text, sending keys, or clicking on elements. Inconsistent with purpose; inconsistent with comparable products.

Bug 19: The “Send keypress” dialog allows changing the key to be sent, or to add modifier keys, but doesn’t allow changing the element to which the key is sent.

Bug 20: The trainer window fails to identify which button is to be clicked in a step unless the button has a text label. Some useful information (e.g. the Aria Label or class ID) to identify the button is available if you enter the step and try to edit it. (Inconsistent with product; inconsistent with purpose)

Bug 21: The .CSV file identifying the steps for a test does not reflect the actual steps performed. (Inconsistent with product; inconsistent with the purpose of trying to see the actual steps in the procedure.) Workaround: going into each step in the Quick Edit or Trainer views displays the entire text, but for long procedures with strings longer than 40 characters, this could be very expensive in terms of time.

Bug 22: You can’t upload a CSV of test steps at all. Editing test steps depends on mabl’s highly limited Trainer or Quick Edit facilities—and Quick Edit depends on the Trainer. The purpose of downloaded CSV step files is unclear.

Bug 23: A file upload recorded through the Trainer / Runner mechanism never happens.

Bug 24: The Help/Get Logs for Support option isn’t set by default to go to the folder where mabl’s logs are stored. Instead, it opens up a normal File/Open window (in my case defaulting to the Downloads folder, perhaps because this is the most recent location where I opened my browser, or…)

Bug 25: The mabl menu View / Zoom In function claims to be mapped to the Ctrl-+. It isn’t. The Zoom Out (Ctrl–) and Actual Size (Ctrl-0) work.

Bug 26: I noticed on October 17 that an update was available. There is no indication that release notes are available or what has changed. When I do a Web search for mabl release notes, such release notes as exist are don’t refer to version numbers!

Bug 27: The mabl Trainer window doesn’t have controls typically found in the upper right of a Windows dialog, which makes resizing the window difficult and makes minimizing it impossible. (Inconsistent with comparable products; inconsistent with UI standards.)

Bug 28: mabl’s Results table falsely suggests that a check for broken links “passed”, when hundreds of broken links were found. (Inconsistent with comparable products; inconsistent with UI standards.)

I thank early readers Djuka Selendic, Jon Beller, and Linda Paustian for spotting problems in this post and bringing them to my attention. Testers help other people look good!

Still reading? You must be interested in testing. I’ll be presenting Rapid Software Testing Explored online for European, UK, and Indian time zones November 22-25, 2021; register here. Rapid Software Testing Managed runs December 1-3, again in European daytimes and Indian evening; here’s where to sign up. Rapid Software Explored for the Americas happens January 17-20, 2022; register here.

To Go Deep, Start Shallow

October 13th, 2021

Here are two questions that testers ask me pretty frequently:

How can I show management the value of testing?
How can I get more time to test?

Let’s start with the second question first. Do you feel overwhelmed by the product space you’ve been assigned to cover relative to the time you’ve been given? Are you concerned that you won’t have enough time to find problems that matter?

As testers, it’s our job to help to shine light on business risk. Some business risk is driven by problems that we’ve discovered in the product—problems that could lead to disappointed users, bad reviews, support costs… More business risk comes from deeper problems that we haven’t discovered yet, because our testing hasn’t covered the product sufficiently to reveal those problems.

All too often, managers allocate time and resources for testing based on limited, vague, and overly optimistic ideas about risk. So here’s one way to bring those risk ideas to light, and to make them more vivid.

  • Start by surveying the product and creating a product coverage outline that identifies what is there to be tested, where you’ve looked for problems so far, and where you could look more deeply for them. If you’ve already started testing, that’s okay; you can start your product coverage outline now.
  • As you go, develop a risk list based on bugs (known problems that threaten the value of the product), product risks (potential deeper, unknown problems in the product in areas that have not yet been covered by testing), and issues (problems that threaten the value of the testing work). Connect these to potential consequences for the business. Again, if you’re not already maintaining a risk list, you can start now.
  • And as you go, try performing some quick testing to find shallow bugs.

By “quick testing”, I mean performing fast, inexpensive tests that take little time to prepare and little effort to perform. As such, small bursts of quick testing can be done spontaneously, even when you’re in the middle of a more deliberative testing process. Fast, inexpensive testing of this nature often reveals shallow, easy-to-find bugs.

In general, in a quick test, we rapidly an encounter some aspect of the product, and then apply fast and easy oracles. Here are just a few examples of quick testing heuristics. I’ve given some of them deliberately goofy and informal names. Feel free to rename them, and to create your own list.

Blink. Load the same page in two browsers and switch quickly between them. Notice any significant differences?
Instant Stress. Overload a field with an overwhelming amount of data (consider PerlClip, BugMagnet or similar lightweight tools; or just use a text editor to create a huge string by copying and pasting); then try to save or complete the transaction. What happens?
Pathological Data. Provide data to a file that should trigger input filtering (reserved HTML characters, emojis…). Is the input handled appropriately?
Click Frenzy. Click in the same (or different) places rapidly and relentlessly. Any strange behaviours? Processing problems (especially at the back end)?
Screen Survey. Pause whatever you’re doing for a moment and look over the screen; see anything obviously inconsistent?
Flood the Field. Try filling each field to its limits. Is all the data visible? What were the actual limits? Is the team okay with them—or surprised to hear about them? What happens when you save the file or commit the transaction?
Empty Input. Leave “mandatory” fields empty. Is an error message triggered? Is the error message reasonable?
Ooops. Make a deliberate mistake, move on a couple of steps, and then try to correct it. Does the system allow you to correct your “mistake” appropriately, or does the mistake get baked in?
Pull Out the Rug. Start a process, and interrupt or undermine it somehow. Close the laptop lid; close the browser session; turn off wi-fi. If the process doesn’t complete, does the system recover gracefully?
Tug-of-War. try grabbing two resources at the same time when one should be locked. Does a change in one instance affect the other?
Documentation Dip. quickly open the spec or user manual or API documentation. Are there inconsistencies between the artifact and the product?
One Shot Stop. Try an idempotent action—doing something twice that should effect a change the first time, but not subsequent times, like upgrading an account status to the top tier and then trying to upgrade it again. Did a change happen the second time?
Zoom-Zoom. Grow or shrink the browser window (remembering that some people don’t see too well, and others want to see more). Does anything disappear?

It might be tempting for some people to to dismiss shallow bugs. “That’s a edge case.” “No user will do that.” “That’s not the right way to use the product.” “The users should read the manual.” Sometimes those things might even be true. Dismissing shallow bugs too casually, without investigation, could be a mistake, though.

Quick, shallow testing is like panning for gold: you probably won’t make much from the flakes and tiny nuggets on their own, but if you do some searching farther upstream, you might hit the mother lode. That is: shallow bugs should prompt at least some suspicion about the risk of deeper, more systemic problems and failure patterns about the product. In the coverage outline and risk list you’re developing, highlight areas where you’ve encountered those shallow bugs. Make these part of your ongoing testing story.

Now: you might think you don’t have time for quick testing, or to investigate those little problems that lead you to big problems. “Management wants me to finish running through all these test cases!” “Management wants to me to turn these test cases into automated checks!” “Management needs me to fix all these automated checks that got out of sync with the product when it changed!”

If those are your assignments from management, you may feel like your testing work is being micromanaged, but is it? Consider this: if managers were really scrutinizing your work carefully, there’s a good chance that they would be horrified at the time you’re spending on paperwork, or on fighting with your test tools, or on trying to teach a machine to recognise buttons on a screen, only to push them to repeatedly to demonstrate that something can work. And they’d probably be alarmed at how easily problems can get past these approaches, and they’d be surprised at the volume of bugs you’re finding without them—especially if you’re not reporting how you’re really finding the bugs.

Because managers are probably not observing you every minute of every day, you may have more opportunity for quick tests than you think, thanks to disposable time.

Disposable time, in the Rapid Software Testing namespace, is our term for time that you can afford to waste without getting into trouble; time when management isn’t actually watching what you’re doing; moments of activity that can be invested to return big rewards. Here’s a blog post on disposable time.

You almost certainly have some disposable time available to you, yet you might be leery about using it.

For instance, maybe you’re worried about getting into trouble for “not finishing the test cases”. It’s a good idea to cover the product with testing, of course, but structuring testing around “test cases” might be an unhelpful way to frame testing work, and “finishing the test cases” might be a kind of goal displacement, when the goal is finding bugs that matter.

Maybe your management is insisting that you create automated GUI checks, a policy arguably made worse by intractable “codeless” GUI automation tools that are riddled with limitations and bugs. This is not to say that automated checking is a bad thing. On the contrary; it’s a pretty reasonable idea for developers to to automate low-level output checks that give them fast feedback about undesired changes. It might also be a really good idea for testers to exercise the product using APIs or scriptable interfaces for testing. But why should testers be recapitulating developers’ lower-level checks while pointing machinery at the machine-unfriendly GUI? As my colleague James Bach says, “When it comes to technical debt, GUI automation is a vicious loan shark.”

If you feel compelled to focus on those assignments, consider taking a moment or two, every now and again, to perform a quick test like the ones above. Even if your testing is less constrained and you’re doing deliberative testing that you find valuable, it’s worthwhile to defocus on occasion and try a quick test. If you don’t find a bug, oh well. There’s a still good chance that you’ll have learned a little something about the product.

If you do find a bug and you only have a couple of free moments, at least note it quickly. If you have a little more time, try investigating it, or looking for a similar bug nearby. If you have larger chunks of disposable time, consider creating little tools that help you to probe the product; writing a quick script to generate interesting data; popping open a log file and scanning it briefly. All focus and no defocus makes Jack—or Jill—a dull tester.

Remember: almost always, the overarching goal of testing is to evaluate the product by learning about it, with a special focus on finding problems that matter to developers, managers, and customers. How do we get time to do that in the most efficient way we can? Quick, shallow tests can provide us with some hints on where to suspect risk. Once found, those problems themselves can help to provide evidence that more time for deep testing might be warranted.

Several years ago, I was listening while James Bach was teaching a testing workshop. “If I find enough important problems quickly enough,” he said, “the managers and developers will be all tied up in arguing about how to fix them before the ship date. They’ll be too busy to micromanage me; they’ll leave me alone.”

You can achieve substantial freedom to captain your own ship of testing work when you consistently bring home the gold to developers and managers. The gold, for testers, is awareness and evidence of problems that make managers say “Ugh… but thank heavens that the tester found that problem before we shipped.”

If you’re using a small fraction of your time to find problems and explore more valuable approaches to finding them, no one will notice on those rare occasions when you’re not successful. But if you are successful, by definition you’ll be accomplishing something valuable or impressive. Discovering shallow bugs, treating them as clues that point us towards deeper problems, finding those, and then reporting responsibly can show how productive spontaneous bursts of experimentation can be. The risks you expose can earn you more time and freedom to to deeper, more valuable testing.

Which brings us to back to the first question, way above: “How can I show management the value of testing?”

Even a highly disciplined and well-coordinated development effort will result in some bugs. If you’re finding bugs that matter—hidden, rare, elusive, emergent, surprising, important, bone-chilling problems that have got past the disciplined review and testing that you, the designers and the developers have done already—then you won’t need to do much convincing. Your bug reports and risk lists will do the convincing for you. Rapid study of the product space builds your mental models and points to areas for deeper examination. Quick, cheap little experiments help you to learn the product, and to find problems point to deeper problems. Finding those subtle, startling, deep problems starts with shallow testing that gets deeper over time.


Rapid Software Testing Explored for Europe and points east runs November 22-25, 2021. A session for daytime in the Americas and evenings in Europe runs January 17-20, 2022.

Alternatives to “Manual Testing”: Experiential, Attended, Exploratory

August 24th, 2021

This is an extension on a long Twitter thread from a while back that made its way to LinkedIn, but not to my blog.

No one ever sits in front of a computer and accidentally compiles a working program, so people know — intuitively and correctly — that programming must be hard. But almost anyone can sit in front of a computer and stumble over bugs, so people believe — intuitively and incorrectly — that testing must be easy!

Testers who take testing seriously have a problem with getting people to understand testing work.

The problem is a special case of the insider/outsider problem that surrounds any aspect of human experience: most of the time, those on the outside of a social group—a community; a culture; a group of people with certain expertise; a country; a fan club—don’t understand the insider’s perspective. The insiders don’t understand the outsiders’ perspective either.

We don’t know what we don’t know. That should be obvious, of course, but when we don’t know something, we have no idea of how little we comprehend it, and our experience and our lack of experience can lead us astray. “Driving is easy! You just put the car in gear and off you go!” That probably works really well in whatever your current context happens to be. Now I invite you to get behind the wheel in Bangalore.

How does this relate to testing? Here’s how:

No one ever sits in front of a computer and accidentally compiles a working program, so people know—intuitively and correctly—that programming must be hard.

By contrast, almost anyone can sit in front of a computer and stumble over bugs, so people believe—intuitively and incorrectly—that testing must be easy!

In our world of software development, there is a kind of fantasy that if everyone is of good will, and if everyone tries really, really hard, then everything will turn out all right. If we believe that fantasy, we don’t need to look for deep, hidden, rare, subtle, intermittent, emergent problems; people’s virtue will magically make them impossible. That is, to put it mildly, a very optimistic approach to risk. It’s okay for products that don’t matter much. But if our products matter, it behooves us to look for problems. And to find deep problems intentionally, it helps a lot to have skilled testers.

Yet the role of the tester is not always welcome. The trouble is that to produce a novel, complex product, you need an enormous amount of optimism; a can-do attitude. But as my friend Fiona Charles once said to me—paraphrasing Tom DeMarco and Tim Lister—”in a can-do environment, risk management is criminalized.” I’d go further: in a can-do environment, even risk acknowledgement is criminalized.

In Waltzing With Bears, DeMarco and Lister say “The direct result of can-do is to put a damper on any kind of analysis that suggests ‘can’t-do’…When you put a structure of risk management in place, you authorize people to think negatively, at least part of the time. Companies that do this understand that negative thinking is the only way to avoid being blindsided by risk as the project proceeds.”

Risk denial plays out in a terrific documentary, General Magic, about a development shop of the same name. In the early 1990s(!!), General Magic was working on a device that — in terms of capability, design, and ambition — was virtually indistinguishable from the iPhone that was released about 15 years later.

The documentary is well worth watching. In one segment, Marc Porat, the project’s leader, talks in retrospect about why General Magic flamed out without ever getting anywhere near the launchpad. He says, “There was a fearlessness and a sense of correctness; no questioning of ‘Could I be wrong?’. None. … that’s what you need to break out of Earth’s gravity. You need an enormous amount of momentum … that comes from suppressing introspection about the possibility of failure.”

That line of thinking persists all over software development, to this day. As a craft, the software development business systematically resists thinking critically about problems and risk. Alas for testers, that’s the domain that we inhabit.

Developers have great skill, expertise, and tacit knowledge in linking the world of people and the world of machines. What they tend not to have—and almost everyone is like this, not just programmers—is an inclination to find problems. The developer is interested in making people’s troubles go away. Testers have the socially challenging job of finding and reporting on trouble wherever they look. Unlike anyone else on the project, testers focus on revealing problems that are unsolved, or problems introduced by our proposed solution. That’s a focus which the builders, by nature, tend ot resist.

Resistance to thinking about problems plays out in many unhelpful and false ideas. Some people believe that the only kind of bug is a coding error. Some think that the only thing that matters is meeting the builders’ intentions for the product. Some are sure that we can find all the important problems in a product by writing mechanistic checks of the build. Those ideas reflect the natural biases of the builder—the optimist. Those ideas make it possible to imagine that testing can be automated.

The false and unhelpful idea that testing can be automated prompts the division of testing into “manual testing” and “automated testing”.

Listen: no other aspect of software development (or indeed of any human social, cognitive, intellectual, critical, analytical, or investigative work) is divided that way. There are no “manual programmers”. There is no “automated research”. Managers don’t manage projects manually, and there is no “automated management”. Doctors may use very powerful and sophisticated tools, but there are no “automated doctors”, nor are there “manual doctors”, and no doctor would accept for one minute being categorized that way.

Testing cannot be automated. Period. Certain tasks within and around testing can benefit a lot from tools, but having machinery punch virtual keys and compare product output to specificed output is not more “automated testing” than spell-checking is “automated editing”. Enough of all that, please.

It’s unhelpful to lump all non-mechanistic tasks in testing together under “manual testing”. Doing so is like referring to craft, social, cultural, aesthetic, chemical, nutritional, or economic aspects of cooking as “manual” cooking. No one who provides food with care and concern for human beings—or even for animals—would suggest that all that matters in cooking is the food processors and the microwave ovens and the blenders. Please.

If you care about understanding the status of your product, you’ll probably care about testing it. You’ll want testing to find out if the product you’ve got is the product you want. If you care about that, you need to understand some important things about testing.

If you want to understand important things about testing, you’ll want to consider some things that commonly get swept a carpet with the words “manual testing” repeatedly printed on it. Considering those things might require naming some aspects of testing that you haven’t named before.

Think about experiential testing, in which the tester’s encounter with the product, and the actions that the tester performs, are indistinguishable from those of the contemplated user. After all, a product is not just its code, and not just virtual objects on a screen. A software product is the experience that we provide for people, as those people try to accomplish a task, fulfill a desire, enjoy a game, make money, converse with people, obtain a mortgage, learn new things, get out of prison…

Contrast experiential testing with instrumented testing. Instrumented testing is testing wherein some medium (some tool, technology, or mechanism) gets in between the tester and the naturalistic encounter with and experience of the product. Instrumentation alters, or accelerates, or reframes, or distorts; in some ways helpfully, in other ways less so. We must remain aware of the effects, both desirable and undesirable, that instrumention brings to our testing.

Are you saying “manual testing”? You might be referring to the attended or engaged aspects of testing, wherein the tester is directly and immediately observing and analyzing aspects of the product and its behaviour in the moment that the behaviour happens. And you might want to contrast that with the algorithmic, unattended things that machines do—things that some people label “automated testing”—except that testing cannot be automated. To make something a test requires the design before the automated behaviour, and the interpretation afterwards. Those parts of the test, which depend upon human social competence to make a judgement, cannot be automated.

Are you saying “manual”? You might be referring to testing activity that’s transformative, wherein something about performing the test changes the tester in some sense, inducing epiphanies or learning or design ideas. Contrast that with procedures that are transactional: rote, routine, box-checking. Transactional things can be done mechanically. Machines aren’t really affected by what happens, and they don’t learn in any meaningful sense. Humans do.

Did you say “manual”? You might be referring to exploratory work, which is interestingly distinct from experiential work as described above. Exploratory—in the Rapid Software Testing namespace at least—refers to agency; who or what is in charge of making choices about the testing, from moment to moment. There’s much more to read about that.

Wait… how are experiential and exploratory testing not the same?

You could be exploring—making unscripted choices—in a way entirely unlike the user’s normal encounter with the product. You could be generating mounds of data and interacting with the product to stress it out; or you could be exploring while attempting to starve the product of resources. You could be performing an action and then analyzing the data produced by the product to find problems, at each moment remaining in charge of your choices, without control by a formal, procedural script.

That is, you could be exploring while encountering the product to investigate it. That’s a great thing, but it’s encountering the product like a tester, rather than like a user. It might not be a great idea to be aware of the differences between those two encounters, and take advantage of them, and not mix those up.

You could be doing experiential testing in a highly scripted, much-less-exploratory kind of way; for instance, following a user-targeted tutorial and walking through each of its steps to observe inconsistencies between the tutorial and the product’s behaviour. To an outsider, your encounter would look pretty much like a user’s encounter; the outsider would see you interacting with the product in a naturalistic way, for the most part—except for the moments where you’re recording observations, bugs, issues, risks, and test ideas. But most observers outside of testing’s form of life won’t notice those those moments.

Of course, there’s overlap between those two kinds of encounters. A key difference is that the tester, upon encountering a problem, will investigate and report it. A user is much less likely to do so. (Notice this phenomenon, while trying to enter a link from LinkedIn’s Articles editor; the “apply” button isn’t visible, and hides off the right-hand side of the popup. I found this while interacting with Linked experientially. I’d like to hope that I would have find that problem when testing intentionally, in an exploratory way, too.)

There are other dimensions of “manual testing”. For a while, we considered “speculative testing” as something that people might mean when they spoke of “manual testing”; “what if?” We contrasted that with “demonstrative” testing—but then we reckoned that demonstration is not really a test at all. Not intended to be, at least. For an action to be testing, we would hold that it must be mostly speculative by nature.

And here’s the main thing: part of the bullshit that testers are being fed is that “automated” testing is somehow “better” than “manual” testing because the latter is “slow and error prone”—as though people don’t make mistakes when they apply automation to checks. They do, and the automation enables those errors at a much larger and faster scale.

Sure, automated checks run quickly; they have low execution cost. But they can have enormous development cost; enormous maintenance cost; very high interpretation cost (figuring out what went wrong can take a lot of work); high transfer cost (explaining them to non-authors).

There’s another cost, related to these others. It’s very well hidden and not reckoned: we might call it interpretation cost or analysis cost. A sufficiently large suite of automated checks is impenetrable; it can’t be comprehended without very costly review. Do those checks that are always running green even do anything? Who knows?

Checks that run red get frequent attention, but a lot of them are, you know, “flaky”; they should be running green when they’re actually running red. Of the thousands that are running green, how many should be actually running red? It’s cognitively costly to know that—so people routinely ignore it.

And all of these costs represent another hidden cost: opportunity cost; the cost of doing something such that it prevents us from doing other equally or more valuable things. That cost is immense, because it takes so much time and effort to to automate GUIs when we could be interacting with the damned product.

And something even weirder is going on: instead of teaching non-technical testers to code and get naturalistic experience with APIs, we put such testers in front of GUIish front-ends to APIs. So we have skilled coders trying to automate GUIs, and at the same time, we have non-programming testers, using Cypress to de-experientialize API use! The tester’s experience of an API through Cypress is enormously different from the programmer’s experience of trying use the API.

And none of these testers are encouraged to analyse the cost and value of the approaches they’re taking. Technochauvinism (great word; read Meredith Broussard’s book Artificial Unintelligence) enforces the illusion that testing software is a routine, factory-like, mechanistic task, just waiting to be programmed away. This is a falsehood. Testing can benefit from tools, but testing cannot be mechanized.

Testing must be seen as a social (and socially challenging), cognitive, risk-focused, critical (in several senses), analytical, investigative, skilled, technical, exploratory, experiential, experimental, scientific, revelatory, honourable craft. Not “manual” or “automated”. Let us urge that misleading distinction to take a long vacation on a deserted island until it dies of neglect.

Testing has to be focused on finding problems that hurt people or make them unhappy. Why? Because optimists who are building a product tend to be unaware of problems, and those problems can lurk in the product. When the builders are aware of those problems, the builders can address them. Whereby they make themselves look good, make money, and help people have better lives.


Rapid Software Testing (for American daytimes and European evenings) runs September 13-16, 2021. Register now! https://www.eventbrite.ca/e/rapid-software-testing-explored-online-american-days-european-evenings-tickets-151562011055

Exact Instructions vs. Social Competence

July 5th, 2021

An amusing video from a few years back has been making the rounds lately. Dad challenges the kids to write exact instructions to make a peanut butter and jelly sandwich, and Dad follows those instructions. The kids find the experience difficult and frustrating, because Dad interprets the “exact” instructions exactly—but differently from the way the kids intended. I’ll be here when you get back. Go ahead and watch it.

Welcome back. When the video was posted in a recent thread on LinkedIn, comments tended to focus on the need for explicit documentation, or more specific instructions, or clear direction.

In Rapid Software Testing, we’d take a different interpretation. The issue here is not that instructions are unclear, or that the kids have expressed themselves poorly. Instead, we would emphasize that communicating clearly, describing intentions explictly, and performing actions appropriately all rely on tacit knowledge—knowledge that has not been made explicit. In that light, the kids did a perfectly reasonable job at describing the assignment.

Notice that the kids do not describe what peanut butter is; they do not have to not tell the father that one must twist the lid on the peanut butter jar to open it; nor do they have to explain that the markings on the paper are words representing their intentions. The father has sufficient tacit knowledge to be aware of those things. At a very young age, through socialization, observation, imitation, and practice, the dad acquired the tacit knowledge required to open peanut butter jars, to squeeze jelly dispensers without crushing them, to use butter knives to deliver peanut butter from jar to bread, to make reasonable inferences about what the “top” of the bread is, and so forth.

Even though he has sufficient tacit knowledge to interpret instructions for making a peanut butter and jelly sandwich, the dad pretends that he doesn’t. What makes the situation in the video funny for us and exasperating for the kids is our own tacit knowledge of things the father presumably should know as a normal American dad in a normal American kitchen. In particular, we’re aware that he should be able to interpret the instructions competently; to repair differences between the actions the kids intended him to take and the ones he chose to take.

In certain circles, there is an idea that “better requirements documents” or “clear communication” or “eliminating ambiguity” are royal roads to better software development and better testing. Certainly these things can help to some degree, but organizing teams and building products requires far more than explicit instructions. It requires the social context and tacit knowledge to interpret things appropriately. Dad misinterpreted on purpose. Development and testing groups can easily misintrepret by accident; unintentionally; obliviously.

Where do explicit instructions come from? Would they be any good if they weren’t rooted in knowledge about the customers’ form of life, and knowledge of the problems that customers face—the problems that the product could help to solve? Could they be expressed more concisely and more reliably when everyone involved had shared sets of feelings and mental models? And would exact instructions help if the person (or machine) receiving them didn’t have the social competence to interpret them appropriately?

In RST, we would hold that it’s essential for the tester to become immersed in the world of the product and in the customers’ forms of life to the greatest degree possible—a topic for posts to come.

Testers: Focus on Problems

June 16th, 2021

A tester writes:

“I’m testing an API. It accepts various settings and parameters. I don’t know how to get access to these settings from the API itself, so I’m stuck with modifying them from the front end. Moreover, some responses to specific requests are long and complicated, so given that, I have no idea how to test it! Online examples of API testing tend to focus on checking the response’s status code, or verification of the schema, or maybe for correctness of a single string. How can I make sure that the whole response is correct?”

My reply:

You can’t.

That reply may sound disconcerting to many testers, but it’s true. The good news is that there’s a practical workaround when you focus less on demonstrating correctness and more on discovering problems.

Among other things, being unsure whether something is correct suggests problems on its own. For instance, your question reveals an immediate problem: you’re not familiar enough with the product or how to test it. Not yet. And, I hasten to point out: that’s okay.

Being unsure about how to test something is always the case to some degree. As you’re learning to test, and as you’re learning to test a product that’s new to you, some uncertainty and confusion is normal. Don’t worry about that too much. To test the product you must learn how to test it. You learn how to test the product by trying to test the product—and by reporting and discussing the problems you encounter. Learning is an exploratory process. Apply an exploratory approach to discover problems.

For instance: I can see from your question that you’ve already discovered a problem: you’ve learned that your testing might be harder or slower without some kind of tool support that allows you to set options and parameters quickly and conveniently. Report that problem. Solving it might require some kind of help from the designers and developers in supplying APIs for setup and configuration.

If those APIs don’t exist, that’s a problem: the intrinisic testability of your product is lower than it could be. When testing is harder or slower, given the limited time you have to test, it’s also shallower. The consequence of reduced testability is that you’re more likely to miss bugs. It would be a good idea to make management aware of testability problems now. Report those problems. If you do so, you’ll already have given at least part of an answer when management inevitability asks, perhaps much later, “Why didn’t you find that bug?”

Moreover, if those setup and configuration APIs don’t exist, there’s a good chance that it’s not only a problem for you; it will probably be a problem for people who want to maintain and develop the product, and for people who want to use it. Report that problem too.

If those APIs do exist, and they’re not described somehow or somewhere, that’s a problem for you right now, but sooner or later it will be a problem for others. Inside developers who maintain the product now and in the future need to understand the API, and outside developers who want to use the product through the API need to be able understand it quickly and easily too. Without description, the API is of severely limited usefulness. Report that problem. Mind you, missing documentation is a problem that you can help to address while testing.

If there is a description of the API, but the description is inaccurate, or unclear, or out of date, you’ll soon find that it doesn’t match the product’s behaviour. That’s a problem for several reasons: for you, the tester, it won’t be clear whether the product or the documentation is correct; inside developers won’t know whether or how to fix it; and outside developers will find that, from their perspective, the product doesn’t work, and they’ll be confused as to whether there’s an error in the documentation or a bug in the API. Report that problem.

If those APIs do exist and they’re documented but you don’t know how to design or perform tests, or how to analyse results efficiently (yet), that’s a problem too: subjective testability is low. The quality of your testing depends on your knowing the product, and the product domain, and the technology. That knowledge doesn’t come immediately. It takes working with something and with the people who built it to know how to test it properly, to learn how deeply it needs to be tested, to develop ideas about risk, and to recognize hidden, subtle, rare, intermittent, emergent problems.

To learn the product, you’ll probably need to be able to talk things over with the developers and your other testing clients, but that’s not all. To learn the product well, you’ll need experience with it. You’ll need to engage with it and interact with it. You’ll need to develop mental models of it. You’ll need to anticipate how people might use it, and how they might have trouble with it. You must play, feel your way around, puzzle things out, and be patient as you encounter and confront confusion. The confusion lifts as you immerse yourself in the world of the product.

However, management needs to know that that learning time is necessary. While you’re learning about where and how to find deep bugs, you won’t find them deliberately. At first, you’ll tend to stumble over bugs accidentally, and you might miss important bugs that are right in front of you. Again, that’s normal and natural until you’ve learned the product, have figured out how to test it, and have become comfortable with your oracles and your tool sets.

Hold up — what’s an oracle? An oracle is a means by which we recognize a problem when we encounter one in testing. And this is where we return to issues around correctness.

After making an API call as part of a test, you can definitely match elements of the response with a reference—an example, or a list, or a table, or a bit of JSON that someone has else provided, or that you’ve developed yourself. You could compare elements in the response individually, one by one, and confirm that each one seemed to be consistent with your reference. You could observe these things directly, with your own eyes, or you could write code to mediate your observation.

If you see some inconsistency, you can suspect that there’s a problem, and report that. If each element in the output matches the reference, and all of the elements match the reference, and there don’t seem to be any extra elements in the output, you can assert that the response appears to be correct, and from that you can infer that the response is correct. But even then, you can’t make sure that the response is correct.

One issue here is that correctness of output is always relative to something, and to someone’s notion of consistency with that something. You could assert that the response seems to be consistent with the developers’ intentions, to the degree that you’re aware of those intentions, and to the degree that your awareness is up to date. Of course, the developers’ intentions could be inconsistent with what the project manager wanted, and all of that could be inconsistent with what some customer would perceive as correct. Correctness is subject to the Relative Rule: that is, “correct” really means “correct to some person, at some time”.

If you don’t notice a problem, you can truthfully say so; you didn’t notice a problem. That doesn’t mean that product is correct. Correctness can refer to an overwhelming number of factors.

Is the output well-formed and syntactically correct? Is the output data accurate? Is it sufficiently precise? Is it overly precise? Are there any missing elements? Are there any extra elements? Has some function changed the original source data (say, from a database) while transforming it to an API response? Was the source data even correct to begin with?

Did the response happen in a sufficiently timely way? The output seemed to be correct, but is the system still in a stable state? The output appeared correct this time; will it be correct next time? Was the response logged properly? If there was an error, was an appropriate, helpful error message returned, such that it could be properly understood by an outside programmer? In terms of the questions we could ask, we’re not even getting started here.

Answering all such questions on every test is impractical, even with tool assistance. You could write code to check a gigantic number of conditions for an enormous number of outputs, based on a multitude of conditions for a huge set of inputs. That would result in an intractable amount of test output data to analyze, and would require an tremendous amount of code—and writing and debugging and maintaining all that code would be harder than writing and maintaining debugging code for the product.

The effort and expense wouldn’t be worthwhile, either. After something has been tested to some degree (for example, by developers’ low-level unit checks, or a smattering of integration checks, or by some quick direct interaction with the product), risk may be sufficiently low that we can turn our attention to higher-risk concerns. If that risk isn’t sufficiently low (perhaps because the developers haven’t been given time or resources to develop a reasonable understanding of the effects of changes they’re making), more testing on your part is unlikely to help. Report that problem.

So rather than focusing on correctness, I’d recommend that you focus on problems and risk instead; and that you report on anything that you believe could plausibly be a problem for some person who matters. There are (at least) three reasons for this; one social, one psychological, and one deeply practical.

The social reason is important, but sometimes a little awkward. The tester must be focused on finding problems because most of the time, no one else on the project is focused on risk and on finding problems. Everyone else is envisioning success.

Developers and designers and managers have the job of building useful products that make people’s problems go away. The tester’s unique role is to notice when the product is not solving those problems, and to notice when the product is introducing new problems.

This puts the tester at a different perspective from everyone else, and the tester’s point of view can seem disruptive sometimes. After all, not very many people love hearing about problems. The saving grace for the tester is that if there are problems about the product, it’s probably better to know about them before it’s too late, and before those problems are inflicted on customers.

The psychological reason to focus on problems is that if you focus on correctness, that’s what you’ll tend to see. You will feel an urge to show that the product is working correctly, and that you can demonstrate success. Your mind will not be drawn to problems, and you’ll likely miss a bunch of them.

And here’s the practical reason to focus on problems: if you focus on risk and search diligently for problems and you don’t find problems, you’ll be able to make reasonable inferences about the correctness part for free!

Unlimited Charges

May 17th, 2021

I noticed something interesting while reviewing my credit card bills a couple of evenings ago: monthly charges for $9.99 from “Amazon Downloads”, going back several months.

I buy a lot of e-books. I looked for receipts from Amazon in email. I found a bunch, but none from Amazon for $9.99. I never delete email receipts; I put all of them into a separate folder so that I can collate them and have the relevant ones as records of business expenses at tax time. No receipts for $9.99 on or near the associated dates.

I did a Google search for “Amazon Downloads charge 9.99”; it autocompleted before I got to the 9.99 part. It seems that I’m not alone.

It turns out that Amazon Kindle Unlimited subscriptions get started in a very subtle and hard-to-notice way, and then automatically renew monthly, without receipts being issued. Other auto-renewing services do provide notification each business cycle; Amazon doesn’t, so it seems.

And I found it intriguing how quickly they processed a refund when I pointed this out.

Within moments of tweeting about this, I heard from another fellow who had the same experience. Indeed I’m not alone.

This really bugs me. On the other hand, by Amazon’s lights, this is not a bug; the system is doing what it is designed to do. I would like to believe that if I were a tester at Amazon, I would have noted the subtlety of the auto-renewal option and the absence of monthly receipts as Severity-1 problems, inconsistent with an image that the organization would like to project or defend. After this experience, and feeling like I’ve been duped, I hope I would be more likely to notice and report these problems. And I’d like to believe that a tester at Amazon did just that, but I don’t know; I can’t know.

I can be more certain that management has not so far seen this as a problem, and I have a problem with that. Meanwhile, lesson learned; keep reviewing those credit card purchases diligently every month.

“Manual Testing”: What’s the Problem?

April 27th, 2021

I used to speak at conferences. For the HUSTEF 2020 conference, I had intended to present a talk called “What’s Wrong with Manual Testing?” In the age of COVID, we’ve all had to turn into movie makers, so instead of delivering a speech, I delivered a video instead.

After I had proposed the talk, and it was accepted, I went through a lot of reflection on what the big deal really was. People have been talking about “manual testing” and “automated testing” for years. What’s the problem? What’s the point? I mulled this over, and video contains some explanations of why I think it’s an important issue. I got some people — a talented musician, an important sociologist, a perceptive journalist and systems thinker, a respected editor and poet, and some testers — to help me out.

In the video, I offer some positive alternatives to “manual testing” that are much less ambiguous, more precise, and more descriptive of what people might be talking about: experiential testing (which we could contrast with “instrumented testing”; exploratory testing (which we have already contrasted with “scripted testing”; attended testing (which we could contrast with “unattended testing”); and there are some others. More about all that in a future post.

I also propose how it came to be that important parts of testing — the rich, cognitive, intellectual social, process of evaluating a product by learning about it through experiencing, exploring and experimenting — came to be diminished and pushed aside by obsessive, compulsive fascination with automated checking.

But there’s a much bigger problem that I didn’t discuss in the video.

You see, a few days before I had to deliver the video, I was visiting an online testing forum. I read a question from a test manager who wanted to interview and qualify “manual testers”. I wanted provide a helpful reply, and as part of that, I asked him what he meant by “manual testing”. (As I do. A lot of people take this as being fussy.)

His reply was that he was wanting to identify candidates who don’t use “automated testing” as part of their tool set, but who were to be given the job of creating and executing manually scripted human-language tests and performing all the critical thinking skills that both approaches require.

(Never mind the fact that testing can’t be automated. Never mind that scripting a test is not what testing is all about. Never mind that no one even considers the idea of scripting programmers, or management. Never mind all that. Wait for what comes next.)

Then he said that “the position does not pay as much as the positions that primarily target automated test creation and execution, but it does require deeper engagement with product owners”. He went on to say that he didn’t want to get into the debate about “manual and automated testing”; he said that he didn’t like “holy wars”.

And there we have it, ladies and gentlemen; that’s the problem. Money talks. And here, the money—the fact that these testers are going to be paid less—is implicitly suggesting that talking to machines is more valuable, more important, than deeper engagement with people.

The money is further suggesting that skills stereotypically associated with men (who are over-represented in the ranks of programmers) are worth more than skills stereotypically associated with women (who are not only under-represented but also underpaid and also pushed out of the ranks of programmers by chauvinism and technochauvinism). (Notice, by the way, that I said “stereotypically” and not “justifiably”; there’s no justification available for this.)

Of course, money doesn’t really talk. It’s not the money that’s doing the talking.  It’s our society, and people within it, who are saying these things. As so often happens, people are using money to say things they dare not speak out loud.

This isn’t a “holy war” about some abstract, obscure point of religious dogma. This is a class struggle that affects very real people and their very real salaries. It’s a struggle about what we value. It’s a humanist struggle. And the test manager’s statement shows that the struggle is very, very real.

Suggestions for the (New) Testers

April 23rd, 2021

A friend that I’m just getting to know runs a training and skills development program for new testers. Today he said, “My students are now starting a project which includes test design, test techniques, and execution of testing. Do you have any input or advice for them?” Here’s my reply.

Test design, test techniques, and execution of testing are all good things. I’d prefer performing tests to “test execution”. In that preference, I’m trying to emphasize that a test is a performance, by an engaged person who adapts to what he or she is experiencing. “Test execution” sounds more like following a recipe, or a programmed set of instructions.

Of these things, my advice is to perform testing first. But that advice can be a little confusing to people who believe that testing is only operating some (nearly) finished product in a search for coding errors. In Rapid Software Testing, we take a much more expansive view: testing is the process of evaluating a product by learning about it through experiencing, exploring and experimenting, which includes to some degree questioning, studying, modeling, observation, inference, etc.

Testing includes analysis of the product, its domain, the people using it, and risk related to all of those. Testing includes critical thinking and scientific thinking. Testing includes performing experiments—that is, tests—all the way along. But I emphasized the learning part just back there, because testing starts with learning, ends with reporting what we’ve learned, feeds back into more learning, and is about learning every step of the way.

We learn more most powerfully from experiencing, exploring, and experimenting; performing experiments; performing tests. So, my advice to the new tester is to start with performing tests to study the product, without focusing too much on test design and test techniques, at first.

Side note: the “product” that you’ve been asked to test may not be a full, working, running piece of software. It may be a feature or component or function that is a part of a product. It may be a document, or a design drawing, a diagram, or even an idea for a product or feature that you’re being asked to review. In the latter cases, “performing a test” might mean the performance of a thought experiment. That’s not the same as the real-world experience of the running product, hence the quotes around “performing a test”. A thought experiment can be a great and useful thing to help nip bugs in the bud, before bugs in an idea turn into bugs in a product. But if we want to determine the real status of the real product, we’ll need to perform real testing on the real product.

So: learn the product (or feature, or design, or document, or idea), and identify how people might get value from it. Survey the product to identify its functions, features, and interfaces. Explore the product, and gain experience with it by engaging in a kind of purposeful play. Don’t look for bugs, particularly—not right away. Look for benefits. Look for how the product is intended to help people get their work done, to help them to communicate with other people, to help them to get something they want or need, to help them to have fun. Try doing things with the product—accomplishing a task, having a conversation, playing the game.

Record your thoughts and ideas and feelings reasonably thoroughly. Pay attention to things that surprise you, or that trigger your interest, or that prompt curiosity. Note things that you find confusing, and notice when the confusion lifts. If you have been learning the product for a while, and that confusion hasn’t gone away, that’s significant; it means there’s some confusing going on. If you get ideas about potential problems (that is, risks), note those. If you get ideas for designing tests, or applying tools, note those too.

Capture what you’re learning in point form, or in mind maps, or in narratives of what you’re doing. Sketeches and diagrams can help too. Don’t make your notes too formal; formality tends to be expensive, and it’s premature at this stage. It might be a good idea to test with someone else, with one person focusing on interacting with the product, and the other minding the task of taking notes and observations. Or you might choose to narrate and record your survey of the product on video to review later on; or to use like the black boxes on airplanes to figure out what led to problems or crashes.

You’ll probably see some bugs right away. If you do, note them quickly, but don’t investigate them. If you spotted a bug this easily, this early, and you take a quick note about it, you’ll almost certainly be able to see the bug again later. Investigating shallow bugs is not the job at the moment. They job right now is to develop your mental model of the product, so that you become prepared to find bugs that are more subtle, more deeply hidden, and potentially much more important or damaging.

Identify the people who might use the product… and then consider other groups of people you might have forgotten. That would include novice users of the product; expert users of the product; experts in the product domain who are novice users of the product; impatient users; plodding users; users under pressure; disabled users… Consider the product in terms of things that people value: capability, reliability, usability, charisma, security, scalability, compability, performance, installability… (As a new tester, or a tester in training, you might know these as quality criteria.)

You might also want to survey the product from the perspective of people who are not users as such, but who are definitely affected by the product: customer support people; infrastructure and operations people; other testers (like testing toolsmiths, or accessibility specialists); future testers; current developers future developers… Think in terms of what they might value from the product: supportability, testability, maintainability, portability, localizability. (These are quality criteria too, but they’re focused on the internal organization more than on their direct benefit to the end user.)

Refine your notes. Create lists, mind maps, tables, sketches, diagrams, flowcharts, stories… whatever helps you to reflect on your experience.

Share your findings with other people in the test or development (or in this case, study) group. That’s very important. It’s a really good way both to share knowledge and to de-bias ourselves and to reveal things that we might have forgotten, ignored, or dismissed too quickly.

Have these questions in mind as you go: What is this that we’re building? Who are we building it for? How would they get value from it? As time goes by, you’ll start to raise other questions: What could go wrong? How would we know? How could people’s value might be threatened or compromised? How could we test this? How should we test this? Then you’ll be ready to make better choices about test design, and applying test techniques.

Of course, this isn’t just advice for the new tester. It applies to anyone who wants to do serious testing. Testing that starts by reading a document and leaps immediately to creating formal, procedurally scripted test cases will almost certainly be weak testing, uninformed by knowledge of the product and how people will engage with it. Testing that starts with being handed some API documentation and leaps to the creation of automated checks for correct results will miss lots of problems that programmers will encounter—problems that we could discover if we try to experience it the way programmers—especially outside programmers—will.

As we’re developing the product, we’re learning about it. As we’re learning the product, we’re developing ideas about what it is, what it does, how people might use it, and how they might get value from it, and that learning feeds back into more development. As we develop our understanding of the product more deeply, we can be much better prepared to consider how people might try to use it unsuccessfully, how they might misuse it, and how their value might be threatened. That’s why it’s important, I believe, to do test execution perform testing first—to prepare ourselves better for test design and for identifying and applying test techniques—so we can find better bugs.

This post has been greatly influenced by ideas on sympathetic testing that came to me—over a couple of decades—from Jon Bach, James Bach, and Cem Kaner.

Evaluating Test Cases, Checks, and Tools

April 11th, 2021

For testers who are being asked to focus on test cases and testing tools, remember this: a test case never finds a bug. The tester finds a bug, and the test case may play a role in finding the bug. (Credit to Pradeep Soundararajan for putting this so succinctly, all those years ago.)

Similarly, an automated check never finds a bug. The tester finds a bug, and the check may play a role in finding the bug.

A testing tool never finds a bug. The tester finds a bug, and the tool may play a role in finding the bug.

If you suspect that managers are putting too much emphasis on test cases, or automated checks, or testing tools—artifacts—, try this:

Start a list.

Whenever you find a bug, make a quick note about the bug and how you found it. Next to that, put a score on the value of the artifact. Write another quick note to describe and explain why you gave the the artifact a particular score.

Score 3 when you notice that an artifact was essential in finding the bug; there’s no way you could have found the bug without the artifact.

Score 2 if the artifact was significant in finding the bug; you could have found the bug, but the artifact was reasonably helpful.

Score 1 if the artifact helped, but not very much.

Score 0 if the artifact played no role either way.

Score -1 whenever you notice the artifact costing you some small amount of time, or distracting you somewhat.

Score -2 whenever the artifact when you notice the artifact costing you significant time or disruption from the task of finding problems that matter.

Score -3 whenever you notice that the artifact is actively preventing you from finding problems—when your attention has been completely diverted from the product, learning about it, and discovering possible problems in it, and has been directed towards the care and feeding of the artifact.

Notice that you don’t need to find a bug to offer a score. Pause your work periodically to evaluate your status and take a note. If you haven’t found a bug in the last little while, note that. In any case, every now and then, identify how long you’ve been on a particular thread of investigation using a test case, or a set of checks, or a tool. Evaluate your interaction with the artifact.

Periodically review the list with your manager and your team. The current total score might be interesting; if it’s high, that might suggest that your tools or test cases or other artifacts are helping you. If it’s low or negative, that might suggest that the tools or test cases or other artifacts are getting in your way.

Don’t take too long on the aggregate score; practically no time at all. It’s far more important to go through the list in detail. The more extreme numbers might be the most interesting. You might want to pay the greatest or earliest attention to the things that score the lowest and highest first, but maybe not. You might prefer to go through the list in order.

In any case, as soon as you begin your review of a particular item, throw away the score, because the score doesn’t really mean anything. It’s arbitrary. You could call it data, but it’s probably not valid data, and it’s almost certainly not reliable data. If people start using the data to control the decisions, eventually the data will be used to control you. Throw the score away.

What matters is your experience, and what you and the rest of the team can learn from it. Turn your attention to your notes and your experience. Then start having a real conversation with your manager and team about the bug, about the artifact or tool, and about your testing. If the artifact was helpful, identify how it helped, and how it might help next time, and how it could fool you if you became over-reliant on it. If the artifact wasn’t helpful, consider how it interfered with your testing, how you might improve or adjust it or whether you should put it to bed for a while or throw it away.

Learn from every discovery. Learn from every bug.

Related reading:

Assess Quality, Don’t Measure It

Flaky Testing

February 22nd, 2021

The expression “flaky tests” is evidence of flaky testing. No scientist refers to “flaky experimental results”. Scientists who observe inconsistency don’t dismiss it. They pay close attention to it, and probe it. They redesign their experiments or put better controls on them.

When someone refers to an automated check (or a suite of them) as a “flaky test”, the suggestion is that it represents an unreliable experiment. That assumption is misplaced. In fact, the experiment reliably shows that someone’s models of the product, check code, test environment, outcomes, theory, and the relationships between them are misaligned.

That’s not a “flaky experiment”. It’s an excellent experiment. The experiment is telling you something crucial: there’s something you don’t know. In science, a surprising, perplexing, or inconsistent result prompts scientists to begin an investigation. By contrast, in software, an inconsistent result prompts some people to shrug and ignore what the experiment is trying to tell them. Then they do weird stuff like calculating a “flakiness score”.

Of course, it’s very tempting psychologically to dismiss results that you can’t explain as “noise”, annoying pieces of red junk on your otherwise lovely all-green lawn. But a green lawn is not the goal. Understanding what the junk is, where it is, and how it gets there is the goal. It might be litter—it it might be a leaking container of toxic waste.

It’s not a great idea to perform a test that you don’t understand, unless your goal is to understand it and its relationship to the product. But it’s an even worse idea to dismiss carelessly a test outcome that you don’t understand. For a tester, that’s the epitome of “flaky”.

Now, on top of all that, there’s something even worse. Suppose you and your team have a suite of 100,000 automated checks that you proudly run on every build. Suppose that, of these, 100 run red. So you troubleshoot. It turns out that your product has problems indicated by 90 of the checks, but ten of the red results represent errors in the check code. No problem. You can fix those, now that you’re aware of the problems in them.

Thanks to the scrutiny that red checks receive, you have become aware that 10% of the outcomes you’re examining are falsely signalling failure when they are in reality successes. That’s only 10 “flaky” checks out of 100,000. Hurrah! But remember: there are 99,900 checks that you haven’t scrutinized. And you probably haven’t looked at them for a while.

Suppose you’re on a team of 10 people, responsible for 100,000 checks. To review those annually requires each person working solo to review 10,000 checks a year. That’s 50 per person (or 100 per pair) every working day of the year. Does your working day include that?

Here’s a question worth asking, then: if 10% of 100 red checks are misleadingly signalling a problem, what percentage of 99,900 green checks are misleadingly signalling “no problem”? They’re running green, so no one looks at them. They’re probably okay. But even if your unreviewed green checks are ten times more reliable than the red checks that got your attention (because they’re red), that’s 1%. That’s 999 misleadingly green checks.

Real testing requires intention and attention. It’s okay for a suite of checks to run unattended most of the time. But to be worth anything, they require periodic attention and review—or else they’re like smoke detectors, scattered throughout enormous buildings, whose batteries and states of repair are uncertain. And as Jerry Weinberg said, “most of the time, a nonfunctioning smoke alarm is behaviorally indistinguishable from one that works. Sadly, the most common reminder to replace the batteries is a fire.”

And after all this, it’s important to remember that most checks, as typically conceived, are about confirming the programmers’ intentions. In general, they represent an attempt to detect coding problems and thereby reduce programmers committing (pun intended) easily avoidable errors. This is a fine and good thing—mostly when the effort is targeted towards lower-level, machine-friendly interfaces.

Typical GUI checks, instrumented with machinery, are touted as “simulating the user”. They don’t really do any such thing. They simulate behaviours, physical keypresses and mouse clicks, which are only the visible aspects of using the product—and of testing. GUI checks do not represent users’ actions, which in the parlance of Harry Collins and Martin Kusch are behaviours plus intentions. Significantly, no one reduces programming or management to scripted and unmotivated keystrokes, yet people call automated GUI checks “simulating the user” or “automated testing”.

Such automated checks tell us almost nothing about how people will experience the product directly. They won’t tell us how the product supports the user’s goals and tasks—or where people might have problems getting what they want from the product. Automated checks will not tell us about people’s confusion or frustration or irritation with the product. And automated checks will not question themselves to raise concern about deeper, hidden risk.

More worrisome still: people who are sufficiently overfocused, fixated, on writing and troubleshooting and maintaining automated checks won’t raise those concerns either. That’s because programming automated GUI checks is hard, like all programming is hard. But programming a machine to simulate human behaviours via complex, ever-changing interfaces designed for humans instead of machines is especially hard. The effort easily displaces risk analysis, studying the business domain, learning about users’ problems, and critical thinking about all of that.

Testers: how much time and effort are you spending on care and feeding of scripts that represents distraction from interacting with the product and searching for problems that matter? How much more valuable would your coding be if it helped you examine, explore, and experiment with the product and its data? If you’re a manager, how much “testing” time is actually coding and fixing time, in which your testers are being asked to fuss with making the checks run green, and adapting them to ongoing changes in the product?

So the issue is not flaky tests, but flaky testing talk, and flaky test strategy. It’s amplified by referring to “flaky understanding” and “flaky explanation” and “flaky investigation” as “flaky tests”.

Some will object. “But that’s what people say! We can’t just change the language!” I agree. But if we don’t change the way we speak —and the way we think along with it—we won’t address the real flakiness, which the flakiness in our systems, and the flakiness in our understanding and explanations of those systems. With determination and skill and perseverance, we can change this. We can help our clients to understand the systems they’ve got, so that they can decide whether those are the systems they want.

Learn about how to focused on fast, inexpensive, powerful testing strategies to find problems that matter. Register for classes here.