Blog Posts for the ‘Oracles’ Category

Expected Results

Sunday, August 23rd, 2020

Klára Jánová is a dedicated tester who studies and practices and advocates Rapid Software Testing. Recently, on LinkedIn, she said:

I might EXPECT something to happen. But that doesn’t necessarily mean that I WANT IT/DESIRE for IT to happen. I even may want it to happen, but it not happening doesn’t have to automatically mean that there’s a problem.

The point of this post: no more “expected results” in the bug reports, please!

In reply, Derek Charles asked:

Then how else would you communicate to the developer or the team what is SUPPOSED to happen? I think that expected results are very necessary especially when regressions are found during testing.

Klara replied:

I suggest to describe the behavior that the tester recognizes as problematic and explain WHY it might be a problem for someone—the reasoning why the behavior is perceived as a bug—that’s what really matters.

Exactly so. Klára is referring here to problems and oracles—means by which we recognize problems when we encounter them in testing.

There’s an issue with the “what is supposed to happen” stuff: in development work, what is supposed to happen is not always entirely clear. Moreover, and more importantly, since testers don’t run the project or the business, we don’t mandate what is supposed to happen.

For instance, while testing, I may observe something in the product that I find confusing, or surprising, or wrong. When I look up the intended behaviour in the specification, it says one thing; the developer, claiming that the spec is out of date, contradicts it; and the product owner confirms that the spec is outdated. But she also says that the developer’s interpretation of what should happen is not what she wants him to implement. And then, when I consult an RFC, the product owner’s interpretation is inconsistent with what the RFC says should be the appropriate behaviour.

Fortunately, I don’t have to decide, and I don’t have to say what should happen. My job as a tester is to report on an apparent inconsistency between the product and presumably desirable things, or between the product and someone’s expressed desire or requirement. In the case above, I let the product owner know about the inconsistency between her interpretation and the standard, and she makes the call on what she and the business want from the product.

That is, even though I have certain expectations, I might be wrong about them and about what I think should be. For instance, she might decide that our product is not going to support that standard. She might point out that the standard I’m considering has been superseded by a later one. In any case, what is supposed to happen gets decided not by me, but by the people who run things. That’s what they’re paid for. This is a good thing, not a bad thing.

But still, I’d like to honour Derek’s question: as testers, how should we report a problem without referring to “expected results”?

  • Instead of saying “expected result” and leaving it that, we could say “inconsistent with the specification”.

    Inconsistency with the specification is a special case of a more general way of recognizing and describing a problem: inconsistency with claims. “Inconsistency with claims” is an oracle heuristic. (A heuristic is a fallible means for solving a problem; an oracle is a special kind of heuristic which, fallibly, helps you to solve the problem of identifying and describing a bug.) When a product is inconsistent with a claim that someone important makes about it, there’s likely a problem, either with the product or the claim. As a tester, I don’t have to decide which.

    The specification is a particular form of a claim that someone is making about what the product is like, or what it should be like. Claims can be made in design sessions, planning meetings, pair programming, hallway conversations, training workshops… Claims can be represented in help files, marketing materials, workflow diagrams, lookup tables, user manuals, whiteboard sketches, UML diagrams… Claims can also be represented in the code of an automated check, where someone has written code to compare the output of the product with an anticipated and presumably desirable result. Recognizing many sources of claims and inconsistencies with them makes us more powerful testers.

    Whatever relevant claim you’re referring to, having said “inconsistent with a claim” (and having identified the nature of the claim, and where or whom it comes from), you don’t need to say “expected result”.

  • Instead of saying “expected result” and leaving it that, you could say “inconsistent with how the product used to work”.

    Inconsistency with history is an oracle heuristic. After a change, the product might have a new bug in it. On the other hand, the product might have been wrong all along, and now it’s right. (This is an example of how oracles can mislead us or conflict with each other, which is why it’s a good idea to identify the oracles we’re applying in problem reports.) If you (or others) aren’t aware of why the desirable change was made, that’s a different kind of problem, but a problem nonetheless.

    Either way, having said “inconsistent with how the product used to work” (and having described that in terms of a problem), you don’t need to say “expected result”.

  • Instead of saying “expected result” and leaving it that, you could say “inconsistent with respect to the product itself”.

    Inconsistency within the product is an oracle heuristic. This can takes a number of forms: the product might return inconsistent results from one run to the next; the product could afford a tidy, smooth interface in one place, and a frustrating, confusing interface in another; the product could present output very precisely in one part of the product, and imprecisely in another; one component in the product could log output using one format, while another component’s log output is in a different format, which makes analysis more difficult…

    The inconsistency might be undesirable (because of a reliability problem), or it might be completely desirable (a Web page for a newspaper should change from day to day), or it might desirable or undesirable in ways that you’re not aware of (since, like me, you probably don’t know everything).

    In general, people tend to prefer things that present themselves in a consistent way. Here’s a trivial example from Microsoft Office (Office 365, these days): to search for text in Word, the keyboard command is Ctrl-F. In Outlook, part of the same product suite, Ctrl-F triggers the Forward Message action instead; F4 triggers a search. Had Outlook and Word been designed by the same teams at the same time, this probably would have been identified as a bug, and addressed. In the end, the Office suite’s program managers decided that consistency with history dominated inconsistency within the product, and now we all have to live with that. Oh well.

    In any case, having said “inconsistent with respect to some aspect of the same product” (and having identified the specifics of the inconsistency), you don’t need to say “expected result”.

  • Instead of saying “expected result” and leaving it that, you could say “inconsistency with a comparable product” (and identify the product, and the nature of the inconsistency).

    Inconsistency with a comparable product is an oracle heuristic. Any product (something that someone has produced) that provides a relevant point of comparison is, by defintion, a comparable product. That includes competitive products, of course; Microsoft Word and Google Docs are comparable products, in that sense. Microsoft Word and WordPad are comparable products too; they have many features in common. If Word can’t open an .RTF file generated by WordPad, we have reason to suspect a problem in one product or the other. If WordPad prints an RTF file properly, and Word does not, we have reason to suspect a problem in Word.

    Is the Unix program wc (wc stands for “word count”) a comparable product to Microsoft Word? All wc does is count words in text files, so no, except… Word has a word-counting feature. If Word’s calculation for the number of words in a text file is inexplicably different from wc‘s count, we have reason to suspect a problem in one product or the other.

    Test tools and suites of automated output checks represent comparable products too. If the output from your product is inconsistent with the specified and desired results provided by your test tool, or with some data that it processes to produce such results, you have reason to suspect a problem somewhere.

    In any case, having said “inconsistent with a comparable product”, and having identified the product and the basis for comparison, you don’t need to say “expected result”.

Those are just a few examples. When we teach Rapid Software Testing, we offer a set of oracle heuristics that identify principles of desirable (and undesirable) consistency (and inconsistency) for identifying bugs; you can read more about those here.

James Bach has recently identified another principle that might apply to bugs but that, in my view, more powerfully applies to enhancement requests: we desire the product to be consistent with acceptable quality: that is, not only good, but every bit as good as it can be.

Why is all this a big deal? Several reasons, I think.

First, “expected result” begs the question of where the expectation comes from. It’s just a middleman for something we could say more specifically. Why not get to the point and say it while at the same time sounding like a pro? Because…

Second, being specific about where the expectation comes from saves time and focuses conversation on the (un)desirable (in)consistencies that matter when developers and product owners are deciding whether something is a bug worth fixing. It also helps to focus repair in the appropriate claim (for example, if the product is right and the spec is wrong, it’s a prompt to repair the spec).

Third, it helps for us to remember that our job as testers is not to confirm that the product works “as expected”, but to ask “is there a problem here?” A product can fulfill an expectation and nonetheless have terrible problems about it. It’s our job to seek and find and describe inconsistencies and problems that matter before it’s too late.

And finally…

Fourth, speaking in terms of an oracle instead of an “expected result” can help to avoid patronizing, condescending, time-wasting, and obvious elements of bug reports that cause developers to feel insulted or to roll their eyes.

Actual result: Product crashes.

Expected result: Product does not crash.

Don’t be that tester.

Further reading:

Not-So-Great Expectations
Oracles From the Inside Out

Want to learn how to observe, analyze, and investigate software? Want to learn how to talk more clearly about testing with your clients and colleagues? Rapid Software Testing Explored, presented by me and set up for the daytime in North America and evenings in Europe and the UK, November 9-12. James Bach will be teaching Rapid Software Testing Managed November 17-20, and a flight of Rapid Software Testing Explored from December 8-11. There are also classes of Rapid Software Testing Applied coming up. See the full schedule, with links to register here.

As Expected

Tuesday, April 12th, 2016

This morning, I started a local backup. Moments later, I started an online backup. I was greeted with this dialog:

Looks a little sparse. Unhelpful. But there is that “More details” drop-down to click on. Let’s do that.

Ah. Well, that’s more information. But it’s confusing and unhelpful, but I suppose it holds the promise of something more helpful to come. I notice that there’s a URL, but that it’s not a clickable link. I notice that if the dialog means what it says, I should copy those error codes and be ready to paste them into the page that comes up. I can also infer that there’s not local help for these error codes. Well, let’s click on the Knowledge Base button.

Oh. The issue is that another backup is running, and starting a second one is not allowed.

As a tester, I wonder how this was tested.

Was an automated check programmed to start a backup, start a second backup, and then query to see if a dialog appeared with the words “Failed to run now: task not executed” in it? If so, the behaviour is as expected, and the check passed.

Was an automated check programmed to start a backup, start a second backup, and then check for any old dialog to appear? If so, the behaviour is as expected, and the check passed.

Was a test script given to a tester that included the instruction to start a backup, start a second backup, and then check for a dialog to appear, including the words “Failed to run now: task not executed”? Or any old dialog that hinted at something? If so, the behaviour is as expected, and the “manual” test passed.

Here’s what that first dialog could have said: “A backup is in progress. Please wait for that backup to complete before starting another.”

At this company, what is the basic premise for testing? When testing is designed, and when results are interpreted, is the focus on confirming that the product “works as expected”? If so, and if the expectations above are met, no bug will be noticed. To me, this illustrates the basic bankruptcy of testing to confirm expectations; to “make sure the tests all pass”; to show that the product “meets requirements”. “Meets requirements”, in practice, is typically taken to mean “is consistent with statements in a requirements document, however misbegotten those statements might be”.

Instead of confirmation, “pass or fail”, “meets the requirements (documents)” or “as expected”, let’s test from the perspective of two questions: “Is there a problem here?” and “Are we okay with this?” As we do so, let’s look at some of the observations that we might make were and questions we might ask. (Notice that I’m doing this without reference to a specification or requirements document.)

Upon starting a local backup and then attempting to start an online backup, I observe this dialog.

I am surprised by the dialog. My surprise is an oracle, a means by which I might recognize a problem. Why am I surprised? Is there a problem here?

I had a desire to create a local backup and an online backup at the same time. On a multi-tasking, multi-threaded operating system, that desire seems reasonable to me, and I’m surprised that it didn’t happen.

Inconsistency with reasonable user desire is an oracle principle, linked to quality criteria that might include capability, usability, performance, and charisma. The product apparently fails to fulfill quality criteria that, in my opinion, a reasonable user might have. Of course, as a tester, I don’t run the project. So I must ask the designer, or the developer, or the product manager: Are we okay with this?

This might be exactly the dialog that has been programmed to appear under this condition—whatever the condition is. I don’t know that condition, though, because the dialog doesn’t tell me anything specific about the problem that the software is having with fulfilling my desire. So I’m somewhat frustrated, and confused. Is there a problem here?

I can’t explain or even understand what’s going on, other than the fact that my desire has been thwarted. My oracle—pointing to a problem—is inconsistency with explainability, in addition to inconsistency with my desires. So I’m seeing a potential problem not only with the product’s behaviour, but also in the dialog. Are we okay with this?

Maybe more information will clear that up.

Still nothing more useful here. All I see is a bunch of error codes; no further explanation of why the product won’t do what I want. I remain frustrated, and even more confused than before. In fact, I’m getting annoyed. Is there a problem here?

One key purpose of a dialog is to provide a user with useful information, and the product seems inconsistent with that (the inconsistency-with-purpose oracle). Are these codes correct? Maybe these error codes are wildly wrong. If they are, that would be a problem too. If that’s the case, I don’t have a spec available, so that’s a problem I’m simply going to miss. Are we okay with that?

I have to accept that, as a human being, there are some problems I’m going to miss—although, if I were testing this in-house, there are things I could do to address the gaps in my knowledge and awareness. I could note the codes and ask the developer about them; or I could ask for a table of the available codes. (Oh… no one has collected a comprehensive listing of the error codes; they’re just scattered through the product’s source code. Are we okay with this?)

Back to the dialog. Maybe those error codes are precisely correct, but they’re not helping me. Are we okay with this?

All right, so there’s that Knowledge Base button. Let’s try it. When I click on the button, this appears:

Let’s look at this in detail. I observe the title: 32493: Acronis True Image: “Failed to run now: task not executed.” That’s consistent with the message that was in the dialog. I notice the dates; something like this has been appeared in the knowledgebase for a while. In that sense, it seems that the product is consistent with its history, but is that a desirable consistency? Is there a problem here?

The error codes being displayed on this Web page seem consistent with the error codes in the dialog, so if there’s a problem with that, I don’t see it. Then I notice the line that says “You cannot run two tasks simultaneously.” Reading down over a long list of products, and through the symptoms, I observe that the product is not intended to perform two tasks simultaneously. The workaround is to wait until the first task is done; then start the second one. In that sense, the product indeed “works as expected”. And yet…are we okay with this?

Once again, it seems to me that attempting to start a second task could be a reasonable user desire. The product doesn’t support that, but maybe we’re okay with that. Yet is there a problem here?

The product displays a terse, cryptic error message that confuses and annoys the user without fulfilling its apparent intended purpose to inform the user of something. The product sends the user to the Web (not even to a local Help file!) to find that the issue is an ordinary, easily anticipated limitation of the program. It does look kind of amateurish to deal with this situation in this convoluted way, instead of simply putting the relevant information in the initial dialog. Is there a problem here?

I believe that this behaviour is inconsistent with an image that the company might reasonably want to project. The behaviour is also inconsistent with the quality criteria we call usability and charisma. A usable product is one that behaves in a way that allows the user to accomplish a task (including dealing with the product’s limitations) quickly and smoothly. A charismatic product is one that does its thing in an elegant way; that engages the user instead of irritating the user; that doesn’t make the development group look silly; that doesn’t prompt a blog post from a customer highlighting the silliness.

So here’s my bug report. Note that I don’t mention expectations, but I do talk about desires, and I cite two oracles. The title is “Unhelpful dialog inconsistent with purpose.” The body would say “Upon attempting to start a second backup while one is in progress, a dialog appears saying ‘Failed to run now: task not executed.’ While technically correct, this message seems inconsistent with the purpose of informing the user that we can’t perform two backup tasks at once. The user is then sent to the (online) knowledge base to find this out. This also seems inconsistent with the product’s image of giving the user a seamless, reliable experience. Is all this desired behaviour?”

Finally: it could be that the testers discovered all of these problems, and laid them out for the the product’s designers, developers, and managers, just as I’ve done here. And maybe the reports were dismissed because the product works “as expected”. But “as expected” doesn’t mean “no problem”. If I can’t trust a backup product to post a simple, helpful dialog, can I really trust it to back up my data?

Oracles from the Inside Out, Part 9: Conference as Oracle and as Destination

Thursday, March 17th, 2016

Over this long series, I’ve described my process of reasoning about problems, using this table:

So far, I’ve mostly talked about the role of experience, inference, and reference. However, I’m typically testing for and with clients—product managers, developers, designers, documenters, and so forth. In doing so, I’m trying to establish a shared understanding of the product with the rest of the team. That understanding is developed through conference; conversation and interaction with those other people. So the lower left quadrant represents two things at once: a set of oracles on the one hand, and my destination on the other.

A brief recap: while testing, I experience and develop my own set of mental models of the product and feelings about it, and reason about possible problems in it. In many cases—for instance, when I get a feeling of surprise or confusion, I’m able to use the consistency principles in the upper right to make inferences that I’m seeing a problem. My inferences might be mediated by references like a document (a specification, or a diagram, or a standard) or a tool (a suite of automated checks, or something that helps me to aggregate and visualize patterns in the data). Those media afford a move from upper right to lower right, and back again to a stronger inference in the upper right.

In other cases, my experiences, inferences, and references may not be enough for me to convince myself that I’m seeing a problem or missing one. If so, one possible move is to ask another tester, a developer, a expert user, a novice user, a product owner, or subject matter expert for information or an opinion. (In Rapid Testing, we often call such a person a live oracle.) When I do that, I’m moving from inference to conference, from upper right to lower left. Occasionally that communication happens immediately and tacitly, without my having to refer to explicit inferences or references. More often, it’s a longer and more involved discussion.

I could use the expertise of a particular person as an oracle, and rely upon that person to declare that he or she is seeing a problem. However, perspectives differ, people have blind spots, everyone is capable of making a mistake, and what was true yesterday may not be true today. Thus there is a risk that a live oracle could be oblivious to certain kinds of problems, or could mislead me into believing there’s a problem where there isn’t one. No oracle—not even a live one, nor a group of them—is infallible. The expert user might not notice an ease-of-learning problem that would cause a novice to stumble. A new programmer might not see a usability problem that an experienced tester would notice right away.

Perhaps more interestingly, people might disagree about whether there’s a problem or not. Such disagreements themselves are oracles, alerting me to problems in the project as well as the product. Feelings can provide important clues about the meaning and the significance of a problem. As we work together, I can listen to people’s opinions, observe the emotional weight they carry, weigh agreements and disagreements between people who matter, and compare their feelings with my own. I move between conference and inference to to recognize or refine my perception of a problem.

The ultimate goal for my testing is to end up in that lower left quadrant with one person in particular: my most important client, the person responsible for making content and release decisions about the product. (That person may have one of a number of titles or labels, including product manager, program manager, project manager, development manager… Here, let’s call that person the Client.) I want my models and feelings about the product to be consistent with the Client’s models and feelings. Experience, inference, reference, and conference help me to do that.

Here’s a fact-based but somewhat fictionalized example. A few years ago, I was working at a financial institution. One of the technical support people mentioned in passing that a surprisingly high proportion of her work was dealing with failed transactions involving two banks out of the hundreds that we interacted with. That triggered a feeling of curiosity: was there a bug in our code? That feeling prompted me to investigate.

Each record had a transaction identifier associated with it. The transaction ID was generated from various bits of data, including the customer account number, and it included a calculated check digit. When I started testing, I noticed that the two banks in question used six-digit account numbers, rather than the more common seven-digit form. I cooked up a script to perform a large number of simulated transactions with those two banks. When I examined the logs, I found that a small number of transactions had invalid account numbers. That problem should have been trapped by the check digit functions, but the transactions were allowed to pass through the system unhindered.

When I mentioned the problem in passing to the product owner, I observed that she seemed unperturbed; she didn’t seem to be taking the problem very seriously. The discrepancy between our feelings suggested that one of two things must have be true: either I hadn’t framed the problem sufficiently well for her to recognize its significance; or she had information that I didn’t, information that would have changed my perception of the problem and lessened my emotional reaction to what I was seeing.

“The problem is only with those two banks,” she said. “Six-digit account numbers, right? We have to special-case those by adding a trailing zero for the check digit function. Something about the check digit calculation fails about one time in a couple of hundred, but the transaction goes through anyway. But later, when we send the acknowledgement packet, those two banks reject it. So six-digit numbers are a pain, but we’ve always been able to deal with the occasional failure.” Here she was using the “patterns of familiar problems” and “history” oracle principles as her means of recognizing a problem. But something else was going on: she was using those two principles to calibrate the significance of the problem in terms of her own mental models, and those principles were helping to dampen her concern. Those oracles suggested that to her that I was observing a problem, but not a big problem.

I did a search of the database, and discovered that there were eight other banks that used six-digit numbers. I wrote a quick script to extract all of the records for those banks. All of transactions had happened successfully.

“OK, but here’s what I found out,” I replied. “There are eight other banks that use six-digit numbers, and we’ve never seen a check-digit failure in those.”

“Really?” she said. “Wow. I thought those were the only two.” I could see that she was suddently more engaged. The fact that the product was inconsistent with itself was a powerful oracle. Awareness of the inconsistency raised her emotional state.

“Yep,” I said. “Here’s the thing: for those two banks—and only for those two—we’re serving up the wrong Web page to get input, which is obviously inconsistent with our design. That page provides the customer with a seven-digit input field. I looked at the logs, and I tried a bunch of stuff myself. Here’s what I think is happening: when the customer enters in a six-digit account number, the page rejects their input because it’s too short, and tells them they need to put in a seven-digit number. It looks to me like a few of the customers are trying to work around the error message by putting in a leading zero. They do that because we show an image to illustrate example input. That image is a seven-digit number that has a leading zero in it. What’s funny is that that the wrong thing to do—putting in a leading zero—actually succeeds every now and again; the hash function for the check digit generates a valid transaction ID by coincidence. Not very often, but enough for it to register.”

“Interesting!” she said. She smiled. “Good detective work there.”

“So, are we going to fix it?” I asked, confident that we finally had a shared understanding of the problem.


I was surprised, and felt myself becoming a little agitated. “Nope?!”

“Well, probably not. We’re replacing the whole input process in six months or so. Since we can deal with the problem as it is, and since the developers are busy on the new version, we’re cool with muddling along.” She noticed from my expression that I suddenly felt deflated. “Listen, that was some really good testing,” she said. “And I really appreciate the effort, and I understand your concern. I get that it’s a real problem for a handful of customers (here, she was acknowledging the inconsistency with user desires oracle), although once they’ve called us, they’re aware of the workaround. I know it does sound like a pretty easy fix, and we could fix it. But then we’d want to test it to make sure that the whole process keeps working for all of the customers of those banks, not just the ones who have had the problems. And with the new version coming up, trust me: you’ll have more than enough to do.”

I was a little disappointed that my investigation hadn’t resulted in a fix, but I did feel that she’d been listening. I had heard enough from her to dampen my own emotional state down so that it was well calibrated with hers.

When I observe a problem, the Client might or might not agree with me that it is a problem. That’s okay. As a tester, I’m not judge or jury for the problem, but I do want to make sure that my report has been heard and understood. After that, the Client can decide what she likes.

She might decide that it’s an important and urgent problem, and that it needs to be addressed right away. She might agree that it’s a problem, but not a problem worth fixing. She might believe that the problem is worth fixing, but not right away. She might dismiss my report of an inconsistency between the product some principle by citing other, more important principles with which the product is consistent.

Oracles give us means not only to recognize problems, but also to interpret and explain our feelings about them. When I can frame my experience—feelings and mental models—in terms of inferences about inconsistencies, I’m better prepared for a conversation—a conference—with my client about each problem, and why I believe it’s a problem.

Oracles from the Inside Out, Part 8: Successful Stumbling

Thursday, November 26th, 2015

When we’re building a product, despite everyone’s good intentions, we’re never really clear about what we’re building until we try to build some of it, and then study what we’ve built. Even after that, we’re never sure, so to reduce risk, we must keep studying. For economy, let’s group the processes associated with that study—review, exploration, experimentation, modelling, checking, evaluating, among many others—and call them testing. Whether we’re testing running code or testing ideas about it, testing at every step reveals problems in what we’ve built so far, and in our ideas about what we’ve built.

Clever people have the capacity to detect some problems and address them before they become bigger problems. A smart business analyst is aware of unusual exceptions in a workflow, recognizes an omission in the requirements document, and gets it corrected. An experienced designer goes over her design in her head, notices a gap in her model, and refines it. A sharp programmer, pairing with another, realizes that a function is using a data type that will overflow, and points out the problem such that it gets fixed right away.

Notice that in each one of these cases, it’s not quite right to say that the business analyst, the designer, or the programmer prevented a problem. It’s more accurate to say that a person detected a little problem and prevented it from becoming a bigger problem. Bug-stuff was there, but a savvy person stomped it while it was an egg or a nymph, before it could hatch or develop into a full-blown cockroach. In order to prevent bigger problems successfully, we have to become expert at detecting the small ones while they’re small.

Sometimes we can be clever and anticipate problems, and design our testing to shine light on them. We can build collaboration into our designs, review into our specifications, and pairing into our programming. We can set up static analysis tools that check code for inconsistency with explicit rules. When we’re dealing with running code, testing might take the form of specific procedures for a tester to follow; sometimes it takes the form of explicit conditions to observe; and sometimes it takes the form of automated checks. All of these approaches can help to find problems along the way.

It’s a fact that when we’re testing, we don’t always find the problems we set out to find. One reason might be, alas, that the problems have successfully evaded our risk ideas, our procedures, our coverage, and our oracles. But another reason might be that, thanks to people’s diligence, some problems were squashed before they had a chance to encounter our testing for them.

Conversely, some problems that we do find are ones that we didn’t anticipate. Instead, we stumble over them. “Stumbling” may sound unappealing until we consider the role that serendipity—accidental or incidental discovery—has played in every aspect of human achievement.

So here, I’m not talking about stumbling in terms of clumsiness. Instead, I’m speaking in terms of what we might find, against the odds, through a combination of diligent search, experimentation, openness to discovery, and alertness—as people have stumbled over diamonds, lost manuscripts, new continents, or penicillin. Chance favours the explorer and—as Pasteur pointed out—the prepared mind. If we don’t open our testing to problems where customers could stumble, customers will find those places.

Productive stumbling can be extended and amplified by tools. They don’t have to be fancy tools by any means, either.

Example: Stuck for a specific idea about risk heuristic, I created some tables of more-or-less randomized data in Excel, and used a Perl script to cover all of the possible values in a four-digit data field. One of those values returned an inappropriate result—one stumble over a gold nugget of a bug. Completely unexpectedly, though, I also stumbled over a sapphire: while scanning quickly through the log file, using a blink oracle: every now and then, a transaction took ten times longer than it should have courtesy of a startling and completely unrelated bug.

Example: At a client site, I had a suspicion that a test script contained an unreasonable amount of duplication. I opened the file in a text editor, selected the first line in a data structure, hit the Ctrl-F key, and kept hitting it. I applied a blink oracle again: most of the text didn’t change at all; tiny patches, representing a handful of variables flickered. Within a few seconds I had discovered that the script wasn’t really doing anything significant except trying the same thing with different numbers. More importantly, I discovered that the tester needed real help in learning how to create flexible, powerful, and maintainable test code.

Example: I wrote a program as a testing exercise for our Rapid Software Testing class. A colleague used James Bach’s PerlClip tool to discover the limit on the amount of data that the program would accept. From this, he realized that he could determine precisely the maximum numeric value supported by the program, something that I, the programmer, had never considered. (When you’re in the building mindset, there’s a lot that you don’t consider.)

Example: Another colleague, testing the same program, used Excel to generate all of the possible values for one of the input fields. From this he determined that the program was interpreting input strings in ways that, once again, I had never considered. Just this test and the previous one revealed information that exploded my five-line description of the program into fifteen far more detailed lines, laden with surprises and exceptions. One of these lines represents a dangerous and subtle gotcha in the programming language’s standard libraries. All this learning came from a program that is, at its core, only two lines of code! What might we learn about a program that’s two million lines of code?

Example: In this series of posts on oracles, I’ve already recounted the tale of how James took data from hundreds of test runs, and used Excel’s conditional formatting feature to visualize the logged results. The visualizations instantly highlighted patterns that raised questions about the behaviour of a product, questions that fed back into refinements of the requirements and design decisions.

Example: While developing a tool to simulate multi-step transactions in a banking application, I discovered that the order in which the steps were performed had a significant impact on the bank’s profit on the overall transaction. This is only one instance of a pattern I’ve seen over and over again: while developing the infrastructure to perform checking, I stumble over bug after bug in the application to be tested. Subsequently, after the bugs are fixed and the product is stabilized and carefully maintained, the checks—despite their value as change detectors—don’t reveal bugs. Most of the value of the checks gets cashed in the testing activity that produces them.

Example: James performed 3000 identical queries on eBay; one query every two or three seconds. He expected random variation over time (i.e. a “drunkard’s walk”). Instead, the visualization allowed him to see suspicious repeating jumps and drops that looked anything but random. Analysis determined that he was probably seeing the effects of many servers responding to his query—some of which occasionally failed to contribute results before timing out.

These examples show how we can use tools powerfully: to generate data sets and increase coverage, so that we can bring specific conditions to our attention; to amplify signals amidst the noise; to highlight subtle patterns and make them clearly visible; to afford observation of things that we never expected to see; to perturb or stress the system such that rare or hidden problems become perceptible.

The traditional view of an oracle is an ostensibly “correct” reference that we can compare to the output from the program. A common view of test automation is using a tool to act like a robotic and unimaginative user to produce output to be checked against a reference oracle. A pervasive view of testing is nothing more than simple output checking, focused on getting right answers and ignoring the value of raising important new questions. In Rapid Software Testing, we think this is too narrow and limiting a view of oracles, of automation, and of testing itself. Testing is exploring a product and experimenting with it, so that we can learn about it, discover surprising things, and help our clients evaluate whether the product they’ve got is the product they want. Automated checking is only one way in which we can use tools to aid in our exploration, and to shine light on the product—and excellent automated checking depends on exploratory work to help us decide what might be interesting to check and to help us to refine our oracles. An oracle is any means—a feeling, principle, person, mechanism, or artifact—by which we might recognize a problem that we encounter during testing. And oracles have another role to play, which I’ll talk about in the last post in this long series.

Oracles from the Inside Out, Part 7: References as Checks

Monday, October 12th, 2015

Over the last few blog posts, I’ve been focusing on oracles—means by which we could recognize a problem when we encounter it during testing. So far, I’ve talked about

  • feelings and private mental models within internal, tacit experiences;
  • consistency heuristics by which we can make inferences that help us to articulate why we think and feel that something is a problem;
  • brief exchanges—tiny bursts of conferences—between people with collective tacit knowledge, and shared feelings and mental models;
  • data sets and tools that produce visualizations that are explicit—things that we can point to—references that we can observe, analyze and discuss

Most of the examples I’ve shown so far involve applying oracles retrospectively—seeing a problem and responding to it, starting in the top left corner of this diagram.

But maybe experience with the product isn’t the only place we could start. Maybe we could start in the bottom right of the table, with tools.

Let’s begin by asking ourselves why we can’t see instantly when things go wrong with software. Why aren’t all bugs immediately obvious? The first-order answer is itself obvious: software is invisible. It’s composed, essentially, of electrons running through sand, based on volumes of instructions written by fallible humans, and whatever happens takes place inside tiny boxes whose contents and inner structures are obscure to us. That answer should remind us to presume that many bugs are by their nature hidden and subtle. Given that, how can we make bugs obvious? If we wish to identify something that we can’t perceive with unaided observation, we’ll need internal or external instrumentation. If we think and talk about tool support to reveal bugs, we might choose to develop it more often, and learn to build it wisely and reliably.

With those facts in front of us, how might we prepare to ourselves to anticipate and to notice problems, with the help of tools to extend our observational powers?

1) Learn the product. That process may—indeed, should, if possible—start even before the product has been built; we can learn about the product and people’s ideas for it as it is being designed. We become more powerful testers as we add to our knowledge of the product, the problem it is intended to solve, and the conditions under which it will be used. This is not just an individual process, but a social process, a team process. A development group not only builds a product; it learns to build a product as it tries to build the product. Both kinds of learning continue throughout the project.

2) As we learn the product, consider risks. What could go wrong? Considering risks may also start before the product has been built, is also a collaborative process, and is also continuous. For instance…

  • A system may not be able to fulfill the user’s task. It may not produce the desired result; some feature or function may be missing. It may not do the right things, or it may do the right things in the wrong way. It may do the wrong things. That is, the system might be have a problem related to capability.
  • A program may exhibit inconsistent behaviour over time. Outputs may vary in undesirable ways. Functions or features may be unavailable from time to time. Something that works in one version may fail to work in the next. That is, the system might have a problem with reliability.
  • A system may be vulnerable to attack or manipulation, or it may expose data to the world that should be kept private. It may permit records to be changed or altered. That is, the system might have a problem related to security.
  • A system might run into difficulty when overloaded or starved of resources. The system might be slow to respond even under normal conditions. That is, the system may have problems with performance.
  • A system may have trouble handling more complex processing, larger amounts of data, or larger numbers of users than could be supported by the original design. That is, the system may have problems related to scalability.
  • Something that should be present might be absent; or something that should be absent might be present. Files might be missing from the distribution, or proprietary files might be included inadvertently. Registry entries or resource files might not include appropriate configuration settings. The uninstaller might leave rubbish lying around, zap data that the user wants to retain, or uninstall components of other programs. That is, the system may have problems related to installability.

This set of examples is by no means complete. There’s a long list of ways in which users might obtain value from a product, and practically an infinite list of ways things that could go wrong. Although machinery cannot evaluate quality, specific conditions within these quality criteria in particular are amenable to being checked by mechanisms either external or internal to the program. Those checks can direct human attention to those conditions. So…

3) As we learn about the product and what can go wrong, consider how a check might detect something going wrong. One rather obvious way to detect a problem would be to imagine a process that the product might follow, drive the product with a tool, and then check if the process comes to an end with a desirable result. For extra points, have the tool collect output produced by the product or some feature within it, and then have the tool check that data for correctness against some reference. But we could also use checking productively by

  • examining data at intermediate states, as it is being processed, and not only at output time;
  • evaluating components of the product to see if any are missing, or the wrong version, or superfluous;
  • identifying platform elements—systems or resources upon which our product depends—and their attributes, including their versions and their capabilities;
  • observing the environment in which the program is running, to see if it changes in some detectable and significant way;
  • monitoring and inspecting the system to determine when it enters some state, when some event occurs, or when some condition is fulfilled;
  • timing processes within the system that must be completed within known, specific limits.

When something noteworthy happens, we have the option of either logging the incident or being notified immediately by some kind of alert or alarm.

Checking of this kind is a special case of something more general: bringing problems to our awareness and attention. Again, machinery cannot evaluate quality or recognize threats to value, so a check requires us to anticipate and encode each specific condition to be checked, and, after the check, to interpret its outcome whether red or green.

Moreover, to match the value of checks with the cost of developing and maintaining them, and to avoid being overwhelmed by having to interpret the results from automated checks, we must find ways to decide what’s likely to be interesting and not so interesting. Using tools to help us learn about that is the subject of the next post.

Oracles from the Inside Out, Part 6: Oracles as Extensions of Testers

Monday, September 21st, 2015

The previous post in this series was something of a diversion from the main thread, but we’ll get back on track in this long-ish post. To review: Marshall McLuhan famously said that “the medium is the message”. He used this snappy slogan to point out that media, tools, and technologies are not only about what they contain; they’re also about themselves, changing the scale, pace, or pattern of human affairs and activities. As he somewhat less famously said, “We shape our tools; thereafter, our tools shape us.”

In the Rapid Software Testing namespace, we recognize the traditional interpretation of an oracle in software testing as “an external mechanism which can be used to check test output for correctness”. W.E. Howden, who introduced the term, said that oracles “can consist of tables, hand calculated values, simulated results, or informal design and requirements descriptions”. But after recognizing that interpretation, we offer a more general and, to us, a more useful notion of “oracle”: a means by which we recognize a problem when we encounter one during testing.

We also make a distinction between testing and checking. Testing is the process of evaluating a product by learning about it through exploration and experimentation, which includes to some degree: questioning, study, modeling, observation, inference, etc. Checking is the process of making evaluations by applying algorithmic decision rules to specific observations of a product. A check produces a bit—true or false, green or red, pass or fail. A check is a kind of formal testing; testing that is done is a specific way, or with the goal of determining specific facts. Checking is a tactic of testing, but it’s certainly not all there is to testing.

In that light, Howden’s examples of oracles—references—form bases for checks. Add McLuhan’s insights, and we can recognize that checks are media that extend and accelerate our capacity to observe inconsistency with a specific claim (“the product, given these inputs, should produce output consistent with the values in this table”), or with a comparable product (“the product, given this inputs, should produce output consistent with the values produced by this simulator”). So: automated checks are media that extend accelerate our capacity to perform checks; to observe and evaluate the product in accordance with explicit, algorithmic decision rules.

Why is all this a big deal?

One: because media don’t recognize problems; media extend, enhance, enable, accelerate, intensify, or amplify our capability to recognize problems.

My friend Pradeep Soundararajan attracted my attention and respect several years ago by this astute observation: “It is not a test that finds a bug; it is a human that finds a bug and a test plays a role in helping the human find it.” To paraphrase Pradeep, the tool doesn’t recognize the problem; the tool plays a role in recongnizing the problem. The microscope does not see; the microscope helps us to see. The burglar alarm doesn’t detect a burglary; the burglar alarm extends our senses over distance to recognize movement, whereby we can infer that a burglary might be happening. The Wikipedia entry Exploration of Mars errs by saying “The exploration of Mars is the study of Mars by spacecraft.” In fact, the exploration of Mars is the study of Mars by humans, enabled by spacecraft. Tools amplify whatever we are. Tools can extend people’s competence and capabilities to help them focus on what matters, allowing tehm to become aware of important new things. Tools can just as easily extend people’s incompetence and incapabilities to overfocus their attention on the known or the trivial, allowing to be oblivious to important things.

Two: because tools can do so much more for testers than automated checking.

Testing is not simply checking to determine whether we’re getting the right answers. Testing is also about making sure we’re asking important questions—and discovering important questions, and discovering things about how we might develop and apply checks.

For instance: my colleague James Bach was working on testing a medical device a few years back. I’ve done some work with this company too, so in order to respect non-disclosure agreements, let’s just say that the medical device is a Zapper Box, intended to deliver Healing Energy to a patient’s body over an Appropriate Period of Time. The Zapper Box is controlled by a Control Box and its software. Too much of a Good Thing can be a Very Bad Thing so, crucially, the Control Box is also intended to stop the Zapper Box from delivering that energy when the Appropriate Period of Time becomes an Inappropriate Period of Time, whereupon the Healing Energy becomes Killing Energy.

The Control Box has a display that shows the operator the amount of Healing Energy that is supposedly being delivered. James and one of the other testers set up a test rig, a meter to monitor the amount of Healing Energy that was actually being delivered. In one of the tests, they monitored and logged the amount of Healing Energy delivered after the device’s operator turned off the Healing Energy Tap via the Control Box. The log recorded measurements at intervals of one-tenth of a second. The testers did several hundred instances of this test. Then James took the logs, and used Microsoft Excel’s conditional formatting feature to highlight various levels of Healing Energy. Red indicated that the device was delivering Energy that could be described here as Very Hot; yellow represented Somewhat Hot; grey represented Cooling Down; green represented Cool. Then, James took over a hundred lines of data, and shrank the lines down until the numbers were too small to see, such that only two digits are readable: the numbers 1 and 2 labelling the vertical lines, which represent the one- and two-second marks after the Healing Energy Tap had been turned off. In Rapid Software Testing we call this a Blink Test, or using a Blink Oracle. Here’s what the testers saw.

What do you observe? Here’s what I observe:

  • Over about 150 test runs, the level of Energy appears to remain Very Hot for .3 seconds to over a second after the Healing Energy Tap was turned off.
  • Towards the end of the observed period, the device seems to remain in a Very Hot state for longer, and more often. (“Longer” here means tending closer one second than to .3 seconds.)
  • The variance in the Energy during the cooling period seems to be greater than at the beginning.
  • Over time, the level of Energy appears to remain in a Cooling Down state for longer and longer.
  • In at least one instance, the Zapper goes from a Cooling Down state back up to a Somewhat Hot state before returning to a Cooling Down state.
  • Starting from about the middle of the observed period, the Zapper appears many times not to reach the Cool state at all.

As I’m observing these results, two questions are looping in my mind. The first is “Is there a problem here?” The second is “Is my client okay with all this?”

When I see the inconsistency of the results, I experience a vague feeling of unease and confusion, followed by feelings of curiosity. The curiosity is about the product’s behaviour, but it’s also about why I feel uneasy and confused. Notice that my feelings and mental models are internal, tacit, and private, at the moment I have them. If I want to make sense of the relationships between my observations and my feelings, I have to do some detective work; feelings don’t come with return addresses on them.

I pause for a moment, and quickly consider: the cooldown period is not the same every time. I make an inference that that could be a problem, since we typically desire a product’s behaviour to be consistent with itself. Is there a problem here? I wonder if the product has shown a consistent, stable cooldown pattern in the past. If it has, it’s not doing that now. I make an inference that product might be inconsistent with its history. Is there a problem here? There’s a related inference: all other things being equal, we desire a product to be consistent with its history, so an inconsistency with its history would point to a possible problem. Another inference: there may be a document that makes a claim about the behaviour required to fulfill the desire of some important person. Inconsistency with that document (a reference; a medium that extends communication and expression of desire) could represent a problem, since (by inference) inconsistency with the client’s desire would probably be a problem too. I don’t have a copy of such a document. Is my client okay with that? Where is the document? Who wrote it? Whose desires are being represented? Can I confer with those people; that is, can I have conference with them? If I can’t, the quality and relevance of my analysis will be compromised. Is my client okay with that? Since this is a medical device, is there a standards document (a reference) to which it should conform? I don’t know; in order to sort that out, I might have engage in conference with a domain expert who is aware of such things. The expert is also a medium, extending my capacity the domain, and the desires of people in that domain. I might have to examine a reference and engage in conference to find a relevant reference, or to understand it.

If I pay attention, I notice that Excel (and its conditional formatting feature) are media too, extending my capacity to observe patterns in the behaviour of the Zapper Box. The image I’m looking at is a reference. The test rig hooked up to the Zapper Box is also a medium, a reference, enabling testers’ capacity to observe the amount of Energy at the electrode. The test rig’s log is a medium to which I can refer, providing a record of precise observations that I can analyse over the longer term. The Control Box also includes testability features that produce a log (yet another reference). Those features extend the testers’ capacity to record the actions of the control box, and the log enables the comparison between the test rig’s log and the operator’s intended actions. There is a whole network of references here. I could apply consistency oracles to all of them, singly and in combination.

My curiosity is also aroused by questions about the test. Did the testers follow the same protocol on each test run? Was the Zapper Box powered on for the same amount of time before the tester flipped the switch at the Control Box? Did the measurement start exactly when the Zapper Box was told to turn off? Was the connection between the Control Box and the Zapper Box reliable? How would we know? Did the testers allow the Zapper Box to cool down between test runs? For how long? For the same amount of time each time? Is the test rig measuring the Energy properly? How would we know?

Notice: although there has been extensive use of tools, no checking has occurred here! James and the testers tested the product, gathered data, and presented that data in a form that allowed them (and others) to see interesting patterns. I have been analyzing those patterns (and I hope you have too).

Some oracles just provide us with “pass or fail”, “true or false”, “green or red” results. Other oracles point us to possible problems upon which we may shine light. In this case, we would need more information—more references, more tools, more conference—before we could determine more clearly whether there is a problem. We might or might not need any of these things before the client could decide that there is a problem. More significantly, we would need more information to determine how we might be able to check for a problem. We must apply informal oracles before we can learn how to apply formal oracles well.

In other words: the process of recognizing a problem sometimes requires us to travel all over the oracle quadrants. That’s far more than “trying the product and comparing it to the specification”.

All this sets us up for the next few posts: if we considered the oracle principles (themselves media for recognizing problems that threaten people’s desires!), how could we imagine applying tools to enable, extend, enhance, accelerate, or intensify our capabilities as testers? What other roles might oracles play in the process? And if we are to use oracles and tools wisely, what effects—good and bad—could we anticipate from applying them? As we shape our tools, how might our tools shape us?

The first part of an answer is in the next post in this series.

Oracles from the Inside Out, Part 5: Oracles as References as Media

Tuesday, September 15th, 2015

Try asking testers how they recognize problems. Many will respond that they compare the product to its specification, and when they see an inconsistency between the product and its specification, they report a bug. Others will talk about creating and running automated checks, using tools to compare output from the product to specific, pre-determined, expected results; when the product produces a result inconsistent with expectations, the check identifies a bug which the tester then reports to the developer or manager. It might be tempting to think of this as moving from the bottom right quadrant on this table to the bottom left.

Traditional talk about oracles refers almost exclusively to references. W.E. Howden, who introduced “oracle” as a term of testing art, said that an oracle as “an external mechanism which can be used to check test output for correctness”. Yet thinking of oracles in terms of correctness leads to some pretty serious problems. (I’ve outlined some of them here).

In the Rapid Software Testing namespace, we take a different, broader view of oracles. Rather than focusing on correctness, we focus on problems: an oracle is a means by which we recognize a problem when we encounter one during testing. Checking for correctness, as Howden puts it, may severely limit our capacity to notice many kinds of problems. A product or service can be correct with respect to some principle, but have plenty of problems that aren’t identified by that principle; and a product can produce incorrect results without the incorrectness representing a problem for anyone. When testers fixate on documented requirements, there’s a risk that they will restrict their attention to looking for inconsistencies with specific claims; when testers fixate on automated checks, there’s a risk that they will restrict their focus to inconsistency with a comparable algorithm. Focus your attention too narrowly on a particular oracle—or a particular class of oracle—and you can be confident of one thing: you’ll miss lots of bugs.

Documents and tools are media. In the most general sense, “medium” is descriptive of something in between, like “small” and “large”. But “medium” as a noun, a medium, can be between lots of things. A communication medium like radio sits between performers and an audience; a psychic medium, so the claim goes, provides a bridge between a person and the spirit world; when people want to exchange things of value, they use often use money as a medium for the exchange. Marshall McLuhan, an early and influential media theorist, said that a medium is anything that humans create or use to effect change. Media are tools, technologies that people use to extend, enhance, enable, accelerate, or intensify human capabilities. Extension is the most obvious and prominent effect of media. Most people think of media in terms of communications media. A medium can certainly be printed pages or television screens that enable messages to be conveyed from one person to another. McLuhan viewed the phonetic alphabet as a technology—a medium that extended the range of speech over great distances and accelerated its transmission. But a cup of coffee is a medium too; it extends alertness and wakefulness, and when consumed socially with others, it can extend conversation and friendliness. Media, placed between a product and our observation of it, extend our capacity to recognize bugs.

McLuhan emphasized that media change things in many different ways at the same time. In addition to extending or enabling or accelerating our capabilities, McLuhan said, every new medium obsolesces one or more existing media, grabbing our attention away from old things; every new medium retrieves notions of formerly obsolescent media, making old things new again. McLuhan used heat as a metaphor for the degree to which media require the involvement of the user; a “cool” medium like radio, he said, requires the listener to participate and fill in the missing pieces of the experience; a “hot” medium like a movie, provides stimulation to the ear and especially the eye, requiring less engagement from the viewer. Every medium, when “overheated” (McLuhan’s term for a medium that has been stretched or extended beyond its original or intended capacity), reverses into the opposite of what it might have been originally intended to accomplish. Socrates (and the King of Egypt) recognized that writing could extend memory, but could reverse into forgetfulness (see Plato’s dialogue Phaedrus). Coffee extends alertness and conversation, but too much of it and people become too wired work and too addled to chat. A medium always draws attention to itself to some degree; an overheated medium may dazzle us so much that we begin to ignore what it contains or what we intended it to do for us. More importantly, a medium affects us. This is one of the implications of McLuhan’s famous but oblique statement “the medium is the message”. By “message”, he means “the change of scale or pace or pattern” that a new invention or innovation “introduces into human affairs.” (This explanation comes from Mark Federman, to whom I’m indebted for explaining McLuhan’s work to me over the years.)

When we pay attention, we can easily observe media overheating both in talk about testing and development work and in the work itself. Documents and tools frequently dominate conversations. In some organizations, a problem won’t be considered a bug unless it is inconsistent with an explicit statement in a specification or requirements document. Yet documents are only partial representations, subsets, of what people claim to have known or believed at some point in time, and times change. In some places, testing work is dominated by automated checking. Checks can be very valuable, providing great precision and fast feedback. But checks may focus on functional aspects of the product, and less on other parafunctional attributes.

McLuhan’s work emphasizes that media are essentially neutral, agnostic to our purposes. It is our engagement with media that produces good or bad outcomes—good and bad outcomes. Perhaps the most important implication of McLuhan’s work is that media amplify whatever we are. If we’re fabulous testers, our tools extend our capabilities, helping us to be even more fabulous. But if we’re incompetent, tools extend our incompetence, allowing us to do bad testing faster and worse than we’ve ever been able to do it before. To the degee that we are inclined to avoid conflict and arguments, we will use documents to help us avoid conflict and arguments; to the degree that we are inclined to welcome discussion and the refinement of ideas, then documents can help us do that. If we are disposed to be alert to a wide range of problems, automated checks will help us as we diversify our scope; if we are oblivious to certain kinds of problems in the product, automated checks will amplify our oblivion.

Reference oracles—documents, checking tools, representative data, comparable products—are unquestionably media, extending all of the other kinds of oracles: private and shared mental models, both private and shared feelings, conversations with others, and principles of consistency. How can we evaluate them? What do we use them for? And how can we use them to help us find problems without letting them overwhelm or displace all the other ways we might have of finding problems? That’s the subject of the next post.

Oracles from the Inside Out, Part 4: From Experience to Inference

Tuesday, September 8th, 2015

In the previous post, I gave an example of the happy (and, alas, rare) circumstance in which the programmer and I share models and feelings, such that the programmer becomes aware of a problem I’ve found without my being explicit about it. That is, on this table, I can go directly from top-left, where I experience the problem, to conference in the bottom left, where awareness of that problem has been successfully conveyed to someone else.

More often than not, my testing client will not recognize the meaning and significance of the problem immediately in the same way that I do. My awareness of a problem tends to start with a feeling. To impart my feeling of a problem successfully to others, most of the time I must move from my tacit, internal, and emotional reaction about it, and develop an explicit and rational description of the problem. That move often begins with framing the problem in terms of logical inferences, based on an undesirable inconsistency between my observation of the product and my understanding of some desirable principle. (An inference is conclusion that we arrive at by a line of reasoning, in which we make logical connections between facts. Inference is also an activity—the process of making those connections.) In the diagram, that’s a move from the upper left to the upper right, from a tacit feeling to an observable and explicit inconsistency; from experience to inference.

I’ve been testing a version of Adobe Acrobat. In its File / Properties dialog, there’s a metadata field called “Keywords”. I enter some data using Perlclip’s counterstring feature (a counterstring is a string that allows you to see its own length easily).

I observe that this field appears to be limited to a maximum of 1,999 characters. I try typing beyond this apparent limit, and nothing happens. When I click on the “Additional Metadata” button a new dialog appears. That dialog also has a “Keywords” field; the text that I entered into the previous dialog is observable there too. The field looks a little different, though. I experience a feeling of curiosity, and make an inference that something else might be different here. I move to the cursor into the Keywords field, and press Ctrl-End. Then I try to enter some text.

I observe that I’m able to enter more than I was able to enter into the “Keywords” field in the previous dialog; the limit appears to be broken. That’s interesting. I paste the counterstring from Perlclip, and discover the the limit is 30,000 characters.

When I dismiss the Additional Metadata dialog, the original File / Properties dialog shows the 30,000-character string. I experiment a little more, and find I can delete characters from that string and replace them with new ones; there’s no longer a 1,999-character limit. Hmmm.

I go back to the Additional Metadata dialog, and delete some text from the Keywords field. I dismiss the dialog, and discover my changes have been reflected in the File/Properties/Keyword field. As before, I can edit what’s there, but the limit is now the length of string that was in the Additional Metadata dialog; less than 30,000, but more than 1,999.

This is all a little surprising, but is this a problem? I refer to my feelings and make some inferences, based on principles related to consistencies and inconsistencies.

In one dialog, there appears to be an enforced limit of 1,999 characters for the Keywords field; in another, the limit is 30,000 characters. I’m surprised and little confused, because the product appears to be inconsistent with itself. I don’t understand why all this should be so; the product is behaving in a way that is inconsistent with my ability to explain it. I infer that wherever there’s a limit, someone had some purpose in setting it. I don’t have access to the programmer’s or designer’s intentions at the moment, but whichever limit someone intended, the product seems inconsistent with one of them—inconsistent with purpose. I’ve seen problems like this before—in some cases, data from one field overwrites data in another, corrupting the file or providing the opportunity for a buffer-overflow attack—so I can infer some risks based on the product being consistent with patterns of familiar problems.

In Rapid Testing, we’ve collected these and several other principles by which we can make inferences about problems in the product; you can find our current list of these oracles here. Armed with oracles in the form of consistency heuristics—principles and inferences about them—, I’ve got more—much more—than “huh?” or “euuugh!” available to me when I’m relating the problem to my client.

Some testers habitually report problems in terms of a “defect”, “actual result” and an “expected result”. This kind of templatized reporting often seems imprecise and even a little lame to me, for reasons that I set out here and here. It’s premature and presumptuous of me even to think of this as a defect, never mind making such a claim in a formal report. Testers confuse “expectation” with “desire”; something can be desirable or undesirable regardless of what we expected from it. Neither expectations nor desires are absolutes; they’re relative to some person(s), and based on specific models, quality criteria, and principles. So, rather than using the tired “expected result/actual result” pattern, try providing your observation of a problem and the inconsistency principle that gives warrant to your belief that it is a problem.

Some of the oracle principles can be applied fairly directly and immediately to a reasonably solid inference. For example, I know tacitly that Adobe’s business is all about rendering text and images beautifully on paper and on screens, so it seems to me that the font rendering issue (as I noted in an earlier post, and as you can see above in the Additional Metadata dialog) is inconsistent with an image that Adobe might reasonably want to project. Some problems, though, are more easily noticed or described with the help of some medium—a tool or an artifact based on and representing an explicit, shared model; something that we can point to; a reference. We’ll talk about that in the next post.

Oracles From the Inside Out, Part 3: From Experience Directly to Conference

Sunday, September 6th, 2015

Yesterday I described the moment of recognition of a problem, which tends to happen in the upper-left corner of this table:

When I perceive a problem during testing, it’s my job to let people know about what I’ve found, and to arrive at a shared set of feelings and mental models about the problem, represented by the lower left quadrant. That is, I want to move from experience—my private, internal, tacit mental models of the product—to the conclusion of successful conference, such that my clients understand what I’ve observed and why I think it’s a problem; and such that I understand their response to the problem, even though that doesn’t necessarily mean that the problem will be fixed.

I might not have to talk about the problem. Sometimes I can get the message across tacitly, without actually telling someone explicitly what the problem is. If I’ve been working with people for a while on the same project, I might be able to communicate without using words at all. In a society or culture, people develop collective tacit knowledge, an understanding of what things are and how things are done in that culture. Collective tacit knowledge is not contained in any one person’s head, but in the social group, and in the relationships between the people in it.

From the previous post, recall the cropped fonts that surprised and amused me.

A developer is walking by. I gesture or grunt something to get his attention. “Yo!”

The programmer replies, “Huh?”

I turn to my screen as he looks over my shoulder. I have the Windows Change Font Size dialog open. With a glance, I can see that he sees it too, and that he’s watching. I point to the Large Fonts radio button. “Hmmm?” I open Acrobat’s File / Properties / Additional Metadata dialog. We both see this:

I say “Euuugh!”

The programmer peers down at the screen, and I can tell that he sees the cropped font problem just as I do—just as you do. “Euuugh!” he says. “Ugh!”

“Mmmmm!”, I reply. He shrugs, sighs, rolls his eyes, and walks off towards his desk. By his manner and his expression, I know that he’s intending to fix the problem, even though we’ve only communicated in grunts. I’ve just had a conference with the programmer, with the communication structured and framed by our shared feelings and mental models; by our collective tacit knowledge. I didn’t have to describe the problem or cite an oracle. Without having exchanged a single word intelligible word, we’ve arrived at our destination. We’ve gone straight from my tacit, private feelings and mental models (experience) to shared feelings and a shared understanding about the problem (conference) and that has prompted some action. A bug is going to be fixed! Huzzah!

Of course, such simple and straightforward exchanges don’t happen very often. Far more often, there’s more work to do. We’ll talk about some of that in the next post.

Oracles from the Inside Out, Part 2: Experience, Mental Models, and Feelings

Saturday, September 5th, 2015

In the first post in this series, I introduced some of the factors in recognizing a problem when one occurs during testing. Let’s walk through some examples of recognizing and relating problems.

Imagine that I’m a tester at Adobe, a few years back, testing a version of Acrobat. As it happens, my laptop’s screen is smaller than a desktop monitor, but at 1920 by 960 pixels, the resolution is quite fine (in more ways than one). On my display, the standard Windows fonts are tiny and my eyes aren’t so good, so I use Windows’ Large Fonts setting. When large fonts are active, Adobe Acrobat displays the text in its File / Properties / Additional Metadata dialog like this:

My process of recognizing and reporting a problem typically starts with a feeling that I experience. I might be confused by some behaviour that I don’t understand; or I might be frustrated that I can’t get something done. I might be annoyed by a feature that’s obscure, hard to find, or hard to use. I’m impatient waiting for the system to finish what it’s doing so that I can move on to the next thing. Maybe I’ve seen this problem before, and I’m mildly disgusted that it has returned, or I’m worried that the programmers misunderstood a bug report or misimplemented a fix somehow. When I experience feelings like these, I begin to suspect that I’m seeing a problem.

When I see the Advanced File Properties dialog, I’m surprised and to some degree amused by the cut-off text in its labels, which seems inappropriate and somewhat silly because it’s in conflict with the way I think text should look. Formally, that’s called a schema; in this case, it’s an internal, private, tacit mental model of how things should be. My models provide framing for why I believe something is problem, and the intensity of my feelings provides clues about the seriousness of the problem. My models have been shaped by experiences of one kind or another, and in turn my models shape my perceptions and my experiences.

Sometimes I might encode some of my mental models, actions, and observations in a set of automated checks, and at some point one might be returning a red result. You could say that my recognition of a problem starts not with a feeling, but with my observation of the red bar, or even with my encoding of the check. I would reply that the observation alone doesn’t have any real meaning or significance for me—that is, I don’t interpret the observation as a problem until the feeling arrives, which it does pretty much immediately anyway: I’m surprised by the red bar and curious about what has produced it.

An experience of a problem—represented by the upper left quadrant of the diagram—is private to me, internal and tacit. Inside, I say something like “Euuugh!” or “Oh!” or “Damn!” or “Ha ha!”. I have the feeling that there’s a problem, but I haven’t yet made it explicit, not to anyone else, and maybe not even to myself. Feelings don’t come with return addresses, and we usually don’t have to justify or explain what we feel or why we feel them. When I’m testing, though, I’m acting as an agent for other people, so when I get a feeling of a problem, there’s more work to do. That’s the subject of the next post in this series.