Blog Posts from March, 2007

How do we know when we’ve done enough exploratory testing?

Thursday, March 29th, 2007

This was written in reply to a suggestion on comp.software.testing that we can’t decide when we’ve done enough ad hoc or exploratory testing.

The original poster asked:

Can you predict how much time such testing stage will take? What are grounding your estimates on? When do you have a chance to adjust your estimates? Is not it too late? And what are you doing in case it is? How do you estimates depend on the level of experience of your testers? Let’s imagine you have a newbie on your team. What estimates you will be using for that person?

My reply:

Can you predict how long daily exercise will take?

I can predict how long an exploratory testing stage will take: ten minutes. Or ninety minutes. Or many sessions that add up to eight hours. Or many sessions that add up to three weeks. I’m grounding my estimates on the time that I allocate to the task–and what’s more, unlike other forms of estimation, I can make the estimate totally predictable: I’ll stop when the alloted time is up.

As soon as I find some interesting information (the product appears to be in great shape, the product is in terrible shape, there is an area about which we know far too little, etc., etc.), I can adjust my estimate. It’s not too late if the product hasn’t shipped. If I have a newbie on the team, I have choices: I can monitor and mentor the newbie closely; I can give the newbie more highly structured processes to follow; or I can simply recognize that in the same amount of time, I might not get the same quantity or quality of information that I can get from a more experienced person.

How do we demonstrate that we’ve done enough exploratory testing? We could do this by telling a story about test coverage. If we’ve modeled the product in lots of different ways, and then tested in accordance with those models, we can say that we’ve done enough testing and stop. If we’re satisfied that we have addressed the list of compelling questions that we set out to answer (along with the other questions that we realized are important along the way), then we can say that we’ve done enough testing and stop. If the product is so horribly broken that it has to be send in for major rework, then we can say that we’ve done enough testing and stop. If management decides that it has sufficient confidence to ship, and ships, then we can say that we’ve done enough testing and stop. If management decides that it must ship the product, even though confidence in its quality is less than what we’d like it to be, we can say that we’ve done enough testing and stop. If we’re testing on behalf of someone who is trying to decide whether the software is acceptable, and they say it is, we can say that we’ve done enough testing and stop. If we’re testing on behalf of someone who is trying to decide whether the software is acceptable, and they say it isn’t, until we get another version, we can say that we’ve done enough testing and stop.

Do you notice that this is exactly the same set of stopping heuristics that we use for other forms of testing?

Context-driven thinking… or not

Wednesday, March 28th, 2007

I revisited this passage recently. I’ve added the emphasis.

[quote]

I am going to describe my personal views about managing large software developments. I have had various assignments during the past nine years, mostly concerned with the development of software packages for spacecraft mission planning, commanding, and post-flight analysis. In these assignments I have experienced different degrees of success with respect to arriving at an operational state, on-time, and within costs. I have become prejudiced by my experiences and I am going to relate some of these prejudices in this presentation.

[/quote]

Each of the italicized passages provides important information about clues the author’s context, and he’s explicit about his biases.

The author goes on to suggest that program design comes first; that the design should be documented; that it should be done twice (“If the computer program in question is being developed for the first time, arrange matters so that the version finally delivered to the customer for operational deployment is actually the second version so far as critical design/operation areas are concerned.”); that testing should be planned, monitored, and controlled; that the customer should be involved. Then he notes (again, I’ve added the italics) “I would emphasize that each item costs some additional sum of money. If the relatively simple process without the five complexities described here would work successfully, then of course the additional money is not well spent.

What’s interesting is how, over the years, people have adopted and applied the model that the author describes while implicitly ignoring the context that he has explicitly set out. The model might make sense in similar contexts–very large projects with enormous technical complexity where huge amounts of money, human lives, and national prestige are at stake–with the additional assumption that it’s still 1970.

The author? The model? The author is Dr. Winston W. Royce. The model is the Waterfall model of software development. http://www.cs.umd.edu/class/spring2003/cmsc838p/Process/waterfall.pdf.

Medieval Tech Support

Saturday, March 10th, 2007

Mark Federman is the author (with Derrick deKerckhove) of McLuhan for Managers, a wonderfully accessible book on McLuhan’s principles applied to more recent ideas about technological innovation.

Mark was the first person to point me to this (http://www.youtube.com/watch?v=LRBIVRwvUeE), which I anticipate will shortly be all over the Web. Although I can see his comments in my aggregator, I can’t see them on his blog yet. To sum them up: the use of any old technology is “obvious” to those of us who are grounded in it, and not at all obvious to those who are coming at it for the first time.

One risk for testers and other members of a project team is that we get to learn about the product early, and that learning gets subsumed into notions of obviousness. Such notions can threaten the quality of our testing. Revisiting our user models with fresh eyes, fresh perspectives, and fresh scenarios is one antidote; can we think of others?

A Fairy Tale from Jerry Weinberg

Thursday, March 8th, 2007

One good reason for reading Michael Hunter’s blog: He’ll help make sure you don’t miss thing like this (http://www.ayeconference.com/Articles/TestTrimmingFable.html)–a new fairy tale from Jerry Weinberg. Kids (and their grandfathers) say the darndest things.

The White Glove Heuristic and The "Unless…" Heuristic

Wednesday, March 7th, 2007

Part of the Rapid Software Testing philosophy involves reducing waste wherever possible. For many organizations, documentation is an area where we might want to cut the clutter. It’s not that documentation is valueless, but every minute that we spend on documentation is a minute that we can’t spend on any other activity. Thus the value of the documentation has to be compared not only to its own cost, but to its opportunity cost. More documentation means less testing; that might be okay, and even important, but it might not.

This leads to the White Glove Heuristic: if we have documentation somewhere in our process, such that running a white-gloved finger over it would cause the glove to pick up a bunch of dust, let’s at least consider applying less work to that document, or eliminating it altogether.

In the RST class, there’s often push-back to this idea. That’s understandable; at one point, someone started producing the document in an attempt to solve some problem or mitigate some risk. The question then becomes, “Has the situation changed such that we no longer need that document?”–and the problem I see most often is that the question is begged.

On a recent trip to India, many of the participants in the class pushed back on the very idea of reducing documentation in any way, claiming “our project managers would never accept that.”

I was curious. “Have you asked them?” The answer was, as I suspected, No. “So suppose you’re producing a forty-page test report for a busy executive. What if that executive only ever reads the summary on the first page? Might she approve of a shorter document? If she had important questions about things in that document, could you answer those questions at a lower cost than preparing the big document?” Maybe, came the answer. “So: your project managers would never accept changes to your test documentation, unless they’re not reading the whole thing anyway. Or they’d never accept changes unless they were aware of the amount of testing time lost to preparing the document. Or they’d never accept changes unless they had the confidence that you could give them the information they needed on demand.” The class participants then began to recognize that a session-based test management approach might allow them to make their testing sufficiently accountable while satisfying the executives with more lightweight summary reports.

Later in the class, we were talking about oracles, and how slippery they can be. Oracles are heuristic; that means that they often work, but they can fail, and that we learn something either way. The class presents a list of consistency oracles (the list is now a little longer than in the linked article); for example, a product should behave in a manner consistent with its history– unless there’s a compelling reason for it to be otherwise, like a feature enhancement or a bug fix.

This led me to formulate The “Unless…” Heuristic: Take whatever statement you care to make about your product, your process, or your model, and append “unless…” to it. Then see where the rest of the sentence takes you.

Matt Heuser’s Testing Challenge

Tuesday, March 6th, 2007

Hopeless. Absolutely hopeless. Lots of important work to do, and this testing challenge steals an hour from me. Matt Heuser posted it on his blog.

When James Bach and I pose testing problems like this at one another, we offer the opportunity to provide a quick, practical or deep answer. Here’s mine. It ain’t quick. It’s fairly deep, but I hope it’s also reasonably practical.

To start with, I think that in this case there’s a risk that Matthew is conflating two things in his ideas about acceptance tests—the idea of acceptability to a given customer, and the idea of acceptability to a given database or application. The important questions to ask here are “who is doing the accepting” and “what are their criteria for acceptance?” After all, a database might reject the number +1 (416) 656-5160 because it expects the data in the format 416-656-5160, but a human could easily deal with the discrepancy. Conversely, a database might happily accept a credit card number composed of 16 digits, since that number meets its acceptance criterion. But if that credit card number is not associated with a customer record (where the database has no way of knowing that), the number is invalid. Thus, I question Matt’s suggestion…

Think about it – the requirement is to take one set of black-box data, and import it into another black box. We can test the data file that is created, but the real proof is what the second system accepts — or rejects.

…because I wouldn’t characterize acceptance by the second system as “the real proof”. It may be a real proof—but the system could just as easily fail by accepting something that it shouldn’t.

Matt goes on…

First of all, the test databases used are refreshed every three months from production. That means that you either have to find test scenarios from live data (and hope they don’t change next time you refresh) or re-enter every scenario in test every three months.

Without knowing more about the story, I also question this premise. This seems to presume that we don’t have any opportunity to test the conversion on anything other than a live platform. Is this really the case? Live data is wonderful (as Jonathan Kohl makes the musical assertion, “Ain’t Nothing Like the Real Thing, Baby”), but it’s not the only kind of data that we can use for testing.

Could we use sample data that is not taken from live data? Could we use test environments where we can use sandboxed copies of the program under test and the data? Sometimes the constraints that we’re dealing with are real, but sometimes they’re artificial, arbitrary, or assumed. I try to question those constraints. If I’m facing some kind of constraint that slows me down, reduces coverage, or makes it hard to determine success or failure, those things could weaken the quality of the testing, and that’s a potential project risk. If it’s a serious problem such that the testing strategy won’t work unless I get some help, I mention the problem as an issue, and negotiate with the project owner to change the context or the strategy, or to recognize the risks in not changing things. Change might be easy or might be hard, but in the end, empirical experience will help us to make decisions.

Now, take the trading partner example. The best you can do within your organization is to test the file. The interface might take three hours to run, then you GREP the file for results and examine.

Is that really the best we can do within our organization? Best in what sense? Fastest? Most convenient, given the tools you have? Highest informational bang for the buck? Easiest to do, given the people you have and their skills? Most likely to reveal a problem? Most likely to reveal an important problem? Would (for example) Excel be a better tool than GREP, allowing you to sort and view the data from more angles? Would an inelegant script that covers several risks reasonably well be better than a polished or comprehensive one? Would something quick help us figure out what we’re really looking for? What are we looking for? Whose values are we trying to serve? What are the risks we’re facing? Given all this stuff, might investing in testing environments and simulators be a pragmatic idea?

The same ideas of quality criteria for a product can be used for asking questions about the testing effort, and for each product there will be different answers.

You’ll have to write custom fixtures to do this, and your programming language isn’t supported by FitNesse. Or you could write a fixture that takes a SELECT statement to count the number of rows that are generated by the file, run the interface, and compare.

That’s one heuristic; the number of rows should be consistent across tables, and that ought to be a pretty fast test for many databases. With enough programming and database savvy, we could probably come up with tons more criteria too, and different ways of testing them. I’m presuming an SQL database, viewable with Toad here.

we could code the interface with lots of its own unit tests

we could have low-level validity tests of the data within the interface’s code

we could write functions that checksum the data values for some row or some column, or each row and each column, on in memory objects, test databases or (shudder) the real thing;

we could generate lots of small data files, each focused on triggering a particular problem;

generate a file (using data that has been constructed using valid but randomly scrambled live data) that is many times larger than the typical file, looking for performance or stress-related problems in the interface;

we could port the imported data back to a new table, reversing any conversion algorithms, and see if you get a table that matches the source;

we could do that; then feed those results back to the destination again, and see if the second pass gave us a table identical to the first;

we could randomly select 10 records and eyeball them with Toad;

we could do a port of a table using marker values or metadata—data that refers to itself (e.g. Record1Field1, Record1Field2) to make certain kinds of consistency problems relatively easy for a human or script to evaluate;

we could create a table of super-funky data that we believe is certain to trigger validation error handling; if it makes it through the conversion without triggering those handlers, that’s bad. Drive it with single records at first, then lots as we start to script it more heavily.

we could try porting certain tables separately from others, and using incremental tests as each table is updated or created; script that or not;

as you suggest below, we could run rough tests that give us the confidence that something is reasonable (if not perfect), when reasonable (and not perfect) is okay; script those checks or not;

yes, we could count rows; script it or ask Toad;

we could look for the maximum and minimum values in each column for a given table, and compare those; script that or ask Toad;

we could count the number of times that a given value appears in a given column for a given table, and compare the counts between the source and destination databases; script it;

we could set up a sample database or mock objects whose values have been chosen to represent some kind of risk—lots of null values in fields where data is expected; repeated values in fields where a unique key is expected; out-of-range values in fields where in-range values are expected; values that contain characters that are “special” by some criterion (making sure that we test against plenty of criteria); over-the-top outrageous values; etc., etc., etc.; move it over once and scan things with Toad and eyeballs, or script some checks, or both;

we could set up a benchmark trial validation process on a simulated environment, and use that for really harsh or risky tests; script it or not as appropriate;

we could have validation-oriented stored procedures in the destination database that run immediately after a conversion; those are kinda scripted by nature;

we could recognize that the easiest and fastest test is the one you never have to run. Some mismatches between one database and the next don’t represent a problem. In such cases, you could choose to ignore a mismatch if one were there. The source file has a field for “Birthday greeting”; we’re not going to send birthday cards from the destination file, so a mismatch here might be irrelevant.

we could vary our testing strategy over time to try to identify and trap new risks. Problems that we discover, near-misses, and greater familiarity with the program space and the test space, and new people on the evolving team will lead to new ideas.

Of course, a programmer is going to have to write the SELECT statement. Is it a valid acceptance test?

I don’t see anything inherently invalid about it. The question is who is doing the accepting, and what are they will to take as evidence of acceptability.

Or you could have the number of rows fixture be approximate – “Between 10,000 and 15,000” – customers could write this, and it guarantees that you didn’t blow a join, but not much else.

You could write code that accesses the deep guts of the application, turning it sideways to generate a single member at a time, thus speeding up the acceptance test runs to a few seconds. That’s great for the feedback loop, but it’s more of a unit test than an acceptance test.

“Unit” and “acceptance” are orthogonal categories to me. “Unit” is about the level of code that we’re trying to test; “acceptance” is about who’s accepting it and what they value. But maybe we can reframe; maybe the difference between “unit” and “acceptance” is relevant only when the test is passing—whereas if it fails, either way it’s a rejection test.

You could suggest I re-write the whole thing to use web services, but that introduces testing challenges of an entirely different kind. To be frank, when I have a problem and people suggest that I re-write the whole thing without recognizing that it would present an entirely different set of challenges, it’s a sign to me of naiveté.

There’s no question in my mind that changing the context changes the challenges. On the other hand, context-driven thinking as I see it requires us to recognize the possibility of changing the context, too.

I submit that all of these would be a significant investment in time and effort for not a whole lot of value generated.

I submit that all of these could be a significant investment in time and effort, but I can’t make any assumptions about the value without a more specific context.

Over the last few months, James Bach and I have been working on exercises and lessons for our Rapid Software Testing class, so that people in real-world testing situations have a general and generative framework for handling the current testing mission. We started by defining the Universal Test Procedure, Version 1.0—a naïve description of testing: “Try it and see if it works.” That’s a little vague. It’s relatively easy to demonstrate at least once that an application can work, but the definition doesn’t address the issue of how the product might fail—and failure is where the risk lives. So Version 1.5 goes like this: “Try it to learn sufficiently about how the product can work and how it might fail.” “Sufficiently” works double duty—“try it sufficiently”, and “to learn sufficiently”. Since we can’t test anything completely, sufficiency is the best we can hope to achieve.

Why Settle for Unit Tests?

Monday, March 5th, 2007

There’s a principle in some circles that suggest that the full suite of regression tests be run after each build, or at the end of each iteration, or before each release. Typically when people talk about stuff like that, they don’t bother to specify what they mean by “full”, or “regression tests”, or even “the” (these tests, but no others?), so it’s hard to tell whether the suggestion is reasonable or not. When the suggestion is reasonable, it’s founded on the idea that there’s a risk that there might be an important problem in the code or the data it deals with—a problem that automated regression tests, typically at the unit level, could catch.

That’s a nifty notion. I’ve worked on several projects where certain unit tests are deemed to be very important, and rightly so. Sometimes these tests are simple assertions that the code deals properly with unexpected values or exceptional conditions. That sounds like a good risk management approach to me. But why stop at the unit tests? If there is a risk of a serious problem, why not move such assertions right into the code? Some might argue that this might cause a performance hit, but on systems that can perform billions of simple comparisons a second, that overhead might be quite tolerable next to the risk of some failure.

Critical thinking is a heuristic approach to solving problems caused by us forgetting something or making an invalid assumption. Testing is strongly informed by critical thinking, and so is good unit testing. The best programmers I’ve ever seen have been great critical thinkers; they programmed like great testers, recognizing that things can be very different from our preconceived ideas. In some cases, they could sometimes afford to put a different emphasis into their unit testing because the product was exceptionally robust, containing important tests within the running code itself. Unit tests are a great way to mitigate certain kinds of risk, but they’re not the only way.