DevelopsenseLogo

A Letter To The Programmer

This is a letter that I would not show to a programmer in a real-life situation. I’ve often thought of bits of it at a time, and those bits come up in conversation occasionally, but not all at once.

This is based on an observation of the chat window in Skype 4.0.0.226.

Dear Programmer,

I discovered a bug today. I’ll tell you how I found it. It’s pretty easy to reproduce. There’s this input field in our program. I didn’t know what the intended limit was. It was documented somewhere, but that part of the spec got deleted when the CM system went down last week. I could have asked you, but you were downstairs getting another latte.

Plus, it’s really quick and easy to find out empirically; quicker than looking it up, quicker than asking you, even if you were here. There’s this tool called PerlClip that allows me to create strings that look like this

*3*5*7*9*12*15*18*21*24*27*30*33*36*39*42*45*48*51*54*57*60*…

As you’ll notice, the string itself tells you about its own length. The number to the left of each asterisk tells you the offset position of that asterisk in the string. (You can use whatever character you like for a delimiter, including letters and numbers, so that you can test fields that filter unwanted characters.)

It takes a handful of keystrokes to generate a string of tremendous length, millions of characters. The tool automatically copies it to the Windows clipboard, whereupon you can paste it into an input field. Right away, you get to see the apparent limit of the field; find an asterisk, and you can figure out in a moment exactly how many characters it accepts. It makes it easy to produce all kinds of strings using Perl syntax, which saves you having to write a line of Perl script to do it and another few lines to get it into the clipboard. In fact, you can give PerlClip to a less-experienced tester that doesn’t know Perl syntax at all (yet), show them a few examples and the online help, and they can get plenty of bang for the buck. They get to learn something about Perl, too. This little tool is like a keychain version of a Swiss Army knife for data generation. It’s dead handy for analyzing input constraints. It allows you to create all kinds of cool patterns, or data that describes itself, and you can store the output wherever you can paste from the clipboard. Oh, and it’s free.

You can get a copy of PerlClip here, by the way. It was written by James Bach and Danny Faught. The idea started with a Perl one-liner by Danny, and they build on each other’s ideas for it. I don’t think it took them very long to write it. Once you’ve had the idea, it’s a pretty trivial program to implement. But still, kind of a cool idea, don’t you think?

So anyway, I created a string a million characters long, and I pasted it into the chat window input field. I saw that the input field apparently accepted 32768 characters before it truncated the rest of the input. So I guess your limit is 32768 characters.

Then I pressed “Send”, and the text appeared in the output field. Well, not all of it. I saw the first 29996 characters, and then two periods, and then nothing else. The rest of the text had vanished.

That’s weird. It doesn’t seem like a big deal, does it? Yet there’s this thing called representativeness bias. It’s a critical thinking error, the phenomenon that causes us to believe that a big problem always looks big from every angle, and that an observation of a problem with little manifestations always has little consequences.

Our biases are influenced by our world views. For example, last week when that tester found that crash in that critical routine, everyone else panicked, but you realized that it was only a one-byte fix and we were back in business within a few minutes. It also goes the other way, though: something that looks trivial or harmless can have dire and shocking consequences, made all the more risky because of the trivial nature of the symptom. If we think symptoms and problems and fixes are all alike in terms of significance, when we see a trivial symptom, no one bothers to investigate the problem. It’s only a little rounding error, and it only happens on one transaction in ten, and it only costs half a cent at most. When that rounding error is multiplied over hundreds of transactions a minute, tens of thousands an hour… well you get the point.

I’m well aware that, as a test, this is a toy. It’s like a security check where you rattle the doorknob. It’s like testing a car by kicking the tires. And the result that I’m seeing is like the doorknob falling off, or the door opening, or a tire suddenly hissing. For a tester, this is a mere bagatelle. It’s a trivial test. Yet when a trivial test reveals something that we can’t explain immediately, it might be good idea to seek an explanation.

A few things occurred to me as possibilities.

  • The first one is that someone, somewhere, is missing some kind of internal check in the code. Maybe it’s you; maybe it’s the guy who wrote the parser downstream, maybe it’s the guy that’s writing the display engine. But it seems to me as though you figured that you could send 32768 bytes, someone else has a limit of 29998 bytes. Or 29996, probably. Well, maybe.
  • Maybe one of you isn’t aware of the published limits of the third-party toolkits you’re using. That wouldn’t be the first time. It wouldn’t necessarily be negligence on your part, either—the docs for those toolkits are terrible, I know.
  • Maybe the published limit is available, but there’s simply a bug in one of those toolkits. In that case, maybe there isn’t a big problem here, but there’s a much bigger problem that the toolkit causes elsewhere in the code.
  • Maybe you’re not using third-party toolkits. Maybe they’re toolkits that we developed here. Mind you, that’s exactly the same as the last problem; if you’re not aware of the limits, or if there’s a bug, who produced the code has no bearing on the behaviour of the code.
  • Maybe you’re not using toolkits at all, for any given function. Mind you, that doesn’t change the nature of the problems above either.
  • Maybe some downstream guy is truncating everything over 29996 bytes, placing those two dots at the end, and ignoring everything else, and and he’s not sending a return value to you to let you know that he’s doing it.
  • Maybe he is sending you a return value, but the wrong one.
  • Maybe he’s sending you a return value, and you’re ignoring it.
  • Maybe he’s sending you a return value, and you are paying attention to it, but there’s some confusion about what it means and how it should be handled.
  • Maybe you’re truncating the last two and a half kilobytes or so of data before you send it on, and we’re not telling the user about it. Maybe that’s your intention. Seems a little rude to me to do that, but to you, it works as designed. To some user, it doesn’t work—as designed.
  • Maybe there’s no one else involved, and it’s just you working on all those bits of the code, but the program has now become sufficiently complex that you’re unable to keep everything in your head. That stands to reason; it is a complicated program, with lots of bits and pieces.
  • Maybe you’re depending on unit tests to tell you if anything is wrong with the individual functions or objects. But maybe nothing is wrong with any particular one of them in isolation; maybe it’s the interaction between them that’s problemmatic.
  • Maybe you don’t have any unit tests at all.
  • Maybe you do have unit tests for this stuff. From right here, I can’t tell. If you do have them, I can’t tell whether your checks are really great and you just missed one this time, or if you missed a few, or if you missed a bunch of them, or whether there’s a ton of them and they’re all really lousy.
  • Any of the above explanations could be in play, many of them simultaneously. No matter what, though, all your unit tests could pass, and you’d never know about the problem until we took out all the mocks and hooked everything up in the real system. Or deployed into the field. (Actually, by now they’re not unit tests; they’re just unit checks, since it’s a while since this part of the code was last looked at and we’ve been seeing green bars for the last few months.)

For any one of the cases above, since it’s so easy to test and check for these things, I would think that if you or anyone else knew about this problem, your sense of professionalism and craftsmanship would tell you to do some testing, write some checks, and fix it. After all, as Uncle Bob Martin said, you guys don’t want us to find any bugs, right?

But it’s not my place to say that. All that stuff is up to you. I don’t tell you how to do your work; I tell you what I observe, in this case entirely from the outside. Plus it’s only one test. I’ll have to do a few more tests to find out if there’s a more general problem. Maybe this is an aberration.

Now, I know you’re fond of saying, “No user would ever do that.” I think what you really mean is no user that you’ve thought of, and that you like, would do that on purpose. But it might be a thought to consider users that you haven’t thought of, however unlikely they and their task might be to you. It could be a good idea to think of users that neither one of us like, such as hackers or identity thieves. It could also be important to think of users that you do like who would do things by accident. People make mistakes all the time. In fact, by accident, I pasted the text of this message into another program, just a second ago.

So far, I’ve only talked about the source of the problem and the trigger for it. I haven’t talked much about possible consequences, or risks. Let’s consider some of those.

  • A customer could lose up to 2770 bytes of data. That actually sounds like a low-risk thing, to me. It seems pretty unlikely that someone would type or paste that much data in any kind of routine way. Still, I did hear from one person that they like to paste stack traces into a chat window. You responded rather dismissively to that. It does sound like a corner case.
  • Maybe you don’t report truncated data as a matter of course, and there are tons of other problems like this in the code, in places that I’m not yet aware of or that are invisible from the black box. Not this problem, but a problem with the same kind of cause could lead to a much more serious problem than this unlikely scenario.
  • Maybe there is a consistent pattern of user interface problems where the internals of the code handle problems but don’t alert the user, even though the user might like to know about them.
  • Maybe there’s a buffer overrun. That worries me more—a lot more—than the stack trace thing above. You remember that this kind of problem used to be dismissed as a “corner case” back when we worked at Microsoft—and then how Microsoft shut down new product development spent two months on investigating these kinds of problems, back in the spring of 2002? Hundreds of worms and viruses and denial of service attacks stem from problems whose outward manifestation looked exactly as trivial as this problem. There are variations on it.
  • Maybe there’s a buffer overrun that would allow other users to view a conversation that my contact and I would like to keep between ourselves.
  • Maybe an appropriately crafted string could allow hackers to get at some of my account information.
  • Maybe an appropriately crafted string could allow hackers to get at everyone‘s account information.
  • Maybe there’s a vulnerability that allows access to system files, as the Blaster worm did.
  • Maybe the product is now unstable, and there’s a crash about to happen that hasn’t yet manifested itself. We never know for sure if a test is finished.
  • Here’s something that I think is more troubling, and perhaps the biggest risk of all. Maybe, by blowing off this report, you’ll discourage testers from reporting a similarly trivial symptom of a much more serious problem. In a meeing a couple of weeks ago, the last time a tester reported something like this, you castigated her in public for the apparently trivial nature of the problem. She was embarrassed and intimidated. These days she doesn’t report anything except symptoms that she thinks you’ll consider sufficiently dramatic. In fact, just yesterday she saw something that she thought to be a pretty serious performance issue, but she’s keeping mum about it. Some time several weeks from now, when we start to do thousands or millions of transactions, you may find yourself wishing that she had felt okay about speaking up today. Or who knows; maybe you’ll just ask her why she didn’t find that bug.

NASA calls this last problem “the normalization of deviance”. In fact, this tiny little inconsistency reminds me of the Challenger problem. Remember that? There were these O-rings that were supposed to keep two chambers of highly-pressurized gases separate from each other. It turns out that on seven of the shuttle flights that preceded the Challenger, these O-rings burned through a bit and some gases leaked (they called this “erosion” and “blow-by”). Various managers managed to convince themselves that it wasn’t a problem, because it only happened on about a third of the flights, and the rings, at most, only burned a third of the way through. Because these “little” problems didn’t result in catastrophe the first seven times, NASA managers used this as evidence for safety. Every successful flight that had the problem was taken as reassurance that NASA could get away with it. In that sense, it was like Nassim Nicholas Taleb’s turkey, who increases his belief in the benevolence of the farmer every day… until some time in the week before Thanksgiving.

Richard Feynman, in his Appendix to the Rogers Commission Report on the Space Shuttle Challenger Accident, nailed the issue:

The phenomenon of accepting for flight, seals that had shown erosion and blow-by in previous flights, is very clear. The Challenger flight is an excellent example. There are several references to flights that had gone before. The acceptance and success of these flights is taken as evidence of safety. But erosion and blow-by are not what the design expected. They are warnings that something is wrong. The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in this unexpected and not thoroughly understood way. The fact that this danger did not lead to a catastrophe before is no guarantee that it will not the next time, unless it is completely understood. When playing Russian roulette the fact that the first shot got off safely is little comfort for the next.

That’s the problem with any evidence of any bug, at first observation; we only know about a symptom, not the cause, and not the consequences. When the system is in an unpredicted state, it’s in an unpredictable state.

Software is wonderfully deterministic, in that it does exactly what we tell it to do. But, as you know, there’s sometimes a big difference between what we tell it to do and what we meant to tell it to do. When software does what we tell it to do instead of what we meant, we find ourselves off the map that we drew for ourselves. And once we’re off the map, we don’t know where we are.

According to Wikipedia,

Feynman’s investigations also revealed that there had been many serious doubts raised about the O-ring seals by engineers at Morton Thiokol, which made the solid fuel boosters, but communication failures had led to their concerns being ignored by NASA management. He found similar failures in procedure in many other areas at NASA, but singled out its software development for praise due to its rigorous and highly effective quality control procedures – then under threat from NASA management, which wished to reduce testing to save money given that the tests had always been passed.

At NASA, back then, the software people realized that just because their checks were passing, it didn’t mean that they should relax their diligence. They realized that what really reduced risk on the project was appropriate testing, lots of tests, and paying attention to seemingly inconsequential failures.

I know we’re not sending people to the moon here. Even though we don’t know the consequences of this inconsistency, it’s hard to conceive of anyone dying because of it. So let’s make it clear: I’m not saying that the sky is falling, and I’m not making a value judgment as to whether we should fix it. That stuff is for you and the project managers to decide upon. It’s simply my role to observe it, to investigate it, and to report it.

I think it might be important, though, for us to understand why the problem is there in the first place. That’s because I don’t know whether the problem that I’m seeing is a big deal. And the thing is, until you’ve looked at the code, neither do you.

As always, it’s your call. And as usual, I’m happy to assist you in running whatever tests you’d like me to run on your behalf. I’ll also poke around and see if I can find any other surprises.

Your friend,

The Tester

P.S. I did run a second test. This time, I used PerlClip to craft a string of 100000 instances of :). That pair of characters, in normal circumstances, results in a smiley-face emoticon. It seemed as though the input field accepted the characters literally, and then converted them to the graphical smiley face. It took a long, long time for the input field to render this. I thought that my chat window had crashed, but it hadn’t. Eventually it finished processing, and displayed what it had parsed from this odd input. I didn’t see 32768 smileys, nor 29996, nor 16384, nor 14998. I saw exactly two dots. Weird, huh?

8 replies to “A Letter To The Programmer”

  1. Dear Tester – by design, but the docs should probably be updated to reflect this – https://developer.skype.com/Docs/ApiDoc/CHATMESSAGE.

    Someone asked this in our forums (in a much more concise manner) –
    http://forum.skype.com/lofiversion/index.php/t68833.html

    The 32768 limit – as anyone familiar with the basic fundamentals of windows programming knows, is the windows limit on characters in a standard text box. We could (I suppose) have added logic to limit the text to match with our api, but the _less_buggy_ approach was to simply let windows handle the text in the normal manner. We monitor usage, and except for a brief, tiny spike last week, users generally don't hit this "issue"

    Also, please consider using our bug database in the future – https://developer.skype.com/jira/browse/SPA

    -Not really a skype developer – just someone who looks at "bugs" with a different lens than you do.

    Reply
  2. Dear Anonymous…

    I'm glad you've looked into some of the specifics of the problem. That is, to my mind, responsible and capable technical work. Quick, too. But I'm not sure you've addressed the bug itself.

    The 32768 limit – as anyone familiar with the basic fundamentals of windows programming knows, is the windows limit on characters in a standard text box. We could (I suppose) have added logic to limit the text to match with our api, but the _less_buggy_ approach was to simply let windows handle the text in the normal manner.

    As anyone familiar with the basic fundamentals of Windows programming knows, you can also constrain the input from an edit control by sending the edit control the EM_SETLIMITTEXT message. The current approach isn't any less buggy, since the bug is that not all of the text being sent gets across AND there's no notice of that. Data disappears without explanation and apparent awareness of it. In addition, we don't know from the outside whether the edit box's memory is coming from Windows' heap or ours, or how that memory is being managed.

    But all that is actually beside the point. You'll appreciate, I hope, that the specifics of the Skype bug was to provide an example—something that people reading the blog post could reproduce easily. You'll also recognize, as I said at the very beginning of the post, that I put a ton more detail into the letter to the programmer than I would put into an ordinary bug report. My point was not to expose a bug in Skype, but rather to give an example of some of the kinds of thinking that testers need to consider (in my view, at least) when reflecting on a bug that seems, from the outside, trivial. Your answer is of the exact kind that would prompt my letter, rather than one that would address it.

    …someone who looks at "bugs" with a different lens than you do.

    I'm glad that you look at bugs through a different lens. We need all kinds of them, from telescopes to microscopes to contact lenses to fisheyes to fresnels, the better to choose what we're going to find and fix.

    Any notions about that second bug?

    —Michael B.

    Reply
  3. Hi Michael!
    Interesting topic you covered in your post. I will not focus on the specific problem rather look at the overall process. Software development is a process that changing rapidly. This put a lot of pressure both on testers and developers to keep up with the pace.

    In your post you outline a lot of different reasons why the bug occur.

    As you say: “it might be good idea to seek an explanation.”

    I’m not sure it always efficient to seek an explanation. The main reason to seek explanation, as tester, is to identify more possible defects in the surroundings of the first bug. On the other hand it could be time consuming to seek explanation. Off course, as you say, most of the possible explanations should have been eliminated by the developer throughout unit test/checks.

    If you have a test team with a lot of experienced tester hunting bugs you have to make sure they are using their time in the best way. I’m thinking that applying “Lean thinking” could help to locate waste in the test process. I might be that seeking explanation could be a waste. Sometimes it will be necessary (and profitable) to dig very deep into the bugs, sometimes not.

    … I think it might be important, though, for us to understand why the problem is there in the first place. That's because I don't know whether the problem that I'm seeing is a big deal. And the thing is, until you've looked at the code, neither do you…

    If we look at developers and testers as one team working close together things will be much more efficient. The time from finding a bug to report and fix will most of the time be shortened.

    Concerning bugs, it is always discussions about priority and importance. The tester may think the bug is an A-rated since he cannot perform more tests in that specific area until it is fixed. The developer may think it is C-rated since it is a very small bug not affecting something important. The project management is looking at customer value and thinks the bug is D-rated since it has very low customer impact.

    To be able to rate bugs correct I think it is necessary to have bug rating meetings with the appropriate people (developers, tester, project, market). This may sound obvious but working as consultant in many different environments I can say this is not always the case.

    I think it is important to have these meetings before the bug is reported into any tracking system.

    /Daniel Åberg

    Reply
  4. Dear Mr. Not Really A Skype Developer,

    Thank you for your answer. Up until now I wasn't aware, that users of your/this software required knowledge of fundamentals of windows programming.

    I think you should start warn users about this prior to downloading it, so it will only be operated by qualified personnel. Perhaps you do, but I don't remember seeing it. Maybe it could be emphasized more.

    I am happy to know that you monitor usage for this specific behaviour. That hints to me, that you're to some extent aware that there might be a problem here, otherwise, why monitor it ? Of course it may be so, that your usage monitoring is targeted at something else, and only as an extra outcome, unintentionally actually, you can see when this 'issue' comes up. I have no idea of knowing; I trust you do your best efforts.

    However, I agree with my colleague that your answer doesn't target the point: that data is in fact lost and that the consequences are perhaps not quite understood and predictable. Be that as it may – as this is not a real bug report, but a debate, in which the product in question could have been just about any product.
    The interesting part is your keen dismissal, which I think could not have been constructed in a more precise and illustrative way; thank you for that.

    All over the world developers answer bug reports and complains in this fashion: without targeting the bug, it's dismissed by use of
    – it's old news (been in our forum)
    – not intelligent/skilled use ('anyone familiar…')
    – wrong forum/procedure (please use bug database…)
    – not happening frequently enough ('users generally don't hit this issue')
    – uprising the design to be flawless ('it's by design').

    Besides the obvious arrogant part of the dismissal, it's exposing something even more daunting:
    – you knew it and didn't handle it!
    – your process is ignoring stuff you might pick up, because it didn't come through the right channels (like a radio distress call: "we're drowning" -> "please send in a form in three copies on the right kind of paper")
    – using a low frequency to evaluate a risk, while a risk is a product of both frequency and impact. Ignoring a huge impact because of a low frequency can potentially be fatal. In this case: a hacker needs only one day to gain access. It'll show in your statistics, but ..
    – ignoring that designs, however well constructed and manufactured, are never flawless is dangerous. The bug might point exactly this out. The argument is vague: "it doesn't work" – "but it's how we designed it" = "you might have designed it wrong".

    In the end it always turns out so, that its never a competition as to who are best at running projects and maintenance procedures, interpreting contracts, designs, requirements and all – but simply a question of whether the product survives the usage of the users, who are lawfully ignorant of all those matters.

    Kind regards,
    Another Tester

    Reply
  5. What I find really interesting about the letter is that it illustrates just how accustomed testers are to having our findings be "dismissed" or "trivialized" – I know I often over-analyze/over-document a bug's validity or potential impact because I'm used to having to do that.

    I also think the letter drives home the point (whether this was intentional, I don't know) that as testers, a significant aspect of our job is to question. If we encounter behavior we don't understand, we question it. It doesn't always mean we "don't like" the behavior or think it's "bad" or even "wrong." We might simply be questioning whether what we found was the intended behavior.

    Reply
  6. I really like Marisa’s point: If we encounter behavior we don’t understand, we question it. It doesn’t always mean we “don’t like” the behavior or think it’s “bad” or even “wrong.” We might simply be questioning whether what we found was the intended behavior.

    Reply

Leave a Comment