Code Reviews#2
Ward-ing away Bugs

John M. Dlugosz

Ten years ago, QUE published “Debugging C” by Robert Ward (ISBN 0-88022-261-1). It was one of the books I bought when I was teaching myself C. It proved to be a wise choice indeed, for unlike most newcomers, I learned a proper appreciation for debugging at the same time I was learning about while loops and pointers and the mysteries of the linker. Although less than a decade ago, it was truly another age. I was using a 4.77 MHz PC with a lavish 640K of RAM, and two 360K floppy disk drives. I put the compiler on one floppy, my source on another. I exchanged the compiler disk for the linker disk after compiling. Those were the days! Naturally, a ten year old book in this business is normally thought of as ancient history, with little or no application to our current field. Even at the time, some of the stuff in that book was already dated, dealing with CPM issues and things about the compiler technology that was currently changing.

The publication date of the book was 1986. I wrote this in August 1996, according to the timestamp on the file.

Nonetheless, I found the book to be nothing short of an inspiration. More than anything else, it taught me that debugging is a discipline different from design and coding, and that a formal approach is possible. That is, you can categorize things, and recognize when you're dealing with something from a particular category. Naming something is the first step to conquering it.

Chapter 1, just ten pages long, opened my eyes to the possibilities of a scientific approach to debugging. That’s where I starting thinking of the kinds of things I’m covering in Code Reviews. So, it’s only fair that my first treatment pay homage to Robert Ward, who got me started on my own journey of enlightenment.

Ward states in his book

It makes sense that laboratory experimentation based on scientific method is the model for debugging activities, if you consider that

The fit between scientific method and debugging is as compelling as the fit between engineering and software design. Recently, software design has benefited significantly from the engineering model. Debugging should benefit comparably from a deliberate effort to adapt the scientific-method model.

There are two important issues to understand here. The most important is that what we do during debugging is different from what we do during design and coding. It’s a whole different process, and we think differently. It’s only natural that we should look into different techniques. The more explicit point is that such techniques should be inspired by the scientific method.

Clues — Cause and Effect

The act of debugging can be very much a detective game. You are given clues. You search, often finding more clues rather than a final result. Then you have to figure out what the problem is based on these clues. Often, the problem you discover is not the ultimate bug, but merely brings you one step closer; then the process repeats.

The reason this is a “detective game” and that deductive reasoning is so much a part of the process is because you are exploring the cause-and-effect relationships in the program’s logic.

For example, a date is printed incorrectly on a report. That’s what you (the programmer assigned to it) get called about. Does the date get mangled within the reporting function, or is an incorrect value supplied to the function initially? You want to be able to make the statement “The date is incorrect at this point because …” Literally, answering this tells you the cause, which you need to know before you can fix it or work around it. Ward writes, The primary, but least helpful, principle is that of causality. Programmers must believe that for every effect there is a cause. Else, why bother? A bug would be a spontaneous irregularity.

With a grasp of cause-and-effect, you become a detective. You start at the known point, the end, where the bug is noticed as incorrect output. Someone skilled at debugging will already be thinking of how to break up the problem set. One such way is asking if the problem starts in the reporting module, or if it’s passed in from the outside. This is easy to check, as (hopefully!) the module interface will be a choke-point where we can easily see what data passes into the reporting system. This is gathering more clues… If a detective is called because a window is found to be broken, he starts there and finds more clues, such as footprints leading up to the window.

Alternatively, you may discard one possibility without having to check. This is possible through having a mental model of the part of the program under consideration. Without having to trace through it, you know enough cause-and-effect relationships to know where it might be coming from. Often, this kind of “reasoning” clue tells you where it’s not coming from.

Proximity

Our intuitive understanding of “clues” is based on cause-and-effect relationships. So, a better understanding of what can be causes to a class of effects, or effects of a class of causes, would aid us in developing our abilities. More to the point, it helps to know where to look. Cause and effect follows some kind of “connection” between the two events. The detective sees footprints leading up to the broken window—that is an example of lexical proximity. The two things (window and footprints) are found near each other in space. Anybody could find that.

Later in the day, the same investigator is looking into a theft that seems to be an “inside job” because there was no sign of a break-in. He can’t just look around the house for clues. However, he finds out who had keys and investigates them further. This is an example of referential proximity. One object (the key) is used in two different places, separated in time and space.

When the investigator starts looking into current and recently discharged employees, one of them tries to skip town. That draws the detective’s suspicion in that direction. When two things happen in sequence, it makes you wonder if there is some causal relationship between them. This is temporal proximity.

Enough of hypothetical detectives. Let’s turn to hypothetical programmers. Lexical proximity is when a cause and effect are found near each other in the source. That’s why we first wonder if the date bug is caused somewhere in the reporting system. This is by far the easiest type of relationship to deal with, and the most instinctive to us. For most of us, the first thing we do upon finding a problem on some line is to inspect the lines immediately preceding it.

Often, a cause-and-effect relationship follows more than one kind of proximity. Two consecutive lines (high lexical proximity) may also execute one right after the other, giving them high temporal proximity as well. But note that adjacent lines are not always executed consecutively, as with looping constructs or declaration (not executable) lines. Nonetheless, they are still lexically proximate, and a problem in one may show up in the other.

Conversely, passage of code may execute consecutively, but be no where near each other in the source. This is the case with function calls. Functions foo and bar may be temporally close because foo calls bar, even though they are in different source files.

Temporal proximity is harder to track than lexical proximity because it changes, and is only apparent at run time. Lexical proximity, on the other hand, can be statically analyzed.

It stands to reason that one function that interferes with another (which you locate due to their temporal proximity) have a deeper reason for interfering. One may set a global variable accessed by the other, perhaps. This is referential proximity.

Referential proximity is particularly important to consider when making changes, and is particularly easy to deal with in object-oriented programs. When I changed the implementation of a class which altered the private member data, I looked at every place that used that member (thanks to a search in my text editor). The code was in various functions, and I didn’t care what order it’s executed in. Yet they were all related through their use of the same variable. A change to one such line would potentially affect the others.

Phases of Debugging

Ward teaches that there are four phases of the debugging process:

Testing is a necessary part of software development. How can you be sure it works unless you actually try it? Today I heard about someone who tried to install a program to find that files were missing from the CD. It was not user error—a call to tech support revealed that the CD was made incorrectly and they didn’t discover it until after they had shipped.

That may be an extreme case, but how many of us do just that in miniature? That is, write a function, and then proceed to use it. Later, a bug is chased back to that function. How much time would have been saved had the function been debugged before it became part of a larger system?

Stabilization is the process of controlling the execution environment so as to reproduce a bug at will. I could call Corel’s tech support and complain that “Ventura crashes a lot”. Would that do them much good? But sending them email saying “load the attached document, position the cursor … type ’till you get a new page, then backspace.” They have a stabilized bug, which gives them someplace to start looking. It would be a serious leg-up on finding and fixing the bug, had the tech support service been at all interested in passing such information along to the engineers.

A customer who gives you stabilized bugs rather than just complaints can be your best friend. Make sure the people who deal with customers know to watch for this and bring it to your attention.

It is not unreasonable for to ask that whoever is reporting the bug to do at least a partial stabilization. If someone says “sometimes it gets the date wrong”, say “give me a specific example.” When he returns with “This report prints incorrectly.” then you have something to work with.

Bugs that are not stabilized are not true bug reports. They are merely complaints and reports of anomalies.

Localization goes beyond stabilization. A bug report should be a report on a stabilized bug, but typically only the engineer can localize the bug, once he starts the process of figuring it out. According to Ward, The localization phase is characterized by intensive data collection and analysis. In other words, this is where the fun part is. Stabilization is prepwork, preparing the stage and setting up the problem. Localization is the actual process of detective work. The goal is to control the program to such a degree that you gain an understanding of the bug, and ultimately deduce its cause.

Naturally, once you find it, fix it! Hopefully, the fix is easy. Sometimes, it can be a major design problem. Normally, it’s something in between. As in my earlier example, be sure to apply the principles of proximity to make sure the fix is accurate and complete. Then, don’t forget to test again.

A Lost Cause

There really is a cause for every effect. If it's reproducible, you can be sure it’s a problem with the code and not stray cosmic rays or other random things beyond your control. It’s just not always apparent that there is a cause-and-effect relationship involved. Consider the fragment

void foo()
 {
 int size= 7;
 int A[10];
 bar (A);
 // inspect size here.
 //...
 }

You might be surprised to find that size, a local variable, is not 7 at the indicated point. Its address is never taken and it’s not passed to the function bar. Yet somehow bar is altering it! There is nothing expressed in the C++ language in the program that would account for this. The problem is that some C++ statement does something other than what it says. In bar,

void bar (int* p)
 {
 p[10]= 20;
 } 

the code overwrites memory not belonging to the supplied array. There is indeed referential proximity between foo and bar through size, but the source never says so. We expect functions to return from whence the came, and variables to stay unaltered unless the code changes them, and other basic assumptions. When such assumptions are violated, we can’t find a cause-and-effect relationship within such a framework of assumptions. The rules have changed. Ward calls this violating the virtual machine. The run-time system of C++ is itself corrupted or is being misused, and we can’t see the cause and effect directly by looking at statements in C++.

A seasoned programmer notices “bizarre” behavior and uses that as a clue in itself. Can you recognize the symptoms of a stray pointer? It's one of the worst bugs to track down, but you tend to know early on that this is what you’re dealing with, simply because its behavior appears to violate referential proximity. It is usually found by using temporal proximity: once you see which value is corrupted, watch it and see when it changes.

Conclusion

More than anything else, I learned from “Debugging C” that a scientific approach to debugging is not only possible, but a good idea.

Debugging should not be a random, haphazard process. If you’ve had any experience with it, you’ve developed some techniques. See if you can understand how you do it. That understanding can help you get better at it, and help you avoid future problems.


Code Reviews series | Magazine top | previous part | next part