Chapter 2. When to Catch a Bug

Why the Compiler Is Your Best Place to Catch Bugs

Given the choice of catching bugs at compile time vs. catching bugs at runtime, the short answer is that you want to catch bugs at compile time if at all possible. There are multiple reasons for this. First, if a bug is detected by the compiler, you will receive a message in plain English saying exactly where, in which file and at which line, the error has occurred. (I may be slightly optimistic here, because in some cases—especially when STL is involved—compilers produce error messages so cryptic that it takes an effort to figure out what exactly the compiler is unhappy about. But compilers are getting better all the time, and most of the time they are pretty clear about what the problem is.)

Another reason is that a complete compilation (with a final link) covers all the code in the program, and if the compiler returns with no errors or warnings, you can be 100% sure that there are no errors that could be detected at compile time in your program. You could never say the same thing about run-time testing; with a large enough piece of code, it is difficult to guarantee that all the possible branches were tested, that every line of code was executed at least once.

And even if you could guarantee that, it wouldn’t be enough—the same piece of code could work correctly with one set of inputs and incorrectly with another, so with runtime testing you are never completely sure that you have tested everything.

And finally, there is the time factor: you compile before you run your code, so if you catch your error during compilation, you’ve saved some time. Some runtime errors appear late in the program, so it might take minutes or even hours of running to get to an error. Moreover, the error might not be even reproducible—it could appear and disappear at consecutive runs in a seemingly random manner. Compared to all that, catching errors at compile time seems like child’s play!

How to Catch Bugs in the Compiler

By now you should be convinced that whenever possible, it’s best to catch errors at compile time. But how can we achieve this? Let’s look at a couple of examples.

The first is the story of a Variant class. Once upon a time, a software company was writing an Excel plug-in. This is a file that, after being opened by Microsoft Excel, adds some new functions that could be called from an Excel cell. Because the Excel cell can contain data of different types—an integer (e.g., 1), a floating-point number (e.g., 3.1415926535), a calendar date (such as 1/1/2000), or even a string (“This is the house that Jack built”)—the company developed a Variant class that behaved like a chameleon and could contain any of these data types. But then someone had the idea that a Variant could contain another Variant, and even a vector of Variants (i.e., std::vector<Variant>). And these Variants started being used not just to communicate with Excel, but also in internal code. So when looking at the function signature:

Variant SomeFunction(const Variant& input);

it became totally impossible to understand what kind of data the function expects on input and what kind of data it returns. So if for example it expects a calendar date and you pass it a string that does not resemble a date, this can be detected only at runtime. As we’ve just discussed, finding errors at compile time is preferable, so this approach prevents us from using the compiler to catch bugs early using type safety. The solution to this problem will be discussed below, but the short answer is that you should use separate C++ classes to represent different data types.

The preceding example is real but somewhat extreme. Here is a more typical situation. Suppose we are processing some financial data, such as the price of a stock, and we accompany each value with the correspondent time stamp, i.e., the date and time when this price was observed. So how do we measure time? The simplest solution is to count seconds since some time in the past (say, since 1/1/1970).

Suddenly someone realizes that the library used for this purpose provides a 32-bit integer, which has a maximum value of about 2 billion, after which the value will overflow and become negative. This would happen about 68 years after the starting point on the time axis, i.e., in the year 2038. The resulting problem is analogous to the famous “Y2K” problem, and fixing it would entail going through a rather large number of files and finding all these variables and making them int64, which has 64 bits instead of 32, and this would last about 4 billion times longer, which should be enough even for the most outrageous optimist.

But by now another problem has turned up: some programmers used int64 num_of_seconds, while others used int64_num_of_millisec, while still others wrote int64 num_of_microsec. The compiler has absolutely no way of figuring out if a function that expects time in milliseconds is being passed time in microseconds or vice versa. Of course, if we make some assumptions that the time interval in which we want to analyze our stock prices starts after, say, year 1990 and goes until some point in the future, say year 3000, then we can add a sanity check at runtime that the value being passed must fall into this interval. However, multiple functions need to be equipped with this sanity check, which requires a lot of human work. And what if someone later decides to go back and analyze the stock prices throughout the 20th century?

The Proper Way to Handle Types

Now, this entire mess could have been easily avoided altogether if we had just created a Time class and left the details of when it starts and what unit it measures (seconds, milliseconds, etc.) as hidden details of the internal implementation. One advantage of this approach is that if we mistakenly try to pass some other data type instead of time (which now has a Time type), a compiler would have caught it early. Another advantage is that if the Time class is currently implemented using milliseconds and we later decide to increase the accuracy to microseconds, we need only edit one class, where we can change this detail of internal implementation without affecting the rest of the code.

So how do we catch these types of errors at compile time instead of runtime? We can start by having a separate class for each type of data. Let’s use int for integers, double for floating-point data, std::string for text, Date for calendar dates, Time for time, and so on for all the other types of data. But simply doing this is not enough. Suppose we have two classes, Apple and Orange, and a function that expects an input of a type Orange:

void DoSomethingWithOrange(const Orange& orange);

However, we accidentally could provide an object of type Apple instead:

Apple an_apple(some_inputs);
DoSomethingWithOrange(an_apple);

This might compile under some circumstances, because the C++ compiler is trying to do us a favor and will silently convert Apple to Orange if it can. This can happen in two ways:

  1. If the Orange class has a constructor taking only one argument of type Apple

  2. If the Apple class has an operator that converts it to Orange

The first case happens when the class Orange looks like this:

class Orange {
 public:
  Orange(const Apple& apple);
  // more code
};

It can even look like this:

class Orange {
 public:
  Orange(const Apple& apple, const Banana* p_banana=0);
  // more code
};

Even though in the last example the constructor looks like it has two inputs, it can be called with only one argument, so it can also serve to implicitly convert Apple into Orange. The solution to this problem is to declare these constructors with keyword explicit. This prevents the compiler from doing an automatic (implicit) conversion, so we force the programmer to use Orange where Orange is expected:

class Orange {
 public:
  explicit Orange(const Apple& apple);
  // more code
};

and correspondingly in the second case:

class Orange {
 public:
  explicit Orange(const Apple& apple, const Banana* p_banana=0);
  // more code
};

Another method that lets the compiler know how to convert an Apple into an Orange is to provide a conversion operator:

class Apple {
 public:
  // constructors and other code …
  operator Orange () const;
};

The very presence of this operator suggests that the programmer made an explicit effort to provide the compiler with a way to convert Apple into Orange, and therefore it might not be a mistake. However, the absence of the keyword explicit in front of the constructor could easily be a mistake, so it’s advisable to declare all constructors that could be called with one argument with keyword explicit. In general, any possibility of implicit conversions is a bad idea, so if you want to provide a way of converting Apple into Orange inside the class Apple, as in the previous example, the better way of doing so is:

class Apple {
 public:
  // constructors and other code …
  Orange AsOrange() const;
};

In this case, in order to convert an Apple into an Orange you would need to write:

  Apple apple(some_inputs);
  DoSomethingWithOrange(apple.AsOrange()); // explicit conversion

There is one more way to mix up different data types: by using enum. Consider the following example: suppose we defined the following two enums for days of the week and for months:

enum { SUN, MON, TUE, WED, THU, FRI, SAT };
enum { JAN=1, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, DEC };

All of these constants are actually integers (e.g., C built-in type int), and if we have a function that expects as an input a day of the week:

void FunctionExpectingDayOfWeek(int day_of_week);

the following call will compile without any warnings:

FunctionExpectingDayOfWeek(JAN);

And there is not much we can do at run time because both JAN and MON are integers equal to 1. The way to catch this bug is not to use “plain vanilla” enums that create integers, but to use enums to create new types:

typedef enum { SUN, MON, TUE, WED, THU, FRI, SAT } DayOfWeek;
typedef enum { JAN=1, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, DEC } Month;

In this case, the function expecting a day of week should be declared like this:

void FunctionExpectingDayOfWeek(DayOfWeek day_of_week);

An attempt to call it with a Month like this:

FunctionExpectingDayOfWeek(JAN);

results in a compilation error:

error: cannot convert 'Month' to 'DayOfWeek' for
        argument '1' to 'void
        FunctionExpectingDayOfWeek(DayOfWeek)'

which is exactly what we would want in this case.

This approach has a downside, however. In the case when enum creates integer constants, you can write a code like this:

for(int month=JAN; month<=DEC; ++month)
  cout << "Month = " << month << endl;

But when the enum is used to create a new type, the following:

for(Month month=JAN; month<=DEC; ++month)
  cout << "Month = " << month << endl;

does not compile. So if you need to iterate through the values of your enum, you are stuck with integers.

Of course, there are exceptions to any rule, and sometimes programmers will have reasons to write classes such as Variant for the specific purpose of allowing implicit conversions. However, most of the time it is a good idea to avoid implicit conversions altogether: this allows you to use the full power of the compiler to check types of different variables to catch our potential errors early—at compile time.

Now suppose that we’ve done everything we can to use type safety to the fullest extent possible. Unfortunately, with the exceptions of types bool and char, the number of different values that each type can contain is astronomically high, and usually only a small portion of these values makes sense. For instance, if we use the type double for the price of a stock, we can be reasonably sure that the value will be between 0 and 10,000 (with the sole exception of the stock of the Berkshire Hathaway company, whose owner Warren Buffet apparently does not believe that it is a good idea to keep the stock price within a reasonable range and has therefore never split the stock, which at the time of this writing is above $100,000 per share). Still, even Berkshire Hathaway uses only a small portion of the range of a double precision number, which can be as large as 10308 and can also be negative, which does not make sense for a stock price. Since for most types only a small portion of all possible values makes sense, there will always be errors that can be diagnosed only at runtime.

In fact, most of the problems of the C language, such as specifying an index out of bounds or accessing memory improperly through pointer arithmetic, can be diagnosed only at runtime. For this reason, the rest of this book is dedicated mainly to the discussion of catching runtime errors.

Rules for this chapter for diagnosing errors at compile time:

  • Prohibit implicit type conversions: declare constructors taking one parameter with the keyword explicit and avoid conversion operators.

  • Use different classes for different data types.

  • Do not use enums to create int constants; use them to create new types.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.213.238