Chapter 10. Techniques for Debugging

Debugging a program can often be as hard or, sometimes, even more difficult than writing it. Quite often, programmers seem to spend an awful amount of time hunting for that elusive bug, the reason for which may be staring them in the face, yet not revealing itself.

Many developers, even the good ones, find troubleshooting a difficult art. Most often, programmers resort to complicated debugging techniques when simple approaches such as properly placed print statements and strategically commented code would do the trick.

Python has its own set of problems when it comes to debugging code. Being a dynamically typed language, type-related exceptions, which happen due to the programmer assuming a type to be something (when it's something else), are pretty common in Python. Name errors and attribute errors fall in a similar category too.

In this chapter, we will exclusively focus on this lesser discussed aspect of software.

Here is a topic-wise listing of what we are going to encounter in this chapter:

  • Maximum subarray problem:
    • The power of "print"
    • Analysis and rewrite
    • Timing and optimizing the code
  • Simple debugging tricks and techniques:
    • Word searcher program
    • Word searcher program—debugging step 1
    • Word searcher program—debugging step 2
    • Word searcher program—final code
    • Skipping blocks of code
    • Stopping execution
    • External dependencies—using wrappers
    • Replacing functions with their return value/data (mocking)
    • Saving to/loading data from files as cache
    • Saving to/loading data from memory as cache
    • Returning random/mock data
    • Generating random patient data
  • Logging as a debugging technique:
    • Simple application logging
    • Advanced logging—logger objects
    • Advanced logging—custom formatting and loggers
    • Advanced logging—writing to syslog
  • Debugging tools—using debuggers:
    • A debugging session with pdb
    • Pdb—similar tools
    • iPdb
    • Pdb++
  • Advanced debugging—tracing:
    • The trace module
    • The iptrace program
    • System call tracing using strace

Okay, so let's debug it!

Maximum subarray problem

For starters, let's look at an interesting problem. In this problem, the goal is to find the maximum contiguous subarray of an array (sequence) of integers having mixed negative and positive numbers.

For example, say we have the following array:

>>> a  = [-5, 20, -10, 30, 15]

It is pretty obvious with a quick scan that the maximum sum is for the subarray [20, -10, 30, 15], giving a sum of 55.

Let's say, as a first cut, you write this piece of code:

import itertools

# max_subarray: v1
def max_subarray(sequence):
    """ Find sub-sequence in sequence having maximum sum """
    
    sums = []
    
    for i in range(len(sequence)):
        # Create all sub-sequences in given size
        for sub_seq in itertools.combinations(sequence, i):
            # Append sum
            sums.append(sum(sub_seq))

    return max(sums)

Now let's try it out:

>>>  max_subarray([-5, 20, -10, 30, 15])
65

This output seems clearly wrong, as any manual addition of any subarray in the array doesn't seem to yield a number more than 55. We need to debug the code.

The power of "print"

In order to debug the preceding example, a simple, strategically placed print statement does the trick. Let's print out the subsequences in the inner for loop:

The function is modified as follows:

# max_subarray: v1

def max_subarray(sequence):
    """ Find sub-sequence in sequence having maximum sum """

    sums = []
    for i in range(len(sequence)):
        for sub_seq in itertools.combinations(sequence, i):
            sub_seq_sum = sum(sub_seq)
            print(sub_seq,'=>',sub_seq_sum)
            sums.append(sub_seq_sum)

    return max(sums)

Now the code executes and prints this output:

>>> max_subarray([-5, 20, -10, 30, 15])
((), '=>', 0)
((-5,), '=>', -5)
((20,), '=>', 20)
((-10,), '=>', -10)
((30,), '=>', 30)
((15,), '=>', 15)
((-5, 20), '=>', 15)
((-5, -10), '=>', -15)
((-5, 30), '=>', 25)
((-5, 15), '=>', 10)
((20, -10), '=>', 10)
((20, 30), '=>', 50)
((20, 15), '=>', 35)
((-10, 30), '=>', 20)
((-10, 15), '=>', 5)
((30, 15), '=>', 45)
((-5, 20, -10), '=>', 5)
((-5, 20, 30), '=>', 45)
((-5, 20, 15), '=>', 30)
((-5, -10, 30), '=>', 15)
((-5, -10, 15), '=>', 0)
((-5, 30, 15), '=>', 40)
((20, -10, 30), '=>', 40)
((20, -10, 15), '=>', 25)
((20, 30, 15), '=>', 65)
((-10, 30, 15), '=>', 35)
((-5, 20, -10, 30), '=>', 35)
((-5, 20, -10, 15), '=>', 20)
((-5, 20, 30, 15), '=>', 60)
((-5, -10, 30, 15), '=>', 30)
((20, -10, 30, 15), '=>', 55)
65

The problem is clear now by looking at the output of the print statements.

There is a subarray [20, 30, 15] (highlighted in bold in the preceding output), which produces the sum 65. However, this is not a valid subarray, as the elements are not contiguous in the original array.

Clearly, the program is wrong and needs a fix.

Analysis and rewrite

A quick analysis tells us that the use of itertools.combinations is the culprit here. We used it as a way to quickly generate all the subarrays of different lengths from the array, but using combinations does not respect the order of items, and generates all combinations producing subarrays that are not contiguous.

Clearly, we need to rewrite this. Here is a first attempt at the rewrite:

# max_subarray: v2

def max_subarray(sequence):
    """ Find sub-sequence in sequence having maximum sum """

    sums = []
    
    for i in range(len(sequence)):
        for j in range(i+1, len(sequence)):
            sub_seq = sequence[i:j]
            sub_seq_sum = sum(sub_seq)
            print(sub_seq,'=>',sub_seq_sum)
            sums.append(sum(sub_seq))

    return max(sums)

Now the output is as follows:

>>> max_subarray([-5, 20, -10, 30, 15])
([-5], '=>', -5)
([-5, 20], '=>', 15)
([-5, 20, -10], '=>', 5)
([-5, 20, -10, 30], '=>', 35)
([20], '=>', 20)
([20, -10], '=>', 10)
([20, -10, 30], '=>', 40)
([-10], '=>', -10)
([-10, 30], '=>', 20)
([30], '=>', 30)
40

The answer is not correct again, as it gives the suboptimal answer 40, not the correct one, which is, 55. Again, the print statement comes to the rescue, as it tells us clearly that the main array itself is not being considered—we have an off-by-one bug.

Note

An off-by-one or one-off error occurs in programming when an array index used to iterate over a sequence (array) is off either by one less or one more than the correct value. This is often found in languages where the index for sequences starts from zero, such as C/C++, Java, or Python.

In this case, the off-by-one error is in this line:

    "sub_seq = sequence[i:j]"

The correct code should, instead, be as follows:

    "sub_seq = sequence[i:j+1]"

With this fix, our code produces the output as expected:

# max_subarray: v2

def max_subarray(sequence):
    """ Find sub-sequence in sequence having maximum sum """
    
    sums = []
    
    for i in range(len(sequence)):
        for j in range(i+1, len(sequence)):
            sub_seq = sequence[i:j+1]
            sub_seq_sum = sum(sub_seq)
          print(sub_seq,'=>',sub_seq_sum)
            sums.append(sub_seq_sum)

    return max(sums)

Here is the output:

>>> max_subarray([-5, 20, -10, 30, 15])
([-5, 20], '=>', 15)
([-5, 20, -10], '=>', 5)
([-5, 20, -10, 30], '=>', 35)
([-5, 20, -10, 30, 15], '=>', 50)
([20, -10], '=>', 10)
([20, -10, 30], '=>', 40)
([20, -10, 30, 15], '=>', 55)
([-10, 30], '=>', 20)
([-10, 30, 15], '=>', 35)
([30, 15], '=>', 45)
55

Let us assume at this point that you consider the code to be complete.

You pass the code on to a reviewer, and they mention that your code, though called max_subarray, actually forgets to return the subarray itself, instead returning only the sum. There is also the feedback that you don't need to maintain an array of sums.

You combine this feedback and produce a version 3.0 of the code, which fixes both the issues:

# max_subarray: v3

def max_subarray(sequence):
    """ Find sub-sequence in sequence having maximum sum """

    # Trackers for max sum and max sub-array
    max_sum, max_sub = 0, []
    
    for i in range(len(sequence)):
        for j in range(i+1, len(sequence)):
            sub_seq = sequence[i:j+1]
            sum_s = sum(sub_seq)
            if sum_s > max_sum:
                # If current sum > max sum so far, replace the values
                max_sum, max_sub = sum_s, sub_seq

    return max_sum, max_sub

>>>  max_subarray([-5, 20, -10, 30, 15])
(55, [20, -10, 30, 15])

Note that we removed the print statement in this last version, as the logic was already correct, and so there was no need for debugging.

All good.

Timing and optimizing the code

If you analyze the code a bit, you'll find that the code performs two passes through the full sequence, one outer and one inner. So if the sequence contains n items, the code performs n*n passes.

We know from Chapter 4, Good Performance is Rewarding!, on performance that such a piece of code performs at the order of O(n2). We can measure the real time spent on the code by using simple context-manager using the with operator.

Our context manager looks as follows:

import time
from contextlib import contextmanager

@contextmanager
def timer():
    """ Measure real-time execution of a block of code """
    
    try:
        start = time.time()
        yield
    finally:
        end = (time.time() - start)*1000
        print 'time taken=> %.2f ms' % end

Let's modify the code to create an array of random numbers of different sizes to measure the time taken. We will write a function for this:

import random

def num_array(size):
    """ Return a list of numbers in a fixed random range
    of given size """

    nums = []
    for i in range(size):
        nums.append(random.randrange(-25, 30))
    return nums

Let's time our logic for various sizes of arrays, beginning with 100:

>>> with timer():
... max_subarray(num_array(100))
... (121, [7, 10, -17, 3, 21, 26, -2, 5, 14, 2, -19, -18, 23, 12, 8, -12, -23, 28, -16, -19, -3, 14, 16, -25, 26, -16, 4, 12, -23, 26, 22, 12, 23])
time taken=> 16.45 ms

For an array of 1,000, the code will be as follows:

>>> with timer():
... max_subarray(num_array(100))
... (121, [7, 10, -17, 3, 21, 26, -2, 5, 14, 2, -19, -18, 23, 12, 8, -12, -23, 28, -16, -19, -3, 14, 16, -25, 26, -16, 4, 12, -23, 26, 22, 12, 23])
time taken=> 16.45 ms

So this takes about 3.3 seconds.

It can be shown that with an input size of 10,000, the code will take around 2 to 3 hours to run.

Is there a way to optimize the code? Yes, there is an O(n) version of the same code, which looks like this:

def max_subarray(sequence):
    """ Maximum subarray – optimized version """

    max_ending_here = max_so_far = 0

    for x in sequence:
        max_ending_here = max(0, max_ending_here + x)
        max_so_far = max(max_so_far, max_ending_here)
    
    return max_so_far

With this version, the time taken is much better:

>>> with timer():
... max_subarray(num_array(100))
... 240
time taken=> 0.77 ms

For an array of 1,000, the time taken is as follows:

>>> with timer():
... max_subarray(num_array(1000))
... 2272
time taken=> 6.05 ms

For an array of 10,000, the time is around 44 milliseconds:

>>> with timer():
... max_subarray(num_array(10000))
... 19362
time taken=> 43.89 ms
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.104.95