Chapter Four. Algorithms and Data Structures

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Q&A

Next Chapter

4.3 Stacks and Queues

Chapter Four. Algorithms and Data Structures

4.1 Performance

4.2 Sorting and Searching

4.3 Stacks and Queues

4.4 Symbol Tables

4.5 Case Study: Small World

This chapter presents fundamental data types that are essential building blocks for a broad variety of applications. This chapter is also a guide to using them, whether you choose to use Java library implementations or to develop your own variations based on the code given here.

Objects can contain references to other objects, so we can build structures known as linked structures, which can be arbitrarily complex. With linked structures and arrays, we can build data structures to organize information in such a way that we can efficiently process it with associated algorithms. In a data type, we use the set of values to build data structures and the methods that operate on those values to implement algorithms.

The algorithms and data structures that we consider in this chapter introduce a body of knowledge developed over the past 50 years that constitutes the basis for the efficient use of computers for a broad variety of applications. From n-body simulation problems in physics to genetic sequencing problems in bioinformatics, the basic methods we describe have become essential in scientific research; from database systems to search engines, these methods are the foundation of commercial computing. As the scope of computing applications continues to expand, so grows the impact of these basic methods.

Algorithms and data structures themselves are valid subjects of scientific study. Accordingly, we begin by describing a scientific approach for analyzing the performance of algorithms, which we apply throughout the chapter.

4.1 Performance

In this section, you will learn to respect a principle that is succinctly expressed in yet another mantra that should live with you whenever you program: pay attention to the cost. If you become an engineer, that will be your job; if you become a biologist or a physicist, the cost will dictate which scientific problems you can address; if you are in business or become an economist, this principle needs no defense; and if you become a software developer, the cost will dictate whether the software that you build will be useful to any of your clients.

4.1.1 3-sum problem

4.1.2 Validating a doubling hypothesis

Programs in this section

To study the cost of running them, we study our programs themselves via the scientific method, the commonly accepted body of techniques universally used by scientists to develop knowledge about the natural world. We also apply mathematical analysis to derive concise mathematical models of the cost.

Which features of the natural world are we studying? In most situations, we are interested in one fundamental characteristic: time. Whenever we run a program, we are performing an experiment involving the natural world, putting a complex system of electronic circuitry through series of state changes involving a huge number of discrete events that we are confident will eventually stabilize to a state with results that we want to interpret. Although developed in the abstract world of Java programming, these events most definitely are happening in the natural world. What will be the elapsed time until we see the result? It makes a great deal of difference to us whether that time is a millisecond, a second, a day, or a week. Therefore, we want to learn, through the scientific method, how to properly control the situation, as when we launch a rocket, build a bridge, or smash an atom.

On the one hand, modern programs and programming environments are complex; on the other hand, they are developed from a simple (but powerful) set of abstractions. It is a small miracle that a program produces the same result each time we run it. To predict the time required, we take advantage of the relative simplicity of the supporting infrastructure that we use to build programs. You may be surprised at the ease with which you can develop cost estimates and predict the performance characteristics of many of the programs that you write.

Scientific method

The following five-step approach briefly summarizes the scientific method:

• Observe some feature of the natural world.

• Hypothesize a model that is consistent with the observations.

• Predict events using the hypothesis.

• Verify the predictions by making further observations.

• Validate by repeating until the hypothesis and observations agree.

One of the key tenets of the scientific method is that the experiments we design must be reproducible, so that others can convince themselves of the validity of the hypothesis. In addition, the hypotheses we formulate must be falsifiable—we require the possibility of knowing for sure when a hypothesis is wrong (and thus needs revision).

Observations

Our first challenge is to make quantitative measurements of the running times of our programs. Although measuring the exact running time of a program is difficult, usually we are happy with approximate estimates. A number of tools can help us obtain such approximations. Perhaps the simplest is a physical stopwatch or the Stopwatch data type (see PROGRAM 3.2.2). We can simply run a program on various inputs, measuring the amount of time to process each input.

Two statements and the corresponding output depict the running time of a program.

Our first qualitative observation about most programs is that there is a problem size that characterizes the difficulty of the computational task. Normally, the problem size is either the size of the input or the value of a command-line argument. Intuitively, the running time should increase with the problem size, but the question of by how much it increases naturally arises every time we develop and run a program.

Another qualitative observation for many programs is that the running time is relatively insensitive to the input itself; it depends primarily on the problem size. If this relationship does not hold, we need to run more experiments to better understand the running time’s sensitivity to the input. Since this relationship does often hold, we focus now on the goal of better quantifying the correspondence between problem size and running time.

As a concrete example, we start with ThreeSum (PROGRAM 4.1.1), which counts the number of (unordered) triples in an array of n numbers that sum to 0 (assuming that integer overflow plays no role). This computation may seem contrived to you, but it is deeply related to fundamental tasks in computational geometry, so it is a problem worthy of careful study. What is the relationship between the problem size n and the running time for ThreeSum?

Hypotheses

In the early days of computer science, Donald Knuth showed that, despite all of the complicating factors in understanding the running time of a program, it is possible in principle to create an accurate model that can help us predict precisely how long the program will take. Proper analysis of this sort involves:

• Detailed understanding of the program

• Detailed understanding of the system and the computer

• Advanced tools of mathematical analysis

Thus, it is best left for experts. Every programmer, however, needs to know how to make back-of-the-envelope performance estimates. Fortunately, we can often acquire such knowledge by using a combination of empirical observations and a small set of mathematical tools.

Doubling hypotheses

For a great many programs, we can quickly formulate a hypothesis for the following question: What is the effect on the running time of doubling the size of the input? For clarity, we refer to this hypothesis as a doubling hypothesis. Perhaps the easiest way to pay attention to the cost is to ask yourself this question about your programs as you develop them. Next, we describe how to answer this question by applying the scientific method.

Empirical analysis

Clearly, we can get a head start on developing a doubling hypothesis by doubling the size of the input and observing the effect on the running time. For example, DoublingTest (PROGRAM 4.1.2) generates a sequence of random input arrays for ThreeSum, doubling the array length at each step, and prints the ratio of running times of ThreeSum.countTriples() for each input to an input of one-half the size. If you run this program, you will find yourself caught in a prediction–verification cycle: It prints several lines very quickly, but then begins to slow down. Each time it prints a line, you find yourself wondering how long it will take to solve a problem of twice the size. If you use a Stopwatch to perform the measurements, you will see that the ratio seems to converge to a value around 8. This leads immediately to the hypothesis that the running time increases by a factor of 8 when the input size doubles. We might also plot the running times, either on a standard plot (right), which clearly shows that the rate of increase of the running time increases with input size, or on a log–log plot. In the case of ThreeSum, the log–log plot (below) is a straight line with slope 3, which clearly suggests the hypothesis that the running time satisfies a power law of the form cn³ (see EXERCISE 4.1.6).

Program 4.1.1 3-sum problem

number of instructions	time per instruction in seconds	frequency	total time
6	2 × 10^–9	n³/6 – n²/2 + n/3	(2 n³ – 6 n² + 4 n) × 10^–9
4	3 × 10^–9	n²/2 – n/2	(6 n² + 6 n) × 10^–9
4	3 × 10^–9	n	(12 n) × 10^–9
10	1 × 10^–9	1	10 × 10^–9
		grand total:	(2 n³ + 22 n + 10) × 10^–9
		tilde notation	~ 2 n³ × 10^–9
		order of growth	n³
Analyzing the running time of a program (example)

order of growth		factor for doubling hypothesis
description	function	factor for doubling hypothesis
constant	1	1
logarithmic	log n	1
linear	n	2
linearithmic	n log n	2
quadratic	n²	4
cubic	n³	8
exponential	2ⁿ	2ⁿ
Commonly encountered order-of-growth classifications

description	order of growth	example	framework
constant	1	`count++;`	statement (increment an integer)
logarithmic	log n	for (int i = n; i > 0; i /= 2) count++;	divide in half (bits in binary representation)
linear	n	for (int i = 0; i < n; i++) if (a[i] == 0) count++;	single loop (check each element)
linearithmic	n log n	[ see mergesort (PROGRAM 4.2.6) ]	divide-and-conquer (mergesort)
quadratic	n²	for (int i = 0; i < n; i++) for (int j = i+1; j < n; j++) if (a[i] + a[j] == 0) count++;	double nested loop (check all pairs)
cubic	n³	for (int i = 0; i < n; i++) for (int j = i+1; j < n; j++) for (int k = j+1; k < n; k++) if (a[i] + a[j] + a[k] == 0) count++;	triple nested loop (check all triples)
exponential	2ⁿ	[ see Gray code (PROGRAM 2.3.3) ]	exhaustive search (check all subsets)
Summary of common order-of-growth hypotheses

order of growth	predicted running time if problem size is increased by a factor of 100
linear	a few minutes
linearithmic	a few minutes
quadratic	several hours
cubic	a few weeks
exponential	forever
Effect of increasing problem size for a program that runs for a few seconds

Table of Contents for Chapter Four. Algorithms and Data Structures

Create new playlist

Sign In

Sign Up

Chapter Four. Algorithms and Data Structures

4.1 Performance

Scientific method

Observations

Hypotheses

Doubling hypotheses

Empirical analysis

Mathematical analysis

Order-of-growth classifications

Constant

Logarithmic

Linear

Linearithmic

Quadratic

Cubic

Exponential

Predictions

Estimating the feasibility of solving large problems

Estimating the value of using a faster computer

Comparing programs

Caveats

Instruction time

Nondominant inner loop

System considerations

Too close to call

Strong dependence on input values

Multiple problem parameters

Performance guarantees

Memory

Primitive types

Objects

Arrays

String objects

Two-dimensional arrays

Perspective

Q&A

Exercises

Creative Exercises

4.2 Sorting and Searching

Binary search

Correctness proof

Analysis of running time

Linear–logarithmic chasm

Binary representation

Inverting a function

Binary search in a sorted array

Exception filter

Weighing an object

Insertion sort

Analysis of running time

Sorting other types of data

Empirical analysis

Sensitivity to input

Mergesort

Analysis of running time

Quadratic–linearithmic chasm

Divide-and-conquer algorithms

Reduction to sorting

Application: frequency counts

Computing the frequencies

Sorting the frequencies

Zipf’s law

Lessons

Respect the cost of computation

Reduce to a known problem

Divide-and-conquer

Q&A

Exercises

Creative Exercises

Table of Contents for
Chapter Four. Algorithms and Data Structures