Sorting Algorithms

Choosing the right sorting algorithm can have a huge impact on application performance. What’s right for one application isn’t necessarily right for a different application. Here are some criteria to consider when selecting a sorting algorithm:

  • How much data is to be sorted? For small data sets it doesn’t matter which algorithm you choose because there is little difference in the execution times, but for large data sets, the worst-case bounds become radically different. Beware of data sets that are typically small but may occasionally be much larger — you need to select an algorithm that performs acceptably on the largest data sets your code may encounter.
  • Does the data fit in memory? Most sorting algorithms are efficient only when the data they operate on resides in memory. If the data set is too large for memory, you may need to split it into smaller chunks for sorting and then combine those sorted chunks to create the final sorted data set.
  • Is the data already mostly sorted? Adding new data to a sorted list can be done efficiently with certain algorithms, but those same algorithms have poor performance on randomly ordered data.
  • How much additional memory does the algorithm require? An in-place sorting algorithm sorts the data without using any additional memory, such as by swapping elements in an array. When memory is at a premium, an in-place algorithm may be a better choice than one with otherwise superior efficiency.
  • Is relative order preserved? A stable sorting algorithm preserves the relative order of data elements that are otherwise identical for sorting purposes. (In other words, if elements A and B have identical key values and A precedes B in the original data set, A will still precede B after a stable sorting.) Stability is generally a desirable feature, but in many cases it may be worth sacrificing stability for improved performance.

In an interview situation, it’s not unusual for the interviewer to vary the criteria as the interview progresses to see how well you understand the differences between sorting algorithms.

For simplicity, the sorting problems used in interviews often deal with simple integer values stored in arrays. In the real world, sorting usually involves more complex data structures with only one or a few of the values in those data structures affecting the sorting order. The value (or values) that determine the sorting order is referred to as the key. Most sorting algorithms in standard libraries are comparison algorithms, which require only that there is a way to determine whether one key is less than, equal to, or greater than another key. No comparison algorithm can have a more optimal worst-case running time than O(n log(n)).

Selection Sort

Selection sort is one of the simplest sorting algorithms. It starts with the first element in the array (or list) and scans through the array to find the element with the smallest key, which it swaps with the first element. The process is then repeated with each subsequent element until the last element is reached.

The description of this algorithm suggests a recursive approach, as shown here with the selectionSortRecursive method:

// Sort an array using a recursive selection sort.
public void selectionSortRecursive( int[] data ){
    selectionSortRecursive( data, 0 );
}

// Sort a subset of the array starting at the given index.
private void selectionSortRecursive( int[] data, int start ) {
    if( start < data.length - 1 ){
        swap( data, start, findMinimumIndex( data, start ) );
        selectionSortRecursive( data, start + 1 );
    }
}

This implementation depends on the two helper routines findMinimumIndex and swap:

// Find the position of the minimum value starting at the given index.
private int findMinimumIndex( int[] data, int start ) {
    int minPos = start;

    for( int i = start + 1; i < data.length; ++i ){
        if( data[i] < data[minPos] ){
            minPos = i;
        }
    }

    return minPos;
}

// Swap two elements in an array.
private void swap( int[] data, int index1, int index2 ){
    if( index1 != index2 ){
        int tmp = data[index1];
        data[index1] = data[index2];
        data[index2] = tmp;
    }
}

This implementation could be optimized by transforming this tail-recursive procedure into an iterative implementation and inlining the two helper functions.

How efficient is selection sort? The first swap requires n – 1 comparisons, the second n – 2, the third n – 3, and so on. This is the series (n – 1) + (n – 2) + ... + 1, which simplifies to n(n – 1)/2. This means that the algorithm is O(n2) in the best, average, and worst cases — the initial order of the data has no effect on the number of comparisons. As you’ll see later in this chapter, other sorting algorithms have more efficient running times than this.

Selection sort does have the advantage that it requires at most n – 1 swaps. In situations in which moving data elements is more expensive than comparing them, selection sort may perform better than other algorithms. The efficiency of an algorithm depends on what you’re optimizing for.

Selection sort is an in-place algorithm. Typical implementations of selection sort, such as the one shown here, are not stable.

Insertion Sort

Insertion sort is another simple sorting algorithm. It builds a sorted array (or list) one element at a time by comparing each new element to the already-sorted elements and inserting the new element into the correct location, similar to the way you sort a hand of playing cards.

A simple implementation of insertion sort is as follows:

// Sort an array using a simple insertion sort.
public void insertionSort( int[] data ){
  for( int which = 1; which < data.length; ++which ){
    int val = data[which];

    for( int i = 0; i < which; ++i ){
      if( data[i] > val ){
        System.arraycopy( data, i, data, i+1, which - i );
        data[i] = val;
        break;
      }
    }
  }
}

Unlike selection sort, the best-case running time for insertion sort is O(n), which occurs when the list is already sorted. This means insertion sort is an efficient way to add new elements to a presorted list. The average and worst cases are both O(n2), however, so it’s not the best algorithm to use for large amounts of randomly ordered data.

Insertion sort is a stable, in-place sorting algorithm especially suitable for sorting small data sets and is often used as a building block for other, more complicated sorting algorithms.

Quicksort

Quicksort is a divide-and-conquer algorithm that involves choosing a pivot value from a data set and splitting the set into two subsets: a set that contains all values less than the pivot and a set that contains all values greater than or equal to the pivot. The pivot/split process is recursively applied to each subset until there are no more subsets to split. The results are combined to form the final sorted set.

A naïve implementation of this algorithm looks like:

// Sort an array using a simple but inefficient quicksort.
public int[] quicksortSimple( int[] data ){

  if( data.length < 2 ){
    return data;
  }

  int pivotIndex = data.length / 2;
  int pivotValue = data[ pivotIndex ];

  int leftCount = 0;

  // Count how many are less than the pivot

  for( int i = 0; i < data.length; ++i ){
    if( data[ i ] < pivotValue ) ++leftCount;
  }

  // Allocate the arrays and create the subsets

  int[] left = new int[ leftCount ];
  int[] right = new int[ data.length - leftCount - 1 ];

  int l = 0;
  int r = 0;

  for( int i = 0; i < data.length; ++i ){
    if( i == pivotIndex ) continue;

    int val = data[ i ];

    if( val < pivotValue ){
      left[ l++ ] = val;
    } else {
      right[ r++ ] = val;
    }
  }

  // Sort the subsets

  left = quicksortSimple( left );
  right = quicksortSimple( right );

  // Combine the sorted arrays and the pivot back into the original array

  System.arraycopy( left, 0, data, 0, left.length );
  data[ left.length ] = pivotValue;
  System.arraycopy( right, 0, data, left.length + 1, right.length );

  return data;
}

The preceding code illustrates the principles of quicksort, but it’s not a particularly efficient implementation due to scanning the starting array twice, allocating new arrays, and copying results from the new arrays to the original.

Quicksort’s performance is dependent on the choice of pivot value. The ideal pivot value is one that splits the original data set into two subsets of identical (or nearly identical) size. Every time you do a pivot-and-split, you perform constant-time operations on each of the elements involved. How many times do you do this for each element? In the best case, the size of a sublist is halved on each successive recursive call, and the recursion terminates when the sublist size is 1. This means the number of times you operate on an element is equal to the number of times you can divide n by 2 before reaching one: log(n). Performing log(n) operations on each of n elements yields a combined best case complexity of O(n log(n)).

On the other hand, what if your pivot choice is poor? In the worst case, the pivot is the minimum value in the data set, which means that one subset is empty and the other subset contains n – 1 items (all the items except for the pivot). The number of recursive calls is then O(n) (analogous to a completely unbalanced tree degrading to a linked list), which gives a combined worst-case complexity of O(n2). This is the same as selection sort or insertion sort.

On average almost any pivot value will split a data set into two non-empty subsets, making the number of recursive calls fall somewhere between O(log(n)) and O(n). A bit of mathematical work (omitted here) is enough to show that in most cases the number of times you operate on an element is still O(log(n)), so the average case complexity of quicksort is also O(n log(n)).

For truly randomly ordered data, the value of the pivot is unrelated to its location, so you can choose a pivot from any location because they’re all equally likely to be good choices. But if the data is already sorted (or mostly sorted), choosing the value located in the middle of the data set ensures that each subset contains approximately half the data, which gives guaranteed O(n log(n)) complexity for sorted data. Because the value in the middle location is the best choice for ordered data and no worse than any other for unordered data, most quicksort implementations use it as the pivot.

Like the preceding implementation, most implementations of quicksort are not stable.

Merge Sort

Merge sort is another divide-and-conquer algorithm that works by splitting a data set into two or more subsets, sorting the subsets, and then merging them together into the final sorted set.

The algorithm can be implemented recursively as follows:

// Sort an array using a simple but inefficient merge sort.
public int[] mergeSortSimple( int[] data ){

    if( data.length < 2 ){
        return data;
    }

    // Split the array into two subarrays of approx equal size.

    int   mid= data.length / 2;
    int[] left = new int[ mid ];
    int[] right = new int[ data.length - mid ];

    System.arraycopy( data, 0, left, 0, left.length );
    System.arraycopy( data, mid, right, 0, right.length );

    // Sort each subarray, then merge the result.

    mergeSortSimple( left );
    mergeSortSimple( right );

    return merge( data, left, right );
}

// Merge two smaller arrays into a larger array.
private int[] merge( int[] dest, int[] left, int[] right ){
    int dind = 0;
    int lind = 0;
    int rind = 0;

    // Merge arrays while there are elements in both
    while ( lind < left.length && rind < right.length ){
        if ( left[ lind ] <= right[ rind ] ){
            dest[ dind++ ] = left[ lind++ ];
        } else {
            dest[ dind++ ] = right[ rind++ ];
        }
    }

    // Copy rest of whichever array remains
    while ( lind < left.length )
        dest[ dind++ ] = left[ lind++ ];

    while ( rind < right.length )
        dest[ dind++ ] = right[ rind++ ];

    return dest;
}

Most of the work is done in the merge method, which combines two sorted arrays into a larger sorted array.

A hybrid merge sort occurs when a different sorting algorithm is used to sort subsets below a specified minimum size. For example, you can transform the mergeSortSimple method into a hybrid algorithm by replacing the termination condition:

if( data.length < 2 ){
    return data;
}

with an insertion sort:

if( data.length < 10 ){ // some small empirically determined value
    insertionSort( data );
    return data;
}

This is a common optimization because insertion sort has lower overhead than merge sort and typically has better performance on very small data sets.

Unlike most other sorting algorithms, merge sort is a good choice for data sets that are too large to fit into memory. In a typical scenario, the contents of a large file are split into multiple smaller files. Each of the smaller files is read into memory, sorted using an appropriate algorithm, and written back out. A merge operation is then performed using the sorted files as input and the sorted data is written directly to the final output file.

The best, average, and worst-case running times for merge sort are all O(n log(n)), which is great when you need a guaranteed upper bound on the sorting time. However, merge sort requires O(n) additional memory — substantially more than many other algorithms.

Typical (maximally efficient) merge sort implementations are stable but not in-place.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.18.198