Chapter 6. Exploring Search Options

Searching is a widely used process in computer applications, primarily to determine whether an element with a particular value is present in a vector or list of elements or not. It acts as a substitute in case of deletions, as without searching an element of a particular value, deletion operations cannot take place. A search can be an evaluation of finding an element (exact match) in a set of given elements, or finding a group of elements (range match) which falls under a certain range of values. In a search operation, the location of the element is also determined. The location can be used later in deletion operations. A search is said to be successful if the element of a particular key value is found in the given vector (or list), and is said to be unsuccessful if the element of a particular key value is not found in the given vector (or list). This chapter shall cover concepts of sequential search operations and direct access by key value (hashing) search operations.

This chapter deals with search operations carried on both vectors and lists (including linked lists). The topics covered are as follows:

  • Searching unsorted and sorted vectors
  • Self-organizing lists
  • Hashing

The approaches discussed in the first two sections are more effective in implementation while using an in-memory-(single node) based system, and the third approach discussed is more effective in implementation while using either in-memory-(single node) or disks (multiple nodes) based systems.

Searching unsorted and sorted vectors

Vectors are simple and widely used data structures used to perform search operations in R. The simplest form of search operation performed on vectors is a linear search or sequential search. In a linear search, each element is compared sequentially within the vector, and then suitable insertions or deletions are performed. Consider an element, S which is to be searched in an unsorted vector, V of length n (indexed from 1 to n). If the element S is not present in vector V, then a minimum of n comparisons are performed, and if it is present in V at position i, then a minimum of i comparisons are performed. In both scenarios, the number of comparisons is linear, resulting in O(n) as the functional form of system runtime for sequential search in the worst case scenario. The following R code performs a linear search of element S in a vector V of length n:

Sequential_search <- function(V,S,n) 
{ 
  i=1 
  present = FALSE 
  while(i <= n & present==FALSE) 
  { 
    if(V[i] == S) 
    present=TRUE else i = i+1 
  } 
  return(list(present=present,key=i)) 
}

Figure 6.1 illustrates two different sequential operations. Part (a) represents the situation wherein the element S is not present, and Part (b) represents the presence of element S. Each step corresponds to a comparison, regardless of whether the current element in the vector is equal to S or not:

Searching unsorted and sorted vectors

Figure 6.1: Sequential or linear search

Sequential search becomes a bottleneck when n is a very large value. One way to deal with large vectors is to pre-process them using multiple sorting techniques learned previously, and then perform a sequential search. If the current element in vector V is larger than the element S, then the search operation can be terminated early. This reduces the system runtime; however, the asymptote of the search operation does not improve beyond O(n) for the worst-case scenario. The following R code performs a sequential search on an ordered vector V of length n:

Seq_ord_search <- function(V,S,n) 
{ 
  i=1 
  present = FALSE 
  while(i <= n & present==FALSE ) 
  { 
    if(V[i] == S) 
    present=TRUE else if(V[i] > S) 
    stop("element S not found") else i=i+1 
  } 
  return(present) 
}

Figure 6.2 illustrates a sequential search operation being performed on an ordered vector in which the element S is not present. Here, each step corresponds to two comparisons: One, whether the current element is equal to S or not, and the other, whether the current element is greater than S or not (provided S is not equal):

Searching unsorted and sorted vectors

Figure 6.2: Sequential or linear search performed on an ordered vector

Now, consider a situation in which the element S is directly compared with the element in the third position and is found to be greater; then, it becomes imperative that element S is greater than the elements in the first and second position, without even comparing explicitly. This kind of comparison performed on intermittent elements (best possible jumps) rather than on each element of the vector is a key feature of the jump search algorithm.

A jump search algorithm is an improvisation on the current sequential search algorithm performed on sorted vectors. Here, the element S is initially compared with elements of the vector V positioned at regular intervals, i. In other words, the element S is initially compared with V[i], V[2i], V[3i], and so on, till the condition of S being lower is met. Once the position wherein the element in that position (key value) is greater than the element S is determined, then the sequential search is performed on its previous i-1 elements. That is, if the element S is lesser than the element V[3i], then the sequential search is performed on elements between V[2i] and V[3i]. This dividing of the vector into sub-vectors and then performing search operations is similar to the concept of divide and conquer discussed in the Merge sort section in Chapter 5, Sorting Algorithms. The best possible i for a vector of length n is Searching unsorted and sorted vectors. The following R code performs a jump search algorithm on a sorted vector V of length n. The jumps are performed at an interval of Searching unsorted and sorted vectors:

Jump_search <- function(V,S,n) 
{ 
  jump <- floor(sqrt(n)) 
  present = FALSE 
  i=1 
  while(jump < n & V[jump] < S) 
  { 
    i=jump 
    jump = jump+floor(sqrt(n)) 
    if(jump>=n) 
    stop("element S not found") 
  } 
  while(V[i] < S & i <= jump) 
  i = i+1 
  if(V[i]==S) 
  present=TRUE 
  return(present) 
}

Figure 6.3 illustrates the working of a jump search algorithm. Initially, jumps are performed till the current value of the jumped position is greater than the element S, and then linear search is performed within the sub-vector (here, elements in positions 4, 5, and 6):

Searching unsorted and sorted vectors

Figure 6.3: Jump search algorithm

The jump search algorithm can also be modified such that the jumps are performed at two stages, and sequential search is performed at the third stage. In a sense, a sub-vector is first determined based on a certain jump; then, within that sub-vector, smaller jumps are performed to determine a sub-sub-vector within which the sequential search is performed. This can be generalized using recursive implementation of generating sub-vectors till a single element in the vector V is left out for comparison. This generalized jump search implementation is equivalent to a binary search algorithm. The approach of binary search is to jump directly toward the element in the middle of the vector, and then compare S with it. If the value of S is greater than the middle element, then jump backward, otherwise, jump forward. The jump is always performed toward the middle element of any sub-portion of the vector in consideration. The asymptote of a binary search algorithm for an average case scenario is O(log n).

A binary search algorithm can be implemented recursively or iteratively. However, recursive implementation can sometimes be risky. Following is the R code implementation of the binary search algorithm:

  • Recursive implementation (returns the position of the element S if found in V):
          Bin_search_recursive <- function(V,S,l,h) { 
            if ( h < l ) { 
              stop("h should be more than l") 
            } else { 
                m <- floor((l + h) / 2) 
                if ( V[m] > S ) 
                Bin_search_recursive(V, S,l,m-1) 
                else if ( V[m] < S ) 
                Bin_search_recursive(V, S, m+1, h) 
                else 
                return(m) 
              } 
          }
    
  • Iterative implementation (returns whether element S is present in the vector V or not):
          Bin_search_iterative <- function(V, S,n) { 
            l=1 
            h=n 
            i = 0 
            while ( l <= h ) { 
              m <- floor((l + h)/2) 
              if ( V[m] > S ) 
              h <- m - 1 
              else if ( V[m] < S ) 
              l <- m + 1 
              else if(V[m]==S) 
              return(TRUE) 
            } 
            return(FALSE) 
          }
    

Figure 6.4 illustrates the implementation of binary search on a sorted vector V:

Searching unsorted and sorted vectors

Figure 6.4: Binary search algorithm

In any of the algorithms covered so far, the distribution of key values (elements of the vector) has not been considered for search operations. Suppose you need to search a word algorithm in a dictionary. The first step will be to look for all the words which begin with the letter a, and then, within those, search for words which begin with al, and so on, till the word algorithm is found. In other words, the distribution of the word is taken into account before computing the next steps of where to search in the dictionary. This form of search, in which the knowledge of element S is considered before computing the next search steps, is called dictionary search or interpolation search. The first search position (p) for the element S in the sorted vector V is computed as follows:

Searching unsorted and sorted vectors

Upon computing the primary search position p, the element S is compared with V[p]. If S equals V[p], then the search operation is terminated; otherwise, the position p is used to split the vector into two sub-vectors. Based on the value of S with respect to V[p], the search continues in either of the sub-vectors, as illustrated in Figure 6.5. The split at position p is similar to the split in the binary search algorithm. Again, the new position of the sub-vector is computed based on the distribution of elements within it, and the search operation continues till the element S is found or the vector is narrowed until no elements are left. The system runtime reduces considerably, which follows an asymptote of O(log log n) for average case scenarios.

The following R code performs an interpolation search of element S in the sorted vector V of length n:

Interpolation_search <- function(V,S,n) 
{ 
  i=1; j=n; l=V[1]; h=V[j]; 
  if(S<l | S>h) return(FALSE) 
  while(i < j) 
  { 
    k = floor(i+((j-i)*(S-l))/(h-l)) 
    split = V[k] 
    if(S>split){ 
      i=k+1; l=split 
    }else if(S < split){ 
      j=k-1; h=split 
    }else if(V[k]==S){ 
      return(TRUE)} 
  } 
  return(FALSE) 
}

Figure 6.5 illustrates the working of an interpolation search:

Searching unsorted and sorted vectors

Figure 6.5: Interpolation or dictionary search algorithm

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.79.65