© Jason Lee Hodges 2019
J. L. HodgesSoftware Engineering from Scratchhttps://doi.org/10.1007/978-1-4842-5206-2_13

13. Algorithms

Jason Lee Hodges1 
(1)
Draper, UT, USA
 

It must certainly be apparent to you by now that software engineering requires significantly more consideration than simply programming or, more idiomatically, “coding.” In the previous chapter, you were introduced to an engineering strategy meant to help optimize the performance and minimize operational cost risk in your programs, which was that of determining the appropriate data structures to use for a given scenario. In this chapter, you will be introduced to a set of algorithms that will provide strategies to further optimize your code. An algorithm is a defined set of instructions meant to solve particular problems with contextually similar circumstances, agnostic of the programming language used to solve the problem. An example of a problem that could be solved with algorithms is the Rubik’s Cube puzzle game. Given a particular combination of different colors on each side of the cube, a discrete set of steps can be enacted to solve part of the puzzle. With enough executed combinations of different algorithms to solve color patterns for either an individual side or a layer of the cube, you can solve the entire puzzle.

While the topic of algorithms is exceptionally large, increasingly complicated, and often considered one of the most difficult topics in computer science to learn, its importance cannot be overstated. Knowing some of the basic design principles of algorithms will help you to design your own algorithms in the future, to uncover solutions to previously unsolved problems, and to bring new knowledge to the world. New algorithms tend to become the basis for valuable intellectual property and advanced breakthroughs in technology. In addition to this, understanding existing algorithms will allow you to recognize when you should use previously discovered solutions to a problem rather than reinvent the proverbial wheel on your own. An exhaustive list of core algorithms will not be provided in this chapter, as that could instead be a book on its own. However, this chapter will provide you a solid basis of understanding that can ideally lead you toward a path of continued education on this topic.

In this chapter, we will continue to build out our model operating system. You will first be introduced to the performance tuning strategy of greedy algorithms demonstrated via a merge sort algorithm and a binary search algorithm to look for files in a folder. Afterward, in order to optimally search for files in an entire operating system of folders, you will learn about various graph traversal algorithms. Finally, you will be shown an optimization strategy known as dynamic programming which we will use to issue fuzzy match searching within our operating system.

Greedy Algorithms

Our model operating system has come a long way since we started developing a basic while loop. Let’s imagine that its value has become clear to a group of power users and they have begun to use it at scale. They are starting to create hundreds or thousands of text files in the system, and their businesses have become dependent upon being able to access them dependably. As we know, scale challenges and their associated risks create unique engineering problems. As these power users begin to ask for new features in order to help them manage their files, and subsequently their businesses, we will need to keep the engineering principles we’ve learned thus far in mind.

The first request your users have asked for is the ability to sort the files in a directory. This will make it easier to navigate through the text files when they are printed to the screen via the “list” command. While it is likely that the underlying operating system that our Nebula shell is operating on already sorts the files in the directory by name, let’s assume for the sake of example that they are always printed to the terminal in the order in which they were created. Given that information, the most basically intuitive, or “naive,” way to sort these files might be to iterate through each file i and compare i to each file that comes before it. During that comparison, we would find the first file that should belong after i, insert i right before it, and then move on to the next iteration. This is an algorithm known as insertion sort and, in the worst case scenario, if the list were sorted in reverse order, each file would have to check every single item that comes before it. This results in exponential time complexity, or O(n2). Given that exponential time complexity is extremely slow, our users will likely not be very happy with that type of solution. Surely we can do better than this “brute force” approach.

The second request that your users have asked for is a way to search for an individual file based on an input keyword. This would make it easier to see if a file exists in a directory without having to scroll through a long list of files that have already been created. To accomplish this, we could implement a naive solution known as sequential search which performs a “brute force” check of every item in the directory one at a time. This approach would take O(n) time which, albeit better than O(n2), is still pretty slow. How might we appease our users’ requests in the most efficient manner possible?

Instead of using naive algorithms, let’s instead use what is known as greedy algorithms to accomplish both of these requests. A greedy algorithm is an algorithm that, given a series of iterations, chooses the locally optimum solution at each iteration regardless of the effect that choice will have on the future outcome of the algorithm. An example of this might be that of a software engineer trying to maximize their total earnings over the course of their career. Let’s assume, given that software engineers are in high demand, that they are recruited by a new firm every 2 years and that each new firm offers them a significant pay increase. The software engineer would be left with the choice of either staying with their current firm or leaving to the new firm to accept the pay increase. A greedy algorithm would suggest that, for each iteration (every 2 years), the engineer should accept the locally optimal solution. Therefore, in order to maximize their career earnings, they must take the pay increase and leave their current firm every time they are offered more money to leave.

Given that the choices made in greedy algorithms are local and do not, by nature, need to consider every possible scenario and its effect on the future or overall outcome, these algorithms are extremely fast. This is the benefit that we are trying to take advantage of for our operating system. The downside is that in many scenarios, given that the algorithm doesn’t consider every possible scenario like a naive or linear algorithm would, oftentimes greedy algorithms do not come up with correct solutions. For example, in the case of optimizing career earnings, an engineer might have received a larger pay increase by staying with their current employer before the next 2-year span because longevity in a firm tends to lead to better contextual knowledge and productivity. However, since the algorithm cannot consider all long-term possibilities, only the optimal choice for a local iteration, which in this case is the choice between more money and the same money, the greedy algorithm might not have maximized the career earnings of the engineer. However, for the problems we want to solve with our operating system, that of sorting and searching, there are greedy algorithms that we can use that have been proven to produce correct answers. In order to use greedy algorithms that yield correct results, they must contain two properties:
  1. 1.

    They must have a “Greedy Choice Property,” meaning that making locally optimal choices derives a globally optimal solution.

     
  2. 2.

    They must have an optimal substructure, meaning the problem must be able to be broken into smaller sub-problems wherein the optimal solution can be found.

     

The algorithms that we are going to use that satisfy these two properties are the merge sort algorithm and binary search algorithm. Both of these algorithms employ a useful strategy known as “divide and conquer” in their implementations. The divide and conquer technique is considered an extremely efficient strategy for solving algorithmic problems and can be used when designing your own algorithms outside of the context of the two algorithms that we will use for sorting and searching.

Divide and Conquer

The divide and conquer strategy is one of dividing a problem into a series of smaller and smaller sub-problems with the intention of creating a sub-problem that is exceptionally simple or computationally inexpensive to solve. It typically involves defining a “base case” branch which is the most simplified version of the problem to be solved and then a separate branch that will further divide the problem into smaller problems if the problem provided does not meet the criteria of the base case branch. Implementations of divide and conquer algorithms often manifest themselves as recursive functions, which you were briefly introduced to in the chapter that contained the functional programming paradigm. Listing 13-1 provides an example of a function that prints the nth Fibonacci number in a Fibonacci sequence given an input number n using recursive divide and conquer. A Fibonacci number is a number that is derived by adding the two numbers that came before it in a sequence (i.e., 0, 1, 1, 2, 3, 5, 8, 13, 21, 34…).
def fibonacci(n: Int): Int = {
    if (n <= 1) n
    else (fibonacci(n-1) + fibonacci(n-2))
}
println(fibonacci(9))
Terminal Output
34
Listing 13-1

A recursive divide and conquer algorithm that prints the nth Fibonacci number

Note

Recall that recursion and recursive functions refer to a function that calls itself within the body of the function. Recursive functions often take the place of iterative loops in the functional programming paradigm. Recursive functions will continue to call themselves repeatedly until a set of criteria is met to keep the function from calling itself again.

In this example, we have defined a function with a conditional branch which checks for our base case, that of our input number having a value less than or equal to 1. As you might have noticed in the example Fibonacci sequence, if you think about the sequence of numbers as an array with starting index 0 and you try to access the 0 index or the 1 index, the number returned is the same number as the index. Because of that, our most simplified base case is that of returning n if n is 0 or 1 (less than or equal to one). If the input number is greater than 1, then you call the fibonacci function recursively twice substituting n for n-1 and n-2, respectively (which seeks to add the previous two numbers in the sequence). In our example, the input number is 9 and therefore the first thing that happens is our function calls itself twice and adds the results together fibonacci(8) + fibonacci(7). Because fibonacci of both 8 and 7 are not less than or equal to 1, they too must call themselves again twice: fibonacci(8) will call fibonacci(7) + fibonacci(6) and fibonacci(7) will call fibonacci(6) + fibonacci(5). These functions will continue to call themselves repeatedly, adding exponential function calls to the memory stack, until the base case is reached. Once the base case is reached and n is returned as either 0 or 1, each branch is able to start calculating actual values for their recursive call and it starts returning results up the function call chain. This recursive call nature is illustrated in Figure 13-1 for further clarity.
../images/476847_1_En_13_Chapter/476847_1_En_13_Fig1_HTML.png
Figure 13-1

Illustration of a recursive divide and conquer Fibonacci algorithm with input parameter 5 (for brevity)

As you might have guessed, this implementation of a Fibonacci algorithm is not greedy. In fact, it has a time complexity of O(2n) which is the slowest program we have examined in this book so far. However, it is a great representation of how you can approach solving seemingly impossible problems by breaking them down into smaller and smaller sub-problems. Let’s combine this divide and conquer strategy with the two properties of greedy algorithms to solve our users’ requests for our model operating system.

Merge Sort

The first request from our users is to have the ability to sort the files in our operating system. Since we know that the insertion sort algorithm is not going to be fast enough to solve our users’ needs, we must identify an algorithm that can sort the files in our operating system in faster than O(n2) time. Merge sort uses a divide and conquer method that will end up yielding O(n∗log(n)) time, which is relatively fast compared to the naive solution of insertion sort. What a merge sort attempts to do is take in two lists that are already sorted and merge them together, taking the lesser of the first values of the two lists as it builds up a list. This satisfies the “Greedy Choice” property of the greedy algorithm pattern since it can take the locally optimal choice (the lesser of the first values of the two lists) and it will result in a globally optimal result.

For example, assume we have two previously sorted lists of files with values “Alpha.txt”, “Jason.txt”, and “Zordon.txt” in the first list and “Billy.txt”, “Kimberly.txt”, “Tina.txt”, and “Zack.txt” in the second list. The merge sort will examine the first item in both lists and determine that “Alpha.txt” should be added to its newly constructed list and not “Billy.txt”. Then the merge sort will examine what remains of the two lists and compare “Billy.txt” to “Jason.txt”, at which point it will add “Billy.txt” to its newly constructed list which now contains “Alpha.txt” and “Billy.txt” in sorted order. It will continue to examine both lists pulling the lesser value from the head of each list until an entirely new list is constructed in sorted order.

At this point, you might wonder how we can make the assumption that the two lists that are being merged are already in sorted order themselves. The answer to that comes from a recursive divide and conquer function. In our merge sort algorithm, we can continue to divide our list in half recursively until each list essentially contains only one item. This satisfies the optimal substructure property of the greedy algorithm pattern. In our example scenario, once the lists have been divided all the way down to their base case, we would have seven lists with only one item in each. From there, the recursive function that divided those lists will compare the two halves that it divided and merge them in sorted order.

Let’s assume our function divided a list of “Kimberly.txt”, “Tina.txt”, “Zack.txt”, and “Billy.txt” into two lists. The first list would contain “Kimberly.txt” and “Tina.txt”, and the second list would contain “Zack.txt” and “Billy.txt”. The next step would be to recursively divide each list again. At this point, we would have four separate lists with one item in each. The branch that divided “Kimberly.txt” and “Tina.txt” would merge those two individual lists back together. In this case, it would add “Kimberly.txt” into a newly created list and then “Tina.txt”, which was the order it was already in. The branch that divided “Zack.txt” and “Billy.txt” would merge them back together selecting “Billy.txt” first and “Zack.txt” second. Then, the original function that divided our list into two lists of two items each will merge these two now sorted sub-lists. As you can see, by dividing our list into two halves repeatedly in a divide and conquer fashion, we can more easily sort small lists and merge them back up through a recursion chain until the entire list is sorted. Listing 13-2 provides a definition of two recursive functions, mergeSort and merge, that employ this merge sort algorithm.
def mergeSort(files: ArrayBuffer[String]): ArrayBuffer[String] = {
    val midpoint = files.length / 2
    midpoint match {
        case 0 => files
        case _ =>
         val (left, right) = files.splitAt(midpoint)
         merge(mergeSort(left), mergeSort(right))
    }
}
def merge(left: ArrayBuffer[String], right: ArrayBuffer[String]): ArrayBuffer[String] = {
    (left, right) match {
      case (_, r) if r.isEmpty => left
      case (l, _) if l.isEmpty => right
      case (_, _) =>
       if (left.head < right.head) ArrayBuffer(left.head) ++ merge(left.tail, right)
       else ArrayBuffer(right.head) ++ merge(left, right.tail)
    }
}
val files = ArrayBuffer("Zordon.txt", "Jason.txt", "Billy.txt", "Zack.txt", "Kimberly.txt", "Tina.txt", "Tommy.txt", "Alpha.txt")
val sortedFiles = mergeSort(files)
println(sortedFiles)
Terminal Output
ArrayBuffer(Alpha.txt, Billy.txt, Jason.txt, Kimberly.txt, Tina.txt, Tommy.txt, Zack.txt, Zordon.txt)
Listing 13-2

Implementation of the merge sort algorithm

In this example, instead of using a basic List data structure, the optimal data structure to use is an Array because we will need to continually access the midpoint of each list. To know the midpoint of a list, we must first determine its length, which takes O(n) time in a List but O(1) time for an Array. Using an array helps keep the operation cost of our algorithm down. You’ll notice that we are actually using an ArrayBuffer from the scala mutable collections library. The ArrayBuffer collection implements various methods for us to help the array grow and shrink (via index copying to new lists) that we don’t have to implement ourselves, for convenience.

In our first recursive function, mergeSort, we first check our base case, which is that of whether or not the midpoint is 0. The midpoint would be zero only if the length of the list were less than or equal to one, which would be the most basic list. In that scenario, we simply return the list as it is already sorted. If the midpoint of the list is anything besides zero, as denoted by the underscore operator, then we split the list into two lists, left and right, at its midpoint. From there, we call the merge function and pass a recursive call to mergeSort on the left list and a recursive call to mergeSort on the right list, which will continue to divide and merge all the way down to the base case.

The merge function is also a recursive function, but this one has two base cases. The first base case is a scenario that matches if the left list has values but the right list is empty. In this case, it simply returns the left list. The second base case is a scenario that matches if the right list has values but the left list is empty. In that case, the right list is returned. If both lists have values, then the heads of the two lists are compared. The head with the least value is added to a new ArrayBuffer and then that ArrayBuffer is added to the result of a recursive merge call on the remnants of the two lists. When all the recursive functions have been called down to their base cases, returned a value from the base case, and calculated back up the recursive chain, a new sorted ArrayBuffer will be returned in O(n∗log(n)) time. This should be fast enough to satisfy our users when using our file system.

Note

In many implementations of common data structures in the functional programming paradigm, including the ArrayBuffer, the tail of the data structure refers to every item of the list that is not the head rather than just the last item in the list.

To provide this functionality within our system, we can add the merge function and the mergeSort function to our Utilities.scala file for use in our Nebula OS and refactor our list function to take an optional argument if the user wants the list that is printed to be sorted. Because the operating system that our Nebula model shell is running on top of is already sorting the files alphabetically, let’s flip the “less than” operator in our merge function to be a “greater than” operator to sort our list in descending order instead of ascending order. Next, let’s add an extra argument to our call to the list function in our nebula.scala file that tells our function whether or not to sort the results. Listing 13-3 demonstrates these changes.
nebula.scala
...
case c if c.contains("list") => Utilities.list(c.split(" ").lift(1))
...
Utilities.scala
...
def list(sort: Option[String]){
    val dir = new File("./").listFiles.map(file => stripDir(file.toString))
    sort match {
        case Some(_) => {
            val sortedDir = mergeSort(ArrayBuffer(dir: _∗))
            sortedDir.foreach(println)
        }
        case None => dir.foreach(println)
    }
 }
...
Listing 13-3

Refactoring the list function to sort the results

Now the command for list splits the user input and calls a lift function on the results to pull out the value at index 1, if it exists. The result of the lift function is an Option type. An Option in Scala is a data type that wraps the underlying data that you are interested in (in this case, we are using an Option that wraps a String) and will indicate to the code whether the data in question exists or not. If it does exist, it is said to be of type Some; otherwise, it is of type None. Some and None are both wrapped up within the type Option, and we can use a pattern match to pull out whether the underlying data actually exists.

In the method signature of our list function, we have added the parameter sort: Option[String]. This tells Scala that we are expecting an Option type to be passed to this function, and if any underlying data exists, it will be of type String. We then pattern match on the sort parameter. If the underlying data of the Option is of type Some, meaning it does contain data, then we want to take the contents of the directory, add them to an ArrayBuffer, and pass them to the mergeSort function. This will sort them in descending order and print them to the console. If the underlying data of the Option is of type None, then we just want to print each item in the directory in its existing order. We can test this command by typing the command list in our Nebula shell without any additional arguments to see the list printed normally or the command list sort (or any other additional argument) to see the list sorted in descending order.

Exercise 13-1

The example of merge sort provided was implemented in a completely recursive manor. In order to cognitively cement the concept of merge sort, try re-implementing the algorithm using a procedural paradigm.

Binary Search

The next request from the users of our model operating system is to be able to search for a file to see if it exists. We could simply check every file in the directory to see if it matches a given search term, but how might we use the concepts of greedy algorithms and divide and conquer to perform this operation in a more efficient manor? The binary search algorithm will improve the performance of a search function over a sequential search algorithm from O(n) time to O(log(n)) time.

Similar to the binary search tree data structure, the binary search algorithm attempts to divide a list in half and compare a given value to the midpoint of a list (rather than a root node or child nodes in a tree). If the midpoint is the value that you were looking for, then you’ve completed the operation. If not, then you check whether the value you are looking for is greater than or less than the midpoint. If it is less than the midpoint, then you take the left half of the list and divide it in half again and start the operation over again. If the value you are looking for is larger than the midpoint, you perform the same operations recursively on the right half of the list. In this way, you eliminate half the search space each time you iterate through the process. This is all predicated by the assumption that the list you are searching through is already sorted in ascending order. In your operating system, the files may be sorted by default, but if they are not, we can use the merge sort algorithm to sort them. Listing 13-4 provides an example implementation of a binary search algorithm added to our Utilities.scala file and a corresponding command added to the nebula.scala file.
Utilities.scala
def search(keyword: String, fileList: Option[Array[String]] = None): String = {
   val files = fileList match {
      case Some(f) => f
      case None => mergeSort(ArrayBuffer(new File("./").listFiles.map(file => stripDir(file.toString)): _*))
    }
    val midpoint = files.length / 2
    return files.length match {
       case 1 => if(keyword == files(0)) files(0) else "No match found."
       case 2 => if(keyword == files(0)) files(0) else if (keyword == files(1)) files(1) else "No match found."
       case _ =>
          val (left, right) = files.splitAt(midpoint)
          if(keyword == files(midpoint)) files(midpoint)
          else if (keyword > files(midpoint)) search(keyword, Some(left))
          else search(keyword, Some(right))
   }
}
nebula.scala
...
case c if c.contains("search") => println(Utilities.search(c.split(" ")(1)))
...
Listing 13-4

Binary search algorithm implementation

The input parameters to our search function are a keyword, which is the file that you are looking for, and optionally a file list (which is set to None by default), which is used during recursion but not for the initial call. The first thing our algorithm does is check to see if a file list was provided. If it is provided, then it sets the variable files to the provided value. If it is not provided, it initializes a new file list by pulling all of the files out of the current directory, adding them to an ArrayBuffer, and performing our mergeSort function on them. The result of the merge sort is then stored in the files variable. By performing the merge sort, we are satisfying the “Greedy Choice” property that allows us to choose a locally optimal path when we start dividing the list in half and searching through each half of the list. The fact that the sorted list can be continually divided and checked also inherently satisfies the optimal substructure property of the greedy algorithm requirements.

Next, the algorithm initializes a midpoint variable by performing integer division on the length of the list. This midpoint is used to check for values or continually divide the list in half, similar to how we used the midpoint in the merge sort algorithm. Finally, we perform a pattern match on the length of the file list that contains three cases. The first case is that the length of the ArrayBuffer is 1, in which case there is only one file to check the search keyword against. If that is the case, we perform the check, and if it matches, we return the value; otherwise, we return a message that says “No match found.” The next case is if the file list length is 2. In this case, we can simply check both files to see if they match the search keyword or return our failure message otherwise.

The final case matches all other lengths of the list. In this scenario, we initialize a left variable and a right variable by splitting the file list in half at its midpoint. Next, we check to see if the file matches at the midpoint. If so, we return it. If not, we check to see if the search term is larger than the word contained at the midpoint. If it is larger, then we recursively perform the search on the left half of the list (since it is sorted in descending order) by passing the keyword back to search along with Some(left) as the parameter for the optional fileList argument. If it is smaller, then we recursively perform the search on the right half of the list instead. This will continue recursively, giving us a half-life characteristic of the search space each time, until either the keyword is found or the value “No match is found.” is returned. A visual representation of this recursive strategy is provided in Figure 13-2.
../images/476847_1_En_13_Chapter/476847_1_En_13_Fig2_HTML.png
Figure 13-2

Visual representation of recursive binary search

In order to call this search function, we’ve added a command to our NebulaOS command list that checks if the user input contains the word “search.” If it does, we split the input into an array and pull out the second position of that array (index 1) as the search keyword to pass to the search function. Thus, if our user provides in the input “search sample.txt”, it will traverse all of the files in the directory, and if it finds a text file named “sample”, it will print “sample.txt” to the terminal; otherwise, it will print “No match found.”

This all works great if the users are happy to continue creating all of their files in a single directory. But at some point, for scalability’s sake, they are going to want to organize their files a little better to make them easier to search. Let’s take our operating system one step further and add the ability to create folders and navigate between them. Then we can put functions in place to search for files either within an individual directory or in the operating system as a whole. In order to do that, we must create a graph data structure to represent the folders in our operating system and implement graph traversal algorithms to search through them.

Graph Traversal Algorithms

In computer science, a graph is an abstract data type wherein a number of nodes are connected to various other nodes through what is known as edges. Our linked list data structure is an example of a graph that links each node in the list via a pointer that acts as the edge. In our linked list, each node only has one edge that connects to the next node in the list. Another example of a graph is our binary search tree data structure. In our binary search tree, each node contained two edges, one connecting to their left child and one connecting to their right child. However, in an abstract graph, each node can contain an infinite number of edges. Those edges can be either directed or undirected. Directed edges , such as the edges in our linked list and binary search tree implementations, are edges that only point to the next node in the graph but not back to their parent node. You can think of these as one-way streets. Undirected edges are edges that create a link between nodes wherein traversal can occur in either direction, like a two-way street. Some graphs can also contain circular references where a node has an edge that leads directly back to itself or indirectly back to itself through connections to other nodes. These are known as a cyclic graphs because they contain a cycle, or circular reference. Binary Search Trees and Linked Lists are considered directional acyclic graphs because their edges can only be traversed in one direction and they contain no cycles. Figure 13-3 provides an illustration of various graphs for reference.
../images/476847_1_En_13_Chapter/476847_1_En_13_Fig3_HTML.png
Figure 13-3

Various abstract graph data structures

By creating a folder structure for our operating system, we are essentially deriving an undirected acyclic graph which contains a folder at each node and a reference to each sub-folder as its edges. Each node, therefore, would contain a list of file values as a property of the node. Once we have this mental model of our operating system, we can apply various graph traversal algorithms to satisfy the requirement of searching for files at either a global level or at the level of a given folder.

In order to set up a folder hierarchy in our operating system shell, we’ll need to implement some commands to allow for the creation of and navigation through folders. Before we do that, however, we’ll need to set up some infrastructure to let us know what the current directory is and where to start from when the shell originally boots up. Let’s start by creating an actual folder in the base directory of our project called “root” that we can set as our origin folder when the application starts. After that we can define a variable called curdir in our nebula.scala file that keeps track of the current directory that we are in as we navigate between folders. We’ll set the curdir variable to “./root” by default. We’ll also want to provide some feedback to the user as to which directory they are in as they navigate through the folders. To do that, we can create a variable called path that takes the curdir variable, which is a string, splits it into an array delimited by the slash, drops the first two items in that array (references to a period “.” and our origin folder “root”), and then joins them back together as a string using the mkString method. This will give us a reference to each sub-folder that we are in that we can append to our original prompt message NOS> without having a reference to our root folder. Listing 13-5 shows an example of how this might be implemented.
nebula.scala
object nebula extends App {
    println("Welcome to the Nebula Operating System (NOS)! Version 1.1.1")
    var curdir = "./root"
    var command = ""
    do {
      val path = curdir.split("/").drop(2).mkString("/")
      command = readLine(if(path.length > 0) s"NOS/${path}> " else "NOS> ")
      command match {
            ...
      }
   }
   while (!command.equalsIgnoreCase("shutdown"))
  }
}
Listing 13-5

Implementing infrastructure to keep track of current directory in the Nebula OS command line shell

As you can see, when passing an argument to the readLine function to prompt the user for input, we can reference the path variable that was implemented to keep track of the directory path. If the path does not have a length greater than 0, then we must be in the root folder, and therefore our original prompt can be displayed as normal. If the path does have a length greater than 0, then we can append the path to the prompt to show the user what directory they are in and how they got there.

Now, let’s implement a mkdir and cd command to allow our users to make new directories and navigate between them. The mkdir command will need to take an argument that contains the name of the folder that the user wishes to create. The cd command will need to take an argument that contains the name of the folder the user wishes to navigate to. Both commands will need to know what directory is currently assigned to curdir in order to properly function. Let’s add these commands to our pattern matching expression as shown in Listing 13-6.
...
case c if c.contains("mkdir") => Utilities.mkdir(curdir, c.split(" ")(1))
case c if c.contains("cd") => curdir = Utilities.changeDir(curdir, c.split(" ")(1))
...
Listing 13-6

Addition of mkdir and cd commands to the pattern matching expression in the NebulaOS shell

Both of these commands contain references to functions in our Utilities module that we have not yet added. The Utilities.mkdir method is pretty simple to implement. It can simply append the name of the desired directory to the current directory and pass that derived path to Java’s File API to create the folder. The cd command, on the other hand, sets the return value of the Utilities.changeDir function to the curdir variable that we initialized earlier. The changeDir function needs to be able do a couple of different things. First, if you pass it a string “..”, the operating system should know that means you wish to navigate up one level. Second, if you pass it any value that is not the double period operator, it needs to check if the directory you want to navigate to exists. If the directory you are trying to navigate to doesn’t exist in the current directory, or if you are trying to move up one directory and you are already in the root folder, the function should print a message suggesting that it can’t find what you are looking for and simply return the curdir that was originally passed to it so that the directory of the operating system doesn’t change at all. The implementation of these requirements is listed in Listing 13-7.
def mkdir(curdir: String, dirname: String){
    println(s"[Creating Directory ${dirname}... ]")
    new File(s"${curdir}/${dirname}").mkdir()
}
def changeDir(curdir: String, newdir: String): String = {
    if(newdir == ".." && curdir != "./root") { return curdir.split("/").init.mkString("/") }
    if(dirExists(curdir, newdir)) { return s"${curdir}/${newdir}" }
    else { println(s"[Directory ${newdir} not found... ]"); return curdir}
}
def dirExists(curdir: String, newdir: String): Boolean = {
    var dir = new File(curdir).listFiles
    dir.foreach(item => if(stripDir(item.toString) == newdir) return true)
    return false
}
def stripDir(fileString: String): String = {
    val filename = fileString.split("/")
    return filename(filename.length - 1)
}
Listing 13-7

Implementation of the mkDir and changeDir functions, along with helper functions as necessary

In this implementation, the changeDir function first checks if the passed-in argument is the double period operator and if the operating system is not currently in the root directory. If these conditions are true, it takes the curdir argument, splits it into an array, uses the init method to keep all items except that last item, and then joins the array back into a string separated by a slash. It then returns this newly constructed directory so that the operating system can set the curdir variable to the directory one level up from where it was before.

The next thing the changeDir function does, assuming the user did not pass in the double period operator, is check to see if the new directory the user requested exists. It does this by passing the curDir argument and the newDir argument to a helper function called dirExists. The dirExists function initializes an array of files and folders that exist in the directory referenced by curDir. Then it iterates through each of those files, passing those files to a stripDir helper function that removes the path prefix from the directory name and checks it against the newDir argument. If it finds a match at any point, it returns true; otherwise, it returns false. If the directory that the user is trying to navigate to exists, then it appends that directory to the current directory and returns that value to the operating system so that it can set the curdir variable to this new value. If it does not exist, it defaults to the final else condition which prints out the error message and returns the curdir that was originally passed into the function with no changes.

At this point, given that the infrastructure to support an undirected acyclic graph of folders is in place, we are ready to start adding the searching capabilities that our users will inevitably want to have. We can first start by adding the current directory as an argument to our search command. Listing 13-8 provides a description of the changes to allow this functionality.
nebula.scala
...
case c if c.contains("search") => println(Utilities.search(curdir, c.split(" ")(1)))
...
Utilities.scala
def search(curdir: String, keyword: String, fileList: Option[ArrayBuffer[String]] = None): String = {
    val files = fileList match {
        case Some(f) => f
        case None => mergeSort(ArrayBuffer(new File(curdir).listFiles.map(file => stripDir(file.toString)): _∗))
    }
    ...
            else if (keyword > files(midpoint)) search(curdir, keyword, Some(left))
            else search(curdir, keyword, Some(right))
    }
}
Listing 13-8

Addition of the current directory as an argument to the search command

As you can see, a curdir parameter has been added to the method signature for the search function in the Utilities.scala file, and the nebula.scala file passes its curdir value as an argument to the search function. The search function now uses the curdir parameter to initialize the list of files in the None case, and it has to continue to pass that curdir parameter to itself as it iterates through recursive calls. Now that this is in place, users can search for files in their current directory. Next, we’ll need to provide them the ability to search for a file in the entire graph of folders. There are two main algorithms that help us accomplish this, depth first search and breadth first search.

Exercise 13-2

By adding a current directory variable to the search function, we have given our users the ability to search for files locally within a given folder. They will want this same functionality applied to their list command and write command now that folders are available. Refactor the list command and the write command on your own to provide these features.

Depth First Search

Not all algorithms can be implemented in a greedy fashion. In the case of searching a graph for a value, it is not possible to use a divide and conquer strategy since the folders contain values independent of each other. Because of this, we are not able to sort the values in our graph to eliminate part of the search space as we traverse through the graph. Also, a graph similar to ours is not binary and therefore could have n number of child nodes for each node. This means that we cannot satisfy the optimal substructure or greedy choice properties required of greedy algorithms.

Because of this, we will need to resort to a more naive algorithm that checks every node for the values we are looking for. However, depending on the user’s knowledge of their file structure, they might be able to optimize their search by choosing one of two graph traversal algorithms. The first algorithm is called depth first search.

The depth first search algorithm is a strategy wherein we traverse the nodes of a graph one branch at a time, checking every node until we hit the leaf and then returning back up the branch to check additional branches. This algorithm is impactful if you know that the value you are looking for is nested deep within many layers of nodes. If we do possess this knowledge, there’s no reason to check the base of each branch in a broad sense before moving on to check the leaves as we know that would generate inefficiencies in our search. To further illustrate this point, consider the graph in Figure 13-4 and the order in which it visits each node.
../images/476847_1_En_13_Chapter/476847_1_En_13_Fig4_HTML.png
Figure 13-4

A representation of the order in which nodes are visited in a depth first search algorithm

As you can see from this depiction, we start with an entry point or a root node. In our case, the root node is our root directory. Along the left side of this illustration are labels that represent the layers of depth that our graph contains. In this instance, we have a graph that is three layers deep. In a depth first search algorithm, we seek to search the deepest layer of an individual branch before moving on to the next branch. In this scenario, if you knew that the file you wanted to search for had a high probability of being found in layer 3, it would make no sense to explore all of layer 1 before moving on. Let’s implement this algorithm in our operating system.
nebula.scala
...
case c if c.contains("search -dfs") => Utilities.depthFirstSearch(c.split(" ")(2))
...
Utilities.scala
def depthFirstSearch(keyword: String) {
    val foldersToSearch: Stack[String] = Stack("./root")
    var folderFound = ""
    while (!foldersToSearch.isEmpty && folderFound.length == 0) {
        val curSearchFolder = foldersToSearch.pop()
        if(search(curSearchFolder, keyword) == keyword) {
            folderFound = curSearchFolder
        }
        foldersToSearch.pushAll(extractFolders(curSearchFolder))
    }
    if(folderFound.length > 0) println(s"Found ${keyword} in ${folderFound}")
    else println("No match found using depth first search.")
}
def extractFolders(dir: String): Array[String] = {
    var directory = new File(dir).listFiles
    directory.filter(item => item.isDirectory()).map(d => d.toString)
}
Listing 13-9

Implementation of depth first search algorithm

In Listing 13-9, we start by adding a search command with a -dfs parameter, signifying that the user wishes to use depth first search. Ensure that this command is placed before the original search command in our pattern matching expression so that it does not match the original search command first and ignore our depth first search command. This depth first search command should take a keyword in index position 2 after the command keyword and the -dfs parameter. Next, we define our depth first search function that is referenced in the command. The only argument it takes is the keyword to be searched for.

We start the algorithm by initializing a Stack data structure, imported from the scala.collection.mutable library, and adding the root directory to it. We call this stack foldersToSearch. Next we initialize a variable called folderFound which we set to an empty string for now. After that, we start a while loop that will kick off our graph traversal. That while loop is going to continue to iterate until the stack is empty or the folderFound variable contains a value. In the while loop, we first pop the first item off the stack and then pass that value to our binary search algorithm to search for the keyword in that directory. If the value is found, it is added to the folderFound variable and our algorithm is done. If it does not find the keyword in this root directory, it passes the node that was searched to a helper called extractFolders. This function checks all the contents of the given folder and pulls out only the items that are other folders or other nodes in the graph. Once extracted, these nodes are all pushed onto the stack for future investigation, keeping the stack from being empty and therefor ensuring that the while loop will continue to traverse nodes. The loop will continue to iterate until there are no more nodes to visit or until it finds the folder that contains the keyword. Once we break out of the loop, we print to the terminal where the keyword was found or that it was not found at all.

Note

If you recall, the binary search function makes a call to our merge sort function. Thus, both of these algorithms are employed each time a node is visited in these graph traversal algorithms. Because the graph traversal algorithms are not greedy, it is important that the algorithms that they depend on are performant since stacking algorithms on top of algorithms can add a lot of time complexity and potentially operational cost to our programs.

Breadth First Search

If you know that the keyword you are looking for is likely to be housed in a fairly shallow node in the graph, it is likely better to use a breadth first search algorithm. In contrast to the depth first search algorithm, breadth first search checks each node in a layer before moving on to the next layer. Figure 13-5 provides an illustration of the order in which nodes would be visited in breadth first traversal.
../images/476847_1_En_13_Chapter/476847_1_En_13_Fig5_HTML.png
Figure 13-5

A representation of the order in which nodes are visited in a breadth first search algorithm

The implementation of a breadth first search algorithm is very similar to that of a depth first search algorithm. The main difference is in the data structure used to store the nodes to visit. Rather than using a Stack, breadth first search uses a Queue from the scala mutable collections library. This ensures that the first nodes that are added to the queue are visited first before moving on to more nodes. Each node that is visited adds its children, or sub-folders, to the end of the queue line rather than the front of the line like that of the stack data structure. By doing so, the queue ensures that entire layers are visited before moving on to the next layer. Listing 13-10 provides the details around this implementation.
nebula.scala
...
case c if c.contains("search -bfs") => Utilities.breadthFirstSearch(c.split(" ")(2))
...
Utilities.scala
def breadthFirstSearch(keyword: String) {
    val foldersToSearch: Queue[String] = Queue("./root")
    var folderFound = ""
    while (!foldersToSearch.isEmpty && folderFound.length == 0) {
        val curSearchFolder = foldersToSearch.dequeue()
        if(search(curSearchFolder, keyword) == keyword) {
            folderFound = curSearchFolder
        }
        foldersToSearch ++= extractFolders(curSearchFolder)
    }
    if(folderFound.length > 0) println(s"Found ${keyword} in ${folderFound}")
    else println("No match found using breadth first search.")
}
Listing 13-10

Implementation of breadth first search algorithm

Our users now have the ability to create folders, navigate between them, add files to folders, search within a folder, and search globally using both depth first search and breadth first search. However, their search is still predicated on matching a keyword exactly. We could add in some logic to match on sub-strings rather than full strings, but that does not protect our user in the event that they made a small typo. How might we still return a matching result in the event our user made a mistake? To do that, we could introduce fuzzy matching using an algorithmic strategy known as dynamic programming.

Dynamic Programming

Dynamic programming is an algorithm design strategy that seeks to minimize the operational cost of exhaustive operations, such as the brute force searching in our graph traversal algorithms. It does this using a process known as memoization. Memoization refers to storing the values of previously solved solutions in a hash table during the execution of an algorithm. For example, Figure 13-1 depicted the recursive call chain that our Fibonacci function created while employing a recursive divide and conquer algorithm. If you notice, that recursion chain ends up calculating the same value several times throughout the lifespan of the algorithm. To optimize this, any time the function needed to calculate the nth Fibonacci number, it could first check a memoized hash table of known Fibonacci numbers. If the answer is not there, it can calculate it on its own and store the result in the hash table for future iteration to check. But if it is there, it does not need to re-calculate it. This will cut down the execution time of the algorithm significantly.

In our model operating system, we want to add a way for our users to search for files in graph nodes that don’t exactly match the input keyword. This would allow for minor spelling differences in the input search term or if the files being searched for were unintentionally saved with a bad filename. One way to do this would be to calculate all the possible changes you could make to the input search term to make it match the file name in question and then return any file that takes less than a certain number of changes as a threshold. This type of calculation is known as an edit distance calculation, and the Levenshtein distance algorithm can help us determine the least amount of operations that could be performed on one string to turn it into another.

Levenshtein Distance

Given two strings to compare, the Levenshtein distance algorithm provides three operations to transform the first string into the second string: insertion, deletion, or replacement. Each operation that needs to be performed in the comparison incurs a cost that is counted. This dynamic programming algorithm tallies up all possible transformations to turn the first string into the second string and then returns a count of the minimum number of transformations required to satisfy the comparison.

For example, if the first string provided was “turn” and the second string provided was “turns,” the minimum number of operations required to change the first string into the second string is 1: insertion of the letter “s” at the end of the first string. If the first string provided was “turns” and the second string provided was “torn,” the minimum number of operations to ensure the words match would be 2: replacement of the “u” to be an “o” and deletion of the “s” at the end. Listing 13-11 provides an example of this algorithm written in a recursive fashion.
def editDist(s1: String, s2: String): Int = {
    val memoizedCosts = Map[(Int, Int), Int]()
    def lev(s1_length: Int, s2_length: Int): Int = {
        memoizedCosts.getOrElseUpdate((s1_length, s2_length),
            (s1_length, s2_length) match {
            case (s1L, 0) => s1L
            case (0, s2L) => s2L
            case (s1L, s2L) =>
                List(
                    1 + lev(s1L - 1, s2L),
                    1 + lev(s1L, s2L - 1),
                    (if (s1(s1L - 1) != s2(s2L - 1)) 1 else 0) + lev(s1L - 1, s2L - 1),
                ).min
            }
        )
    }
    lev(s1.length, s2.length)
    memoizedCosts(s1.length, s2.length)
}
Listing 13-11

Levenshtein distance implementation function

In this example, the first thing that we do is initialize a hash map called memoizedCosts that will store the cost of operations for any two strings of given integer lengths. This satisfies the memoization property of the dynamic programming design pattern. Next, it creates an inner function that takes the length of the first string and the length of the second string as input parameters. The first thing the inner function does is check to see if it has already calculated a minimum cost for two strings of those lengths (with the characters defined at their given index position) that it has already stored in the hash table. If it has, then the inner function is complete. If it has not, then it must calculate the value and store it in the hash table. This is all wrapped in the getOrElseUpdate method call.

Next, the algorithm does a pattern match on the lengths of the strings which evaluates two base cases. First, a scenario wherein the first string has some length and the second string has length 0. The cost of transforming the second string of length 0 into the first string is the length of the first string since we will need to insert every character from the first string. Second, it performs this same logic but in reverse. After the base cases have been evaluated, in the event that both strings have some length, it initializes a list that contains three recursive function calls.

The first call represents a scenario wherein the last character position of the first string could be considered deleted. Such would be the only case if the first string were “turns” and the second string were “turn” and all we had to do was delete the last character in the first string. In that scenario, we count the cost (1) and then add in the cost of performing these evaluations on all the other characters in the remaining sub-string of string one and all the characters in string two.

The second call in the list represents a scenario where instead of deleting the last character position in the first string, we insert the last character in string two into the last position in string one. In that scenario, we now know that the last character from string two should now match, so we only need to finish comparing string one to a sub-string of all characters in string two except the last character. So we add one to a recursive call that removes one character length from string two.

The last recursive call checks to see if the last character positions in both strings actually match. If they do, the cost is zero, and if they don’t, the cost of replacement is 1. Once we’ve determined what the cost will be for a replacement operation, we subtract one from the character lengths of both strings and advance it through recursion. In our Fibonacci recursive function, you were able to see how each iteration through the recursion split out two additional function calls into a tree that evaluated all the way down to the base case in each branch. In this edit distance algorithm, each of these recursive calls will make three additional recursive calls all the way down to their base cases. However, in many of those calls, the cost will have already been calculated and stored in the memoization table, thus speeding up the algorithm. This is the power of dynamic programming.

It is worthy to note that the unique part of this algorithm is that through each iteration, as it starts to return values up through the recursion chain, the whole function will only return the minimum cost evaluated for the three scenarios in the list. This is denoted by the call to the min method at the end of the list initialization. This demonstrates how useful dynamic programming can be in optimization problems that seek to find a min or max value from a set of scenarios. You will see these types of strategies employed in many algorithms that seek to find the shortest path between two nodes on a graph.

After defining the inner function, we actually call that inner function with the lengths of the two strings provided to the outer function. This call to the inner function builds up our memoization table. Once it has been built, we can access the key in that table that represents the lengths of the two strings that we are comparing to find the minimum edit cost of the two strings, and return that as the overall value of the outer function.

Given this useful functionality, instead of comparing equality in our three search functions, you can check to see if the edit distance of the search keyword and the file being evaluated falls below a certain threshold (you can decide whatever threshold you want that to be). In my case, I elected to check if the edit distance between the words was less than 2, meaning that there could only be at most a one-character difference between the words for my operating system to consider it a match. Now my users have a small margin for error in spelling differences when searching for files which adds a marginal level of convenience that hopefully contributes to the overall user experience of the application.

Exercise 13-3

Go back and refactor the three search functions in your operating system. Instead of checking the keyword against the files in the operating system for equality, try comparing their edit distance to a threshold number. If the edit distance falls below the threshold, return the file as a match.

Summary

In this chapter, the algorithm design strategies of divide and conquer, greedy algorithms, and dynamic programming laid the groundwork for your future exploration into the vast world of known algorithms. You were exposed to a few common algorithms, namely, merge sort, binary search, depth first search, breadth first search, and Levenshtein distance. Having a base understanding of known algorithms is important; however, knowing how to design your own algorithms can be just as useful in constructing large-scale software engineering projects. As you might have noticed, we layered each algorithm on top of the other in our operating system as we built up functionality. This demonstrates how important it is to ensure that each function is as efficient as possible because, when combined with other operations, things can start to get incredibly computationally costly. Knowing the right algorithm for the job and how much time complexity it adds to your program might be the difference in determining whether or not your project is probable or even possible.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.228.138