Chapter 7. Composing Functional Pipelines with Algorithms and Ranges

A frequent source of difficulty for new D users with no background in functional programming is that making sense of the range-based functions from Phobos can be rather daunting, particularly when looking at a long chain of function calls with names that seem as if they come from an alien language. As an old C programmer myself, I still think of hash maps when I see std.algorithm.map, and the C function itoa pops into my head when I see std.range.iota. Until that "eureka" moment where it all falls into place, knowing which functions are used for what, and where to find them in Phobos, can be a challenging task. It's for this reason that some new D programmers tend to avoid ranges and algorithms altogether.

In this chapter, we're going to work on getting past that first hurdle with range-based functions in Phobos. We're going to look into how to use D ranges in a functional style and, in the process, explore the standard library modules that make it possible. We'll begin with an introduction to composable pipelines and then, through the rest of the chapter, take a look at a variety of functions from Phobos that can be combined in interesting ways. By the end of the chapter, you should be able to understand the idiomatic D snippets that people post in the community forums, know where to look in Phobos for the algorithms you need, and understand how to compose functional pipelines yourself. The chapter is shaped like so:

  • Functional programming and composable pipelines: an introduction and some examples
  • Navigating Phobos: std.range, std.algorithm, and std.array
  • Wrapping up MovieMan

Functional programming and composable pipelines

A major goal of functional programming is functional purity. This means that functions should be implemented such that they have no side effects. A pure function does not modify any global state or mutate any data structures in place. Take the example of appending an item to a list. In C, it might look like this:

list_append(list, item);

This will add item to list in place. When list_append returns, the state of list has been mutated. This is a side effect, making it an impure function. In a functional language, it would look more like this:

list2 = list_append(list1, item);

In this form, list1 is immutable; new items cannot be added or inserted. Instead, this function returns a new list, list2, containing all of the elements of list1 followed by the new addition, item. Immutability coupled with purity can make it easier to reason about a program's behavior. It also leads naturally to functions which are self-contained, tightly-sealed units, with no dependency on external state. This in turn makes it natural to string multiple functions together in a pipeline, where the output of one becomes the input of the next (sometimes, even a function can become the input of another).

The idea of chaining function calls together in a program is not new or particularly special. It's been a common idiom in Java for years, where the member functions of a class are often implemented to return a reference to the class instance, making something like the following possible:

myClass.operation1(10).operation2(30).process();

We can call this sort of operation chaining a pipeline, but it isn't very composable. The process function (or method, as member functions are called in Java) most likely depends on the state modified by operation1 and operation2, both of which might also be dependent on the order in which they will be compiled. This makes it impossible to swap their positions in the call chain, or replace either of them with a different implementation. It would be necessary to read the documentation to know for sure, but class methods written with chaining in mind aren't usually designed for composability.

Composable pipelines are more flexible than regular function chains in that each operation is independent of the others. This allows them to be placed at any point in a pipeline; the order of operations is dictated not by the operations themselves, but usually by the format of the desired output or other external constraints. The benefits are so great that support for them has begun to appear in non-functional languages, though usually without the strict guarantees that functional languages provide. Another difference is that functions in non-functional languages generally are not first-class citizens. In a functional language, a function can be passed as an argument to another function, or returned as the result of a function, with each instance carrying its own state (this sort of stateful function instance is often called a closure). Contrast this with C, in which function pointers can be passed to and returned from functions, but pointers to the same function all share the same state.

While variables in D are not immutable by default, we have observed that they can be made so. D also has the concept of functional purity, though we will only go through it briefly in Chapter 11, Taking D to the Next Level. Functions are not first-class citizens in D, but the language does have closures in the form of delegates. While it's possible to conceive of a composable pipeline implementation using D's delegates, ranges present a much more flexible alternative. With the variety of range interfaces available, and the ease with which they can be implemented to perform lazily, they provide a great abstraction for the inputs and outputs in composable pipelines.

There's a lot more to functional programming than composable pipelines. Refer to the article at https://en.wikipedia.org/wiki/Functional_programming for a more in-depth introduction.

A simple example

You're working on a tactical combat game where squads of robot warriors duke it out on a battlefield. You decide to implement a function that allows the AI to reorganize two squads such that the members with the most health remaining are placed in one, while those with the least are placed in the other. A naïve implementation follows:

class Robot {
  int health;
  this(int health) { this.health = health; }
  // Other members
}

// Assumes a squad size of 5 for simplicity
void healthBasedSwap(ref Robot[] squad1, ref Robot[] squad2) {
  import std.algorithm : sort;
  auto tmp = squad1 ~ squad2;
  tmp.sort!((a,b) => a.health > b.health)();
  squad1 = tmp[0 .. 5];
  squad2 = tmp[5 .. $];
}

This function first concatenates the two arrays, then sorts the elements of the resulting array in descending order with std.algorithm.sort. Finally, the sorted array is sliced such that the first five elements, the robots with the highest health, are assigned to squad1 and the remainder to squad2. In order for this to work properly, both squad1 and squad2 must be declared with the ref storage class. Remember that any changes to the existing elements of an array function parameter are reflected in the source array, except for those made to the length or ptr properties. This function modifies the ptr property of each array when it assigns the slices, so without ref the changes would only be local. The ref requirement could be eliminated, but this would require copying the data, for example, squad1[] = temp[0 .. 5].

The following unittest verifies that the function works as expected. Note that the test declares an inner function to help assemble the Robot arrays and includes a local import. This serves as a reminder that a unittest is just a special kind of function.

unittest {
  Robot[] makeRobots(int[] healthVals) {
    Robot[] bots;
    foreach(val; healthVals)
      bots ~= new Robot(val);
      return bots;
  }
  import std.algorithm : equal;
  auto squad1 = makeRobots([10, 76, 22, 67, 55]);
  auto squad2 = makeRobots([33, 94, 17, 27, 16]);
  healthBasedSwap(squad1, squad2);
  assert(squad1.equal!((a,b) => a.health == b.health)
  (makeRobots([94, 76, 67, 55, 33])));
  assert(squad2.equal!((a,b) => a.health == b.health)
  (makeRobots([27, 22, 17, 16, 10])));
}

This implementation gets the job done, but it's not an ideal solution. Every time it's called, the concatenation allocates new memory. This can add up if it's called frequently. Not only that, it's considered best practice in D to minimize memory allocations using lazy ranges. The idea is to avoid allocation completely where possible and, where it's not, only allocate at the point where the memory is actually needed. As it turns out, the allocation in healthBasedSwap can easily be done away with.

Note

Walter Bright gave a talk on the topic of memory allocation avoidance using lazy ranges at DConf 2015. You can watch the video, and download the slides at http://dconf.org/2015/talks/bright.html.

A D newcomer who is unfamiliar with ranges may attempt to avoid allocation by looping through the two arrays and swapping values as needed, perhaps using a static array to help out. This approach can be efficient and the algorithm, depending on its implementation, could likely be made generic so that it can be used elsewhere. On the other hand, a D user who understands ranges and uses them on a regular basis would likely never consider such an approach. Instead, he would reach for Phobos and the many range utilities suited to his purpose. Consider the following reimplementation of healthBasedSwap:

void healthBasedSwap(Robot[] squad1, Robot[] squad2) {
  import std.algorithm : sort;
  import std.range : chain;
  squad1.chain(squad2).sort!((a,b) => a.health > b.health)();
}

The most obvious property of this version is that it's shorter than the original. If it were in a module where other functions from std.algorithm and std.range are used, the imports could be moved to the top of the module and the function would become a one-liner. It's also true to say that the reduced number of statements makes it easier to reason about what the code is doing, but this is predicated on the assumption that the reader understands std.range.chain.

In this particular case, the name of std.range.chain is fairly descriptive and the pipeline is short, so it may look more like a familiar Earthly language rather than Martian. chain takes any number of input ranges and returns a range that presents them all as a single sequence. So, we've managed to replace the concatenation in the original implementation with something that effectively has the same end result without the allocation. Nothing is copied, nothing is moved. In fact, no action is taken at all, other than constructing and returning the output range; chain results in a lazy range.

std.algortithm.sort, on the other hand, is eager. In the function above, its return value is never used. This is because the input to sort is sorted in place; its output is only needed if it is to be handed off to another algorithm. Sorting in place can avoid allocation. It's also the reason the parameters in this version of healthBasedSwap are not declared ref; only the content of each array is modified, not the length or ptr properties.

To sum up, sort is passed given the output of chain and sees a single random-access range rather than two separate ones. Because squad1 is the first parameter to chain, it corresponds to indexes 0–4, while indexes 5–9 map to squad2. Because the delegate passed as a predicate to sort specifies descending order (the default is ascending), the end result is that squad1 contains the five highest values and squad2 the five lowest. This can be verified with the same unittest used with the first implementation. Now, squad1 can carry on the fight while squad2 withdraws for repairs.

A more complex example

You are making a word search game. You download the English Open Word List, a set of words provided freely by Ken Loge, from http://dreamsteep.com/projects/the-english-open-word-list.html. Your plan is to load the words into memory at startup, randomly shuffle them, then for each new round select 10 of them to hide in the puzzle space. You also decide to constrain the length of the selected words based on the size of the puzzle space (there are no words longer than 10 characters in the EOWL).

An imperative implementation of the pickWords function might look like this:

string[] pickWords(string[] words, size_t maxLen, string[] sink) {
  size_t matches;
  foreach(s; words) {
    if(s.length <= maxLen) {
      sink[matches] = s;
      if(++matches == sink.length)
        break;
    }
  }
  return sink[0 .. matches];
}

This implementation does not allocate memory. It takes an array to use as a sink. It returns a slice of sink, allowing the calling code to recognize the number of elements assigned to the array. To see why this is a good idea, consider the following code:

static string[10] arr;
arr[0] = "Hello";
writeln(arr);

This will print Hello followed by nine empty strings. sink might be the same length as the source array, or it might only be a slice of it. In the latter case, the .length property of the source array will always report the number of allocated elements, but not the number actually assigned to it in pickWords. By returning a slice of sink, the caller can read .length on the return value to get the actual number of elements assigned.

This function isn't bad, but let's rework it to use a range-based functional pipeline.

string[] pickWords(string[] words, size_t maxLen, string[] sink) {
    import std.algorithm : filter, copy;
    import std.range : take;
    auto remaining = words
        .filter!(s => s.length <= maxLen)
        .take(sink.length)
        .copy(sink);
    return sink[0 .. sink.length - remaining.length];
}

The highlighted lines have replaced the foreach loop, resulting in a code that looks cleaner and can be easier to follow. We can read it from top to bottom: iterate words; only consider strings whose length is less than or equal to maxLen; take up to sink.length of those; copy them all to sink. The return value assigned to remaining comes from the last call in the chain, copy. It's always of the same type as the target range, which in this case is string[], and its .length is the number of empty slots remaining in the target range. So if sink is filled up, it will return 0. The number of elements copied is determined by subtracting remaining.length from sink.length, which is what happens at the end to generate the returned slice.

This function looks more like idiomatic D, but an idiomatic D programmer probably isn't going to stop there. Two things stand out about it. First, words and sink aren't just arrays, they're ranges, too. That's why they can be used with the functions from std.algorithm and std.range. Second, the words themselves are just range elements, meaning that the algorithm could apply to any elements of any range. All that would be required is a way to specify the predicate, which right now compares the length of each word. From that perspective, a more generic pickElements function could be envisioned, but that will be left as an exercise for the reader.

Sometimes we can't

In the downloadable source code for the book, under Chapter07, you'll find the modules words1.d and words2.d. These files contain the implementations of pickWords, along with a main function that loads the word list, shuffles it, then calls pickWords and prints the selected strings to standard output. In both files, the following lines load and shuffle the word list:

import std.array : split;
import std.file : readText;
import std.random : randomShuffle;
auto wordList = readText("words.txt").split();
wordList.randomShuffle();

There are two things that are worth noting about this snippet. First off, the two highlighted lines are just screaming to be combined, but they can't be. The reason is that randomShuffle has a return value of void, so it can't be used in any pipeline that's making an initialization or an assignment.

Second, std.array.split is an eager function. When given a string, it splits on whitespace by default and returns an array of strings. Another function, std.algorithm.splitter, will lazily split strings on whitespace, but if you replace the call to split above with a call to splitter, it will no longer compile. This is because splitter returns an input range. If the source range is a forward or bidirectional range, the returned range will have those properties as well. randomShuffle, however, requires a random-access range as input. In other words, a range returned from splitter cannot be used with randomShuffle until it is first converted to a random-access range, which is usually accomplished by calling std.array.array.

Ultimately, it doesn't matter in this case whether the splitting operation is lazy or eager; the end result still needs to be an array of strings, which can be obtained equally well by split and splitter.array. The point of this little diversion is that when first learning your way around range-based functions in Phobos, it's not uncommon to run into situations where you can't get some of them to work together. Always pay attention to the documented inputs and outputs for any functions you use. For example, the documentation for the single-argument version of randomShuffle has this as a header:

void randomShuffle(Range)(Range r) if (isRandomAccessRange!Range);

The void return value is a quick indication that it can't be used directly in a pipeline. The template constraint shows that the type Range must be a random-access range, so we know that we can pass arrays to it directly, but other types of ranges must first be converted. There has been discussion in the D forums about the proliferation of template constraints in the Phobos documentation and how it can look like gibberish to new users. The flipside is that if you understand the meaning of the predicates in those constraints, then you can understand at a glance what sort of input a function requires.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.80.34