Chapter 6. Understanding Ranges

Since they were first introduced, ranges have become a pervasive part of D. It's possible to write D code and never need to create any custom ranges or algorithms yourself, but it helps tremendously to understand what they are, where they are used in Phobos, and how to get from a range to an array or another data structure. If you intend to use Phobos, you're going to run into them eventually. Unfortunately, some new D users have a difficult time understanding and using ranges.

The aim of this chapter and the next is to present ranges and functional style in D from the ground up, so the you can see they aren't some arcane secret understood only by a chosen few. Then, you can start writing idiomatic D early on in your journey. In this chapter, we lay the foundation with the basics of constructing and using ranges in two sections:

  • Ranges defined
  • Ranges in use

Ranges defined

In this section, we're going to explore what ranges are and examine concrete definitions of the different types of ranges recognized by Phobos. First, we'll dig into an example of the sort of problem ranges are intended to solve and, in the process, develop our own solution. This will help form an understanding of ranges from the ground up.

The problem

As part of an ongoing project, you've been asked to create a utility function, filterArray, that takes an array of any type and produces a new array containing all of the elements from the source array that satisfy a Boolean condition. The algorithm should be nondestructive, meaning it should not modify the source array at all. For example, given an array of integers as the input, filterArray could be used to produce a new array containing all of the even numbers from the source array.

It should be immediately obvious that a function template can handle the requirement to support any type. With a bit of thought and experimentation, a solution can soon be found to enable support for different Boolean expressions, perhaps a string mixin, a delegate, or both. After browsing the Phobos documentation for a bit, you come across a template, std.functional.unaryFun, that looks like it will help with the implementation. Its declaration is as follows:

template unaryFun(alias fun, string parmName = "a");

The alias fun parameter can be a string representing an expression, or any callable type that accepts one argument. If it is the former, the name of the variable inside the expression should be the value of parmName, which is "a" by default. The following snippet demonstrates this:

int num = 10;
assert(unaryFun!("(a & 1) == 0")(num));
assert(unaryFun!("x > 0", "x")(num));

If fun is a callable type, then unaryFun is documented to alias itself to fun and the parmName parameter is ignored. The following snippet calls unaryFun first with a struct that implements opCall, then calls it again with a delegate literal:

struct IsEven {
  bool opCall(int x) {
    return (x & 1) == 0;
  }
}
IsEven isEven;
assert(unaryFun!isEven(num));
assert(unaryFun!(x => x > 0)(num));

With this, you have everything you need to implement the utility function to spec:

import std.functional;
T[] filterArray(alias predicate, T)(T[] source)
  if(is(typeof(unaryFun!predicate(source[0])))
{
  T[] sink;
  foreach(t; source) {
    if(unaryFun!predicate(t))
      sink ~= t;
  }
  return sink;
}
unittest {
  auto ints = [1, 2, 3, 4, 5, 6, 7];
  auto even = ints.filterArray!(x => (x & 1) == 0)();
  assert(even == [2, 4, 6]);
}

The unittest verifies that the function works as expected. As a standalone implementation, it gets the job done and is quite likely good enough. But what if, later on down the road, someone decides to create more functions that perform specific operations on arrays in the same manner? The natural outcome of that is to use the output of one operation as the input for another, creating a chain of function calls to transform the original data.

The most obvious problem is that any such function that cannot perform its operation in place must allocate at least once every time it's called. This means that chain operations on a single array will end up allocating memory multiple times. This is not the sort of habit you want to get into in any language, especially in performance-critical code, but in D you have to take the GC into account. Any given allocation could trigger a garbage collection cycle, so it's a good idea to program to the GC; don't be afraid of allocating, but do so only when necessary and keep it out of your inner loops.

In filterArray, the naïve appending can be improved upon, but the allocation can't be eliminated unless a second parameter is added to act as the sink. This allows the allocation strategy to be decided at the call site rather than by the function, but it leads to another problem. If all of the operations in a chain require a sink and the sink for one operation becomes the source for the next, then multiple arrays must be declared to act as sinks. This can quickly become unwieldy.

Another potential issue is that filterArray is eager, meaning that every time the function is called, the filtering takes place immediately. If all of the functions in a chain are eager, it becomes quite difficult to get around the need for allocations or multiple sinks. The alternative, lazy functions, do not perform their work at the time they are called, but rather at some future point. Not only does this make it possible to put off the work until the result is actually needed (if at all), it also opens the door to reducing the amount of copying or allocating needed by operations in the chain. Everything could happen in one step at the end.

Finally, why should each operation be limited to arrays? Often, we want to execute an algorithm on the elements of a list, a set, or some other container, so why not support any collection of elements? By making each operation generic enough to work with any type of container, it's possible to build a library of reusable algorithms without the need to implement each algorithm for each type of container.

The solution

Now we're going to implement a more generic version of filterArray, called filter, that can work with any container type. It needs to avoid allocation and should also be lazy. To facilitate this, the function should work with a well-defined interface that abstracts the container away from the algorithm. By doing so, it's possible to implement multiple algorithms that understand the same interface. It also takes the decision on whether or not to allocate memory completely out of the algorithms. The interface of the abstraction need not be an actual interface type. Template constraints can be used to verify that a given type meets the requirements.

Note

You might have heard of duck typing. It originates from the old saying, If it looks like a duck, swims like a duck, and quacks like a duck, then it's probably a duck. The concept is that if a given object instance has the interface of a given type, then it's probably an instance of that type. D's template constraints and compile-time capabilities easily allow for duck typing. We've already seen some of this in Chapter 5, Generic Programming Made Easy. We're going to see more here, as it's a key component of range-based programming in D.

The interface

In looking for inspiration to define the new interface, it's tempting to turn to other languages such as Java and C++. On the one hand, we want to iterate the container elements, which brings to mind the iterator implementations in other languages. However, we also want to do a bit more than that, as demonstrated by the following chain of function calls:

container.getType.algorithm1.algorithm2.algorithm3.toContainer();

Conceptually, the instance returned by getType will be consumed by algorithm1, meaning that, inside the function, it will be iterated to the point where it can produce no more elements. But then, algorithm1 should return an instance of the same type, which can iterate over the same container, and which will in turn be consumed by algorithm2. The process repeats for algorithm3. This implies that instances of the new type should be able to be instantiated independent of the container they represent.

Moreover, given that D supports slicing, the role of getType could easily be played by opSlice. Iteration need not always begin with the first element of a container and end with the last; any range of elements should be supported. In fact, there's really no reason for an actual container instance to exist at all in some cases. Imagine a random number generator. We should be able to plug one into the preceding function chain just by eliminating the container and replacing getType with the generator instance. As long as it conforms to the interface we define, it doesn't matter that there is no concrete container instance backing it.

The short version of it is, we don't want to think solely in terms of iteration, as it's only a part of the problem we're trying to solve. We want a type that not only supports iteration, of either an actual container or a conceptual one, but one that also can be instantiated independently of any container, knows both its beginning and ending boundaries, and, in order to allow for lazy algorithms, can be used to generate new instances that know how to iterate over the same elements.

Considering these requirements, Iterator isn't a good fit as a name for the new type. Rather than naming it for what it does or how it's used, it seems more appropriate to name it for what it represents. There's more than one possible name that fits, but we'll go with Range (as in, a range of elements). That's it for the requirements and the type name. Now, let's move on to the API.

For any algorithm that needs to sequentially iterate a range of elements from beginning to end, three basic primitives are required:

  • There must be a way to determine whether or not any elements are available
  • There must be a means to access the next element in the sequence
  • There must be a way to advance the sequence so that another element can be made ready

Based on these requirements, there are several ways to approach naming the three primitives, but we'll just take a shortcut and use the same names used in D. The first primitive will be called empty and can be implemented either as a member function that returns bool or as a bool member variable. The second primitive will be called front, which again could be a member function or variable and which returns T, the element type of the range. The third primitive can only be a member function and will be called popFront, as conceptually it is removing the current front element from the sequence to ready the next element.

A range for arrays

Wrapping an array in the Range interface is quite easy. It looks like this:

auto range(T)(T[] array) {
  struct ArrayRange(T) {
    private T[] _array;
    bool empty() {
      return _array.length == 0;
    }
    ref T front() {
      return _array[0];
    }
    void popFront() {
      _array = _array[1 .. $];
    }
  }
  return ArrayRange!T(array);
}

By implementing the iterator as a struct, there's no need to allocate GC memory for a new instance. The only member is a slice of the source array, which again avoids allocation. Look at the implementation of popFront. Rather than requiring a separate variable to track the current array index, it slices the first element out of _array so that the next element is always at index 0, consequently shortening the length of the slice by one so that, after every item has been consumed, _array.length will be 0. This makes the implementation of both empty and front dead simple.

ArrayRange can be a Voldemort type because there is no need to declare its type in any algorithm it's passed to. As long as the algorithms are implemented as templates, the compiler can infer everything that needs to be known for them to work. Moreover, thanks to UFCS, it's possible to call this function as if it were an array property. Given an array called myArray, the following is valid:

auto range = myArray.range;

Next, we need a template to go in the other direction. This needs to allocate a new array, walk the iterator, and store the result of each call to front in the new array. Its implementation is as follows:

T[] array(T, R)(R range) {
  T[] ret;
  while(!range.empty) {
    ret ~= range.front;
    range.popFront();
  }
  return ret;
}

This can be called after any operation that produces any Range in order to get an array. If the range comes at the end of one or more lazy operations, this will cause all of them to execute simply by the call to popFront (we'll see how shortly). In that case, no allocations happen except as needed in this function when elements are appended to ret. Again, the appending strategy used here is naïve, so there's room for improvement in order to reduce the potential number of allocations. Now it's time to implement an algorithm to make use of our new range interface.

The implementation of filter

The filter function isn't going to do any filtering at all. If that sounds counterintuitive, recall that we want the function to be lazy; all of the work should be delayed until it is actually needed. The way to accomplish that is to wrap the input range in a custom range that has an internal implementation of the filtering algorithm. We'll call this wrapper FilteredRange. It will be a Voldemort type, which is local to the filter function. Before seeing the entire implementation, it will help to examine it in pieces as there's a bit more to see here than with ArrayRange.

FilteredRange has only one member:

private R _source;

R is the type of the range that is passed to filter. The empty and front functions simply delegate to the source range, so we'll look at popFront next:

void popFront() {
    _source.popFront();
    skipNext();
}

This will always pop the front element from the source range before running the filtering logic, which is implemented in the private helper function skipNext:

private void skipNext() {
    while(!_source.empty && !unaryFun!predicate(_source.front))
        _source.popFront();
}

This function tests the result of _source.front against the predicate. If it doesn't match, the loop moves on to the next element, repeating the process until either a match is found or the source range is empty. So, imagine you have an array arr of the values [1,2,3,4]. Given what we've implemented so far, what would be the result of the following chain?:

arr.range.filter!(x => (x & 1) == 0).front;

As mentioned previously, front delegates to _source.front. In this case, the source range is an instance of ArrayRange; its front returns _source[0]. Since popFront was never called at any point, the first value in the array was never tested against the predicate. Therefore, the return value is 1, a value which doesn't match the predicate. The first value returned by front should be 2, since it's the first even number in the array.

In order to make this behave as expected, FilteredRange needs to ensure the wrapped range is in a state such that either the first call to front will properly return a filtered value, or empty will return true, meaning there are no values in the source range that match the predicate. This is best done in the constructor:

this(R source) {
    _source = source;
    skipNext();
}

Calling skipNext in the constructor ensures that the first element of the source range is tested against the predicate before front is ever called; however, it does mean that our filter implementation isn't completely lazy. In an extreme case, when_source contains no values that match the predicate; it's actually going to be completely eager. The source elements will be consumed as soon as the range is instantiated. Not all algorithms will lend themselves to 100 percent laziness. No matter. What we have here is lazy enough. Wrapped up inside the filter function, the whole thing looks like this:

import std.functional;
auto filter(alias predicate, R)(R source)
  if(is(typeof(unaryFun!predicate))) {
    struct FilteredRange {
      private R _source;
      this(R source) {
        _source = source;
        skipNext();
      }
      bool empty() { return _source.empty; }
      auto ref front() { return _source.front; }
      void popFront() {
        _source.popFront();
        skipNext();
      }
      private void skipNext() {
        while(!_source.empty && !unaryFun!predicate(_source.front))
          _source.popFront();
      }
    }
    return FilteredRange(source);
}

It might be tempting to take the filtering logic out of the skipNext method and add it to front, which is another way to guarantee that it's performed on every element. Then no work would need to be done in the constructor and popFront would simply become a wrapper for _source.popFront. The problem with that approach is that front can potentially be called multiple times without calling popFront in between, meaning the predicate will be tested on each call. That's unnecessary work. As a general rule, any work that needs to be done inside a range to prepare a front element should happen as a result of calling popFront, leaving front to simply focus on returning the current element.

The test

With the implementation complete, it's time to put it through its paces. Here are a few test cases in a unittest block:

unittest {
  auto arr = [10, 13, 300, 42, 121, 20, 33, 45, 50, 109, 18];
  auto result = arr.range
                     .filter!(x => x < 100 )
                     .filter!(x => (x & 1) == 0)
                     .array!int();
  assert(result == [10,42,20,50,18]);

  arr = [1,2,3,4,5,6];
  result = arr.range.filter!(x => (x & 1) == 0).array!int;
  assert(result == [2, 4, 6]);

  arr = [1, 3, 5, 7];
  auto r = arr.range.filter!(x => (x & 1) == 0);
  assert(r.empty);

  arr = [2,4,6,8];
  result = arr.range.filter!(x => (x & 1) == 0).array!int;
  assert(result == arr);
}

Assuming all of this has been saved in a file called filter.d, the following will compile it for unit testing:

dmd -unittest -main filter

That should result in an executable called filter that, when executed, should print nothing to the screen, indicating a successful test run. Notice the test that calls empty directly on the returned range. Sometimes, we might not need to convert a range to a container at the end of the chain. For example, to print the results, it's quite reasonable to iterate a range directly. Why allocate when it isn't necessary?

The real ranges

The purpose of the preceding exercise was to get a feel of the motivation behind D ranges. We didn't develop a concrete type called Range, just an interface. D does the same, with a small set of interfaces defining ranges for different purposes. The interface we developed exactly corresponds to the basic kind of D range, called an input range, one of two foundational range interfaces in D (the upshot of that is that both ArrayRange and FilteredRange are valid input ranges, though, as we'll eventually see, there's no reason to use either outside of this chapter). There are also certain optional properties that ranges might have that, when present, some algorithms might take advantage of. We'll take a brief look at the range interfaces now, then see more details regarding their usage in the next section.

Input ranges

This foundational range is defined to be anything from which data can be sequentially read via the three primitives empty, front, and popFront. The first two should be treated as properties, meaning they can be variables or functions. This is important to keep in mind when implementing any generic range-based algorithm yourself; calls to these two primitives should be made without parentheses. The three higher-order range interfaces, we'll see shortly, build upon the input range interface.

To reinforce a point made earlier, one general rule to live by when crafting input ranges is that consecutive calls to front should return the same value until popFront is called; popFront prepares an element to be returned and front returns it. Breaking this rule can lead to unexpected consequences when working with range-based algorithms, or even foreach.

Input ranges are somewhat special in that they are recognized by the compiler. Recall the coverage of opApply in the previous chapter; it's what enables iteration of a custom type with a foreach loop. An alternative is to provide an implementation of the input range primitives. When the compiler encounters a foreach loop, it first checks to see if the iterated instance is of a type that implements opApply. If not, it then checks for the input range interface and, if found, rewrites the loop. For example, given a range someRange:

foreach(e; someRange) { ... }

This is rewritten to something like this:

for(auto __r = range; !__r.empty; __r.popFront()) {
    auto e = __r.front;
    ...
}

This has implications. To demonstrate, let's use the ArrayRange from earlier:

auto ar = [1, 2, 3, 4, 5].range;
foreach(n; ar) {
    writeln(n);
}
if(!ar.empty) writeln(ar.front);

The last line prints 1. If you're surprised, look again at the for loop that the compiler generates. ArrayRange is a struct, so when it's assigned to __r, a copy is generated. The slices inside, ar and __r, point to the same memory, but their .ptr and .length properties are distinct. As the length of the __r slice decreases, the length of the ar slice remains the same. This behavior was covered in the discussion of slices back in Chapter 2, Building a Foundation with D Fundamentals, and again when we talked about postblit constructors in Chapter 3, Programming Objects the D Way, but it's easy to forget.

When implementing generic algorithms that loop over a source range, it's not a good idea to rely on this behavior. If the range is a class instead of struct, it will be consumed by the loop, as classes are references types. Furthermore, there are no guarantees about the internal implementation of a range. There could be struct-based ranges that are actually consumed in a foreach loop. Generic functions should always assume this is the case.

To test if a given range type R is an input range:

import std.range : isInputRange;
static assert(isInputRange!R);

There are no special requirements on the return value of the front primitive. Elements can be returned by value or by reference, they can be qualified or unqualified, they can be inferred via auto, and so on. Any qualifiers, storage classes, or attributes that can be applied to functions and their return values can be used with any range function, though it might not always make sense to do so.

Forward ranges

The most basic of the higher-order ranges is the forward range. This is defined as an input range that allows its current point of iteration to be saved via a primitive appropriately named save. Effectively, the implementation should return a copy of the current state of the range. For ranges that are struct types, it could be as simple as:

auto save() { return this; }

For ranges that are class types, it requires allocating a new instance:

auto save() { return new MyForwardRange(this); }

Forward ranges are useful for implementing algorithms that require lookahead. For example, consider the case of searching a range for a pair of adjacent elements that pass an equality test:

auto saved = r.save;
if(!saved.empty) {
  for(saved.popFront(); !saved.empty;
              r.popFront(), saved.popFront()) {
    if(r.front == saved.front)
      return r;
  }
}
return saved;

Because this uses a for loop and not a foreach loop, the ranges are iterated directly and are going to be consumed. Before the loop begins, a copy of the current state of the range is made by calling r.save. Then, iteration begins over both the copy and the original. The original range is positioned as the first element, and the call to saved.popFront in the beginning of the loop statement positions the saved range as the second element. As the ranges are iterated in unison, the comparison is always made on adjacent elements. If a match is found, r is returned, meaning that the returned range is positioned at the first element of a matching pair. If no match is found, saved is returned—since it's one element ahead of r, it will have been consumed completely and its empty property will be true.

The preceding example is derived from a more generic implementation in Phobos, std.range.findAdjacent. It can use any binary (two-argument) Boolean condition to test adjacent elements and is constrained to only accept forward ranges.

It's important to understand that calling save usually does not mean a deep copy, but it sometimes can. If we were to add a save function to the ArrayRange from earlier, we could simply return this; the array elements would not be copied. A class-based range, on the other hand, will usually perform a deep copy because it's a reference type. When implementing generic functions, you should never make the assumption that the range does not require a deep copy. For example, let's assume a range r:

auto saved = r;         // INCORRECT!!
auto saved = r.save;    // Correct.

If r is a class, the first line is almost certainly going to result in incorrect behavior.

To test if a given range R is a forward range:

import std.range : isForwardRange;
static assert(isForwardRange!R);

Bidirectional ranges

A bidirectional range is a forward range that includes the primitives back and popBack, allowing a range to be sequentially iterated in reverse. The former should be a property, the latter a function. Given a bidirectional range r, the following forms of iteration are possible:

foreach_reverse(e; r) writeln(e);
for(; !r.empty; r.popBack)
    writeln(r.back);
}

Like its cousin foreach, the foreach_reverse loop will be rewritten into a for loop that might not consume the original range; the for loop shown here does consume it.

To test whether a given range type R is a bidirectional range:

import std.range : isBidirectionalRange;
static assert(isBidirectionalRange!R);

Random-access ranges

A random-access range is a bidirectional range that supports indexing and is required to provide a length primitive if it isn't infinite (two topics we'll discuss shortly). For custom range types, this is achieved via the opIndex operator overload. It is assumed that r[n] returns a reference to the (n+1)th element of the range, just as when indexing an array.

To test whether a given range R is a random-access range:

import std.range : isRandomAccessRange;
static assert(isRandomAccessRange!R);

Dynamic arrays can be treated as random-access ranges by importing std.array. This pulls functions into scope that accept dynamic arrays as parameters and allows them to pass all the isRandomAccessRange checks. This makes our ArrayRange from earlier obsolete. Often, when you need a random-access range, it's sufficient just to use an array instead of creating a new range type. However, char and wchar arrays (string and wstring) are not considered random-access ranges, so they will not work with any algorithm that requires one.

Tip

Getting a random-access range from char[] and wchar[]

Recall that a single Unicode character can be composed of multiple elements in a char or wchar array, which is an aspect of strings that would seriously complicate any algorithm implementation that needs to directly index the array. To get around this, the thing to do in the general case is to convert char[] and wchar[] into dchar[]. This can be done with std.utf.toUTF32, which encodes UTF-8 and UTF-16 strings into UTF-32 strings. Alternatively, if you know you're only working with ASCII characters, you can use std.string.representation to get ubyte[] or ushort[] (on dstring, it returns uint[]). This will also avoid auto-decoding in algorithms that work with input ranges, as calling popFront on any string will decode the next front element into UTF-32.

Output ranges

The output range is the second foundational range type. It's defined as anything that can be sequentially written to via the primitive put. Generally, it should be implemented to accept a single parameter, but the parameter could be a single element, an array of elements, a pointer to elements, or another data structure containing elements. When working with output ranges, never call the range's implementation of put directly; instead, use the Phobos utility function std.range.put. It will call the range's implementation internally, but it allows for a wider range of argument types. Given a range r and element e, it would look like this:

import std.range : put;
put(r, e);

The benefit here is that, if e is anything other than a single element, such as an array or another range, the global put does what is necessary to pull elements from it and put them into r one at a time. With this, you can define and implement a simple output range that might look something like this:

MyOutputRange(T) {
  private T[] _elements;
  void put(T elem) {
    _elements ~= elem;
  }
}

Now you need not worry about calling put in a loop, or overloading it to accept collections of T. For example:

MyOutputRange!int range;
auto nums = [11, 22, 33, 44, 55];
import std.range : put;
put(range, nums);

Note

Note that using UFCS here will cause compilation to fail, as the compiler will attempt to call MyOutputRange.put directly, rather than the utility function. However, it's fine to use UFCS when the first parameter is a dynamic array.

To test whether a given range R is an output range:

import std.range : isOutputRange;
static assert(isOutputRange!(R, E));

Here, E is the type of element accepted by R.put.

Optional range primitives

In addition to the five primary range types, some algorithms in Phobos are designed to look for optional primitives that can be used as an optimization or, in some cases, a requirement. There are predicate templates in std.range that allow the same options to be used outside of Phobos.

hasLength

Ranges that expose a length property can reduce the amount of work needed to determine the number of elements they contain. A great example is the std.range.walkLength function, which will calculate and return the length of any range, whether it has a length primitive or not. Given a range that satisfies the std.range.hasLength predicate, the operation becomes a call to the length property; otherwise, the range must be iterated until it is consumed, incrementing a variable every time popFront is called. Generally, length is expected to be a O(1) operation. If any given implementation cannot meet that expectation, it should be clearly documented as such. For non-infinite random-access ranges, length is a requirement. For all others, it's optional.

isInfinite

An input range with an empty property which is implemented as a compile-time value set to false, is considered an infinite range. For example:

struct IR {
    private uint _number;
    enum empty = false;
    auto front() { return _number; }
    void popFront() { ++_number; }
}

Here, empty is a manifest constant, but it could alternatively be implemented as follows:

static immutable empty = false;

The predicate template std.range.isInfinite can be used to identify infinite ranges. Any range that is always going to return false from empty should be implemented to pass isInfinite. Wrapper ranges (such as the FilterRange we implemented earlier) in some functions might check isInfinite and customize an algorithm's behavior when it's true. Simply returning false from an empty function will break this, potentially leading to infinite loops or other undesired behavior.

Other options

There are a handful of other optional primitives and behaviors, as follows:

  • hasSlicing: This returns true for any forward range that supports slicing. There are a set of requirements specified by the documentation for finite versus infinite ranges and whether opDollar is implemented.
  • hasMobileElements: This is true for any input range whose elements can be moved around in memory (as opposed to copied) via the primitives moveFront, moveBack, and moveAt.
  • hasSwappableElements: This returns true if a range supports swapping elements through its interface. The requirements differ depending on the range type.
  • hasAssignableElements: This returns true if elements are assignable through range primitives such as front, back, or opIndex.

At http://dlang.org/phobos/std_range_primitives.html, you can find specific documentation for all of these tests, including any special requirements that must be implemented by a range type to satisfy them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.54.245