Chapter 1. “D”iving In

You know what’s coming first, so without further ado:


Click here to view code image

import std.stdio;
void main() {
   writeln("Hello, world!");
}


Depending on what other languages you know, you might have a feeling of déjà vu, a mild appreciation for simplicity, or perhaps a slight disappointment that D didn’t go the scripting languages’ route of allowing top-level statements. (Top-level statements invite global variables, which quickly turn into a liability as the program grows; D does offer ways of executing code outside main, just in a more structured manner.) If you’re a stickler for precision, you’ll be relieved to hear that void main is equivalent to an int main that returns “success” (code zero) to the operating system if it successfully finishes execution.

But let’s not get ahead of ourselves. The purpose of the traditional “Hello, world!” program is not to discuss a language’s capabilities, but instead to get you started on writing and running programs using that language. If you don’t have some IDE offering transparent builds, the command line is an easy route. After you have typed the code above in a file called, say, hello.d, fire a shell and type the following commands:


Click here to view code image

$ dmd hello.d
$ ./hello
Hello, world!
$ _


where $ stands in for your command prompt (it could be C:PathToDir> on Windows or /path/to/dir% on Unix systems, such as OSX, Linux, and Cygwin). You can even get the program to compile and run automatically if you apply a bit of your system-fu skills. On Windows, you may want to associate the shell command “Run” with the program rdmd.exe, which is part of the installation. Unix-like systems support the “shebang notation” for launching scripts, syntax that D understands; adding the line

to the very beginning of your hello.d program makes it directly executable. After you make that change, you can simply type at the command prompt:


Click here to view code image

$ chmod u+x hello.d
$ ./hello.d
Hello, world!
$ _


(You need to do the chmod thing only once.)

On all operating systems, the rdmd program is smart enough to cache the generated executable, such that compilation is actually done only after you’ve changed the program, not every time you run it. This, combined with the fact that the compiler proper is very fast, fosters a rapid edit-run cycle that helps short scripts and large programs alike.

The program itself starts with the directive


Click here to view code image

import std.stdio;


which instructs the compiler to look for a module called std.stdio and make its symbols available for use. import is akin to the #include preprocessor directive found in C and C++ but is closer in semantics to Python’s import: there is no textual inclusion taking place—just a symbol table acquisition. Repeated imports of the same file are of no import.

Per the venerable tradition established by C, a D program consists of a collection of declarations spread across multiple files. The declarations can introduce, among other things, types, functions, and data. Our first program defines the main function to take no arguments and return “nothingness”—void, that is. When invoked, main calls the writeln function (which, of course, was cunningly defined by the std.stdio module), passing it a constant string. The ln suffix indicates that writeln appends a newline to the printed text.

The following sections provide a quick drive through Deeville. Little illustrative programs introduce basic language concepts. At this point the emphasis is on conveying a feel for the language, rather than giving pedantic definitions. Later chapters will treat each part of the language in greater detail.

1.1 Numbers and Expressions

Are you ever curious how tall foreigners are? Let’s write a simple program that displays a range of usual heights in feet + inches and in centimeters.


Click here to view code image

/*
  Compute heights in centimeters for a range of heights
  expressed in feet and inches
*/
import std.stdio;

void main() {
   // Values unlikely to change soon
   immutable inchesPerFoot = 12;
   immutable cmPerInch = 2.54;

   // Loop'n write
   foreach (feet; 5 .. 7) {
      foreach (inches; 0 .. inchesPerFoot) {
         writeln(feet, "'", inches, "'' ",
            (feet * inchesPerFoot + inches) * cmPerInch);
      }
   }
}


When executed, this program will print a nice two-column list:


Click here to view code image

5'0''   152.4
5'1''   154.94
5'2''   157.48
...
6'10''  208.28
6'11''  210.82


The construct foreach (feet; 5 .. 7) { ... } is an iteration statement that defines an integer variable feet and binds it in turn to 5 and then 6, but not 7 (the interval is open to the right).

Just like Java, C++, and C#, D supports /*multiline comments*/ and //single-line comments (plus documentation comments, which we’ll get to later). One more interesting detail is the way our little program introduces its data. First, there are two constants:


Click here to view code image

immutable inchesPerFoot = 12;
immutable cmPerInch = 2.54;


Constants that will never, ever change are introduced with the keyword immutable. Constants, as well as variables, don’t need to have a manifest type; the actual type can be inferred from the value with which the symbol is initialized. In this case, the literal 12 tells the compiler that inchesPerFoot is an integer (denoted in D with the familiar int); similarly, the literal 2.54 causes cmPerInch to be a floating-point constant (of type double). Going forth, we notice that the definitions of feet and inches avail themselves of the same magic, because they look like variables all right, yet have no explicit type adornments. That doesn’t make the program any less safe than one that states:


Click here to view code image

immutable int inchesPerFoot = 12;
immutable double cmPerInch = 2.54;
...
foreach (int feet; 5 .. 7) {
   ...
}


and so on, only less redundant. The compiler allows omitting type declarations only when types can be unambiguously inferred from context. But now that types have come up, let’s pause for a minute and see what numeric types are available.

In order of increasing size, the signed integral types include byte, short, int, and long, having sizes of exactly 8, 16, 32, and 64 bits, respectively. Each of these types has an unsigned counterpart of the same size, named following a simple rule: ubyte, ushort, uint, and ulong. (There is no “unsigned” modifier as in C.) Floating-point types consist of float (32-bit IEEE 754 single-precision number), double (64-bit IEEE 754), and real (which is as large as the machine’s floating-point registers can go, but no less than 64 bits; for example, on Intel machines real is a so-called IEEE 754 double-extended 79-bit format).

Getting back to the sane realm of integral numbers, literals such as 42 can be assigned to any numeric type, but note that the compiler checks whether the target type is actually large enough to accommodate that value. So the declaration


Click here to view code image

immutable byte inchesPerFoot = 12;


is as good as the one omitting byte because 12 fits as comfortably in 8 bits as in 32. By default, if the target type is to be deduced from the number (as in the sample program), integral constants have type int and floating-point constants have type double.

Using these types, you can build a lot of expressions in D using arithmetic operators and functions. The operators and their precedence are much like the ones you’d find in D’s sibling languages: +, -, *, /, and % for basic arithmetic, ==, !=, <, >, <=, >= for comparisons, fun(argument1, argument2) for function calls, and so on.

Getting back to our centimeters-to-inches program, there are two noteworthy details about the call to writeln. One is that writeln takes five arguments (as opposed to one in the program that opened the hailing frequencies). Much like the I/O facilities found in Pascal (writeln), C (printf), or C++ (cout), D’s writeln function accepts a variable number of arguments (it is a “variadic function”). In D, however, users can define their own variadic functions (unlike in Pascal) that are always typesafe (unlike in C) without needing to gratuitously hijack operators (unlike in C++). The other detail is that our call to writeln awkwardly mixes formatting information with the data being formatted. Separating data from presentation is often desirable, so let’s use the formatted write function writefln instead:


Click here to view code image

writefln("%s'%s'' %s", feet, inches,
   (feet * inchesPerFoot + inches) * cmPerInch);


The newly arranged call produces exactly the same output, with the difference that writefln’s first argument describes the format entirely. % introduces a format specifier similar to C’s printf, for example, %i for integers, %f for floating-point numbers, and %s for strings.

If you’ve used printf, you’d feel right at home were it not for an odd detail: we’re printing ints and doubles here—how come they are both described with the %s specifier, which traditionally describes only strings? The answer is simple. D’s variadic argument facility gives writefln access to the actual argument types passed, a setup that has two nice consequences: (1) the meaning of %s could be expanded to “whatever the argument’s default string representation is,” and (2) if you don’t match the format specifier with the actual argument types, you get a clean-cut error instead of the weird behavior specific to misformatted printf calls (to say nothing about the security exploits made possible by printf calls with untrusted format strings).

1.2 Statements

In D, just as in its sibling languages, any expression followed by a semicolon is a statement (for example, the “Hello, world!” program’s call to writeln has a ; right after it). The effect of the statement is to simply evaluate the expression.

D is a member of the “curly-braces block-scoped” family, meaning that you can group several statements into one by surrounding them with { and }—something that’s necessary, for example, when you want to do several things inside a foreach loop. In the case of exactly one statement, you can omit the curly braces entirely. In fact, our entire height conversion double loop could be rewritten as follows:


Click here to view code image

foreach (feet; 5 .. 7)
  foreach (inches; 0 .. inchesPerFoot)
    writefln("%s'%s'' %s", feet, inches,
      (feet * inchesPerFoot + inches) * cmPerInch);


Omitting braces for single statements has the advantage of shorter code and the disadvantage of making edits more fiddly (during code maintenance, you’ll need to add or remove braces as you mess with statements). People tend to be pretty divided when it comes to rules for indentation and for placing curly braces. In fact, so long as you’re consistent, these things are not as important as they might seem, and as a proof, the style used in this book (full bracing even for single statements, opening braces on the introducing line, and closing braces on their own lines) is, for typographical reasons, quite different from the author’s style in everyday code. If he could do this without turning into a werewolf, so could anyone.

The Python language made popular a different style of expressing block structure by means of indentation—“form follows structure” at its best. Whitespace that matters is an odd proposition for programmers of some other languages, but Python programmers swear by it. D normally ignores whitespace but is especially designed to be easily parsed (e.g., parsing does not need to understand the meaning of symbols), which suggests that a nice pet project could implement a simple preprocessor allowing usage of Python indentation style with D without suffering any inconvenience in the process of compiling, running, and debugging programs.

The code samples above also introduced the if statement. The general form should be very familiar:


Click here to view code image

if (expressionstatement1 else statement2


A nice theoretical result known as the theorem of structure [10] proves that we can implement any algorithm using compound statements, if tests, and loops à la for and foreach. Of course, any realistic language would offer more than just that, and D is no exception, but for now let’s declare ourselves content as far as statements go and move on.

1.3 Function Basics

Let’s go beyond the required definition of the main function and see how to define other functions in D. Function definitions follow the model found in other Algol-like languages: first comes the return type, then the function’s name, and finally the formal parameters1 as a parenthesized comma-separated list. For example, to define a function called pow that takes a double and an int and returns a double, you’d write


Click here to view code image

double pow(double base, int exponent) {
  ...
}


Each function parameter (base and exponent in the example above) has, in addition to its type, an optional storage class that decides the way arguments are passed to the function when invoked. By default, arguments are passed into pow by value. If storage class ref is prepended to a parameter’s type, the parameter is bound directly to the incoming argument such that changing the parameter is directly and immediately reflected in the value received from the outside. For example:


Click here to view code image

import std.stdio;

void fun(ref uint x, double y) {
   x = 42;
   y = 3.14;
}
void main() {
   uint a = 1;
   double b = 2;
   fun(a, b);
   writeln(a, " ", b);
}


This program prints 42 2 because x is a ref uint, meaning that assigning to x really means assigning to a. On the other hand, assigning to y has no effect on b because y is a private copy at fun’s disposal.

The last adornments we’ll discuss in this brief introduction are in and out. Simply put, in is a promise on the part of the function that it wants only to look at, not touch, the parameter. Using out with a function parameter works similarly to ref, with the amendment that the parameter is forcibly initialized to its default value upon the function’s entry. (Each type T defines an initial value, denoted as T.init. User-defined types can define their own init.)

There is a lot more to say about functions. You can pass functions to other functions, nest them into one another, allow a function to save its local environment (full-fledged syntactic closures), create and comfortably manipulate unnamed functions (lambdas), and some additional juicy little bits. We will get to each of these in good order.

1.4 Arrays and Associative Arrays

Arrays and associative arrays (the latter colloquially referred to as hashtables or hashes) are arguably the most used compound data structures in the history of computing, enviously followed by Lisp’s lists. A lot of useful programs need no more than some sort of array and associative array, so it’s about time to see how D implements them.

1.4.1 Building a Vocabulary

For example, let’s write a simple program following this specification:

Read a text consisting of words separated by whitespace, and associate a unique number with each distinct word. Output lines of the form ID word.

This little script can be quite useful if you want to do some text processing; once you have built a vocabulary, you only need to manipulate numbers (cheaper), not full-fledged words. A possible approach to building such a vocabulary is to accumulate already seen words in a sort of a dictionary that maps words to integers. When adding a new mapping we only need to make sure the integer is unique (a solid option is to just use the current length of the dictionary, resulting in the IDs 0, 1, 2, ...). Let’s see how we can do that in D.


Click here to view code image

import std.stdio, std.string;

void main() {
   uint[string] dictionary;
   foreach (line; stdin.byLine()) {
      // Break sentence into words
      // Add each word in the sentence to the vocabulary
      foreach (word; splitter(strip(line))) {
         if (word in dictionary) continue// Nothing to do
         auto newID = dictionary.length;
         dictionary[word] = newID;
         writeln(newID, ' ', word);
      }
   }
}


In D, the type of an associative array (a hashtable) that maps values of type K to values of type V is denoted as V[K]. So the variable dictionary of type uint[string] maps strings to unsigned integers—just what we needed to store word-to-ID mappings. The expression word in dictionary is nonzero if the key word could be found in associative array dictionary. Finally, insertion in the dictionary is done with dictionary[word] = newID.

Although not made explicit in the script above, the type string is really an array of characters. Generally, dynamically sized arrays of T are denoted as T[] and are allocated in a number of ways, such as


Click here to view code image

int[] a = new int[20]; // 20 zero-initialized integers
int[] b = [ 1, 2, 3 ]; // An array containing 1, 2, and 3


Unlike C arrays, D arrays know their own length, accessible as arr.length for any array arr. Assigning to arr.length reallocates the array. Array accesses are bounds checked; code that enjoys risking buffer overruns can scare the pointer out of the array (by using arr.ptr) and then use unchecked pointer arithmetic. Also, a compiler option disables bounds checking if you really need everything that silicon wafer could give. This places the path of least resistance on the right side of safety: code is safe by default and can be made a tad faster with more work.

Here’s how to iterate over an array using a new form of the already familiar foreach statement:


Click here to view code image

int[] arr = new int[20];
foreach (elem; arr) {
   /* ... use elem ... */
}


The loop above binds elem to each element of arr in turn. Assigning to elem does not assign back to elements in arr. To change the array, just use the ref keyword:


Click here to view code image

// Zero all elements of arr
foreach (ref elem; arr) {
   elem = 0;
}


And now that we know how foreach works with arrays, let’s look into one more useful thing. If you also need the index of the array element while you’re iterating, foreach can do that for you:


Click here to view code image

int[] months = new int[12];
foreach (i, ref e; months) {
   e = i + 1;
}


The code above creates an array containing 1, 2, ..., 12. The loop is equivalent to the slightly more verbose code below, which uses foreach to iterate over a range of numbers:


Click here to view code image

foreach (i; 0 .. months.length) {
   months[i] = i + 1;
}


D also offers statically sized arrays denoted as, for example, int[5]. Outside a few specialized applications, dynamically sized arrays are to be preferred because more often than not you don’t know the size of the array in advance.

Arrays have shallow copy semantics, meaning that copying one array variable to another does not copy the entire array; it just spawns a new view to the same underlying storage. If you do want to obtain a copy, just use the dup property of the array:


Click here to view code image

int[] a = new int[100];
int[] b = a;
// ++x increments value x
++b[10];    // b[10] is now 1, as is a[10]
b = a.dup;  // Copy a entirely into b
++b[10];    // b[10] is now 2, a[10] stays 1


1.4.2 Array Slicing. Type-Generic Functions. Unit Tests

Array slicing is a powerful feature that allows referring to a portion of an array without actually copying array data. To exemplify, let’s write a function binarySearch implementing the eponymous algorithm: given a sorted array and a value, binarySearch quickly returns a Boolean value that tells whether the value is in the array. D’s standard library offers a function that does this in a more general way and returns something more informative than just a Boolean, but that needs to wait for more language features. Let us, however, bump our ambitions up just a notch, by setting to write a binarySearch that works not only with arrays of integers, but with arrays of any type as long as values of that type can be compared with <. It turns out that that’s not much of a stretch. Here’s what a generic binarySearch looks like:


Click here to view code image

import std.array;

bool binarySearch(T)(T[] input, T value) {
   while (!input.empty) {
      auto i = input.length / 2;
      auto mid = input[i];
      if (mid > value) input = input[0 .. i];
      else if (mid < value) input = input[i + 1 .. $];
      else return true;
   }
   return false;

}

unittest {
   assert(binarySearch([ 1, 3, 6, 7, 9, 15 ], 6));
   assert(!binarySearch([ 1, 3, 6, 7, 9, 15 ], 5));
}


The (T) notation in binarySearch’s signature introduces a type parameter T. The type parameter can then be used in the regular parameter list of the function. When called, binarySearch will deduce T from the actual arguments received. If you want to explicitly specify T (for example, for double-checking purposes), you may write


Click here to view code image

assert(binarySearch!(int)([ 1, 3, 6, 7, 9, 15 ], 6));


which reveals that a generic function can be invoked with two pairs of parenthesized arguments. First come the compile-time arguments enclosed in !(...), and then come the runtime arguments enclosed in (...). Mixing the two realms together has been considered, but experimentation has shown that such uniformity creates more trouble than it eliminates.

If you are familiar with similar facilities in Java, C#, or C++, you certainly noticed that D made a definite departure from these languages’ use of angle brackets < and > to specify compile-time arguments. This was a deliberate decision aimed at avoiding the crippling costs revealed by experience with C++, such as increased parsing difficulties, a hecatomb of special rules and arbitrary tiebreakers, and obscure syntax to effect user-directed disambiguation.2 The difficulty stems from the fact that < and > are at their heart comparison operators,3 which makes it very ambiguous to use them as delimiters when expressions are allowed inside those delimiters. Such would-be delimiters are very difficult to pair. Java and C# have an easier time exactly because they do not allow expressions inside < and >, but that limits their future extensibility for the sake of a doubtful benefit. D does allow expressions as compile-time arguments and chose to simplify the life of both human and compiler by extending the traditional unary operator ! to binary uses and using the classic parentheses (which (I’m sure) you always pair properly).

Another detail of interest in binarySearch’s implementation is the use of auto to leverage type deduction: i and mid have their types deduced from their initialization expressions.

In keeping with good programming practices, binarySearch is accompanied by a unit test. Unit tests are introduced as blocks prefixed with the unittest keyword (a file can contain as many unittests as needed, and you know what it’s like—too many are almost enough). To run unit tests before main is entered, pass the -unittest flag to the compiler. Although unittest looks like a small feature, it helps you observe good programming style by making it so easy to insert small tests that it’s embarrassing not to. Also, if you’re a top-level thinker who prefers to see the unittest first and the implementation second, feel free to move unittest before binarySearch; in D, the semantics of a module-level symbol never depends on its relative ordering with others.

The slice expression input[a .. b] returns a slice of input from index a up to, and excluding, index b. If a == b, an empty slice is produced, and if a > b, an exception is thrown. A slice does not trigger a dynamic memory allocation; it’s just an alias for a part of the array. Inside an index expression or a slice expression, $ stands in for the length of the array being accessed; for example, input[0 .. $] is exactly the same thing as input.

Again, although it might seem that binarySearch does a lot of array shuffling, no array is ever allocated; all of input’s slices share space with the original input. The implementation is in no way less efficient than a traditional one maintaining indices but is arguably easier to understand because it manipulates less state. Speaking of state, let’s write a recursive implementation of binarySearch that doesn’t reassign index at all:


Click here to view code image

import std.array;

bool binarySearch(T)(T[] input, T value) {
   if (input.empty) return false;
   auto i = input.length / 2;
   auto mid = input[i];
   if (mid > value) return binarySearch(input[0 .. i]);
   if (mid < value) return binarySearch(input[i + 1 .. $]);
   return true;
}


The recursive implementation is arguably simpler and terser than its iterative counterpart. It’s also every bit as efficient because the recursive calls can easily be optimized away by a popular compiler technique known as tail call elimination: in brief, if a function’s return simply calls itself with different arguments, the compiler modifies the argument and issues a jump to the beginning of the function.

1.4.3 Counting Frequencies. Lambda Functions

Let’s set out to write another useful program: counting distinct words in a text. Want to know what words were used most frequently in Hamlet? You’re in the right place.

The following program uses an associative array mapping strings to uints and has a structure similar to the vocabulary-building example. Adding a simple printing loop completes a useful frequency-counting program:


Click here to view code image

import std.stdio, std.string;

void main() {
  // Compute counts
  uint[string] freqs;
  foreach (line; stdin.byLine()) {
    foreach (word; split(strip(line))) {
      ++freqs[word.idup];
    }
  }
  // Print counts
  foreach (key, value; freqs) {
    writefln("%6u %s", value, key);
  }
}


All right, now after downloading hamlet.txt off the Net (you can find a permanent link at http://erdani.com/tdpl/hamlet.txt), running our little program against the Bard’s chef d’oeuvre prints


Click here to view code image

1  outface
1  come?
1  blanket,
1  operant
1  reckon
2  liest
1  Unhand
1  dear,
1  parley.
1  share.
...


which sadly reveals that output doesn’t come quite ordered, and that whichever words come first are not quite the most frequent. This isn’t surprising; in order to implement their primitives as fast as possible, associative arrays are allowed to store them internally in any order.

In order to sort output with the most frequent words first, you can just pipe the program’s output to sort -nr (sort numerically and reversed), but that’s in a way cheating. To integrate sorting into the program, let’s replace the last loop with the following code:


Click here to view code image

// Print counts
string[] words = freqs.keys;
sort!((a, b) { return freqs[a] > freqs[b]; })(words);
foreach (word; words) {
   writefln("%6u %s", freqs[word], word);
}


The property .keys yields only the keys of the freqs associative array as an array of strings. The array is newly allocated, which is necessary because we need to shuffle the strings. We now get to the code


Click here to view code image

sort!((a, b) { return freqs[a] > freqs[b]; })(words);


which features the pattern we’ve already seen:


Click here to view code image

sort!(compile-time arguments)(runtime arguments);


Peeling one layer of parentheses off !(...), we reach this notation, which looks like an incomplete function that forgot to mention parameter types, result type, and the name of the function itself:


Click here to view code image

(a, b) { return freqs[a] > freqs[b]; }


This is a lambda function—a short anonymous function that is usually meant to be passed to other functions. Lambda functions are so useful in so many places, D did its best to eliminate unnecessary syntactic baggage from defining a lambda: parameter types as well as the return type are deduced. This makes a lot of sense because the body of the lambda function is by definition right there for the writer, the reader, and the compiler to see, so there is no room for misunderstandings and no breakage of modularity principles.

There is one rather subtle detail to mention about the lambda function defined in this example. The lambda function accesses the freqs variable that is local to main; that is, it is not a global or a static. This is more like Lisp than C and makes for very powerful lambdas. Although traditionally such power comes at a runtime cost (by requiring indirect function calls), D guarantees no indirect calls (and consequently full opportunities for inlining).

The modified program outputs


Click here to view code image

929  the
680  and
625  of
608  to
523  I
453  a
444  my
382  in
361  you
358  Ham.
...


which is as expected, with commonly used words being the most frequent, with the exception of “Ham.” That’s not to indicate a strong culinary preference of the dramatis personae, it’s just the prefix of all of Hamlet’s lines. So apparently he has some point to make 358 times throughout, more than anyone else. If you browse down the list, you’ll see that the next speaker is the king with only 116 lines—fewer than a third of Hamlet’s. And at 58 lines, Ophelia is downright taciturn.

1.5 Basic Data Structures

Now that we’ve gotten into Hamlet, let’s analyze the text a bit further. For example, for all dramatis personae, we’d like to collect some information, such as how many words they say in total, and how rich their vocabulary is. To do that, we need to associate several data items with one persona.4 To group such information in one place, we can define a data structure as follows:


Click here to view code image

struct PersonaData {
   uint totalWordsSpoken;
   uint[string] wordCount;
}


In D you get structs and then you get classes. They share many amenities but have different charters: structs are value types, whereas classes are meant for dynamic polymorphism and are accessed solely by reference. That way confusion, slicing-related bugs, and comments á la // No! Do NOT inherit! do not exist. When you design a type, you decide up front whether it’ll be a monomorphic value or a polymorphic reference. C++ famously allows defining ambiguous-gender types, but their use is rare, error-prone, and objectionable enough to warrant simply avoiding them by design.

In our case, we just need to collect some data and we have no polymorphic ambitions, so using struct is a good choice. Let’s now define an associative array mapping persona names to PersonaData values:


Click here to view code image

PersonaData[string] info;


All we have to do is fill info appropriately from hamlet.txt. This needs some work because a character’s paragraph may extend for several lines, so we need to do some simple processing to coalesce physical lines into paragraphs. To figure out how to do that, let’s take a look at a short fragment from hamlet.txt, dumped verbatim below (with leading spaces made visible for clarity):

Whether or not Polonius’s enthusiasm about goto was a factor in his demise is, even to this day, a matter of speculation. Regardless of that, let’s note how each character’s line is preceded by exactly two spaces, followed by the (possibly contracted) character’s name, followed by a period and a space, finally followed by the actual content of the line. If a logical line extends to multiple physical lines, the continuations are always preceded by exactly four spaces. We could do such simple pattern matching by using a regular expression engine (found in the std.regex module), but we want to learn arrays so let’s match things “by hand.” We enlist the help of only the Boolean function a.startsWith(b), defined by std.algorithm, which tells whether a starts with b.

The main driver reads input lines, concatenates them in logical paragraphs (ignoring everything that doesn’t fit our pattern), passes complete paragraphs to an accumulator function, and at the end prints the desired information.


Click here to view code image

import std.algorithm, std.conv, std.ctype, std.regex,
   std.range, std.stdio, std.string;

struct PersonaData {
   uint totalWordsSpoken;
   uint[string] wordCount;
}

void main() {
   // Accumulates information about dramatis personae
   PersonaData[string] info;
   // Fill info
   string currentParagraph;
   foreach (line; stdin.byLine()) {
      if (line.startsWith(" ")
            && line.length > 4
            && isalpha(line[4])) {
         // Persona is continuing a line
         currentParagraph ~= line[3 .. $];
      } else if (line.startsWith(" ")
            && line.length > 2
            && isalpha(line[2])) {
         // Persona just started speaking
         addParagraph(currentParagraph, info);
         currentParagraph = to!string(line[2 .. $]);
      }
   }
   // Done, now print collected information
   printResults(info);
}


After we’ve equipped ourselves with information on how arrays work, the code should be self-explanatory, save for the presence of to!string(line[2 .. $]). Why is it needed, and what if we forgot about it?

The foreach loop that reads from stdin deposits successive lines of text in the variable line. Because it would be wasteful to allocate a brand-new buffer for each line read, byLine reuses the contents of line every pass through the loop. The type of line itself is char[]—an array of characters.

As long as you just inspect each line as it comes and then forget about it, everything works smoothly. But code that wants to squirrel away the contents of a line better makes a copy of it. Obviously currentParagraph is meant to indeed save text, so duplication is needed; hence the presence of to!string, which converts any expression into a string. The string type itself is impossible to overwrite, and to takes care of whatever duplication is necessary to observe that guarantee.

Now, if we forgot to!string and subsequently the code still compiled, the results would have been nonsensical and the bug rather hard to find. Having a part of a program modify data held in a different part of the program is very unpleasant to track down because it’s a non-local effect (just how many to calls could one forget in a large program?). Fortunately, that’s not the case because the types of line and currentParagraph reflect their respective capabilities: line has type char[], that is, an array of characters that could be overwritten at any time; whereas currentParagraph has type string, which is an array of characters that cannot be individually modified. (For the curious: the full name of string is immutable(char)[], which means precisely “contiguous region of immutable characters.” We’ll get to talk more about strings in Chapter 4.) The two cannot refer to the same memory content because line would break the promise of currentParagraph. So the compiler refuses to compile the erroneous code and demands a copy, which you provide in the form of the conversion to!string, and everybody’s happy.

On the other hand, when you copy string values around, there’s no more need to duplicate the underlying data—they can all refer to the same memory because it’s known neither will overwrite it, which makes string copying at the same time safe and efficient. Better yet, strings can be shared across threads without problems because, again, there’s never contention. Immutability is really cool indeed. If, on the other hand, you need to modify individual characters intensively, you may want to operate on char[], at least temporarily.

PersonaData as defined above is very simple, but in general structs can define not only data, but also other entities such as private sections, member functions, unittests, operators, constructors, and destructor. By default, each data member of a structure is initialized with its default initializer (zero for integral numbers, Not a Number (NaN) for floating-point numbers,5 and null for arrays and other indirect-access types). Let’s now implement addParagraph, which slices and dices a line of text and puts it into the associative array.

The line as served by main has the form "Ham. To be, or not to be- that is the question." We need to find the first ". " to distinguish the persona’s name from the actual line. To do so, we use the find function. haystack.find(needle) returns the right-hand portion of haystack starting with the first occurrence of needle. (If no occurrence is found, find returns an empty string.) While we’re at it, we should also do a little cleanup while collecting the vocabulary. First, we must convert the sentence to lowercase such that capitalized and non-capitalized words count as the same vocabulary element. That’s easily taken care of with a call to tolower. Second, we must eliminate a strong source of noise: punctuation that makes, for example, “him.” and “him” count as distinct words. To clean up the vocabulary, all we need to do is pass an additional parameter to split mentioning a regular expression that eliminates all chaff: regex("[ ,.;:?]+"). With that argument, the split function will consider any sequence of the characters mentioned in between [ and ] as part of word separators. That being said, we’re ready to do a lot of good stuff in just a little code:


Click here to view code image

void addParagraph(string line, ref PersonaData[string] info) {
   // Figure out persona and sentence
   line = strip(line);
   auto sentence = std.algorithm.find(line, ". ");
   if (sentence.empty) {
      return;
   }
   auto persona = line[0 .. $ - sentence.length];
   sentence = tolower(strip(sentence[2 .. $]));
   // Get the words spoken
   auto words = split(sentence, regex("[  ,.;:?]+"));
   // Insert or update information
   if (!(persona in info)) {
      // First time this persona speaketh
      info[persona] = PersonaData();
   }
   info[persona].totalWordsSpoken += words.length;
   foreach (word; words) ++info[persona].wordCount[word];
}


The bulk of addParagraph consists of updating the associative array. In case the person hasn’t been heard from yet, the code inserts an empty, default-constructed PersonaData object in the associative array. Since the default-constructed uint is zero and the default-constructed associative array is empty, the newly inserted slot is ready to start absorbing meaningful information.

Finally, let’s implement printResults to print a quick summary for each persona:


Click here to view code image

void printResults(PersonaData[string] info) {
  foreach (persona, data; info) {
     writefln("%20s %6u %6u", persona, data.totalWordsSpoken,
        data.wordCount.length);
  }
}


Ready for a test drive? Save and run!


Click here to view code image

     Queen   1104     500
       Ros    738     338
       For     55      45
      Fort     74      61
 Gentlemen      4       3
     Other    105      75
      Guil    349     176
       Mar    423     231
      Capt     92      66
      Lord     70      49
      Both     44      24
       Oph    998     401
     Ghost    683     350
       All     20      17
    Player     16      14
      Laer   1507     606
       Pol   2626     870
    Priest     92      66
       Hor   2129     763
      King   4153    1251
Cor., Volt     11      11
 Both [Mar      8       8
       Osr    379     179
      Mess    110      79
    Sailor     42      36
   Servant     11      10
Ambassador     41      34
      Fran     64      47
     Clown    665     298
      Gent    101      77
       Ham  11901    2822
       Ber    220     135
      Volt    150     112
       Rey     80      37


Now that’s some fun stuff. Unsurprisingly, our friend “Ham” gets the lion’s share by a large margin. Voltemand’s (“Volt”) role is rather interesting: he doesn’t have many words to say, but in these few words he does his best to display a solid vocabulary, not to mention the Sailor, who hardly repeats a word. Also compare the well-rounded Queen with Ophelia: the Queen has about 10% more words to say than Ophelia, but her vocabulary is no less than 25% larger.

The output has some noise in it (such as "Both [Mar"), easy for a diligent programmer to fix and hardly affecting the important statistics. Nevertheless, fixing the last little glitches would be an instructive (and recommended) exercise.

1.6 Interfaces and Classes

Object-oriented features are important for large projects; therefore, introducing them by means of small examples is at high risk of looking goofy. Add to that a pressing desire to stay away from overused examples featuring shapes, animals, or employees, and we’re faced with quite a pickle. Oh, and there’s one more thing—small examples usually gloss over the issue of polymorphic object creation, which is important. Talk about writer’s block! Fortunately, the real world provides a useful example in the form of a problem that’s relatively small, yet has no satisfactory procedural solution. The code we’ll discuss below is the rewrite of a small useful awk script that had grown well beyond the implicit limits set by its design. We will work together toward an object-oriented solution that is at the same time small, complete, and elegant.

Consider writing a small statistics program called stats with a simple interface: stats takes the statistical functions to compute as command-line parameters, gathers the numbers to operate on via standard input as a whitespace-separated list, and prints the statistical results one per line. Here is a sample session:


Click here to view code image

$ echo 3 5 1.3 4 10 4.5 1 5 | stats Min Max Average
1
10
4.225
$ _


A quick-and-dirty script can perform such tasks with no problem, yet the “dirty” tends to overshadow the “quick” as the number of statistical functions grows. So let’s put together a better solution. For now, we start with the simplest statistical functions: minimum, maximum, and average. After we figure out an extensible design, the door is open to implementing more complex statistical functions.

A simple way to approach things is to just loop through the input and compute all needed statistics. This is not a scalable design because each time we need to add a new statistical function, we’d have to do surgery on existing code. The modifications will be nontrivial if we want to perform only the computations asked for in the command line. Ideally, we’d confine each statistical function to one contiguous piece of code. That way, we can add new functionality to the program by simply appending new code—the Open-Closed principle [39] at its best.

Such an approach entails figuring out what all, or at least most, statistical functions have in common, with the goal of manipulating them all from one place and in a uniform manner. Let’s start by remarking that Min and Max take their input one number at a time and have the result handy as soon as the input is finished. The final result is only one number. In addition, Average must do a post-processing step (divide the accumulated sum by the number of inputs). Moreover, each algorithm maintains its own state. When different computations obey a uniform interface and need to keep state, it makes sense to make them objects and define a formal interface to manipulate any and all of them.


Click here to view code image

interface Stat {
   void accumulate(double x);
   void postprocess();
   double result();
}


An interface defines a required behavior as a set of functions. Of course, anyone claiming to implement the interface must define all functions as specified by their declarations. Speaking of implementation, let’s see how we can define Min to obey Stat’s iron fist:


Click here to view code image

class Min : Stat {
   private double min = double.max;
   void accumulate(double x) {
      if (x < min) {
         min = x;
      }
   }
   void postprocess() {} // Nothing to do
   double result() {
      return min;
   }
}


Min is a class—a user-defined type that brings lots of object orientation goodies into D. Min manifestly implements Stat through the syntax class Min : Stat and indeed defines Stat’s three functions exactly with the same arguments and return types (otherwise the compiler would not have allowed Min to get away with it). Min keeps only one private member variable min, which is the smallest value seen so far, and updates it inside accumulate. The initial value of Min is the largest possible number, such that the first input number will replace it.

Before defining more statistical functions, let’s write a driver for our stats program that reads the command-line parameters, creates the appropriate objects to do computations (such as Min when Min is passed in the command line), and uses the objects through the interface Stat.


Click here to view code image

import std.contracts, std.stdio;

void main(string[] args) {
   Stat[] stats;
   foreach (arg; args[1 .. $]) {
      auto newStat = cast(Stat) Object.factory("stats." ~ arg);
      enforce(newStat, "Invalid statistics function: " ~ arg);
      stats ~= newStat;
   }
   for (double x; readf(" %s ", &x) == 1; ) {
      foreach (s; stats) {
         s.accumulate(x);
      }
   }
   foreach (s; stats) {
      s.postprocess();
      writeln(s.result());
   }
}


This program does quite a lot but is only one mouthful. First off, main has a signature different from what we’ve seen so far—it takes an array of strings. The D runtime support initializes the array from the command-line parameters. The first loop initializes the stats array from args. Given that in D (as in other languages) the first argument is the name of the program itself, we skip that first argument by taking the slice args[1 .. $]. We now hit the statement


Click here to view code image

auto newStat = cast(Stat) Object.factory("stats." ~ arg);


which is quite long, but, to quote a sitcom cliché, “I can explain.” First, ~, when used as a binary operator, concatenates strings, so if the command-line argument was Max, the concatenation results in the string "stats.Max", which is passed to the function Object.factory. Object is the root of all class objects, and it defines the static method factory that takes a string, looks up a little database built during compilation, magically creates an object of the type named by the passed-in string, and returns it. If the class is not present, Object.factory returns null. For that call to succeed, all you need is to have a class called Max defined somewhere in the same file. Creating an object given the name of its type is an important facility with many useful applications—so important, in fact, that some dynamic languages make it a central feature; languages with a more static approach to typing need to rely on runtime support (such as D or Java) or leave it to the programmer to devise a manual registration and discovery mechanism.

Why stats.Max and not just Max? D is serious about modularity so it does not have a global namespace in which anyone can put anything. Each symbol lives in a named module, and by default the name of the module is the base name of the source file. So given that our file is called stats.d, D reckons that every name defined in that file belongs to module stats.

There is one more hitch left. The static type of the just-obtained Min object is actually not Min. That sounds dumb, but it’s justified by the fact that you could create any object by invoking Object.factory("whatever"), so the return type should be some common denominator of all possible object types—Object, that is. To get the appropriate handle on the newly created object, we must make it into a Stat object, an operation known as casting. In D, the expression cast(T) expr casts expression expr into type T. Casts involving class and interface types are always checked, so our code is foolproof.

Looking back, we notice that we’ve done a lot of solid work in main’s first five lines. That was the hardest part, because the rest of the code writes itself. The second loop reads one number at a time (readf takes care of that) and calls accumulate for all statistical objects. The readf function returns the number of items read successfully according to the specified format. In our case, the format is "%s", which means one item surrounded by any amount of whitespace. (The item’s type is decided by the type of the element being read, in this case x of type double.) Finally, the program prints all results.

1.6.1 More Statistics. Inheritance

Implementing Max is as trivial as implementing Min; aside from a slight change in accumulate, everything is exactly the same. Whenever a new task looks a lot like an old one, “interesting” and not “boring” is what should come to mind. A repetitive task is an opportunity for reuse, and rightly languages that can better exploit various flavors of similarity should rate higher on a certain quality scale. What we need to figure out is the particular kind of similarity that Min and Max (and we hope other statistical functions) enjoy. As we think it through, it looks like they both belong to the kind of statistical functions that build their result incrementally and need only one number to characterize the result. Let’s call this category of statistical functions, incremental functions.


Click here to view code image

class IncrementalStat : Stat {
   protected double _result;
   abstract void accumulate(double x);
   void postprocess() {}
   double result() {
      return _result;
   }
}


An abstract class can be seen as a partial commitment: it implements a number of methods, but not all, and as such cannot work stand-alone. The way to materialize an abstract class is to inherit it and complete its implementation. IncrementalStat takes care of Stat’s boilerplate code but leaves accumulate to be implemented by the derived class. Here’s what the new Min looks like:


Click here to view code image

class Min : IncrementalStat {
   this() {
      _result = double.max;
   }
   void accumulate(double x) {
      if (x < _result) {
         _result = x;
      }
   }
}


Class Min defined a constructor, too, in the form of a special function called this(), needed to initialize the result appropriately. Even with the constructor in place, the resulting code marks good savings from the initial state of affairs, particularly if we take into account the fact that many other statistical functions follow a similar pattern (e.g., sum, variance, average, standard deviation). Let’s look at implementing average, because it’s a great occasion to introduce a couple of more concepts:


Click here to view code image

class Average : IncrementalStat {
   private uint items = 0;
   this() {
      _result = 0;
   }
   void accumulate(double x) {
      _result += x;
      ++items;
   }
   override void postprocess() {
      if (items) {
         _result /= items;
      }
   }
}


First off, Average introduces one more member variable, items, which is initialized with zero through the syntax = 0 (just to showcase initialization syntax, but redundant in this case because integral types are zero-initialized anyway, as discussed on page 17). Second, Average defines a constructor that sets result to zero; this is because, unlike minimum or maximum, the average of zero numbers is defined to be zero. Although it might seem that initializing result with NaN just to overwrite it later with zero is needless busywork, optimizing away the so-called dead assignment is low-hanging fruit for any optimizer. Finally, Average overrides postprocess even though IncrementalStat already defined it. In D, by default, you can override (inherit and redefine) member functions of all classes, but you must specify override so as to avoid various accidents (e.g., failing to override because of some typo or a change in the base type, or overriding something by mistake). If you prepend final to a member function, that prohibits derived classes from overriding the function, effectively stopping the dynamic method lookup mechanism.

1.7 Values versus References

Let’s run a simple experiment:


Click here to view code image

import std.stdio;

struct MyStruct {
   int data;
}
class MyClass {
   int data;
}

void main() {
   // Play with a MyStruct object
   MyStruct s1;
   MyStruct s2 = s1;
   ++s2.data;
   writeln(s1.data); // Prints 0
   // Play with a MyClass object
   MyClass c1 = new MyClass;
   MyClass c2 = c1;
   ++c2.data;
   writeln(c1.data); // Prints 1
}


It seems like playing with a MyStruct object is quite a different game from playing with a MyClass object. In both cases we create a variable that we copy into another variable, after which we modify the copy (recall that ++ is a unary operator that increments its argument). The experiment seems to reveal that after a copy, c1 and c2 refer to the same underlying storage, while on the contrary, s1 and s2 have independent lives.

The behavior of MyStruct obeys value semantics: each variable refers to exactly one value, and assigning one variable to another really means copying the state of the variable over the state of the other variable. The source of the copy is unchanged, and the two variables continue to evolve independently. The behavior of MyClass obeys reference semantics: values are created explicitly (in our case by invoking newMyClass), and assigning a class variable to another simply means that the two variables refer to the same value.

Value semantics are easy to deal with, simple to reason about, and allow efficient implementation for small sizes. On the other hand, nontrivial programs are difficult to implement without some means to refer to a value without copying it. Value semantics alone preclude, for example, forming self-referential types (lists or trees), or mutually referential structures such as a child window knowing about its parent window. Any serious language implements some sort of reference semantics; it could be argued that it all depends on where the default is. C has value semantics exclusively and allows forming references explicitly, by means of pointers. In addition to pointers, C++ also defines reference types. Interestingly, pure functional languages are free to use reference or value semantics as they see fit, because user code cannot tell the difference. This is because pure functional languages don’t allow mutation, so you can’t tell if they snuck a copy of a value or just a reference to it—it’s frozen anyway, so you couldn’t verify whether the value is shared by changing it. On the contrary, pure object-oriented languages are traditionally mutation-intensive and employ reference semantics exclusively, some to the extent of allowing a disconcerting amount of flexibility such as changing system-wide constants dynamically. Finally, some languages take a hybrid approach, embracing both value and reference types, with various levels of commitment.

D makes a systematic approach to the hybrid method. To define reference types you use class. To define value types or hybrid types you use struct. As Chapters 6 and 7 (respectively) describe in detail, each of these type constructors is endowed with amenities specific to this fundamental design choice. For example, structs do not have support for dynamic inheritance and polymorphism (the kind we’ve shown in the stats program above), as such behaviors are not compatible with value semantics. Dynamic polymorphism of objects needs reference semantics, and any attempt to mess with that can only lead to terrible accidents. (For example, a common danger to watch for in C++ is slicing, i.e., suddenly stripping the polymorphic abilities of an object when inadvertently using it as a value. In D, slicing could never occur.)

A closing thought is that structs are arguably a more flexible design choice. By defining a struct, you can tap into any semantics that you want, be it eager-copy value, lazy copying á la copy-on-write or reference counting, or anything in between. You can even define reference semantics by using class objects or pointers inside your struct object. On the other hand, some of these stunts may require quite advanced technical savvy; in contrast, using classes offers simplicity and uniformity across the board.

1.8 Summary

Because of the introductory nature of this chapter, some concepts and examples glossed over a few details and assumed passing familiarity with some others. Also, an experienced programmer could easily find ways to complete and improve the examples.

With luck, this chapter included something for everyone. If you are the kind of practical, no-nonsense coder, you might have looked at the cleanliness of arrays and associative arrays with an appreciative eye. These two concepts alone make a world of difference in simplifying code day in and day out, for projects small and large. If you enjoy object orientation, no doubt interfaces and classes seemed dearly familiar to you and suggest good upward scalability of the language to large projects. If you need to use D for short scripts as well, this chapter has shown that short scripts that manipulate files are easy to write and get running.

As always, the whole story is a fair amount longer. However, it’s useful to get back to the basics and make sure that simple things remain simple.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.67.5