© Thomas Mailund 2018

Thomas Mailund, Domain-Specific Languages in R, https://doi.org/10.1007/978-1-4842-3588-1_1

1. Introduction

Thomas Mailund

(1)Aarhus N, Staden København, Denmark

This book introduces embedded domain-specific languages in R. The term domain-specific languages, or DSL, refers to programming languages specialized for a particular purpose, as opposed to general-purpose programming languages. Domain-specific languages ideally give you a precise way of specifying tasks you want to do and goals you want to achieve, within a specific context. Regular expressions are one example of a domain-specific language, where you have a specialized notation to express patterns of text. You can use this domain-specific language to define text strings to search for or specify rules to modify text. Regular expressions are often considered very hard to read, but they do provide a useful language for describing text patterns. Another example of a domain-specific language is SQL—a language specialized for extracting from and modifying a relational database. With SQL, you have an expressive domain-specific language in which you can specify rules as to which data points in a database you want to access or modify.

Who This Book Is For

This book is aimed at experienced R programmers. Some of the concepts we cover in this book are advanced, so at the very least you should be familiar with functional and object-oriented programming in R (although the next chapter will review the object-oriented programming features we will use). It will be helpful to have some experience with meta-programming when it comes to evaluating expressions in contexts that interact with the surrounding R code. However, Chapter 7 gives a crash course in the techniques we will use in this book, so you should be able to pick it up with a little effort from there.

Domain-Specific Languages

With domain-specific languages we often distinguish between “external” and “embedded” languages. Regular expressions and SQL are typically specified as strings when you use them in a program, and these strings must be parsed and interpreted when your program runs. In a sense, they are languages separated from the programming language you use them in. They need to be compiled separately and then called by the main programming language. They are therefore considered “external” languages. In contrast, embedded domain-specific languages provide domain-specific languages expressed in the general-purpose language in which they are used. In R, the grammar of graphics implemented in ggplot2 or the data transformation operations implemented in dplyr provides small languages—domain-specific languages—that you can use from within R, and you write the programs for these languages in R as well.

Embedded DSLs extend the programming language in which you are working. They provide more than what you usually find in a framework in the form of functions and classes as they offer a high level of flexibility in what you can do with them. They are programming languages, after all, and you can express complex ideas and tasks in them. They provide a language for expressing thoughts in a specific domain, so they do not give you a general programming language as the language you use them from, but they do extend that surrounding language with new expressive power. However, being embedded in the general-purpose language means that they will follow the rules you are familiar with there—or mostly, at least, since in languages such as R it is possible to modify the rules a bit using so-called non-standard evaluation . You can expect the syntax of the embedded DSL to follow the rules of the general-purpose language. The semantics will be determined by the DSL, but when you write programs in the DSL, the syntax is already familiar to you. If you implement a DSL yourself, embedding it in a general-purpose language lets you reuse the parsing and evaluation done by the general-purpose language so that you can focus on designing the domain-specific language.

Implementing embedded domain-specific languages often involves meta-programming ; that is, it consists of treating the program you are writing as data that you can manipulate from within the language itself. This might sound more complicated than it is, but quite often, it is reasonably straightforward to achieve. Using classes and operator overloading, we can use R’s parser to parse embedded languages by simply designing the language such that evaluating expressions automatically parse them. This leaves us with data structures we, ourselves, have defined, without us having to parse anything, and we can rewrite the results of such parsed expressions in various ways before we evaluate them to run the embedded program. Evaluating expressions can be relatively straightforward or involve a deeper analysis of the parsed expressions.

To get a feel for what we can do with embedded domain-specific languages, let’s consider a simple DSL: matrix multiplication (an example we cover in more detail in Chapter 2). You might not think of matrix expressions as much of a programming language, but the arithmetic notation is highly efficient for expressing ideas in a limited domain. Just imagine having to do mathematics without this notation. Of course, R already supports this language—it has infix operators and the semantics we associate with arithmetic expressions when we write them in an R program. However, since matrix multiplication is a well-known task, it serves as an excellent example to illustrate some of the things we can do if we extend R with other smaller programming languages.

R already supports arithmetic with matrices, and if you use the operator %*%, you can do matrix multiplication (if you use *, you will do component-wise multiplication instead). Multiplications are done one at a time, so if you have a series of them, such as this:

A %*% B %*% C %*% D

then the product will be computed from left to right, like this:

((A %*% B) %*% C) %*% D

For each multiplication, you produce a new matrix that will be used in the next multiplication.

Now, matrix multiplication is associative, so you should be able to set the parentheses in any way, as long as you respect the left-to-right order of the matrices (matrix multiplication is not commutative, after all), and you will get the same result. The running time, however, will not be the same. We can do a small experiment to see this using the microbenchmark package.

A <- matrix(1, nrow = 400, ncol = 300)
B <- matrix(1, nrow = 300, ncol = 30)
C <- matrix(1, nrow = 30, ncol = 500)
D <- matrix(1, nrow = 500, ncol = 400)


library(microbenchmark)
res <- microbenchmark(A %*% B %*% C %*% D,
                      ((A %*% B) %*% C) %*% D,
                      (A %*% (B %*% C)) %*% D,
                      (A %*% B) %*% (C %*% D),
                      A %*% (B %*% (C %*% D)),
                      A %*% ((B %*% C) %*% D))


options(microbenchmark.unit="relative")
print(res, signif = 3, order = "mean")


## Unit: relative
##                     expr  min   lq mean median
##  (A %*% B) %*% (C %*% D) 1.00 1.00 1.00   1.00
##  A %*% (B %*% (C %*% D)) 3.92 3.87 3.49   3.84
##      A %*% B %*% C %*% D 6.13 6.06 5.42   6.03
##  ((A %*% B) %*% C) %*% D 6.12 6.05 5.51   6.04
##  A %*% ((B %*% C) %*% D) 7.71 7.62 6.75   7.57
##  (A %*% (B %*% C)) %*% D 9.88 9.76 8.73   9.69
##    uq  max neval
##  1.00 1.00   100
##  3.62 1.41   100
##  5.57 2.06   100
##  5.61 2.35   100
##  7.00 2.30   100
##  8.99 3.71   100

Here, I’ve computed the matrix product in the five different possible ways. There are six expressions, but the first two will compute the matrix multiplication in the same order. With microbenchmark we compute each expression 100 times and collect the time each evaluation takes. We collect the time it takes to compute each expression, and here I have displayed the running time relative to the fastest expression, sorted by the mean evaluation time.

On average, there is almost a factor of ten between the fastest and the slowest evaluation (for the slowest evaluations in the two cases the difference is a factor of two, which is still a substantial relative difference). There is something to be gained by setting parentheses optimally if we multiply together several large matrices. The dimensions of matrices are not necessarily known before runtime, however, so ideally we want to set the parentheses when we evaluate expressions in an optimal way.

The approach we take in Chapter 2 is to delay the evaluation of matrix multiplication and instead build a data structure for matrix expressions, one we can evaluate later when we have the entire matrix multiplication expression constructed. It is a simple DSL, but it contains all the components we typically need in one: we need code for parsing an expression and creating a representation of it, we need to do some manipulation of expressions, and then we need to evaluate them.

For parsing expressions, we need to capture matrices and multiplications. We wrap matrices in a class to make them objects of our language, rather than plain R data.

m <- function(data) {
  structure(data,
            nrow = nrow(data),
            ncol = ncol(data),
            class = c("matrix_expr", class(data)))
}

The class matrix_expr is one we will use to overload the multiplication operator, and we want all elements of this class to know their dimensions, so when we wrap the data, we save the number of rows, nrow, and the number of columns, ncol.

We also want to capture matrix multiplications, where we do not evaluate the multiplication but simply save references to the matrices that we want to multiply together.

matrix_mult <- function(A, B) {
  structure(list(left = A, right = B),
            nrow = nrow(A),
            ncol = ncol(B),
            class = c("matrix_mult", "matrix_expr"))
}

Of course, we do not want to write expressions as follows:

matrix_mult(matrix_mult(m(A), m(B), matrix_mult(m(C), m(D))))

so we overload the * operator:

`*.matrix_expr` <- function(A, B) {
  matrix_mult(A, B)
}

Now, this expression:

m(A) * m(B) * m(C) * m(D)

is our new syntax for the matrix multiplication, shown here:

A %*% B %*% C %*% D

except that the former expression only constructs a data structure representing the expression. It does not evaluate it.

When we need to evaluate a matrix multiplication, we want to analyze the delayed evaluation and rearrange the multiplication to get the optimal performance. In Chapter 2 we will implement the functions rearrange_matrix_mult and eval_matrix_mult that do this. Here, we just define a function, v, for evaluating a matrix multiplication:

v <- function(expr)
    eval_matrix_expr(rearrange_matrix_expr(expr))

We can compare this automatic parentheses setting procedure with the default evaluation and the optimal evaluation order we saw earlier.

res <- microbenchmark(A %*% B %*% C %*% D,
                      (A %*% B) %*% (C %*% D),
                      v(m(A) * m(B) * m(C) * m(D)))


options(microbenchmark.unit="relative")
print(res, signif = 3, order = "mean")


## Unit: relative
##                          expr  min   lq mean
##       (A %*% B) %*% (C %*% D) 1.00 1.00 1.00
##  v(m(A) * m(B) * m(C) * m(D)) 1.13 1.19 1.37
##           A %*% B %*% C %*% D 6.13 6.09 5.65
##  median   uq  max neval
##    1.00 1.00 1.00   100
##    1.23 1.26 1.19   100
##    6.08 5.99 2.19   100

The automatic solution is only slightly slower than the optimal solution and about a factor of six better than the default evaluation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.27.178