An example - alternating least squares

A recommender system tries to predict the potential items that a user might be interested in, based on a history from other users.

So let's consider a so-called item-user or product-customer matrix, as illustrated here:

This is a so-called sparse matrix because only a couple of cells are populated with non-zero values indicating a match between a customer i and a product j. Either by just putting a one in the cell or any other numerical value, for example, indicating the number of products bought or a rating for that particular product j from customer i. Let's call this matrix rui, where u stands for user and i for item.

Those of you familiar with linear algebra might know that any matrix can be factorized by two smaller matrices. This means that you have to find two matrices pu and qi that, when multiplied with each other, reconstruct the original matrix rui; let's call the reconstruction rui'. The goal is to find pu and qi to reconstruct rui' such that it doesn't differ too much from rui. This is done using a sum of squared errors objective.

The following figure illustrates this and the sparsity property of the matrix:

Once we've found good factors pu and qi, we can construct rui' and, finally, new non-zero cells will be present, which become the new predicted product suggestions. In case you haven't understood all the details, don't worry, as we don't need too much of this example to understand the rest of this chapter.

A common algorithm to find pu and qi is called alternating least squares (ALS)--alternating because in each iteration the optimization objective switches from pu to qi and vice versa. Don't get bothered with it too much, but this is how it actually works, and, in Apache Spark MLlib, this is just a single line of Scala code:

So what's wrong with this? Before we explain this, let's take a look at how ALS is implemented in a statistical programming language such as R:

Again, don't worry if you don't understand each line, but the purpose of this figure is to show you that in R, this algorithm needs only 27 lines of code to be expressed. If we now take a look at the ALS implementation in MLlib, we'll see that it has more than 800 lines. You can find this implementation at https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation.

So why do we need more than 800 lines in Scala on Spark and only 27 in R? This is because of performance optimizations. The ALS implementation in MLlib consists of more than 50% of performance optimization code. So what if we could perform the following?

  • Get rid of all performance optimizations in our algorithm implementation
  • Port our R code 1:1 to some parallel framework
  • In case of changes, just change our R implementation

This is where Apache SystemML kicks in, it supports all this. Apache SystemML's DSL (domain specific language) is a subset of R syntax, so you can just take the previous example and run it 1:1 without any modification on top of Apache SystemML. In addition, a cost-based performance optimizer generates a physical execution plan on top of Apache Spark in order to minimize execution time based on size properties of your data. So let's find out how this works.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.42.116