The Mahout and Spark integration

Apache Mahout was a general machine learning library built on top of Hadoop. Mahout started out primarily as a Java MapReduce package to run machine learning algorithms. As machine learning algorithms are iterative in nature, MapReduce had major performance and scalability issues, so Mahout stopped the development of MapReduce-based algorithms and started supporting new platforms, such as Spark, H2O, and Flink, with a new package called Samsara.

Let's install Mahout, explore the Mahout shell with Scala bindings, and then build a recommendation system.

Installing Mahout

The latest version of Spark does not work well with Mahout yet, so I used the Spark 1.4.1 version with the Mahout 0.12.2 version. Download the Spark prebuilt binary from the following location and start Spark daemons:

wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1-bin-hadoop2.6.tgz
tar xzvf spark-1.4.1-bin-hadoop2.6.tgz
cd spark-1.4.1-bin-hadoop2.6

Now, let's download the Mahout binaries and unpack them as shown here:

wget http://mirrors.sonic.net/apache/mahout/0.12.2/apache-mahout-distribution-0.12.2.tar.gz
tar xzvf apache-mahout-distribution-0.12.2.tar.gz

Now, export the following environment variables and start the Mahout shell:

export MAHOUT_HOME=/home/cloudera/apache-mahout-distribution-0.12.2
export SPARK_HOME=/home/cloudera/spark-1.4.1-bin-hadoop2.6
export MAHOUT_LOCAL=true
export MASTER=yarn-client
export JAVA_TOOL_OPTIONS="-Xmx2048m -XX:MaxPermSize=1024m -Xms1024m"
cd ~/apache-mahout-distribution-0.12.2
bin/mahout spark-shell

Exploring the Mahout shell

The following two Scala imports are typically used to enable Mahout Scala DSL Bindings for linear algebra:

import org.apache.mahout.math._
import scalabindings._
import MatlabLikeOps._

The two types of vectors supported by Mahout shell are as follows:

  • Dense vector: A dense vector is a vector with relatively few zero elements. On the Mahout command line, type the following command to initialize a dense vector:
    mahout> val denseVector1: Vector = (3.0, 4.1, 6.2)
    denseVector1: org.apache.mahout.math.Vector = {0:3.0,1:4.1,2:6.2}
    
  • Sparse vector: A sparse vector is a vector with a relatively large number of zero elements. On the Mahout command line, type the following command to initialize a sparse vector:
    mahout> val sparseVector1 = svec((6 -> 1) :: (9 -> 2.0) :: Nil)
    sparseVector1: org.apache.mahout.math.RandomAccessSparseVector = {9:2.0,6:1.0}
    

Access the elements of a vector:

mahout> denseVector1(2)
res0: Double = 6.2

Set values to a vector:

mahout> denseVector1(2)=8.2
mahout> denseVector1
res2: org.apache.mahout.math.Vector = {0:3.0,1:4.1,2:8.2}

The following are the vector arithmetic operations:

mahout> val denseVector2: Vector = (1.0, 1.0, 1.0)
denseVector2: org.apache.mahout.math.Vector = {0:1.0,1:1.0,2:1.0}

mahout> val addVec=denseVector1 + denseVector2
addVec: org.apache.mahout.math.Vector = {0:4.0,1:5.1,2:9.2}

mahout> val subVec=denseVector1 - denseVector2
subVec: org.apache.mahout.math.Vector = {0:2.0,1:3.0999999999999996,2:7.199999999999999}

Similarly, multiplication and division can be done as well.

The result of adding a scalar to a vector is that all elements are incremented by the value of the scalar. For example, the following command adds 10 to all the elements of the vector:

mahout> val addScalr=denseVector1+10
addScalr: org.apache.mahout.math.Vector = {0:13.0,1:14.1,2:18.2}

mahout> val addScalr=denseVector1-2
addScalr: org.apache.mahout.math.Vector = {0:1.0,1:2.0999999999999996,2:6.199999999999999}

Now let's see how to initialize the matrix. The inline initialization of a matrix, either dense or sparse, is always performed row-wise:

  • The dense matrix is initialized as follows:
    mahout> val denseMatrix = dense((10, 20, 30), (30, 40, 50))
    denseMatrix: org.apache.mahout.math.DenseMatrix = 
    {
     0 => {0:10.0,1:20.0,2:30.0}
     1 => {0:30.0,1:40.0,2:50.0}
    }
    
  • The sparse matrix is initialized as follows:
    mahout> val sparseMatrix = sparse((1, 30) :: Nil, (0, 20) :: (1, 20.5) :: Nil)
    sparseMatrix: org.apache.mahout.math.SparseRowMatrix = 
    {
     0 => {1:30.0}
     1 => {0:20.0,1:20.5}
    }
    
  • The diagonal matrix is initialized as follows:
    mahout> val diagonalMatrix=diag(20, 4)
    diagonalMatrix: org.apache.mahout.math.DiagonalMatrix = 
    {
     0 => {0:20.0}
     1 => {1:20.0}
     2 => {2:20.0}
     3 => {3:20.0}
    }
    
  • The identity matrix is initialized as follows:
    mahout> val identityMatrix = eye(4)
    identityMatrix: org.apache.mahout.math.DiagonalMatrix = 
    {
     0 => {0:1.0}
     1 => {1:1.0}
     2 => {2:1.0}
     3 => {3:1.0}
    }
    

Access the elements of a matrix as follows:

mahout> denseMatrix(1,1)
res5: Double = 40.0

mahout> sparseMatrix(0,1)
res18: Double = 30.0

Fetch a row of a vector as follows:

mahout> denseMatrix(1,::)
res21: org.apache.mahout.math.Vector = {0:30.0,1:40.0,2:50.0}

Fetch a column of a vector as follows:

mahout> denseMatrix(::,1)
res22: org.apache.mahout.math.Vector = {0:20.0,1:40.0}

Set a matrix row as follows:

mahout> denseMatrix(1,::)=(99,99,99)
res23: org.apache.mahout.math.Vector = {0:99.0,1:99.0,2:99.0}
mahout> denseMatrix
res24: org.apache.mahout.math.DenseMatrix = 
{
 0 => {0:10.0,1:20.0,2:30.0}
 1 => {0:99.0,1:99.0,2:99.0}
}

Matrices are assigned by reference and not as a copy. See the following example:

mahout> val newREF = denseMatrix
newREF: org.apache.mahout.math.DenseMatrix = 
{
 0 => {0:10.0,1:20.0,2:30.0}
 1 => {0:99.0,1:99.0,2:99.0}
}
mahout> newREF += 10.0
res25: org.apache.mahout.math.Matrix = 
{
 0 => {0:20.0,1:30.0,2:40.0}
 1 => {0:109.0,1:109.0,2:109.0}
}
mahout> denseMatrix
res26: org.apache.mahout.math.DenseMatrix = 
{
 0 => {0:20.0,1:30.0,2:40.0}
 1 => {0:109.0,1:109.0,2:109.0}
}

If you want a separate copy, you can clone it:

mahout> val newClone = denseMatrix clone
newClone: org.apache.mahout.math.Matrix = 
{
 0 => {0:20.0,1:30.0,2:40.0}
 1 => {0:109.0,1:109.0,2:109.0}
}
mahout> newClone += 10
res27: org.apache.mahout.math.Matrix = 
{
 0 => {0:30.0,1:40.0,2:50.0}
 1 => {0:119.0,1:119.0,2:119.0}
}
mahout> newClone
res28: org.apache.mahout.math.Matrix = 
{
 0 => {0:30.0,1:40.0,2:50.0}
 1 => {0:119.0,1:119.0,2:119.0}
}
mahout> denseMatrix
res29: org.apache.mahout.math.DenseMatrix = 
{
 0 => {0:20.0,1:30.0,2:40.0}
 1 => {0:109.0,1:109.0,2:109.0}
}

Building a universal recommendation system with Mahout and search tool

Mahout provides you with recommendation algorithms such as spark-itemsimilarity and spark-rowsimilarity to create recommendations. When these recommendations are combined with a search tool, such as Solr or Elasticsearch, the recommendations will be personalized for individual users.

Figure 8.6 is a recommendation application with lambda architecture to create and update the model in batch mode and a search engine playing the real-time serving role. All user interactions are collected in real time and stored on HBase. In Solr, two collections are created—one for user history and one for item indicators. Indicators are user interactions created by spark-mahout correlation algorithms.

A recommender application queries Solr directly to get recommendations. There can be two actions from users such as purchase which is a primary action and secondary actions such as product detail-views or add-to-wishlists. The primary action from the user history with its co-occurrence and cross-co-occurrence indicators are usually recommended. However, recommendations can be customized with secondary actions.

Building a universal recommendation system with Mahout and search tool

Figure 8.6: A universal recommendation engine with Spark and Mahout

Let's learn how to create a similarity matrix and cross-similarity matrix using the spark-itemsimilarity algorithm. Create a file called infile with the following content. Note that you can directly provide a log file as an input by providing delimiters:

[cloudera@quickstart ~]$ cat infile
u1,purchase,iphone
u1,purchase,ipad
u2,purchase,nexus
u2,purchase,galaxy
u3,purchase,surface
u4,purchase,iphone
u4,purchase,galaxy
u1,view,iphone
u1,view,ipad
u1,view,nexus
u1,view,galaxy
u2,view,iphone
u2,view,ipad
u2,view,nexus
u2,view,galaxy
u3,view,surface
u3,view,nexus
u4,view,iphone
u4,view,ipad
u4,view,galaxy

Then, run the spark-itemsimilarity job with the following command:

[cloudera@quickstart apache-mahout-distribution-0.12.2]$ bin/mahout spark-itemsimilarity 
   --input /home/cloudera/infile        
   --output /home/cloudera/outdir         
   --master local[*]     
   --filter1 purchase    
   --filter2 view        
   -ic 2                 
   -rc 0                 
   -fc 1

The previous command line options are explained as follows:

  • -f1 or --filter1: Datum for the primary item set
  • -f2 or --filter2: Datum for the secondary item set
  • -ic or --itemIDColumn: Column number for item ID. Default is 1.
  • -rc or --rowIDColumn: Column number for row ID. Default is 0.
  • -fc or --filterColumn: Column number for filter string. Default is -1 for no filter.

Note that the --master parameter can be changed to a standalone master or yarn.

This program will produce the following output:

[cloudera@quickstart apache-mahout-distribution-0.12.2]$$ cd ~/outdir/
[cloudera@quickstart outdir]$ ls -R
.:
cross-similarity-matrix  similarity-matrix

./cross-similarity-matrix:
part-00000  part-00001  part-00002  _SUCCESS

./similarity-matrix:
part-00000  _SUCCESS

[cloudera@quickstart outdir]$ cat similarity-matrix/part-00000 
galaxy nexus:1.7260924347106847
ipad iphone:1.7260924347106847
surface
iphone ipad:1.7260924347106847
nexus galaxy:1.7260924347106847

[cloudera@quickstart outdir]$ cat cross-similarity-matrix/part-0000* 
galaxy galaxy:1.7260924347106847 ipad:1.7260924347106847 iphone:1.7260924347106847 nexus:1.7260924347106847
ipad galaxy:0.6795961471815897 ipad:0.6795961471815897 iphone:0.6795961471815897 nexus:0.6795961471815897
surface surface:4.498681156950466 nexus:0.6795961471815897
iphone galaxy:1.7260924347106847 ipad:1.7260924347106847 iphone:1.7260924347106847 nexus:1.7260924347106847
nexus galaxy:0.6795961471815897 ipad:0.6795961471815897 iphone:0.6795961471815897 nexus:0.6795961471815897

Based on the previous result, on the site for the page displaying the iPhone, we can now show that the iPad as a recommendation that was purchased by similar people. The current user's purchase history will be used to personalize the recommendations on Solr.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.14.93