Data types

Recall that in the Entree dataset (Chapter 2, Data Processing Pipeline Using Scala), we represented each restaurant feature with an array of Boolean values. We could also think of that as a vector of feature values. This is exactly what we can do using the vector data type in Spark.

Vector

There are two variations of a vector: dense vector and sparse vector. For a vector of length N, a dense vector will take space equal to N double values. So, even if a restaurant has only a few features present, the vector will occupy a lot of space. However, with sparse representation, we only need to store the non-zero values. So, if there are M non-zero values where Vector, then the total space requirements are equal to M integers and M doubles. Before deciding whether to use a sparse or dense representation, you may want to think about the sparsity of your feature vectors overall. Also, you may want to think about the operations to be performed on the vectors. A dense vector allows fast random lookups, however, that is not the case with sparse vectors.

Matrix

Another data type is a matrix. A matrix can also be both dense and sparse. Sparse implementation in Spark is done using a coordinate-list format—CoordinateMatrix. IndexedRowMatrix is another sparse implementation where we can only have sparsity at the row level, that is, each of the row vectors is a sparse vector; however, there can be no missing rows. There are two other variations of a matrix, that is, a matrix can either be located locally or distributed across a Spark cluster. You may want to read Spark documentation to decide which ones to choose for your problem.

Labeled point

We can represent an instance and its label using a data type called labeled point. Labeled point has two members: a real label and a feature vector. By convention in Spark, a binary label is represented as—0 for negative and 1 for positive. Multi-class labels should start from 0 with a step increment of 1. For example, if there are 4 classes, the labels should be 0, 1, 2, and 3.

Let me at this point introduce you to RDD (Resilient Distributed Dataset). Essentially it is an abstraction over a collection of objects (which could be a Scala object or Scala tuples, or just primitive data). In simple terms, RDD is a distributed collection of elements, spread across multiple nodes in a cluster. This distribution is taken care of by Spark.

For more information about RDD, you can read the documentation: https://spark.apache.org/docs/1.4.1/quick-start.html

So finally, we have multiple options to encode our original dataset into different Spark data types. We can use an RDD of vector, or RDD of labeled point or a distributed matrix. Before proceeding with the machine learning (ML) algorithms we may also want to understand our dataset better.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.81.201