Datatypes

Though dataframes have become the deFacto way of doing ML operations, we will still go over basic knowledge of traditional RDD-based Spark MLlib datatypes as follows:

Local vector: Local vectors are stored on a single machine having values as double and indices of each such values are stored as an integer starting with zero. Local vectors can further be sub classified into dense and sparse vectors. A double array representing its values is called dense vector while sparse vector contains two arrays representing indices and values separately.

For example, a vector ( 2.0, 0.0, 5.0, 3.0 ) can be represented as a:

Dense vector as [ 2.0, 0.0, 5.0, 3.0]
Sparse vector as [4, (0,2,3), (2.0,5.0,3.0) ]
Labeled point: A dense or sparse local vector having a label or a response value is called a labeled point. They are used in supervised learning more prominently in classification and regression-based algorithms. Labeled point values are stored in a double datatype.

Algorithm	Values
Binary classification	0 or 1
Multiclass classification	0, 1, 2...
Regression	Double

Local matrix: A matrix that is stored on a single machine having values as a double datatype and row and column indices being represented as integer is called local matrix. It can further be segregated into the dense matrix whose values are stored in a single double array in column major order while sparse matrix stores non zero values in Compressed Sparse Column (CSC) format in column major order.
Distributed matrix: A distributed matrix is stored in one or more RDDs and hence is distributed in nature. Also the row and column indices of distributed matrix are of type long while the values are of double type. Conversion of a distributed matrix to any other type may require shuffling and hence is an expensive operation. Distributed matrices have been further classified into the following sub categories:
- Row matrix: Row-oriented distributed matrix without meaningful row indices-RDD of sequence of vector without rows indices
- Indexed row matrix: like a row matrix, but with meaningful row indices
- Coordinate matrix: elements values are explicit defined by using IndexedRow(row_index, col_index, value)
- Block matrix: included set of matrix block (row_index, col_index, matrix)

Spark 2.0 introduced the paradigm shift in the way machine learning algorithms are implemented in Spark. They not only changed the RDD-based operation to dataframe based operations by introducing a new package spark.ml, but also focused on solving problems using pipelines. Some of the important concepts surrounding pipelines and its associated terminologies are discussed in the following sections.

Table of Contents for Datatypes

Create new playlist

Sign In

Sign Up

Table of Contents for
Datatypes