Spark, particularly with memory-based storage systems, claims to substantially improve the speed of data access within and between nodes. ML seems to be a natural fit, as many algorithms require multiple passes over the data, or repartitioning. MLlib is the open source library of choice, although private companies are catching, up with their own proprietary implementations.
As I will chow in Chapter 5, Regression and Classification, most of the standard machine learning algorithms can be expressed as an optimization problem. For example, classical linear regression minimizes the sum of squares of y distance between the regression line and the actual value of y:
Here, are the predicted values according to the linear expression:
A is commonly called the slope, and B the intercept. In a more generalized formulation, a linear optimization problem is to minimize an additive function:
Here, is a loss function and is a regularization function. The regularization function is an increasing function of model complexity, for example, the number of parameters (or a natural logarithm thereof). Most common loss functions are given in the following table:
Loss function L |
Gradient | |
---|---|---|
Linear | ||
Logistic | ||
Hinge |
The purpose of the regularizer is to penalize more complex models to avoid overfitting and improve generalization error: more MLlib currently supports the following regularizers:
Regularizer R |
Gradient | |
---|---|---|
L2 | ||
L1 | ||
Elastic net |
Here, sign(w) is the vector of the signs of all entries of w.
Currently, MLlib includes implementation of the following algorithms:
I will go over some of the algorithms in Chapter 5, Regression and Classification. More complex non-structured machine learning methods will be considered in Chapter 6, Working with Unstructured Data.
R is an implementation of popular S programming language created by John Chambers while working at Bell Labs. R is currently supported by the R Foundation for Statistical Computing. R's popularity has increased in recent years according to polls. SparkR provides a lightweight frontend to use Apache Spark from R. Starting with Spark 1.6.0, SparkR provides a distributed DataFrame implementation that supports operations such as selection, filtering, aggregation, and so on, which is similar to R DataFrames, dplyr, but on very large datasets. SparkR also supports distributed machine learning using MLlib.
SparkR required R version 3 or higher, and can be invoked via the ./bin/sparkR
shell. I will cover SparkR in Chapter 8, Integrating Scala with R and Python.
Graph algorithms are one of the hardest to correctly distribute between nodes, unless the graph itself is naturally partitioned, that is, it can be represented by a set of disconnected subgraphs. Since the social networking analysis on a multi-million node scale became popular due to companies such as Facebook, Google, and LinkedIn, researches have been coming up with new approaches to formalize the graph representations, algorithms, and types of questions asked.
GraphX is a modern framework for graph computations described in a 2013 paper (GraphX: A Resilient Distributed Graph System on Spark by Reynold Xin, Joseph Gonzalez, Michael Franklin, and Ion Stoica, GRADES (SIGMOD workshop), 2013). It has graph-parallel frameworks such as Pregel, and PowerGraph as predecessors. The graph is represented by two RDDs: one for vertices and another one for edges. Once the RDDs are joined, GraphX supports either Pregel-like API or MapReduce-like API, where the map function is applied to the node's neighbors and reduce is the aggregation step on top of the map results.
At the time of writing, GraphX includes the implementation for the following graph algorithms:
As GraphX is an open source library, changes to the list are expected. GraphFrames is a new implementation from Databricks that fully supports the following three languages: Scala, Java, and Python, and is build on top of DataFrames. I'll discuss specific implementations in Chapter 7, Working with Graph Algorithms.
3.15.220.29