actions on Spark RDDs, 33
Akka system, 123
alignment problems, 6
all grouping (Storm), 100
AM (Application Master), 163-165
analytics
Apache Hama, 136
application layer in BDAS, 29-31
Application Master (AM), 163-165
Apply phase (GraphLab vertex program), 151
architecture
of Hadoop YARN, 165
of Mesos, 48
basic statistics computations, 4
batch learning, online learning versus, 66
BDAS (Berkeley Data Analytics Stack), 13, 21
design and architecture, 26-31
Mesos
architecture, 48
cluster resource sharing, 46-47
fault tolerance, 52
motivation for developing, 21-23
distributed data loading, 45
full partition-wise joins, 45
ML (machine learning) support, 46
partition pruning, 46
Spark
DSM (Distributed Shared Memory) systems versus, 38-40
RDDs (resilient distributed datasets), 33-36
BI (business intelligence), history of term, 1-2
big data
history of term, 2
binary LR (logistic regression), 67-69
real-time computing example, 97-99
BSP (Bulk Synchronous Parallel)
expressibility through RDDs, 40
open source implementations, 134-137
bulk iterations, 133
business intelligence (BI), history of term, 1-2
CGD, as iterative ML algorithm, 11-12
chromatic engine (GraphLab), 142
classification, regression versus, 66
cluster resource sharing, 25-26, 46-47
code sketches, 171
JPMMLLinearRegInSpark.java, 182
NaiveBayesHandler.java, 171
NaiveBayesPMMLBolt.java, 178
sgd.cpp, 191
Simple_pagerank.cpp, 186
columnar memory store in Shark, 44-45
complex data structure processing, 8
complex decision planes in SVM, 74-76
computation paradigms, list of, 4-6
consistency models (GraphLab), 140-141
continuous transformations in PMML, 82
Conviva, 30
data dictionary in PMML, 81-82
data management layer in BDAS, 28-29
data processing
in BDAS, 29
data splits computations, 3
data transformations. See transformations
daxpy primitive, 11
ddot primitive, 11
decision trees example (machine learning), 62-64
deep learning, 167
dependencies between Spark RDDs, 36
design patterns in Storm
distributed remote procedure calls (DRPCs), 102-106
direct grouping (Storm), 100
discrete normalization in PMML, 83
discretization in PMML, 83
disk-based single-node analytics, 168-169
distributed data loading in Shark, 45
distributed remote procedure calls (DRPCs), 102-106
Distributed Shared Memory (DSM) systems, Spark versus, 38-40
distributed SQL systems, Shark as, 42-46
distributed version of GraphLab, 141-143
Domain Specific Language (DSL), 166-167
Dominant Resource Fairness (DRF), 49
Dremel, 123
Drill, 123
DRPCs (distributed remote procedure calls), 102-106
DryadLINQ, 40
DSL (Domain Specific Language), 166-167
DSM (Distributed Shared Memory) systems, Spark versus, 38-40
estimators in logistic regression (LR), 69-70
expressibility of RDDs (resilient distributed datasets), 40-41
Facebook graph search, 129
fault tolerance in Mesos, 52
field grouping (Storm), 99
first-generation ML (machine learning) tools, 10
comparison with subsequent generations, 16
framework schedulers, 28, 46-47
frameworks over Hadoop YARN, 165-166
full partition-wise joins in Shark, 45
future of big data analytics, 166-169
GAS (Gather, Apply, Scatter), 143-144, 149-153
Gather phase (GraphLab vertex program), 151
GBASE, 132
generalized N-body problems, 4
global grouping (Storm), 100
GoldenORB, 136
GPS (Graph Processing System), 137
graph computations, 5, 8, 14-15, 129-130
GraphLab
page rank algorithm example, 147-153, 186-191
stochastic gradient descent example, 153-156, 191-207
Hadoop YARN support, 166
open source implementations, 134-137
Graph Processing System (GPS), 137
Hadoop YARN support, 166
page rank algorithm example, 147-153, 186-191
stochastic gradient descent example, 153-156, 191-207
GraphX, 166
grouping Storm streams, 99-100
Hadoop
in history of big data analytics, 2
suitability and limitations, 3-9, 21-23, 162-163
Hadoop Distributed File System (HDFS), 22
comparison with Mesos, 28
motivation for developing, 162-163
as resource scheduler, 163-165
Hama, 15
header information in PMML, 81
history of big data analytics terminology, 1-2
HMMs (Hidden Markov Models), 168
incremental iterations, 133-134
inductive approach, transductive approach versus, 65
integration problems, 6
Internet of Things (IoT), 169
Internet traffic filtering example (real-time analytics with Storm), 121-122
interrelated data splits computations, 3
IoT (Internet of Things), 169
iterative ML algorithms, 13-14
expressibility through RDDs, 41
Hadoop YARN support, 166
with Tez, 166
limitations
of Hadoop, 3-9, 21-23, 162-163
of Spark, 7
linear algebraic computations, 4
linear regression support in PMML, 88-89, 182-186
local grouping (Storm), 100
locking engine (GraphLab), 143
logistic regression (LR), 67-70
multinomial LR, 70
machine learning (ML) tools. See ML (machine learning) tools
machine-to-machine (M2M) data, 116-117
manufacturing log classification example (real-time analytics with Storm), 116-121
Map-Reduce (MR)
expressibility through RDDs, 40
parallel databases versus, 24-25
Markov Chain Monte Carlo (MCMC), 168
mathematics of SVM (support vector machine), 76-78
matmul primitive, 11
MCMC (Markov Chain Monte Carlo), 168
Mesos, 13
architecture, 48
cluster resource sharing, 46-47
comparison with Hadoop YARN, 28
fault tolerance, 52
motivation for developing, 25-26
message processing guarantees in Storm, 100-102
ML (machine learning), 9-16, 61
comparison of generations, 16
first-generation tools, 10
Hadoop YARN support, 166
logistic regression (LR). See logistic regression (LR)
PMML. See PMML (Predictive Modeling Markup Language)
real-time analytics. See also Storm
alternatives to Storm, 122-123
Internet traffic filtering example, 121-122
manufacturing log classification example, 116-121
second-generation tools, 10-12
support in Shark, 46
SVM. See SVM (support vector machine)
with Tez, 166
iterative ML algorithms, 13-14
real-time analytics, 14
model definition in PMML, 84
monolithic schedulers, 28
MR (Map-Reduce)
expressibility through RDDs, 40
parallel databases versus, 23-25
multicore version of GraphLab, 139-141
multinomial LR (logistic regression), 70
Naive Bayes support in PMML
Nectar, 42
Nimbus, 28
no grouping (Storm), 100
Omega, 28
online learning, batch learning versus, 66
open source implementations of Pregel, 134-137
optimization problems, 5
OS Level virtualization, 51
outputs in PMML, 85
page rank algorithm example (GraphLab), 147-153, 186-191
partition pruning in Shark, 46
partitioning with PowerGraph, 145-147
Phoebus, 136
Piccolo, 138
PMML (Predictive Modeling Markup Language), 12
linear regression support, 88-89, 182-186
Naive Bayes support
producers and consumers, 85-86
Pregel framework, 14-15, 130-132
open source implementations, 134-137
quantum computing, 168
R, scaling over Hadoop, 12
random forest (RF) example (machine learning), 62-64
RDDs (resilient distributed datasets), 33-36
real-time analytics, 7-8, 14. See also Storm
alternatives to Storm, 122-123
Internet traffic filtering example, 121-122
manufacturing log classification example, 116-121
recommender systems, 61
regression
classification versus, 66
linear regression support, in PMML, 88-89, 182-186
logistic regression (LR), 67-70
multinomial LR, 70
reinforcement learning, ML (machine learning) as, 66
Remote Procedure Calls (RPCs), 102-104
resilient distributed datasets (RDDs), 33-36
resource allocation in Mesos, 49-50
resource management layer in BDAS, 26-28
Resource Manager (RM), 163-165
resource scheduler, Hadoop YARN as, 163-165
RF (random forest) example (machine learning), 62-64
RM (Resource Manager), 163-165
RPCs (Remote Procedure Calls), 102-104
S4 system, 123
Scala, 23
second-generation ML (machine learning) tools, 10-12
comparison with other generations, 16
distributed data loading, 45
full partition-wise joins, 45
ML (machine learning) support, 46
motivation for developing, 23-25
partition pruning, 46
shuffle grouping (Storm), 99
Spark
DSM (Distributed Shared Memory) systems versus, 38-40
Hadoop YARN support, 166
as iterative ML algorithm, 13
logistic regression (LR) in, 70-73
motivation for developing, 23
for linear regression, 88-89, 182-186
for Naive Bayes, 87-88, 171-182
RDDs (resilient distributed datasets), 33-36
suitability and limitations, 7
SVM (support vector machine) in, 78-79
split-data computations, 3
real-time computing example, 97-99
SQL interfaces, Shark as, 42-46
Stanford GPS (Graph Processing System), 137
stateful operators, D-Streams, 125-126
stateless operators, D-Streams, 125
stochastic gradient descent example (GraphLab), 153-156, 190-207
design patterns
distributed remote procedure calls (DRPCs), 102-106
Internet traffic filtering example, 121-122
logistic regression (LR) in, 107-110
manufacturing log classification example, 116-121
message processing guarantees, 100-102
PMML support for Naive Bayes, 113-116
real-time computing example, 97-99
streams, 95
support vector machine (SVM) in, 110-113
streams
in Storm, 95
suitability
of Spark, 7
supervised learning, ML (machine learning) as, 65
Surfer, 133
SVM (support vector machine), 74-79
complex decision planes, 74-76
Tachyon, 13
targets in PMML, 85
third-generation ML (machine learning) tools, 12-16
comparison with previous generations, 16
iterative ML algorithms, 13-14
real-time analytics, 14
transductive approach, inductive approach versus, 65
on Spark RDDs, 33
unsupervised learning, ML (machine learning), 66
value mapping in PMML, 84
vertex program in GraphLab, 149-153
Yahoo, 30
comparison with Mesos, 28
motivation for developing, 162-163
3.129.90.66