Index

A

actions on Spark RDDs, 33

Akka system, 123

alignment problems, 6

all grouping (Storm), 100

AM (Application Master), 163-165

analytics

defined, 1-2

future of, 166-169

history of term, 1-2

Apache Hama, 136

application layer in BDAS, 29-31

Application Master (AM), 163-165

Apply phase (GraphLab vertex program), 151

architecture

of BDAS, 26-31

of Hadoop YARN, 165

of Mesos, 48

of MLbase, 90-91

B

basic statistics computations, 4

batch learning, online learning versus, 66

BDAS (Berkeley Data Analytics Stack), 13, 21

design and architecture, 26-31

Mesos

architecture, 48

cluster resource sharing, 46-47

fault tolerance, 52

isolation, 50-52

resource allocation, 49-50

motivation for developing, 21-23

Shark, 42-46

columnar memory store, 44-45

distributed data loading, 45

full partition-wise joins, 45

ML (machine learning) support, 46

partition pruning, 46

Spark extensions for, 43-44

Spark

data processing in, 31-32

DSM (Distributed Shared Memory) systems versus, 38-40

expressibility of RDDs, 40-41

implementation, 36-38

RDDs (resilient distributed datasets), 33-36

similar systems, 41-42

BI (business intelligence), history of term, 1-2

big data

future of, 166-169

history of term, 2

binary LR (logistic regression), 67-69

BLinkDB, 29-30

bolts in Storm, 95-96

real-time computing example, 97-99

BSP (Bulk Synchronous Parallel)

applications for, 132-134

expressibility through RDDs, 40

Pregel as, 14-15, 130-132

limitations, 138-139

open source implementations, 134-137

bulk iterations, 133

business intelligence (BI), history of term, 1-2

C

CGD, as iterative ML algorithm, 11-12

chromatic engine (GraphLab), 142

classification, regression versus, 66

cluster resource sharing, 25-26, 46-47

clusters in Storm, 96-97

code sketches, 171

JPMMLLinearRegInSpark.java, 182

NaiveBayesHandler.java, 171

NaiveBayesPMMLBolt.java, 178

sgd.cpp, 191

Simple_pagerank.cpp, 186

columnar memory store in Shark, 44-45

complex data structure processing, 8

complex decision planes in SVM, 74-76

computation paradigms, list of, 4-6

consistency models (GraphLab), 140-141

consumers in PMML, 85-86

continuous transformations in PMML, 82

Conviva, 30

D

data dictionary in PMML, 81-82

data management layer in BDAS, 28-29

data processing

in BDAS, 29

in Spark, 31-32

data splits computations, 3

data transformations. See transformations

daxpy primitive, 11

ddot primitive, 11

decision trees example (machine learning), 62-64

deep learning, 167

dependencies between Spark RDDs, 36

design of BDAS, 26-31

design patterns in Storm

distributed remote procedure calls (DRPCs), 102-106

Trident, 106-107

direct grouping (Storm), 100

discrete normalization in PMML, 83

discretization in PMML, 83

disk-based single-node analytics, 168-169

distributed data loading in Shark, 45

distributed remote procedure calls (DRPCs), 102-106

Distributed Shared Memory (DSM) systems, Spark versus, 38-40

distributed SQL systems, Shark as, 42-46

distributed version of GraphLab, 141-143

Domain Specific Language (DSL), 166-167

Dominant Resource Fairness (DRF), 49

Dremel, 123

Drill, 123

DRPCs (distributed remote procedure calls), 102-106

DryadLINQ, 40

DSL (Domain Specific Language), 166-167

DSM (Distributed Shared Memory) systems, Spark versus, 38-40

D-Streams, 124-126

E

estimators in logistic regression (LR), 69-70

expressibility of RDDs (resilient distributed datasets), 40-41

F

Facebook graph search, 129

fairness algorithms, 49-50

fault tolerance in Mesos, 52

field grouping (Storm), 99

first-generation ML (machine learning) tools, 10

comparison with subsequent generations, 16

Forge, 166-167

framework schedulers, 28, 46-47

frameworks over Hadoop YARN, 165-166

full partition-wise joins in Shark, 45

future of big data analytics, 166-169

G

GAS (Gather, Apply, Scatter), 143-144, 149-153

Gather phase (GraphLab vertex program), 151

GBASE, 132

generalized N-body problems, 4

Giraph, 134-136, 166

global grouping (Storm), 100

GoldenORB, 136

GPS (Graph Processing System), 137

graph computations, 5, 8, 14-15, 129-130

applications for, 132-134

GraphLab

distributed version, 141-143

multicore version, 139-141

page rank algorithm example, 147-153, 186-191

PowerGraph, 143-147

stochastic gradient descent example, 153-156, 191-207

vertex program, 149-153

Hadoop YARN support, 166

with Pregel, 130-132

limitations, 138-139

open source implementations, 134-137

Graph Processing System (GPS), 137

GraphChi, 168-169

GraphLab, 15, 138

distributed version, 141-143

Hadoop YARN support, 166

multicore version, 139-141

page rank algorithm example, 147-153, 186-191

PowerGraph, 143-147

stochastic gradient descent example, 153-156, 191-207

vertex program, 149-153

GraphX, 166

grouping Storm streams, 99-100

H

Hadoop

in history of big data analytics, 2

suitability and limitations, 3-9, 21-23, 162-163

Hadoop Distributed File System (HDFS), 22

Hadoop YARN, 161-162

comparison with Mesos, 28

frameworks for, 165-166

motivation for developing, 162-163

as resource scheduler, 163-165

HaLoop, 13-14, 41-42

Hama, 15

header information in PMML, 81

history of big data analytics terminology, 1-2

HMMs (Hidden Markov Models), 168

I

incremental iterations, 133-134

inductive approach, transductive approach versus, 65

integration problems, 6

interactive queries, 23-25

Internet of Things (IoT), 169

Internet traffic filtering example (real-time analytics with Storm), 121-122

interrelated data splits computations, 3

IoT (Internet of Things), 169

isolation in Mesos, 50-52

iterative computations, 3-4

iterative ML algorithms, 13-14

expressibility through RDDs, 41

Hadoop YARN support, 166

with Tez, 166

K

Kafka clusters, 118-120

Kafka spout, 14, 97-98

L

limitations

of Hadoop, 3-9, 21-23, 162-163

of Pregel, 138-139

of Spark, 7

linear algebraic computations, 4

linear regression support in PMML, 88-89, 182-186

local grouping (Storm), 100

locking engine (GraphLab), 143

logistic regression (LR), 67-70

binary LR, 67-69

estimators, 69-70

multinomial LR, 70

in Spark, 70-73

in Storm, 107-110

M

machine learning (ML) tools. See ML (machine learning) tools

machine-to-machine (M2M) data, 116-117

Mahout, 10-11, 107-108

manufacturing log classification example (real-time analytics with Storm), 116-121

Map-Reduce (MR)

expressibility through RDDs, 40

parallel databases versus, 24-25

Markov Chain Monte Carlo (MCMC), 168

mathematics of SVM (support vector machine), 76-78

matmul primitive, 11

MCMC (Markov Chain Monte Carlo), 168

Mesos, 13

architecture, 48

cluster resource sharing, 46-47

comparison with Hadoop YARN, 28

fault tolerance, 52

isolation, 50-52

motivation for developing, 25-26

resource allocation, 49-50

message processing guarantees in Storm, 100-102

mining schema in PMML, 84-85

ML (machine learning), 9-16, 61

comparison of generations, 16

decision trees example, 62-64

first-generation tools, 10

Hadoop YARN support, 166

logistic regression (LR).  See logistic regression (LR)

MLbase, 90-91

PMML.  See PMML (Predictive Modeling Markup Language)

real-time analytics. See also Storm

alternatives to Storm, 122-123

D-Streams, 124-126

Internet traffic filtering example, 121-122

manufacturing log classification example, 116-121

Storm example, 97-99

second-generation tools, 10-12

support in Shark, 46

SVM.  See SVM (support vector machine)

taxonomy of, 65-66

with Tez, 166

third-generation tools, 12-16

graph computations, 14-15

iterative ML algorithms, 13-14

real-time analytics, 14

uses for, 61-62

MLbase, 90-91

model definition in PMML, 84

monolithic schedulers, 28

MR (Map-Reduce)

expressibility through RDDs, 40

parallel databases versus, 23-25

multicore version of GraphLab, 139-141

multinomial LR (logistic regression), 70

N

Naive Bayes support in PMML

in Spark, 87-88, 171-182

in Storm, 113-116

Nectar, 42

Neo4j, 129-130

Nimbus, 28

no grouping (Storm), 100

Node Manager (NM), 164-165

O

Omega, 28

online learning, batch learning versus, 66

Ooyala, 30-31

open source implementations of Pregel, 134-137

operators, D-Streams, 125-126

optimization problems, 5

OS Level virtualization, 51

outputs in PMML, 85

P

page rank algorithm example (GraphLab), 147-153, 186-191

parallel databases, 23-25

ParsePoint, 71-72

partial DAG execution, 43-44

partition pruning in Shark, 46

partitioning with PowerGraph, 145-147

Pegasus, 132-133

Phoebus, 136

Piccolo, 138

PMML (Predictive Modeling Markup Language), 12

linear regression support, 88-89, 182-186

Naive Bayes support

in Spark, 87-88, 171-182

in Storm, 113-116

producers and consumers, 85-86

structure of, 80-85

support in Spark, 79-80

PowerGraph, 15, 143-147

Pregel framework, 14-15, 130-132

limitations, 138-139

open source implementations, 134-137

producers in PMML, 85-86

Q

quantum computing, 168

R

R, scaling over Hadoop, 12

random forest (RF) example (machine learning), 62-64

RDDs (resilient distributed datasets), 33-36

expressibility of, 40-41

implementation, 36-38

real-time analytics, 7-8, 14. See also Storm

alternatives to Storm, 122-123

D-Streams, 124-126

Internet traffic filtering example, 121-122

manufacturing log classification example, 116-121

Storm example, 97-99

recommender systems, 61

regression

classification versus, 66

linear regression support, in PMML, 88-89, 182-186

logistic regression (LR), 67-70

binary LR, 67-69

estimators, 69-70

multinomial LR, 70

in Spark, 70-73

in Storm, 107-110

reinforcement learning, ML (machine learning) as, 66

Remote Procedure Calls (RPCs), 102-104

resilient distributed datasets (RDDs), 33-36

expressibility of, 40-41

implementation, 36-38

resource allocation in Mesos, 49-50

resource management layer in BDAS, 26-28

Resource Manager (RM), 163-165

resource scheduler, Hadoop YARN as, 163-165

RF (random forest) example (machine learning), 62-64

RM (Resource Manager), 163-165

RPCs (Remote Procedure Calls), 102-104

S

S4 system, 123

Scala, 23

second-generation ML (machine learning) tools, 10-12

comparison with other generations, 16

Shark, 13, 42-46

columnar memory store, 44-45

distributed data loading, 45

full partition-wise joins, 45

ML (machine learning) support, 46

motivation for developing, 23-25

partition pruning, 46

Spark extensions for, 43-44

shuffle grouping (Storm), 99

Spark

data processing in, 31-32

DSM (Distributed Shared Memory) systems versus, 38-40

expressibility of RDDs, 40-41

extensions for Shark, 43-44

Hadoop YARN support, 166

implementation, 36-38

as iterative ML algorithm, 13

logistic regression (LR) in, 70-73

MLbase, 90-91

motivation for developing, 23

PMML support, 79-80

for linear regression, 88-89, 182-186

for Naive Bayes, 87-88, 171-182

RDDs (resilient distributed datasets), 33-36

similar systems, 41-42

streaming, 124-126

suitability and limitations, 7

SVM (support vector machine) in, 78-79

split-data computations, 3

spouts in Storm, 95-96

real-time computing example, 97-99

SQL interfaces, Shark as, 42-46

Stanford GPS (Graph Processing System), 137

stateful operators, D-Streams, 125-126

stateless operators, D-Streams, 125

stochastic gradient descent example (GraphLab), 153-156, 190-207

Storm, 14, 93

alternatives to, 122-123

clusters, 96-97

design patterns

distributed remote procedure calls (DRPCs), 102-106

Trident, 106-107

Internet traffic filtering example, 121-122

logistic regression (LR) in, 107-110

manufacturing log classification example, 116-121

message processing guarantees, 100-102

PMML support for Naive Bayes, 113-116

real-time computing example, 97-99

streams, 95

grouping, 99-100

support vector machine (SVM) in, 110-113

topology, 95-96

uses for, 93-94

Stratosphere, 133-134

streams

Spark streaming, 124-126

in Storm, 95

grouping, 99-100

suitability

of Hadoop, 3-9

of Spark, 7

supervised learning, ML (machine learning) as, 65

Surfer, 133

SVM (support vector machine), 74-79

complex decision planes, 74-76

mathematics of, 76-78

in Spark, 78-79

in Storm, 110-113

T

Tachyon, 13

targets in PMML, 85

Tez, 165-166

third-generation ML (machine learning) tools, 12-16

comparison with previous generations, 16

graph computations, 14-15

iterative ML algorithms, 13-14

real-time analytics, 14

topology of Storm, 95-96

transductive approach, inductive approach versus, 65

transformations

in PMML, 82-84

on Spark RDDs, 33

Trident, 106-107

Twister, 14, 41-42

U

unsupervised learning, ML (machine learning), 66

V

value mapping in PMML, 84

vertex program in GraphLab, 149-153

Y-Z

Yahoo, 30

YARN, 161-162

comparison with Mesos, 28

frameworks for, 165-166

motivation for developing, 162-163

as resource scheduler, 163-165

Zookeeper, 96-97, 134

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.129.90.66