Index

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

A. Code Sketches

Next Chapter

Images

Index

A

actions on Spark RDDs, 33

Akka system, 123

alignment problems, 6

all grouping (Storm), 100

AM (Application Master), 163-165

analytics

defined, 1-2

future of, 166-169

history of term, 1-2

Apache Hama, 136

application layer in BDAS, 29-31

Application Master (AM), 163-165

Apply phase (GraphLab vertex program), 151

architecture

of BDAS, 26-31

of Hadoop YARN, 165

of Mesos, 48

of MLbase, 90-91

B

basic statistics computations, 4

batch learning, online learning versus, 66

BDAS (Berkeley Data Analytics Stack), 13, 21

design and architecture, 26-31

Mesos

architecture, 48

cluster resource sharing, 46-47

fault tolerance, 52

isolation, 50-52

resource allocation, 49-50

motivation for developing, 21-23

Shark, 42-46

columnar memory store, 44-45

distributed data loading, 45

full partition-wise joins, 45

ML (machine learning) support, 46

partition pruning, 46

Spark extensions for, 43-44

Spark

data processing in, 31-32

DSM (Distributed Shared Memory) systems versus, 38-40

expressibility of RDDs, 40-41

implementation, 36-38

RDDs (resilient distributed datasets), 33-36

similar systems, 41-42

BI (business intelligence), history of term, 1-2

big data

future of, 166-169

history of term, 2

binary LR (logistic regression), 67-69

BLinkDB, 29-30

bolts in Storm, 95-96

real-time computing example, 97-99

BSP (Bulk Synchronous Parallel)

applications for, 132-134

expressibility through RDDs, 40

Pregel as, 14-15, 130-132

limitations, 138-139

open source implementations, 134-137

bulk iterations, 133

business intelligence (BI), history of term, 1-2

C

CGD, as iterative ML algorithm, 11-12

chromatic engine (GraphLab), 142

classification, regression versus, 66

cluster resource sharing, 25-26, 46-47

clusters in Storm, 96-97

code sketches, 171

JPMMLLinearRegInSpark.java, 182

NaiveBayesHandler.java, 171

NaiveBayesPMMLBolt.java, 178

sgd.cpp, 191

Simple_pagerank.cpp, 186

columnar memory store in Shark, 44-45

complex data structure processing, 8

complex decision planes in SVM, 74-76

computation paradigms, list of, 4-6

consistency models (GraphLab), 140-141

consumers in PMML, 85-86

continuous transformations in PMML, 82

Conviva, 30

D

data dictionary in PMML, 81-82

data management layer in BDAS, 28-29

data processing

in BDAS, 29

in Spark, 31-32

data splits computations, 3

data transformations. See transformations

daxpy primitive, 11

ddot primitive, 11

decision trees example (machine learning), 62-64

deep learning, 167

dependencies between Spark RDDs, 36

design of BDAS, 26-31

design patterns in Storm

distributed remote procedure calls (DRPCs), 102-106

Trident, 106-107

direct grouping (Storm), 100

discrete normalization in PMML, 83

discretization in PMML, 83

disk-based single-node analytics, 168-169

distributed data loading in Shark, 45

distributed remote procedure calls (DRPCs), 102-106

Distributed Shared Memory (DSM) systems, Spark versus, 38-40

distributed SQL systems, Shark as, 42-46

distributed version of GraphLab, 141-143

Domain Specific Language (DSL), 166-167

Dominant Resource Fairness (DRF), 49

Dremel, 123

Drill, 123

DRPCs (distributed remote procedure calls), 102-106

DryadLINQ, 40

DSL (Domain Specific Language), 166-167

DSM (Distributed Shared Memory) systems, Spark versus, 38-40

D-Streams, 124-126

E

estimators in logistic regression (LR), 69-70

expressibility of RDDs (resilient distributed datasets), 40-41

F

Facebook graph search, 129

fairness algorithms, 49-50

fault tolerance in Mesos, 52

field grouping (Storm), 99

first-generation ML (machine learning) tools, 10

comparison with subsequent generations, 16

Forge, 166-167

framework schedulers, 28, 46-47

frameworks over Hadoop YARN, 165-166

full partition-wise joins in Shark, 45

future of big data analytics, 166-169

G

GAS (Gather, Apply, Scatter), 143-144, 149-153

Gather phase (GraphLab vertex program), 151

GBASE, 132

generalized N-body problems, 4

Giraph, 134-136, 166

global grouping (Storm), 100

GoldenORB, 136

GPS (Graph Processing System), 137

graph computations, 5, 8, 14-15, 129-130

applications for, 132-134

GraphLab

distributed version, 141-143

multicore version, 139-141

page rank algorithm example, 147-153, 186-191

PowerGraph, 143-147

stochastic gradient descent example, 153-156, 191-207

vertex program, 149-153

Hadoop YARN support, 166

with Pregel, 130-132

limitations, 138-139

open source implementations, 134-137

Graph Processing System (GPS), 137

GraphChi, 168-169

GraphLab, 15, 138

distributed version, 141-143

Hadoop YARN support, 166

multicore version, 139-141

page rank algorithm example, 147-153, 186-191

PowerGraph, 143-147

stochastic gradient descent example, 153-156, 191-207

vertex program, 149-153

GraphX, 166

grouping Storm streams, 99-100

H

Hadoop

in history of big data analytics, 2

suitability and limitations, 3-9, 21-23, 162-163

Hadoop Distributed File System (HDFS), 22

Hadoop YARN, 161-162

comparison with Mesos, 28

frameworks for, 165-166

motivation for developing, 162-163

as resource scheduler, 163-165

HaLoop, 13-14, 41-42

Hama, 15

header information in PMML, 81

history of big data analytics terminology, 1-2

HMMs (Hidden Markov Models), 168

I

incremental iterations, 133-134

inductive approach, transductive approach versus, 65

integration problems, 6

interactive queries, 23-25

Internet of Things (IoT), 169

Internet traffic filtering example (real-time analytics with Storm), 121-122

interrelated data splits computations, 3

IoT (Internet of Things), 169

isolation in Mesos, 50-52

iterative computations, 3-4

iterative ML algorithms, 13-14

expressibility through RDDs, 41

Hadoop YARN support, 166

with Tez, 166

K

Kafka clusters, 118-120

Kafka spout, 14, 97-98

L

limitations

of Hadoop, 3-9, 21-23, 162-163

of Pregel, 138-139

of Spark, 7

linear algebraic computations, 4

linear regression support in PMML, 88-89, 182-186

local grouping (Storm), 100

locking engine (GraphLab), 143

logistic regression (LR), 67-70

binary LR, 67-69

estimators, 69-70

multinomial LR, 70

in Spark, 70-73

in Storm, 107-110

M

machine learning (ML) tools. See ML (machine learning) tools

machine-to-machine (M2M) data, 116-117

Mahout, 10-11, 107-108

manufacturing log classification example (real-time analytics with Storm), 116-121

Map-Reduce (MR)

expressibility through RDDs, 40

parallel databases versus, 24-25

Markov Chain Monte Carlo (MCMC), 168

mathematics of SVM (support vector machine), 76-78

matmul primitive, 11

MCMC (Markov Chain Monte Carlo), 168

Mesos, 13

architecture, 48

cluster resource sharing, 46-47

comparison with Hadoop YARN, 28

fault tolerance, 52

isolation, 50-52

motivation for developing, 25-26

resource allocation, 49-50

message processing guarantees in Storm, 100-102

mining schema in PMML, 84-85

ML (machine learning), 9-16, 61

comparison of generations, 16

decision trees example, 62-64

first-generation tools, 10

Hadoop YARN support, 166

logistic regression (LR). See logistic regression (LR)

MLbase, 90-91

PMML. See PMML (Predictive Modeling Markup Language)

real-time analytics. See also Storm

alternatives to Storm, 122-123

D-Streams, 124-126

Internet traffic filtering example, 121-122

manufacturing log classification example, 116-121

Storm example, 97-99

second-generation tools, 10-12

support in Shark, 46

SVM. See SVM (support vector machine)

taxonomy of, 65-66

with Tez, 166

third-generation tools, 12-16

graph computations, 14-15

iterative ML algorithms, 13-14

real-time analytics, 14

uses for, 61-62

MLbase, 90-91

model definition in PMML, 84

monolithic schedulers, 28

MR (Map-Reduce)

expressibility through RDDs, 40

parallel databases versus, 23-25

multicore version of GraphLab, 139-141

multinomial LR (logistic regression), 70

N

Naive Bayes support in PMML

in Spark, 87-88, 171-182

in Storm, 113-116

Nectar, 42

Neo4j, 129-130

Nimbus, 28

no grouping (Storm), 100

Node Manager (NM), 164-165

O

Omega, 28

online learning, batch learning versus, 66

Ooyala, 30-31

open source implementations of Pregel, 134-137

operators, D-Streams, 125-126

optimization problems, 5

OS Level virtualization, 51

outputs in PMML, 85

P

page rank algorithm example (GraphLab), 147-153, 186-191

parallel databases, 23-25

ParsePoint, 71-72

partial DAG execution, 43-44

partition pruning in Shark, 46

partitioning with PowerGraph, 145-147

Pegasus, 132-133

Phoebus, 136

Piccolo, 138

PMML (Predictive Modeling Markup Language), 12

linear regression support, 88-89, 182-186

Naive Bayes support

in Spark, 87-88, 171-182

in Storm, 113-116

producers and consumers, 85-86

structure of, 80-85

support in Spark, 79-80

PowerGraph, 15, 143-147

Pregel framework, 14-15, 130-132

limitations, 138-139

open source implementations, 134-137

producers in PMML, 85-86

Q

quantum computing, 168

R

R, scaling over Hadoop, 12

random forest (RF) example (machine learning), 62-64

RDDs (resilient distributed datasets), 33-36

expressibility of, 40-41

implementation, 36-38

real-time analytics, 7-8, 14. See also Storm

alternatives to Storm, 122-123

D-Streams, 124-126

Internet traffic filtering example, 121-122

manufacturing log classification example, 116-121

Storm example, 97-99

recommender systems, 61

regression

classification versus, 66

linear regression support, in PMML, 88-89, 182-186

logistic regression (LR), 67-70

binary LR, 67-69

estimators, 69-70

multinomial LR, 70

in Spark, 70-73

in Storm, 107-110

reinforcement learning, ML (machine learning) as, 66

Remote Procedure Calls (RPCs), 102-104

resilient distributed datasets (RDDs), 33-36

expressibility of, 40-41

implementation, 36-38

resource allocation in Mesos, 49-50

resource management layer in BDAS, 26-28

Resource Manager (RM), 163-165

resource scheduler, Hadoop YARN as, 163-165

RF (random forest) example (machine learning), 62-64

RM (Resource Manager), 163-165

RPCs (Remote Procedure Calls), 102-104

S

S4 system, 123

Scala, 23

second-generation ML (machine learning) tools, 10-12

comparison with other generations, 16

Shark, 13, 42-46

columnar memory store, 44-45

distributed data loading, 45

full partition-wise joins, 45

ML (machine learning) support, 46

motivation for developing, 23-25

partition pruning, 46

Spark extensions for, 43-44

shuffle grouping (Storm), 99

Spark

data processing in, 31-32

DSM (Distributed Shared Memory) systems versus, 38-40

expressibility of RDDs, 40-41

extensions for Shark, 43-44

Hadoop YARN support, 166

implementation, 36-38

as iterative ML algorithm, 13

logistic regression (LR) in, 70-73

MLbase, 90-91

motivation for developing, 23

PMML support, 79-80

for linear regression, 88-89, 182-186

for Naive Bayes, 87-88, 171-182

RDDs (resilient distributed datasets), 33-36

similar systems, 41-42

streaming, 124-126

suitability and limitations, 7

SVM (support vector machine) in, 78-79

split-data computations, 3

spouts in Storm, 95-96

real-time computing example, 97-99

SQL interfaces, Shark as, 42-46

Stanford GPS (Graph Processing System), 137

stateful operators, D-Streams, 125-126

stateless operators, D-Streams, 125

stochastic gradient descent example (GraphLab), 153-156, 190-207

Storm, 14, 93

alternatives to, 122-123

clusters, 96-97

design patterns

distributed remote procedure calls (DRPCs), 102-106

Trident, 106-107

Internet traffic filtering example, 121-122

logistic regression (LR) in, 107-110

manufacturing log classification example, 116-121

message processing guarantees, 100-102

PMML support for Naive Bayes, 113-116

real-time computing example, 97-99

streams, 95

grouping, 99-100

support vector machine (SVM) in, 110-113

topology, 95-96

uses for, 93-94

Stratosphere, 133-134

streams

Spark streaming, 124-126

in Storm, 95

grouping, 99-100

suitability

of Hadoop, 3-9

of Spark, 7

supervised learning, ML (machine learning) as, 65

Surfer, 133

SVM (support vector machine), 74-79

complex decision planes, 74-76

mathematics of, 76-78

in Spark, 78-79

in Storm, 110-113

T

Tachyon, 13

targets in PMML, 85

Tez, 165-166

third-generation ML (machine learning) tools, 12-16

comparison with previous generations, 16

graph computations, 14-15

iterative ML algorithms, 13-14

real-time analytics, 14

topology of Storm, 95-96

transductive approach, inductive approach versus, 65

transformations

in PMML, 82-84

on Spark RDDs, 33

Trident, 106-107

Twister, 14, 41-42

U

unsupervised learning, ML (machine learning), 66

V

value mapping in PMML, 84

vertex program in GraphLab, 149-153

Y-Z

Yahoo, 30

YARN, 161-162

comparison with Mesos, 28

frameworks for, 165-166

motivation for developing, 162-163

as resource scheduler, 163-165

Zookeeper, 96-97, 134

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Index

Create new playlist

Sign In

Sign Up

Index

A

B

C

D

E

F

G

H

I

K

L

M

N

O

P

Q

R

S

T

U

V

Y-Z

Table of Contents for
Index