Previous Chapter

Index

A

A/B testing
- about / A/B testing
actions engine
- about / Basic components of a data-driven system, Actions engine
Activity Monitor
- about / System monitoring
Alternating least squares (ALS)
- about / ML libraries
Analysis of Variance (ANOVA)
- about / Multivariate regression
AngularJS
- about / UI component
- URL / UI component
annotation
- about / Segmentation, annotation, and chunking
ANother Tool for Language Recognition (ANTLR)
- URL / Text analysis pipeline
architecture, Spark
- about / Understanding Spark architecture
- task scheduling / Task scheduling
- Spark components / Spark components
- MQTT / MQTT, ZeroMQ, Flume, and Kafka
- ZeroMQ / MQTT, ZeroMQ, Flume, and Kafka
- Flume / MQTT, ZeroMQ, Flume, and Kafka
- Kafka / MQTT, ZeroMQ, Flume, and Kafka
- HDFS / HDFS, Cassandra, S3, and Tachyon
- Cassandra / HDFS, Cassandra, S3, and Tachyon
- S3 / HDFS, Cassandra, S3, and Tachyon
- Tachyon / HDFS, Cassandra, S3, and Tachyon
- Mesos / Mesos, YARN, and Standalone
- YARN / Mesos, YARN, and Standalone
- Standalone / Mesos, YARN, and Standalone
Aster Data
- URL / Sessionization
Avro
- about / Other serialization formats
AvroParquet
- about / Other serialization formats
Azkaban
- about / Data transformation layer

B

Balancer
- about / Running Hadoop HDFS
basic sampling
- about / Basic, stratified, and consistent sampling
Box-Cox transformation
- about / Heteroscedasticity
Breeze
- URL / Linear regression
build.sbt file
- about / SBT
- URL / SBT

C

Cassandra
- about / HDFS, Cassandra, S3, and Tachyon
categorical field
- distinct values / Distinct values of a categorical field
central limit theorem (CLT)
- about / Sequential trials and dealing with risk
chunking
- about / Segmentation, annotation, and chunking
classification metrics
- about / Classification metrics
Client
- about / Running Hadoop HDFS
clique
- about / A quick introduction to graphs
Cloudera
- URL / Running Hadoop HDFS
command and control (C2)
- about / Influence diagrams
complex types
- ARRAY / Hive and Impala
- MAP / Hive and Impala
- STRUCT / Hive and Impala
- UNIONTYPE / Hive and Impala
connected components
- about / A quick introduction to graphs, Connected components
consistent sampling
- about / Basic, stratified, and consistent sampling
continuous space
- about / Continuous space and metrics
correlation engine
- about / Basic components of a data-driven system, Correlation engine
correlations
- about / Basic correlations

D

data-driven system
- data ingest / Basic components of a data-driven system
- data transformation layer / Basic components of a data-driven system
- data analytics / Basic components of a data-driven system
- machine learning engine / Basic components of a data-driven system
- UI component / Basic components of a data-driven system
- actions engine / Basic components of a data-driven system
- correlation engine / Basic components of a data-driven system
- monitoring / Basic components of a data-driven system
data analysis life cycle / Linear models
data analytics
- about / Basic components of a data-driven system, Data analytics and machine learning
DataFrame
- using / Spark SQL and DataFrame
data frames
- reference link / PySpark
DataFrameStatFunctions
- URL / Working with Scala and Spark Notebooks
data ingest
- about / Basic components of a data-driven system, Data ingest
- Syslog / Data ingest
- Rsync / Data ingest
- Kafka / Data ingest
data rearranging
- about / Sessionization
data transformation layer
- about / Basic components of a data-driven system, Data transformation layer
- Oozie / Data transformation layer
- Azkaban / Data transformation layer
- StreamSets / Data transformation layer
decision tree
- about / Decision tree
decision tree, parameters
- maxDepth / Decision tree
- minInstancesPerNode / Decision tree
- maxBins / Decision tree
- minInfoGain / Decision tree
- maxMemoryInMB / Decision tree
- subsamplingRate / Decision tree
- useNodeIdCache / Decision tree
- checkpointDir / Decision tree
- checkpointInterval / Decision tree
descriptive statistics
- about / Working with Scala and Spark Notebooks
Directed Acyclic Graph (DAG)
- about / Graph constraints
Dirichlet distribution
- about / LDA
distributed algorithms
- reference link / LDA
Drools
- URL / Actions engine
Dropwizard
- URL / UI component
- about / UI component
Druid
- about / Data transformation layer
- URL / Data transformation layer

E

e-mails
- obtaining / Who is getting e-mails?
edge list
- about / A quick introduction to graphs
edges
- about / A quick introduction to graphs
- adding, to graph / Adding nodes and edges
Elastic Net
- about / Regularization
Emacs / SBT
ensemble learning methods
- about / Bagging and boosting – ensemble learning methods
Expectation Maximization (EM) algorithm
- about / LDA
exploration-exploitation trade-off
- about / Exploration and exploitation
extract, transform, and load (ETL)
- about / Basic components of a data-driven system

F

FACTORIE toolkit
- URL / POS tagging
- binary image, URL / POS tagging
feature construction
- reference link / MLlib algorithms in Spark
Flex
- URL / Text analysis pipeline
Flume
- about / Data ingest, MQTT, ZeroMQ, Flume, and Kafka
- URL / MQTT, ZeroMQ, Flume, and Kafka
functional approach
- versus object-oriented approach / Other serialization formats

G

Ganglia
- URL / Monitoring, System monitoring
- about / System monitoring
Gaussian mixture
- about / Unsupervised learning
generalization error
- about / Generalization error and overfitting
graph
- about / A quick introduction to graphs
graph algorithms
- about / Graph algorithms – GraphX and GraphFrames
- GraphX / Graph algorithms – GraphX and GraphFrames
- GraphFrames / Graph algorithms – GraphX and GraphFrames
Graph for Scala
- graph, creating / Graph for Scala
- reference link / Graph for Scala
- nodes, adding / Adding nodes and edges
- edges, adding / Adding nodes and edges
- constraints, setting / Graph constraints
- support for JSON / JSON
GraphFrames
- about / Graph algorithms – GraphX and GraphFrames
Graphite
- URL / Monitoring
GraphX
- about / Graph algorithms – GraphX and GraphFrames, GraphX
- node IDs / GraphX
- e-mails, obtaining / Who is getting e-mails?
- connected components / Connected components
- triangle counting algorithm / Triangle counting
- strongly connected components / Strongly connected components
- PageRank algorithm / PageRank
- SVD++ / SVD++

H

Hadoop Distributed File System (HDFS)
- about / Task scheduling
Hadoop HDFS
- executing / Running Hadoop HDFS
- URL / Running Hadoop HDFS
HDFS
- about / HDFS, Cassandra, S3, and Tachyon
heteroscedasticity
- about / Heteroscedasticity
Hive
- about / Hive and Impala
- URL, for downloading / Hive and Impala
Homebrew package
- installation link / Setting up Python

I

Ignite File System (IGFS)
- URL / HDFS, Cassandra, S3, and Tachyon
Impala
- about / Hive and Impala
influence diagrams
- about / Influence diagrams
- demonstration / Influence diagrams
interactivity
- about / Optimization and interactivity
Iris dataset
- about / Iris dataset
- URL / Iris dataset

J

Java Management Extensions (JMX)
- about / Process monitoring
Java Mission Control (JMC)
- about / System monitoring
Java Specification Request (JSR) / Linear models
- about / Process monitoring
joda-time library
- about / GraphX
JSON format
- about / Other serialization formats
JSON package
- URL / JSON
JSON support
- about / JSON
JSR110
- about / Process monitoring
JSR 223
- reference link / Jython and JSR 223
- about / Jython and JSR 223
Jython
- reference link / Jython and JSR 223

K

k-means clustering
- about / Unsupervised learning
Kafka
- about / Data ingest, MQTT, ZeroMQ, Flume, and Kafka
Kamon
- about / Monitoring
- URL / Monitoring
Kelly Criterion
- about / Sequential trials and dealing with risk
Kryo
- about / Other serialization formats
Kudu
- URL / HDFS, Cassandra, S3, and Tachyon
Kullback-Leibler (KL) distance
- about / Continuous space and metrics

L

LabeledPoint
- about / Nested data
labeled point
- about / Labeled point
- reference link / Labeled point
Latent Dirichlet allocation (LDA)
- about / ML libraries
Latent Dirichlet Allocation (LDA)
- about / Unsupervised learning, LDA
Least Absolute Shrinkage and Selection Operator (LASSO)
- about / Regularization
LIBSVM format
- need for / Labeled point
Lift
- about / UI component
lift-json library
- about / GraphX
Lift framework
- about / SBT
Limited-Memory BFGS (L-BFGS)
- about / ML libraries
linear regression
- about / Linear regression
Linear Support Vector Machine (SVM)
- about / SVMWithSGD
logistic regression
- about / Logistic regression, Logistic regression
loss functions
- about / Linear regression

M

Machine Learning (ML)
- about / Data analytics and machine learning
machine learning engine
- about / Basic components of a data-driven system, Data analytics and machine learning
map optimization
- about / Using word2vec to find word relationships
Markov Chain Decision Process
- about / Influence diagrams
Mesos
- URL / Task scheduling
- about / Mesos, YARN, and Standalone
metrics
- about / Continuous space and metrics
micro-batch processing
- about / Streaming word count
mirrors
- reference link / Linux
MLlib algorithms
- about / MLlib algorithms in Spark
- Term Frequency Inverse Document Frequency (TF-IDF) / TF-IDF
- Latent Dirichlet Allocation (LDA) / LDA
ML libraries
- about / ML libraries
- SparkR / SparkR
- graph algorithms / Graph algorithms – GraphX and GraphFrames
model monitoring
- about / Model monitoring
- performance, monitoring / Performance over time
- model, retiring criteria / Criteria for model retiring
- A/B testing / A/B testing
monitoring
- about / Basic components of a data-driven system, Monitoring
Monthly Active Users (MAU)
- about / Linear regression
MQTT
- about / MQTT, ZeroMQ, Flume, and Kafka
multiclass problems
- about / Multiclass problems
Multilayer Perceptron Classifier (MLCP)
- about / Perceptron
Multivariate Analysis of Variance (MANOVA)
- about / Multivariate regression
multivariate regression
- about / Multivariate regression
MurmurHash function
- about / Basic, stratified, and consistent sampling

N

.NET MyMediaLite library
- reference link / SVD++
NameNode
- about / Running Hadoop HDFS
Namenode UI
- URL / Running Hadoop HDFS
nested data
- about / Nested data
- working with / Nested data
NodeJS
- about / UI component
Node Manager
- about / Mesos, YARN, and Standalone
nodes
- adding, to graph / Adding nodes and edges
numeric field
- summarization / Summarization of a numeric field
- grepping, across multiple fields / Grepping across multiple fields

O

object-oriented approach
- versus functional approach / Other serialization formats
Online Transaction Processing (OLTP)
- about / Hive and Impala
Oozie
- about / Data transformation layer
optimization
- about / Optimization and interactivity
- feedback loops / Feedback loops
outputs, linear models
- residuals / Linear models
- coefficients / Linear models
- residual standard error / Linear models
- multiple R-squared / Linear models
- F-statistic / Linear models
overfitting
- about / Generalization error and overfitting

P

PageRank algorithm
- about / PageRank
parameters, SparkR glm implementation
- formula / Generalized linear model
- family / Generalized linear model
- data / Generalized linear model
- lambda / Generalized linear model
- alpha / Generalized linear model
- standardize / Generalized linear model
- solver / Generalized linear model
Paretto chart
- about / Working with Scala and Spark Notebooks
Parquet
- about / Nested data, Other serialization formats
- reference link / Nested data
parquet file
- about / Nested data
- URL / Nested data
pattern matching
- working with / Working with pattern matching
Pearson correlation coefficient
- about / Working with Scala and Spark Notebooks
perceptron
- about / Perceptron
Play
- about / UI component
Play framework
- about / SBT
Poisson distribution
- about / Heteroscedasticity
Porter Stemmer
- implementation / Simple text analysis, A Porter Stemmer implementation of the code
- URL / Simple text analysis
- reference link / A Porter Stemmer implementation of the code
POS (part-of-speech) tagging
- about / POS tagging
Power Iteration Clustering (PIC)
- about / ML libraries, Unsupervised learning
Principal Component Analysis (PCA)
- about / ML libraries
probabilistic structures
- about / Probabilistic structures
problem dimensionality
- about / Problem dimensionality
process monitoring
- about / Process monitoring
Project Gutenberg
- URL / Simple text analysis
projections
- about / Projections
Protobuf
- about / Other serialization formats
pseudo-regret
- about / Exploration and exploitation
PySpark / PySpark
Python
- integrating with Scala / Integrating with Python
- setting up / Setting up Python
- PySpark / PySpark
- calling from Java/Scala / Calling Python from Java/Scala
Python, calling from Java/Scala
- about / Calling Python from Java/Scala
- sys.process._, using / Using sys.process._
- Spark pipe / Spark pipe
- Jython / Jython and JSR 223
- JSR 223 / Jython and JSR 223

R

R
- Scala, integrating with / Integrating with R
- setting up / Setting up R and SparkR
- setting up, on Linux / Linux
- setting up, on Mac OS / Mac OS
- for Mac OS, download link / Mac OS
- for Windows, download link / Windows
- setting up, on Windows / Windows
read-evaluate-print-loop (REPL)
- about / Getting started with Scala
Receiver Operating Characteristic (ROC)
- about / SVMWithSGD
regression
- about / What regression stands for?
regression trees
- about / Regression trees
regularization
- about / Regularization
Remote Procedure Call (RPC)
- about / Other serialization formats
Resilient Distributed Dataset (RDD)
- about / Task scheduling
Resource Manager
- about / Mesos, YARN, and Standalone
risk handling
- about / Sequential trials and dealing with risk
ROC
- about / SVMWithSGD
Rsclient/Rserve
- reference link / Using Rserve
RStudio
- reference link / Running Spark via R's command line
Rsync
- about / Data ingest
Run-Length Encoding (RLE)
- about / Nested data

S

S3
- about / HDFS, Cassandra, S3, and Tachyon
SBT
- about / SBT
- features / SBT
- URL / SBT
sbteclipse project
- URL / SBT
Scala
- URL, for downloading / Getting started with Scala
- installing / Getting started with Scala
- working with / Working with Scala and Spark Notebooks
- integrating, with R / Integrating with R
- big data / Generalized linear model
- nulls / Generalized linear model
- invoking from R / Invoking Scala from R
- Rserve, using / Using Rserve
- integrating, with Python / Integrating with Python
Scala, integrating with Python
- about / Integrating with Python
- PySpark / PySpark
Scala, integrating with R
- DataFrames / DataFrames
- linear models / Linear models
- generalized linear model / Generalized linear model
- JSON files, reading in SparkR / Reading JSON files in SparkR
- Parquet files, writing in SparkR / Writing Parquet files in SparkR
Scala API
- reference link / PySpark
scalastyle plugin
- URL / SBT
Scala Swing
- about / UI component
Scalate template
- URL / Process monitoring
Scalatra
- URL / Process monitoring
Secondary Namenode
- about / Running Hadoop HDFS
segmentation
- about / Segmentation, annotation, and chunking
sequential trials
- managing / Sequential trials and dealing with risk
serialization
- about / Simple text analysis
serialization formats
- about / Other serialization formats
- XML / Other serialization formats
- JSON / Other serialization formats
- YAML / Other serialization formats
- Protobuf / Other serialization formats
- Avro / Other serialization formats
- Thrift / Other serialization formats
- Parquet / Other serialization formats
- Kryo / Other serialization formats
sessionization
- about / Sessionization
Singular Value Decomposition (SVD)
- about / ML libraries, SVD++
Slick
- about / UI component
Spark
- setting up / Setting up Spark
- URL / Setting up Spark
- architecture / Understanding Spark architecture
- components / Spark components
- performance, tuning / Spark performance tuning
Spark, applications
- word count / Word count
- word count, streaming / Streaming word count
- Spark SQL, using / Spark SQL and DataFrame
- DataFrame, using / Spark SQL and DataFrame
SPARK-3703
- reference link / Bagging and boosting – ensemble learning methods
Spark Master
- about / Task scheduling
Spark Notebook
- URL / Working with Scala and Spark Notebooks
Spark Notebooks
- working with / Working with Scala and Spark Notebooks
SparkR
- about / SparkR
- setting up / Setting up R and SparkR
- setting up, on Linux / Linux
- Linux setup, reference link / Linux
- setting up, on Mac OS / Mac OS
- setting up, on Windows / Windows
- running, via Scripts / Running SparkR via scripts
- running, via R command line / Running Spark via R's command line
- JSON files, reading / Reading JSON files in SparkR
- Parquet files, writing / Writing Parquet files in SparkR
Spark RDDs
- reference link / PySpark
Spark SQL
- using / Spark SQL and DataFrame
Standalone
- URL / Task scheduling
- about / Mesos, YARN, and Standalone
Stochastic Gradient Descent (SGD)
- about / ML libraries
Stochastic Gradient Descent (SGD) algorithm
- about / Logistic regression
stratified sampling
- about / Basic, stratified, and consistent sampling
streaming k-means
- about / Unsupervised learning
StreamSets
- about / Data transformation layer
strongly connected components
- about / Strongly connected components
supervised learning
- about / Records and supervised learning
- Iris dataset / Iris dataset
- labeled point / Labeled point
- SVMWithSGD / SVMWithSGD
- logistic regression / Logistic regression
- decision tree / Decision tree
- ensemble learning methods / Bagging and boosting – ensemble learning methods
SVD++
- about / SVD++
SVMWithSGD
- about / SVMWithSGD
Syslog
- about / Data ingest
system monitoring
- about / System monitoring

T

Tachyon
- about / HDFS, Cassandra, S3, and Tachyon
task scheduling
- about / Task scheduling
Term Frequency Inverse Document Frequency (TF-IDF)
- about / TF-IDF
text analysis pipeline
- about / Text analysis pipeline
- simple text analysis / Simple text analysis
Thrift
- about / Other serialization formats
- reference link / Other serialization formats
tokenization
- about / Text analysis pipeline
traits
- working with / Working with traits
triangle counting algorithm
- about / Triangle counting
triangle inequality
- about / Continuous space and metrics
Turkey paradox
- about / Unknown unknowns

U

UI component
- about / Basic components of a data-driven system, UI component
- Scala Swing / UI component
- Lift / UI component
- Play / UI component
- Dropwizard / UI component
- Slick / UI component
- NodeJS / UI component
- AngularJS / UI component
unknown unknowns
- about / Unknown unknowns
unstructured data
- usage / Other uses of unstructured data
unsupervised learning
- about / Unsupervised learning

V

Vapnik-Chervonenkis (VC) dimension
- about / Problem dimensionality
Vector
- about / Nested data
- SparseVector / Nested data
- DenseVector / Nested data
vertices
- about / A quick introduction to graphs
vi / SBT

W

weighted graph
- about / A quick introduction to graphs
word2vec
- using / Using word2vec to find word relationships
- Porter Stemmer / A Porter Stemmer implementation of the code

X

XML format
- about / Other serialization formats

Y

YAML format
- about / Other serialization formats
YARN
- URL / Task scheduling
- about / Mesos, YARN, and Standalone

Z

ZeroMQ
- about / MQTT, ZeroMQ, Flume, and Kafka

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Index

Create new playlist

Sign In

Sign Up

Index

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

R

S

T

U

V

W

X

Y

Z

Table of Contents for
Index