Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Index

Symbols

=> lambda syntax, Introduction to Core Spark Concepts

A

AccumulatorParam, Custom Accumulators
accumulators, Accumulators
- and fault tolerance, Accumulators and Fault Tolerance
- custom, Custom Accumulators
- empty line count in Python and Scala (example), Accumulators
- error count in Python (example), Accumulators
- how they work, Accumulators
aggregations, Aggregations
- conditional aggregate operations in Spark SQL, Spark SQL Performance
- disabling map-side aggregation in combineByKey(), Aggregations
- distributed word count, Aggregations
- per-key average using combineByKey(), Aggregations
- per-key average using reduceByKey() and mapValues(), Aggregations
- per-key combiners, combineByKey() and, Aggregations
Akka actor stream, Akka actor stream
Alternating Least Squares (ALS) algorithm, Alternating Least Squares
Amazon EC2, Amazon EC2-Storage on the cluster
- launching clusters, Launching a cluster
- logging into a cluster, Logging in to a cluster
- pausing and restarting clusters, Pausing and restarting clusters
- storage on the cluster, Storage on the cluster
Amazon S3, Amazon S3
Amazon Web Services (AWS) account, Launching a cluster
Apache Flume, Apache Flume
- pull-based receiver for, Pull-based receiver
- push-based receiver for, Push-based receiver
Apache Hive, Apache Hive
- connecting Spark SQL to existing Hive installation, Apache Hive
- Hive JDBC/ODBC, JDBC/ODBC Server
  - (see also JDBC/ODBC server in Spark SQL))
- HiveServer2, JDBC/ODBC Server
- loading and saving data from, in Spark SQL, Apache Hive
- Spark SQL and, Linking with Spark SQL
- user defined functions (UDFs), Hive UDFs
Apache Kafka, Apache Kafka
Apache Mesos, Spark Runtime Architecture, Apache Mesos-Configuring resource usage
- client and cluster mode, Client and cluster mode
- configuring resource usage, Configuring resource usage
- scheduling modes, Mesos scheduling modes
Apache Software Foundation
- Spark, A Brief History of Spark
Apache ZooKeeper, High availability
- using with Mesos, Apache Mesos
applications, Spark, Introduction
- driver and executors, Spark Runtime Architecture
- driver programs, Introduction to Core Spark Concepts
- runtime architecture, Spark Runtime Architecture
  - driver, The Driver
  - executors, Executors
  - launching a program, Launching a Program
  - summary of steps, Summary
- standalone applications, Standalone Applications
  - building, Building Standalone Applications
ascending parameter, sortByKey() function, Sorting Data
assembly JAR, Packaging Your Code and Dependencies
- adding assembly plug-in to sbt project build, A Scala Spark Application Built with sbt
associative operations, Custom Accumulators
Aurora, Client and cluster mode
Avro sink, Push-based receiver
AvroFlumeEvent, Pull-based receiver
AWS_ACCESS_KEY_ID variable, Amazon S3, Launching a cluster
AWS_SECRET_ACCESS_KEY variable, Amazon S3, Launching a cluster

B

batch interval, A Simple Example, Windowed transformations
Beeline client
- command for enabling codegen, Performance Tuning Options
- connecting to JDBC server with, JDBC/ODBC Server
- working with, Working with Beeline
Berkeley Data Analytics Stack (BDAS), A Brief History of Spark
BinaryClassificationMetrics, Model Evaluation
Block Manager, Executors
Broadcast object, Broadcast Variables
broadcast variables, Broadcast Variables
- country lookup with Broadcast values in (example), Broadcast Variables
- defined, Broadcast Variables
- optimizing broadcasts, Optimizing Broadcasts
- using, process of, Broadcast Variables
build tools
- for Java and Scala applications, Packaging Your Code and Dependencies
- Maven, Java Spark application built with, A Java Spark Application Built with Maven
- sbt, Scala Spark application built with, A Scala Spark Application Built with sbt

C

cache(), Determining an RDD’s Partitioner
- cacheTable() and, Caching
caching
- CACHE TALE or UNCACHE TABLE command in HQL, Working with Beeline
- in Spark SQL, Caching
- of RDDs for reuse, Caching RDDs to Reuse
- of RDDs in serialized form, Garbage Collection and Memory Usage
- tweaking default caching behavior, Memory Management
Cassandra, Cassandra
- saving to, from RDD types, Cassandra
- setting Cassandra property in Scala and Java, Cassandra
- Spark Cassandra connector, Cassandra
- table, loading as RDD with key/value data into Spark, Cassandra
CassandraRow objects, Cassandra
checkpointing, Spark Streaming, Architecture and Abstraction
- for stateful transformations, Stateful Transformations
- setting up for fault tolerance in Spark Streaming, Checkpointing
Chronos, Client and cluster mode
classification, Machine Learning Basics, Classification and Regression
- decision trees, Decision trees and random forests
- Naive Bayes algorithm, Naive Bayes
- random forests, Decision trees and random forests
- Support Vector Machines (SVMs), Support Vector Machines
client mode, Hadoop YARN
- on Apache Mesos, Client and cluster mode
cluster managers, Cluster Managers, Introduction, Cluster Manager, Cluster Managers-Which Cluster Manager to Use?
- Apache Mesos, Apache Mesos-Configuring resource usage
- deciding which to use, Which Cluster Manager to Use?
- Hadoop YARN, Hadoop YARN
- scheduling jobs in multitenant clusters, Scheduling Within and Between Spark Applications
- spark-submit script and, Launching a Program
- Standalone cluster manager, Standalone Cluster Manager
cluster mode, Hadoop YARN
- on Apache Mesos, Client and cluster mode
cluster URLs, Deploying Applications with spark-submit
- in Standalone cluster manager web UI, Submitting applications
clustering, Clustering
- K-means algorithm for, K-means
clusters
- deploying on Amazon EC2, Amazon EC2-Storage on the cluster
- multiple input sources and cluster sizing, Multiple Sources and Cluster Sizing
- running on a cluster, Introduction-Conclusion
  - deploying applications with spark-submit, Deploying Applications with spark-submit
  - packaging code with dependencies, Packaging Your Code and Dependencies
  - scheduling within/between Spark applications, Scheduling Within and Between Spark Applications
  - Spark runtime architecture, Spark Runtime Architecture
  - summary of steps, Summary
- Spark executing on, Introduction to Core Spark Concepts
coalesce(), Tuning the level of parallelism, Level of Parallelism
coarse-grained Mesos mode, Mesos scheduling modes
codegen, enabling in Spark SQL, Performance Tuning Options
cogroup(), Grouping Data
- benefitting from partitioning, Operations That Benefit from Partitioning
- on DStreams, Stateless Transformations
- setting partitioner, Operations That Affect Partitioning
collaborative filtering, Collaborative Filtering and Recommendation
- Alternating Least Squares (ALS), Alternating Least Squares
collect(), Sorting Data, Components of Execution: Jobs, Tasks, and Stages
collectAsMap(), Actions Available on Pair RDDs
combine(), Aggregations
combineByKey(), Aggregations
- benefitting from partitioning, Operations That Benefit from Partitioning
- disabling map-side aggregation in, Aggregations
- per-key average using, Aggregations
- setting partitioner, Operations That Affect Partitioning
combiners, reduceByKey() and foldByKey() and, Aggregations
comma-separated vaule file (see CSV files)
commutative operations, Custom Accumulators
compression
- choosing a compression codec, File Compression
- file compression options, File Compression
- ShemaRDD records, Performance Tuning Options
computing cluster, A Unified Stack
Concurrent Mark-Sweep garbage collector, Garbage Collection and Memory Usage
conf/spark-defaults.conf file, Configuring Spark with SparkConf
configuration
- copying hive-site.xml file to Spark /conf/ directory, Apache Hive
- information on, web UI environment page, Environment: Debugging Spark’s configuration
configuring algorithms, Configuring Algorithms
configuring Spark, Tuning and Debugging Spark
- common configuration values, Configuring Spark with SparkConf
- configuration properties set in multiple places, Configuring Spark with SparkConf
- local directories for shuffle data storage, Configuring Spark with SparkConf
- setting configurations dynamically with spark-submit, Configuring Spark with SparkConf
- with SparkConf, Configuring Spark with SparkConf
connections, shared connection pool, Working on a Per-Partition Basis
core input sources, Spark Streaming, Core Sources
- Akka actor stream, Akka actor stream
- stream of files, Stream of files
cores, number for executors, Hardware Provisioning
count(), Numeric RDD Operations
countByKey(), Actions Available on Pair RDDs
countByValue(), Aggregations
countByValueAndWindow(), Windowed transformations
countByWindow(), Windowed transformations
CSV files, File Formats, Comma-Separated Values and Tab-Separated Values
- loading, Loading CSV
- saving, Saving CSV

D

data processing applications, Data Processing Applications
data science tasks, Data Science Tasks
data wrangling, Data Science Tasks
databases
- accessing from Spark, Databases
- Cassandra, Cassandra
- data sources from, Motivation
- Elasticsearch, Elasticsearch
- external, writing data to, Output Operations
- HBase, HBase
- supporting JDBC, loading data from, Java Database Connectivity
debugging Spark
- finding information, Finding Information-Driver and Executor Logs
  - driver and executor logs, Driver and Executor Logs
  - on executors, Executors: A list of executors present in the application
  - on Spark configuration, Environment: Debugging Spark’s configuration
- output operations, Output Operations
decision trees, Decision trees and random forests
- and random forests, Decision trees and random forests
- DecisionTree class, training methods, Decision trees and random forests
- DecisionTreeModel, Decision trees and random forests
dependencies
- conflicts in, Dependency Conflicts
- conflicts with Hive, Linking with Spark SQL
- for Spark SQL, Linking with Spark SQL
- information about, Environment: Debugging Spark’s configuration
- packaging your code with, Packaging Your Code and Dependencies
  - in Python, Packaging Your Code and Dependencies
  - in Scala and Java, Packaging Your Code and Dependencies
  - Java Spark application built with Maven, A Java Spark Application Built with Maven
  - Scala Spark application built with sbt, A Scala Spark Application Built with sbt
- runtime dependencies of an application, Deploying Applications with spark-submit
deployment modes
- and location of application logfiles, Driver and Executor Logs
- executor memory, cores, and number of executors, Hardware Provisioning
- local disk configuration, Hardware Provisioning
dimensionality reduction
- principal component analysis, Principal component analysis
- singular value decomposition (SVD), Singular value decomposition
directed acyclic graph (DAG), The Driver, Components of Execution: Jobs, Tasks, and Stages
directories
- loading, Loading text files
- wildcard expansion on, Loading text files
discretized streams (see DStreams)
distance program (in R), Piping to External Programs
- driver programs using pipe() to call finddistance.R), Piping to External Programs
distributed data and computation, resilient distributed datasets (RDDs), Introduction to Spark’s Python and Scala Shells
downloading Spark, Downloading Spark
- important files and directories, Downloading Spark
driver programs, Introduction to Core Spark Concepts, Spark Runtime Architecture, The Driver
- collecting an RDD to, Components of Execution: Jobs, Tasks, and Stages
- deploy mode on Apache Mesos, Client and cluster mode
- deploy modes on Standalone cluster manager, Submitting applications
- deploy modes on YARN cluster manager, Hadoop YARN
- duties performed by, The Driver
- fault tolerance, Driver Fault Tolerance
- in local mode, Executors
- launching in supervise mode, Driver Fault Tolerance
- logs, Driver and Executor Logs
DStream.repartition(), Level of Parallelism
DStream.transformWith(), Stateless Transformations
DStreams (discretized streams), Spark Streaming
- as continuous series of RDDs, Architecture and Abstraction
- creating from Kafka messages, Apache Kafka
- creating with socketTextStream(), A Simple Example
- fault-tolerance properties for, Architecture and Abstraction
- of SparkFlumeEvents, Pull-based receiver
- output operations, Output Operations
- output operations support, Architecture and Abstraction
- transform() operator, Stateless Transformations
- transformations on, Transformations
  - stateless transformations, Stateless Transformations
  - statful transformations, Stateful Transformations
- transformations support, Architecture and Abstraction

E

EC2 clusters, Amazon EC2
- launching a cluster, Launching a cluster
- logging into a cluster, Logging in to a cluster
- pausing and restarting, Pausing and restarting clusters
Elasticsearch, reading and writing data from, Elasticsearch
Elephant Bird package (Twitter), Loading with other Hadoop input formats
- support for protocol buffers, Example: Protocol buffers
empty line count using accumulators (example), Accumulators
ETL (extract, transform, and load), Working with Key/Value Pairs
exactly-once semantics for transformations, Processing Guarantees
execution components, Components of Execution: Jobs, Tasks, and Stages-Components of Execution: Jobs, Tasks, and Stages
- summary of execution phases, Components of Execution: Jobs, Tasks, and Stages
execution graph, The Driver
executors, Introduction to Core Spark Concepts, Spark Runtime Architecture
- configuration values for, Configuring Spark with SparkConf
- in local mode, Executors
- information on, web UI executors page, Executors: A list of executors present in the application
- logs, Driver and Executor Logs
- memory usage, Memory Management
- memory, number of cores, and total number of executors, Hardware Provisioning
- requests for more memory, causing application not to run, Submitting applications
- resource allocation on Apache Mesos, Configuring resource usage
- resource allocation on Hadoop YARN cluster manager, Configuring resource usage
- resource allocation on Standalone cluster manager, Configuring resource usage
- scheduling modes on Apache Mesos, Mesos scheduling modes
- scheduling tasks on, The Driver
- sizing heaps for, Hardware Provisioning
exiting a Spark application, Initializing a SparkContext
Externalizable interface (Java), Optimizing Broadcasts

F

Fair Scheduler, Scheduling Within and Between Spark Applications
fakelogs_directory.sh script, Stream of files
fault tolerance in Spark Streaming, 24/7 Operation
- driver fault tolerance, Driver Fault Tolerance
- receiver fault tolerance, Receiver Fault Tolerance
- worker fault tolerance, Worker Fault Tolerance
fault tolerance, accumulators and, Accumulators and Fault Tolerance
feature extraction algorithms, Overview, Feature Extraction
- normalization, Normalization
- scaling, Scaling
- TF-IDF, TF-IDF
- Word2Vec, Word2Vec
feature extraction and transformation, Machine Learning Basics
feature preparation, Preparing Features
file formats, Motivation
- common supported file formats, File Formats
- CSV and TSV files, Comma-Separated Values and Tab-Separated Values
- file compression, File Compression
- Hadoop input and output formats, Hadoop Input and Output Formats
- object files, Object Files
- SequenceFiles, SequenceFiles
files
- building list of files for each worker node to download for Spark job, Piping to External Programs
- stream of, input source in Spark Streaming, Stream of files
filesystems, Filesystems
- Amazon S3, Amazon S3
- and file formats, Motivation
- HDFS (Hadoop Distributed File System), HDFS
- local/regular, Local/“Regular” FS
filter()
- setting partitioner, Operations That Affect Partitioning
- streaming filter for printing lines containing error, A Simple Example
filtering, Python example of, Introduction to Core Spark Concepts
fine-grained Mesos mode, Mesos scheduling modes
flatMap(), Aggregations
flatMapValues(), Operations That Affect Partitioning
Flume (see Apache Flume)
FlumeUtils object, Push-based receiver
fold(), Aggregations
- groupByKey() and, Grouping Data
foldByKey(), Aggregations
- combiners and, Aggregations
foreach()
- accumulators used in actions, Accumulators and Fault Tolerance
- per-partition version, Working on a Per-Partition Basis
foreachPartition(), Working on a Per-Partition Basis, Output Operations
foreachRDD(), Output Operations
functions, passing to Spark, Introduction to Core Spark Concepts

G

garbage collection
- executor heap sizes and, Hardware Provisioning
- in Spark Streaming applications, Garbage Collection and Memory Usage
gfortran runtime library, System Requirements
Google Guava library, Joins
GraphX library, GraphX
groupBy(), Grouping Data
groupByKey(), Grouping Data
- benefitting from partitioning, Operations That Benefit from Partitioning
- resulting in hash-partitioned RDDs, Data Partitioning (Advanced)
- setting partitioner, Operations That Affect Partitioning
grouping data in pair RDDs, Grouping Data
groupWith()
- benefitting from partitioning, Operations That Benefit from Partitioning
- setting partitioner, Operations That Affect Partitioning

H

Hadoop
- CSVInputFormat, Loading CSV
- file APIs for keyed (paired) data, File Formats
- input and output formats, Hadoop Input and Output Formats
  - Elasticsearch, Elasticsearch
  - file compression, File Compression
  - input format for HBase, HBase
  - non-filesystem data sources, Non-filesystem data sources
  - reading input formats, Stream of files
  - saving with Hadoop output formats, Saving with Hadoop output formats
- LZO support, hadoop-lzo package, Loading with other Hadoop input formats
- RecordReader, SequenceFiles
- SequenceFiles, SequenceFiles
Hadoop Distributed File System (HDFS), HDFS
- on Spark EC2 clusters, Storage on the cluster
Hadoop YARN, Spark Runtime Architecture, Hadoop YARN
- configuring resource usage, Configuring resource usage
hadoopFile(), Loading with other Hadoop input formats
HADOOP_CONF_DIR variable, Hadoop YARN
hardware provisioning, Hardware Provisioning
HashingTF algorithm, TF-IDF
HashPartitioner object, Data Partitioning (Advanced)
HBase, HBase
HBaseConfiguration, HBase
HDFS (Hadoop Distributed File System), HDFS
Hive (see Apache Hive)
Hive Query Language (HQL), Spark SQL, Apache Hive
- CACHE TABLE or UNCACHE TABLE statement, Caching
- CREATE TABLE statement, Linking with Spark SQL
- documentation for, Working with Beeline
- syntax for type definitions, SchemaRDDs
- using within Beeline client, Working with Beeline
HiveContext object, Apache Hive, Linking with Spark SQL
- creating, Initializing Spark SQL
- importing, Initializing Spark SQL
- registering SchemaRDDs as temp tables to query, SchemaRDDs
- requirement for using Hive UDFs, Hive UDFs
HiveContext.inferSchema(), From RDDs
HiveContext.jsonFile(), JSON, JSON
HiveContext.parquetFile(), Parquet
hiveCtx.cacheTable(), Caching
HiveServer2, JDBC/ODBC Server

I

IDF (inverse document frequency), TF-IDF
implicits
- importing, Initializing Spark SQL
- schema inference with, From RDDs
incrementally computing reductions, Windowed transformations
inner joins, Joins
input and output souces, Motivation
- common data sources, Motivation
- file formats, File Formats
  - CSV and TSV files, Comma-Separated Values and Tab-Separated Values
  - file compression, File Compression
  - Hadoop input and output formats, Hadoop Input and Output Formats
  - JSON, JSON
  - object files, Object Files
  - SequenceFiles, SequenceFiles
  - text files, Text Files
- Hadoop input and output formats, non-filesystem, Non-filesystem data sources
- input sources in Spark Streaming, Input Sources
  - reliable sources, Receiver Fault Tolerance
InputFormat and OutputFormat interfaces, Motivation
InputFormat types, Loading with other Hadoop input formats
integers
- accumulator type, Custom Accumulators
- sorting as strings, Sorting Data
IPython shell, Introduction to Spark’s Python and Scala Shells
Iterable object, Working on a Per-Partition Basis
Iterator object, Working on a Per-Partition Basis

J

JAR files
- in transitive dependency graph, Packaging Your Code and Dependencies
- uber JAR (or assembly JAR), Packaging Your Code and Dependencies
Java
- Apache Kafka, Apache Kafka
- Concurrent Mark-Sweep garbage collector, Garbage Collection and Memory Usage
- country lookup with Broadcast values in (example), Broadcast Variables
- creating HiveContext and selecting data, Apache Hive
- creating pair RDD, Creating Pair RDDs
- creating SchemaRDD from a JavaBean, From RDDs
- custom sort order, sorting integers as if strings, Sorting Data
- driver program using pipe() to call finddistance.R, Piping to External Programs
- FlumeUtils agent in, Push-based receiver
- FlumeUtils custom sink, Pull-based receiver
- Hive load in, Apache Hive
- joining DStreams in, Stateless Transformations
- lambdas in Java 8, Introduction to Core Spark Concepts
- linear regression in, Linear regression
- linking Spark into standalone applications, Standalone Applications
- loading and querying tweets, Basic Query Example
- loading CSV with textFile(), Loading CSV
- loading entire Cassandra table as RDD with key/value data, Cassandra
- loading JSON, Loading JSON, JSON, JSON
- loading text files, Loading text files
- map() and reduceByKey() on DStream, Stateless Transformations
- Maven coordinates for Spark SQL with Hive support, Linking with Spark SQL
- partitioner, custom, Custom Partitioners
- partitioner, determining for an RDD, Determining an RDD’s Partitioner
- partitioning in, Data Partitioning (Advanced)
- passing functions to Spark, Introduction to Core Spark Concepts
- per-key average using combineByKey(), Aggregations
- Row objects, getter functions, Working with Row objects
- saving JSON, Saving JSON
- saving SequenceFiles, Saving SequenceFiles, Saving with Hadoop output formats, Output Operations
- setting Cassandra property, Cassandra
- setting up driver that can recover from failure, Driver Fault Tolerance
- shared connection pool and JSON parser, Working on a Per-Partition Basis
- spam classifier in, Example: Spam Classification
- Spark application built with Maven, A Java Spark Application Built with Maven
- Spark Cassandra connector, Cassandra
- SQL imports, Initializing Spark SQL
- streaming filter for printing lines containing error, A Simple Example
- streaming imports, A Simple Example
- streaming text files written to a directory, Stream of files
- string length UDF, Spark SQL UDFs
- submitting applications with dependencies, Packaging Your Code and Dependencies
- transform() on a DStream, Stateless Transformations
- transformations on pair RDDs, Transformations on Pair RDDs
- UDF imports, Spark SQL UDFs
- updateStateByKey() transformation, UpdateStateByKey transformation
- using summary statistics to remove outliers in, Numeric RDD Operations
- vectors, creating, Working with Vectors
- visit counts per IP address, Windowed transformations
- window(), using, Windowed transformations
- windowed count operations in, Windowed transformations
- word count in, Building Standalone Applications, Aggregations
- Writable types, SequenceFiles
Java Database Connectivity (JDBC), Java Database Connectivity
- JDBC connector in Spark SQL, performance tuning options, Performance Tuning Options
Java Serialization, Object Files, Optimizing Broadcasts, Serialization Format
Java Virtual Machine (JVM), Downloading Spark and Getting Started
java.io.Externalizable interface, Optimizing Broadcasts
java.io.Serializable, Hadoop Writable classes and, SequenceFiles
JDBC/ODBC server in Spark SQL, Scheduling Within and Between Spark Applications, JDBC/ODBC Server-Long-Lived Tables and Queries
- connecting to JDBC server with Beeline, JDBC/ODBC Server
- launching the JDBC server, JDBC/ODBC Server
JdbcRDD, Java Database Connectivity
jobs
- defined, Components of Execution: Jobs, Tasks, and Stages
- Spark application UI jobs page, Jobs: Progress and metrics of stages, tasks, and more
- submission by scheduler, Components of Execution: Jobs, Tasks, and Stages
join operator, Joins
join()
- benefitting from partitioning, Operations That Benefit from Partitioning
- setting partitioner, Operations That Affect Partitioning
joins
- on pair RDDs, Joins
- transformations on key/value DStreams, Stateless Transformations
- using partitionBy(), Data Partitioning (Advanced)
JSON, File Formats
- ham radio call log in (example), Introduction
- loading, Loading JSON
- loading and saving in Spark SQL, JSON
- loading using custom Hadoop input format, Loading with other Hadoop input formats
- saving, Saving JSON
- structured, working with using Spark SQL, JSON

K

K-means algorithm, K-means
Kafka, Apache Kafka
key/value pairs, working with, Working with Key/Value Pairs-Conclusion
- actions available on pair RDDs, Actions Available on Pair RDDs
- aggregations, Aggregations
- creating pair RDDs, Creating Pair RDDs
- DStream transformations, Stateless Transformations
- loading Cassandra table as RDD with key/value pairs, Cassandra
- pair RDDs, Motivation
- partitioning (advanced), Data Partitioning (Advanced)
  - custom partitioners, Custom Partitioners
  - determining RDD's partitioner, Determining an RDD’s Partitioner
  - example, PageRank, Example: PageRank
  - operations affecting partitioning, Operations That Affect Partitioning
  - operations benefitting from partitioning, Operations That Benefit from Partitioning
- transformations on pair RDDs, Transformations on Pair RDDs-Actions Available on Pair RDDs
  - aggregations, Aggregations
  - grouping data, Grouping Data
  - joins, Joins
  - sorting data, Sorting Data
  - tuning level of parallelism, Tuning the level of parallelism
key/value stores
- data sources from, Motivation
- SequenceFiles, SequenceFiles
KeyValueTextInputFormat(), Loading with other Hadoop input formats
Kryo serialization library, Optimizing Broadcasts, Serialization Format
- using Kyro serializer and registering classes, Serialization Format

L

LabeledPoints, Example: Spam Classification, Data Types
- use in classification and regression, Classification and Regression
lambda (=>) syntax, Introduction to Core Spark Concepts
LassoWithSGD, Linear regression
launch scripts, using cluster launch scripts, Launching the Standalone cluster manager
LBFGS algorithm, Logistic regression
leftOuterJoin(), Joins
- benefitting from partitioning, Operations That Benefit from Partitioning
- on DStreams, Stateless Transformations
- setting partitioner, Operations That Affect Partitioning
linear regression, Linear regression
LinearRegressionModel, Linear regression
LinearRegressionWithSGD object, Linear regression
Linux/Mac
- streaming application, running on, A Simple Example
- tar command, Downloading Spark
loading and saving data, Loading and Saving Your Data-Conclusion
- databases, Databases
  - Cassandra, Cassandra
  - Elasticsearch, Elasticsearch
  - HBase, HBase
- file formats, File Formats
  - CSV and TSV files, Comma-Separated Values and Tab-Separated Values
  - file compression, File Compression
  - Hadoop input and output formats, Hadoop Input and Output Formats
  - JSON, JSON
  - object files, Object Files
  - SequenceFiles, SequenceFiles
  - text files, Text Files
- filesystems, Filesystems
  - Amazon S3, Amazon S3
  - HDFS, HDFS
  - local/regular, Local/“Regular” FS
- in Spark SQL, Loading and Saving Data-JDBC/ODBC Server
- structured data, using Spark SQL, Structured Data with Spark SQL
  - Apache Hive, Apache Hive
  - JSON, JSON
local mode, Spark running in, Downloading Spark, Executors
local/regular filesystem, Local/“Regular” FS
log4j, Driver and Executor Logs
- customizing logging in Spark, Driver and Executor Logs
- example configuration file, Driver and Executor Logs
Log4j.properties.template, Introduction to Spark’s Python and Scala Shells
logging
- controlling for PySpark shell, Introduction to Spark’s Python and Scala Shells
- driver and executor logs, Driver and Executor Logs
- logs for JDBC server, JDBC/ODBC Server
- output from Spark Streaming app example, Architecture and Abstraction
logistic regression, Logistic regression
LogisticRegressionModel, Logistic regression
long-lived Spark applications, Scheduling Within and Between Spark Applications
long-lived tables and queries, Long-Lived Tables and Queries
lookup(), Actions Available on Pair RDDs
- benefitting from partitioning, Operations That Benefit from Partitioning
LzoJsonInputFormat, Loading with other Hadoop input formats

M

machine learning
- with MLlib, Machine Learning with MLlib
  - algorithms, Algorithms
  - basic machine learning concepts, Machine Learning Basics
  - classification and regression, Classification and Regression
  - clustering, Clustering
  - collaborative filtering and recommendation, Collaborative Filtering and Recommendation
  - data types, Data Types
  - dimensionality reduction, Principal component analysis
  - example, spam classification, Example: Spam Classification
  - feature extraction algorithms, Feature Extraction
  - model evaluation, Model Evaluation
  - overview, Overview
  - pipeline API, Pipeline API
  - statistics, Statistics
  - system requirements, System Requirements
  - tips and performance considerations, Preparing Features
  - working with vectors, Working with Vectors
machine learning (ML) functionality (MLlib), MLlib
main function, Introduction to Core Spark Concepts
map(), Transformations on Pair RDDs, Data Partitioning (Advanced)
- on DStreams in Scala, Stateless Transformations
- partitioner property and, Operations That Affect Partitioning
- per-partition version, Working on a Per-Partition Basis
mapPartitions(), Working on a Per-Partition Basis
- advantages of using, Working on a Per-Partition Basis
- calculating average with and without, Working on a Per-Partition Basis
mapPartitionsWithIndex(), Working on a Per-Partition Basis
mapValues(), Transformations on Pair RDDs, Operations That Affect Partitioning
master, Cluster Manager, Standalone Cluster Manager
- editing conf/slaves file on, Launching the Standalone cluster manager
- launching manually, Launching the Standalone cluster manager
-- master flag (spark-submit), Deploying Applications with spark-submit
master/slave architecture, Spark Runtime Architecture
match operator, Custom Partitioners
Matrix object, Principal component analysis
MatrixFactorizationModel, Alternating Least Squares
Maven, Standalone Applications, Packaging Your Code and Dependencies
- building a simple Spark application, Building Standalone Applications
- coordinates for Flume sink, Pull-based receiver
- coordinates for Spark Streaming, A Simple Example
- Java Spark application built with, A Java Spark Application Built with Maven
- linking to Spark SQL with Hive, Linking with Spark SQL
- shading support, Dependency Conflicts
max(), Numeric RDD Operations
mean(), Numeric RDD Operations
memory management, Memory Management
- controlling memory use in Spark Streaming applications, Garbage Collection and Memory Usage
MEMORY_AND_DISK storage level, Memory Management
Mesos (see Apache Mesos)
Metrics object, Model Evaluation
micro-batch architecture, Spark Streaming, Architecture and Abstraction
min(), Numeric RDD Operations
MLlib, MLlib
model evaluation, Model Evaluation
models, Machine Learning Basics
MulticlassMetrics, Model Evaluation
Multinomial Naive Bayes, Naive Bayes
multitenant clusters, scheduling in, Scheduling Within and Between Spark Applications

N

Naive Bayes algorithm, Naive Bayes
NaiveBayes class, Naive Bayes
NaiveBayesModel, Naive Bayes
natural language libraries, TF-IDF
network filesystems, Local/“Regular” FS
newAPIHadoopFile(), Loading with other Hadoop input formats
normalization, Normalization
Normalizer class, Normalization
NotSerializableException, Serialization Format
numeric operations, Numeric RDD Operations
- StatsCounter, methods available on, Numeric RDD Operations
- using summary statistics to remove outliers, Numeric RDD Operations
NumPy, System Requirements

O

object files, Object Files
objectFile(), Object Files
ODBC, Spark SQL ODBC driver, JDBC/ODBC Server
optimizations performed by Spark driver, The Driver
Option object, Joins, Determining an RDD’s Partitioner
Optional object, Joins
outer joins, Joins
output operations, Output Operations
OutputFormat interface, Motivation

P

package managers (Python), Packaging Your Code and Dependencies
packaging
- modifying to eliminate dependency conflicts, Dependency Conflicts
- Spark application built with Maven, A Java Spark Application Built with Maven
- Spark application built with sbt, A Scala Spark Application Built with sbt
PageRank algorithm, Example: PageRank
pair RDDs, Motivation
- actions available on, Actions Available on Pair RDDs
- aggregations on, Aggregations
  - distributed word count, Aggregations
- creating, Creating Pair RDDs
- partitioning (advanced), Data Partitioning (Advanced)
  - custom partitioners, Custom Partitioners
  - determining an RDD's partitioner, Determining an RDD’s Partitioner
  - example, PageRank, Example: PageRank
  - operations affecting partitioning, Operations That Affect Partitioning
  - operations benefitting from partitioning, Operations That Benefit from Partitioning
- transformations on, Transformations on Pair RDDs-Actions Available on Pair RDDs
  - aggregations, Aggregations
  - grouping data, Grouping Data
  - joins, Joins
  - sorting data, Sorting Data
  - tuning level of parallelism, Tuning the level of parallelism
parallel algorithms, Overview
parallelism
- level of, for MLlib algorithms, Level of Parallelism
- level of, in Spark Streaming apps, Level of Parallelism
- level of, performance and, Level of Parallelism
- tuning level of, Tuning the level of parallelism
parallelize(), Creating Pair RDDs, Overview
Parquet, Parquet
- loading data into Spark SQL, Parquet
- registering Parquet file as temp table and querying against it in Spark SQL, Parquet
- saving a SchemaRDD to, Parquet
partitionBy(), Data Partitioning (Advanced)
- failure to persist RDD after transformation by, Data Partitioning (Advanced)
- setting partitioner, Operations That Affect Partitioning
Partitioner object, Determining an RDD’s Partitioner
- custom, Custom Partitioners
partitioner property, Determining an RDD’s Partitioner
- automatically set by operations partitioning data, Operations That Affect Partitioning
- transformations not setting, Operations That Affect Partitioning
partitioning, Working with Key/Value Pairs
- advanced, Data Partitioning (Advanced)
  - custom partitioners, Custom Partitioners
  - determining an RDD's partitioner, Determining an RDD’s Partitioner
  - example, PageRank, Example: PageRank
  - operations affecting partitioning, Operations That Affect Partitioning
  - operations benefitting from partitioning, Operations That Benefit from Partitioning
- changing for RDDs outside of aggregations and grouping, Tuning the level of parallelism
partitions
- asking Spark to use specific number of, Tuning the level of parallelism
- controlling number when loading text files, Loading text files
- working on per-partition basis, Working on a Per-Partition Basis
  - per-partition operators, Working on a Per-Partition Basis
PBs (see protocol buffers)
PCA (principal component analysis), Principal component analysis
performance
- information about, Finding Information-Driver and Executor Logs
  - level of parallelism, Level of Parallelism
  - web UI stage page, Jobs: Progress and metrics of stages, tasks, and more
- key considerations, Key Performance Considerations-Hardware Provisioning
  - hardware provisioning, Hardware Provisioning
  - level of parallelism, Level of Parallelism
  - memory management, Memory Management
  - serialization format, Serialization Format
- Spark SQL, Spark SQL Performance
  - performance tuning options, Performance Tuning Options
- Spark Streaming applications, Performance Considerations
  - batch and window sizing, Batch and Window Sizes
  - garbage collection and memory usage, Garbage Collection and Memory Usage
  - level of parallelism, Level of Parallelism
persist(), Determining an RDD’s Partitioner
pickle serializaion library (Python), Object Files
pipe(), Piping to External Programs
- driver programs using pipe() to call finddistance.R, Piping to External Programs
- specifying shell environment variables with, Piping to External Programs
- using to interact with an R program, Piping to External Programs
pipeline API in MLlib, Overview, Pipeline API
- spam classification example, Pipeline API
pipelining, Components of Execution: Jobs, Tasks, and Stages, Components of Execution: Jobs, Tasks, and Stages
piping to external programs, Piping to External Programs
port 4040, information on running Spark applications, The Driver
predict(), Linear regression
principal component analysis (PCA), Principal component analysis
print(), A Simple Example, Output Operations
programming, advanced, Introduction-High availability
- accumulators, Accumulators
  - and fault tolerance, Accumulators and Fault Tolerance
  - custom, Custom Accumulators
- broadcast variables, Broadcast Variables
- numeric RDD operations, Numeric RDD Operations
- piping to external programs, Piping to External Programs
protocol buffers, Example: Protocol buffers
- sample definition, Example: Protocol buffers
pull-based receiver, Apache Flume, Pull-based receiver
push-based receiver, Apache Flume
PySpark shell, Data Science Tasks
- creating RDD and doing simple analysis, Introduction to Spark’s Python and Scala Shells
- filtering example, Introduction to Core Spark Concepts
- launching against Standalone cluster manager, Submitting applications
- launching against YARN cluster manager, Hadoop YARN
- opening and using, Introduction to Spark’s Python and Scala Shells
Python
- accumulaor error count in, Accumulators
- accumulator empty line count in, Accumulators
- average without and with mapPartitions(), Working on a Per-Partition Basis
- constructing SQL context, Initializing Spark SQL
- country lookup in, Broadcast Variables
- country lookup with broadcast variables, Broadcast Variables
- creating an application using a SparkConf, Configuring Spark with SparkConf
- creating HiveContext and selecting data, Apache Hive
- creating pair RDD, Creating Pair RDDs
- creating SchemaRDD using Row and named tuple, From RDDs
- custom sort order, sorting integers as if strings, Sorting Data
- driver program using pipe() to call finddistance.R, Piping to External Programs
- HashingTF, using in, TF-IDF
- Hive load in, Apache Hive
- installing third-party libraries, Packaging Your Code and Dependencies
- IPython shell, Introduction to Spark’s Python and Scala Shells
- linear regression in, Linear regression
- loading and querying tweets, Basic Query Example
- loading CSV with textFile(), Loading CSV
- loading JSON with Spark SQL, JSON, JSON
- loading SequenceFiles, Loading SequenceFiles
- loading text files, Loading text files
- loading unstructured JSON, Loading JSON
- Parquet files in, Parquet
- partitioner, custom, Custom Partitioners
- partitioning in, Data Partitioning (Advanced)
- passing functions to Spark, Introduction to Core Spark Concepts
- per-key avarage using reduceByKey() and mapValues(), Aggregations
- per-key average using combineByKey(), Aggregations
- pickle serialization library, Object Files
- requirement for EC2 script, Amazon EC2
- Row objects, working with, Working with Row objects
- saving JSON, Saving JSON
- scaling vectors in, Scaling
- shared connection pool in, Working on a Per-Partition Basis
- shell in Spark, Introduction to Spark’s Python and Scala Shells
- spam classifier in, Example: Spam Classification
- Spark SQL with Hive support, Linking with Spark SQL
- SQL imports, Initializing Spark SQL
- string length UDF, Spark SQL UDFs
- submitting a Python program with spark-submit, Deploying Applications with spark-submit
- TF-IDF, using in, TF-IDF
- transformations on pair RDDs, Transformations on Pair RDDs
- using MLlib in, requirement for NumPy, System Requirements
- using summary statistics to remove outliers in, Numeric RDD Operations
- vectors, creating, Working with Vectors
- word count in, Aggregations
- writing CSV in, Saving CSV
- writing Spark applications as scripts, Standalone Applications

Q

queues, scheduling in YARN, Configuring resource usage
Quick Start Guide, Introduction to Spark’s Python and Scala Shells

R

R library, Piping to External Programs
- distance program, Piping to External Programs
RandomForest class, Decision trees and random forests
Rating objects, Alternating Least Squares
Rating type, Data Types
rdd.getNumPartitions(), Tuning the level of parallelism
rdd.partition.size(), Tuning the level of parallelism
receivers, Architecture and Abstraction
- fault tolerance, Receiver Fault Tolerance
- increasing number of, Level of Parallelism
recommendations, Collaborative Filtering and Recommendation
- ALS collaborative filtering algorithm, Alternating Least Squares
RecordReader (Hadoop), SequenceFiles
reduce(), Aggregations
- groupByKey() and, Grouping Data
reduceByKey(), Aggregations
- benefitting from partitioning, Operations That Benefit from Partitioning
- combiners and, Aggregations
- DStream transformations, Stateless Transformations
- setting partitioner, Operations That Affect Partitioning
reduceByKeyAndWindow(), Windowed transformations
reduceByWindow(), Windowed transformations
regression, Classification and Regression
- decision trees, Decision trees and random forests
- linear, Linear regression
- logistic, Logistic regression
- random forests, Decision trees and random forests
repartition(), Tuning the level of parallelism, Level of Parallelism, Level of Parallelism
resilient distributed datasets (RDDs), Spark Core (see caching in executors)
- caching to reuse, Caching RDDs to Reuse
- Cassandra table, loading as RDD with key/value pairs, Cassandra
- changing partitioning, Tuning the level of parallelism
- collecting, Components of Execution: Jobs, Tasks, and Stages
- computing an alredy cached RDD, Components of Execution: Jobs, Tasks, and Stages
- counts (example), Components of Execution: Jobs, Tasks, and Stages
- creating and doing simple analysis, Introduction to Spark’s Python and Scala Shells
- DStreams as continuous series of, Architecture and Abstraction
- JdbcRDD, Java Database Connectivity
- loading and saving data from, in Spark SQL, From RDDs
- numeric operations, Numeric RDD Operations
- of CassandraRow objects, Cassandra
- pair RDDs, Motivation
  - (see also pair RDDs)
- persisted, information on, Storage: Information for RDDs that are persisted
- pipe(), Piping to External Programs
- pipelining of RDD transforations into a single stage, Components of Execution: Jobs, Tasks, and Stages
- running computations on RDDs in a DStream, Output Operations
- saving to Cassandra from, Cassandra
- SchemaRDDs, Spark SQL, SchemaRDDs-Caching
- visualiing with toDebugString() in Scala, Components of Execution: Jobs, Tasks, and Stages
resource allocation
- configuring on Hadoop YARN cluster manager, Configuring resource usage
- configuring on Standalone cluster manager, Configuring resource usage
- on Apache Mesos, Configuring resource usage
RidgeRegressionWithSGD, Linear regression
rightOuterJoin(), Joins
- benefitting from partitioning, Operations That Benefit from Partitioning
- setting partitioner, Operations That Affect Partitioning
Row objects, Structured Data with Spark SQL
- creating RDD of, From RDDs
- RDD of, Spark SQL
- working with, Working with Row objects
RowMatrix class, Principal component analysis
runtime architecture (Spark), Spark Runtime Architecture
- cluster manager, Cluster Manager
- driver, The Driver
- executors, Executors
- launching a program, Launching a Program
- running Spark application on a cluster, summary of steps, Summary
runtime dependencies of an application, Deploying Applications with spark-submit

S

s3n://, path starting with, Amazon S3
sampleStdev(), Numeric RDD Operations
sampleVariance(), Numeric RDD Operations
save(), Sorting Data
- for DStreams, Output Operations
saveAsHadoopFiles(), Output Operations
saveAsObjectFile(), Object Files
saveAsParquetFile(), Parquet
saveAsTextFile(), Saving text files
- saving JSON, Saving JSON
- using with accumulators, Accumulators
sbt (Scala build tool), Packaging Your Code and Dependencies
- building a simple Spark application, Building Standalone Applications
- Spark application built with, A Scala Spark Application Built with sbt
  - adding assembly plug-in, A Scala Spark Application Built with sbt
sc variable (SparkContext), Introduction to Core Spark Concepts, The Driver
Scala, Downloading Spark and Getting Started
- accumulator empty line count in, Accumulators
- Apache Kafka, Apache Kafka
- constructing SQL context, Initializing Spark SQL
- country lookup with broadcast variables, Broadcast Variables
- creating HiveContext and selecting data, Apache Hive
- creating pair RDD, Creating Pair RDDs
- creating SchemaRDD from case class, From RDDs
- custom sort order, sorting integers as if strings, Sorting Data
- driver program using pipe() to call finddistance.R, Piping to External Programs
- Elasticsearch output in, Elasticsearch
- FlumeUtils agent in, Push-based receiver
- FlumeUtils custom sink, Pull-based receiver
- Hive load in, Apache Hive
- joining DStreams in, Stateless Transformations
- linear regression in, Linear regression
- linking to Spark, Standalone Applications
- loading and querying tweets, Basic Query Example
- loading compressed text file from local filesystem, Local/“Regular” FS
- loading CSV with textFile(), Loading CSV
- loading entire Cassandra table as RDD with key/value pairs, Cassandra
- loading JSON, Loading JSON, JSON, JSON
- loading LZO-compressed JSON with Elephant Bird, Loading with other Hadoop input formats
- loading SequenceFiles, Loading SequenceFiles
- loading text files, Loading text files
- map() and reduceByKey() on DStream, Stateless Transformations
- Maven coordinates for Spark SQL with Hive support, Linking with Spark SQL
- PageRank example, Example: PageRank
- partitioner, custom, Data Partitioning (Advanced), Custom Partitioners
- partitioner, determining for an RDD, Determining an RDD’s Partitioner
- passing functions to Spark, Introduction to Core Spark Concepts
- PCA (principal component analysis) in, Principal component analysis
- per-key avarage using redueByKey() and mapValues(), Aggregations
- per-key average using combineByKey(), Aggregations
- processing text data in Scala Spark shell, Components of Execution: Jobs, Tasks, and Stages
- reading from HBase, HBase
- Row objects, getter functions, Working with Row objects
- saving data to external systems with foreachRDD(), Output Operations
- saving DStream to text files, Output Operations
- saving JSON, Saving JSON
- saving SequenceFiles, Saving SequenceFiles, Output Operations
- saving to Cassandra, Cassandra
- setting Cassandra property, Cassandra
- setting up driver that can recover from failure, Driver Fault Tolerance
- spam classifier in, Example: Spam Classification
- spam classifier, pipeline API version, Pipeline API
- Spark application built with sbt, A Scala Spark Application Built with sbt
- Spark Cassandra connector, Cassandra
- SparkFlumeEvent in, Pull-based receiver
- SQL imports, Initializing Spark SQL
- streaming filter for printing lines containing error, A Simple Example
- streaming imports, A Simple Example
- streaming SequenceFiles written to a directory, Stream of files
- streaming text files written to a directory, Stream of files
- string length UDF, Spark SQL UDFs
- submitting applications with dependencies, Packaging Your Code and Dependencies
- transform() on a DStream, Stateless Transformations
- transformations on pair RDDs, Transformations on Pair RDDs
- updateStateByKey() transformation, UpdateStateByKey transformation
- user information application (example), Data Partitioning (Advanced)
- using summary statistics to remove outliers in, Numeric RDD Operations
- vectors, creating, Working with Vectors
- visit counts per IP address, Windowed transformations
- visualizing RDDs with toDebugString(), Components of Execution: Jobs, Tasks, and Stages
- window(), using, Windowed transformations
- windowed count operations in, Windowed transformations
- word count application example, Building Standalone Applications
- word count in, Aggregations
- Writable types, SequenceFiles
- writing CSV in, Saving CSV
Scala shell, Introduction to Spark’s Python and Scala Shells
- creating RDD and doing simple analysis, Introduction to Spark’s Python and Scala Shells
- filtering example, Introduction to Core Spark Concepts
- opening, Introduction to Spark’s Python and Scala Shells
scala.Option object, Determining an RDD’s Partitioner
scala.Tuple2 class, Creating Pair RDDs
scaling vectors, Scaling
schedulers
- creating execution plan to compute RDDs necessary for an action, Components of Execution: Jobs, Tasks, and Stages
- pipelining of RDDs into a single stage, Components of Execution: Jobs, Tasks, and Stages
scheduling information, Deploying Applications with spark-submit
scheduling jobs, Scheduling Within and Between Spark Applications
- Apache Mesos scheduling modes, Mesos scheduling modes
SchemaRDDs, Spark SQL, SchemaRDDs
- caching of, Performance Tuning Options
- converting regular RDDs to, From RDDs
- in Spark SQL UI, Caching
- registering as temporary table to query, SchemaRDDs
- saving to Parquet, Parquet
- types stored by, SchemaRDDs
- use in MLlib pipeline API, Pipeline API
- working with Row objects, Working with Row objects
schemas, Structured Data with Spark SQL, Spark SQL
- acccessing nested fields and array fields in SQL, JSON
- in JSON data, JSON
- partial schema of tweets, JSON
SequenceFiles, File Formats, SequenceFiles
- compression in, File Compression
- loading, Loading SequenceFiles
- saving, Saving SequenceFiles
- saving from a DStream, Output Operations
- saving in Java using old Hadoop format APIs, Saving with Hadoop output formats
- streaming, written to a directory, Stream of files
SerDes (serialization and deserialization formats), Linking with Spark SQL
serialization
- caching RDDs in serialized form, Garbage Collection and Memory Usage
- caching serialized objects, Memory Management
- class to use for serializing object, Configuring Spark with SparkConf
- optimizing for broadasts, Optimizing Broadcasts
- serialization format, performance and, Serialization Format
shading, Dependency Conflicts
shared variables, Introduction, Accumulators
- accumulators, Accumulators-Custom Accumulators
- broadcast variables, Broadcast Variables-Optimizing Broadcasts
shells
- driver program, creating, The Driver
- IPython, Introduction to Spark’s Python and Scala Shells
- launching against Standalone cluster manager, Submitting applications
- launching Spark shell and PySpark against YARN, Hadoop YARN
- opening PySpark shell in Spark, Introduction to Spark’s Python and Scala Shells
- opening Scala shell in Spark, Introduction to Spark’s Python and Scala Shells
- processing text data in Scala Spark shell, Components of Execution: Jobs, Tasks, and Stages
- sc variable (SparkContext), Introduction to Core Spark Concepts
- Scala and Python shells in Spark, Introduction to Spark’s Python and Scala Shells
- standalone Spark SQL shell, Long-Lived Tables and Queries
singular value decomposition (SVD), Singular value decomposition
skew, Jobs: Progress and metrics of stages, tasks, and more
sliding duration, Windowed transformations
sorting data
- in pair RDDs, Sorting Data
- sort(), setting partitioner, Operations That Affect Partitioning
- sortByKey(), Sorting Data
  - range-partitioned RDDs, Data Partitioning (Advanced)
spam classification example (MLlib), Example: Spam Classification-Example: Spam Classification
- pipeline API version, Pipeline API
Spark
- accessing Spark UI, Introduction to Spark’s Python and Scala Shells
- brief history of, A Brief History of Spark
- closely integrated components, A Unified Stack
- defined, Preface
- linking into standalone applications in different languages, Standalone Applications
- shutting down an application, Initializing a SparkContext
- storage layers, Storage Layers for Spark
- uses of, Who Uses Spark, and for What?
- versions and releases, Spark Versions and Releases
- web UI (see web UI)
Spark Core, Spark Core
Spark SQL, Spark SQL, Spark SQL-Conclusion
- capabilities provided by, Spark SQL
- JDBC/ODBC server, JDBC/ODBC Server-Long-Lived Tables and Queries
  - connecting to JDBC server with Beeline, JDBC/ODBC Server
  - long-lived tables and queries, Long-Lived Tables and Queries
  - ODBC driver, JDBC/ODBC Server
  - working with Beeline, Working with Beeline
- linking with, Linking with Spark SQL
  - Apache Hive, Linking with Spark SQL
- loading and saving data, Loading and Saving Data-JDBC/ODBC Server
  - Apache Hive, Apache Hive
  - from RDDs, From RDDs
  - JSON, JSON
  - Parquet, Parquet
- performance, Spark SQL Performance
  - performance tuning options, Performance Tuning Options
- structured data sources through, Motivation
- user-defined functions (UDFs), User-Defined Functions
  - Hive UDFs, Hive UDFs
- using in applications, Using Spark SQL in Applications
  - basic query example, Basic Query Example
  - caching, Caching
  - initializing Spark SQL, Initializing Spark SQL
  - SchemaRDDs, SchemaRDDs
- working with structured data, Structured Data with Spark SQL
  - Apache Hive, Apache Hive
  - JSON, JSON
Spark Streaming, Spark Streaming, Spark Streaming-Conclusion
- additional setup for applications, Spark Streaming
- architecture and abstraction, Architecture and Abstraction
- checkpointing, Architecture and Abstraction
- DStreams, Spark Streaming
- execution within Spark components, Architecture and Abstraction
- fault-tolerance properties for DStreams, Architecture and Abstraction
- input sources, Input Sources
  - additional, Additional Sources
  - Apaache Kafka, Apache Kafka
  - Apache Flume, Apache Flume
  - core sources, Core Sources
  - custom, Custom input sources
  - multiple sources and cluster sizing, Multiple Sources and Cluster Sizing
- output operations, Output Operations
- performance considerations, Performance Considerations
  - batch and window sizing, Batch and Window Sizes
  - garbage collection and memory usage, Garbage Collection and Memory Usage
  - level of parallelism, Level of Parallelism
- running applications 24/7, 24/7 Operation-Processing Guarantees
  - checkpointing, Checkpointing
  - driver fault tolerance, Driver Fault Tolerance
  - processing guarantees, Processing Guarantees
  - receiver fault tolerance, Receiver Fault Tolerance
  - worker fault tolerance, Worker Fault Tolerance
- simple example, A Simple Example
  - running application and providing data on Linux/Mac, A Simple Example
- Spark application UI showing, Architecture and Abstraction
- Streaming UI, Streaming UI
- transformations on DStreams, Transformations
  - stateful transformations, Stateful Transformations
  - stateless transformations, Stateless Transformations
spark-class script, Launching the Standalone cluster manager
spark-core package, Building Standalone Applications
spark-ec2 script, Amazon EC2
- launch command, Launching a cluster
- login command, Logging in to a cluster
- options, common, Launching a cluster
- stopping and restarting clusters, Pausing and restarting clusters
spark-submit script, Building Standalone Applications, Launching a Program
- --deploy-mode cluster flag, Submitting applications, Driver Fault Tolerance
- --deploy-mode flag, Hadoop YARN
- --executor-cores flag, Hardware Provisioning
- --executor-memory flag, Submitting applications, Configuring resource usage, Configuring resource usage, Hardware Provisioning
- --jars flag, Packaging Your Code and Dependencies
- --master mesos flag, Apache Mesos
- --master yarn flag, Hadoop YARN
- --num-executors flag, Configuring resource usage, Hardware Provisioning
- --py-files argument, Packaging Your Code and Dependencies
- --total-executor-cores argument, Configuring resource usage, Configuring resource usage
- common flags, summary listing of, Deploying Applications with spark-submit
- deploying applications with, Deploying Applications with spark-submit
  - submitting application with extra arguments, Deploying Applications with spark-submit
  - submitting Python program with, Deploying Applications with spark-submit
- general format, Deploying Applications with spark-submit
- loading configuration values from a file, Configuring Spark with SparkConf
- setting configuration values at runtime with flags, Configuring Spark with SparkConf
- submitting application from Amazon EC2, Logging in to a cluster
- using with various options, Deploying Applications with spark-submit
spark.cores.max, Hardware Provisioning
spark.deploy.spreadOut config property, Configuring resource usage
spark.executor.cores, Hardware Provisioning
spark.executor.memory, Hardware Provisioning
spark.local.dir option, Hardware Provisioning
spark.Partitioner object, Determining an RDD’s Partitioner
spark.serializer property, Optimizing Broadcasts
spark.sql.codegen, Performance Tuning Options
spark.sql.inMemoryColumnarStorage.batchSize, Performance Tuning Options
spark.storage.memoryFracton, Memory Management
SparkConf object, Initializing a SparkContext
- configuring Spark application with, Configuring Spark with SparkConf-Configuring Spark with SparkConf
SparkContext object, Introduction to Core Spark Concepts, The Driver
- initializing, Initializing a SparkContext
- StreamingContext and, A Simple Example
SparkContext.addFile(), Piping to External Programs
SparkContext.parallelize(), Creating Pair RDDs
SparkContext.parallelizePairs(), Creating Pair RDDs
SparkContext.sequenceFile(), Loading SequenceFiles
SparkFiles.get(), Piping to External Programs
SparkFiles.getRootDirectory(), Piping to External Programs
SparkFlumeEvents, Pull-based receiver
SparkR project, Piping to External Programs
SparkStreamingContext.checkpoint(), Checkpointing
SPARK_LOCAL_DIRS variable, Configuring Spark with SparkConf, Hardware Provisioning
SPARK_WORKER_INSTANCES variable, Hardware Provisioning
sparsity, recognizing, Recognizing Sparsity
SQL (Structured Query Language)
- CACHE TABLE or UNCACHE TABLE statement, Caching
- query to run on structured data source, Structured Data with Spark SQL
- querying data with, Spark SQL
SQL shell, Data Science Tasks
sql(), Basic Query Example
SQLContext object, Linking with Spark SQL
- creating, Initializing Spark SQL
- importing, Initializing Spark SQL
- SchemaRDDs, registering as temp tables to query, SchemaRDDs
SQLContext.parquetFile(), Parquet
stack trace from executors, Executors: A list of executors present in the application
stages, The Driver, Components of Execution: Jobs, Tasks, and Stages, Components of Execution: Jobs, Tasks, and Stages
- pipelining of RDD transformations into physical stages, Components of Execution: Jobs, Tasks, and Stages
- progress and metrics of, on web UI jobs page, Jobs: Progress and metrics of stages, tasks, and more
standalone applications, Standalone Applications
- building, Building Standalone Applications
Standalone cluster manager, Spark Runtime Architecture, Standalone Cluster Manager-High availability
- configuration, documentation for, Launching the Standalone cluster manager
- configuring resource usage, Configuring resource usage
- deploy modes, Submitting applications
- high availability, High availability
- launching, Launching the Standalone cluster manager
- submitting applications to, Submitting applications
- using with clusters launched on Amazon EC2, Amazon EC2
StandardScaler class, Scaling
StandardScalerModel, Scaling
start-thriftserver.sh, JDBC/ODBC Server
stateful transformations, Transformations, Stateful Transformations
- checkpointing for, Stateful Transformations
- updateStateByKey(), UpdateStateByKey transformation
- windowed transformations, Windowed transformations
stateless transformations, Transformations
- combining data from multiple DStreams, Stateless Transformations
- merging DStreams with union(), Stateless Transformations
Statistics class, Statistics
StatsCounter object, Numeric RDD Operations
- methods available on, Numeric RDD Operations
stdev(), Numeric RDD Operations
stochastic gradient descent (SGD), Example: Spam Classification, Linear regression
- logistic regression with, Logistic regression
storage
- information for RDDs that are persisted, Storage: Information for RDDs that are persisted
- local disks to store intermediate data, Hardware Provisioning
- setting local storage directories for shuffle data, Configuring Spark with SparkConf
- spark.storage.memoryFraction, Memory Management
storage layers for Spark, Storage Layers for Spark
storage levels
- MEMORY_AND_DISK_SER, Memory Management
- MEMORY_ONLY, Memory Management
- MEMORY_ONLY_SER, Memory Management
streaming (see Spark Streaming)
StreamingContext object, A Simple Example
StreamingContext.awaitTermination(), A Simple Example
StreamingContext.getOrCreate(), Driver Fault Tolerance
StreamingContext.start(), A Simple Example
StreamingContext.transform(), Stateless Transformations
StreamingContext.union(), Stateless Transformations
strings, sorting integers as, Sorting Data
structured data, Spark SQL
- working with, using Spark SQL, Structured Data with Spark SQL
  - Apache Hive, Apache Hive
  - JSON, JSON
sum(), Numeric RDD Operations
supervised learning, Classification and Regression
Support Vector Machines, Support Vector Machines
SVD (singular value decomposition), Singular value decomposition
SVMModel, Support Vector Machines
SVMWithSGD class, Support Vector Machines

T

tab-separated value files (see TSV files)
TableInputFormat, HBase
tar command, Downloading Spark
tar extractors, Downloading Spark
tasks, The Driver, Components of Execution: Jobs, Tasks, and Stages
- for each partition in an RDD, Components of Execution: Jobs, Tasks, and Stages
- progress and metrics of, on web UI jobs page, Jobs: Progress and metrics of stages, tasks, and more
- scheduling on executors, The Driver
term frequency, Example: Spam Classification
Term Frequency–Inverse Document Frequency (TF-IDF), TF-IDF
- using in Python, TF-IDF
text files, File Formats
- KeyValueTextInputFormat(), Hadoop, Loading with other Hadoop input formats
- loading in Spark, Loading text files
- saving, Saving text files
- saving DStream to, Output Operations
- saving JSON as, Saving JSON
textFile(), Loading text files, Components of Execution: Jobs, Tasks, and Stages
- compressed input, handling, File Compression
- loading CSV with, in Java, Loading CSV
- loading CSV with, in Python, Loading CSV
- loading CSV with, in Scala, Loading CSV
Thrift server, JDBC/ODBC Server
toDebugString(), Components of Execution: Jobs, Tasks, and Stages
training data, Machine Learning Basics
transitive dependency graph, Packaging Your Code and Dependencies
TSV (tab-separated value) files, Comma-Separated Values and Tab-Separated Values
- loading, Loading CSV
- saving, Saving CSV
tuning Spark, Tuning and Debugging Spark, Finding Information
- components of execution, Components of Execution: Jobs, Tasks, and Stages-Components of Execution: Jobs, Tasks, and Stages
- configuring Spark with SparkConf, Configuring Spark with SparkConf-Configuring Spark with SparkConf
- performance considerations, key, Key Performance Considerations-Hardware Provisioning
- performance tuning options for Spark SQL, Performance Tuning Options
tuples, Creating Pair RDDs
- scala.Tuple2 class, Creating Pair RDDs
tweets, Spark SQL
- accessing text column in topTweets SchemaRDD, Working with Row objects
- loading and queryiny, Basic Query Example
- partial schema of, JSON
Twitter
- Elephant Bird package, Loading with other Hadoop input formats
  - support for protocol buffers, Example: Protocol buffers
types
- accumulator types in Spark, Custom Accumulators
- in MLlib, Data Types
- stored by SchemaRDDs, SchemaRDDs

U

uber JAR, Packaging Your Code and Dependencies
UDFs (see user-defined functions)
union(), merging DStreams, Stateless Transformations
updateStateByKey transformation, UpdateStateByKey transformation
user-defined functions (UDFs), Linking with Spark SQL, User-Defined Functions
- Hive UDFs, Hive UDFs
- Spark SQL, Spark SQL UDFs

V

variables, shared (see shared variables)
variance(), Numeric RDD Operations
vectors
- Vector type, Data Types
- working with, Working with Vectors
versions, Spark and HDFS, HDFS
visit counts per IP address, Windowed transformations

W

web UI, The Driver, Spark Web UI
- driver and executor logs, Driver and Executor Logs
- environment page, Environment: Debugging Spark’s configuration
- executors page, Executors: A list of executors present in the application
- for Standalone cluster manager, Submitting applications
- jobs page, Jobs: Progress and metrics of stages, tasks, and more
- storage page, Storage: Information for RDDs that are persisted
WeightedEnsembleModel, Decision trees and random forests
wholeFile(), Loading CSV
wholeTextFiles(), Loading text files
window duration, Windowed transformations
window(), Windowed transformations
windowed transformations, Windowed transformations
Windows systems
- installing Spark, Downloading Spark
- IPython shell, running on, Introduction to Spark’s Python and Scala Shells
- streaming application, running on, A Simple Example
- tar extractor, Downloading Spark
word count, distributed, Aggregations
Word2Vec class, Word2Vec
Word2VecModel, Word2Vec
workers, Cluster Manager, Standalone Cluster Manager
- fault tolerance, Worker Fault Tolerance
- lauching manually, Launching the Standalone cluster manager
- requirement in standalone mode, Hardware Provisioning
Writable interface (Hadoop), SequenceFiles
Writable types (Hadoop)
- automatic conversion in Scala, Loading SequenceFiles
- corresponding Scala and Java types, SequenceFiles

Y

YARN (see Hadoop YARN)

Z

ZooKeeper, High availability
- using with Apache Mesos, Apache Mesos

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Index

Create new playlist

Sign In

Sign Up

Index

Symbols

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

Y

Z

Table of Contents for
Index