Index
A
- AccumulatorParam, Custom Accumulators
- accumulators, Accumulators
- aggregations, Aggregations
- Akka actor stream, Akka actor stream
- Alternating Least Squares (ALS) algorithm, Alternating Least Squares
- Amazon EC2, Amazon EC2-Storage on the cluster
- Amazon S3, Amazon S3
- Amazon Web Services (AWS) account, Launching a cluster
- Apache Flume, Apache Flume
- Apache Hive, Apache Hive
- Apache Kafka, Apache Kafka
- Apache Mesos, Spark Runtime Architecture, Apache Mesos-Configuring resource usage
- Apache Software Foundation
- Apache ZooKeeper, High availability
- applications, Spark, Introduction
- ascending parameter, sortByKey() function, Sorting Data
- assembly JAR, Packaging Your Code and Dependencies
- associative operations, Custom Accumulators
- Aurora, Client and cluster mode
- Avro sink, Push-based receiver
- AvroFlumeEvent, Pull-based receiver
- AWS_ACCESS_KEY_ID variable, Amazon S3, Launching a cluster
- AWS_SECRET_ACCESS_KEY variable, Amazon S3, Launching a cluster
C
- cache(), Determining an RDD’s Partitioner
- caching
- Cassandra, Cassandra
- saving to, from RDD types, Cassandra
- setting Cassandra property in Scala and Java, Cassandra
- Spark Cassandra connector, Cassandra
- table, loading as RDD with key/value data into Spark, Cassandra
- CassandraRow objects, Cassandra
- checkpointing, Spark Streaming, Architecture and Abstraction
- Chronos, Client and cluster mode
- classification, Machine Learning Basics, Classification and Regression
- client mode, Hadoop YARN
- cluster managers, Cluster Managers, Introduction, Cluster Manager, Cluster Managers-Which Cluster Manager to Use?
- cluster mode, Hadoop YARN
- cluster URLs, Deploying Applications with spark-submit
- clustering, Clustering
- clusters
- coalesce(), Tuning the level of parallelism, Level of Parallelism
- coarse-grained Mesos mode, Mesos scheduling modes
- codegen, enabling in Spark SQL, Performance Tuning Options
- cogroup(), Grouping Data
- collaborative filtering, Collaborative Filtering and Recommendation
- collect(), Sorting Data, Components of Execution: Jobs, Tasks, and Stages
- collectAsMap(), Actions Available on Pair RDDs
- combine(), Aggregations
- combineByKey(), Aggregations
- combiners, reduceByKey() and foldByKey() and, Aggregations
- comma-separated vaule file (see CSV files)
- commutative operations, Custom Accumulators
- compression
- computing cluster, A Unified Stack
- Concurrent Mark-Sweep garbage collector, Garbage Collection and Memory Usage
- conf/spark-defaults.conf file, Configuring Spark with SparkConf
- configuration
- configuring algorithms, Configuring Algorithms
- configuring Spark, Tuning and Debugging Spark
- connections, shared connection pool, Working on a Per-Partition Basis
- core input sources, Spark Streaming, Core Sources
- cores, number for executors, Hardware Provisioning
- count(), Numeric RDD Operations
- countByKey(), Actions Available on Pair RDDs
- countByValue(), Aggregations
- countByValueAndWindow(), Windowed transformations
- countByWindow(), Windowed transformations
- CSV files, File Formats, Comma-Separated Values and Tab-Separated Values
D
- data processing applications, Data Processing Applications
- data science tasks, Data Science Tasks
- data wrangling, Data Science Tasks
- databases
- debugging Spark
- decision trees, Decision trees and random forests
- dependencies
- deployment modes
- dimensionality reduction
- directed acyclic graph (DAG), The Driver, Components of Execution: Jobs, Tasks, and Stages
- directories
- discretized streams (see DStreams)
- distance program (in R), Piping to External Programs
- distributed data and computation, resilient distributed datasets (RDDs), Introduction to Spark’s Python and Scala Shells
- downloading Spark, Downloading Spark
- driver programs, Introduction to Core Spark Concepts, Spark Runtime Architecture, The Driver
- collecting an RDD to, Components of Execution: Jobs, Tasks, and Stages
- deploy mode on Apache Mesos, Client and cluster mode
- deploy modes on Standalone cluster manager, Submitting applications
- deploy modes on YARN cluster manager, Hadoop YARN
- duties performed by, The Driver
- fault tolerance, Driver Fault Tolerance
- in local mode, Executors
- launching in supervise mode, Driver Fault Tolerance
- logs, Driver and Executor Logs
- DStream.repartition(), Level of Parallelism
- DStream.transformWith(), Stateless Transformations
- DStreams (discretized streams), Spark Streaming
- as continuous series of RDDs, Architecture and Abstraction
- creating from Kafka messages, Apache Kafka
- creating with socketTextStream(), A Simple Example
- fault-tolerance properties for, Architecture and Abstraction
- of SparkFlumeEvents, Pull-based receiver
- output operations, Output Operations
- output operations support, Architecture and Abstraction
- transform() operator, Stateless Transformations
- transformations on, Transformations
- transformations support, Architecture and Abstraction
E
- EC2 clusters, Amazon EC2
- Elasticsearch, reading and writing data from, Elasticsearch
- Elephant Bird package (Twitter), Loading with other Hadoop input formats
- empty line count using accumulators (example), Accumulators
- ETL (extract, transform, and load), Working with Key/Value Pairs
- exactly-once semantics for transformations, Processing Guarantees
- execution components, Components of Execution: Jobs, Tasks, and Stages-Components of Execution: Jobs, Tasks, and Stages
- execution graph, The Driver
- executors, Introduction to Core Spark Concepts, Spark Runtime Architecture
- configuration values for, Configuring Spark with SparkConf
- in local mode, Executors
- information on, web UI executors page, Executors: A list of executors present in the application
- logs, Driver and Executor Logs
- memory usage, Memory Management
- memory, number of cores, and total number of executors, Hardware Provisioning
- requests for more memory, causing application not to run, Submitting applications
- resource allocation on Apache Mesos, Configuring resource usage
- resource allocation on Hadoop YARN cluster manager, Configuring resource usage
- resource allocation on Standalone cluster manager, Configuring resource usage
- scheduling modes on Apache Mesos, Mesos scheduling modes
- scheduling tasks on, The Driver
- sizing heaps for, Hardware Provisioning
- exiting a Spark application, Initializing a SparkContext
- Externalizable interface (Java), Optimizing Broadcasts
F
- Fair Scheduler, Scheduling Within and Between Spark Applications
- fakelogs_directory.sh script, Stream of files
- fault tolerance in Spark Streaming, 24/7 Operation
- fault tolerance, accumulators and, Accumulators and Fault Tolerance
- feature extraction algorithms, Overview, Feature Extraction
- feature extraction and transformation, Machine Learning Basics
- feature preparation, Preparing Features
- file formats, Motivation
- files
- filesystems, Filesystems
- filter()
- filtering, Python example of, Introduction to Core Spark Concepts
- fine-grained Mesos mode, Mesos scheduling modes
- flatMap(), Aggregations
- flatMapValues(), Operations That Affect Partitioning
- Flume (see Apache Flume)
- FlumeUtils object, Push-based receiver
- fold(), Aggregations
- foldByKey(), Aggregations
- foreach()
- foreachPartition(), Working on a Per-Partition Basis, Output Operations
- foreachRDD(), Output Operations
- functions, passing to Spark, Introduction to Core Spark Concepts
H
- Hadoop
- Hadoop Distributed File System (HDFS), HDFS
- Hadoop YARN, Spark Runtime Architecture, Hadoop YARN
- hadoopFile(), Loading with other Hadoop input formats
- HADOOP_CONF_DIR variable, Hadoop YARN
- hardware provisioning, Hardware Provisioning
- HashingTF algorithm, TF-IDF
- HashPartitioner object, Data Partitioning (Advanced)
- HBase, HBase
- HBaseConfiguration, HBase
- HDFS (Hadoop Distributed File System), HDFS
- Hive (see Apache Hive)
- Hive Query Language (HQL), Spark SQL, Apache Hive
- HiveContext object, Apache Hive, Linking with Spark SQL
- HiveContext.inferSchema(), From RDDs
- HiveContext.jsonFile(), JSON, JSON
- HiveContext.parquetFile(), Parquet
- hiveCtx.cacheTable(), Caching
- HiveServer2, JDBC/ODBC Server
J
- JAR files
- Java
- Apache Kafka, Apache Kafka
- Concurrent Mark-Sweep garbage collector, Garbage Collection and Memory Usage
- country lookup with Broadcast values in (example), Broadcast Variables
- creating HiveContext and selecting data, Apache Hive
- creating pair RDD, Creating Pair RDDs
- creating SchemaRDD from a JavaBean, From RDDs
- custom sort order, sorting integers as if strings, Sorting Data
- driver program using pipe() to call finddistance.R, Piping to External Programs
- FlumeUtils agent in, Push-based receiver
- FlumeUtils custom sink, Pull-based receiver
- Hive load in, Apache Hive
- joining DStreams in, Stateless Transformations
- lambdas in Java 8, Introduction to Core Spark Concepts
- linear regression in, Linear regression
- linking Spark into standalone applications, Standalone Applications
- loading and querying tweets, Basic Query Example
- loading CSV with textFile(), Loading CSV
- loading entire Cassandra table as RDD with key/value data, Cassandra
- loading JSON, Loading JSON, JSON, JSON
- loading text files, Loading text files
- map() and reduceByKey() on DStream, Stateless Transformations
- Maven coordinates for Spark SQL with Hive support, Linking with Spark SQL
- partitioner, custom, Custom Partitioners
- partitioner, determining for an RDD, Determining an RDD’s Partitioner
- partitioning in, Data Partitioning (Advanced)
- passing functions to Spark, Introduction to Core Spark Concepts
- per-key average using combineByKey(), Aggregations
- Row objects, getter functions, Working with Row objects
- saving JSON, Saving JSON
- saving SequenceFiles, Saving SequenceFiles, Saving with Hadoop output formats, Output Operations
- setting Cassandra property, Cassandra
- setting up driver that can recover from failure, Driver Fault Tolerance
- shared connection pool and JSON parser, Working on a Per-Partition Basis
- spam classifier in, Example: Spam Classification
- Spark application built with Maven, A Java Spark Application Built with Maven
- Spark Cassandra connector, Cassandra
- SQL imports, Initializing Spark SQL
- streaming filter for printing lines containing error, A Simple Example
- streaming imports, A Simple Example
- streaming text files written to a directory, Stream of files
- string length UDF, Spark SQL UDFs
- submitting applications with dependencies, Packaging Your Code and Dependencies
- transform() on a DStream, Stateless Transformations
- transformations on pair RDDs, Transformations on Pair RDDs
- UDF imports, Spark SQL UDFs
- updateStateByKey() transformation, UpdateStateByKey transformation
- using summary statistics to remove outliers in, Numeric RDD Operations
- vectors, creating, Working with Vectors
- visit counts per IP address, Windowed transformations
- window(), using, Windowed transformations
- windowed count operations in, Windowed transformations
- word count in, Building Standalone Applications, Aggregations
- Writable types, SequenceFiles
- Java Database Connectivity (JDBC), Java Database Connectivity
- Java Serialization, Object Files, Optimizing Broadcasts, Serialization Format
- Java Virtual Machine (JVM), Downloading Spark and Getting Started
- java.io.Externalizable interface, Optimizing Broadcasts
- java.io.Serializable, Hadoop Writable classes and, SequenceFiles
- JDBC/ODBC server in Spark SQL, Scheduling Within and Between Spark Applications, JDBC/ODBC Server-Long-Lived Tables and Queries
- JdbcRDD, Java Database Connectivity
- jobs
- join operator, Joins
- join()
- joins
- JSON, File Formats
L
- LabeledPoints, Example: Spam Classification, Data Types
- lambda (=>) syntax, Introduction to Core Spark Concepts
- LassoWithSGD, Linear regression
- launch scripts, using cluster launch scripts, Launching the Standalone cluster manager
- LBFGS algorithm, Logistic regression
- leftOuterJoin(), Joins
- linear regression, Linear regression
- LinearRegressionModel, Linear regression
- LinearRegressionWithSGD object, Linear regression
- Linux/Mac
- loading and saving data, Loading and Saving Your Data-Conclusion
- local mode, Spark running in, Downloading Spark, Executors
- local/regular filesystem, Local/“Regular” FS
- log4j, Driver and Executor Logs
- Log4j.properties.template, Introduction to Spark’s Python and Scala Shells
- logging
- logistic regression, Logistic regression
- LogisticRegressionModel, Logistic regression
- long-lived Spark applications, Scheduling Within and Between Spark Applications
- long-lived tables and queries, Long-Lived Tables and Queries
- lookup(), Actions Available on Pair RDDs
- LzoJsonInputFormat, Loading with other Hadoop input formats
M
- machine learning
- with MLlib, Machine Learning with MLlib
- algorithms, Algorithms
- basic machine learning concepts, Machine Learning Basics
- classification and regression, Classification and Regression
- clustering, Clustering
- collaborative filtering and recommendation, Collaborative Filtering and Recommendation
- data types, Data Types
- dimensionality reduction, Principal component analysis
- example, spam classification, Example: Spam Classification
- feature extraction algorithms, Feature Extraction
- model evaluation, Model Evaluation
- overview, Overview
- pipeline API, Pipeline API
- statistics, Statistics
- system requirements, System Requirements
- tips and performance considerations, Preparing Features
- working with vectors, Working with Vectors
- machine learning (ML) functionality (MLlib), MLlib
- main function, Introduction to Core Spark Concepts
- map(), Transformations on Pair RDDs, Data Partitioning (Advanced)
- mapPartitions(), Working on a Per-Partition Basis
- mapPartitionsWithIndex(), Working on a Per-Partition Basis
- mapValues(), Transformations on Pair RDDs, Operations That Affect Partitioning
- master, Cluster Manager, Standalone Cluster Manager
- -- master flag (spark-submit), Deploying Applications with spark-submit
- master/slave architecture, Spark Runtime Architecture
- match operator, Custom Partitioners
- Matrix object, Principal component analysis
- MatrixFactorizationModel, Alternating Least Squares
- Maven, Standalone Applications, Packaging Your Code and Dependencies
- max(), Numeric RDD Operations
- mean(), Numeric RDD Operations
- memory management, Memory Management
- MEMORY_AND_DISK storage level, Memory Management
- Mesos (see Apache Mesos)
- Metrics object, Model Evaluation
- micro-batch architecture, Spark Streaming, Architecture and Abstraction
- min(), Numeric RDD Operations
- MLlib, MLlib
- model evaluation, Model Evaluation
- models, Machine Learning Basics
- MulticlassMetrics, Model Evaluation
- Multinomial Naive Bayes, Naive Bayes
- multitenant clusters, scheduling in, Scheduling Within and Between Spark Applications
N
- Naive Bayes algorithm, Naive Bayes
- NaiveBayes class, Naive Bayes
- NaiveBayesModel, Naive Bayes
- natural language libraries, TF-IDF
- network filesystems, Local/“Regular” FS
- newAPIHadoopFile(), Loading with other Hadoop input formats
- normalization, Normalization
- Normalizer class, Normalization
- NotSerializableException, Serialization Format
- numeric operations, Numeric RDD Operations
- NumPy, System Requirements
O
- object files, Object Files
- objectFile(), Object Files
- ODBC, Spark SQL ODBC driver, JDBC/ODBC Server
- optimizations performed by Spark driver, The Driver
- Option object, Joins, Determining an RDD’s Partitioner
- Optional object, Joins
- outer joins, Joins
- output operations, Output Operations
- OutputFormat interface, Motivation
P
- package managers (Python), Packaging Your Code and Dependencies
- packaging
- PageRank algorithm, Example: PageRank
- pair RDDs, Motivation
- parallel algorithms, Overview
- parallelism
- parallelize(), Creating Pair RDDs, Overview
- Parquet, Parquet
- loading data into Spark SQL, Parquet
- registering Parquet file as temp table and querying against it in Spark SQL, Parquet
- saving a SchemaRDD to, Parquet
- partitionBy(), Data Partitioning (Advanced)
- Partitioner object, Determining an RDD’s Partitioner
- partitioner property, Determining an RDD’s Partitioner
- partitioning, Working with Key/Value Pairs
- partitions
- PBs (see protocol buffers)
- PCA (principal component analysis), Principal component analysis
- performance
- persist(), Determining an RDD’s Partitioner
- pickle serializaion library (Python), Object Files
- pipe(), Piping to External Programs
- pipeline API in MLlib, Overview, Pipeline API
- pipelining, Components of Execution: Jobs, Tasks, and Stages, Components of Execution: Jobs, Tasks, and Stages
- piping to external programs, Piping to External Programs
- port 4040, information on running Spark applications, The Driver
- predict(), Linear regression
- principal component analysis (PCA), Principal component analysis
- print(), A Simple Example, Output Operations
- programming, advanced, Introduction-High availability
- protocol buffers, Example: Protocol buffers
- pull-based receiver, Apache Flume, Pull-based receiver
- push-based receiver, Apache Flume
- PySpark shell, Data Science Tasks
- Python
- accumulaor error count in, Accumulators
- accumulator empty line count in, Accumulators
- average without and with mapPartitions(), Working on a Per-Partition Basis
- constructing SQL context, Initializing Spark SQL
- country lookup in, Broadcast Variables
- country lookup with broadcast variables, Broadcast Variables
- creating an application using a SparkConf, Configuring Spark with SparkConf
- creating HiveContext and selecting data, Apache Hive
- creating pair RDD, Creating Pair RDDs
- creating SchemaRDD using Row and named tuple, From RDDs
- custom sort order, sorting integers as if strings, Sorting Data
- driver program using pipe() to call finddistance.R, Piping to External Programs
- HashingTF, using in, TF-IDF
- Hive load in, Apache Hive
- installing third-party libraries, Packaging Your Code and Dependencies
- IPython shell, Introduction to Spark’s Python and Scala Shells
- linear regression in, Linear regression
- loading and querying tweets, Basic Query Example
- loading CSV with textFile(), Loading CSV
- loading JSON with Spark SQL, JSON, JSON
- loading SequenceFiles, Loading SequenceFiles
- loading text files, Loading text files
- loading unstructured JSON, Loading JSON
- Parquet files in, Parquet
- partitioner, custom, Custom Partitioners
- partitioning in, Data Partitioning (Advanced)
- passing functions to Spark, Introduction to Core Spark Concepts
- per-key avarage using reduceByKey() and mapValues(), Aggregations
- per-key average using combineByKey(), Aggregations
- pickle serialization library, Object Files
- requirement for EC2 script, Amazon EC2
- Row objects, working with, Working with Row objects
- saving JSON, Saving JSON
- scaling vectors in, Scaling
- shared connection pool in, Working on a Per-Partition Basis
- shell in Spark, Introduction to Spark’s Python and Scala Shells
- spam classifier in, Example: Spam Classification
- Spark SQL with Hive support, Linking with Spark SQL
- SQL imports, Initializing Spark SQL
- string length UDF, Spark SQL UDFs
- submitting a Python program with spark-submit, Deploying Applications with spark-submit
- TF-IDF, using in, TF-IDF
- transformations on pair RDDs, Transformations on Pair RDDs
- using MLlib in, requirement for NumPy, System Requirements
- using summary statistics to remove outliers in, Numeric RDD Operations
- vectors, creating, Working with Vectors
- word count in, Aggregations
- writing CSV in, Saving CSV
- writing Spark applications as scripts, Standalone Applications
R
- R library, Piping to External Programs
- RandomForest class, Decision trees and random forests
- Rating objects, Alternating Least Squares
- Rating type, Data Types
- rdd.getNumPartitions(), Tuning the level of parallelism
- rdd.partition.size(), Tuning the level of parallelism
- receivers, Architecture and Abstraction
- recommendations, Collaborative Filtering and Recommendation
- RecordReader (Hadoop), SequenceFiles
- reduce(), Aggregations
- reduceByKey(), Aggregations
- reduceByKeyAndWindow(), Windowed transformations
- reduceByWindow(), Windowed transformations
- regression, Classification and Regression
- repartition(), Tuning the level of parallelism, Level of Parallelism, Level of Parallelism
- resilient distributed datasets (RDDs), Spark Core (see caching in executors)
- caching to reuse, Caching RDDs to Reuse
- Cassandra table, loading as RDD with key/value pairs, Cassandra
- changing partitioning, Tuning the level of parallelism
- collecting, Components of Execution: Jobs, Tasks, and Stages
- computing an alredy cached RDD, Components of Execution: Jobs, Tasks, and Stages
- counts (example), Components of Execution: Jobs, Tasks, and Stages
- creating and doing simple analysis, Introduction to Spark’s Python and Scala Shells
- DStreams as continuous series of, Architecture and Abstraction
- JdbcRDD, Java Database Connectivity
- loading and saving data from, in Spark SQL, From RDDs
- numeric operations, Numeric RDD Operations
- of CassandraRow objects, Cassandra
- pair RDDs, Motivation
- persisted, information on, Storage: Information for RDDs that are persisted
- pipe(), Piping to External Programs
- pipelining of RDD transforations into a single stage, Components of Execution: Jobs, Tasks, and Stages
- running computations on RDDs in a DStream, Output Operations
- saving to Cassandra from, Cassandra
- SchemaRDDs, Spark SQL, SchemaRDDs-Caching
- visualiing with toDebugString() in Scala, Components of Execution: Jobs, Tasks, and Stages
- resource allocation
- RidgeRegressionWithSGD, Linear regression
- rightOuterJoin(), Joins
- Row objects, Structured Data with Spark SQL
- RowMatrix class, Principal component analysis
- runtime architecture (Spark), Spark Runtime Architecture
- runtime dependencies of an application, Deploying Applications with spark-submit
S
- s3n://, path starting with, Amazon S3
- sampleStdev(), Numeric RDD Operations
- sampleVariance(), Numeric RDD Operations
- save(), Sorting Data
- saveAsHadoopFiles(), Output Operations
- saveAsObjectFile(), Object Files
- saveAsParquetFile(), Parquet
- saveAsTextFile(), Saving text files
- sbt (Scala build tool), Packaging Your Code and Dependencies
- sc variable (SparkContext), Introduction to Core Spark Concepts, The Driver
- Scala, Downloading Spark and Getting Started
- accumulator empty line count in, Accumulators
- Apache Kafka, Apache Kafka
- constructing SQL context, Initializing Spark SQL
- country lookup with broadcast variables, Broadcast Variables
- creating HiveContext and selecting data, Apache Hive
- creating pair RDD, Creating Pair RDDs
- creating SchemaRDD from case class, From RDDs
- custom sort order, sorting integers as if strings, Sorting Data
- driver program using pipe() to call finddistance.R, Piping to External Programs
- Elasticsearch output in, Elasticsearch
- FlumeUtils agent in, Push-based receiver
- FlumeUtils custom sink, Pull-based receiver
- Hive load in, Apache Hive
- joining DStreams in, Stateless Transformations
- linear regression in, Linear regression
- linking to Spark, Standalone Applications
- loading and querying tweets, Basic Query Example
- loading compressed text file from local filesystem, Local/“Regular” FS
- loading CSV with textFile(), Loading CSV
- loading entire Cassandra table as RDD with key/value pairs, Cassandra
- loading JSON, Loading JSON, JSON, JSON
- loading LZO-compressed JSON with Elephant Bird, Loading with other Hadoop input formats
- loading SequenceFiles, Loading SequenceFiles
- loading text files, Loading text files
- map() and reduceByKey() on DStream, Stateless Transformations
- Maven coordinates for Spark SQL with Hive support, Linking with Spark SQL
- PageRank example, Example: PageRank
- partitioner, custom, Data Partitioning (Advanced), Custom Partitioners
- partitioner, determining for an RDD, Determining an RDD’s Partitioner
- passing functions to Spark, Introduction to Core Spark Concepts
- PCA (principal component analysis) in, Principal component analysis
- per-key avarage using redueByKey() and mapValues(), Aggregations
- per-key average using combineByKey(), Aggregations
- processing text data in Scala Spark shell, Components of Execution: Jobs, Tasks, and Stages
- reading from HBase, HBase
- Row objects, getter functions, Working with Row objects
- saving data to external systems with foreachRDD(), Output Operations
- saving DStream to text files, Output Operations
- saving JSON, Saving JSON
- saving SequenceFiles, Saving SequenceFiles, Output Operations
- saving to Cassandra, Cassandra
- setting Cassandra property, Cassandra
- setting up driver that can recover from failure, Driver Fault Tolerance
- spam classifier in, Example: Spam Classification
- spam classifier, pipeline API version, Pipeline API
- Spark application built with sbt, A Scala Spark Application Built with sbt
- Spark Cassandra connector, Cassandra
- SparkFlumeEvent in, Pull-based receiver
- SQL imports, Initializing Spark SQL
- streaming filter for printing lines containing error, A Simple Example
- streaming imports, A Simple Example
- streaming SequenceFiles written to a directory, Stream of files
- streaming text files written to a directory, Stream of files
- string length UDF, Spark SQL UDFs
- submitting applications with dependencies, Packaging Your Code and Dependencies
- transform() on a DStream, Stateless Transformations
- transformations on pair RDDs, Transformations on Pair RDDs
- updateStateByKey() transformation, UpdateStateByKey transformation
- user information application (example), Data Partitioning (Advanced)
- using summary statistics to remove outliers in, Numeric RDD Operations
- vectors, creating, Working with Vectors
- visit counts per IP address, Windowed transformations
- visualizing RDDs with toDebugString(), Components of Execution: Jobs, Tasks, and Stages
- window(), using, Windowed transformations
- windowed count operations in, Windowed transformations
- word count application example, Building Standalone Applications
- word count in, Aggregations
- Writable types, SequenceFiles
- writing CSV in, Saving CSV
- Scala shell, Introduction to Spark’s Python and Scala Shells
- scala.Option object, Determining an RDD’s Partitioner
- scala.Tuple2 class, Creating Pair RDDs
- scaling vectors, Scaling
- schedulers
- scheduling information, Deploying Applications with spark-submit
- scheduling jobs, Scheduling Within and Between Spark Applications
- SchemaRDDs, Spark SQL, SchemaRDDs
- schemas, Structured Data with Spark SQL, Spark SQL
- acccessing nested fields and array fields in SQL, JSON
- in JSON data, JSON
- partial schema of tweets, JSON
- SequenceFiles, File Formats, SequenceFiles
- SerDes (serialization and deserialization formats), Linking with Spark SQL
- serialization
- shading, Dependency Conflicts
- shared variables, Introduction, Accumulators
- shells
- driver program, creating, The Driver
- IPython, Introduction to Spark’s Python and Scala Shells
- launching against Standalone cluster manager, Submitting applications
- launching Spark shell and PySpark against YARN, Hadoop YARN
- opening PySpark shell in Spark, Introduction to Spark’s Python and Scala Shells
- opening Scala shell in Spark, Introduction to Spark’s Python and Scala Shells
- processing text data in Scala Spark shell, Components of Execution: Jobs, Tasks, and Stages
- sc variable (SparkContext), Introduction to Core Spark Concepts
- Scala and Python shells in Spark, Introduction to Spark’s Python and Scala Shells
- standalone Spark SQL shell, Long-Lived Tables and Queries
- singular value decomposition (SVD), Singular value decomposition
- skew, Jobs: Progress and metrics of stages, tasks, and more
- sliding duration, Windowed transformations
- sorting data
- spam classification example (MLlib), Example: Spam Classification-Example: Spam Classification
- Spark
- accessing Spark UI, Introduction to Spark’s Python and Scala Shells
- brief history of, A Brief History of Spark
- closely integrated components, A Unified Stack
- defined, Preface
- linking into standalone applications in different languages, Standalone Applications
- shutting down an application, Initializing a SparkContext
- storage layers, Storage Layers for Spark
- uses of, Who Uses Spark, and for What?
- versions and releases, Spark Versions and Releases
- web UI (see web UI)
- Spark Core, Spark Core
- Spark SQL, Spark SQL, Spark SQL-Conclusion
- Spark Streaming, Spark Streaming, Spark Streaming-Conclusion
- additional setup for applications, Spark Streaming
- architecture and abstraction, Architecture and Abstraction
- checkpointing, Architecture and Abstraction
- DStreams, Spark Streaming
- execution within Spark components, Architecture and Abstraction
- fault-tolerance properties for DStreams, Architecture and Abstraction
- input sources, Input Sources
- output operations, Output Operations
- performance considerations, Performance Considerations
- running applications 24/7, 24/7 Operation-Processing Guarantees
- simple example, A Simple Example
- Spark application UI showing, Architecture and Abstraction
- Streaming UI, Streaming UI
- transformations on DStreams, Transformations
- spark-class script, Launching the Standalone cluster manager
- spark-core package, Building Standalone Applications
- spark-ec2 script, Amazon EC2
- spark-submit script, Building Standalone Applications, Launching a Program
- --deploy-mode cluster flag, Submitting applications, Driver Fault Tolerance
- --deploy-mode flag, Hadoop YARN
- --executor-cores flag, Hardware Provisioning
- --executor-memory flag, Submitting applications, Configuring resource usage, Configuring resource usage, Hardware Provisioning
- --jars flag, Packaging Your Code and Dependencies
- --master mesos flag, Apache Mesos
- --master yarn flag, Hadoop YARN
- --num-executors flag, Configuring resource usage, Hardware Provisioning
- --py-files argument, Packaging Your Code and Dependencies
- --total-executor-cores argument, Configuring resource usage, Configuring resource usage
- common flags, summary listing of, Deploying Applications with spark-submit
- deploying applications with, Deploying Applications with spark-submit
- general format, Deploying Applications with spark-submit
- loading configuration values from a file, Configuring Spark with SparkConf
- setting configuration values at runtime with flags, Configuring Spark with SparkConf
- submitting application from Amazon EC2, Logging in to a cluster
- using with various options, Deploying Applications with spark-submit
- spark.cores.max, Hardware Provisioning
- spark.deploy.spreadOut config property, Configuring resource usage
- spark.executor.cores, Hardware Provisioning
- spark.executor.memory, Hardware Provisioning
- spark.local.dir option, Hardware Provisioning
- spark.Partitioner object, Determining an RDD’s Partitioner
- spark.serializer property, Optimizing Broadcasts
- spark.sql.codegen, Performance Tuning Options
- spark.sql.inMemoryColumnarStorage.batchSize, Performance Tuning Options
- spark.storage.memoryFracton, Memory Management
- SparkConf object, Initializing a SparkContext
- SparkContext object, Introduction to Core Spark Concepts, The Driver
- SparkContext.addFile(), Piping to External Programs
- SparkContext.parallelize(), Creating Pair RDDs
- SparkContext.parallelizePairs(), Creating Pair RDDs
- SparkContext.sequenceFile(), Loading SequenceFiles
- SparkFiles.get(), Piping to External Programs
- SparkFiles.getRootDirectory(), Piping to External Programs
- SparkFlumeEvents, Pull-based receiver
- SparkR project, Piping to External Programs
- SparkStreamingContext.checkpoint(), Checkpointing
- SPARK_LOCAL_DIRS variable, Configuring Spark with SparkConf, Hardware Provisioning
- SPARK_WORKER_INSTANCES variable, Hardware Provisioning
- sparsity, recognizing, Recognizing Sparsity
- SQL (Structured Query Language)
- SQL shell, Data Science Tasks
- sql(), Basic Query Example
- SQLContext object, Linking with Spark SQL
- SQLContext.parquetFile(), Parquet
- stack trace from executors, Executors: A list of executors present in the application
- stages, The Driver, Components of Execution: Jobs, Tasks, and Stages, Components of Execution: Jobs, Tasks, and Stages
- standalone applications, Standalone Applications
- Standalone cluster manager, Spark Runtime Architecture, Standalone Cluster Manager-High availability
- StandardScaler class, Scaling
- StandardScalerModel, Scaling
- start-thriftserver.sh, JDBC/ODBC Server
- stateful transformations, Transformations, Stateful Transformations
- stateless transformations, Transformations
- Statistics class, Statistics
- StatsCounter object, Numeric RDD Operations
- stdev(), Numeric RDD Operations
- stochastic gradient descent (SGD), Example: Spam Classification, Linear regression
- storage
- storage layers for Spark, Storage Layers for Spark
- storage levels
- streaming (see Spark Streaming)
- StreamingContext object, A Simple Example
- StreamingContext.awaitTermination(), A Simple Example
- StreamingContext.getOrCreate(), Driver Fault Tolerance
- StreamingContext.start(), A Simple Example
- StreamingContext.transform(), Stateless Transformations
- StreamingContext.union(), Stateless Transformations
- strings, sorting integers as, Sorting Data
- structured data, Spark SQL
- sum(), Numeric RDD Operations
- supervised learning, Classification and Regression
- Support Vector Machines, Support Vector Machines
- SVD (singular value decomposition), Singular value decomposition
- SVMModel, Support Vector Machines
- SVMWithSGD class, Support Vector Machines
T
- tab-separated value files (see TSV files)
- TableInputFormat, HBase
- tar command, Downloading Spark
- tar extractors, Downloading Spark
- tasks, The Driver, Components of Execution: Jobs, Tasks, and Stages
- term frequency, Example: Spam Classification
- Term Frequency–Inverse Document Frequency (TF-IDF), TF-IDF
- text files, File Formats
- textFile(), Loading text files, Components of Execution: Jobs, Tasks, and Stages
- Thrift server, JDBC/ODBC Server
- toDebugString(), Components of Execution: Jobs, Tasks, and Stages
- training data, Machine Learning Basics
- transitive dependency graph, Packaging Your Code and Dependencies
- TSV (tab-separated value) files, Comma-Separated Values and Tab-Separated Values
- tuning Spark, Tuning and Debugging Spark, Finding Information
- tuples, Creating Pair RDDs
- tweets, Spark SQL
- Twitter
- types
W
- web UI, The Driver, Spark Web UI
- WeightedEnsembleModel, Decision trees and random forests
- wholeFile(), Loading CSV
- wholeTextFiles(), Loading text files
- window duration, Windowed transformations
- window(), Windowed transformations
- windowed transformations, Windowed transformations
- Windows systems
- word count, distributed, Aggregations
- Word2Vec class, Word2Vec
- Word2VecModel, Word2Vec
- workers, Cluster Manager, Standalone Cluster Manager
- Writable interface (Hadoop), SequenceFiles
- Writable types (Hadoop)
..................Content has been hidden....................
You can't read the all page of ebook, please click
here login for view all page.