0%

Book Description

Harness the power of Scala to program Spark and analyze tonnes of data in the blink of an eye!

About This Book

  • Learn Scala’s sophisticated type system that combines Functional Programming and object-oriented concepts
  • Work on a wide array of applications, from simple batch jobs to stream processing and machine learning
  • Explore the most common as well as some complex use-cases to perform large-scale data analysis with Spark

Who This Book Is For

Anyone who wishes to learn how to perform data analysis by harnessing the power of Spark will find this book extremely useful. No knowledge of Spark or Scala is assumed, although prior programming experience (especially with other JVM languages) will be useful to pick up concepts quicker.

What You Will Learn

  • Understand object-oriented & functional programming concepts of Scala
  • In-depth understanding of Scala collection APIs
  • Work with RDD and DataFrame to learn Spark’s core abstractions
  • Analysing structured and unstructured data using SparkSQL and GraphX
  • Scalable and fault-tolerant streaming application development using Spark structured streaming
  • Learn machine-learning best practices for classification, regression, dimensionality reduction, and recommendation system to build predictive models with widely used algorithms in Spark MLlib & ML
  • Build clustering models to cluster a vast amount of data
  • Understand tuning, debugging, and monitoring Spark applications
  • Deploy Spark applications on real clusters in Standalone, Mesos, and YARN

In Detail

Scala has been observing wide adoption over the past few years, especially in the field of data science and analytics. Spark, built on Scala, has gained a lot of recognition and is being used widely in productions. Thus, if you want to leverage the power of Scala and Spark to make sense of big data, this book is for you.

The first part introduces you to Scala, helping you understand the object-oriented and functional programming concepts needed for Spark application development. It then moves on to Spark to cover the basic abstractions using RDD and DataFrame. This will help you develop scalable and fault-tolerant streaming applications by analyzing structured and unstructured data using SparkSQL, GraphX, and Spark structured streaming. Finally, the book moves on to some advanced topics, such as monitoring, configuration, debugging, testing, and deployment.

You will also learn how to develop Spark applications using SparkR and PySpark APIs, interactive data analytics using Zeppelin, and in-memory data processing with Alluxio.

By the end of this book, you will have a thorough understanding of Spark, and you will be able to perform full-stack data analytics with a feel that no amount of data is too big.

Style and approach

Filled with practical examples and use cases, this book will hot only help you get up and running with Spark, but will also take you farther down the road to becoming a data scientist.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Table of Contents

  1. Preface
    1. What this book covers
    2. What you need for this book
    3. Who this book is for
    4. Conventions
    5. Reader feedback
    6. Customer support
      1. Downloading the example code
      2. Downloading the color images of this book
      3. Errata
      4. Piracy
      5. Questions
  2. Introduction to Scala
    1. History and purposes of Scala
    2. Platforms and editors
    3. Installing and setting up Scala
      1. Installing Java
      2. Windows
      3. Mac OS
        1. Using Homebrew installer
        2. Installing manually
        3. Linux
    4. Scala: the scalable language
      1. Scala is object-oriented
      2. Scala is functional
      3. Scala is statically typed
      4. Scala runs on the JVM
      5. Scala can execute Java code
      6. Scala can do concurrent and synchronized processing
    5. Scala for Java programmers
      1. All types are objects
      2. Type inference
      3. Scala REPL
      4. Nested functions
      5. Import statements
      6. Operators as methods
      7. Methods and parameter lists
      8. Methods inside methods
      9. Constructor in Scala
      10. Objects instead of static methods
      11. Traits
    6. Scala for the beginners
      1. Your first line of code
        1. I'm the hello world program, explain me well!
      2. Run Scala interactively!
      3. Compile it!
        1. Execute it with Scala command
    7. Summary
  3. Object-Oriented Scala
    1. Variables in Scala
      1. Reference versus value immutability
      2. Data types in Scala
        1. Variable initialization
        2. Type annotations
        3. Type ascription
        4. Lazy val
    2. Methods, classes, and objects in Scala
      1. Methods in Scala
        1. The return in Scala
      2. Classes in Scala
      3. Objects in Scala
        1. Singleton and companion objects
        2. Companion objects
        3. Comparing and contrasting: val and final
        4. Access and visibility
        5. Constructors
        6. Traits in Scala
          1. A trait syntax
          2. Extending traits
        7. Abstract classes
        8. Abstract classes and the override keyword
        9. Case classes in Scala
    3. Packages and package objects
    4. Java interoperability
    5. Pattern matching
    6. Implicit in Scala
    7. Generic in Scala
      1. Defining a generic class
    8. SBT and other build systems
      1. Build with SBT
      2. Maven with Eclipse
      3. Gradle with Eclipse
    9. Summary
  4. Functional Programming Concepts
    1. Introduction to functional programming
      1. Advantages of functional programming
    2. Functional Scala for the data scientists
    3. Why FP and Scala for learning Spark?
      1. Why Spark?
      2. Scala and the Spark programming model
      3. Scala and the Spark ecosystem
    4. Pure functions and higher-order functions
      1. Pure functions
      2. Anonymous functions
      3. Higher-order functions
      4. Function as a return value
    5. Using higher-order functions
    6. Error handling in functional Scala
      1. Failure and exceptions in Scala
      2. Throwing exceptions
      3. Catching exception using try and catch
      4. Finally
      5. Creating an Either
      6. Future
      7. Run one task, but block
    7. Functional programming and data mutability
    8. Summary
  5. Collection APIs
    1. Scala collection APIs
    2. Types and hierarchies
      1. Traversable
      2. Iterable
      3. Seq, LinearSeq, and IndexedSeq
      4. Mutable and immutable
      5. Arrays
      6. Lists
      7. Sets
      8. Tuples
      9. Maps
      10. Option
      11. Exists
      12. Forall
      13. Filter
      14. Map
      15. Take
      16. GroupBy
      17. Init
      18. Drop
      19. TakeWhile
      20. DropWhile
      21. FlatMap
    3. Performance characteristics
      1. Performance characteristics of collection objects
      2. Memory usage by collection objects
    4. Java interoperability
    5. Using Scala implicits
      1. Implicit conversions in Scala
    6. Summary
  6. Tackle Big Data – Spark Comes to the Party
    1. Introduction to data analytics
      1. Inside the data analytics process
    2. Introduction to big data
      1. 4 Vs of big data
        1. Variety of Data
        2. Velocity of Data
        3. Volume of Data
        4. Veracity of Data
    3. Distributed computing using Apache Hadoop
      1. Hadoop Distributed File System (HDFS)
        1. HDFS High Availability
        2. HDFS Federation
        3. HDFS Snapshot
        4. HDFS Read
        5. HDFS Write
      2. MapReduce framework
    4. Here comes Apache Spark
      1. Spark core
      2. Spark SQL
      3. Spark streaming
      4. Spark GraphX
      5. Spark ML
      6. PySpark
      7. SparkR
    5. Summary
  7. Start Working with Spark – REPL and RDDs
    1. Dig deeper into Apache Spark
    2. Apache Spark installation
      1. Spark standalone
      2. Spark on YARN
        1. YARN client mode
        2. YARN cluster mode
      3. Spark on Mesos
    3. Introduction to RDDs
      1. RDD Creation
        1. Parallelizing a collection
        2. Reading data from an external source
        3. Transformation of an existing RDD
        4. Streaming API
    4. Using the Spark shell
    5. Actions and Transformations
      1. Transformations
        1. General transformations
        2. Math/Statistical transformations
        3. Set theory/relational transformations
        4. Data structure-based transformations
          1. map function
          2. flatMap function
          3. filter function
          4. coalesce
          5. repartition
      2. Actions
        1. reduce
        2. count
        3. collect
    6. Caching
    7. Loading and saving data
      1. Loading data
        1. textFile
        2. wholeTextFiles
        3. Load from a JDBC Datasource
      2. Saving RDD
    8. Summary
  8. Special RDD Operations
    1. Types of RDDs
      1. Pair RDD
      2. DoubleRDD
      3. SequenceFileRDD
      4. CoGroupedRDD
      5. ShuffledRDD
      6. UnionRDD
      7. HadoopRDD
      8. NewHadoopRDD
    2. Aggregations
      1. groupByKey
      2. reduceByKey
      3. aggregateByKey
      4. combineByKey
      5. Comparison of groupByKey, reduceByKey, combineByKey, and aggregateByKey
    3. Partitioning and shuffling
      1. Partitioners
        1. HashPartitioner
        2. RangePartitioner
      2. Shuffling
        1. Narrow Dependencies
        2. Wide Dependencies
    4. Broadcast variables
      1. Creating broadcast variables
      2. Cleaning broadcast variables
      3. Destroying broadcast variables
    5. Accumulators
    6. Summary
  9. Introduce a Little Structure - Spark SQL
    1. Spark SQL and DataFrames
    2. DataFrame API and SQL API
      1. Pivots
      2. Filters
      3. User-Defined Functions (UDFs)
      4. Schema   structure of data
        1. Implicit schema
        2. Explicit schema
        3. Encoders
      5. Loading and saving datasets
        1. Loading datasets
        2. Saving datasets
    3. Aggregations
      1. Aggregate functions
        1. Count
        2. First
        3. Last
        4. approx_count_distinct
        5. Min
        6. Max
        7. Average
        8. Sum
        9. Kurtosis
        10. Skewness
        11. Variance
        12. Standard deviation
        13. Covariance
      2. groupBy
      3. Rollup
      4. Cube
      5. Window functions
        1. ntiles
    4. Joins
      1. Inner workings of join
        1. Shuffle join
      2. Broadcast join
      3. Join types
        1. Inner join
        2. Left outer join
        3. Right outer join
        4. Outer join
        5. Left anti join
        6. Left semi join
        7. Cross join
      4. Performance implications of join
    5. Summary
  10. Stream Me Up, Scotty - Spark Streaming
    1. A Brief introduction to streaming
      1. At least once processing
      2. At most once processing
      3. Exactly once processing
    2. Spark Streaming
      1. StreamingContext
        1. Creating StreamingContext
        2. Starting StreamingContext
        3. Stopping StreamingContext
      2. Input streams
        1. receiverStream
        2. socketTextStream
        3. rawSocketStream
        4. fileStream
        5. textFileStream
        6. binaryRecordsStream
        7. queueStream
      3. textFileStream example
      4. twitterStream example
    3. Discretized streams
      1. Transformations
      2. Window operations
    4. Stateful/stateless transformations
      1. Stateless transformations
      2. Stateful transformations
    5. Checkpointing
      1. Metadata checkpointing
      2. Data checkpointing
      3. Driver failure recovery
    6. Interoperability with streaming platforms (Apache Kafka)
      1. Receiver-based approach
      2. Direct stream
      3. Structured streaming
    7. Structured streaming
      1. Handling Event-time and late data
      2. Fault tolerance semantics
    8. Summary
  11. Everything is Connected - GraphX
    1. A brief introduction to graph theory
    2. GraphX
    3. VertexRDD and EdgeRDD
      1. VertexRDD
      2. EdgeRDD
    4. Graph operators
      1. Filter
      2. MapValues
      3. aggregateMessages
      4. TriangleCounting
    5. Pregel API
      1. ConnectedComponents
      2. Traveling salesman problem
      3. ShortestPaths
    6. PageRank
    7. Summary
  12. Learning Machine Learning - Spark MLlib and Spark ML
    1. Introduction to machine learning
      1. Typical machine learning workflow
      2. Machine learning tasks
        1. Supervised learning
        2. Unsupervised learning
        3. Reinforcement learning
        4. Recommender system
        5. Semisupervised learning
    2. Spark machine learning APIs
      1. Spark machine learning libraries
        1. Spark MLlib
        2. Spark ML
        3. Spark MLlib or Spark ML?
    3. Feature extraction and transformation
      1. CountVectorizer
      2. Tokenizer
      3. StopWordsRemover
      4. StringIndexer
      5. OneHotEncoder
      6. Spark ML pipelines
        1. Dataset abstraction
    4. Creating a simple pipeline
    5. Unsupervised machine learning
      1. Dimensionality reduction
      2. PCA
        1. Using PCA
        2. Regression Analysis - a practical use of PCA
          1. Dataset collection and exploration
          2. What is regression analysis?
    6. Binary and multiclass classification
      1. Performance metrics
        1. Binary classification using logistic regression
        2. Breast cancer prediction using logistic regression of Spark ML
          1. Dataset collection
          2. Developing the pipeline using Spark ML
      2. Multiclass classification using logistic regression
      3. Improving classification accuracy using random forests
        1. Classifying MNIST dataset using random forest
    7. Summary
  13. My Name is Bayes, Naive Bayes
    1. Multinomial classification
      1. Transformation to binary
        1. Classification using One-Vs-The-Rest approach
        2. Exploration and preparation of the OCR dataset
      2. Hierarchical classification
      3. Extension from binary
    2. Bayesian inference
      1. An overview of Bayesian inference
        1. What is inference?
        2. How does it work?
    3. Naive Bayes
      1. An overview of Bayes' theorem
      2. My name is Bayes, Naive Bayes
      3. Building a scalable classifier with NB
        1. Tune me up!
    4. The decision trees
      1. Advantages and disadvantages of using DTs
        1. Decision tree versus Naive Bayes
          1. Building a scalable classifier with DT algorithm
    5. Summary
  14. Time to Put Some Order - Cluster Your Data with Spark MLlib
    1. Unsupervised learning
      1. Unsupervised learning example
    2. Clustering techniques
      1. Unsupervised learning and the clustering
        1. Hierarchical clustering
        2. Centroid-based clustering
        3. Distribution-based clustestering
    3. Centroid-based clustering (CC)
      1. Challenges in CC algorithm
      2. How does K-means algorithm work?
        1. An example of clustering using K-means of Spark MLlib
    4. Hierarchical clustering (HC)
      1. An overview of HC algorithm and challenges
        1. Bisecting K-means with Spark MLlib
        2. Bisecting K-means clustering of the neighborhood using Spark MLlib
    5. Distribution-based clustering (DC)
      1. Challenges in DC algorithm
        1. How does a Gaussian mixture model work?
          1. An example of clustering using GMM with Spark MLlib
    6. Determining number of clusters
    7. A comparative analysis between clustering algorithms
    8. Submitting Spark job for cluster analysis
    9. Summary
  15. Text Analytics Using Spark ML
    1. Understanding text analytics
      1. Text analytics
        1. Sentiment analysis
        2. Topic modeling
        3. TF-IDF (term frequency - inverse document frequency)
        4. Named entity recognition (NER)
        5. Event extraction
    2. Transformers and Estimators
      1. Standard Transformer
      2. Estimator Transformer
    3. Tokenization
    4. StopWordsRemover
    5. NGrams
    6. TF-IDF
      1. HashingTF
      2. Inverse Document Frequency (IDF)
    7. Word2Vec
    8. CountVectorizer
    9. Topic modeling using LDA
    10. Implementing text classification
    11. Summary
  16. Spark Tuning
    1. Monitoring Spark jobs
      1. Spark web interface
        1. Jobs
        2. Stages
        3. Storage
        4. Environment
        5. Executors
        6. SQL
      2. Visualizing Spark application using web UI
        1. Observing the running and completed Spark jobs
        2. Debugging Spark applications using logs
        3. Logging with log4j with Spark
    2. Spark configuration
      1. Spark properties
      2. Environmental variables
      3. Logging
    3. Common mistakes in Spark app development
      1. Application failure
        1. Slow jobs or unresponsiveness
    4. Optimization techniques
      1. Data serialization
      2. Memory tuning
        1. Memory usage and management
        2. Tuning the data structures
        3. Serialized RDD storage
        4. Garbage collection tuning
        5. Level of parallelism
        6. Broadcasting
        7. Data locality
    5. Summary
  17. Time to Go to ClusterLand - Deploying Spark on a Cluster
    1. Spark architecture in a cluster
      1. Spark ecosystem in brief
      2. Cluster design
      3. Cluster management
        1. Pseudocluster mode (aka Spark local)
        2. Standalone
        3. Apache YARN
        4. Apache Mesos
        5. Cloud-based deployments
    2. Deploying the Spark application on a cluster
      1. Submitting Spark jobs
        1. Running Spark jobs locally and in standalone
      2. Hadoop YARN
        1. Configuring a single-node YARN cluster
          1. Step 1: Downloading Apache Hadoop
          2. Step 2: Setting the JAVA_HOME
          3. Step 3: Creating users and groups
          4. Step 4: Creating data and log directories
          5. Step 5: Configuring core-site.xml
          6. Step 6: Configuring hdfs-site.xml
          7. Step 7: Configuring mapred-site.xml
          8. Step 8: Configuring yarn-site.xml
          9. Step 9: Setting Java heap space
          10. Step 10: Formatting HDFS
          11. Step 11: Starting the HDFS
          12. Step 12: Starting YARN
          13. Step 13: Verifying on the web UI
        2. Submitting Spark jobs on YARN cluster
        3. Advance job submissions in a YARN cluster
      3. Apache Mesos
        1. Client mode
        2. Cluster mode
      4. Deploying on AWS
        1. Step 1: Key pair and access key configuration
        2. Step 2: Configuring Spark cluster on EC2
        3. Step 3: Running Spark jobs on the AWS cluster
        4. Step 4: Pausing, restarting, and terminating the Spark cluster
    3. Summary
  18. Testing and Debugging Spark
    1. Testing in a distributed environment
      1. Distributed environment
        1. Issues in a distributed system
        2. Challenges of software testing in a distributed environment
    2. Testing Spark applications
      1. Testing Scala methods
      2. Unit testing
      3. Testing Spark applications
        1. Method 1: Using Scala JUnit test
        2. Method 2: Testing Scala code using FunSuite
        3. Method 3: Making life easier with Spark testing base
      4. Configuring Hadoop runtime on Windows
    3. Debugging Spark applications
      1. Logging with log4j with Spark recap
      2. Debugging the Spark application
        1. Debugging Spark application on Eclipse as Scala debug
        2. Debugging Spark jobs running as local and standalone mode
        3. Debugging Spark applications on YARN or Mesos cluster
        4. Debugging Spark application using SBT
    4. Summary
  19. PySpark and SparkR
    1. Introduction to PySpark
    2. Installation and configuration
      1. By setting SPARK_HOME
        1. Using Python shell
      2. By setting PySpark on Python IDEs
      3. Getting started with PySpark
      4. Working with DataFrames and RDDs
        1. Reading a dataset in Libsvm format
        2. Reading a CSV file
        3. Reading and manipulating raw text files
      5. Writing UDF on PySpark
      6. Let's do some analytics with k-means clustering
    3. Introduction to SparkR
      1. Why SparkR?
      2. Installing and getting started
      3. Getting started
      4. Using external data source APIs
      5. Data manipulation
      6. Querying SparkR DataFrame
      7. Visualizing your data on RStudio
    4. Summary
3.129.69.151