Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Mastering Apache Spark

Table of Contents

Mastering Apache Spark

Credits

Foreword

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. Apache Spark

Overview

Spark Machine Learning

Spark Streaming

Spark SQL

Spark graph processing

Extended ecosystem

The future of Spark

Cluster design

Cluster management

Local

Standalone

Apache YARN

Apache Mesos

Amazon EC2

Performance

The cluster structure

The Hadoop file system

Data locality

Memory

Coding

Cloud

Summary

2. Apache Spark MLlib

The environment configuration

Architecture

The development environment

Installing Spark

Classification with Naïve Bayes

Theory

Naïve Bayes in practice

Clustering with K-Means

Theory

K-Means in practice

ANN – Artificial Neural Networks

Theory

Building the Spark server

ANN in practice

Summary

3. Apache Spark Streaming

Overview

Errors and recovery

Checkpointing

Streaming sources

TCP stream

File streams

Flume

Kafka

Summary

4. Apache Spark SQL

The SQL context

Importing and saving data

Processing the Text files

Processing the JSON files

Processing the Parquet files

DataFrames

Using SQL

User-defined functions

Using Hive

Local Hive Metastore server

A Hive-based Metastore server

Summary

5. Apache Spark GraphX

Overview

GraphX coding

Environment

Creating a graph

Example 1 – counting

Example 2 – filtering

Example 3 – PageRank

Example 4 – triangle counting

Example 5 – connected components

Mazerunner for Neo4j

Installing Docker

The Neo4j browser

The Mazerunner algorithms

The PageRank algorithm

The closeness centrality algorithm

The triangle count algorithm

The connected components algorithm

The strongly connected components algorithm

Summary

6. Graph-based Storage

Titan

TinkerPop

Installing Titan

Titan with HBase

The HBase cluster

The Gremlin HBase script

Spark on HBase

Accessing HBase with Spark

Titan with Cassandra

Installing Cassandra

The Gremlin Cassandra script

The Spark Cassandra connector

Accessing Cassandra with Spark

Accessing Titan with Spark

Gremlin and Groovy

TinkerPop's Hadoop Gremlin

Alternative Groovy configuration

Using Cassandra

Using HBase

Using the filesystem

Summary

7. Extending Spark with H2O

Overview

The processing environment

Installing H2O

The build environment

Architecture

Sourcing the data

Data Quality

Performance tuning

Deep learning

Example code – income

The example code – MNIST

H2O Flow

Summary

8. Spark Databricks

Overview

Installing Databricks

AWS billing

Databricks menus

Account management

Cluster management

Notebooks and folders

Jobs and libraries

Development environments

Databricks tables

Data import

External tables

The DbUtils package

Databricks file system

Dbutils fsutils

The DbUtils cache

The DbUtils mount

Summary

9. Databricks Visualization

Data visualization

Dashboards

An RDD-based report

A stream-based report

REST interface

Configuration

Cluster management

The execution context

Command execution

Libraries

Moving data

The table data

Folder import

Library import

Further reading

Summary

Index

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

52.15.80.101