0%

Quickly build and deploy massive data pipelines and improve productivity using Azure Databricks

Key Features

  • Get to grips with the distributed training and deployment of machine learning and deep learning models
  • Learn how ETLs are integrated with Azure Data Factory and Delta Lake
  • Explore deep learning and machine learning models in a distributed computing infrastructure

Book Description

Microsoft Azure Databricks helps you to harness the power of distributed computing and apply it to create robust data pipelines, along with training and deploying machine learning and deep learning models. Databricks' advanced features enable developers to process, transform, and explore data. Distributed Data Systems with Azure Databricks will help you to put your knowledge of Databricks to work to create big data pipelines.

The book provides a hands-on approach to implementing Azure Databricks and its associated methodologies that will make you productive in no time. Complete with detailed explanations of essential concepts, practical examples, and self-assessment questions, you'll begin with a quick introduction to Databricks core functionalities, before performing distributed model training and inference using TensorFlow and Spark MLlib. As you advance, you'll explore MLflow Model Serving on Azure Databricks and implement distributed training pipelines using HorovodRunner in Databricks.

Finally, you'll discover how to transform, use, and obtain insights from massive amounts of data to train predictive models and create entire fully working data pipelines. By the end of this MS Azure book, you'll have gained a solid understanding of how to work with Databricks to create and manage an entire big data pipeline.

What you will learn

  • Create ETLs for big data in Azure Databricks
  • Train, manage, and deploy machine learning and deep learning models
  • Integrate Databricks with Azure Data Factory for extract, transform, load (ETL) pipeline creation
  • Discover how to use Horovod for distributed deep learning
  • Find out how to use Delta Engine to query and process data from Delta Lake
  • Understand how to use Data Factory in combination with Databricks
  • Use Structured Streaming in a production-like environment

Who this book is for

This book is for software engineers, machine learning engineers, data scientists, and data engineers who are new to Azure Databricks and want to build high-quality data pipelines without worrying about infrastructure. Knowledge of Azure Databricks basics is required to learn the concepts covered in this book more effectively. A basic understanding of machine learning concepts and beginner-level Python programming knowledge is also recommended.

Table of Contents

  1. Distributed Data Systems with Azure Databricks
  2. Contributors
  3. About the author
  4. About the reviewer
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Download the color images
    6. Conventions used
    7. Get in touch
    8. Reviews
  6. Section 1: Introducing Databricks
  7. Chapter 1: Introduction to Azure Databricks
    1. Technical requirements
    2. Introducing Apache Spark
    3. Introducing Azure Databricks
    4. Examining the architecture of Databricks
    5. Discovering core concepts and terminology
    6. Interacting with the Azure Databricks workspace
    7. Workspace assets
    8. Workspace object operations
    9. Using Azure Databricks notebooks
    10. Creating and managing notebooks
    11. Notebooks and clusters
    12. Exploring data management
    13. Databases and tables
    14. Viewing databases and tables
    15. Importing data
    16. Creating a table
    17. Table details
    18. Exploring computation management
    19. Displaying clusters
    20. Starting a cluster
    21. Terminating a cluster
    22. Deleting a cluster
    23. Cluster information
    24. Cluster logs
    25. Exploring authentication and authorization
    26. Clustering access control
    27. Folder permissions
    28. Notebook permissions
    29. MLflow Model permissions
    30. Summary
  8. Chapter 2: Creating an Azure Databricks Workspace
    1. Technical requirements
    2. Using the Azure portal UI
    3. Accessing the Workspace UI
    4. Configuring an Azure Databricks cluster
    5. Creating a new notebook
    6. Examining Azure Databricks authentication
    7. Access control
    8. Working with VNets in Azure Databricks
    9. Virtual network requirements
    10. Deploying to your own VNet
    11. Azure Resource Manager templates
    12. Creating an Azure Databricks workspace with an ARM template
    13. Reviewing deployed resources
    14. Cleaning up resources
    15. Setting up the Azure Databricks CLI
    16. Authentication through an access token
    17. Authentication using an Azure AD token
    18. Validating the installation
    19. Workspace CLI
    20. Using the CLI to explore the workplace
    21. Clusters CLI
    22. Jobs CLI
    23. Groups API
    24. The Databricks CLI from Azure Cloud Shell
    25. Summary
  9. Section 2: Data Pipelines with Databricks
  10. Chapter 3: Creating ETL Operations with Azure Databricks
    1. Technical requirements
    2. Using ADLS Gen2
    3. Setting up a basic ADLS Gen2 data lake
    4. Uploading data to ADLS Gen2
    5. Accessing ADLS Gen2 from Azure Databricks
    6. Loading data from ADLS Gen2
    7. Using S3 with Azure Databricks
    8. Connecting to S3
    9. Loading data into a Spark DataFrame
    10. Using Azure Blob storage with Azure Databricks
    11. Setting up Azure Blob storage
    12. Uploading files and access keys
    13. Setting up the connection to Azure Blob storage
    14. Transforming and cleaning data
    15. Spark data frames
    16. Querying using SQL
    17. Writing back table data to Azure Data Lake
    18. Orchestrating jobs with Azure Databricks
    19. ADF
    20. Creating an ADF resource
    21. Creating an ETL in ADF
    22. Scheduling jobs with Azure Databricks
    23. Scheduling a notebook as a job
    24. Job logs
    25. Summary
  11. Chapter 4: Delta Lake with Azure Databricks
    1. Technical requirements
    2. Introducing Delta Lake
    3. Ingesting data using Delta Lake
    4. Partner integrations
    5. The COPY INTO SQL command
    6. Auto Loader
    7. Batching table read and writes
    8. Creating a table
    9. Reading a Delta table
    10. Partitioning data to speed up queries
    11. Querying past states of a table
    12. Using time travel to query tables
    13. Working with past and present data
    14. Schema validation
    15. Streaming table read and writes
    16. Streaming from Delta tables
    17. Managing table updates and deletes
    18. Specifying an initial position
    19. Streaming modes
    20. Optimization with Delta Lake
    21. Summary
  12. Chapter 5: Introducing Delta Engine
    1. Technical requirements
    2. Optimizing file management with Delta Engine
    3. Merging small files using bin-packing
    4. Skipping data
    5. Using Z-order clustering
    6. Managing data recency
    7. Understanding checkpoints
    8. Automatically optimizing files with Delta Engine
    9. Using caching to improve performance
    10. Delta and Apache Spark caching
    11. Caching a subset of the data
    12. Configuring the Delta cache
    13. Optimizing queries using DFP
    14. Using DFP
    15. Using Bloom filters
    16. Understanding Bloom filters
    17. Bloom filters in Azure Databricks
    18. Creating a Bloom filter index
    19. Optimizing join performance
    20. Range join optimization
    21. Enabling range join optimization
    22. Skew join optimization
    23. Relationships and columns
    24. Summary
  13. Chapter 6: Introducing Structured Streaming
    1. Technical requirements
    2. Structured Streaming model
    3. Using the Structured Streaming API
    4. Mapping, filtering, and running aggregations
    5. Windowed aggregations on event time
    6. Merging streaming and static data
    7. Interactive queries
    8. Using different sources with continous streams
    9. Using a Delta table as a stream source
    10. Azure Event Hubs
    11. Auto Loader
    12. Apache Kafka
    13. Avro data
    14. Data sinks
    15. Recovering from query failures
    16. Optimizing streaming queries
    17. Triggering streaming query executions
    18. Different kinds of triggers
    19. Trigger examples
    20. Visualizing data on streaming data frames
    21. Example on Structured Streaming
    22. Summary
  14. Section 3: Machine and Deep Learning with Databricks
  15. Chapter 7: Using Python Libraries in Azure Databricks
    1. Technical requirements
    2. Installing libraries in Azure Databricks
    3. Workspace libraries
    4. Cluster libraries
    5. Notebook-scoped Python libraries
    6. PySpark API
    7. Main functionalities of PySpark
    8. Operating with PySpark DataFrames
    9. pandas DataFrame API (Koalas)
    10. Using the Koalas API
    11. Using SQL in Koalas
    12. Working with PySpark
    13. Visualizing data
    14. Bokeh
    15. Matplotlib
    16. Plotly
    17. Summary
  16. Chapter 8: Databricks Runtime for Machine Learning
    1. Loading data
    2. Reading data from DBFS
    3. Reading CSV files
    4. Feature engineering
    5. Tokenizer
    6. Binarizer
    7. Polynomial expansion
    8. StringIndexer
    9. One-hot encoding
    10. VectorIndexer
    11. Normalizer
    12. StandardScaler
    13. Bucketizer
    14. Element-wise product
    15. Time-series data sources
    16. Joining time-series data
    17. Using the Koalas API
    18. Handling missing values
    19. Extracting features from text
    20. TF-IDF
    21. Word2vec
    22. Training machine learning models on tabular data
    23. Engineering the variables
    24. Building the ML model
    25. Registering the model in the MLflow Model Registry
    26. Model serving
    27. Summary
  17. Chapter 9: Databricks Runtime for Deep Learning
    1. Technical requirements
    2. Loading data for deep learning
    3. Using TFRecords for distributed learning
    4. Structuring TFRecords files
    5. Managing data using TFRecords
    6. Automating schema inference
    7. Using TFRecordDataset to load data
    8. Using Petastorm for distributed learning
    9. Introducing Petastorm
    10. Generating a dataset
    11. Reading a dataset
    12. Using Petastorm to prepare data for deep learning
    13. Data preprocessing and featurization
    14. Featurization using a pre-trained model for transfer learning
    15. Featurization using pandas UDFs
    16. Applying featurization to the DataFrame of images
    17. Summary
  18. Chapter 10: Model Tracking and Tuning in Azure Databricks
    1. Technical requirements
    2. Tuning hyperparameters with AutoML
    3. Automating model tracking with MLflow
    4. Managing MLflow runs
    5. Automating MLflow tracking with MLlib
    6. Hyperparameter tuning with Hyperopt
    7. Hyperopt concepts
    8. Defining a search space
    9. Applying best practices in Hyperopt
    10. Optimizing model selection with scikit-learn, Hyperopt, and MLflow
    11. Summary
  19. Chapter 11: Managing and Serving Models with MLflow and MLeap
    1. Technical requirements
    2. Managing machine learning models
    3. Using MLflow notebook experiments
    4. Registering a model using the MLflow API
    5. Transitioning a model stage
    6. Model Registry example
    7. Exporting and loading pipelines with MLeap
    8. Serving models with MLflow
    9. Scoring a model
    10. Summary
  20. Chapter 12: Distributed Deep Learning in Azure Databricks
    1. Technical requirements
    2. Distributed training for deep learning
    3. The ring allreduce technique
    4. Using the Horovod distributed learning library in Azure Databricks
    5. Installing the horovod library
    6. Using the horovod library
    7. Training a model on a single node
    8. Distributing training with HorovodRunner
    9. Distributing hyperparameter tuning using Horovod and Hyperopt
    10. Using the Spark TensorFlow Distributor package
    11. Summary
    12. Why subscribe?
  21. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Leave a review - let other readers know what you think
3.144.189.177