Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Part I. Getting Started

Ron L’Esteve

The Azure Data Lakehouse Toolkit

Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake

The Apress logo.

Ron L’Esteve

Chicago, IL, USA

ISBN 978-1-4842-8232-8e-ISBN 978-1-4842-8233-5

https://doi.org/10.1007/978-1-4842-8233-5

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Apress imprint is published by the registered company APress Media, LLC, part of Springer Nature.

The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

For Cayden and Christina

Introduction

With the rise of data volume, velocity, and variety in the modern data and analytics platform on Azure, there is an ever-growing demand for innovative low-cost storage and on-demand compute options that are centered around their decoupled capabilities. This book will enhance your understanding around some of the practical methods of designing and implementing the Data Lakehouse paradigm on Azure by demonstrating the capabilities of Apache Spark and Delta Lake to build cutting-edge modern Lakehouse solutions on Databricks, Synapse Analytics, and Snowflake. You will gain an understanding of these various technologies and how they fit into the modern data and analytics Lakehouse paradigm by supporting the needs of ingestion, processing, storing, serving, reporting, and consumption. You’ll also gain a better understanding of how machine learning, data governance, and continuous integration and deployment play a role in the Lakehouse.

The Data Lakehouse paradigm on Azure, which leverages Apache Spark and Delta Lake heavily, has become a popular choice for big data engineering, ELT (extraction, loading, and transformation), AI/ML, real-time data processing, reporting, and querying use cases. In some scenarios of the Lakehouse paradigm, Spark coupled with MPP is great for big data reporting and BI Analytics platforms that require heavy querying and performance capabilities since MPP is based on the traditional RDBMS and brings with it the best features of SQL Server, such as automated query tuning, data shuffling, ease of analytics platform management, even data distribution based on a primary key, and much more. As the Lakehouse matures, specifically with Delta Lake, it begins to demonstrate its capabilities of supporting many critical features, such as ACID (atomicity, consistency, isolation, and durability)-compliant transactions for batch and streaming jobs, data quality enforcement, and highly optimized performance tuning.

In the upcoming chapters of this book, we will unravel the many complexities of understanding the technologies used within the Lakehouse paradigm along with their capabilities through hands-on, scenario-based exercises. You will learn how to implement advanced performance optimization tools and patterns for Spark performance improvement in the Lakehouse by using partitioning, indexing, and other tuning options. You will also learn about the capabilities of Delta Lake which include schema evolution, change feed, Live Tables, sharing, and clones. Finally, you will gain more knowledge about some of the advanced capabilities within the Lakehouse, such as building and installing custom Python libraries, implementing security and controls, and working with event-driven autoloading data on the Lakehouse platform. The chapters presented within this book are intended to equip you with the right skills and knowledge to design and implement a modern Lakehouse by serving as your Data Lakehouse Toolkit.

Acknowledgments

The journey to completing this book has inspired me to learn more, dream more, and do more. I am thankful to my family, friends, and supporters who have fueled me with the determination and inspiration to write this book.

Table of Contents

Part II: Data Platforms43

Part III: Apache Spark ELT183

Chapter 5: Pipelines and Jobs185

Databricks185

Data Factory191

Mapping Data Flows191

HDInsight Spark Activity196

Scheduling and Monitoring200

Synapse Analytics Workspace202

Summary206

Chapter 6: Notebook Code209

PySpark210

Excel211

XML217

JSON221

ZIP225

Scala227

SQL228

Optimizing Performance229

Summary232

Part IV: Delta Lake233

Chapter 7: Schema Evolution235

Schema Evolution Using Parquet Format236

Schema Evolution Using Delta Format239

Append240

Overwrite241

Summary243

Chapter 8: Change Data Feed245

Create Database and Tables245

Insert Data into Tables248

Change Data Capture249

Streaming Changes254

Summary255

Chapter 9: Clones257

Shallow Clones257

Deep Clones263

Summary267

Chapter 10: Live Tables269

Advantages of Delta Live Tables270

Create a Notebook270

Create and Run a Pipeline274

Schedule a Pipeline278

Explore Event Logs280

Summary283

Chapter 11: Sharing285

Architecture286

Share Data287

Access Data288

Sharing Data with Snowflake291

Summary292

Part V: Optimizing Performance295

Chapter 12: Dynamic Partition Pruning297

Partitions297

Prerequisites299

DPP Commands299

Create Cluster300

Create Notebook and Mount Data Lake300

Create Fact Table301

Verify Fact Table Partitions304

Create Dimension Table305

Join Results Without DPP Filter306

Join Results with DPP Filter308

Summary309

Chapter 13: Z-Ordering and Data Skipping311

Prepare Data in Delta Lake312

Verify Data in Delta Lake314

Create Hive Table317

Run Optimize and Z-Order Commands318

Verify Data Skipping320

Summary325

Chapter 14: Adaptive Query Execution327

How It Works327

Prerequisites328

Comparing AQE Performance on Query with Joins329

Create Datasets329

Disable AQE332

Enable AQE334

Summary338

Chapter 15: Bloom Filter Index339

How a Bloom Filter Index Works339

Create a Cluster340

Create a Notebook and Insert Data341

Enable Bloom Filter Index343

Create Tables344

Create a Bloom Filter Index346

Optimize Table with Z-Order348

Verify Performance Improvements349

Summary352

Chapter 16: Hyperspace353

Prerequisites354

Create Parquet Files358

Run a Query Without an Index360

Import Hyperspace362

Read the Parquet Files to a Data Frame362

Create a Hyperspace Index362

Rerun the Query with Hyperspace Index364

Other Hyperspace Management APIs365

Summary366

Part VI: Advanced Capabilities369

Chapter 17: Auto Loader371

Advanced Schema Evolution372

Prerequisites372

Generate Data from SQL Database372

Load Data to Azure Data Lake Storage Gen2376

Configure Resources in Azure Portal377

Configure Databricks383

Run Auto Loader in Databricks385

Configuration Properties385

Rescue Data387

Schema Hints391

Infer Column Types392

Add New Columns396

Managing Auto Loader Resources402

Read a Stream403

Write a Stream404

Explore Results408

Summary416

Chapter 18: Python Wheels417

Install Application Software417

Install Visual Studio Code and Python Extension418

Install Python418

Configure Python Interpreter Path for Visual Studio Code419

Verify Python Version in Visual Studio Code Terminal420

Set Up Wheel Directory Folders and Files420

Create Setup File421

Create Readme File422

Create License File422

Create Init File423

Create Package Function File424

Install Python Wheel Packages424

Install Wheel Package424

Install Check Wheel Package425

Create and Verify Wheel File425

Create Wheel File426

Check Wheel Contents426

Verify Wheel File427

Configure Databricks Environment428

Install Wheel to Databricks Library428

Create Databricks Notebook429

Mount Data Lake Folder429

Create Spark Database430

Verify Wheel Package431

Import Wheel Package432

Create Function Parameters432

Run Wheel Package Function432

Show Spark Tables432

Files in Databricks Repos433

Continuous Integration and Deployment435

Summary436

Chapter 19: Security and Controls437

Implement Cluster, Pool, and Jobs Access Control437

Implement Workspace Access Control440

Implement Other Access and Visibility Controls442

Table Access Control443

Personal Access Tokens443

Visibility Controls444

Example Row-Level Security Implementation445

Create New User Groups445

Load Sample Data447

Run Queries Using Row-Level Security450

Create Row-Level Secured Views and Grant Selective User Access454

Interaction with Azure Active Directory457

Summary458

Index459

About the Author

Ron L’Esteve

is a professional author, trusted technology leader, and digital innovation strategist residing in Chicago, IL, USA. He is well known for his impactful books and award-winning article publications about Azure Data and AI architecture and engineering. He possesses deep technical skills and experience in designing, implementing, and delivering modern Azure Data and AI projects for numerous clients around the world.

Having several Azure Data, AI, and Lakehouse certifications under his belt, Ron has been a trusted and go-to technical advisor for some of the largest and most impactful Azure implementation projects on the planet. He has been responsible for scaling key data architectures, defining the roadmap and strategy for the future of data and business intelligence needs, and challenging customers to grow by thoroughly understanding the fluid business opportunities and enabling change by translating them into high-quality and sustainable technical solutions that solve the most complex challenges and promote digital innovation and transformation.

Ron is a gifted presenter and trainer, known for his innate ability to clearly articulate and explain complex topics to audiences of all skill levels. He applies a practical and business-oriented approach by taking transformational ideas from concept to scale. He is a true enabler of positive and impactful change by championing a growth mindset.

About the Technical Reviewer

Diego Poggioli

is an analytics architect with over ten years of experience managing big data and machine learning projects.

At Avanade, he is currently helping clients to identify, document, and translate business strategy and requirements into solutions and services that help clients to achieve their business outcomes using analytics.

With a passion for software development, cloud, and serverless innovations, he has worked on various product development initiatives spanning IaaS, PaaS, and SaaS.

He is also interested in the emerging topics of artificial intelligence, machine learning, and large-scale data processing and analytics.

He holds several Microsoft Azure and AWS certifications.

His hobbies are handcraft with driftwood, playing with his two kids, and learning new technologies. He currently lives in Bologna, Italy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Front Matter

Create new playlist

Sign In

Sign Up

The Azure Data Lakehouse Toolkit

Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake

Table of Contents for
Front Matter