Ron L’Esteve

The Azure Data Lakehouse Toolkit

Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake

Ron L’Esteve
Chicago, IL, USA
ISBN 978-1-4842-8232-8e-ISBN 978-1-4842-8233-5
© Ron L'Esteve 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Apress imprint is published by the registered company APress Media, LLC, part of Springer Nature.

The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

For Cayden and Christina

Introduction

With the rise of data volume, velocity, and variety in the modern data and analytics platform on Azure, there is an ever-growing demand for innovative low-cost storage and on-demand compute options that are centered around their decoupled capabilities. This book will enhance your understanding around some of the practical methods of designing and implementing the Data Lakehouse paradigm on Azure by demonstrating the capabilities of Apache Spark and Delta Lake to build cutting-edge modern Lakehouse solutions on Databricks, Synapse Analytics, and Snowflake. You will gain an understanding of these various technologies and how they fit into the modern data and analytics Lakehouse paradigm by supporting the needs of ingestion, processing, storing, serving, reporting, and consumption. You’ll also gain a better understanding of how machine learning, data governance, and continuous integration and deployment play a role in the Lakehouse.

The Data Lakehouse paradigm on Azure, which leverages Apache Spark and Delta Lake heavily, has become a popular choice for big data engineering, ELT (extraction, loading, and transformation), AI/ML, real-time data processing, reporting, and querying use cases. In some scenarios of the Lakehouse paradigm, Spark coupled with MPP is great for big data reporting and BI Analytics platforms that require heavy querying and performance capabilities since MPP is based on the traditional RDBMS and brings with it the best features of SQL Server, such as automated query tuning, data shuffling, ease of analytics platform management, even data distribution based on a primary key, and much more. As the Lakehouse matures, specifically with Delta Lake, it begins to demonstrate its capabilities of supporting many critical features, such as ACID (atomicity, consistency, isolation, and durability)-compliant transactions for batch and streaming jobs, data quality enforcement, and highly optimized performance tuning.

In the upcoming chapters of this book, we will unravel the many complexities of understanding the technologies used within the Lakehouse paradigm along with their capabilities through hands-on, scenario-based exercises. You will learn how to implement advanced performance optimization tools and patterns for Spark performance improvement in the Lakehouse by using partitioning, indexing, and other tuning options. You will also learn about the capabilities of Delta Lake which include schema evolution, change feed, Live Tables, sharing, and clones. Finally, you will gain more knowledge about some of the advanced capabilities within the Lakehouse, such as building and installing custom Python libraries, implementing security and controls, and working with event-driven autoloading data on the Lakehouse platform. The chapters presented within this book are intended to equip you with the right skills and knowledge to design and implement a modern Lakehouse by serving as your Data Lakehouse Toolkit.

Acknowledgments

The journey to completing this book has inspired me to learn more, dream more, and do more. I am thankful to my family, friends, and supporters who have fueled me with the determination and inspiration to write this book.

Table of Contents
Part II: Data Platforms43
Part VI: Advanced Capabilities369
About the Author
Ron L’Esteve

A photo of Ron L'Esteve.

is a professional author, trusted technology leader, and digital innovation strategist residing in Chicago, IL, USA. He is well known for his impactful books and award-winning article publications about Azure Data and AI architecture and engineering. He possesses deep technical skills and experience in designing, implementing, and delivering modern Azure Data and AI projects for numerous clients around the world.

Having several Azure Data, AI, and Lakehouse certifications under his belt, Ron has been a trusted and go-to technical advisor for some of the largest and most impactful Azure implementation projects on the planet. He has been responsible for scaling key data architectures, defining the roadmap and strategy for the future of data and business intelligence needs, and challenging customers to grow by thoroughly understanding the fluid business opportunities and enabling change by translating them into high-quality and sustainable technical solutions that solve the most complex challenges and promote digital innovation and transformation.

Ron is a gifted presenter and trainer, known for his innate ability to clearly articulate and explain complex topics to audiences of all skill levels. He applies a practical and business-oriented approach by taking transformational ideas from concept to scale. He is a true enabler of positive and impactful change by championing a growth mindset.

 
About the Technical Reviewer
Diego Poggioli

A photo of Diego Poggioli.

is an analytics architect with over ten years of experience managing big data and machine learning projects.

At Avanade, he is currently helping clients to identify, document, and translate business strategy and requirements into solutions and services that help clients to achieve their business outcomes using analytics.

With a passion for software development, cloud, and serverless innovations, he has worked on various product development initiatives spanning IaaS, PaaS, and SaaS.

He is also interested in the emerging topics of artificial intelligence, machine learning, and large-scale data processing and analytics.

He holds several Microsoft Azure and AWS certifications.

His hobbies are handcraft with driftwood, playing with his two kids, and learning new technologies. He currently lives in Bologna, Italy.

 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.25.74