This Apress imprint is published by the registered company APress Media, LLC, part of Springer Nature.
The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.
For Cayden and Christina
With the rise of data volume, velocity, and variety in the modern data and analytics platform on Azure, there is an ever-growing demand for innovative low-cost storage and on-demand compute options that are centered around their decoupled capabilities. This book will enhance your understanding around some of the practical methods of designing and implementing the Data Lakehouse paradigm on Azure by demonstrating the capabilities of Apache Spark and Delta Lake to build cutting-edge modern Lakehouse solutions on Databricks, Synapse Analytics, and Snowflake. You will gain an understanding of these various technologies and how they fit into the modern data and analytics Lakehouse paradigm by supporting the needs of ingestion, processing, storing, serving, reporting, and consumption. You’ll also gain a better understanding of how machine learning, data governance, and continuous integration and deployment play a role in the Lakehouse.
The Data Lakehouse paradigm on Azure, which leverages Apache Spark and Delta Lake heavily, has become a popular choice for big data engineering, ELT (extraction, loading, and transformation), AI/ML, real-time data processing, reporting, and querying use cases. In some scenarios of the Lakehouse paradigm, Spark coupled with MPP is great for big data reporting and BI Analytics platforms that require heavy querying and performance capabilities since MPP is based on the traditional RDBMS and brings with it the best features of SQL Server, such as automated query tuning, data shuffling, ease of analytics platform management, even data distribution based on a primary key, and much more. As the Lakehouse matures, specifically with Delta Lake, it begins to demonstrate its capabilities of supporting many critical features, such as ACID (atomicity, consistency, isolation, and durability)-compliant transactions for batch and streaming jobs, data quality enforcement, and highly optimized performance tuning.
In the upcoming chapters of this book, we will unravel the many complexities of understanding the technologies used within the Lakehouse paradigm along with their capabilities through hands-on, scenario-based exercises. You will learn how to implement advanced performance optimization tools and patterns for Spark performance improvement in the Lakehouse by using partitioning, indexing, and other tuning options. You will also learn about the capabilities of Delta Lake which include schema evolution, change feed, Live Tables, sharing, and clones. Finally, you will gain more knowledge about some of the advanced capabilities within the Lakehouse, such as building and installing custom Python libraries, implementing security and controls, and working with event-driven autoloading data on the Lakehouse platform. The chapters presented within this book are intended to equip you with the right skills and knowledge to design and implement a modern Lakehouse by serving as your Data Lakehouse Toolkit.
The journey to completing this book has inspired me to learn more, dream more, and do more. I am thankful to my family, friends, and supporters who have fueled me with the determination and inspiration to write this book.
Having several Azure Data, AI, and Lakehouse certifications under his belt, Ron has been a trusted and go-to technical advisor for some of the largest and most impactful Azure implementation projects on the planet. He has been responsible for scaling key data architectures, defining the roadmap and strategy for the future of data and business intelligence needs, and challenging customers to grow by thoroughly understanding the fluid business opportunities and enabling change by translating them into high-quality and sustainable technical solutions that solve the most complex challenges and promote digital innovation and transformation.
Ron is a gifted presenter and trainer, known for his innate ability to clearly articulate and explain complex topics to audiences of all skill levels. He applies a practical and business-oriented approach by taking transformational ideas from concept to scale. He is a true enabler of positive and impactful change by championing a growth mindset.
At Avanade, he is currently helping clients to identify, document, and translate business strategy and requirements into solutions and services that help clients to achieve their business outcomes using analytics.
With a passion for software development, cloud, and serverless innovations, he has worked on various product development initiatives spanning IaaS, PaaS, and SaaS.
He is also interested in the emerging topics of artificial intelligence, machine learning, and large-scale data processing and analytics.
He holds several Microsoft Azure and AWS certifications.
His hobbies are handcraft with driftwood, playing with his two kids, and learning new technologies. He currently lives in Bologna, Italy.
3.144.25.74