© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2021
R. C. L'EsteveThe Definitive Guide to Azure Data Engineeringhttps://doi.org/10.1007/978-1-4842-7182-7_15

15. Delta Lake

Ron C. L’Esteve1  
(1)
Chicago, IL, USA
 

While working with Azure Data Lake Storage Gen2 and Apache Spark, users have learned about both the limitations of Apache Spark and the many data lake implementation challenges. The need for an ACID-compliant feature set is critical within the data lake landscape, and Delta Lake offers many solutions to the current limitations of the standard Azure Data Lake Storage Gen2 accounts.

Delta Lake is an open source storage layer that guarantees data atomicity, consistency, isolation, and durability in the lake. In short, a Delta Lake is ACID compliant. In addition to providing ACID transactions, scalable metadata handling, and more, Delta Lake runs on an existing data lake and is compatible with Apache Spark APIs. There are a few methods of getting started with Delta Lake. Databricks offers notebooks along with compatible Apache Spark APIs to create and manage Delta Lakes. Alternatively, Azure Data Factory’s Mapping Data Flows, which uses scaled-out Apache Spark clusters, can be used to perform ACID-compliant CRUD operations through GUI-designed ETL pipelines. This chapter will demonstrate how to get started with Delta Lake using Azure Data Factory’s Delta Lake connector through examples of how to create, insert, update, and delete in a Delta Lake.

Why an ACID Delta Lake

There are many advantages to introducing Delta Lake into a modern cloud data architecture. Traditionally, data lakes and Apache Spark are not ACID compliant. Delta Lake introduces this ACID compliance to solve many the following ACID compliance issues:
  1. 1.

    Atomicity: Write either all data or nothing. Apache Spark save modes do not utilize any locking and are not atomic. With this, a failed job may leave an incomplete file and may corrupt data. Additionally, a failing job may remove the old file and corrupt the new file. While this seems concerning, Spark does have built-in data frame writer APIs that are not atomic but behave so for append operations. This however does come with performance overhead for use with cloud storage. The currently available Apache Spark save modes include ErrorIfExists, Append, Overwrite, and Ignore.

     
  2. 2.

    Consistency: Data is always in a valid state. If the Spark API writer deletes an old file and creates a new one and the operation is not transactional, then there will always be a period of time when the file does not exist between the deletion of the old file and creation of the new. In that scenario, if the overwrite operation fails, this will result in data loss of the old file. Additionally, the new file may not be created. This is a typical Spark overwrite operation issue related to consistency.

     
  3. 3.

    Isolation: Multiple transactions occur independently without interference. This means that when writing to a dataset, other concurrent reads or writes on the same dataset should not be impacted by the write operation. Typical transactional databases offer multiple isolation levels, such as read uncommitted, read committed, repeatable read, snapshot, and serializable. While Spark has task- and job-level commits, since it lacks atomicity, it does not have isolation types.

     
  4. 4.

    Durability: Committed data is never lost. When Spark does not correctly implement a commit, then it overwrites all the great durability features offered by cloud storage options and either corrupts and/or loses the data. This violates data durability.

     

Prerequisites

Now that you have an understanding of the current data lake and Spark challenges along with benefits of an ACID-compliant Delta Lake, let’s get started with the exercise.

For this exercise, be sure to successfully create the following prerequisites:
  1. 1.

    Create a Data Factory V2: Data Factory will be used to perform the ELT orchestrations. Additionally, ADF’s Mapping Data Flows Delta Lake connector will be used to create and manage the Delta Lake.

     
  2. 2.

    Create a Data Lake Storage Gen2: ADLS Gen2 will be the data lake storage, on top of which the Delta Lake will be created.

     
  3. 3.

    Create Data Lake Storage Gen2 container and zones: Once your Data Lake Storage Gen2 account is created, also create the appropriate containers and zones. Revisit Chapter 3 for more information on designing ADLS Gen2 zones where I discuss how to design an ADLS Gen2 storage account. This exercise will use the Raw zone to store a sample source parquet file. Additionally, the Staging zone will be used for Delta Updates, Inserts, and Deletes and additional transformations. Though the Curated zone will not be used in this exercise, it is important to mention that this zone may contain the final ETL, advanced analytics, or data science models that are further transformed and curated from the Staging zone. Once the various zones are created in your ADLS Gen2 account, they would look similar to the illustration in Figure 15-1.

     
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig1_HTML.jpg
Figure 15-1

ADLS Gen2 zones/folders

  1. 4.

    Upload data to the Raw zone: Finally, you’ll need some data for this exercise. By searching for “sample parquet files” online or within publicly available datasets, you’ll obtain access to a number of free sample parquet files. For this exercise, you could download the sample parquet files within the following GitHub repo (https://github.com/Teradata/kylo/tree/master/samples/sample-data/parquet) and then upload it to your ADLS Gen2 storage account, as shown in Figure 15-2.

     
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig2_HTML.jpg
Figure 15-2

Sample userdata1.parquet file

  1. 5.

    Create a Data Factory parquet dataset pointing to the Raw zone: The final prerequisite would be to create a Parquet format dataset in the newly created instance of ADF V2, as shown in Figure 15-3, pointing to the sample parquet file stored in the Raw zone.

     
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig3_HTML.jpg
Figure 15-3

userData1 connection setting in ADF

Create and Insert into Delta Lake

Now that all prerequisites are in place, you are ready to create the initial delta tables and insert data from your Raw zone into the delta tables.

Begin the process by creating a new Data Factory pipeline and adding a new “Mapping Data Flow” to it. Also remember to name the pipeline and data flow sensible names, much like the sample shown in Figure 15-4.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig4_HTML.jpg
Figure 15-4

Mapping Data Flow canvas for insert

Within the data flow, add a source with the configuration settings shown in Figure 15-5. Also, check the option to “Allow schema drift.” When metadata related to fields, columns, and types change frequently, this is referred to as schema drift. Without a proper process to handle schema drift, an ETL pipeline might fail. ADF supports flexible schemas that change often. ADF treats schema drift as late binding. Therefore, the drifted schemas will not be available for you to view in the data flow.

When schema drift is enabled, all of the incoming fields are read from your source during execution and passed through the entire flow to the sink. All newly detected columns arrive as a string data type by default. If you need to automatically infer data types of your drifted columns, then you’ll need to enable Infer drifted column types in your source settings.

Allow auto-detection of drifted column types. Sampling offers a method to limit the number of rows from the source, mainly used for testing and debugging purposes.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig5_HTML.jpg
Figure 15-5

Mapping Data Flows ETL flow source settings for inserts

Since Delta Lake leverages Spark’s distributed processing power, it is capable of partitioning data appropriately. However, for purposes of exploring the capability of manually setting partitioning, configure 20 Hash partitions on the ID column, as shown in Figure 15-6.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig6_HTML.jpg
Figure 15-6

Optimize settings for MDF partitioning

After adding the destination activity, ensure that the sink type, shown in Figure 15-7, is set to Delta format in Azure Data Factory (https://docs.microsoft.com/en-us/azure/data-factory/format-delta). Note that Delta Lake is available as both a source and sink in Mapping Data Flows. Also, you will be required to select the linked service once the sink type of Delta Lake is selected.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig7_HTML.jpg
Figure 15-7

MDF sink settings

Under the Settings tab shown in Figure 15-8, ensure that the Staging folder is selected and select Allow insert for the update method. Also, select the Truncate table option if there is a need to truncate the delta table before loading it.

The process of vacuuming a delta table within the lake (https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-vacuum.html) will remove files that are no longer referenced by delta tables and are older than the retention threshold in hours. The default is 30 days if the value is left at 0 or empty.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig8_HTML.jpg
Figure 15-8

Settings for insert sink in MDF

Finally, within the Optimize tab shown in Figure 15-9, use the current partitioning since the source partitioning will flow downstream to the sink.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig9_HTML.jpg
Figure 15-9

Partitioning options for optimizing sink

As expected, once you trigger the pipeline and after it completes running, notice from Figure 15-10 that there are 13 new columns created across 20 different partitions.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig10_HTML.jpg
Figure 15-10

ADF MDF pipeline run details

While looking at the ADLS Gen2 Staging folder, notice from Figure 15-11 that a delta_log folder along with 20 snappy compressed parquet files has been created.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig11_HTML.jpg
Figure 15-11

Delta Lake partitioned files

Open the delta_log folder to view the two transaction log files, shown in Figure 15-12. The transaction log captures many important features, including ACID transactions, scalable metadata handling, time travel, and more (https://databricks.com/blog/2019/08/21/diving-into-delta-lake-unpacking-the-transaction-log.html).
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig12_HTML.jpg
Figure 15-12

Delta log files

After checking the new data in the Staging Delta Lake, notice that there are new records inserted. To easily explore the Delta Lake from ADF with Mapping Data Flows, simply toggle the Data Preview tab to view the details of the data flow, as shown in Figure 15-13.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig13_HTML.jpg
Figure 15-13

Insert data results reading from parquet file

Update Delta Lake

So far, inserts into the Delta Lake have been covered in the previous section. Next, let’s take a look at how Data Factory can handle updates to your delta tables. Similar to what you did in the previous section for inserts, create a new ADF pipeline with a Mapping Data Flow for updates, shown in Figure 15-14.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig14_HTML.jpg
Figure 15-14

MDF update Delta Lake

For this update exercise, update the first and last names of the user and convert them to lowercase. To do this, add a Derived column and AlterRow transform activity to the update Mapping Data Flow canvas, as shown in Figure 15-15.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig15_HTML.jpg
Figure 15-15

Source settings for updating Mapping Data Flows parquet

Within the Source options tab shown in Figure 15-16, the source data is still your Staging Delta Lake that was also configured for the inserts. The Delta Lake source connector within ADF also introduced delta time travel for large-scale data lakes to audit data changes, reproduce results, handle rollbacks, and more. Time travel allows you to query data by timestamp or version (https://databricks.com/blog/2019/02/04/introducing-delta-time-travel-for-large-scale-data-lakes.html).
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig16_HTML.jpg
Figure 15-16

Update source options

The Derived column transform activity converts first and last names to lowercase using the expressions lower(first_name) and lower(last_name), as shown in Figure 15-17. Mapping Data Flows is capable of handling extremely complex transformations in this stage.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig17_HTML.jpg
Figure 15-17

Derived column settings for update

For the alter row settings, you’ll need to specify an Update if condition of true() to update all rows that meet the criteria, as shown in Figure 15-18.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig18_HTML.jpg
Figure 15-18

Settings for the AlterRow updates

Ensure that you verify the Sink tab’s configuration settings, as shown in Figure 15-19.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig19_HTML.jpg
Figure 15-19

Sink settings for update MDF

Ensure that the sink is still pointing to the Staging Delta Lake data. Also, select Allow update as the update method. To show that multiple key columns can be simultaneously selected, there are three columns selected in Figure 15-20.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig20_HTML.jpg
Figure 15-20

Update method settings for MDF sink

After the pipeline is saved and triggered, notice the results from the ADF Data Preview tab shown in Figure 15-21. The first and last names have been updated to lowercase values.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig21_HTML.png
Figure 15-21

Data showing updates as expected

Delete from Delta Lake

To recap, inserts and updates have been covered till now. Next, let’s look at an example of how Mapping Data Flows handles deletes in Delta Lake. Similar to the process of setting up the insert and update MDF pipelines, create a new Data Factory Mapping Data Flow, as shown in Figure 15-22.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig22_HTML.jpg
Figure 15-22

MDF for deleting from Delta Lake

Configure the Delta source settings as desired, shown in Figure 15-23.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig23_HTML.jpg
Figure 15-23

Source settings for mdf_delete_deltalake

Since you are still working with the same Staging Delta Lake, these source settings, shown in Figure 15-24, will be configured similar to how you had configured the inserts and updates in the previous sections.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig24_HTML.jpg
Figure 15-24

Source options for mdf_delete_deltalake

For this example, delete all records where gender = male. To do this, you’ll need to configure the alter row conditions to Delete if gender == ‘Male’, as shown in Figure 15-25.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig25_HTML.jpg
Figure 15-25

Alter row settings for mdf_delete_deltalake delta

Finally, Figure 15-26 shows the sink delta configuration settings.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig26_HTML.jpg
Figure 15-26

Sink settings for mdf_delete_deltalake

Select the Staging Delta Lake for the sink and select “Allow delete” along with the desired key columns of id, registration_dttm, and ip_address, as shown in Figure 15-27.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig27_HTML.jpg
Figure 15-27

Destination for mdf_delete_deltalake

After publishing and triggering this pipeline, notice from Figure 15-28 how all records where gender = Male have been deleted.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig28_HTML.jpg
Figure 15-28

Delete data file as expected

Explore Delta Logs

Lastly, let’s take a look at the delta logs to briefly understand how the logs have been created and populated. The main commit info files are generated and stored in the Insert, Update, and Delete JSON commit files. Additionally, CRC files are created. CRC is a popular technique for checking data integrity as it has excellent error detection abilities, uses little resources, and is easily used. The delta logs that have been created from the ADF Delta insert, update, and delete MDF pipelines are stored in the _delta_log folder within your ADLS Gen2 account, as shown in Figure 15-29.
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig29_HTML.jpg
Figure 15-29

Delta logs after insert, update, and delete

Insert

Open the Insert JSON commit file shown in Figure 15-30, and notice that it contains commit info for the insert operations, referenced by line items that begin with the word “add”. You will usually not need to work with or open these files while building ETL pipelines and processes. However, the logs will always be persisted to this folder in JSON format, so you will always have the option to open the logs when needed. For this exercise, opening these logs will give you a deeper understanding as to how these logs capture information to help with better understanding the mechanics of this process.
{"commitInfo":{"timestamp":1594782281467,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"isolationLevel":"WriteSerializable","isBlindAppend":true}}
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig30_HTML.jpg
Figure 15-30

Delta log insert

Update

Similar to the insert delta logs, open the Update JSON commit file shown in Figure 15-31 and notice that it contains commit info for the update operations. The line items that begin with “remove” show the path to the data files that have been deleted, and the line items that begin with “add” show the data that has been added as part of the update process. Additionally, important details related to commit info, timestamps, and more are available in these logs. Again, in a regular ADF pipeline process, you will not need to explore these delta logs, and this is meant to be an informational and exploratory exercise.
{"commitInfo":{"timestamp":1594782711552,"operation":"MERGE","operationParameters":{"predicate":"(((source.`id` = target.`id`) AND (source.`registration_dttm` = target.`registration_dttm`)) AND (source.`ip_address` = target.`ip_address`))","updatePredicate":"((NOT ((source.`ra2b434a305b34f2f96cd5b4b4149455e` & 2) = 0)) OR (NOT ((source.`ra2b434a305b34f2f96cd5b4b4149455e` & 8) = 0)))","deletePredicate":"(NOT ((source.`ra2b434a305b34f2f96cd5b4b4149455e` & 4) = 0))"},"readVersion":0,"isolationLevel":"WriteSerializable","isBlindAppend":false}}
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig31_HTML.jpg
Figure 15-31

Delta log update

Delete

Lastly, open the Delete JSON commit file shown in Figure 15-32 and notice that it contains commit info for the delete operations. The files that have been deleted are captured by line items that begin with “remove”. Now that you have an understanding of these delta transaction logs, along with how to open and interpret them, you’ll better understand the relevance of Delta Lake and how it is positioned to handle ACID-compliant transactions within a data lake.
"commitInfo":{"timestamp":1594783812366,"operation":"MERGE","operationParameters":{"predicate":"(((source.`id` = target.`id`) AND (source.`registration_dttm` = target.`registration_dttm`)) AND (source.`ip_address` = target.`ip_address`))","updatePredicate":"((NOT ((source.`ra079d97a688347b581710234d2cc4b63` & 2) = 0)) OR (NOT ((source.`ra079d97a688347b581710234d2cc4b63` & 8) = 0)))","deletePredicate":"(NOT ((source.`ra079d97a688347b581710234d2cc4b63` & 4) = 0))"},"readVersion":1,"isolationLevel":"WriteSerializable","isBlindAppend":false}}
../images/511918_1_En_15_Chapter/511918_1_En_15_Fig32_HTML.jpg
Figure 15-32

Delta log delete

Summary

In this chapter, I have demonstrated how to get started with Delta Lake using Azure Data Factory’s Delta Lake connector through examples of how to create, insert, update, and delete in a Delta Lake by using Azure Data Lake Storage Gen2 as the storage account. Since Delta Lake is an open source project that is meant to enable building a lakehouse architecture on top of existing storage systems, it can certainly be used with other storage systems such as Amazon S3, Google Cloud Storage, HDFS, and others. Additionally, you could just as easily work with Delta Lake by writing Spark, Python, and/or Scala code within Databricks notebooks.

With such flexibility, the lakehouse data management paradigm is gaining momentum with a vision of becoming an industry standard and the evolution of both the data lake and data warehouse. The low-cost storage of data lakes makes this option very attractive for organizations that embrace a cost-sensitive and growth mindset. Additional advantages of data lakehouse include reduced data redundancy, elimination of simple ETL jobs, decoupled compute from storage, real-time streaming support, ease of data governance, and the capability to connect directly to modern-day BI tools. While still in their infancy, data lakehouse and Delta Lake do have certain limitations that prevent them from being a full-fledged replacement of traditional data warehousing appliances such as SQL Server databases and warehouses. Nevertheless, this chapter demonstrates the capability of working directly with Delta Lake in a code-free manner and shows how to easily work with Delta Lake to begin exploring the building blocks of a data lakehouse.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.18.220.243