© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2021
R. C. L'EsteveThe Definitive Guide to Azure Data Engineeringhttps://doi.org/10.1007/978-1-4842-7182-7_12

12. Aggregate and Transform Big Data Using Mapping Data Flows

Ron C. L’Esteve1  
(1)
Chicago, IL, USA
 

The process of cleansing and transforming big datasets in the data lake has become an increasingly popular and critical step in a modern enterprise’s data architecture. Microsoft has introduced several big data analytics and orchestration tools to serve the need for big data lake Extract-Load-Transform (ELT). Customers are seeking cloud-based services that can cleanse, transform, and aggregate extremely big datasets with ease, coupled with a low learning curve. They are seeking to understand what tools and technologies could potentially fit the bill for big data lake cleansing and transformations.

Azure Data Factory’s Mapping Data Flows has become a promising solution for big data lake cleansing and transformations. In Chapter 11, I discussed the concept of a modern data warehouse and demonstrated a practical example of Mapping Data Flows for enterprise data warehouse transformations. In this chapter, I will continue to demonstrate additional data cleansing and aggregation features of Mapping Data Flows, specifically to process big data files stored in Azure Data Lake Storage Gen2 as hierarchical files.

Add Files and Folders to Azure Data Lake Storage Gen2

Structuring a Data Lake Storage correctly by using best practices is key. When data is stored in Data Lake Storage Gen2, the file size, number of files, and folder structure have an impact on performance.

File Size

Depending on what services and workloads are using the data, a good size to consider for files is 256 MB or greater. If the file sizes cannot be batched when landing in Data Lake Storage, you can have a separate compaction job that combines these files into larger ones.

Folder Structure

The folder and file organizational structure can help some queries read only a subset of the data, which improves performance by optimizing for the larger file sizes and a reasonable number of files in each folder. Be cognizant of performance tuning and optimization techniques, along with folder and file structure recommendations.

For this exercise, you will learn how to create an ADLS Gen2 container named lake, along with a few additional folders that will organize the data by the year 2016. To get started, ensure that you have an ADLS Gen2 account and ADF account set up. Next, you’ll need to upload the 2016 sales files from the following GitHub account (https://github.com/ronlesteve/sales-datasets) into the ADLS Gen2 folder structure shown in Figure 12-1.

As we can see from Figure 12-1, within the 2016 sales folder, the additional folders are organized by month number.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig1_HTML.jpg
Figure 12-1

2016 sales folders organized by month number

Within each month, there are .txt files organized and saved by day, as shown in Figure 12-2.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig2_HTML.jpg
Figure 12-2

2016 sales folders organized by month day text files

And finally, upon opening one of the text files, notice the structure of the data in Figure 12-3 consisting of the following columns selected.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig3_HTML.jpg
Figure 12-3

Structure of the text files

Create Azure Data Factory Resources

Now that the data lake files and folders have been structured, it is time to create the necessary Azure Data Factory resources. Once Data Factory is open, begin by creating a new pipeline, as shown in Figure 12-4.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig4_HTML.jpg
Figure 12-4

Create a new ADF pipeline

Next, add a new dataset that will reference the lake container with the following connection properties. Notice that the Directory and File properties in Figure 12-5 have been left empty as this can be dynamically set in the Mapping Data Flows properties. Also, set the column delimiter to Tab().
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig5_HTML.jpg
Figure 12-5

ADLS Gen2 dataset connection properties

After you publish the resources, the Factory Resources section in Figure 12-6 will be available. These resources consist of pipelines and datasets.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig6_HTML.jpg
Figure 12-6

ADF pipeline and dataset resources

Within the newly created pipeline, expand Move & Transform from Activities and then drag the Data Flow activity, shown in Figure 12-7, onto the canvas.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig7_HTML.jpg
Figure 12-7

ADF Data Flow activity

Create the Mapping Data Flow

Now you are ready to create the Data Flow activity shown in Figure 12-8.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig8_HTML.png
Figure 12-8

ADF SalesOrderDataFlow

Start by adding a source connection shown in Figure 12-9 to the SalesOrderDataset created in the previous section.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig9_HTML.png
Figure 12-9

ADF data flow source connection to SalesOrderDataset

Be sure to Allow schema drift as there may be columns that change in the files. Additionally, select Infer drifted column types to allow auto-detection of drifted column types. These options are illustrated in Figure 12-10.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig10_HTML.jpg
Figure 12-10

ADF source data flow settings

Parametrize the year 2016 so that you can maintain these values outside of the hard-coded path. To add parameters, click the white space in the Mapping Data Flows canvas, and then select and add the desired parameters shown in Figure 12-11.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig11_HTML.jpg
Figure 12-11

ADF source dataset parameters

After the parameter is added, return to the Source Options tab. Under Source Options, add the path to your 2016 sales folder in Wildcard paths. This setting will override the folder path set in the dataset, starting at the container root. The ** will allow for recursive directory nesting. Lastly, specify that all text files will be needed by using *.txt. These settings for the Source Options tab have been listed in Figure 12-12.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig12_HTML.jpg
Figure 12-12

ADF source options

Here is the code that has been added to the Wildcard paths field within the Source Options tab in Figure 12-12:
'DataLakeCleansing/raw/sales/'+$Year+'/**/*.txt'
In the Projection tab shown in Figure 12-13, verify and alter the column schema.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig13_HTML.jpg
Figure 12-13

ADF data flow source projection

Next, add a Select schema modifier shown in Figure 12-14 to prune the columns that are needed. To achieve this, click the + icon next to the source activity and add the Select schema modifier. Notice that there are many other options for transforming the data and schema that are available for use.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig14_HTML.jpg
Figure 12-14

ADF Mapping Data Flows “Select” modifier activity

Also select options to skip duplicates, as shown in Figure 12-15.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig15_HTML.jpg
Figure 12-15

ADF data flow “Select Settings” to skip duplicates

Next, add a Derived Column schema modifier to add two new columns shown in Figure 12-16:
  1. 1.

    Order month number based on OrderDate

     
  2. 2.

    Year number also based on OrderDate

     
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig16_HTML.jpg
Figure 12-16

ADF data flow derived columns

Derived columns are great for data cleansing through the power of expressions.

Regular Expressions (Regex)

A regular expression is a sequence of characters that specifies a search pattern. Usually, such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings or for input validation. It is a technique developed in theoretical computer science and formal language theory. Regex functions are compatible with Mapping Data Flows and can be added to activities within the ADP pipelines. Here are some sample regular expressions and their intended purpose:

  • RegexReplace(Address,`^a-zA-Zds:`,'')​​: Removes all non-alphanumeric characters

  • RegexReplace(Address,`[ ]{2}|.`,' '): Takes the Address field, which contains street address strings, and replaces any occurrence of two spaces or dots “.” with a single space.

  • Regex_extract(Address, `^(d+)`, 1): Uses the street address to extract just the house number.

Soundex

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling.

This can be a great expression to use when working with semi- or unstructured data in a lake to overcome issues with joining and cleansing data without keys.

Next, add an Aggregate schema modifier shown in Figure 12-17 to aggregate unit price * quantity.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig17_HTML.png
Figure 12-17

ADF data flow Aggregate schema modifier

Remember to group this aggregation by the following columns shown in Figure 12-18.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig18_HTML.jpg
Figure 12-18

ADF data flow Group by aggregate settings

Figure 12-19 shows where you would need to enter the total calculation .
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig19_HTML.jpg
Figure 12-19

ADF data flow Aggregates

Now that the aggregation is complete, add a Windows schema modifier, as shown in Figure 12-20.

The window transformation is where you will define window-based aggregations of columns in your data streams. In the Expression Builder, you can define different types of aggregations that are based on data or time windows (SQL OVER clause such as LEAD, LAG, NTILE, CUMEDIST, RANK, etc.). A new field will be generated in your output that includes these aggregations. You can also include optional group-by fields.

../images/511918_1_En_12_Chapter/511918_1_En_12_Fig20_HTML.png
Figure 12-20

ADF data flow Windows schema modifier

Figure 12-21 demonstrates how to rank the totals by CustomerName.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig21_HTML.jpg
Figure 12-21

ADF Windows Settings for “Over”

Figure 12-22 shows how to sort the totals in descending order to sort and rank the totals from highest to lowest.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig22_HTML.jpg
Figure 12-22

ADF Windows Settings for “Sort”

Leave the Range by as Unbounded , as shown in Figure 12-23.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig23_HTML.jpg
Figure 12-23

ADF Windows Settings for “Range by”

Next, add a dense rank function to the total. Note that there are a few rank and row number functions that fit specific needs and use cases .

RANK Function

The RANK function is used to retrieve ranked rows based on the condition of the ORDER BY clause.
Table 12-1

RANK function sample output

ENAME

SAL

RANK

SMITH

800

1

JAMES

950

2

ADAMS

1100

3

MARTIN

1250

4

WARD

1250

4

TURNER

1500

6

Here is a sample query containing the RANK function:
SELECT ename,
       sal,
       Rank()
         OVER (
           ORDER BY sal) RANK
FROM   emp;

Table 12-1 shows a sample output for the RANK function which has been run on a sample dataset.

DENSE_RANK Function

The DENSE_RANK function is like the RANK function. However, the DENSE_RANK function does not skip any ranks if there is a tie between the ranks of the preceding records.

Here is a sample query containing the DENSE_RANK function:
SELECT ename,
       sal,
       Dense_rank()
         OVER (
           ORDER BY sal) DEN_RANK
FROM   emp;
Table 12-2 shows a sample output for the DENSE_RANK function which has been run on a sample dataset.
Table 12-2

DENSE_RANK function sample output

ENAME

SAL

RANK

SMITH

800

1

JAMES

950

2

ADAMS

1100

3

MARTIN

1250

4

WARD

1250

4

TURNER

1500

5

ROW_NUMBER Function

Unlike the RANK and DENSE_RANK functions, the ROW_NUMBER function simply returns the row numbers of the sorted records starting with 1. The Window Settings with the Mapping Data Flows transformation task in ADF can be seen in Figure 12-24.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig24_HTML.jpg
Figure 12-24

ADF Windows Settings for “Window columns”

Once the window function is complete, add a sink to store the enriched results in the data lake, as shown in Figure 12-25.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig25_HTML.png
Figure 12-25

ADF data flow sink to store enriched data

Use the following sink dataset settings shown in Figure 12-26 and remember to check the option Allow schema drift. Azure Data Factory natively supports flexible schemas that change from execution to execution so that you can build generic data transformation logic without the need to recompile your data flows.

Schema drift is the case where your sources often change metadata. Fields, columns, and types can be added, removed, or changed on the fly. Without handling schema drift, your data flow becomes vulnerable to upstream data source changes. Typical ETL patterns fail when incoming columns and fields change because they tend to be tied to those source names.

To protect against schema drift , it's important to have the facilities in a data flow tool to allow you, as a data engineer, to
  • Define sources that have mutable field names, data types, values, and sizes.

  • Define transformation parameters that can work with data patterns instead of hard-coded fields and values.

  • Define expressions that understand patterns to match incoming fields, instead of using named fields.

../images/511918_1_En_12_Chapter/511918_1_En_12_Fig26_HTML.jpg
Figure 12-26

ADF data flow sink dataset connection properties

Also, configure the settings to output to a single file and specify the file name, as shown in Figure 12-27.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig27_HTML.jpg
Figure 12-27

ADF data flow sink settings

One of the benefits of Mapping Data Flows is the Data Flow Debug mode, which allows for a preview of the transformed data without having to manually create clusters and run the pipeline.

Remember to turn on Debug mode in order to preview the data as shown in Figure 12-28, and then turn it off before logging out of Azure Data Factory. Note that the Debug mode will auto-terminate after a period.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig28_HTML.jpg
Figure 12-28

ADF data flow Debug trigger

The ranked total results by customer will look similar to the results displayed in Figure 12-29.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig29_HTML.jpg
Figure 12-29

Ranked total results by customer

This next exercise will show you how to split the Total Aggregate into a new branch to create a new file with a different window function, this time ranking the totals by month and outputting the results to a different file in the data lake, as shown in Figure 12-30.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig30_HTML.png
Figure 12-30

Split Total Aggregate into new branch

Ensure that the new window function’s settings are configured as shown in Figure 12-31 .
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig31_HTML.jpg
Figure 12-31

Settings to rank the totals by month and output the results to file

Once the Mapping Data Flow is complete, it will look like what is being displayed in Figure 12-32.
../images/511918_1_En_12_Chapter/511918_1_En_12_Fig32_HTML.png
Figure 12-32

Full end-to-end ADF Mapping Data Flow

Summary

In this chapter, I demonstrated how to create a dataset pointing to a data lake container. Next, I showed you how to add a parameterized wildcard path to all text files in the 2016 sales folder. I then showed you how to select the pertinent columns, add a few key derived columns, perform aggregations, add window functions, split branches, and export the desired results to enriched text files in the data lake.

The output dataset contains aggregated and descending ranked totals (unit price * quantity) by customer name and by month. All of this was done by utilizing Azure Data Factory’s Mapping Data Flows feature and tested with the Data Flow Debug functionality.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.226.105