© Sudhir Rawat and Abhishek Narain 2019
Sudhir Rawat and Abhishek NarainUnderstanding Azure Data Factoryhttps://doi.org/10.1007/978-1-4842-4122-6_5

5. Data Transformation: Part 2

Sudhir Rawat1  and Abhishek Narain2
(1)
Bangalore, India
(2)
Shanghai, China
 

In the previous chapter, you worked with various activities to build a solution with various data analytics requirements. On Microsoft Azure, you will notice that many services are available for storage and compute. There is no right or wrong choice of service; you just need to be clear about what the business needs now and, more important, what it needs in the future. There will always be trade-offs when choosing one service over another. So, as a data professional, you need to be clear on what the business requirements are now and in the future.

Most data professionals want to know whether they can schedule their open source code for data transformation? That is the question this chapter will answer through building a pipeline to process data using Python.

Before diving into generating more code, let’s take a look at two different terms that are frequently used.

Data Warehouse to Modern Data Warehouse

In the traditional sense, a data warehouse is a central repository that consolidates data from different sources such as a file system, a customer relationship management (CRM) system, SQL Server, and so on. The data is cleaned and transformed, the values are calculated, and the data is stored for historical purposes. This has been going on for a decade.

Today, there are many more data points generating data, and it’s becoming useful to utilize this data. In addition to getting data from on-premises sources, you also can get data from social media platforms and third-party APIs. Therefore, you need a scalable and high-compute system that can retrieve data from these sources and store it in a data warehouse. Figure 5-1 shows a modern data warehouse pattern.
../images/468833_1_En_5_Chapter/468833_1_En_5_Fig1_HTML.jpg
Figure 5-1

Modern data warehouse pattern

ETL vs. ELT

Extract-transform-load (ETL) and extract-load-transform (ELT) are not new to data professionals. Both techniques describe transformations, either before loading or after loading the data. In ELT, the transformation happens on the target, whereas in ETL, it may happen on the source side or on a compute layer in between. If you are processing Big Data, you may want to use ETL and then use an Azure Databricks Spark cluster to transform the data in an optimized environment with lower latency. If you have the same source and destination, you can leverage the compute power of a system like Azure SQL Data Warehouse to transform data there. Since ELT takes place in the first stage of loading raw data on the target, you can leverage of the power of a system like Azure SQL Data Warehouse to transform the data parallelly.

As there is no right or wrong way to process data, you may want to look at various parameters such as the data source and destination, latency, scalability, performance, cost, skill set, and so on, to decide on ETL or ELT. This chapter will show how to apply both approaches using Azure Data Factory.

Azure Databricks

Apache Spark is one of the most contributed to projects in the Apache world. Apache Spark uses an in-memory engine to process Big Data and makes it upto 100 times faster than Hadoop. The best part of the technology is that it has a runtime engine, and on top of the engine there are various libraries available such as SparkSQL, GraphX, Streaming, and the machine learning libraries. Bringing this platform on-premises, configuring it, and building a security and collaboration layer is a tedious task. That’s where Azure Databricks comes into the picture and provides an optimized platform to run Spark jobs. The beauty of Azure Databricks is that it helps set up the environment in less time, streamlines workflows, and provides a collaboration workspace between the data scientist, data engineer, and data analyst. Figure 5-2 shows the architecture of Azure Databricks.
../images/468833_1_En_5_Chapter/468833_1_En_5_Fig2_HTML.jpg
Figure 5-2

Azure Databricks architecture on Azure

Here are the benefits of using Azure Databricks:
  • Optimized environment

  • Ease of setup and management

  • Provides enterprise security through Azure AD integration

  • Collaboration

  • PowerBI integration

  • Integrates with other Azure services

When should you choose HDInsight Spark versus Azure Databricks? Both are optimized to run Spark workloads. HDInsight provides a platform-as-a-service (PaaS) experience where organizations can run multiple types of workloads such as Kafka, Spark, MapReduce, Storm, HBase, and Hive LLAP. Azure Databricks supports only Spark clusters. The platform provides a software-as-a-service (SaaS) experience. Also, it helps different people within the organization to collaborate easily. The pricing of the services is another consideration.

Build and Implement Use Case

In this use case, you will focus on transforming data using Azure Data Factory. This will be an example of ETL.

AdventureWorks wants to operationalize its data pipeline so that the company can visualize data seamlessly without worrying about the platform. Through this chapter and the previous chapter, each step is broken down separately so that it is easy to understand the solution. If required, you can put together all the blocks and build one ADF pipeline. The following are the services used in this demo:
  • Microsoft Azure Data Factory

  • Microsoft Azure SQL Data Warehouse (DWH)

  • Microsoft Azure Databricks

  • Microsoft Azure Blob Storage

One of the ways to load dimension data is to use Azure Data Factory’s Copy activity to transfer dimension data to Azure SQL Data Warehouse. The other way is to leverage Azure Databricks to do it (as shown in Figure 5-3). You can do various kinds of transformations such as managing slowly changing dimensions (SCD-1, SCD-2, SCD-3) and checking for data anomalies.

For the AdventureWorks retail company, you’ll build the solution shown in Figure 5-3 to move all the dimension data. In this scenario, you are ingesting data from a CSV file.
../images/468833_1_En_5_Chapter/468833_1_En_5_Fig3_HTML.jpg
Figure 5-3

Architecture to feed dimension data

Let’s upload the CSV files to Azure Blob Storage. You can use Azure Data Factory (the Copy Data activity) as we discussed in previous Chapter to move data to Azure Blob Storage. Figure 5-4 shows the files.
../images/468833_1_En_5_Chapter/468833_1_En_5_Fig4_HTML.jpg
Figure 5-4

Dimension/master data files

Let’s set up Azure SQL Data Warehouse.
  1. 1)
     
  2. 2)

    Click “Create a resource.”

     
  3. 3)

    Click Databases.

     
  4. 4)
    Click SQL Data Warehouse (see Figure 5-5).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig5_HTML.jpg
    Figure 5-5

    Selecting SQL Data Warehouse

     
  5. 5)

    Provide the database name.

     
  6. 6)

    Select your subscription.

     
  7. 7)

    Create or select a resource group.

     
  8. 8)

    Set “Select source” to “Blank database.”

     
  9. 9)

    Create a new server.

     
  10. 10)

    Select your needed performance level.

     
  11. 11)

    Leave Collation at its default value.

     
  12. 12)
    Click Create (see Figure 5-6).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig6_HTML.jpg
    Figure 5-6

    Providing values to set up the SQL Data Warehouse service

     
  13. 13)

    Once created, click SQL Data Warehouse.

     
  14. 14)
    Click “Query editor (preview),” as shown in Figure 5-7.
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig7_HTML.jpg
    Figure 5-7

    Opening the query editor to run queries

     
  15. 15)
    Click Login (see Figure 5-8).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig8_HTML.jpg
    Figure 5-8

    Logging into SQL Data Warehouse

     
  16. 16)
    Enter the following command:
    CREATE master KEY encryption BY password = 'adfBook@123';
             go
     
  17. 17)
    Click Run (see Figure 5-9).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig9_HTML.jpg
    Figure 5-9

    Executing scripts in the query editor

     
Let’s set up the Azure Databricks workspace.
  1. 1)

    Click “Create a resource.”

     
  2. 2)

    Click Analytics.

     
  3. 3)
    Click Azure Databricks (see Figure 5-10).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig10_HTML.jpg
    Figure 5-10

    Selecting the Azure Databricks service

     
  4. 4)

    Provide the workspace name.

     
  5. 5)

    Select your subscription.

     
  6. 6)

    Create or select a resource group.

     
  7. 7)

    Select your location.

     
  8. 8)

    Select the needed pricing tier. For this demonstration, let’s select Standard.

     
  9. 9)
    Click Create (see Figure 5-11).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig11_HTML.jpg
    Figure 5-11

    Providing values to set up Azure Databricks

     
  10. 10)

    Click Azure Databricks Service in the Azure dashboard.

     
  11. 11)
    Click Launch Workspace (see Figure 5-12).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig12_HTML.jpg
    Figure 5-12

    Azure Databricks’ Launch Workspace option

     
  12. 12)
    Click New Notebook (see Figure 5-13).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig13_HTML.png
    Figure 5-13

    Creating a new notebook

     
  13. 13)

    Provide the name.

     
  14. 14)
    For Language, select Scala (see Figure 5-14).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig14_HTML.jpg
    Figure 5-14

    Creating a notebook

     
  15. 15)

    Click Create.

     
  16. 16)
    Once the notebook is created, paste the code shown in Figure 5-15, Figure 5-16, Figure 5-17, and Figure 5-18 into the notebook.
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig15_HTML.jpg
    Figure 5-15

    Azure Databricks Scala code

    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig16_HTML.jpg
    Figure 5-16

    Azure Databricks Scala code, continued

    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig17_HTML.jpg
    Figure 5-17

    Azure Databricks Scala code, continued

    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig18_HTML.jpg
    Figure 5-18

    Azure Databricks Scala code, continued

     
Let’s set up Azure Data Factory.
  1. 1)

    Switch to the Azure Data Factory Author & Monitor UI.

     
  2. 2)
    Drag and drop a Notebook activity onto the designer (see Figure 5-19).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig19_HTML.jpg
    Figure 5-19

    Setting up an activity in Azure Data Factory

     
  3. 3)
    Name the activity (see Figure 5-20).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig20_HTML.jpg
    Figure 5-20

    Setting up an activity

     
  4. 4)

    Click the Azure Databricks tab.

     
  5. 5)
    Click +New (see Figure 5-21).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig21_HTML.jpg
    Figure 5-21

    Creating a Databricks linked service

     
  6. 6)

    Provide the name and add a description.

     
  7. 7)

    Leave the default value for “Connect via integration runtime.”

     
  8. 8)

    Select From Azure Subscription for “Account selection method.”

     
  9. 9)

    Select your Azure subscription.

     
  10. 10)

    Select the Databricks workspace created earlier for “Databricks workspace.”

     
  11. 11)

    Select “New job cluster” for “Select cluster.”

     
  12. 12)

    Domain/Region will populate automatically. The value was set when you created the Databricks workspace.

     
  13. 13)

    Select the access token.

     
  14. 14)

    For “Access token,” click the text box; it will take you to the Azure Databricks account to generate a token. Copy the token and paste it here.

     
  15. 15)

    Select the cluster node type.

     
  16. 16)

    Select 4.1 (which includes Apache Spark 2.3.0 and Scala 2.11) for “Cluster version.”

     
  17. 17)
    Select 1 for Workers (see Figure 5-22).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig22_HTML.jpg
    Figure 5-22

    Setting up a linked service

     
  18. 18)
    Click “Test connection” (see Figure 5-23).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig23_HTML.jpg
    Figure 5-23

    Verifying a connection

     
  19. 19)

    Click Finish.

     
  20. 20)

    Click Settings.

     
  21. 21)
    Provide the notebook path. You can get this path from the Azure Databricks workspace (see Figure 5-24).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig24_HTML.jpg
    Figure 5-24

    Setting the notebook path

     
  22. 22)
    Click Publish All (see Figure 5-25).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig25_HTML.jpg
    Figure 5-25

    Setting up the activity

     
  23. 23)
    Click Trigger and then Trigger Now (see Figure 5-26).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig26_HTML.jpg
    Figure 5-26

    Running an Azure Data factory pipeline

     
  24. 24)

    Click Finish.

     
  25. 25)
    Click the Monitor option to monitor the progress (see Figure 5-27).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig27_HTML.jpg
    Figure 5-27

    Pipeline progress

     
  26. 26)
    Once the pipeline has executed successfully, switch to the Azure SQL Data Warehouse query editor to view the tables and data. Let’s first see the error records (see Figure 5-28).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig28_HTML.png
    Figure 5-28

    Querying SQL Data Warehouse to check error data

     
Let’s query the product dimension table (see Figure 5-29).
../images/468833_1_En_5_Chapter/468833_1_En_5_Fig29_HTML.jpg
Figure 5-29

Querying SQL Data Warehouse to verify dimension data

Let’s add sales data to Azure SQL Data Warehouse. In this chapter, we didn’t discuss how to delete files from Azure Blob Storage once they’re processed successfully. We’ll cover that in a later chapter. Figure 5-30 shows the architecture you’ll build.
../images/468833_1_En_5_Chapter/468833_1_En_5_Fig30_HTML.jpg
Figure 5-30

Architecture to load sales data

  1. 1)
    Switch to the Azure SQL Data Warehouse query editor to create a fact table using the following script (see Figure 5-31):
    CREATE TABLE [dbo].[FactStoreSales](
    [SalesOrderID] [int] NULL,
    [StoreId] [int] NULL,
    [OrderDate] [date] NULL,
    [SubTotal] [decimal](18, 2) NULL,
    [Taxperc] [int] NULL,
    [TaxAmt] [decimal](18, 2) NULL,
    [Freightperc] [int] NULL,
    [Freight] [decimal](18, 2) NULL,
    [TotalDue] [decimal](18, 2) NULL,
    [SalesOrderDetailID] [int] NULL,
    [ProductKey] [bigint] NULL,
    [OrderQty] [int] NULL,
    [UnitPrice] [decimal](18, 2) NULL,
    [UnitPriceDiscount] [decimal](18, 2) NULL,
    [LineTotal] [decimal](18, 2) NULL
     )
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig31_HTML.png
    Figure 5-31

    Building the fact table

     
  2. 2)
    Upload the sales data to Azure Blob Storage (see Figure 5-32).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig32_HTML.jpg
    Figure 5-32

    Sales data in CSV format

     
  3. 3)
    Figure 5-33 shows the Databricks script that will get executed from Azure Data Factory. Create a notebook (Python) in your Azure Databricks account and put the code there.
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig33_HTML.jpg
    Figure 5-33

    Azure Databricks Python code

     
  4. 4)

    Switch to the Azure Data Factory Author & Monitor UI.

     
  5. 5)
    Drag and drop a Notebook activity, as shown in Figure 5-34.
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig34_HTML.jpg
    Figure 5-34

    Building the Azure data pipeline

     
  6. 6)

    On the General tab, provide a name and description for the activity.

     
  7. 7)

    On the Azure Databricks tab, create or use the existing Databricks linked service.

     
  8. 8)
    If you are creating a new Databricks linked activity, provide the name, add a description, select your subscription, provide the Databricks workspace, generate a token if you don’t have one, and add the cluster details, as shown in Figure 5-35.
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig35_HTML.jpg
    Figure 5-35

    Parameter selection for Azure Databrick linked service

     
  9. 9)

    Click Finish.

     
  10. 10)
    On the Settings tab, provide the notebook path, as shown in Figure 5-36.
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig36_HTML.jpg
    Figure 5-36

    Setting a notebook path

     
  11. 11)
    Drag and drop a Copy Data activity and connect it with the Notebook activity, as shown in Figure 5-37.
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig37_HTML.jpg
    Figure 5-37

    Setting up the Azure Data Factory pipeline

     
  12. 12)
    On the Source tab, provide the link to the location of the raw files (see Figure 5-38).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig38_HTML.jpg
    Figure 5-38

    Setting a connection to Azure Blob Storage

     
  13. 13)

    Similarly, provide a link service to Azure Blob Storage where you want to store archive files.

     
  14. 14)

    Click Publish All.

     
  15. 15)

    Click Trigger and then Trigger Now.

     
  16. 16)

    Click Finish.

     
  17. 17)

    Click the Monitor icon on the left side to monitor the pipeline execution.

     
  18. 18)
    Once completed successfully, you can query the data in Azure SQL Data Warehouse (see Figure 5-39).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig39_HTML.jpg
    Figure 5-39

    Querying the fact table in SQL Data Warehouse

     
  19. 19)
    Click Azure Blob Storage to see the files copied in Azure Blob Storage (see Figure 5-40).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig40_HTML.jpg
    Figure 5-40

    Verifying the file movement

     

Stored Procedure

Running a stored procedure is another way to transform data into deeper insights and predictions. The benefit of running a stored procedure is that it provides compute near the data, which means the data doesn’t need to travel for processing. You can invoke a Stored Procedure activity to Azure SQL Database, Azure SQL Data Warehouse, and a SQL Server database on-premises or via a virtual machine (VM). In the case of on-premises storage, you need to install the self-hosted integration runtime. Organizations use stored procedures to clean and move data from staging to the production database. This is a classic example of ELT. Since the stored procedure runs on the database server, you’ll need to validate whether executing the heavy-lifting job causes any performance issues.

Let’s say AdventureWorks wants to evaluate how to transform data using the ELT methodology. In this scenario, assume that it’s available in Azure Blob Storage (in CSV format). Using Azure Data Factory, you will move data to Azure SQL Database and then run a stored procedure to clean the data (primarily removing duplicate records).

In Azure SQL Database, there are three tables. All_Sales_Records_Raw is the table you will use to load the raw data without doing any cleaning. All_Sales_Records_Production holds all the good/cleaned data, and All_Sales_Records_ERROR holds all the records that have errors. Let’s get started building this for AdventureWorks.
  1. 1)
     
  2. 2)

    Click “Create a resource.”

     
  3. 3)

    Click Databases.

     
  4. 4)
    Click SQL Database (see Figure 5-41).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig41_HTML.jpg
    Figure 5-41

    Creating a SQL database

     
  5. 5)
    Fill in the following details and click Create to set up the new Azure SQL Server instance (see Figure 5-42).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig42_HTML.png
    Figure 5-42

    Inputting values to set up Azure SQL Database

     
  6. 6)
    Let’s use SQL Server Management Studio (SSMS) to connect to the Azure SQL server, or you can use the query editor (see Figure 5-43).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig43_HTML.jpg
    Figure 5-43

    Switching to the query editor

     
  7. 7)
    If you plan to use SSMS, please set your machine IP to Azure SQL so that you can access it from your computer (SSMS), as shown in Figure 5-44.
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig44_HTML.jpg
    Figure 5-44

    Accessing Azure SQL Server from the client tool

     
  8. 8)
    Create the SQL Server table and stored procedure as provided. Figure 5-45 shows the structure after running the table and stored procedure scripts.
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig45_HTML.jpg
    Figure 5-45

    Artifact in Azure SQL Database

     
  9. 9)
    Upload the data file (provided) to Azure Blob Storage (see Figure 5-46).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig46_HTML.jpg
    Figure 5-46

    Data file in Azure Blob Storage

     
Let’s start building an Azure Data Factory pipeline.
  1. 1)

    Switch to the Azure Data Factory Author & Monitor UI.

     
  2. 2)
    Let’s create a SQL and Azure Blob Storage connection. Click Connections (see Figure 5-47).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig47_HTML.jpg
    Figure 5-47

    Setting up a connection

     
  3. 3)
    Click +New (see Figure 5-48).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig48_HTML.jpg
    Figure 5-48

    Creating new linked services

     
  4. 4)
    Select Azure Blob Storage (see Figure 5-49).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig49_HTML.jpg
    Figure 5-49

    Azure Blob Storage option

     
  5. 5)

    Click Continue.

     
  6. 6)
    Provide information about the storage where you uploaded your data file (see Figure 5-50).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig50_HTML.jpg
    Figure 5-50

    Azure Blob Storage linked service options

     
  7. 7)

    Click Finish.

     
  8. 8)

    Let’s create a SQL connection. Click +New.

     
  9. 9)
    Select Azure SQL Database (see Figure 5-51).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig51_HTML.jpg
    Figure 5-51

    Azure SQL Database option

     
  10. 10)

    Click Continue.

     
  11. 11)
    Provide information about the Azure SQL Server instance created in earlier steps (see Figure 5-52).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig52_HTML.jpg
    Figure 5-52

    Setting up the Azure SQL Database linked service

     
  12. 12)

    Click Finish.

     
Finally, you have two connections (see Figure 5-53).
../images/468833_1_En_5_Chapter/468833_1_En_5_Fig53_HTML.jpg
Figure 5-53

List of available linked service

  1. 13)
    Let’s create datasets. Click + and then Dataset (see Figure 5-54).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig54_HTML.jpg
    Figure 5-54

    Dataset option

     
  2. 14)
    Click Azure Blob Storage (see Figure 5-55).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig55_HTML.jpg
    Figure 5-55

    Azure Blob Storage option

     
  3. 15)

    Click Finish.

     
  4. 16)
    On the Connection tab, select the container and file name that you uploaded earlier (see Figure 5-56).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig56_HTML.jpg
    Figure 5-56

    Connecting to Azure Blob Storage

     
  5. 17)

    Select column names in the first row.

     
Let’s create another dataset on top of the SQL connection created earlier.
  1. 18)

    Click + and then Dataset.

     
  2. 19)

    Select Azure SQL Database.

     
  3. 20)

    Click Finish.

     
  4. 21)
    On the Connection tab, select the Azure SQL Server connection created in the previous step. For Table, select All_Sales_Records_Raw (see Figure 5-57).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig57_HTML.jpg
    Figure 5-57

    Inputting values to set up Azure SQL Database

     
  5. 22)

    Click Publish ALL.

     
  6. 23)

    Click + and then Pipeline.

     
  7. 24)
    Drag and drop the Copy Data activity, as shown in Figure 5-58.
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig58_HTML.jpg
    Figure 5-58

    Copy Data activity option

     
  8. 25)
    On the Source tab, select the Azure Blob Storage dataset created earlier (see Figure 5-59).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig59_HTML.jpg
    Figure 5-59

    Source selection for the Copy activity

     
  9. 26)
    On the Sink tab, select the Azure SQL dataset created earlier (see Figure 5-60).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig60_HTML.jpg
    Figure 5-60

    Sink selection for the Copy activity

     
  10. 27)
    On the Mapping tab, click Import Schemas. Since the column names are the same on the source and target, the mapping is done automatically (see Figure 5-61).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig61_HTML.jpg
    Figure 5-61

    Field mapping

     
  11. 28)
    Drag and drop the Stored Procedure activity (see Figure 5-62).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig62_HTML.jpg
    Figure 5-62

    Stored Procedure activity

     
  12. 29)
    Select the Copy Data activity, click the “Add activity on:” option, select Success, and then drag the arrow to the Stored Procedure activity. This means run the next process if the current activity runs successfully (see Figure 5-63).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig63_HTML.jpg
    Figure 5-63

    Setting up a link between two activities

     
  13. 30)
    Select Stored Procedure (see Figure 5-64).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig64_HTML.jpg
    Figure 5-64

    Setting values for the stored procedure

     
  14. 31)

    On the SQL Account tab, select “Azure SQL Database connection” for “Linked service.”

     
  15. 32)
    On the Stored Procedure tab, select the stored procedure (adfstoredprocactivity) created earlier (see Figure 5-65).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig65_HTML.jpg
    Figure 5-65

    Selecting the stored procedure

     
  16. 33)

    Click Publish All.

     
  17. 34)

    Click Trigger and then Trigger Now.

     
  18. 35)

    Click Finish.

     
  19. 36)
    Select the Monitor tab on the left side (see Figure 5-66).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig66_HTML.jpg
    Figure 5-66

    Monitoring the ADF pipeline

     
  20. 37)
    Once pipeline execution happens successfully, query the database (see Figure 5-67).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig67_HTML.jpg
    Figure 5-67

    SQL query to validate the transformation

     

Custom Activity

So far, you have seen various activities that can be used in Azure Data Factory for data transformation. Why would you need a custom activity?

It is not always the case that you will go with the built-in activities to transform data. There are many scenarios where developers want to add their own logic in a programming language for transformation. For example, you might want to read a document, extract specific information and store it in a database, or call an API to retrieve data and store it or any other customization; these are not built-in tasks. In a nutshell, if Azure Data Factory doesn’t support the transformation that you are looking for, then you can use a Custom activity to write your own transformation.

Azure Data Factory uses Azure Batch services to help developers run their own code in any operating system (Windows/Linux). Configuring the scaling feature of Azure Batch services guarantees to provide the scalability that the enterprise wants (see Figure 5-68).
../images/468833_1_En_5_Chapter/468833_1_En_5_Fig68_HTML.jpg
Figure 5-68

Reference architecture

Let’s set up a Twitter application that retrieves credentials for making API calls to get tweets for a specific tag.
  1. 1)
     
  2. 2)

    Click Create New App.

     
  3. 3)
    Fill out the information (see Figure 5-69).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig69_HTML.jpg
    Figure 5-69

    Setting up a Twitter application

     
  4. 4)

    Click Create New Application.

     
  5. 5)
    Once the application is created, go to Keys and Access Tokens (see Figure 5-70).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig70_HTML.jpg
    Figure 5-70

    Application settings

     
  6. 6)

    Click “Create my access token” and save it to use it in a future step.

     
Let’s set up the Azure Active Directory app to get a token and access Azure Key Vault.
  1. 1)
     
  2. 2)

    Click Azure Active Directory on the left side.

     
  3. 3)

    Click “App registrations.”

     
  4. 4)
    Click “New application registration” (see Figure 5-71).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig71_HTML.jpg
    Figure 5-71

    Azure AD app registration option

     
  5. 5)

    Provide a name.

     
  6. 6)

    Select the application type; here select “Web app /API.”

     
  7. 7)
    Provide the sign-on URL. This does not need to be a site that exists (you can put http://test1.adventureworks.com ), as shown in Figure 5-72.
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig72_HTML.jpg
    Figure 5-72

    AD app registration values

     
  8. 8)

    Click Create.

     
  9. 9)

    Once the app is registered, click it.

     
  10. 10)
    Click Settings (see Figure 5-73).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig73_HTML.jpg
    Figure 5-73

    Registered app options

     
  11. 11)
    Click Keys in the Password section and then provide a description and expiration date. Click Save, and it will show the password (see Figure 5-74). Copy it into a notepad.
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig74_HTML.jpg
    Figure 5-74

    Setting app keys

     
Let’s set up Azure Key Vault.
  1. 1)
    Click “Create a resource” and search for Azure key vault (see Figure 5-75).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig75_HTML.jpg
    Figure 5-75

    Azure Key Vault

     
  2. 2)

    Select Key Vault and click Create.

     
  3. 3)
    Enter a name, select your subscription, enter or create a resource group, select the right location, and leave the defaults for “Pricing tier,” “Access Policies,” and “Virtual Network Access (preview),” as shown in Figure 5-76.
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig76_HTML.jpg
    Figure 5-76

    Inputting values to set up Azure Key Vault

     
  4. 4)

    Click Create.

     
Let’s set up the Azure Cosmos DB account to store tweets.
  1. 1)
    Click “Create a resource,” click Databases, and then click Azure Cosmos DB (Figure 5-77).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig77_HTML.jpg
    Figure 5-77

    Setting up Azure Cosmos DB

     
  2. 2)
    Provide an ID and the API, select your subscription, select or create a resource group, select the right location, leave the other settings at the defaults, and click Create (see Figure 5-78).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig78_HTML.png
    Figure 5-78

    Inputting values to set up Azure Cosmos Database

     
  3. 3)
    At the end, you will see three services added in your Azure subscription (see Figure 5-79).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig79_HTML.jpg
    Figure 5-79

    Services set up for this demo so far

     
  4. 4)

    Click Azure Cosmos DB Account (created in an earlier step).

     
  5. 5)
    Click Data Explorer and then New Database (see Figure 5-80).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig80_HTML.jpg
    Figure 5-80

    Creating a new Azure Cosmos DB database

     
  6. 6)
    Provide a database ID and click OK (see Figure 5-81).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig81_HTML.png
    Figure 5-81

    Inputting values to set up the Azure Cosmos DB database

     
  7. 7)
    Click New Collection (see Figure 5-82).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig82_HTML.png
    Figure 5-82

    Inputting values to set up a new collection

     
  8. 8)

    Select “Use existing” and choose the database ID created in the previous step.

     
  9. 9)

    Provide the collection ID.

     
  10. 10)

    Select Fixed (10 GB) as the storage capacity.

     
  11. 11)

    Click OK.

     
Let’s set up an Azure Batch service.
  1. 1)

    Click “Create a resource.”

     
  2. 2)

    Click Compute.

     
  3. 3)
    Click Batch Service (see Figure 5-83).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig83_HTML.jpg
    Figure 5-83

    Setting up the Azure Batch service

     
  4. 4)

    Provide an account name.

     
  5. 5)

    Select your subscription.

     
  6. 6)

    Select or create a new resource group.

     
  7. 7)

    Select the right location.

     
  8. 8)

    Select or create a new storage account.

     
  9. 9)
    Select “Batch service” for “Poll allocation mode” (see Figure 5-84).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig84_HTML.jpg
    Figure 5-84

    Setting up a new Azure Blob Storage account

     
  10. 10)

    Click Create.

     
  11. 11)
    Once the Azure Batch services are set up, you will see services shown in Figure 5-85 on the Azure dashboard (if you choose to pin to dashboard).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig85_HTML.jpg
    Figure 5-85

    Services set up for this demo

     
By default, there is no pool (nodes available for compute) available; hence, let’s add a pool. In this demo, let’s use a Windows custom image build using a Windows Server 2008 R2 SP1 virtual machine. Make sure to install the software and packages in a virtual machine. Table 5-1 shows the prerequisites.
Table 5-1

Prerequisites

Package Name

Description

Python 2.7

Install Python 2.7 from https://www.python.org/downloads/ .

install python-pip

Install python-pip to make sure to install the Python packages.

pip install tweepy

Connects to Twitter.

pip install pydocumentdb

Accesses Azure Cosmos DB.

pip install azure-keyvault

Accesses Azure Key Vault.

  1. 1)

    Click “Azure batch service.”

     
  2. 2)

    Click Pools.

     
  3. 3)
    Click Add (see Figure 5-86).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig86_HTML.jpg
    Figure 5-86

    Setting up pools

     
  4. 4)
    Add the information in Table 5-2 (see Figure 5-87).
    Table 5-2

    Setting Values to Set Up the Pool

    Property

    Value

    Pool ID

    Name of the pool.

    Display name

    Description (optional).

    Image Type

    Custom image (Linux/Windows).

    Custom VM Image

    Select the custom image created earlier.

    Operating System

    Windows.

    OS Distribution

    WindowsServer.

    OS version

    Microsoft Windows Server 2008 R2 SP1 (latest).

    Caching

    None.

    Container configuration

    None.

    Metered licenses for rendering

    Don’t change. You’re not doing any rendering for this demo.

    Node pricing tier

    Standard A1 (1 core, 1.8 GB). This demo is not a compute-extensive job; hence, basic compute works. However, you can go for higher compute.

    Mode

    Fixed. This service allows you choose the “Auto scale” option, which allows the service to increase/decrease compute based on a formula. This helps the organization not to worry about scaling out and scaling in.

    Target dedicated nodes

    Set it to 1.

    Low priority nodes

    0. This option reduces compute cost. Low-priority nodes take advantages of surplus capacity in Azure. You use low-priority nodes when the job consumes less time or for batch processing. The trade-off of using such an option is that the VMs may not be available for allocation or preempted at any time, depending on the available capacity.

    Resize Timeout

    15 minutes. This is how long the process waits for resizing.

    Start task

    Disabled. Specify the task that needs to run first when a VM is added to the pool.

    Max tasks per node

    1. You can specify the maximum number of tasks that can be run on the VM. Be cautious about the VM size you choose.

    User accounts

    Default.

    Task scheduling policy

    Pack. This defines how tasks get distributed between VMs in the pool.

    Inter-node communication

    No.

    Application Package

    0. In case your application requires packages for it to run successfully.

    Certificates

    0.

    Pool endpoint configuration

    Default.

    Network configuration

    Default. Not required for this demo.

    Subnet

    Default. Not required for this demo.

    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig87_HTML.jpg
    Figure 5-87

    Inputting values to set up a pool

     
  5. 5)

    Click OK.

     
  6. 6)
    Once the pool is created, click Pool (created in the previous step) and then Nodes to make sure a VM is created (see Figure 5-88).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig88_HTML.jpg
    Figure 5-88

    Available nodes

     
Let’s store credentials on Azure Key Vault and give access to the Azure AD app.
  1. 1)

    Switch to Azure Key Vault.

     
  2. 2)
    Add all the secrets like the Azure Cosmos DB details and Twitter API details on Azure Key Vault. Switch to the respective services to capture the keys (see Figure 5-89).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig89_HTML.jpg
    Figure 5-89

    Setting secrets

     
  3. 3)

    In Azure Key Vault, click “Access policies.”

     
  4. 4)
    Click +Add New (see Figure 5-90).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig90_HTML.jpg
    Figure 5-90

    Setting access policies

     
  5. 5)

    Select Principal. This is the application registered in Azure Active Directory.

     
  6. 6)
    Select Get for “Secret permission” (see Figure 5-91).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig91_HTML.jpg
    Figure 5-91

    Adding an access policy

     
  7. 7)

    Click OK.

     
  8. 8)
    Click Save (see Figure 5-92).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig92_HTML.jpg
    Figure 5-92

    Access policy defined

     

Now the environment is set.

Let’s look at the Python code. Upload the code to Azure Blob Storage (see Figure 5-93 and Figure 5-94).
../images/468833_1_En_5_Chapter/468833_1_En_5_Fig93_HTML.jpg
Figure 5-93

Python code

../images/468833_1_En_5_Chapter/468833_1_En_5_Fig94_HTML.jpg
Figure 5-94

Python code, continued

Let’s set up Azure Data Factory.
  1. 1)

    Switch to the Azure Data Factory Author & Monitor UI.

     
  2. 2)
    Drag and drop a Custom activity onto the designer (see Figure 5-95).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig95_HTML.jpg
    Figure 5-95

    Custom activity

     
  3. 3)
    Provide a name and add a description to the activity (see Figure 5-96).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig96_HTML.jpg
    Figure 5-96

    Setting up a Custom activity

     
  4. 4)
    On the Azure Batch tab, click +New (see Figure 5-97).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig97_HTML.jpg
    Figure 5-97

    Setting up an Azure Batch linked service

     
  5. 5)
    Provide the Azure Batch account details. Retrieve all the information from the Azure Batch account services (see Figure 5-98).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig98_HTML.jpg
    Figure 5-98

    Options to set new linked service

     
  6. 6)

    Click Finish.

     
  7. 7)

    Click Settings.

     
  8. 8)
    Provide the command in Command Text Area (see Table 5-3).
    Table 5-3

    Values for Custom Activity

    Parameter

    Description

    getTweets.py

    getTweets is the name of the Python program to execute.

    Azure

    This gets tweets for given hash tag.

    2018/07/28

    Read since. From date, when you want application to capture the tweets.

    Todayrunid

    This is any text value to be passed when testing or debugging purposes.

     
Figure 5-99 shows the screen after setting the values.
../images/468833_1_En_5_Chapter/468833_1_En_5_Fig99_HTML.jpg
Figure 5-99

Setting values for the Custom activity

  1. 9)

    Select “Resource linked service.” This is the Azure storage location where the Python code is uploaded.

     
  2. 10)

    Select “Folder path.” This is the folder location where the Python code is uploaded.

     
  3. 11)

    Click Publish All.

     
  4. 12)

    Click Trigger and then Trigger Now.

     
  5. 13)

    Click Finish.

     
  6. 14)
    Go to the Monitoring page and wait until the pipeline gets executed successfully (see Figure 5-100).
    ../images/468833_1_En_5_Chapter/468833_1_En_5_Fig100_HTML.jpg
    Figure 5-100

    Monitoring pipeline progress

     
Finally, after successful completion, the tweets get stored in Azure Cosmos DB (see Figure 5-101).
../images/468833_1_En_5_Chapter/468833_1_En_5_Fig101_HTML.jpg
Figure 5-101

Azure Cosmos DB

If you encountered any errors, look at the Azure Batch service logs for the specific job to get insight on the type of error encountered (see Figure 5-102).
../images/468833_1_En_5_Chapter/468833_1_En_5_Fig102_HTML.jpg
Figure 5-102

Inputting values to set up Azure SQL Database

In this hands-on chapter, you explored Databricks and the Custom and Stored Procedure activities to run various workloads. Azure Data Factory lets you build an end-to-end data pipeline, whether on Microsoft or on an open source platform.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.29.126