When your business needs to move data between cloud providers, Azure Data Factory presents a convenient and robust interface for this task. Microsoft provides connectors to integrate the data factory with multiple third-party services, including Amazon Web Services (AWS) and Google Cloud. In this chapter, we will walk though several illustrative examples on migrating data from these two cloud providers. In addition, you will learn how to use Azure Data Factory's Custom Activity to work with providers who are not supported by Microsoft's built-in connectors.
In this chapter, we will cover the following recipes:
All recipes in this chapter assume that you have a Microsoft Azure account and an instance of a data factory. Refer to Chapter 1, Getting Started with ADF, for instructions on how to set up your Azure account and create a data factory.
For the recipes in this chapter, you will need accounts with sufficient permissions on third-party services. For recipes 1 and 2, you will need to set up an account with AWS. For recipes 3, 4, and 5, you will need a Google Cloud account. For recipe 6, you will require a Dropbox account.
If you do not have accounts already set up with the aforementioned services, you can do this for free:
In this recipe, you will learn how to copy data from an AWS S3 bucket to the Azure Blob storage container using a data factory.
This recipe requires you to have an AWS account and an S3 bucket. Refer to the Technical requirements section to find out how to set up a new AWS account if you do not have one. Once you have your AWS account set up, go to https://s3.console.aws.amazon.com/s3/ to create a bucket. Upload the sample CSV files from https://github.com/PacktPublishing/Azure-Data-Factory-Cookbook/tree/master/data to your bucket.
Rather than designing the pipeline in the Author tab, we will use the Copy Data Wizard. The Copy Data Wizard will walk you through pipeline creation step by step, and will create and run the pipeline for you:
Your New linked service blade should look like this:
With the source dataset formatted, we are ready to go to the next step.
The Copy Data Wizard is simply an application that assists you in designing a simple data movement pipeline. It presents you with a series of steps, and once you have filled them all in, it creates and publishes the pipeline for you. After you have gone through the steps in this recipe, go to the Author tab and find the pipeline. It consists of a single copy activity that we configured with the Copy Data Wizard.
Azure Data Factory can help you move very large datasets into the Azure ecosystem with speed and efficiency. The key to moving large datasets is data partitioning. The way you partition depends heavily on the nature of your data.
In the following recipe, we will illustrate a methodology to utilize a data partitioning table for moving a large dataset. We will use a public Common Crawl dataset, which contains petabytes of web crawl data from 2008 to the present day. It is a public dataset hosted on the AWS S3 platform. We will only use a small subset of this data for our example, enough to illustrate the power of data factory parallel processing.
In order to access Amazon Web Services, such as an S3 bucket, you need to have proper credentials. These credentials consist of an access key ID (for example, AKFAGOKFOLNN7EXAMPL8) and the secret access key itself (for example, pUgkrUXtPFEer/PO9rbNG/bPxRgiMYEXAMPLEKEY). In this book, we will refer to these credentials as your AWS Account key and Secret.
Even though Common Crawl data is hosted in a public repository, you will still need an AWS account and your own Account key and Secret. Refer to the instructions in the Technical requirements section at the beginning of the chapter on how to obtain these credentials.
We will need to create a partition table in our SQL database. Download the script to create the partitions from https://github.com/PacktPublishing/Azure-Data-Factory-Cookbook/blob/master/Chapter07/sql-scripts/CreateCommon CrawlPartitionsTable.sql.
You will also need access to an Azure Gen2 storage account where you plan to copy the data. Refer to this document at https://docs.microsoft.com/azure/storage/common/storage-account-create?tabs=azure-portal if you need to create one.
Finally, we will need an instance of the Azure SQL database, and the linked service for this instance. Refer to the Using parameters and built-in functions recipe in Chapter 2, Orchestration and Control Flow, for detailed instructions on how to create this resource and configure the connection.
Note that moving large datasets can be very expensive. We suggest that you opt for local redundancy when you create your Azure Data Lake account, and delete data that is not required as a cost-saving measure. Once you construct your pipeline and verify that it works as expected, do not run it repeatedly: you might incur considerable charges.
In order to copy the large dataset, we will first create a database table that will define how we partition our data. Then, we will create an outer pipeline, which will read information from the partition table and pass it to a ForEach activity. The ForEach activity will invoke an inner pipeline that will perform the data movement:
CREATE TABLE [dbo].[Common CrawlPartitions](
[YearAndMonth][varchar](255) NULL,
[Path] [varchar](255) NULL,
[UpdatedAt] [Datetime]
)
INSERT INTO Common CrawlPartitions (YearAndMonth,
Path, UpdatedAt)
VALUES
('01-2019', 'cc-index/collections/CC-MAIN-2019-04/indexes', GetDate()),
('02-2019', 'cc-index/collections/CC-MAIN-2019-09/indexes', GetDate()),
('03-2019', 'cc-index/collections/CC-MAIN-2019-13/indexes', GetDate()),
('04-2019', 'cc-index/collections/CC-MAIN-2019-18/indexes', GetDate()),
('05-2019', 'cc-index/collections/CC-MAIN-2019-22/indexes', GetDate()),
('06-2019', 'cc-index/collections/CC-MAIN-2019-26/indexes', GetDate()),
('07-2019', 'cc-index/collections/CC-MAIN-2019-30/indexes', GetDate()),
('08-2019', 'cc-index/collections/CC-MAIN-2019-35/indexes', GetDate()),
('09-2019', 'cc-index/collections/CC-MAIN-2019-39/indexes', GetDate()),
('10-2019', 'cc-index/collections/CC-MAIN-2019-43/indexes', GetDate()),
('11-2019', 'cc-index/collections/CC-MAIN-2019-47/indexes', GetDate()),
('12-2019', 'cc-index/collections/CC-MAIN-2019-51/indexes', GetDate());
Run the query to create and populate the dbo.Common CrawlPartitions table.
Once we have created the dbo.Common CrawlPartitions table, we are ready to create the linked services and dataset for our pipeline. Perform the following steps to do so:
Go to your Data Factory instance and open the Author and Monitor interface. Create a dataset to store the data from the partition table. From Factory Resources (on the left), select New Dataset. In the New Dataset blade, select Azure SQL. Select the appropriate subscription and test connection, and then click on Create.
@dataset().DirName
The Connection tab for your dataset configuration window should look similar to the following screenshot:
We now have all the components to design the inner pipeline, the pipeline that will actually copy chunks of data (defined in our partition table) from the S3 Common Crawl dataset into our Azure Storage account:
Your inner pipeline Parameters tab should look similar to the following screenshot:
In the Wildcard Paths section, fill in the value using the dynamic content interface. Click inside the Wildcard folder path text field and then select the Add dynamic content link to bring up the interface. In that interface, scroll down to find the Parameters section and select the Path parameter. Click the Finish button to finalize your choice.
Finally, in the Wildcard file name text field, enter *.gz.
Next, we will build the outer pipeline. The outer pipeline reads the data partitions from the table and invokes the inner pipeline:
select top 2 Path from [dbo].[Common CrawlPartitions]
Make sure the First Row only checkbox is not checked.
@activity('Get Partitions').output.value
Make sure that the Sequential checkbox is not checked
Your Execute Pipeline activity configuration settings should look similar to the following screenshot:
In this recipe, we have constructed a pipeline that is capable of moving very large datasets from S3 to an Azure Storage account. For our example, we used 1 year of Common Crawl data, although we only copied over 2 months of this data (enough to validate our approach while saving the considerable cost of moving even more data). We then created the Common CrawlPartitions table in which we listed our partitions. This table also contained paths to the data locations.
Next, we created the inner and outer pipelines to ingest this data from locations provided in the Common CrawlPartitions table and copy it over to the Azure Data Lake storage. The key to parallel execution is the design of the outer pipeline: it is the ForEach activity that allows us to execute activities in parallel by allowing us to specify the batch count. Note that there is a limitation in terms of the degree of parallelism. If your batch count is more than 40, it is better to enhance the design by further partitioning the data and having multiple ForEach activities, each executing a subset of parallel sub-pipelines.
One thing to notice is that we have limited our dataset further by specifying the query in LookupActivity as a select top 2 path from [dbo][Common CrawlPartitioning]. We can regulate the size of our data pull by tweaking this query and, of course, by adding data to our data partitioning table.
If you are interested in the Common Crawl dataset, which contains web crawl data from 2008, you can obtain more information at https://Common Crawl.org/the-data/.
Microsoft offers guidance on pricing for Azure Data Factory. We strongly suggest that you familiarize yourself with this information before copying large datasets: https://docs.microsoft.com//azure/data-factory/pricing-concepts.
In this recipe, we will use the in-built Microsoft connectors to copy the data from Google Cloud Storage to an Azure Storage account. You will learn how to configure the Google Storage account and grant permissions to allow your data factory to connect and import the files.
For this recipe, you will need to have a Google Cloud account and at least one Google Storage bucket:
In order to import the data, we will first set up the Google Storage account with the correct permissions and then use the built-in data factory connectors to configure the linked services and datasets to connect to both the Google Cloud Storage bucket and the Azure Storage account:
Log in to your Google Cloud Storage account and select the bucket that you want to export to Azure Storage. Select Settings from the menu on the left, and then go to the Interoperability tab in the main blade. In the Interoperability tab, click on the CREATE A KEY FOR A SERVICE ACCOUNT button:
If you have an existing service account, you can view and change its access permissions here: https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your-project-id> (replace <your-project-id> with the correct value for your project).
Select your desired service account and then click on CREATE KEY:
Make a note of your account key and secret; you will need these values later to create a linked service.
We have set up the access policies on the Google Cloud service account to allow the data factory to perform a data pull. Now we can design the pipeline.
Note
You can also use an Azure Data Lake Storage V2 (ADLS) account for this recipe. If you choose to do that (for example, if you want to reuse the ADLS account from the previous Copying large datasets from S3 to ADLS recipe), you will have to create a linked service and dataset with corresponding connectors. Refer to the previous recipe for detailed instructions on how to do that.
In this recipe, the main challenge is to configure the Google Storage access permissions correctly. The pipeline is simple and follows all the pipeline design principles we have covered in previous recipes.
Google Cloud is a vast and nimble ecosystem. To learn more about Google Cloud Storage and access management on this platform, refer to the current Google documentation at https://cloud.google.com/iam/docs/granting-changing-revoking-access?.
In this recipe, we will use Azure Data Factory to import a subset of a public fdic_banks.locations dataset from the Google BigQuery service (a cloud data warehouse) into an Azure Data Lake store. We will write the data into destination storage in Parquet format for convenience.
For this recipe, we assume that you have a Google Cloud account and a project, as well as an Azure account and a Data Lake storage account (ADLS Gen2). The following is a list of additional preparatory work:
Note that these instructions include consent given by the Azure Data Factory user. Make sure that this user (the Data Factory user, the account that is signed into the The Azure portal) has correct permissions to run BigQuery user jobs. Azure Data Factory will not be able to access data in Google Cloud otherwise. To assign the correct permissions, perform the following steps:
Microsoft provides connectors to both Google BigQuery and Azure Data Lake storage. We will create a simple pipeline with a copy activity utilizing these built-in connectors:
From the Manage tab, select Linked Services and then click on New. In the new Linked Service blade, search for BigQuery and then enter the Project ID, Client ID, Client Secret, and Refresh Token information that you obtained previously (refer to the Getting ready section) in the blade presented. The blade should look similar to the following screenshot:
Test your connection and click on Create to save the linked service
Configure Source and Sink for the copy activity. In Source, select the GoogleBigQueryDS dataset we created in step 2. Check the Query radio button in the Use Query section, and enter the following query:
select * from bigquery-public-data.fdic_banks.locations LIMIT 3000
In Sink, select the ADLS2 Parquet dataset we created in step 3. Leave all the other options as their default settings.
Run your pipeline in debug mode. Once the run is complete, you will see new Parquet files in your ADLS account. You can use a tool such as http://Parquet-viewer-online.com/ to verify that this is the data from the BigQuery dataset.
In this recipe, we will import a public dataset, github_repo.files, from Google BigQuery into Azure Synapse – formerly Azure Data Warehouse. We will create a SQL data pool, create the table to store our imported data, and configure the pipeline to migrate data from a public dataset hosted at Google BigQuery.
To complete this recipe, you will need a Google Cloud project with the BigQuery API enabled. Refer to the Getting ready section in the previous recipe for instructions on how to set those up and obtain your Project ID, Client ID, Client Secret, and Refresh Token fields.
You will also need an instance of an Azure Synapse SQL pool to import the data. Refer to the chapter on Azure Synapse on how to create and configure a SQL pool. Have the login credentials for this SQL pool to hand.
You will also need to create a table in your database to store the data we import. Download the script to create the table from https://github.com/PacktPublishing/Azure-Data-Factory-Cookbook/blob/master/Chapter07/sql-scripts/CreateGithubRepoTable.sql.
We will create a simple pipeline consisting of just one copy activity, which will use the BigQuery connector and Azure Synapse connector (both provided by Microsoft) to migrate the data from one cloud provider to the other:
CREATE SCHEMA github_repo
GO
CREATE TABLE github_repo.files(
repo_name VARCHAR(200) NOT NULL
,ref VARCHAR(17) NULL
,path VARCHAR(200) NULL
,mode INTEGER NULL
,id VARCHAR(64) NOT NULL
,symlink_target VARCHAR(200) NULL
);
This will create a table with an appropriate schema to load data from BigQuery data.
Your dataset connection configuration should look similar to the following screenshot:
Create a linked service to connect to Azure Synapse SQL pool. Refer to Chapter 3, Setting Up a Cloud Data Warehouse, on how to obtain credentials and create the linked service. Name your linked service AzureSynapseLinkedService.
select * from bigquery-public-data.github_repos.sample_files limit 30
Important note
In order to configure and test our pipeline, we will limit the import to 30 records. Once we are sure that the pipeline works correctly, we can expand our selection.
Important note
Copy Activity supports three methods for copying data to Azure Synapse Analytics: PolyBase, Copy Command (preview), and Bulk Insert. PolyBase and Copy Command are much more efficient options, although only a limited number of services support them at the time of writing. For our use case (loading data from Google BigQuery), Bulk Insert is the most straightforward option. It will work out of the box; no additional setup is necessary.
In a production environment, where efficiency is of paramount importance, you will want to explore the PolyBase and Copy options. References to resources are included in the See also section of this recipe.
select * from bigquery-public-data.github_repos.sample_files
However, this is optional and can be done at your discretion.
To learn more about the copy methods supported by the data factory's Azure Synapse connector, refer to the following resources:
As we described in previous chapters, Azure Data Factory provides many out-of-the-box connectors that integrate with various data sources both within the Azure ecosystem and outside of it (such as Amazon Web Services and Google Cloud). Sometimes, however, you need to have a destination for the reports or other data files that are not supported by Microsoft-provided connectors. How can we do that?
In this chapter, we will build a pipeline that exports the data from a table in an Azure Blob storage account into a folder in Dropbox. Currently, Microsoft does not have a preconfigured Dropbox connector. We will use Azure Data Factory Custom Activity, Azure Batch service, and a simple Python client to achieve our goals.
This recipe assumes that you have a Dropbox account. If this is not the case, go to www.dropbox.com and sign up for a free account.
Create an Azure Storage account, which will serve as a source storage account. In the recipe, we will refer to this account as batchservicedata.
In this recipe, we will work with Azure Data Factory Custom Activity and a simple Python client application running on Azure Batch service to copy files from an Azure Storage account to Dropbox. We'll begin by setting up a Dropbox app to enable API access, and then we'll create an Azure Batch account and a batch pool to host the Python client. Only after these steps are complete will we design a pipeline to export our data from an Azure Storage account to Dropbox.
In order to access Dropbox folders from the Python client (and ultimately, from the ADF pipeline), we need to create a Dropbox app and generate the authentication token:
Fill in the Create a new app form, as shown in the following screenshot (make sure the name of your app is unique):
Copy and keep the value of the generated token: we will use it later in the Python client.
The next step is to upload our Python client script and the data we intend to copy into the storage account:
Open the script and edit it by replacing the words <enter your generated access token> with the token you generated in step 3.
Important note
Azure Batch service is a service that allows you to utilize the compute power of virtual machines to perform custom operations. In this recipe, we will use the DSVM configuration (which stands for Data Science Virtual Machine) for the batch pool. Although it is a more expensive option, this will simplify installation and deployment for our client. These virtual machines are already configured with Python.
Now we will create a batch account with a pool of DSVMs from an ARM template.
Open the https://portal.azure.com/#create/Microsoft.Template URL.
Go into the pool. You should see that one node is allocated. It usually takes several minutes to start. Wait until it starts and verify that it is in the running or idle state:
We have configured the batch service, created a pool of virtual machines to serve our computing needs, and set up our Python client. All that is left now is to design the ADF pipeline and run it:
Test the connections (with the button on the lower right), and create the linked service.
python upload_2_dropbox.py
Then, expand the Advanced section and select BatchServiceStorageLinkedService as Resource linked service, and the container with the script and files that we will upload to Dropbox as Folder path.
The Settings tab of your custom activity should look similar to the following screenshot:
Important note
Do not forget to clean up your resources in Azure. Batch pools can be expensive, so make sure to delete the pool you created in the previous steps if you do not need it.
This pipeline involved creating and configuring several components: a batch service with a pool of nodes hosting a Python script acting as a Dropbox client, a Dropbox app, and finally, the data factory pipeline with its usual linked services and datasets, which are integral parts of any pipeline. The pipeline itself was simple – just one activity. The complexity lies in the interaction between the components.
Dropbox – a popular file hosting and synching service – provides a public API, which allows third-party users to manipulate the contents of the Dropbox account. As a common security practice, it is necessary to configure access and grant appropriate permissions to the application, which takes advantage of the API. An authentication token, which we generated in step 3 of the Creating a Dropbox app and enabling API access section, is a common authentication mechanism.
We used this token in the upload-2-dropbox.py Python script. The script acts as a Python client and is based on the example provided by Dropbox (https://github.com/dropbox/dropbox-sdk-python/blob/master/example/back-up-and-restore/backup-and-restore-example.py). Using the Dropbox SDK, the script uploads all the files with the extension csv from the local directory to the Dropbox account specified by the access token. This is just an example client provided to illustrate the process. For your needs, you may use custom scripts and executables, or fully fledged third-party clients.
We used Azure Batch service as a compute layer to host our Python client. You may learn more about Batch Service from the following resources (refer to the following See also section). It is a very versatile and flexible resource. There are a multitude of options (virtual machine configurations, the number of nodes per batch, regions where they are hosted, and so on) that you can tweak based on your business and computing needs. In this recipe, we have deployed the batch service and batch pool virtual machines using Azure's Resource Template deployment functionality, with a pre-defined template. This powerful technique allowed us to bypass lengthy setup steps and roll out an instance very quickly. The ARM template provided in this recipe contained a specific configuration (a pool with one DSVM node). You will need to consider your business requirements when constructing your own batch pools.
Sometimes, it is useful to examine the logs on your server in order to understand how the application works, or to debug an error. In this recipe, we are using a batch pool for our compute layer, and we can easily access the logs, which will give us an insight into our processing. There is an option to log in remotely to the nodes of your batch pool via SSH or a remote desktop, but you can also access the logs from the The Azure portal directly:
Microsoft maintains extensive documentation on using Batch Service. To learn more, start at https://docs.microsoft.com/azure/batch/ for a general overview.
3.144.86.134