Azure Data Factory

Working with data is a complex issue and, therefore, it is not very easy to describe the Azure Data Factory service.

Let's start with a simple definition:

In general, a data factory allows you to process local data at the same time (for example from a SQL Server) along with data from the cloud (for example, from an Azure SQL Database, Blobs, and Tables). It does not matter whether the data is in a structured, semi-structured or unstructured form. The data sources (input datasets) are created, processed and monitored within the data factory via a simple, highly available data pipeline.

This is the simple definition, let's now come to a more complex definition:

Azure Data Factory is a cloud-based data integration service that lets you create data-driven workflows (so-called pipelines) in the cloud to orchestrate and automate data shifts and data transformations.

In more detail—with Azure Data Factory, you can create and plan data-driven workflows, collect data from different data stores (local and cloud data stores), process and transform data using services such as Azure HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning, and provide the output data for data storage to data stores such as an Azure SQL Data Warehouse, so that it can be used by business intelligence (BI) applications like Microsoft PowerBI.

Let's look at the individual components of a data factory.

A data factory consists of:

  • Input datasetsThe so-called input datasets are the incoming data from the following data sources:
    • Azure Blob Storage
    • Azure CosmosDB
    • Azure Data Lake Store
    • Azure SQL Database
    • Azure SQL Data Warehouse
    • Azure Table Storage
    • Amazon Redshift
    • DB2
    • MySQL
    • Oracle
    • PostgreSQL
    • SAP Business Warehouse
    • SAP HANA
    • SQL Server
    • Sybase
    • Teradata
    • Cassandra
    • MongoDB
    • Amazon S3
    • FTP
    • Hadoop distributed file system (HDFS)
    • SSH File Transfer Protocol (SFTP)
    • Generic HTTP
    • Generic OData
    • Generic ODBC
    • Salesforce
    • Web table (HTML table)
    • GE Historian
  • Linked services: Before you can use your dataset, you must first create a linked service to link your data store to the data factory. Linked services define the connection information that is required by the data factory to connect to the external data store.
  • Pipeline: A data factory has one or more pipelines. A pipeline is a logical group of activities that together form a task.
  • Activities: Each pipeline has one or more activities. Activities define the actions that apply to your data. Currently, two types of activities are supported, which are as follows: 
    • Data movement activities: This activity type is easily explained because it is limited to a pure copy operation between one of the available data sources and one of the available data sinks (both listed in the description of input datasets or output datasets)
    • Data transformation activities: The term data transformation activities describe the processes with which you can transform your raw data into predictions and insights and process them
Data transformation activities usually do not use their own codebase, but execute scripts (code) from the following services:
  • Azure HDInsight Hive
  • Azure HDInsight Pig
  • Azure HDInsight MapReduce
  • Azure HDInsight Streaming
  • Azure HDInsight Spark.
  • Azure Machine Learning (in combination with the Azure Batch Activity)
  • Azure Data Lake Analytics (U-SQL Script)

Data transformation activities can also execute their own code, these are the activities based on:

  • Stored procedure
  • Custom .NET code
  • Output datasets: The so-called output datasets are the outgoing data to one or more from the following data stores (data sinks):
    • Azure Blob Storage
    • Azure CosmosDB
    • Azure Data Lake Store
    • Azure SQL Database
    • Azure SQL Data Warehouse
    • Azure Table Storage
    • Oracle
    • SQL Server
    • File System
Because many data sources are located on the on-premises website, and therefore hybrid scenarios are the default case, Microsoft has introduced the Azure Data Factory Integration Runtime (formerly known as Data Management Gateway) as part of the Azure data services offering. You can download the free download here: https://www.microsoft.com/en-us/download/details.aspx?id=39717&a03ffa40-ca8b-4f73-0358-c191d75a7468=True&751be11f-ede8-5a0c-058c-2ee190a24fa6=True.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.121.101