© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2021
R. C. L'EsteveThe Definitive Guide to Azure Data Engineeringhttps://doi.org/10.1007/978-1-4842-7182-7_2

2. Data Factory vs. SSIS vs. Databricks

Ron C. L’Esteve1  
(1)
Chicago, IL, USA
 

Choosing the right ELT tool can be difficult based on the many data integration offerings from Microsoft’s ever-growing Azure data engineering and integration ecosystem. Technology professionals ranging from data engineers to data analysts are interested in choosing the right ELT tool for the job and often need guidance when determining when to choose between Azure Data Factory (ADF), SQL Server Integration Services (SSIS), and Azure Databricks for their data integration projects.

Both SSIS and ADF are robust GUI-driven data integration tools designed for ELT and ETL workflows, pipelines, and operations with connectors to multiple sources and sinks. SSIS development is hosted in SQL Server Data Tools, while ADF development is a browser-based experience; both have robust scheduling and monitoring features. With ADF’s Mapping Data Flows, transforming data through aggregations, derived columns, fuzzy lookups, and other visually designed data transformations similar to SSIS is a capability that allows data engineers to build ELT in a code-free manner. Both ADF’s Mapping Data Flows and Databricks utilize Spark clusters to transform and process big data and analytics workloads in Azure. This chapter aims to cover the similarities and differences between ADF, SSIS, and Databricks, in addition to providing some guidance to help you determine how to choose between these various data integration services.

Choosing the Right Data Integration Tool

When choosing between Azure Data Factory (ADF) and SQL Server Integration Services (SSIS) for a new project, it is critical to understand whether your organization has an Azure footprint, and if so, can your data integration project be hosted in Azure? If the answer is yes, then ADF is the perfect tool for the job. On the other hand, if the new project must be completed on-premises either for security reasons or because there is already an existing SSIS ecosystem, then SSIS is the tool of choice. Oftentimes organizations reap the benefits of combining SSIS with ADF through lift and shift scenarios in which they leverage ADF’s cloud-based computing services to schedule, run, and execute SSIS packages.

SSIS is a part of SQL Server’s several editions, ranging in price from free (Express and Developer editions) to ~$14K per core (Enterprise), and SSIS integration runtime nodes start at $0.84 per hour on Azure. That said, data volume can become a concern from both a price and performance standpoint when running big data workloads using SSIS since hardware will need to be purchased and frequently maintained.

Azure Data Factory V2’s pay-as-you-go plan starts at $1 per 1,000 orchestrated runs and $1.5 per 1,000 self-hosted IR runs. ADF would be a great resource for organizations that have hundreds of SSIS packages that they would not want to rewrite in ADF but would like to reduce operational costs, increase high availability, and increase scalability by leveraging Azure. For this scenario, a hybrid lift and shift of SSIS workloads to the cloud would be ideal.

From a data velocity perspective, ADF natively supports event-based and Tumbling window triggers in addition to scheduled batch triggers, whereas SSIS only supports batching natively with the capability of potentially building custom triggers for near-real-time data streams. For example, developing a file watcher task for SQL Server Integration Services would automate the process of continuously checking a directory for incoming files before processing them.

From a data variety perspective, ADF can natively connect to over 90 sources ranging from REST APIs to CRM systems to complex JSON structures, while SSIS is better suited for structured data sources but can integrate well to either third-party or custom C# connectors for JSON, REST APIs, and more.

From a programmability perspective, Azure Data Factory does not have a native programming SDK but does support automation through PowerShell without any third-party components, whereas SSIS has a programming SDK, along with automation through BIML and a variety of other third-party components. Figure 2-1 lists the various similarities and differences between SSIS and ADF.
../images/511918_1_En_2_Chapter/511918_1_En_2_Fig1_HTML.jpg
Figure 2-1

Capabilities of SSIS and ADF

When to Use Azure Data Factory, Azure Databricks, or Both

For big data projects, both Data Factory and Databricks are Azure cloud-based data integration tools that are available within Microsoft Azure’s data ecosystem and can handle big data, batch/streaming data, and structured/unstructured data. Both have browser-based interfaces along with pay-as-you-go pricing plans.

ADF’s Mapping Data Flows uses scaled-out Apache Spark clusters, which is similar to Databricks’ underlying architecture, and performs similarly for big data aggregations and transformations. It is important to note that Mapping Data Flows currently does not support connectivity to on-premises data sources. Also, ADF’s original Copy activity does not use Spark clusters but rather self-hosted integration runtimes and does allow connectivity to on-premises SQL Servers. Based on these options to connect to on-premises SQL Servers, Databricks does have capabilities to connect to on-premises data sources and may outperform ADF on big data workloads since it utilizes Spark clusters.

From a velocity perspective, both ADF and Databricks support batch and streaming options. ADF does not natively support real-time streaming capabilities, and Azure Stream Analytics would be needed for this. Databricks supports Structured Streaming, which is an Apache Spark API that can handle real-time streaming analytics workloads.

From a development interface perspective, ADF’s drag-and-drop GUI is very similar to that of SSIS, which fosters a low learning curve and ease of use for developers that are familiar with the code-free interface of SSIS. Additionally, cluster types, cores, and nodes in the Spark compute environment can be managed through the ADF activity GUI to provide more processing power to read, write, and transform your data.

Databricks does require the commitment to learn either Spark, Scala, Java, R, or Python for data engineering– and data science–related activities. This can equate to a higher learning curve for traditional MS SQL BI developers that have been ingrained in the SSIS ETL process for over a decade. For data engineers and scientists that are familiar and comfortable with the Databricks programming languages, Databricks offers a neat and organized method of writing and managing code through notebooks.

The last and most notable difference between ADF and Databricks is related to their primary purposes. ADF, which resembles SSIS in many aspects, is mainly used for ETL/ELT, data movement, and orchestration, whereas Databricks can be used for real-time data streaming and collaboration across data engineers, data scientists, and more, along with supporting the design and development of AI and machine learning models by data scientists.

For example, MLflow from Databricks simplifies the machine learning lifecycle by tracking experiment runs between multiple users within a reproducible environment and manages the deployment of models to production. Additionally, Databricks supports a variety of third-party machine learning tools. In the subsequent chapters, we will cover more details on the capabilities of MLflow and a variety of other advanced features of Databricks.

Once these Databricks models have been developed, they can be integrated within ADF’s Databricks activity and chained into complex ADF ETL/ELT pipelines, coupled with a seamless experience for passing parameters from ADF to Databricks. Additionally, the Databricks models can be scheduled and monitored via ADF. We will explore these various combined pipelines and parameter passing capabilities in future chapters. Figure 2-2 lists the various similarities and differences between Databricks and ADF.
../images/511918_1_En_2_Chapter/511918_1_En_2_Fig2_HTML.jpg
Figure 2-2

Capabilities of Databricks and ADF

Summary

In this chapter, I explored the differences and similarities between ADF, SSIS, and Databricks and made recommendations on when to choose one over the other and when to use them together. The solution truly depends on a number of different factors such as performance, cost, preference, security, feature capability, and more. In the upcoming chapters, we will use a combination of ADF and Databricks to demonstrate real-world end-to-end workflows and pipelines.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.255.116