Chapter 14. Microsoft Azure Databricks

The main purpose of this chapter is to highlight the Microsoft Data Environment and how we can utilize the many tools provided to us, especially Azure Databricks: a fullyfledged, powerful analytics platform powered by Apache Spark.

We have three main case studies in this chapter that join together the principles of data science and machine learning that we have learned about in this book with the ease and power of the Microsoft Data Environment, specifically Azure Databricks. Each case study will highlight different features of using Azure Databricks, as well as aspects of machine learning that we have learned in this text.

The Microsoft data science environment

Microsoft offers many tools and environments to the data scientist, and this offering consists of many parts. The three main components are as follows:

  • Microsoft Azure: This is an enterprise-grade cloud computing platform that provides a great deal of access and support for production-ready scalable systems.
  • Microsoft Azure Machine Learning Studio: This is a GUI-based environment for creating and operationalizing a machine learning workflow that is optimized for Azure.
  • Azure Databricks: This is an Apache Spark-based analytics and machine learning platform optimized for the Microsoft Azure platform.

This text will focus heavily on the Azure Databricks environment as it provides access to many tools, including these:

  • Spark DataFrames: Spark DataFrames are distributed collections of data organized into rows and columns. They are conceptually equivalent to a data frame in Python.
  • Notebooks: Like Jupyter, the Azure Databricks notebook tool provides a cell-based code environment for writing logically separated code that is easy to read and replicate.
  • Clusters/workers: Using Microsoft Azure (or another cloud computing platform), we can spin up resources in order to massively optimize and parallelize our machine learning and data analysis to allow for faster iteration.
  • MLib: This is a scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, and more.

Azure Databricks provides many more capabilities other than what we have listed here. We will focus on the tools that are relevant to us as data scientists and machine learning engineers.

Note

For more information on Azure Databricks, check out https://docs.microsoft.com/en-us/azure/azure-databricks/what-is-azure-databricks.

What exactly are Spark and PySpark?

The backbone of Azure Databricks is Apache Spark. Spark is an analytics engine for big data processing. It has built-in modules for data streaming, SQL, machine learning, and more. PySpark is the Python API for Spark. It allows us to invoke the power of Spark using Python code. Azure Databricks brings all of this at our fingertips and makes setting up clusters running Spark and PySpark within seconds possible. Let's see it in action!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.146.134