15. Machine Learning on HDInsight

Leila Etaati¹

(1)

Aukland, Auckland, New Zealand

In this chapter, an overview of how to use HDInsight for the purpose of machine learning will be presented. HDInsight is based on Apache Spark and used for in-memory cluster processing. Processing data in-memory is much faster than disk-based computing. Spark also supports the Scala language, which supports distributed data sets. Creating a cluster in Spark is very fast, and it is able to use Jupyter Notebook, which makes data processing and visualization easier. Spark clusters can also be integrated with Azure Event Hub and Kafka. Moreover, it is possible to set up Azure Machine Learning (ML) services to run distributed R computations. In the next section, the process of setting up Spark in HDInsight will be discussed.

HDInsight Overview

HDInsight is an open source analytics and cloud-based service. Integration with Azure HDInsight is easy, fast, and cost-effective for processing massive amounts of data. There are many different use-case scenarios for HDInsight, such as ETL (extract, transform, and load), data warehousing, machine learning, Internet of things (IoT), and so forth.

The main benefit of using HDInsight for machine learning is access to a memory-based processing framework. HDInsight helps developers to process and analyze big data and develop solutions, using some great and open source frameworks, such as Hadoop, Spark, Hive, LLAP, Kafka, Storm, and Microsoft Machine Learning Server [1].

Setting Up Clusters in HDInsight

The first step is to set up HDInsight in Azure. Log in to your Azure account and create an HDInsight component in Azure (Figure 15-1). As you can see in Figure 15-1, there are different modules for HDInsight, such as HDInsight Spark Monitoring and HDInsight InteractiveQuery Monitoring. Among those modules, select the HDInsight Analytics option.

../images/463840_1_En_15_Chapter/463840_1_En_15_Fig1_HTML.jpg — Figure 15-1
Setting up HDInsight in Azure

When you create HDInsight, you must follow some steps for setting up the cluster and identifying the size. The first step is to set a name for the cluster, set the subscription, and choose the cluster type (Figure 15-2). Different cluster types are available, including Spark, Hadoop, Kafka, ML Services, and more.

../images/463840_1_En_15_Chapter/463840_1_En_15_Fig2_HTML.jpg — Figure 15-2
Creating HDInsight in Azure

The next step is to identify the size and check the summary (Figure 15-3) of the cluster.

../images/463840_1_En_15_Chapter/463840_1_En_15_Fig3_HTML.jpg — Figure 15-3
Setting up HDInsight

Creation of HDInsight may take some time. After creating an HDInsight component, on the main page, in the Overview section, select Cluster dashboards (Figure 15-4).

../images/463840_1_En_15_Chapter/463840_1_En_15_Fig4_HTML.jpg — Figure 15-4
HDInsight environment

Next, select Jupyter Notebook. On the new page, choose the New option (Figure 15-5).

../images/463840_1_En_15_Chapter/463840_1_En_15_Fig5_HTML.jpg — Figure 15-5
Jupyter Notebook environment

As you can see in Figure 15-5, there are environment options, such as PySpark, PySpark3, and Spark. You can write the Python code in all of them.

After creating the new page for Spark, you must log in with the username and password that you provided for creating the HDInsight component. The Jupyter environment is like a notebook and, so, like the Azure Databricks environment. As you can see in Figure 15-6, it is possible to write the code there and run the whole cell to see the result.

../images/463840_1_En_15_Chapter/463840_1_En_15_Fig6_HTML.jpg — Figure 15-6
Jupyter Notebook environment

It is also possible to fetch the data from other Azure components, such as Azure Data Lake Store Gen1 (Figure 15-7). To do this, you must run the following code (in Databricks, you run the same code).

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")

spark.conf.set("dfs.adls.oauth2.client.id", "a1824181-e20c-4952-894f-6f53670672dd")

spark.conf.set("dfs.adls.oauth2.credential", "iRzOkcyahiomc5AKobyVxFdDVF/mEbS3mqN1moehG0w=")

spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/0b414bdb-2159-4b16-ad13-b2d54a1781da/oauth2/token")

val df=spark.read.option("header", "true").csv("adl://adlsbook.azuredatalakestore.net/titanic.csv")

val specificColumnsDf = df.select("Survived", "Pclass", "Sex", "Age")

val renamedColumnsDF = specificColumnsDf.withColumnRenamed("Sex", "Gender")

renamedColumnsDF.createOrReplaceTempView("some_name")

renamedColumnsDF.show()

../images/463840_1_En_15_Chapter/463840_1_En_15_Fig7_HTML.jpg — Figure 15-7
Getting the data from Azure Data Lake Store Gen1

It is also possible to perform machine learning in Jupyter Notebook with Spark. For an example, follow the tutorial available at https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-ipython-notebook-machine-learning .

Summary

This chapter presented an overview of HDInsight and how to set it up. How to use HDInsight for different purposes was explained briefly, as well as how to set up HDInsight inside the Azure environment and how to access Jupyter Notebook for the purpose of writing codes in a PySpark, Spark, or PySpark3 environment. An example of how to connect to Azure Data Lake Store Gen1 to fetch the data was shown.

Reference

[1]
Microsoft Azure, “Azure HDInsight Documentation,” https://docs.microsoft.com/en-us/azure/hdinsight/ , 2019.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 15. Machine Learning on HDInsight

Create new playlist

Sign In

Sign Up

15. Machine Learning on HDInsight

HDInsight Overview

Setting Up Clusters in HDInsight

Summary

Reference

Table of Contents for
15. Machine Learning on HDInsight