In this chapter, we will cover how to load and enrich data using the power of Apache Spark in Azure Synapse Analytics. We will learn about and understand various concepts and recipes for writing Spark data frames to read data from Azure Data Lake Storage (ADLS) and writing to a SQL pool using PySpark.
This chapter comprises the following recipes:
Azure Synapse can take advantage of reading and writing data from the files that are placed in the ADLS2 using Apache Spark. You can read different file formats from Azure Storage with Synapse Spark using Python.
Apache Spark provides a framework that can perform in-memory parallel processing. On top of that, Spark pools help developers to debug and work more effectively as regards their production workloads.
We will be using the same public dataset that we used in Chapter 1, Choosing the Optimal Method for Loading Data to Synapse. To retrieve the dataset, you can go to the following URL: https://www.kaggle.com/microize/newyork-yellow-taxi-trip-data-2020-2019.
The prerequisites for this recipe are as follows:
Let's begin this recipe and see how you can read the data from ADLS2 using the Spark notebook within Synapse Studio. We will leverage the notebook capability of Azure Synapse to get connected to ADLS2 and read the data from it using PySpark:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
adls_path ='abfss://%s@%s.dfs.core.windows.net/%s' % ("taxistagingdata", "synapseadlsac","")
mydataframe = spark.read.option('header','true')
.option('delimiter', ',')
.csv(adls_path + '/yellow_tripdata_2020-06.csv')
mydataframe.show()
Please refer to Figure 2.3 for a better understanding of the execution and the results:
mydataframe1 = mydataframe.withColumn("passenger_count" ,mydataframe["passenger_count"].cast(IntegerType()))
mydataframe1.groupBy("VendorID","payment_type").sum("passenger_count").show()
You can refer to Figure 2.4 to see how it looks:
%%pyspark
df = spark.read.load('abfss://[email protected]/yellow_tripdata_2019-01.csv', format='csv'
, header=True
)
df.write.mode("overwrite").saveAsTable("default.yellow_tripdata")
The following screenshot shows the result:
You can also create charts to analyze it on the fly, as shown in Figure 2.7:
The Spark pool gives you the flexibility to define the compute as per your needs. You can define the node size as Small, Large, xLarge, xxLarge, or xxxLarge, with up to 80 vCores/505 GB. The autoscale features provide you with the ability to automatically scale up and down based on the level of load and activity.
You can monitor the compute allocation using the Spark pool monitor to understand the vCore allocation, active applications, and concluded applications by date and time. This allows the developer to plan resource allocation more optimally, as you can see in Figure 2.8:
Let's now look at an interesting aspect of data exploration that will involve plotting some interesting visuals within the Synapse notebook. We all know that it is always easier to understand pictures or graphs compared to a typical dataset in rows and columns, for example, when you are dealing with a very large dataset, which may contain a lot of key insights. To obtain data-driven insights, we try to work on data pointers that will lead us to those insights; to do that, we plot the data in the form of a visual.
This is exactly what we will be doing in this recipe, and you will learn how to do this within the notebook experience.
We will be leveraging the same data frame that we created in the Reading and writing data from ADLS Gen2 using PySpark recipe.
Basic knowledge of matplotlib is required, which will help you to create static and interactive Python visuals.
Let's get back to the same notebook, PySparkNotebook, that we published in the Reading and writing data from ADLS Gen2 using PySpark recipe:
import matplotlib.pyplot as plt
This is the visualization plotting library in Python, as shown in Figure 2.6:
mydataframeplot = mydataframe1.toPandas()
ax = mydataframeplot['passenger_count'].plot(kind='hist', bins= 20, facecolor='orange')
ax.set_title('Total Passenger distribution')
ax.set_xlabel('No. of Passengers')
ax.set_ylabel('Counts')
chartplt.suptitle('Trend')
chartplt.show()
Figure 2.10 shows the output:
This leverages the power of the Spark pool that you have created to perform data exploration. It makes the process of extracting useful insights from the data extremely fast. The notebook experience within Synapse makes it a one-stop-shop for the developer and the data analyst to collaborate and perform their respective activities.
3.139.105.83