Databricks notebook

We are now ready to consume and transform data from the Sales.txt file we created in the previous section. We'll go back to the Databricks workspace and create a new notebook. The easiest way to achieve it is to use a template. As shown in the following screenshot, we select the Data icon in the toolbar at the left of our workspace. We go to Tables | Spark Data Sources | Azure Blob Storage. Click on Create Table in Notebook to create the new notebook:

The notebook opens, with lots of sections. We'll first rename it, right-click on the notebook name and select Rename.

Enter ADFV2Notebook and click on Rename to rename it.

We'll now move it to the ADFCalls folder. Right-click again on the newly renamed notebook and select Move from the submenu, as shown next:

The following dialog box appears. Select the folder ADFCalls and click on Select to move the notebook to its desired location:

Now that the notebook is named and located properly, we'll modify its content.

A notebook is used by a data scientist to note how he/she conducted his/her experiments with the data he/she used. In our case, we'll do the following experiment:

To execute all of the following steps in our notebook, we need computing capacity—a cluster. We'll make sure that the one we created earlier is running. At the top of the notebook, we have the status of the notebook, whether it is attached to the cluster or not. As shown in the following screenshot, we make sure the notebook is attached to our cluster:

To attach it to the cluster, click on the Detached icon and select your cluster in the list:

Once attached to the cluster, we can start the cluster if it's not running, as shown in the following screenshot:

We are now ready to focus on our notebook. The first step will consist of indicating to Sparks where to find the source data and what format it is in. In our case, the source file is in our ADFV2Book | sales-data blob storage. At the top of the notebook, we have textboxes that can hold this information. This information will be used later by our ADF Databricks activity as parameters. The following list shows the properties that must be filled in:

  • Storage Key / SAS: This is the storage key used by the adfv2Book storage account. It is accessible in the Access keys section of the adfv2Book storage account.
  • Storage Account Name: This our storage account name, adfv2book.
  • Upload Location: This property is a bit tricky to set. It consists of using wasbs:// + the container name + [email protected]/. In our case, we will use wasbs://[email protected]/.
  • file_type: CSV.

The textboxes are created by the first section of our notebook, as shown in the following screenshot:

The notebook described here has been modified to make it simpler.

The dbutils.widgets.text is declaring parameters that will be used later in the notebook. The next step sets the Sparks configuration, as shown in the following screenshot. We can execute each step by pressing Ctrl + Enter to test the steps one at a time:

This time, the command dbutils.widgets.get is used to read the parameters. We are now ready to go to step 3, reading the data. We'll use the following command:

df = spark.read.format(dbutils.widgets.get("file_type")).option("inferSchema", "true").load(dbutils.widgets.get("file_location")) 

The df is a data frame, a structure in Sparks. It will hold the content of the container at file_location (sales-data/). The file is a file_type (CSV).

We can now create the tables. As shown in the following screenshot, the notebook is well-documented. We describe all the steps we accomplished in our notebook:

We first use %sql to tell Sparks that the next command will use the SQL language. Since we are creating a table from the data frame (which contains all files in our sales-data container—Sales.txt), we drop it first if it exists. If we don't do that, the next command will return an error. The df.write.format... command is writing the content of the data frame in a file on the cluster. This will become a table that can be consumed by later processes.

The last step creates a table that will aggregate the data of our file, as shown in the following screenshot:

The table Sales is the one that will be used in the Power BI chapter. To make sure that the notebook executes properly, click on the Run All button at the top of the notebook, as shown in the screenshot.

Now that we have a notebook, we'll execute it from ADF.

At the end of the notebook, we're making a query to observe the results. By default, the results are displayed in a grid. But one of the advantages of a notebook is that we can use some visuals to analyze the data inside the workbook. Clicking on the graph button allows us to see the data in some graphs. We can see an example of it in the next screenshot:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.58.132.97