Using Spark DSL to build queries

In this section, we will use Spark DSL to build queries for structured data operations:

  1. In the following command, we have used the same query as used earlier; this time expressed in the Spark DSL to illustrate and compare how using the Spark DSL is different, but achieves the same goal as our SQL is shown in the previous section:
df.select("duration").filter(df.duration>2000).filter(df.protocol=="tcp").show()

In this command, we first take the df object that we created in the previous section. We then select the duration by calling the select function and feeding in the duration parameter. 

  1. Next, in the preceding code snippet, we call the filter function twice, first by using df.duration, and the second time by using df.protocolIn the first instance, we are trying to see whether the duration is larger than 2000, and in the second instance, we are trying to see whether the protocol is equal to "tcp". We also need to append the show function at the very end of the command to get the same results, as shown in the following code block:
+--------+
|duration|
+--------+
|   12454|
|   10774|
|   13368|
|   10350|
|   10409|
|   14918|
|   10039|
|   15127|
|   25602|
|   13120|
|    2399|
|    6155|
|   11155|
|   12169|
|   15239|
|   10901|
|   15182|
|    9494|
|    7895|
|   11084|
+--------+
only showing top 20 rows

Here, we have the top 20 rows of the data points again that fit the description of the code used to get this result.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.54.136