Querying SparkR DataFrame

Similar to Scala, we can perform a SQL query on the DataFrame once it is saved as TempView using the createOrReplaceTempView() method. Let's see an example of that. At first, let's save the fight DataFrame (that is, flightDF) as follows:

# First, register the flights SparkDataFrame as a table
createOrReplaceTempView(flightDF, "flight")

Now let's select destination and destinations of all the flights with their associated carrier information as follows:

destDF<- sql("SELECT dest, origin, carrier FROM flight") 
showDF(destDF, numRows=10)

The output is as follows:

Figure 29: All the flights with their associated carrier information

Now let's make the SQL a bit more complex, such as finding the destination's airport of all the flights that are at least 120 minutes delayed as follows:

selected_flight_SQL<- sql("SELECT dest, origin, arr_delay FROM flight WHERE arr_delay>= 120")
showDF(selected_flight_SQL, numRows = 10)

The preceding code segment queries and shows the name of the airports of all the flights that are delayed by at least 2 hours:

Figure 30: Destination airports of all the flights that are delayed by at least 2 hours

Now let's do a more complex query. Let's find the origins of all the flights to Iowa that are delayed by at least 2 hours. Finally, sort them by arrival delay and limit the count up to 20 as follows:

selected_flight_SQL_complex<- sql("SELECT origin, dest, arr_delay FROM flight WHERE dest='IAH' AND arr_delay>= 120 ORDER BY arr_delay DESC LIMIT 20")
showDF(selected_flight_SQL_complex, numRows=20)

The preceding code segment queries and shows the name of the airports of all the flights that are delayed by at least 2 hours to Iowa:

Figure 31: Origins of all the flights that are delayed by at least 2 hours where the destination is Iowa
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.197.250