Data manipulation

Show the column names in the SparkDataFrame as follows:

columns(flightDF)
[1] "year" "month" "day" "dep_time" "dep_delay" "arr_time" "arr_delay" "carrier" "tailnum" "flight" "origin" "dest"
[13] "air_time" "distance" "hour" "minute"

Show the number of rows in the SparkDataFrame as follows:

count(flightDF)
[1] 336776

Filter flights data whose destination is only Miami and show the first six entries as follows:

 showDF(flightDF[flightDF$dest == "MIA", ], numRows = 10)

The output is as follows:

Figure 25: Flights with destination Miami only

Select specific columns. For example, let's select all the flights that are going to Iowa that are delayed. Also, include the origin airport names:

delay_destination_DF<- select(flightDF, "flight", "dep_delay", "origin", "dest") 
delay_IAH_DF<- filter(delay_destination_DF, delay_destination_DF$dest == "IAH") showDF(delay_IAH_DF, numRows = 10)

The output is as follows:

Figure 26: All the flights that are going to Iowa that are delayed

We can even use it to chain data frame operations. To show an example, at first, group the flights by date and then find the average daily delay. Then, finally, write the result into a SparkDataFrame as follows:

install.packages(c("magrittr")) 
library(magrittr)
groupBy(flightDF, flightDF$day) %>% summarize(avg(flightDF$dep_delay), avg(flightDF$arr_delay)) ->dailyDelayDF

Now print the computed DataFrame:

head(dailyDelayDF)

The output is as follows:

Figure 27: Group the flights by date and then find the average daily delay

Let's see another example that aggregates average arrival delay for the entire destination airport:

avg_arr_delay<- collect(select(flightDF, avg(flightDF$arr_delay))) 
head(avg_arr_delay)
avg(arr_delay)
1 6.895377

Even more complex aggregation can be performed. For example, the following code aggregates the average, maximum, and minimum delay per each destination airport. It also shows the number of flights that land in those airports:

flight_avg_arrival_delay_by_destination<- collect(agg( 
groupBy(flightDF, "dest"),
NUM_FLIGHTS=n(flightDF$dest),
AVG_DELAY = avg(flightDF$arr_delay),
MAX_DELAY=max(flightDF$arr_delay),
MIN_DELAY=min(flightDF$arr_delay)
))
head(flight_avg_arrival_delay_by_destination)

The output is as follows:

Figure 28: Maximum and minimum delay per each destination airport
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.140.111