This is a general tutorial aimed at finding interesting snippets of information from Flights data available in the nycflights13 package.
The dataset contains On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 as per its description:
# First, install the package if you haven't already install.packages("nycflights13") library(nycflights13)
library(data,table) flights <- data.table(nycflights13::flights) flights # Get descriptive information about the dataset describe(flights)
The output of the preceding code is as follows:
We can also get descriptive information on a per NYC Airport basis:
# There are 3 major airports in the nycflights13 dataset flights[,c(unique(origin))] # [1] "JFK" "EWR" "LGA" # Using describe.by, we can get descriptive information on each of them describe.by(flights, flights$origin)
The output of the preceding code is as follows:
flights_extra <- fread("http://ourairports.com/data/airports.csv") # Extract the columns name, iata_code and region for only those entries where the IATA code exists flights_extra2 <- flights_extra[!iata_code=="",.(dest_name=name, dest=iata_code, region=iso_region)] flights_extra2 setkey(flights, dest) setkey(flights_extra2, dest) # The table, flights does not contain the names of the destination airport. Using the data from the flights_extra2 DataFrame, we'd like to add the respective information to the flights data # We can do this using the "merge" function or the in-built functionality of data.table which lets us subset using the key columns flights2 <- flights_extra2[flights] flights2 # Find the maximum, minimum and average of flight timings between JFK and any other airport, ordered by the highest AvgTime (descending) flights3 <- flights[,.(MaxTime=max(air_time, na.rm = T), MinTime=min(air_time, na.rm = T), AvgTime=mean(air_time, na.rm = T)), by=c("origin", "dest")] flights3 <- merge(flights3, flights_extra2[,.(dest, dest_name)], by = "dest") flights3[order(-AvgTime)] # From this we can see that the flight with the longest average duration (498 minutes) is between JFK/EWR and Daniel K Inouye International Airport in Honolulu
Finding the average delays between all airports can be as follows:
flights4 <- flights[,.(MeanDelay=mean(arr_delay+dep_delay, na.rm=T)), by=c("origin","dest")][order(-MeanDelay)] # origin dest MeanDelay # 1: EWR TYS 82.80511 # 2: EWR CAE 78.94681 # 3: EWR TUL 68.54762 # 4: EWR OKC 59.80000 # 5: EWR JAC 59.73684 # Since the flights_extra2 is keyed on "dest" airport code, we can very easily extract the names of the airports (and join it with the respective table) by using the tablename[key] method as follows: setkey(flights4, dest) flights_extra2[flights4][order(-MeanDelay)] # > flights_extra2[flights4][order(-MeanDelay)] # dest_name dest region origin MeanDelay # 1: McGhee Tyson Airport TYS US-TN EWR 82.80511 # 2: Columbia Metropolitan Airport CAE US-SC EWR 78.94681 # 3: Tulsa International Airport TUL US-OK EWR 68.54762 # 4: Will Rogers World Airport OKC US-OK EWR 59.80000 # From this, we can see that the highest average delay was between EWR (Newark) and McGhee Tyson airport in Tennessee (82 minutes) # To see which of the NYC Airports is the busiest, we can use: flights[,.(total_flights=.N, total_distance_flown=sum(distance), total_air_time=sum(air_time, na.rm = T)), by=origin] # origin total_flights total_distance_flown total_air_time # 1: JFK 111279 140906931 19454136 # 2: EWR 120835 127691515 17955572 # 3: LGA 104662 81619161 11916902 # This shows that even though JFK has comparatively lesser flights than EWR, it has a much higher total_distance_flown and total_air_time # In order to check which airlines have the best overall performance, we can perform a group-by operation using the carrier code # Notice that mix dplyr code along with data.table as follows flights[,.(TotalFlights = .N, AvgDelay=mean(arr_delay+dep_delay, na.rm=T)), by=carrier][order(AvgDelay)] %>% left_join(nycflights13::airlines) # > flights[,.(TotalFlights = .N, AvgDelay=mean(arr_delay+dep_delay, na.rm=T)), by=carrier][order(AvgDelay)] %>% left_join(nycflights13::airlines) # Joining, by = "carrier" # carrier TotalFlights AvgDelay name # 1 AS 714 -4.100141 Alaska Airlines Inc. # 2 HA 342 -2.014620 Hawaiian Airlines Inc. # 3 US 20536 5.874288 US Airways Inc. # 4 AA 32729 8.933421 American Airlines Inc. # 5 DL 48110 10.868291 Delta Air Lines Inc. # 6 VX 5162 14.521110 Virgin America # 7 UA 58665 15.574920 United Air Lines Inc.
This shows the best on-time performance is that of Alaska Airlines followed by Hawaiian, US Airways, American Airlines, and Delta. We can combine output also with ggplot2 to make nice looking charts:
flights[,.(TotalFlights = .N, AvgDelay=mean(arr_delay+dep_delay, na.rm=T)), by=carrier][order(AvgDelay)] %>% left_join(nycflights13::airlines) %>% arrange(desc(AvgDelay)) %>% rename(Airline=name) %>% ggplot(aes(x=reorder(Airline, -AvgDelay),y=AvgDelay, fill=Airline)) + geom_bar(stat="identity") + geom_text(aes(label=round(AvgDelay)), vjust=1.6, color="white", size=4) + theme(axis.text.x = element_text(angle = 75, hjust = 1, size = 12)) + labs (title = "Average Airline Delays from NYC", x = "Airline", y = "Average Delay (Minutes)")
We have explored a few different ways we can manipulate datasets using data.table and dplyr. The intention of providing these examples is primarily to familiarize the user with the ways in which the packages can be used. The best resource for learning about data.table and dplyr can be found on Stack Overflow (search for any data.table or dplyr question and there is usually a link to questions on Stack Overflow that might already have an answer).