Tutorial – looking at airline flight times data

This is a general tutorial aimed at finding interesting snippets of information from Flights data available in the nycflights13 package.

The dataset contains On-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 as per its description:

# First, install the package if you haven't already 
 
install.packages("nycflights13") 
library(nycflights13)
library(data,table) flights <- data.table(nycflights13::flights) flights # Get descriptive information about the dataset describe(flights)

The output of the preceding code is as follows:

 We can also get descriptive information on a per NYC Airport basis:

# There are 3 major airports in the nycflights13 dataset 
 
flights[,c(unique(origin))] 
# [1] "JFK" "EWR" "LGA" 
 
# Using describe.by, we can get descriptive information on each of them 
describe.by(flights, flights$origin) 

The output of the preceding code is as follows:

flights_extra <- fread("http://ourairports.com/data/airports.csv") 
 
# Extract the columns name, iata_code and region for only those entries where the IATA code exists 
flights_extra2 <- flights_extra[!iata_code=="",.(dest_name=name, dest=iata_code, region=iso_region)] 
flights_extra2 
 
setkey(flights, dest) 
setkey(flights_extra2, dest) 
 
# The table, flights does not contain the names of the destination airport. Using the data from the flights_extra2 DataFrame, we'd like to add the respective information to the flights data 
# We can do this using the "merge" function or the in-built functionality of data.table which lets us subset using the key columns 
 
flights2 <- flights_extra2[flights] 
flights2 
 
# Find the maximum, minimum and average of flight timings between JFK and any other airport, ordered by the highest AvgTime (descending) 
 
flights3 <- flights[,.(MaxTime=max(air_time, na.rm = T), MinTime=min(air_time, na.rm = T), AvgTime=mean(air_time, na.rm = T)), by=c("origin", "dest")] 
flights3 <- merge(flights3, flights_extra2[,.(dest, dest_name)], by = "dest") 
flights3[order(-AvgTime)] 
 
# From this we can see that the flight with the longest average duration (498 minutes) is between JFK/EWR and Daniel K Inouye International Airport in Honolulu 

Finding the average delays between all airports can be as follows:

flights4 <- flights[,.(MeanDelay=mean(arr_delay+dep_delay, na.rm=T)), by=c("origin","dest")][order(-MeanDelay)] 
 
# origin dest MeanDelay 
# 1:    EWR  TYS  82.80511 
# 2:    EWR  CAE  78.94681 
# 3:    EWR  TUL  68.54762 
# 4:    EWR  OKC  59.80000 
# 5:    EWR  JAC  59.73684 
 
# Since the flights_extra2 is keyed on "dest" airport code, we can very easily extract the names of the airports (and join it with the respective table) by using the tablename[key] method as follows: 
 
setkey(flights4, dest) 
flights_extra2[flights4][order(-MeanDelay)] 
 
# > flights_extra2[flights4][order(-MeanDelay)] 
#     dest_name dest region origin MeanDelay 
# 1:  McGhee Tyson Airport  TYS  US-TN    EWR  82.80511 
# 2:  Columbia Metropolitan Airport  CAE  US-SC    EWR  78.94681 
# 3:  Tulsa International Airport  TUL  US-OK    EWR  68.54762 
# 4:  Will Rogers World Airport  OKC  US-OK    EWR  59.80000 
 
 
# From this, we can see that the highest average delay was between EWR (Newark) and McGhee Tyson airport in Tennessee (82 minutes) 
 
# To see which of the NYC Airports is the busiest, we can use: 
 
flights[,.(total_flights=.N, total_distance_flown=sum(distance), total_air_time=sum(air_time, na.rm = T)), by=origin] 
 
# origin total_flights total_distance_flown total_air_time 
# 1:    JFK        111279            140906931       19454136 
# 2:    EWR        120835            127691515       17955572 
# 3:    LGA        104662             81619161       11916902 
 
# This shows that even though JFK has comparatively lesser flights than EWR, it has a much higher total_distance_flown and total_air_time 
 
# In order to check which airlines have the best overall performance, we can perform a group-by operation using the carrier code 
 
# Notice that mix dplyr code along with data.table as follows 
 
flights[,.(TotalFlights = .N, AvgDelay=mean(arr_delay+dep_delay, na.rm=T)), by=carrier][order(AvgDelay)] %>% left_join(nycflights13::airlines) 
 
# > flights[,.(TotalFlights = .N, AvgDelay=mean(arr_delay+dep_delay, na.rm=T)), by=carrier][order(AvgDelay)] %>% left_join(nycflights13::airlines) 
 
# Joining, by = "carrier" 
# carrier TotalFlights  AvgDelay                        name 
# 1       AS          714 -4.100141        Alaska Airlines Inc. 
# 2       HA          342 -2.014620      Hawaiian Airlines Inc. 
# 3       US        20536  5.874288             US Airways Inc. 
# 4       AA        32729  8.933421      American Airlines Inc. 
# 5       DL        48110 10.868291        Delta Air Lines Inc. 
# 6       VX         5162 14.521110              Virgin America 
# 7       UA        58665 15.574920       United Air Lines Inc. 
 

This shows the best on-time performance is that of Alaska Airlines followed by Hawaiian, US Airways, American Airlines, and Delta. We can combine output also with ggplot2 to make nice looking charts:

flights[,.(TotalFlights = .N, AvgDelay=mean(arr_delay+dep_delay, na.rm=T)), by=carrier][order(AvgDelay)] %>% left_join(nycflights13::airlines) %>% arrange(desc(AvgDelay)) %>% rename(Airline=name) %>% ggplot(aes(x=reorder(Airline, -AvgDelay),y=AvgDelay, fill=Airline)) + geom_bar(stat="identity") + geom_text(aes(label=round(AvgDelay)), vjust=1.6, color="white", size=4) + theme(axis.text.x = element_text(angle = 75, hjust = 1, size = 12)) + labs (title = "Average Airline Delays from NYC", x = "Airline", y = "Average Delay (Minutes)") 

We have explored a few different ways we can manipulate datasets using data.table and dplyr. The intention of providing these examples is primarily to familiarize the user with the ways in which the packages can be used. The best resource for learning about data.table and dplyr can be found on Stack Overflow (search for any data.table or dplyr question and there is usually a link to questions on Stack Overflow that might already have an answer).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.133.180