Predicting airplane arrival time

R has built-in functionality for splitting up a data frame between training and testing sets, building a model based on the training set, predicting results using the model and the testing set, and then visualizing how well the model is working.

For this example, I am using airline arrival and departure times versus scheduled arrival and departure times from http://stat-computing.org/dataexpo/2009/the-data.html for 2008. The dataset is distributed as a .bz2 file that unpacks into a CSV file. I like this dataset, as the initial row count is over 7 million and it all works nicely in Jupyter.

We first read in the airplane data and display a summary. There are additional columns in the dataset that we are not using:

df <- read.csv("Documents/2008-airplane.csv")
summary(df)
...
CRSElapsedTime      AirTime          ArrDelay          DepDelay      
 Min.   :-141.0   Min.   :   0     Min.   :-519.00   Min.   :-534.00  
 1st Qu.:  80.0   1st Qu.:  55     1st Qu.: -10.00   1st Qu.:  -4.00  
 Median : 110.0   Median :  86     Median :  -2.00   Median :  -1.00  
 Mean   : 128.9   Mean   : 104     Mean   :   8.17   Mean   :   9.97  
 3rd Qu.: 159.0   3rd Qu.: 132     3rd Qu.:  12.00   3rd Qu.:   8.00  
 Max.   :1435.0   Max.   :1350     Max.   :2461.00   Max.   :2467.00  
 NA's   :844      NA's   :154699   NA's   :154699    NA's   :136246   
     Origin             Dest            Distance          TaxiIn      
 ATL    : 414513   ATL    : 414521   Min.   :  11.0   Min.   :  0.00  
 ORD    : 350380   ORD    : 350452   1st Qu.: 325.0   1st Qu.:  4.00  
 DFW    : 281281   DFW    : 281401   Median : 581.0   Median :  6.00  
 DEN    : 241443   DEN    : 241470   Mean   : 726.4   Mean   :  6.86  
 LAX    : 215608   LAX    : 215685   3rd Qu.: 954.0   3rd Qu.:  8.00  
 PHX    : 199408   PHX    : 199416   Max.   :4962.0   Max.   :308.00  
 (Other):5307095   (Other):5306783                    NA's   :151649  
    TaxiOut         Cancelled       CancellationCode    Diverted       
 Min.   :  0.00   Min.   :0.00000    :6872294        Min.   :0.000000  
 1st Qu.: 10.00   1st Qu.:0.00000   A:  54330        1st Qu.:0.000000  
 Median : 14.00   Median :0.00000   B:  54904        Median :0.000000  
 Mean   : 16.45   Mean   :0.01961   C:  28188        Mean   :0.002463  
 3rd Qu.: 19.00   3rd Qu.:0.00000   D:     12        3rd Qu.:0.000000  
 Max.   :429.00   Max.   :1.00000                    Max.   :1.000000  
 NA's   :137058                                                        
  CarrierDelay      WeatherDelay        NASDelay       SecurityDelay    
 Min.   :   0      Min.   :   0      Min.   :   0      Min.   :  0      
 1st Qu.:   0      1st Qu.:   0      1st Qu.:   0      1st Qu.:  0      
 Median :   0      Median :   0      Median :   6      Median :  0      
 Mean   :  16      Mean   :   3      Mean   :  17      Mean   :  0      
 3rd Qu.:  16      3rd Qu.:   0      3rd Qu.:  21      3rd Qu.:  0      
 Max.   :2436      Max.   :1352      Max.   :1357      Max.   :392      
 NA's   :5484993   NA's   :5484993   NA's   :5484993   NA's   :5484993  
 LateAircraftDelay
 Min.   :   0     
 1st Qu.:   0     
 Median :   0     
 Mean   :  21     
 3rd Qu.:  26     
 Max.   :1316     
 NA's   :5484993

Many of the data points have NA values. We need to remove these in order to build an accurate model:

# eliminate rows with NA values
df <- na.omit(df)

Let's create our partitions:

# for partitioning to work data has to be ordered
times <- df[order(df$ArrTime),]
nrow(times)
1524735

# partition data - 75% training
library(caret)
set.seed(1337)
trainingIndices <- createDataPartition(df$ArrTime,p=0.75,list=FALSE)
trainingSet <- df[trainingIndices,]
testingSet <- df[-trainingIndices,]
nrow(trainingSet)
nrow(testingSet)
1143553
381182

Let's build our model of the arrival time (ArrTime) based on the fields:

CRSArrTime: Scheduled arrival time
ArrDelay: Arrival delay
DepDelay: Departure delay
Diverted: Whether the plane used a diverted route
CarrierDelay: Delay by the carrier systems
WeatherDelay: Delay due to weather
NASDelay: Delay due to NAS
SecurityDelay: Delay due to security
LateAircraftDelay: Plane arrived late due to other delay

Two of the data items are just flags (0/1), unfortunately. The greatest predictor appears to be the scheduled arrival time. The other various delay factors have small effects. I think it just feels as if it's taking an extra 20 minutes for a security check or the like; it's a big deal when you are traveling.

Now that we have a model, let's use the testing set to make predictions:

predicted <- predict(model, newdata=testingSet)
summary(predicted)
summary(testingSet$ArrTime)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   -941    1360    1629    1590    1843    2217 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1    1249    1711    1590    2034    2400

Plot out the predicted versus actual data to get a sense of the model's accuracy:

plot(predicted,testingSet$ArrTime)

Visually, the predictions match up well with the actuals as shown by the almost 45 degree line. That whole set of predicted points on the lower-right portion of the graphic is troublesome. There appears to be many predictions that are well below the actuals. There must be additional factors involved, as I would have expected all of the data to plot in one area rather than two.

Table of Contents for Predicting airplane arrival time

Create new playlist

Sign In

Sign Up

Table of Contents for
Predicting airplane arrival time