The easiest way of dealing with missing values, especially with MCAR data, is simply removing all the observations with any missing values. If we want to exclude every row of a matrix
or data.frame
object which has at least one missing value, we can use the complete.cases
function from the stats
package to identify those.
For a quick start, let's see how many rows have at least one missing value:
> library(hflights) > table(complete.cases(hflights)) FALSE TRUE 3622 223874
This is around 1.5 percent of the quarter million rows:
> prop.table(table(complete.cases(hflights))) * 100 FALSE TRUE 1.592116 98.407884
Let's see what the distribution of NA
looks like within different columns:
> sort(sapply(hflights, function(x) sum(is.na(x)))) Year Month DayofMonth 0 0 0 DayOfWeek UniqueCarrier FlightNum 0 0 0 TailNum Origin Dest 0 0 0 Distance Cancelled CancellationCode 0 0 0 Diverted DepTime DepDelay 0 2905 2905 TaxiOut ArrTime TaxiIn 2947 3066 3066 ActualElapsedTime AirTime ArrDelay 3622 3622 3622
3.141.200.3