Identifying missing data

The easiest way of dealing with missing values, especially with MCAR data, is simply removing all the observations with any missing values. If we want to exclude every row of a matrix or data.frame object which has at least one missing value, we can use the complete.cases function from the stats package to identify those.

For a quick start, let's see how many rows have at least one missing value:

> library(hflights)
> table(complete.cases(hflights))
 FALSE   TRUE 
  3622 223874

This is around 1.5 percent of the quarter million rows:

> prop.table(table(complete.cases(hflights))) * 100
    FALSE      TRUE 
 1.592116 98.407884

Let's see what the distribution of NA looks like within different columns:

> sort(sapply(hflights, function(x) sum(is.na(x))))
             Year             Month        DayofMonth 
                0                 0                 0 
        DayOfWeek     UniqueCarrier         FlightNum 
                0                 0                 0 
          TailNum            Origin              Dest 
                0                 0                 0 
         Distance         Cancelled  CancellationCode 
                0                 0                 0 
         Diverted           DepTime          DepDelay 
                0              2905              2905 
          TaxiOut           ArrTime            TaxiIn 
             2947              3066              3066 
ActualElapsedTime           AirTime          ArrDelay 
             3622              3622              3622
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.200.3