Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3
Data Inspection and Cleaning

3.1 Introduction

Data Cleaning and Inspection is the next important part of the data analysis pipeline. It implies that before starting analysis, visualization or machine learning and its insights, you should have cleaned any data that has to be analyzed. Though Machine Learning, Exploratory Data Analysis and Data Visualization take up more time in analytical education, in an actual data science project much more time is spent in data inspection and cleaning.

3.2 Data Inspection

Data inspection helps us determine that data import has been executed correctly, that variables are in same length (rows) and breadth (columns) and that variables (columns) are in the same format as expected.

3.2.1 Data Inspection in SAS

Let’s try this in SAS

Referring to a column is easier in SAS than in R

/* Refer to a column by using var in proc and keeping it by keep in data step*/

data import4 (keep=ozone2);
set import3;
run;

Creating a new variable which is twice the size of target column

data import4;
set import4;
ozone3=2*ozone2;
run;

Printing only that variable by using var (which is also used in other Procs)

proc print data=import3 (obs=5);
var ozone2;
run;

Referring to a row is more complex in SAS than R

proc print data=import3 (obs=7);
run;
Obs  Wind  Temp  Month  Day  Ozone2   Solar_R2   var2
1    7.4   67    5      1    41.0000  190.000   .
2    8     72    5      2    36.0000  118.000   .
3    12.6  74    5      3    12.0000  149.000   .
4    11.5  62    5      4    18.0000  313.000   .
5    14.3  56    5      5    35.3017  185.932   .
6    14.9  66    5      6    28.0000  185.932   .
7    8.6   65    5      7    23.0000  299.000

Using a do loop for getting certain rows

data output1;
  do i = 1, 3, 4, 7;
     set import3 point = i;
     output;
  end;
  stop;
run;

proc print data=output1;
run;

Obs  Wind  Temp  Month  Day  Ozone2  Solar_R2  var2
1    7.4   67    5      1    41      190       .
2    12.6  74    5      3    12      149       .
3    11.5  62    5      4    18      313       .
4    8.6   65    5      7    23      299       .

3.2.2 Data Inspection in R

head gives first 6 values
names give names of columns
dim gives dimensions (row column)
str gives structure (type of variables, variable names, dimensions) type of data object
class gives type of data object (which is important in R as it can be many different types of object)
summary gives a summary of the whole object including numerical analysis, presence of missing values and frequencies of factor variables.

> data(airquality)

> head(airquality) 
Ozone  Solar.R  Wind  Temp  Month  Day
1  41     190      7.4   67    5      1
2  36     118      8.0   72    5      2
3  12     149      12.6  74    5      3
4  18     313      11.5  62    5      4
5  NA     NA       14.3  56    5      5
6  28     NA       14.9  66    5      6

> names(airquality)
[1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"

> dim(airquality)
[1] 153 6

> str(airquality)
'data.frame': 153 obs. of 6 variables:
$ Ozone  : int  41   36   12    18    NA    28    23   19    8     NA   ...
$ Solar.R: int  190  118  149   313   NA    NA    299  99    19    194  ...
$ Wind   : num  7.4  8    12.6  11.5  14.3  14.9  8.6  13.8  20.1  8.6  ...
$ Temp   : int  67   72   74    62    56    66    65   59    61    69   ...
$ Month  : int  5    5    5     5     5     5     5    5     5     5    ...
$ Day    : int  1    2    3     4     5     6     7    8     9     10   ...

>class(airquality)
[1] "data.frame"

> summary(airquality)
    Ozone           Solar.R          Wind            Temp           Month           Day
Min.   : 1.00   Min.   : 7.0   Min.   : 1.700  Min.   :56.00  Min.   :5.000  Min.   : 1.0
1st Qu.: 18.00  1st Qu.:115.8  1st Qu.: 7.400  1st Qu.:72.00  1st Qu.:6.000  1st Qu.: 8.0
Median : 31.50  Median :205.0  Median : 9.700  Median :79.00  Median :7.000  Median :16.0
Mean   : 42.13  Mean   :185.9  Mean   : 9.958  Mean   :77.88  Mean   :6.993  Mean   :15.8
3rd Qu.: 63.25  3rd Qu.:258.8  3rd Qu.:11.500  3rd Qu.:85.00  3rd Qu.:8.000  3rd Qu.:23.0
Max.   :168.00  Max.   :334.0  Max.   :20.700  Max.   :97.00  Max.   :9.000  Max.   :31.0
NA's  :37 NA's :7

> class(airquality)
[1] "data.frame"

We can choose specific parts of a data frame by using square brackets, i.e.

airquality [2,3] gives data in second row and third column of airquality
airquality [2,] gives data in second row and all columns of airquality
airquality [,3] gives data in all rows and third column of airquality
airquality [R,C] gives data in Rth row and Cth column of airquality

airquality$Ozone gives value of Ozone column in airquality

3.3 Missing Values

Data that is missing can be due to human data input error, formatting issues or incorrect coding syntax for import. It is a problem because we cannot have analysis without data.

There are three ways to handle missing data:

Ignore it
Delete it
Replace it – Replace with a value that does not change the numerical properties significantly. Missing value imputation is the name given to replacing missing data. At its simplest form we replace missing values by either mean or median data. At its more sophisticated form, we use correlation from other variables that are more complete to impute them. We can also use machine learning algorithms to impute data from other variables. Specific packages like mice package in R help with more sophisticated missing value imputation.

3.3.1 Missing Values in SAS

File Import

FILENAME REFFILE '/home/ajay4/book/airquality.csv';

PROC IMPORT DATAFILE=REFFILE
     DBMS=CSV
     OUT=WORK.IMPORT;
     GETNAMES=YES;
RUN;

Finding variable type. To our surprise many variables have been encoded as string variables in SAS which were encoded as numeric in R. This is due to NA being a character value in SAS but missing values in R. In SAS missing values are denoted by a single period.

PROC CONTENTS DATA=WORK.IMPORT; RUN;

Alphabetic List of Variables and Attributes
#   Variable   Type   Len   Format   Informat
7   Day        Num    8     BEST12.  BEST32.
6   Month      Num    8     BEST12.  BEST32.
2   Ozone      Char   2     $2.      $2.
3   Solar.R    Char   3     $3.      $3.
5   Temp       Num    8     BEST12.  BEST32.
1   VAR1       Char   4     $4.      $4.
4   Wind       Num    8     BEST12.  BEST32.

Let’s print the first six rows of data (similar to head function in R)

proc print data =import (obs=6);
run;
Obs  VAR1  Ozone  Solar.R  Wind  Temp  Month  Day
1    1     41     190      7.4   67    5      1
2    2     36     118      8     72    5      2
3    3     12     149      12.6  74    5      3
4    4     18     313      11.5  62    5      4
5    5     NA     NA       14.3  56    5      5
6    6     28     NA       14.9  66    5      6

Let's replace NA in SAS

If we had to replace NA in just one variable, we can use compress


data import2;
set import;
Ozone=compress(Ozone,"NA","");
Run;

proc contents data=import2;
run;

However, if we wanted to replace it in all character variables, we use an SAS function called array with a ‘for’ loop and an ‘if’ statement. We convert character variable one by one into numeric variables and drop the original. A point to note is to avoid a dot in variable names in SAS. Drop is used in SAS to drop a certain variable in the SAS dataset.

data import2;
set import;

 array Chars[*] _character_;
do i = 1 to dim(Chars);

     Chars[i] = strip(Chars[i]);
 if  Chars[i] = "NA" then Chars[i] =. ;
end;
drop i;
Ozone2=input(Ozone,2.);
drop Ozone ;

Solar_R2=input(SolarR,3.);
drop SolarR ;
var2=input(VAR1,8.);
drop VAR1 ;
run;

proc means data=import2 n nmiss mean;
run;

Using nmiss we can find and ignore the missing values in proc means (similar to na.rm. = T in R)

The MEANS Procedure

Variable	N	N Miss	Mean
Wind	153	0	9.9575163
Temp	153	0	77.8823529
Month	153	0	6.9934641
Day	153	0	15.8039216
Ozone2	116	37	35.3017241
Solar_R2	146	7	185.9315068
var2	0	153

Suppose we ran the same Proc Means procedure but without nmiss, we will not see the missing values (is.na = T in R).

The MEANS Procedure

Variable	Mean
Wind	9.9575163
Temp	77.8823529
Month	6.9934641
Day	15.8039216
Ozone2	35.3017241
Solar_R2	185.9315068
var2

But for character variables and others there is the following representation:

Missing Values	Representation in Data
Numeric	. (a single point)
Character	′ ′ (a blank enclosed in quotes)

For replacing a missing value in a character variable you can use:

if var=' ' then do;

To simply omit all missing values (like na.omit in R) we use the following SAS code:

data import21;
set import2;
if Ozone2=. then delete;
run;

proc means data=import21 n nmiss mean;
Run;

You can see a few rows that also had solar_R2 have been deleted. Therefore, we need to be careful about explicit deletion.

The MEANS Procedure

Variable	N	N Miss	Mean
Wind	116	0	9.8620690
Temp	116	0	77.8706897
Month	116	0	7.1982759
Day	116	0	15.5344828
Ozone2	116	0	35.3017241
Solar_R2	111	5	184.8018018
var2	0	116

To replace missing values with the mean of the variable you can use the following:

proc stdize data=import2 reponly method=mean out=import3;
var ozone2 solar_r2;
Run;

proc print data=import2 (obs=6);
var ozone2 solar_r2;
run;
Obs     Ozone2     Solar_R2
1       41         190
2       36         118
3       12         149
4       18         313
5       .          .
6       28         .

proc print data=import3 (obs=6);
var ozone2 solar_r2;
run;


Obs     Ozone2     Solar_R2
1       41.0000    190.000
2       36.0000    118.000
3       12.0000    149.000
4       18.0000    313.000
5       35.3017    185.932
6       28.0000    185.932

3.3.2 Missing Values in R

Let’s do this in R

> data(airquality)> summary(airquality)
    Ozone           Solar.R          Wind            Temp           Month           Day
Min.   : 1.00   Min.   : 7.0   Min.   : 1.700  Min.   :56.00  Min.   :5.000  Min.   : 1.0
1st Qu.: 18.00  1st Qu.:115.8  1st Qu.: 7.400  1st Qu.:72.00  1st Qu.:6.000  1st Qu.: 8.0
Median : 31.50  Median :205.0  Median : 9.700  Median :79.00  Median :7.000  Median :16.0
Mean   : 42.13  Mean   :185.9  Mean   : 9.958  Mean   :77.88  Mean   :6.993  Mean   :15.8
3rd Qu.: 63.25  3rd Qu.:258.8  3rd Qu.:11.500  3rd Qu.:85.00  3rd Qu.:8.000  3rd Qu.:23.0
Max.   :168.00  Max.   :334.0  Max.   :20.700  Max.   :97.00  Max.   :9.000  Max.   :31.0
NA's :37 NA's :7

We see the mean and then check for mean with missing values ignored using na.rm. = T. We also check for total missing values by is.na. In R, as we have mentioned, missing values are given by NA

> mean(airquality$Ozone)
[1] NA
Using na.rm+T we ignore missing values (in R they are NA in SAS they are . )
> mean(airquality$Ozone,na.rm=T)
[1] 42.12931
> table(is.na(airquality))
FALSE TRUE 
 874 44

We can delete all missing values by na.omit

> airquality2=na.omit(airquality)

> str(airquality)

'data.frame': 153 obs. of 6 variables:
$ Ozone     : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R   : int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind      : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp      : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month     : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day       : int 1 2 3 4 5 6 7 8 9 10 ...

> str(airquality2)

'data.frame': 111 obs. of 6 variables:
$ Ozone    : int 41 36 12 18 23 19 8 16 11 14 ...
$ Solar.R  : int 190 118 149 313 299 99 19 256 290 274 ...
$ Wind     : num 7.4 8 12.6 11.5 8.6 13.8 20.1 9.7 9.2 10.9 ...
$ Temp     : int 67 72 74 62 65 59 61 69 66 68 ...
$ Month    : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day      : int 1 2 3 4 7 8 9 12 13 14 ...
- attr(*, "na.action") = 'omit' Named int 5 6 10 11 25 26 27 32 33 34 ...
..- attr(*, "names") = chr "5" "6" "10" "11" ...

We can use a conditional operator to replace missing values by median. In the ifelse operator, the first part is condition, the second part is if condition is true and the third part is if condition is false. We put the condition as is.na (which checks for missing value). If is.na is true, it indicates data is missing value then we replace it by median of variable (ignoring missing values) and if is.na is false we do not replace data but keep the original value. This is similar to proc stdize

> data("airquality")

> summary(airquality$Ozone)

 Min.  1st Qu. Median  Mean   3rd Qu.  Max.    NA's
 1.00  18.00   31.50   42.13  63.25    168.00  37

> airquality$Ozone=ifelse(is.na(airquality$Ozone),median(airquality$Ozone,na.rm = T),airquality$Ozone)

> summary(airquality$Ozone)
 Min.  1st Qu.  Median  Mean  3rd Qu. Max.
 1.00  21.00    31.50   39.56 46.00   168.00

3.4 Data Cleaning

Here we try and clean various types of errors in a data type in both SAS and R.

3.4.1 Data Cleaning in SAS

We have input the data in our first step to clean the Data. SAS code to omit this type of errors and create a useful dataset for the purposes of analysis.

We first input data using the SAS datalines way (corresponds to R's c operator for a list).

data money;
infile datalines ;
input name$ ;
datalines;
'50000'
'50,000'
'$50000'
50000
'50000'
'50000.00'
;

run;

We check if data was input correctly using proc print

proc print data=money;
run;

Note /* test*/ shows a comment in SAS just like #test shows a comment in R

/* Obs     name */
/* 1     '50000' */
/* 2     '50,000' */
/* 3     '$50000' */
/* 4     50000 */
/* 5     '50000.0 */

Here we use the compress function to get rid of junk values, $ ‘(just like gsub in R).

We also use the input function to convert character to numeric value just as we did as.numeric function in R.

However, in SAS we have to specify format and informats to convert data types from one to another. In R, we use lubridate, stringr package, and the as operator to do so.

data money2;
set money;
name2=compress(name,",$'");
name3 = input(name2,6.);
run;

Proc contents is like str in R to check data

proc contents data=money2;
run;

Proc means is like summary in R (however we can specify only one var by using the var argument in SAS whereas in R we can use the $operator for single variables

proc means data=money2;
var name3;
run;

The output shows our data cleaning was successful.

The MEANS Procedure

Analysis Variable: name3
N     Mean        Std Dev    Minimum     Maximum
6     50 000.00   0          50 000.00   50 000.00

3.4.2 Data Cleaning in R

> money=c('50000','50,000','$50000',50000,'50000.00')

> mean(money)
[1] NA
Warning message:
In mean.default(money) : argument is not numeric or logical: returning NA

> str(money)
 chr [1:5] "50000" "50,000" "$50000" "50000" "50000.00"

> money=gsub(',',",money)
> money
[1] "50000" "50000" "$50000" "50000" "50000.00"

> money=gsub('\$',",money)
> money
[1] "50000" "50000" "50000" "50000" "50000.00"

> money=as.numeric(money)
> money
[1] 50000 50000 50000 50000 50000

> mean(money)
[1] 50000
> str(money)
 num [1:5] 50000 50000 50000 50000 50000

Using the gsub package in R, it is easy to clean Data just as we used compress in SAS. We have created a different variable every time we replace to avoid the actual data to being lost and/or changed. Data cleaning is quite a simple process in both R and SAS thanks to the inbuilt functions as well as documentation. What adds to the complexity is the volume and variety of the data both Big and Small. You can also see data cleaning is an intensive manual task as data errors can be of many types. It is estimated that out of many data science projects as much as 80% of time is spent on data hygiene.

3.5 Quiz Questions

How do you represent missing values in SAS?
How do you represent missing values in R?
How will you replace a missing value by mean in R?
How will you replace a missing value by mean in SAS?
How will you clean data with junk values like $ and , in R?
How will you clean data with junk values like $ and , in SAS?
How do you check variable types in SAS?
How do you check variable types in R?
How do you print only variable in SAS?
How do you print only variable in R?

Quiz Answers

X.
NA.
Using ifelse
proc stdize
gsub
compress
proc contents
str

Use var in proc print like

Use $ operator like
```
datasetname$variablename
```

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3 Data Inspection and Cleaning

Create new playlist

Sign In

Sign Up

3.1 Introduction

3.2 Data Inspection

3.2.1 Data Inspection in SAS

3.2.2 Data Inspection in R

3.3 Missing Values

3.3.1 Missing Values in SAS

3.3.2 Missing Values in R

3.4 Data Cleaning

3.4.1 Data Cleaning in SAS

3.4.2 Data Cleaning in R

3.5 Quiz Questions

Quiz Answers

Table of Contents for
3 Data Inspection and Cleaning