3
Data Inspection and Cleaning

3.1 Introduction

Data Cleaning and Inspection is the next important part of the data analysis pipeline. It implies that before starting analysis, visualization or machine learning and its insights, you should have cleaned any data that has to be analyzed. Though Machine Learning, Exploratory Data Analysis and Data Visualization take up more time in analytical education, in an actual data science project much more time is spent in data inspection and cleaning.

3.2 Data Inspection

Data inspection helps us determine that data import has been executed correctly, that variables are in same length (rows) and breadth (columns) and that variables (columns) are in the same format as expected.

3.2.1 Data Inspection in SAS

Let’s try this in SAS

  • Referring to a column is easier in SAS than in R

  • Referring to a row is more complex in SAS than R

3.2.2 Data Inspection in R

  • head gives first 6 values
  • names give names of columns
  • dim gives dimensions (row column)
  • str gives structure (type of variables, variable names, dimensions) type of data object
  • class gives type of data object (which is important in R as it can be many different types of object)
  • summary gives a summary of the whole object including numerical analysis, presence of missing values and frequencies of factor variables.

We can choose specific parts of a data frame by using square brackets, i.e.

  • airquality [2,3] gives data in second row and third column of airquality
  • airquality [2,] gives data in second row and all columns of airquality
  • airquality [,3] gives data in all rows and third column of airquality
  • airquality [R,C] gives data in Rth row and Cth column of airquality

airquality$Ozone gives value of Ozone column in airquality

3.3 Missing Values

Data that is missing can be due to human data input error, formatting issues or incorrect coding syntax for import. It is a problem because we cannot have analysis without data.

There are three ways to handle missing data:

  1. Ignore it
  2. Delete it
  3. Replace it – Replace with a value that does not change the numerical properties significantly. Missing value imputation is the name given to replacing missing data. At its simplest form we replace missing values by either mean or median data. At its more sophisticated form, we use correlation from other variables that are more complete to impute them. We can also use machine learning algorithms to impute data from other variables. Specific packages like mice package in R help with more sophisticated missing value imputation.

3.3.1 Missing Values in SAS

3.3.2 Missing Values in R

Let’s do this in R

We see the mean and then check for mean with missing values ignored using na.rm. = T. We also check for total missing values by is.na. In R, as we have mentioned, missing values are given by NA

We can delete all missing values by na.omit

We can use a conditional operator to replace missing values by median. In the ifelse operator, the first part is condition, the second part is if condition is true and the third part is if condition is false. We put the condition as is.na (which checks for missing value). If is.na is true, it indicates data is missing value then we replace it by median of variable (ignoring missing values) and if is.na is false we do not replace data but keep the original value. This is similar to proc stdize

3.4 Data Cleaning

Here we try and clean various types of errors in a data type in both SAS and R.

3.4.1 Data Cleaning in SAS

We have input the data in our first step to clean the Data. SAS code to omit this type of errors and create a useful dataset for the purposes of analysis.

3.4.2 Data Cleaning in R

Using the gsub package in R, it is easy to clean Data just as we used compress in SAS. We have created a different variable every time we replace to avoid the actual data to being lost and/or changed. Data cleaning is quite a simple process in both R and SAS thanks to the inbuilt functions as well as documentation. What adds to the complexity is the volume and variety of the data both Big and Small. You can also see data cleaning is an intensive manual task as data errors can be of many types. It is estimated that out of many data science projects as much as 80% of time is spent on data hygiene.

3.5 Quiz Questions

  1. How do you represent missing values in SAS?
  2. How do you represent missing values in R?
  3. How will you replace a missing value by mean in R?
  4. How will you replace a missing value by mean in SAS?
  5. How will you clean data with junk values like $ and , in R?
  6. How will you clean data with junk values like $ and , in SAS?
  7. How do you check variable types in SAS?
  8. How do you check variable types in R?
  9. How do you print only variable in SAS?
  10. How do you print only variable in R?

Quiz Answers

  1. X.
  2. NA.
  3. Using ifelse
  4. proc stdize
  5. gsub
  6. compress
  7. proc contents
  8. str
  9. Use var in proc print like
  10. Use $ operator like
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.23.79.149