Data Cleaning and Inspection is the next important part of the data analysis pipeline. It implies that before starting analysis, visualization or machine learning and its insights, you should have cleaned any data that has to be analyzed. Though Machine Learning, Exploratory Data Analysis and Data Visualization take up more time in analytical education, in an actual data science project much more time is spent in data inspection and cleaning.
Data inspection helps us determine that data import has been executed correctly, that variables are in same length (rows) and breadth (columns) and that variables (columns) are in the same format as expected.
Let’s try this in SAS
We can choose specific parts of a data frame by using square brackets, i.e.
airquality$Ozone gives value of Ozone column in airquality
Data that is missing can be due to human data input error, formatting issues or incorrect coding syntax for import. It is a problem because we cannot have analysis without data.
There are three ways to handle missing data:
Let’s do this in R
We see the mean and then check for mean with missing values ignored using na.rm. = T. We also check for total missing values by is.na. In R, as we have mentioned, missing values are given by NA
We can delete all missing values by na.omit
We can use a conditional operator to replace missing values by median. In the ifelse operator, the first part is condition, the second part is if condition is true and the third part is if condition is false. We put the condition as is.na (which checks for missing value). If is.na is true, it indicates data is missing value then we replace it by median of variable (ignoring missing values) and if is.na is false we do not replace data but keep the original value. This is similar to proc stdize
Here we try and clean various types of errors in a data type in both SAS and R.
We have input the data in our first step to clean the Data. SAS code to omit this type of errors and create a useful dataset for the purposes of analysis.
Using the gsub package in R, it is easy to clean Data just as we used compress in SAS. We have created a different variable every time we replace to avoid the actual data to being lost and/or changed. Data cleaning is quite a simple process in both R and SAS thanks to the inbuilt functions as well as documentation. What adds to the complexity is the volume and variety of the data both Big and Small. You can also see data cleaning is an intensive manual task as data errors can be of many types. It is estimated that out of many data science projects as much as 80% of time is spent on data hygiene.
3.23.79.149