So far, we have learnt how to import data and create your own data in R and SAS, along with data inspection and cleaning. Here, we will learn how to work with different kinds of data, for example, dates, strings and numbers and how to convert one data format to another in both R and SAS. This includes handling numeric data, manipulating string/character variables (i.e. by extracting a substring of a string variable), handling different types of date format and numeric calculations with dates (i.e. difference between dates) as well as categorical data.
Integral types represent only whole numbers (positive, negative, and zero), and nonintegral types represent numbers with both integer and fractional parts.
Assigning Numeric Values to Variables
We also use input to convert character to numeric variable, but in SAS we need to specify the correct type of informat as well.
COMMA ELIMINATION USING INFORMAT
DOLLAR AND COMMA ELIMINATION USING INFORMAT
NUMERIC TO CHARACTER USING PUT
Using R As Calculator
Assignment Of Numeric Values
Arithmetic with Numeric Variables
Numeric Vector
Naming A Vector
Convert Numeric to String
Dates are a special case of numeric data because dates have multiple formats. Date variables can pose a challenge in data management. Date data are quite critical for industries like finance, telecom, sales etc. R and SAS provide several options for dealing with date and datetime data.
SAS date value is a value that represents the number of days between 1 January 1960, and a specified date. Dates before 1 January 1960, are negative numbers; dates after are positive numbers. Various SAS language elements handle SAS date values: functions, formats and informats.
READING DATE
The format used here is date9. which is used when specifying the first three letters of month, in total making it nine characters.
DATE9. is an INFORMAT. In an INPUT statement, it provides the SAS interpreter with a set of translation commands it can send to the compiler to turn your text into the right numbers, which will then look like a date once the right FORMAT is applied.
INFORMAT tells the compiler how to read data while FORMAT tells the compiler how to write data. FORMATs are just visible representations of numbers (or characters).
Changing Date Formats
By default, ddmmyy. reads seven characters. To specify year with century use ddmmyy10.
Calculating Differences Between Dates
Here today() in SAS is like Sys.Date() in R. We simply subtract dates in SAS (instead of using difftime in R)
Using Intck Option
We can use datepart to extract date and use intck to obtain difference in dates
Having your dates in the proper format allows R to know that they are dates, and what calculations it should perform on them.
The builtin as.Date function handles dates (without times). The as.Date function allows a variety of input formats through the format = argument.
The default format is a four‐digit year, followed by a month, then a day, separated by either dashes or slashes.
Consider another example for the usage of format option in as.Date function.
tz option is used to specify a time zone.
CONVERTING DATE TO NUMERIC VALUE
R stores dates using 1 January 1970 as the origin. When R looks at dates as integers it calculates the days passed since 1 January 1970.
CONVERTING DATE TO CHARACTER VALUE
USING Sys.Date FUNCTION
To get the current date, the Sys.Date function will return a Date object.
Using Posixct Function
Date and time are stored using POSIXct function but it can parse only the format of YYYY‐MM‐DD HH:MM:SS. strptime is used for different formats.
The formatting and order variations of the date pieces is what strptime deals with so as.POSIXct can recognize the date.
Giving Time Differences Between Two Dates using difftime
Lubridate Package
R is greatly simplified and augmented by the lubridate package. It has many functions but primarily the format either says ‘dmy’, ‘mdy’, “ymd” which can be easily used in R.
Giving Time Differences using Lubridate Package
A vast amount data is in the form of text particularly emails, documents and the Internet. The ability to manipulate string data types is critical to a data scientist.
A “string” is a collection of characters that make up one element of a vector. You can tell a string because it will be mostly be enclosed in (double) quotation marks.
Assigning String Values to Variables
Using SUBSTRN option
Using Trimn Function
CONCATENATION OF STRING
TO CONVERT CHARACTER TO NUMERIC
Replace One Value of String with Another‐ Here, translate function replaces a with &
Eliminating Whitespace using compress
In R, a piece of text is represented as a sequence of characters (letters, numbers, and symbols). The data type R provided for storing sequences of characters is character. Formally, the mode of an object that holds character strings in R is “character”.
You express character strings by surrounding text within double quotes or single quotes.
ASSIGNING STRING VALUE TO AN OBJECT:
SPECIAL CHARACTERS IN STRINGS
PRINT FORMATTED STRING
COUNTING CHARACTERS IN STRING
VECTOR STRINGS
USING EXTRACTION AND ASSIGNMENT OPERATORS
CONVERTING NUMERIC TO STRINGS
PARSING STRINGS TOGETHER
BREAKING A STRING AT A DELIMITER
USING DescTools, trimws IN R ELIMINATES WHITE SPACE
STRING TO NUMERIC
Note R has another data type called factor which is used for categorical variables. They may look like string data but usually have a few levels and are identified as factor variables in R. In SAS they would be character and analyzed using Proc Freq.
18.225.98.18