2
Data Input, Import and Print

2.1 Importing Data

Importing data is the first step in analyzing data. It is important that you have reliable and relevant data. You should be able to import data correctly because the computer processes what data you input. If the imported data is faulty, the analysis that you will receive after performing various tasks on it will also be erroneous and misleading.

This concept is also commonly known as GIGO (Garbage In Garbage OUT). Therefore, the input step is one of the most important steps in the data science pipeline. There could also be different ways to input data in R and SAS from files or from data connections. Importing of datasets calls for certain functions in R whereas it calls for certain procedures for the same in SAS.

2.1.1 Packages in R

Importing of data in R can be done using certain packages and functions, and to use those packages, we need to install them in our application.

Installing a package has the following command in R:

After installation to use this package you must load that package. Loading a package means getting the package in active state (session). To load a package use:

Updating a package:

Note that we install the package only once, we update it occasionally and we load it every time we begin a R session. To unload a package, we use:

To uninstall a package we use:

2.2 Importing Data in SAS

Here we study multiple ways to input data.

In SAS to save space you can put this in the beginning options compress = yes;

2.2.1 Data Input in SAS

Data Input in SAS manually has been an easy task, and there are a certain set of examples where you can easily learn how to input data in SAS.

The INPUT statement reads raw data from instream data lines or external files into a SAS dataset. Data input is the first step for every analysis, without any dataset or data there can be no analysis of any kind. Data input can be done in various forms. Let’s look at a few examples of data input in SAS.

In the examples given below, we have input normal numerical data, strings, names etc.

The code above creates a dataset named first.

  • infile statement is used to specify the type of data to be read.
  • input statement is used to specify the names, number and type of variables being read.
  • datalines statement is used to specify that the following lines contain the data to be read.
  • The keyword “cards” can also be used instead of “datalines.”
  • $ sign is used to specify that the variable it follows is a character variable
  • Missover option is used to prevent the data step from going to the next line if it does not find values for all variables in the input statement in the current record.
  • dsd option is used to treat commas as separator characters.

Code to import in SAS is different from R because in R we use functions whereas in SAS we generally call procedures. R is an object‐oriented language.

Using the Import Wizard is an easy and straightforward way to import existing data with well‐behaved formatting into SAS. There are other methods for importing data into SAS like proc. import, or even entering raw observations into SAS itself to create a new dataset.

These methods of importing or creating data can give you greater control over how to read variables (the informats), how to write the variables (the formats), how to parse the data (delimited, aligned, repetition, etc.), and more.

2.2.2 Using Proc Import to Import a Raw File

Here we use the proc. import step to import a raw data file and save it as an SAS dataset. DBMS is used to specify the file type, e.g.: CSV, XLS etc.

getnames = yes is to specify that the first row contains column names.

Note: The type of dataset created (temporary or permanent) depends on the name you specify in the out = statement.

A permanent dataset has to be referenced by a two‐level name: ‐ library_name.data_set_name whereas a temporary dataset just has a one‐level name.

2.2.3 Creating a temporary dataset from a permanent one using “set”

2.3 Importing Data in R

There are a number of ways to import data into R, and several formats are available:

  1. From CSV files using readr or data. Table package
  2. From Excel to R
  3. From SAS to R
  4. From SPSS to R
  5. From Stata to R, and more
  6. From Relational Databases (RDBMS) using RODBC
  7. From json files using jsonlite package

https://rforanalytics.wordpress.com/useful‐links‐for‐r/odbc‐databases‐for‐r

Let us explore some of the ways to import data in R.

2.3.1 Importing from Comma Separated Value (CSV) Files

There are three functions which can be used to import csv files in R:

  1. read.csv() or read. Table() which are in the utils package which is installed and loaded by default.
  1. read_csv which is in the readr package.
  1. fread() which is in the data. Table package.
fread and read_csv  are the fastest of all these.

You can use the system. Time() function to verify that as follows:

2.3.2 Importing from Excel Files

We need to install readxl package and use the read_excel function to import .xls or .xlsx types of files.

Example: To import sheet 1 of an excel file with the first row as column names

We can also use sheet names put within double quotes instead of the sheet number to specify the sheet we want from any excel file.

2.3.3 Importing from SAS

read.sas7bdat() from sas7bdat package is used to import .sas7bdat files

2.3.4 Importing from SPSS and STATA

We use the read.spss () and read.dta() function from foreign package to import SPSS and STATA files respectively.

2.3.5 Assigning the Values Imported to a Data Object in R

Assigning in R has the following syntax:

The following code is used to assign the imported file to an object.

Similarly, data read using other functions can be assigned to R objects.

Note: Each of the functions used to import data discussed above take in more parameters which define certain formatting to be done on the data while importing.

To manually input we use the following

We can do the same for other types of data except string variables which will be in quotes (i.e. “ten”)

2.4 Providing Data Input

We can also create datasets, vectors or matrices by using the input value given by us.

2.4.1 Data Input in R

2.4.1.1 Using the c() function is the simplest way to create a list in R

We can input numerical, dates and string values as follows:

2.4.1.2 Providing missing values to the vector

NA in R signifies missing values (in SAS a missing value is denoted by a single period.)

is.na() function is used to detect missing values in the vector.

2.4.1.3 To Input multiple columns of data

This creates a data frame with two columns as follows:

or, we can create a matrix using:

This code makes a matrix with values in c() arranged in three rows and two columns arranged column wise. Note: vector and matrix must have all values of the same type but data frames can have values of different types.

2.4.1.4 Using loops to input

2.5 Data Input in SAS

Data Input in SAS manually has been an easy task, however, there are a certain set of examples where you can easily learn how to input data in SAS.

The INPUT statement reads raw data from instream data lines or external files into an SAS dataset. Data input is the first step for every analysis; without any dataset or data there can be no analysis of any kind. Data input could be done in various forms lets see few examples of data input in SAS.

In the examples given below, we have input normal numerical data, strings, names etc.

This code creates a dataset named first.

  • infile statement is used to specify the type of data to be read.
  • input statement is used to specify the names, number and type of variables being read.
  • datalines statement is used to specify that the following lines contain the data to be read.
  • The keyword “cards” can also be used instead of “datalines.”

$ sign is used to specify that the variable it follows is a character variable

Missover option is used to prevent the data step from going to the next line if it does not find values for all variables in the input statement in the current record. Here the dsd option is used to treat commas as separator characters.

2.6 Printing Data

After importing the data, the next important step is to print that data to have a look at the type of data you now have to analyze.

2.6.1 Print in SAS

Printing the dataset in SAS involves calling the print procedure in SAS. The code below will help you print the whole dataset named ajaydat.

The code below will help you print the first five observations of the dataset named ajaydat.

The code below will help you print the observations ranging from 10 to 20 for dataset ajaydat.

2.6.2 Print in R

In R, printing of data does not need any function or package. You simply write the dataset name and then run it to print the data.

If you read data in mydata and write the data_set name:

The whole data in mydata will be printed at console.

Only the first observation of mydata is printed to the console. Default value of n is 6.

Observations ranging from 10 to 20 would be displayed.

2.7 Summary

Importing data in R requires a variety of functions to import different types of files whereas proc. import is used with different options or parameters to import any type of file in SAS. Data input in R is done using the c() function and using a data step with input option in SAS. In R, printing a dataset just requires the writing of the name of the dataset and running it, whereas SAS uses proc. print to print any dataset.

2.8 Quiz Questions

  1. How will you load an installed package in R?
  2. Give three functions which can be used to import csv files in R.
  3. Which package contains read_csv() and fread() respectively?
  4. Which function in R can you use to measure the time taken by a code to execute?
  5. Which procedure in SAS is used to import raw data files?
  6. How can you create a temporary dataset from a permanent one in SAS using a data step?
  7. Which wildcard is used to specify that a particular variable is a character variable in SAS?
  8. What is the missover option used for in the infile statement in a SAS data step?
  9. How will you print a data set in R?
  10. Which procedure is used to print a data set in SAS?

Quiz Answers

  1. library(“package_name”)
  2. read_csv(),fread(),read.csv()
  3. readr, data.table
  4. system.time()
  5. proc. import
  6. $
  7. Missover tells SAS not to jump to the next line if it does not find values for all variables. We just type the name of the dataset and run it to print a data set in R.
  8. Just type the name of object
  9. proc print
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.214.21