Chapter 3
Data Preparation and Other Tricks

Package(s): gdata, chron

Dataset(s): 100mrun.xls, Earthwormbiomass.xls, Bacteria.XLS, nerve.dat, atombombtest.xls, airquality, wine.dat, sat, faithful, 2005-10.txt.gz

3.1 Introduction

Data comes in various forms and complexities. It is a difficult task to even list the major/minor complexity levels of data preparation. The different forms of data, as well as complexity levels, may be known or unknown. Thus, it is difficult to have a standard set of guidelines for teaching data preparation methods.

Complexities arise on various counts, such as file types, files with missing data values, files with different kinds of attributes, etc. In some cases, it may be simply improbable for the user to read the data properly without repeated efforts of writing the codes over and over again. In Section 3.2, we use the options available in the R function read.table to import data of external files which pose some difficulties. The options may vary to accommodate data problems, avoiding certain number of lines of file, and so forth. A good practice during the learning curve is to validate the imported data into R and check if it is on the expected lines. Thus, it may help to see the imported data using the functions head, tail, str, View, etc., and such functions will be illustrated in Section 3.4. The R functions aggregate, with, and assign are effective in carrying out data manipulation without the need to create new R objects. The use of these functions will be seen in Section 3.5. Time and date vectors need special consideration and we will aid the reader with the math of it in Section 3.6. The complexity of dealing with text matter is one of the most detailed ones and the final technical Section 3.7 will consider the preliminary aspect of this new area. Rscript is a very important backend R tool and helps to run programs without the necessity of even opening the software. Furthermore, rich text editors are important too, and whenever it is possible to use them, the author would recommend that such editors be promptly deployed. This forms the topic of Section 3.8.

3.2 Manipulation with Complex Format Files

Section 2.5 introduced us to a method of reading data from external data files using the scan function. However, it is not the case that data is always well organized. The word organized here has a very vague meaning. After all, why would anybody write files which are not organized. It is nice to understand that “organized” is used in a rather internal sense here. The data that is “internally” stored and managed by R gives consistent results if we read the data according to the need of the hour. We will illustrate with some examples.

Remark about gdata. This R package needs the perl software too. In general, Linux and McIntosh OS have this software by default and the read.xls function works fine. However, Windows OS does not contain the perl software and needs to be installed before using the gdata package. Thus, the user will need to first download and install the software from http://www.perl.org/get.html. Furthermore, the perl option needs to explicitly specified in the read.xls code. That is, we need a modification with newsome1 <- read.xls(“100mrun.xls”,perl= “C:/Perl/bin/perl.exe”,sheet=1,skip=1) to ensure that the data is properly imported in R. If the sheets have names, and not numbers, the code changes slightly to sheet='sheet name'.

We will check out one more example for the frequency table, which is predominantly useful for categorical data analysis.

c03-math-0001

3.3 Reading Datasets of Foreign Formats

Datasets may be available in formats other than csv or dat. It is also a frequent situation where we need to read data stored in xls (Microsoft Excel) format, sav (SPSS), ssd (SAS), dta (STATA), etc. For example, if the Youden-Beale data was saved in the first sheet of an xls file, we can use the command:

> yb <- read.xls("/.../youden.xls",header=TRUE,sheet=1)
Converting xls file to csv file... Done.
Reading csv file... Done.

Note that R first internally converts the xls file into a csv file, and then imports it into the session.

Similarly, we can read datasets of other software. Consider the rootstock.dta available from http://www.stata-press.com/data/r10/rootstock.dta. This is a popular dataset in the domain of multivariate statistics. The soft copy rootstock.dta has been generated from the Stata software. We assume that this dataset has been downloaded from the web and stored in the current working directory. We can read this dataset in R using the foreign packages read.dta function.

library(foreign)
rootstock <- read.dta("/.../rootstock.dta")

Section 1.4 listed many sources of data on the web. The laborious way of using them for analysis is to download them from the sources to the local hard disk and maybe to the current working directory. The technical way is to ask R to access and download the file and load the data into the working session. The next two small examples will clarify these ideas.

> rootstock.url <- "http://www.stata-press.com/data/r10/
+ rootstock.dta" # Example 1
> rootstock <- read.dta(rootstock.url)
> crime.url <- "http://www.jrsainfo.org/jabg/state_data2/
+ Tribal_Data00.xls" # Example 2
> crime <- read.xls(crime.url, pattern = "State")

Using the url link as a text string and appropriate importing functions such as read.dta and read.xls, we can import data in foreign formats to R.

3.4 Displaying R Objects

R objects are of varying nature and we may be interested in having a quick look at the dataset itself, and not through sophisticated tools such as graphics or statistical summaries. The utils package shipped along with R contains a host of functionalities for our purpose.

Suppose we want to see the first ten observations of the 100mrun.xls dataset that we imported earlier. Or we may like to see the last five observations of the same dataset. In R, head and tail are the two functions which give us this facility:

> head(newsome1,10)
   Year Time.sec.
1  1896      12.0
2  1900      11.0
10 1936      10.3
> tail(newsome1,5)
   Year Time.sec.
20 1984      9.99
24 2000      9.87

Another compact way of visualizing an object is to horizontally display the dataset. This is provided by the str function.

> str(newsome1)
'data.frame': 23 obs. of  2 variables:
 $ X1896: int  1900 1904 1908 1912 1920 1924 1928 1932 1936 1948 ...
 $ X12  : num  11 11 10.8 10.8 10.8 10.6 10.8 10.3 10.3 10.3 ...

The fix function is used to display the dataset in a new window, whereas the View function is used to just view the dataset.

> fix(newsome1)
> View(newsome1)
> View(wine)

Check what exactly the edit function does to an R object.

Note that the window arising due to the fix function contains three tabs Copy, Paste and Quit, which may be used to change the dataset. The reader may find the edit function to be useful too. For more such interesting functions, run library(help=utils) at the R console and experiment.

c03-math-0002

3.5 Manipulation Using R Functions

Consider a dataset where we have one column for measurement values of different individuals and another column with group indicators for those individuals. Now, we need to obtain some summaries of these measurements by the group indicator. This task can be achieved with the aggregate function, and the next example illustrates this. The examples discussed in Section 2.4.5 are also useful for manipulation of data preparation.

Consider a situation where you know what the name of the variables should be. However, for some technical reason you cannot declare them before they are actually assigned some value. The question is then how can we do such assignments in the flow of a program. As an artificial example, assume that you feed to R the current top ten Sensex companies of the day. Sensex refers to a number indicative of the relative prices of shares on the Mumbai Stock Exchange. Now, you would like to create objects whose name is the company name and whose value is its Sensex closing value. We will use the assign function towards this end.

With large data files it is memory-consuming to create new objects for some modifications of existing columns (or rows). Thus, there is this economic reason for modifying the objects without creating new ones. The R functions with and within meet the said requirement. For the faithful dataset, we will use the within function for carrying out necessary changes.

c03-math-0003

3.6 Working with Time and Date

Time and dates have always been a complex entity and it does not become any easier in programming languages either. A year has 7 months with 31 days, 4 months with 30 days, and 1 month with 28 days for 3 years and 29 days every fourth year. Even if we ignore a leap year, the number of weekdays, such as Monday, Tuesday, etc., in a year is different, and this is strangely distributed across the months. The number of days of a month, except for February in non-leap years, is not an integer multiple of 7, the number of days of a year is not an integer multiple of the number of months, or the number of weeks of a year. Similarly, time order is also a complex issue to deal with.

The number of ways in which we can write the date is in a further multiple ways. The complexity can be understood as the dates are written in different styles: “9-Sep-2010”, “9-Sep-10”,“09-September-2010”, “09-09-10”,“09/09/10”, etc., all represent the same date. The month is written in numeric as well as text. The year may either be specified in full four digits or the last two digits of a century. We need to take into account all such complexities. Chapter 4 of Spector (2008) is a dedicated and rigorous treatment of handling dates, and we will deal with it in some detail.

As used in the previous section, the current time may be obtained using Sys.time(). Internally, for each time stamp, R stores a number. Similarly, Sys.Date() returns the current system date. This can be easily verified.

> Sys.time()
[1] "2011-06-14 23:28:34 IST"
> as.numeric(Sys.time())
[1] 1308074318
> as.numeric(Sys.time()+1)
[1] 1308074367
> as.numeric(Sys.time()+2)
[1] 1308074370
> Sys.Date()
[1] "2011-06-14"

We see that R is currently reading time up to seconds accuracy. Can the accuracy be increased? That is, we need to know the time in millisecond units. As is the practice, set the default number of digits at 3 using the options function.

> op <- options(digits.secs=3)
> Sys.time()
[1] "2011-06-14 23:33:43.964 IST"

Date objects belong to the classes POSIXct and POSIXt, where these two classes are of the date/times class. For more details, try ?POSIXct. In Example 3.4.2 we had used a small function: month.abb. What is it really? Let us check it out.

> month.abb
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug"
 [9] "Sep" "Oct" "Nov" "Dec"
> month.name
 [1] "January"  "February"        "June"
 [7] "July"     "August"          "December"

Let us store some date objects. To begin with we will consider the system date itself, and see the analysis (extraction actually) that may be performed with it.

> curr_date <- Sys.Date()
> curr_date
[1] "2015-04-13"
> years(curr_date); quarters(curr_date); months(curr_date)
[1] 2015
Levels: 2015
[1] "Q2"
[1] "April"
> days(curr_date); weekdays(curr_date); julian(curr_date)
[1] 13
31 Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8  < ... < 31
[1] "Monday"
[1] 16538
attr(,"origin")
[1] "1970-01-01"

In the above display, some functions are from the chron package, whereas the rest are from the base package. Tables 4.1, 4.2 and 4.3 of Spector (2008) have details about various formats of dates and time. We will clarify a few of them here. As seen earlier, a single date may be written in distinct ways: 9-Sep-2010, 9-Sep-10, 09-September-2010, 09-09-10, 09/09/10. The format can be specified through the format option as a string, and the conversion of text matter to date is achieved through the Date function. Let us now check how R understands these dates as one and the same.

> x1 <- as.Date('9-Sep-2010',format='%d-%b-%Y')
> x2 <- as.Date('9-Sep-10',format='%d-%b-%y')
> x3 <- as.Date('09-September-2010','%d-%B-%Y')
> x4 <- as.Date('09-09-10','%d-%m-%y')
> x5 <- as.Date('09/09/10','%d/%m/%y')
> x1;x2;x3;x4;x5
[1] "2010-09-09"
[1] "2010-09-09"
[1] "2010-09-09"
[1] "2010-09-09"
[1] "2010-09-09"
> x1==x2; x2==x3; x3==x4; x4==x5
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE

Some algebra is possible with date objects, for example, difftime, mean, and range.

> x1+1
[1] "2010-09-10"
> difftime(x1,x2)
Time difference of 0 secs
> mean(c(x1,x2))
[1] "2010-09-09"
> range(c(x1,x2))
[1] "2010-09-09" "2010-09-09"

R is efficient in dealing with time and date variables and this section has given a brief exposition of it.

c03-math-0004

3.7 Text Manipulations

Data is not always in a ready-to-analyze format. We have seen that working with character or factor objects is not as convenient as working with numeric or integer objects. Working with text matter is a task with much higher complexity and inevitable inconvenience too. In fact, there is a specialized school working with such problems, known as Text miners. Here, we will illustrate the important text tools working our way through a complex text matter. The complexity of text functions is such that we feel that it is better to work with some examples instead of looking at their definitions. This approach forms the remainder of this section.

In Section 1.5, we indicated the importance and relevance of subscribing to the R mailing list. The mail exchanged among the subscribers is uploaded at the end of the day. Furthermore, all the mail in a month is consolidated in a tar compressed text file. As an example, the mail exchanged during the month of October in 2005 is available in the file 2005-10.txt.gz, which can be downloaded from the R website. The first few lines of this text file are displayed below:

From lisawang at uhnres.utoronto.ca  Sat Oct  1 00:14:23 2005
From: lisawang at uhnres.utoronto.ca (Lisa Wang)
Date: Fri, 30 Sep 2005 17:14:23 -0500
Subject: [R] How to get to the varable in a list
Message-ID: <[email protected]>
Hello,
I have a list "lis" as the following:

We will first learn how to read such text files in R using the readLines function, which reads the different lines of a txt file as a character class.

> Imine <- readLines("2005-10.txt.gz")
> Imine[1:10]
 [1] "From lisawang at uhnres.utoronto.ca  Sat Oct  1 00:14:23 2005"
 [2] "From: lisawang at uhnres.utoronto.ca (Lisa Wang)"
 [3] "Date: Fri, 30 Sep 2005 17:14:23 -0500"
 [4] "Subject: [R] How to get to the varable in a list"
 [5] "Message-ID: <[email protected]>"
 [6] ""
 [7] "Hello,"
 [8] ""
 [9] "I have a list "lis" as the following:"
[10] ""

Verify for yourself the difference in actual text file 2005-10.txt.gz and the R object Imine. The rest of the section will help you to extract information from such files. As an example, we will first extract Date, Subject, and Message-ID for the first mail of October 2005. We see that Date is in the third line of the object. We will ask R to return this line number with the use of functions grep and grepl.

> grep("Date",Imine[1:10])
[1] 3
> grepl("Date",Imine[1:10])
 [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Thus, we see that the grep function finds in which line (row number) the text, commonly referred as a string, of interest occurs. The grepl is a logical function. The number of characters in a line can be found by the next line of codes through the R function nchar.

> unlist(lapply(Imine[1:10],nchar))
 [1] 61 48 37 48 50  0  6  0 37  0

The fourth line of the text file contains the subject of the mail. We want to obtain the subject sans the content header Subject: [R]. Note that we have included the space after [R]. We will first find the position where the subject begins and next extract the exact matter using the R function.

> nchar("Subject: [R] ")
[1] 13
> substring(Imine[4],14)
[1] "How to get to the varable in a list"

Thus, the function substring deletes the first 13 characters of the string and returns the rest of the string. Let us now extract the Message Id of this particular mail. As with the subject id, we have Message-ID: < as an indicator of the line of the object which contains the message ID. An added complexity is that we need to remove the sign > at the end of the message. We can see once more the utility of the nchar function.

> grep("Message-ID: <",Imine[1:10])
[1] 5
> nchar(Imine[5])
[1] 50
> nchar("Message-ID: <")
[1] 13
> substr(Imine[5],14,49)
[1] "[email protected]"

We will conclude this section with the extraction of the time and date of this message. The line containing the date and time is indicated by Date:. After doing extended manipulations we can obtain the date and time of the message. Recall from the previous section that the format of 30 Sep 2005 17:14:23 is \%d %B %Y \%H:\%M:\%S. Thus, we can extract the exact Date and Time of this email.

> grep("Date: ",Imine[1:10])
[1] 3
> temp <- strsplit(Imine[3],"Date: ")[[1]][2]
> temp
[1] "Fri, 30 Sep 2005 17:14:23 -0500"
> tempdate <- substring(temp,6,nchar(temp)-6)
> tempdate
[1] "30 Sep 2005 17:14:23"
> strptime(tempdate,"%d %B %Y %H:%M:%S")
[1] "2005-09-30 17:14:23"

Though this book does not deal with the emerging area of text mining, data also exists in rich and hidden forms in text format and files. The functions used here form the base and it will be useful for many text manipulations.

c03-math-0005

3.8 Scripts and Text Editors for R

In earlier sections we saw the need of using objects against just plain computing at the terminal. As the need and experience grows, the user finds it difficult to get the task accomplished, even within this framework. Consider the hypothetical scenario where the program runs into a few hundred lines. A mistake made at the 21st line is observed after 89 lines of code have been executed. There is thus this intrinsic need to write the R codes in a separate file and execute them. Consider the set of following codes:

yb <- read.table("/.../youden.csv",header=TRUE,sep=",")
quantile(yb$Preparation_1,seq(0,1,.1))
# here seq give 0, .1, .2, ...,1
quantile(yb$Preparation_2,seq(0,1,.1))
fivenum(yb$Preparation_1)
fivenum(yb$Preparation_2)
sd(yb$Preparation_1); sd(yb$Preparation_2)
var(yb$Preparation_1); var(yb$Preparation_2)
range(yb$Preparation_1); range(yb$Preparation_2)

Copy and paste these codes in any text editor, such as Notepad, vi, kate, gedit, etc., and save the file as yb.R or yb.txt. In the File option of the menu ribbon, a Windows user will find New Script and Open Script options. Using the Open Script option, load the yb.R file. Choose Run line or selection from the Edit option. We will then get the results the same as those obtained in Section 2.4.1. If there are any errors, we can modify the codes from the yb.R file, and thus the task of fixing the bugs is simplified. The Windows user may also explore the package Rcmdr explained in the next subsection.

3.8.1 Text Editors for Linuxians

Linuxians unfortunately do not have any Menu ribbon option available to them. Interestingly, we have a host of other options. As an example, just open the terminal and set the address to the working directory. At the terminal, run the following one-line code:

equation

We demonstrate two more options for Linuxians. Prof John Fox, McMaster University, and his team have developed a special package Rcmdr. Installing packages and loading libraries exercises will be explained in the next subsection. Enter the code below in the R session:

> library(Rcmdr)

and what you will see next is a user-friendly version of R. Yes, we have a new set of very useful tools. The options on the menu ribbon now includes File, Edit, Data, Statistics, Graphics, Models, Distributions, Tools, and Help. From the File menu, choose Open script file and open the file yb.R and then click on the Submit button. We get the same results as earlier.

The third option for a Linuxian is also better. You can go to the web http://rkward.sourceforge.net/ and download the RKWard 0.5.3 binaries. Of course knowing that Linuxians prefer the terminal, simply key in sudo apt-get install rkward verb. Having installed RKWard, start it. This software is as user-friendly as many enterprise editions. The options on the ribbon are File, Edit, View, Workspace, Run, Analysis, Plots, Distributions, Windows, Settings, and Help options. If the reader is wondering why we are trivializing here, we justify it as we have seen a lot of users struggling in using R in the Linux environment. We will compromise here though by not repeating how to run yb.R in RKWard.

RStudio is quickly emerging as a popular variant and may be obtained from www.rstudio.com.

c03-math-0007

3.9 Further Reading

Spector (2008) is a comprehensive book dedicated to data preparation. Chapter 2 of Venables and Ripley (2002) contains data manipulation for R and S. We also recommend that the reader go through Chapter 9 of Zuur, et al. (2009) for common R mistakes.

3.10 Complements, Problems, and Programs

  1. Problem 3.1 For the data.frame some in Example 3.2.2, what will be your expectation of the R code summary(some)? Validate the expectation by running the code too.

  2. Problem 3.2 By considering the dataset rootstock imported in Section 3.3, export the data back to the working directory using the write.dta function from the foreign package.

  3. Problem 3.3 Run edit(newsome1) as required in Section 3.4, and comment on how this function is different from the View function.

  4. Problem 3.4 For any directory in your computer, use the function list.files to obtain the contents, inclusive of files and maybe other directories. Recollect that the default list.files() function returns the contents in the working directory, and hence you need to experiment with a directory other than getwd().

  5. Problem 3.5 The attach function, when applied on a data.frame object, loads the variables in the R session. How do you undo this operation? If the attach function is repeated more than once, what will be the result?

  6. Problem 3.6 Suppose that the option header=FALSE is an error when an object is imported. Write appropriate codes which bring up the right variable names and deletes the wrong observations too. For example, suppose that the chest data is inappropriately imported with chest <-read.csv(“Chest_VH.csv”,header=FALSE). A simple use of names(chest) <- chest[1,] and chest <- chest[-1,] will not fix the problem.

  7. Problem 3.7 Using the aggregate function, as in Example 3.5.1, obtain the frequency instead of sum. Also, extend the list variables in the example to include both GPP and Grade, and hence obtain the sum of Sat for possible combinations of these two variables.

  8. Problem 3.8 Using the ifelse conditional function, create a new as.Date type of function, which can read date objects available in a vector in two different forms.

  9. Problem 3.9 Find the time difference between two time objects in units of hours, days, etc.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.150.80