Package(s): gdata
, chron
Dataset(s): 100mrun.xls
, Earthwormbiomass.xls
, Bacteria.XLS
, nerve.dat
, atombombtest.xls
, airquality
, wine.dat
, sat
, faithful
, 2005-10.txt.gz
Data comes in various forms and complexities. It is a difficult task to even list the major/minor complexity levels of data preparation. The different forms of data, as well as complexity levels, may be known or unknown. Thus, it is difficult to have a standard set of guidelines for teaching data preparation methods.
Complexities arise on various counts, such as file types, files with missing data values, files with different kinds of attributes, etc. In some cases, it may be simply improbable for the user to read the data properly without repeated efforts of writing the codes over and over again. In Section 3.2, we use the options available in the R function read.table
to import data of external files which pose some difficulties. The options may vary to accommodate data problems, avoiding certain number of lines of file, and so forth. A good practice during the learning curve is to validate the imported data into R and check if it is on the expected lines. Thus, it may help to see the imported data using the functions head
, tail
, str
, View
, etc., and such functions will be illustrated in Section 3.4. The R functions aggregate
, with
, and assign
are effective in carrying out data manipulation without the need to create new R objects. The use of these functions will be seen in Section 3.5. Time and date vectors need special consideration and we will aid the reader with the math of it in Section 3.6. The complexity of dealing with text matter is one of the most detailed ones and the final technical Section 3.7 will consider the preliminary aspect of this new area. Rscript
is a very important backend R tool and helps to run programs without the necessity of even opening the software. Furthermore, rich text editors are important too, and whenever it is possible to use them, the author would recommend that such editors be promptly deployed. This forms the topic of Section 3.8.
Section 2.5 introduced us to a method of reading data from external data files using the scan
function. However, it is not the case that data is always well organized. The word organized here has a very vague meaning. After all, why would anybody write files which are not organized. It is nice to understand that “organized” is used in a rather internal sense here. The data that is “internally” stored and managed by R gives consistent results if we read the data according to the need of the hour. We will illustrate with some examples.
Remark about gdata
. This R package needs the perl
software too. In general, Linux and McIntosh OS have this software by default and the read.xls
function works fine. However, Windows OS does not contain the perl
software and needs to be installed before using the gdata
package. Thus, the user will need to first download and install the software from http://www.perl.org/get.html. Furthermore, the perl option needs to explicitly specified in the read.xls
code. That is, we need a modification with newsome1 <- read.xls(“100mrun.xls”,perl=
“C:/Perl/bin/perl.exe”,sheet=1,skip=1)
to ensure that the data is properly imported in R. If the sheets have names, and not numbers, the code changes slightly to sheet='sheet name'
.
We will check out one more example for the frequency table, which is predominantly useful for categorical data analysis.
Datasets may be available in formats other than csv or dat. It is also a frequent situation where we need to read data stored in xls (Microsoft Excel) format, sav (SPSS), ssd (SAS), dta (STATA), etc. For example, if the Youden-Beale data was saved in the first sheet of an xls file, we can use the command:
> yb <- read.xls("/.../youden.xls",header=TRUE,sheet=1)
Converting xls file to csv file... Done.
Reading csv file... Done.
Note that R first internally converts the xls file into a csv file, and then imports it into the session.
Similarly, we can read datasets of other software. Consider the rootstock.dta
available from http://www.stata-press.com/data/r10/rootstock.dta. This is a popular dataset in the domain of multivariate statistics. The soft copy rootstock.dta
has been generated from the Stata software. We assume that this dataset has been downloaded from the web and stored in the current working directory. We can read this dataset in R using the foreign
packages read.dta
function.
library(foreign)
rootstock <- read.dta("/.../rootstock.dta")
Section 1.4 listed many sources of data on the web. The laborious way of using them for analysis is to download them from the sources to the local hard disk and maybe to the current working directory. The technical way is to ask R to access and download the file and load the data into the working session. The next two small examples will clarify these ideas.
> rootstock.url <- "http://www.stata-press.com/data/r10/
+ rootstock.dta" # Example 1
> rootstock <- read.dta(rootstock.url)
> crime.url <- "http://www.jrsainfo.org/jabg/state_data2/
+ Tribal_Data00.xls" # Example 2
> crime <- read.xls(crime.url, pattern = "State")
Using the url link as a text string and appropriate importing functions such as read.dta
and read.xls
, we can import data in foreign formats to R.
R objects are of varying nature and we may be interested in having a quick look at the dataset itself, and not through sophisticated tools such as graphics or statistical summaries. The utils
package shipped along with R contains a host of functionalities for our purpose.
Suppose we want to see the first ten observations of the 100mrun.xls
dataset that we imported earlier. Or we may like to see the last five observations of the same dataset. In R, head
and tail
are the two functions which give us this facility:
> head(newsome1,10)
Year Time.sec.
1 1896 12.0
2 1900 11.0
10 1936 10.3
> tail(newsome1,5)
Year Time.sec.
20 1984 9.99
24 2000 9.87
Another compact way of visualizing an object is to horizontally display the dataset. This is provided by the str
function.
> str(newsome1)
'data.frame': 23 obs. of 2 variables:
$ X1896: int 1900 1904 1908 1912 1920 1924 1928 1932 1936 1948 ...
$ X12 : num 11 11 10.8 10.8 10.8 10.6 10.8 10.3 10.3 10.3 ...
The fix
function is used to display the dataset in a new window, whereas the View
function is used to just view the dataset.
> fix(newsome1)
> View(newsome1)
> View(wine)
Check what exactly the edit
function does to an R object.
Note that the window arising due to the fix
function contains three tabs Copy
, Paste
and Quit
, which may be used to change the dataset. The reader may find the edit
function to be useful too. For more such interesting functions, run library(help=utils)
at the R console and experiment.
Consider a dataset where we have one column for measurement values of different individuals and another column with group indicators for those individuals. Now, we need to obtain some summaries of these measurements by the group indicator. This task can be achieved with the aggregate
function, and the next example illustrates this. The examples discussed in Section 2.4.5 are also useful for manipulation of data preparation.
Consider a situation where you know what the name of the variables should be. However, for some technical reason you cannot declare them before they are actually assigned some value. The question is then how can we do such assignments in the flow of a program. As an artificial example, assume that you feed to R the current top ten Sensex companies of the day. Sensex refers to a number indicative of the relative prices of shares on the Mumbai Stock Exchange. Now, you would like to create objects whose name is the company name and whose value is its Sensex closing value. We will use the assign
function towards this end.
With large data files it is memory-consuming to create new objects for some modifications of existing columns (or rows). Thus, there is this economic reason for modifying the objects without creating new ones. The R functions with
and within
meet the said requirement. For the faithful
dataset, we will use the within
function for carrying out necessary changes.
Time and dates have always been a complex entity and it does not become any easier in programming languages either. A year has 7 months with 31 days, 4 months with 30 days, and 1 month with 28 days for 3 years and 29 days every fourth year. Even if we ignore a leap year, the number of weekdays, such as Monday, Tuesday, etc., in a year is different, and this is strangely distributed across the months. The number of days of a month, except for February in non-leap years, is not an integer multiple of 7, the number of days of a year is not an integer multiple of the number of months, or the number of weeks of a year. Similarly, time order is also a complex issue to deal with.
The number of ways in which we can write the date is in a further multiple ways. The complexity can be understood as the dates are written in different styles: “9-Sep-2010”, “9-Sep-10”,“09-September-2010”, “09-09-10”,“09/09/10”, etc., all represent the same date. The month is written in numeric as well as text. The year may either be specified in full four digits or the last two digits of a century. We need to take into account all such complexities. Chapter 4 of Spector (2008) is a dedicated and rigorous treatment of handling dates, and we will deal with it in some detail.
As used in the previous section, the current time may be obtained using Sys.time()
. Internally, for each time stamp, R stores a number. Similarly, Sys.Date()
returns the current system date. This can be easily verified.
> Sys.time()
[1] "2011-06-14 23:28:34 IST"
> as.numeric(Sys.time())
[1] 1308074318
> as.numeric(Sys.time()+1)
[1] 1308074367
> as.numeric(Sys.time()+2)
[1] 1308074370
> Sys.Date()
[1] "2011-06-14"
We see that R is currently reading time up to seconds accuracy. Can the accuracy be increased? That is, we need to know the time in millisecond units. As is the practice, set the default number of digits at 3 using the options
function.
> op <- options(digits.secs=3)
> Sys.time()
[1] "2011-06-14 23:33:43.964 IST"
Date objects belong to the classes POSIXct
and POSIXt
, where these two classes are of the date/times class. For more details, try ?POSIXct
. In Example 3.4.2 we had used a small function: month.abb
. What is it really? Let us check it out.
> month.abb
[1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug"
[9] "Sep" "Oct" "Nov" "Dec"
> month.name
[1] "January" "February" "June"
[7] "July" "August" "December"
Let us store some date objects. To begin with we will consider the system date itself, and see the analysis (extraction actually) that may be performed with it.
> curr_date <- Sys.Date()
> curr_date
[1] "2015-04-13"
> years(curr_date); quarters(curr_date); months(curr_date)
[1] 2015
Levels: 2015
[1] "Q2"
[1] "April"
> days(curr_date); weekdays(curr_date); julian(curr_date)
[1] 13
31 Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < ... < 31
[1] "Monday"
[1] 16538
attr(,"origin")
[1] "1970-01-01"
In the above display, some functions are from the chron
package, whereas the rest are from the base
package. Tables 4.1, 4.2 and 4.3 of Spector (2008) have details about various formats of dates and time. We will clarify a few of them here. As seen earlier, a single date may be written in distinct ways: 9-Sep-2010
, 9-Sep-10
, 09-September-2010
, 09-09-10
, 09/09/10
. The format can be specified through the format
option as a string, and the conversion of text matter to date is achieved through the Date
function. Let us now check how R understands these dates as one and the same.
> x1 <- as.Date('9-Sep-2010',format='%d-%b-%Y')
> x2 <- as.Date('9-Sep-10',format='%d-%b-%y')
> x3 <- as.Date('09-September-2010','%d-%B-%Y')
> x4 <- as.Date('09-09-10','%d-%m-%y')
> x5 <- as.Date('09/09/10','%d/%m/%y')
> x1;x2;x3;x4;x5
[1] "2010-09-09"
[1] "2010-09-09"
[1] "2010-09-09"
[1] "2010-09-09"
[1] "2010-09-09"
> x1==x2; x2==x3; x3==x4; x4==x5
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
Some algebra is possible with date objects, for example, difftime
, mean
, and range
.
> x1+1
[1] "2010-09-10"
> difftime(x1,x2)
Time difference of 0 secs
> mean(c(x1,x2))
[1] "2010-09-09"
> range(c(x1,x2))
[1] "2010-09-09" "2010-09-09"
R is efficient in dealing with time and date variables and this section has given a brief exposition of it.
Data is not always in a ready-to-analyze format. We have seen that working with character or factor objects is not as convenient as working with numeric or integer objects. Working with text matter is a task with much higher complexity and inevitable inconvenience too. In fact, there is a specialized school working with such problems, known as Text miners. Here, we will illustrate the important text tools working our way through a complex text matter. The complexity of text functions is such that we feel that it is better to work with some examples instead of looking at their definitions. This approach forms the remainder of this section.
In Section 1.5, we indicated the importance and relevance of subscribing to the R mailing list. The mail exchanged among the subscribers is uploaded at the end of the day. Furthermore, all the mail in a month is consolidated in a tar compressed text file. As an example, the mail exchanged during the month of October in 2005 is available in the file 2005-10.txt.gz
, which can be downloaded from the R website. The first few lines of this text file are displayed below:
From lisawang at uhnres.utoronto.ca Sat Oct 1 00:14:23 2005
From: lisawang at uhnres.utoronto.ca (Lisa Wang)
Date: Fri, 30 Sep 2005 17:14:23 -0500
Subject: [R] How to get to the varable in a list
Message-ID: <[email protected]>
Hello,
I have a list "lis" as the following:
We will first learn how to read such text files in R using the readLines
function, which reads the different lines of a txt
file as a character
class.
> Imine <- readLines("2005-10.txt.gz")
> Imine[1:10]
[1] "From lisawang at uhnres.utoronto.ca Sat Oct 1 00:14:23 2005"
[2] "From: lisawang at uhnres.utoronto.ca (Lisa Wang)"
[3] "Date: Fri, 30 Sep 2005 17:14:23 -0500"
[4] "Subject: [R] How to get to the varable in a list"
[5] "Message-ID: <[email protected]>"
[6] ""
[7] "Hello,"
[8] ""
[9] "I have a list "lis" as the following:"
[10] ""
Verify for yourself the difference in actual text file 2005-10.txt.gz
and the R object Imine
. The rest of the section will help you to extract information from such files. As an example, we will first extract Date
, Subject
, and Message-ID
for the first mail of October 2005. We see that Date
is in the third line of the object. We will ask R to return this line number with the use of functions grep
and grepl
.
> grep("Date",Imine[1:10])
[1] 3
> grepl("Date",Imine[1:10])
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Thus, we see that the grep
function finds in which line (row number) the text, commonly referred as a string, of interest occurs. The grepl
is a logical function. The number of characters in a line can be found by the next line of codes through the R function nchar
.
> unlist(lapply(Imine[1:10],nchar))
[1] 61 48 37 48 50 0 6 0 37 0
The fourth line of the text file contains the subject of the mail. We want to obtain the subject sans the content header Subject: [R]
. Note that we have included the space after [R]
. We will first find the position where the subject begins and next extract the exact matter using the R function.
> nchar("Subject: [R] ")
[1] 13
> substring(Imine[4],14)
[1] "How to get to the varable in a list"
Thus, the function substring
deletes the first 13
characters of the string and returns the rest of the string. Let us now extract the Message Id
of this particular mail. As with the subject id, we have Message-ID: <
as an indicator of the line of the object which contains the message ID. An added complexity is that we need to remove the sign >
at the end of the message. We can see once more the utility of the nchar
function.
> grep("Message-ID: <",Imine[1:10])
[1] 5
> nchar(Imine[5])
[1] 50
> nchar("Message-ID: <")
[1] 13
> substr(Imine[5],14,49)
[1] "[email protected]"
We will conclude this section with the extraction of the time and date of this message. The line containing the date and time is indicated by Date:
. After doing extended manipulations we can obtain the date and time of the message. Recall from the previous section that the format of 30 Sep 2005 17:14:23
is \%d %B %Y \%H:\%M:\%S
. Thus, we can extract the exact Date and Time of this email.
> grep("Date: ",Imine[1:10])
[1] 3
> temp <- strsplit(Imine[3],"Date: ")[[1]][2]
> temp
[1] "Fri, 30 Sep 2005 17:14:23 -0500"
> tempdate <- substring(temp,6,nchar(temp)-6)
> tempdate
[1] "30 Sep 2005 17:14:23"
> strptime(tempdate,"%d %B %Y %H:%M:%S")
[1] "2005-09-30 17:14:23"
Though this book does not deal with the emerging area of text mining, data also exists in rich and hidden forms in text format and files. The functions used here form the base and it will be useful for many text manipulations.
In earlier sections we saw the need of using objects against just plain computing at the terminal. As the need and experience grows, the user finds it difficult to get the task accomplished, even within this framework. Consider the hypothetical scenario where the program runs into a few hundred lines. A mistake made at the 21st line is observed after 89 lines of code have been executed. There is thus this intrinsic need to write the R codes in a separate file and execute them. Consider the set of following codes:
yb <- read.table("/.../youden.csv",header=TRUE,sep=",")
quantile(yb$Preparation_1,seq(0,1,.1))
# here seq give 0, .1, .2, ...,1
quantile(yb$Preparation_2,seq(0,1,.1))
fivenum(yb$Preparation_1)
fivenum(yb$Preparation_2)
sd(yb$Preparation_1); sd(yb$Preparation_2)
var(yb$Preparation_1); var(yb$Preparation_2)
range(yb$Preparation_1); range(yb$Preparation_2)
Copy and paste these codes in any text editor, such as Notepad, vi, kate, gedit, etc., and save the file as yb.R
or yb.txt
. In the File
option of the menu ribbon, a Windows user will find New Script
and Open Script
options. Using the Open Script
option, load the yb.R
file. Choose Run line or selection
from the Edit
option. We will then get the results the same as those obtained in Section 2.4.1. If there are any errors, we can modify the codes from the yb.R
file, and thus the task of fixing the bugs is simplified. The Windows user may also explore the package Rcmdr
explained in the next subsection.
Linuxians unfortunately do not have any Menu ribbon option available to them. Interestingly, we have a host of other options. As an example, just open the terminal and set the address to the working directory. At the terminal, run the following one-line code:
We demonstrate two more options for Linuxians. Prof John Fox, McMaster University, and his team have developed a special package Rcmdr
. Installing packages and loading libraries exercises will be explained in the next subsection. Enter the code below in the R session:
> library(Rcmdr)
and what you will see next is a user-friendly version of R. Yes, we have a new set of very useful tools. The options on the menu ribbon now includes File, Edit, Data, Statistics, Graphics, Models, Distributions, Tools, and Help. From the File menu, choose Open script file
and open the file yb.R
and then click on the Submit
button. We get the same results as earlier.
The third option for a Linuxian is also better. You can go to the web http://rkward.sourceforge.net/ and download the RKWard 0.5.3 binaries. Of course knowing that Linuxians prefer the terminal, simply key in sudo
apt-get install rkward verb
. Having installed RKWard, start it. This software is as user-friendly as many enterprise editions. The options on the ribbon are File, Edit, View, Workspace, Run, Analysis, Plots, Distributions, Windows, Settings, and Help options. If the reader is wondering why we are trivializing here, we justify it as we have seen a lot of users struggling in using R in the Linux environment. We will compromise here though by not repeating how to run yb.R
in RKWard
.
RStudio is quickly emerging as a popular variant and may be obtained from www.rstudio.com.
Spector (2008) is a comprehensive book dedicated to data preparation. Chapter 2 of Venables and Ripley (2002) contains data manipulation for R and S. We also recommend that the reader go through Chapter 9 of Zuur, et al. (2009) for common R mistakes.
Problem 3.1 For the data.frame
some
in Example 3.2.2, what will be your expectation of the R code summary(some)
? Validate the expectation by running the code too.
Problem 3.2 By considering the dataset rootstock
imported in Section 3.3, export the data back to the working directory using the write.dta
function from the foreign
package.
Problem 3.3 Run edit(newsome1)
as required in Section 3.4, and comment on how this function is different from the View
function.
Problem 3.4 For any directory in your computer, use the function list.files
to obtain the contents, inclusive of files and maybe other directories. Recollect that the default list.files()
function returns the contents in the working directory, and hence you need to experiment with a directory other than getwd()
.
Problem 3.5 The attach
function, when applied on a data.frame
object, loads the variables in the R session. How do you undo this operation? If the attach
function is repeated more than once, what will be the result?
Problem 3.6 Suppose that the option header=FALSE
is an error when an object is imported. Write appropriate codes which bring up the right variable names and deletes the wrong observations too. For example, suppose that the chest
data is inappropriately imported with chest <-read.csv(“Chest_VH.csv”,header=FALSE)
. A simple use of names(chest) <- chest[1,]
and chest <- chest[-1,]
will not fix the problem.
Problem 3.7 Using the aggregate
function, as in Example 3.5.1, obtain the frequency instead of sum
. Also, extend the list
variables in the example to include both GPP
and Grade
, and hence obtain the sum of Sat
for possible combinations of these two variables.
Problem 3.8 Using the ifelse
conditional function, create a new as.Date
type of function, which can read date objects available in a vector in two different forms.
Problem 3.9 Find the time difference between two time objects in units of hours
, days
, etc.
18.118.150.80