Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3
Data Preparation and Other Tricks

Package(s): gdata, chron

Dataset(s): 100mrun.xls, Earthwormbiomass.xls, Bacteria.XLS, nerve.dat, atombombtest.xls, airquality, wine.dat, sat, faithful, 2005-10.txt.gz

3.1 Introduction

Data comes in various forms and complexities. It is a difficult task to even list the major/minor complexity levels of data preparation. The different forms of data, as well as complexity levels, may be known or unknown. Thus, it is difficult to have a standard set of guidelines for teaching data preparation methods.

Complexities arise on various counts, such as file types, files with missing data values, files with different kinds of attributes, etc. In some cases, it may be simply improbable for the user to read the data properly without repeated efforts of writing the codes over and over again. In Section 3.2, we use the options available in the R function read.table to import data of external files which pose some difficulties. The options may vary to accommodate data problems, avoiding certain number of lines of file, and so forth. A good practice during the learning curve is to validate the imported data into R and check if it is on the expected lines. Thus, it may help to see the imported data using the functions head, tail, str, View, etc., and such functions will be illustrated in Section 3.4. The R functions aggregate, with, and assign are effective in carrying out data manipulation without the need to create new R objects. The use of these functions will be seen in Section 3.5. Time and date vectors need special consideration and we will aid the reader with the math of it in Section 3.6. The complexity of dealing with text matter is one of the most detailed ones and the final technical Section 3.7 will consider the preliminary aspect of this new area. Rscript is a very important backend R tool and helps to run programs without the necessity of even opening the software. Furthermore, rich text editors are important too, and whenever it is possible to use them, the author would recommend that such editors be promptly deployed. This forms the topic of Section 3.8.

3.2 Manipulation with Complex Format Files

Section 2.5 introduced us to a method of reading data from external data files using the scan function. However, it is not the case that data is always well organized. The word organized here has a very vague meaning. After all, why would anybody write files which are not organized. It is nice to understand that “organized” is used in a rather internal sense here. The data that is “internally” stored and managed by R gives consistent results if we read the data according to the need of the hour. We will illustrate with some examples.

Example 3.2.1. The Hundred Meter Running Race

Consider the dataset 100mrun.xls from Gore, et al. (2006). To obtain the file of interest, please visit http://ces.iisc.ernet.in/hpg/nvjoshi/statspunedatabook/databook.html and download datafile sxls.zip and unzip it and copy the file 100mrun.xls to your working directory. The two columns give us the year of the Olympics and the winning recorded time. In spite of using the read.xls function from the package gdata, our problems are not over. The first row does not contain the variable names and is a descriptor of the file characteristic containing data about the 100 meters running race. The variable names appear on the second row, and we need to read the data from the second row onward. Hence, we ask R to skip the first line. We will first load the gdata package. The data is then properly read as below.

> library(gdata)
> newsome1 <- read.xls("100mrun.xls",sheet=1,skip=1)
> newsome1
   Year Time.sec.
1  1896     12.00
2  1900     11.00
3  1904     11.00
24 2000      9.87

The option skip ensures that the data is read from the second line of the file. Since the xls (xlsx) document may have multiple sheets, it is necessary for R to know from which sheet we are importing the data. This is met through the sheet option. The argument for the sheet option may either be the sheet number or the sheet name.

Remark about gdata. This R package needs the perl software too. In general, Linux and McIntosh OS have this software by default and the read.xls function works fine. However, Windows OS does not contain the perl software and needs to be installed before using the gdata package. Thus, the user will need to first download and install the software from http://www.perl.org/get.html. Furthermore, the perl option needs to explicitly specified in the read.xls code. That is, we need a modification with newsome1 <- read.xls(“100mrun.xls”,perl= “C:/Perl/bin/perl.exe”,sheet=1,skip=1) to ensure that the data is properly imported in R. If the sheets have names, and not numbers, the code changes slightly to sheet='sheet name'.

Example 3.2.2. The Earthworm Density

In this example, we have a few more added complexities. Here, we have to skip the first line of the xls file, which is a description of the data, where the option is skip=1. Columns I, II, and IV are numeric vectors, though Column IV contains years which may be considered both numeric as well as a factor. The variable names in the xls file will be changed with col.names=c(“Density”,“Biomass”, “Crop”,“Year”,“Soil Layer”). The data points are followed by a two-line citation of the research paper in which the data appears. We should not read this citation into R, and hence we specify that the number of rows which need to read should be 12 and this is done with the option nrows=12. The xls file containing this dataset is Earthwormbiomass.xls. The source of the dataset is the same as discussed in the previous example.

> some <- read.xls("Earthwormbiomass.xls",skip=1,nrows=12,
+ header=TRUE,sep=",", col.names=c("Density","Biomass","Crop",
+ "Year","Soil Layer"))
> some
   Density Biomass              Crop Year Soil.Layer
1      210    15.1             Maize 1998      0-10
2      251    22.2             Maize 1999      0-10
12       3     0.6 Wheat and Mustard 1999     10-20
> sapply(some,class)
   Density    Biomass       Crop       Year Soil.Layer
 "integer"  "numeric"   "factor"  "integer"   "factor"

The sapply function shows that we have properly read the data into R.

Example 3.2.3. Removing Percentage Symbol from a Dataset

We consider the dataset Bacteria.XLS from the same source, as in the previous two examples. Here, the data in the second and third columns are percentage values, and accordingly end with a % symbol. We know from our experience that the value with the symbol is the percentage number. Unfortunately, R does not know this fact. The rest of the columns are numeric vectors.

We need to power our read.xls command with the colClasses option, which tells R to read all the columns as character vectors. The vectors can be straightaway made numeric using the as.numeric vector. For the second and third columns, we tell R that the decimals of the vector have to start by replacing % as a decimal start point. The program is given below which carries out the steps detailed here.

> bacteria <- read.xls("Bacteria.XLS",colClasses="character")
> sapply(bacteria,class)
   Response        salt       lipid          pH        Temp
"character" "character" "character" "character" "character"
> bacteria[,"Response"] <- as.numeric(bacteria[,"Response"])
> bacteria[,"salt"] <- type.convert(bacteria[,"salt"],dec="%")
> bacteria[,"lipid"] <- type.convert(bacteria[,"lipid"],dec="%")
> bacteria[,"pH"] <- as.numeric(bacteria[,"pH"])
> bacteria[,"Temp"] <- as.numeric(bacteria[,"Temp"])
> sapply(bacteria,class)
 Response      salt     lipid        pH      Temp
"numeric" "numeric" "numeric" "numeric" "numeric"
> bacteria
    Response salt lipid pH Temp
1      -5.55    0     0  3    0
2      -5.15    1     0  3    0
3      -5.05    0     5  3    0
            .  .  .
299     0.25    3    20  0    2
300     1.03    4    20  0    2
> sapply(bacteria,mean)
Response     salt    lipid       pH     Temp
  0.2693   2.0000  10.0000   1.5000   1.0000

Note that we are extracting the variables from a data.frame using their names for the first time. The class of the variables shows that all the variables have been imported as character variables, which is not the data that we really require. Thus, there is a need to change them. The Response is simply converted using the as.numeric function, that is, we are reinforcing that we need the Response as a numeric variable. Next, the variables salt and lipid are converted from character to numeric with the specification that the decimals for the numeric values are occurring at the % symbol, and we have to use the function type.convert to achieve the result. The rest of the program can be understood without any further explanation. Also note that we have used the data.frame names to index the columns instead of the column numbers, which is again a nice R feature.

Example 3.2.4. Reading from the “nerve.dat” using the “scan” Function

We consider reading the nerve dataset, which was probably first used by Cox and Lewis (1966). This data is available on the web at http://www.stat.cmu.edu/larry/all-of-nonpar/=data/nerve.dat. The data consists of 799 observations of the waiting times between successive pulses along a nerve fiber. The dataset in the file is displayed as an arrangement of six observations per line, and we have 133 lines, and one more line containing the last observation. If we use the function read.csv, this dataset will be read as a data.frame consisting of six variables with 134 observations each. The first variable will contain 134 observations, whereas the 134-th observation for the remaining five variables will be a missing value NA. Using the scan function will properly read the dataset in the required format.

> nerve <- read.csv("nerve.dat",sep="	") # Not the correct way
> dim(nerve)
[1] 133   6
> nerve <- scan("nerve.dat")
Read 799 items

Thus, we have been able to read data using the read.csv function.

Example 3.2.5. Reading the “Wine and Raters” Frequency Dataset using the “ftable” Function

Example of judges and their ratings are of interest to consumers. Wine tasting is more of an art than a science. However, this cannot stop us from considering the data arising out of such experiments! Lindley does a wonderful analysis of such a scenario, see http://www.liquidasset.com/lindley.htm. The experiment involves two types of Tastings: Chardonnay and Cabernet. Each tasting has ten wines, labeled A-J. The 11 judges, labeled 1 to 11 across the file, unfolds as 1 Englishman in Steven Spurrier, 1 American in Patricia Gallagher, and the remaining 9 are French. Each of these 11 tasters taste the 10 wines from both tastings. The rankings are made on a scale of 0 to 20. Thus, we have a total of 11 tasters times 10 wines times 2 tastings as 220 observations. The Chardonnay wines are also commonly known as white wines and the Cabernet as red wine. White wine is popular among Americans and red among the French. This is an example of ordinal data and we require a special function to handle it.

Frequency data, if properly entered in a file and saved as .dat or .txt file, may be conveniently read into R using the read.ftable function as follows:¹

> wine <- read.ftable("wine.dat")
> wine
                 Tasters    1    2       10   11
Tastings   Wines
Chardonnay A             10.0 18.0     16.5 17.0
           B             15.0 15.0     16.0 14.5
           J              0.0  8.0      5.0  7.0
Cabernet   A             14.0 15.0     16.5 14.0
           B             16.0 14.0     16.0 14.0
           J              7.0  7.0      6.0  7.0

The function read.ftable is useful to read data from flat contingency tables. What is the advantage of reading the data as table objects? This framework allows easy handling of frequency in the sense that we can obtain the average ratings received by the ten wines across judges and tastings, or average Chardonnay and Cabernet ratings, and so forth. This needs the use of the xtabs function.

> xtabs(Freq∼Wines,data=wine)/22
Wines
        A         B         C                J
14.272727 14.204545 13.636364    7.681818
> xtabs(Freq∼Tastings,data=wine)/110
Tastings
Chardonnay   Cabernet
  11.35455   11.83636
> xtabs(Freq∼Tasters,data=wine)/20
Tasters
     1      2      11
10.700 11.800    12.050

We have used two special features of R programming in ∼ and data. In general ∼ is used in R formulas, indicating a relationship that the variable on the left-hand side of the expression depends on the right-hand side. The variables on the right-hand side may be more than one. The evaluation of the formula depends on the function that is being deployed. The data option is used to specify that the variables used in the expression ∼ are to be found in the data frame as declared.

The table, ftable, xtabs, etc., are very useful functions for analysis of categorical data, see Chapter 16.

We will check out one more example for the frequency table, which is predominantly useful for categorical data analysis.

Example 3.2.6. Preparing a Contingency Table

In this example we will read a dataset from an xls file, and then convert that data frame into a table or matrix form. Gore, et al. (2006) consider the frequencies of cancer deaths of Japanese atomic bomb survivors by extent of exposure, years after exposure, etc. This dataset has appeared in the journal “Statistical Sleuth”. The data is first read from an Excel file which creates a data.frame object in R. This needs to be converted into a contingency table format, which is later achieved using the xtabs function in R. The next R program achieves exactly the same.

> library(gdata)
> atombomb <- read.xls("atombombtest.xls",header=TRUE)
> atombombbxtabs <- xtabs(Frequency∼Radians+Count.Type+Count.Age.Group, data=atombomb)
> atombombxtabs
, , Count.Age.Group = '0-7'
       Count.Type
Radians At Risk Death Count
    0       262          10
    400      15           0
, , Count.Age.Group = '12-15'
       Count.Type
Radians At Risk Death Count
    0       240          19
    400      14           5
, , Count.Age.Group = '16-19'
       Count.Type
Radians At Risk Death Count
    0       243          12
    400      14           2
> class(atombombxtabs)
[1] "xtabs" "table"

This dataset will be used in Chapter 16.

Example 3.2.7. Reading Data from the Clipboard

A common practice is “Copy and Paste”. This practice is so prevalent that it is tempting to do that in R. Suppose that the data is to be copied from any external source, say SAS, Gedit, SPSS, EXL, etc., and then pasted into R. The common practice is merely to copy the matter which your computer then holds on the clipboard. For example, the matter, in the vertical display in a spreadsheet file, is the following:

NOP 10 28 ... 0 14 8

which we have copied to the clipboard. Next, do the following at the R console:

> read.table("clipboard",header=TRUE)
+ # Copy-paste methods die hard
   NOP
1   10
2   28
...
17   0
18  14
19   8

Thus, the copied matter in the clipboard may be easily imported in R. Note that you might prefer to copy certain columns/rows from the spreadsheet available in your machine to the clipboard.

Example 3.2.8. Reading the Row Names

Thus far we have read external files with column names and we would like to find out if we can read the row names too. Consider an external file in a csv format. Here, we have taken the dataset from Everitt and Hothorn (2011) and saved the data in a csv file. This dataset is related to life expectancies for different countries and we have further information on the age and gender groups. In the csv file we have the first column which has the name of the countries and four different age groups for males and females. Particularly, we require the row names to be read by the country name and the column names to reflect the age group with gender. Using the code life=read.csv(“lifedata.csv”,header=TRUE,row.names=1), we can read the data in the required format

$c03-math-0001$

3.3 Reading Datasets of Foreign Formats

Datasets may be available in formats other than csv or dat. It is also a frequent situation where we need to read data stored in xls (Microsoft Excel) format, sav (SPSS), ssd (SAS), dta (STATA), etc. For example, if the Youden-Beale data was saved in the first sheet of an xls file, we can use the command:

> yb <- read.xls("/.../youden.xls",header=TRUE,sheet=1)
Converting xls file to csv file... Done.
Reading csv file... Done.

Note that R first internally converts the xls file into a csv file, and then imports it into the session.

Similarly, we can read datasets of other software. Consider the rootstock.dta available from http://www.stata-press.com/data/r10/rootstock.dta. This is a popular dataset in the domain of multivariate statistics. The soft copy rootstock.dta has been generated from the Stata software. We assume that this dataset has been downloaded from the web and stored in the current working directory. We can read this dataset in R using the foreign packages read.dta function.

library(foreign)
rootstock <- read.dta("/.../rootstock.dta")

Section 1.4 listed many sources of data on the web. The laborious way of using them for analysis is to download them from the sources to the local hard disk and maybe to the current working directory. The technical way is to ask R to access and download the file and load the data into the working session. The next two small examples will clarify these ideas.

> rootstock.url <- "http://www.stata-press.com/data/r10/
+ rootstock.dta" # Example 1
> rootstock <- read.dta(rootstock.url)
> crime.url <- "http://www.jrsainfo.org/jabg/state_data2/
+ Tribal_Data00.xls" # Example 2
> crime <- read.xls(crime.url, pattern = "State")

Using the url link as a text string and appropriate importing functions such as read.dta and read.xls, we can import data in foreign formats to R.

3.4 Displaying R Objects

R objects are of varying nature and we may be interested in having a quick look at the dataset itself, and not through sophisticated tools such as graphics or statistical summaries. The utils package shipped along with R contains a host of functionalities for our purpose.

Suppose we want to see the first ten observations of the 100mrun.xls dataset that we imported earlier. Or we may like to see the last five observations of the same dataset. In R, head and tail are the two functions which give us this facility:

> head(newsome1,10)
   Year Time.sec.
1  1896      12.0
2  1900      11.0
10 1936      10.3
> tail(newsome1,5)
   Year Time.sec.
20 1984      9.99
24 2000      9.87

Another compact way of visualizing an object is to horizontally display the dataset. This is provided by the str function.

> str(newsome1)
'data.frame': 23 obs. of  2 variables:
 $ X1896: int  1900 1904 1908 1912 1920 1924 1928 1932 1936 1948 ...
 $ X12  : num  11 11 10.8 10.8 10.8 10.6 10.8 10.3 10.3 10.3 ...

The fix function is used to display the dataset in a new window, whereas the View function is used to just view the dataset.

> fix(newsome1)
> View(newsome1)
> View(wine)

Check what exactly the edit function does to an R object.

Note that the window arising due to the fix function contains three tabs Copy, Paste and Quit, which may be used to change the dataset. The reader may find the edit function to be useful too. For more such interesting functions, run library(help=utils) at the R console and experiment.

$c03-math-0002$

3.5 Manipulation Using R Functions

Consider a dataset where we have one column for measurement values of different individuals and another column with group indicators for those individuals. Now, we need to obtain some summaries of these measurements by the group indicator. This task can be achieved with the aggregate function, and the next example illustrates this. The examples discussed in Section 2.4.5 are also useful for manipulation of data preparation.

Example 3.5.1. Use of the `aggregate` function

We have the sat.csv file which contains data on Student ID Number, Grade, Pass indicator, Sat score, and GPP grade. Now, we wish to obtain the sum of the Sat scores by the GPP grade. The aggregate function helps us to achieve the result.

> data(sat)
> aggregate(sat$Sat,by=list(sat$GPP),sum)
  Group.1    x
1       A 4055
2       B 5590
3       C 4393
4       D 2164
5       F  574

Here we have used the by option to specify the groups, and sum is the FUN option. Thus, we have obtained the group sum using the aggregate function.

Consider a situation where you know what the name of the variables should be. However, for some technical reason you cannot declare them before they are actually assigned some value. The question is then how can we do such assignments in the flow of a program. As an artificial example, assume that you feed to R the current top ten Sensex companies of the day. Sensex refers to a number indicative of the relative prices of shares on the Mumbai Stock Exchange. Now, you would like to create objects whose name is the company name and whose value is its Sensex closing value. We will use the assign function towards this end.

Example 3.5.2. Creating Variables in the Flow of a Program

Suppose that we have collected on our clipboard the top ten companies of Sensex for today. We then want to create ten new R objects which will have the end-of-day Sensex value. Check the one thing that has gone wrong with the below R program.

> Sensex <- read.table("clipboard",header=FALSE)
> Sensex
            V1       V2
1         Ram1 867.1884
2        Dyan3 866.1884
3    Kaps&Japs 865.1884
4  Rocks&Rolls 864.1884
5     JUSTBEST 863.1884
6     Sin_Gine 862.1884
7       Books1 861.1884
8  BikesMotors 860.1884
9          RCB 859.1884
10         JCF 858.1884
> for(i in 1:10) {
+ nam <- paste(as.character(Sensex[i,1]),"_",days(Sys.time()),sep="")
+ assign(nam,Sensex[i,2])
+ }
> ls()
 [1] "BikesMotors_14" "Books1_14"      "Dyan3_14"       "i"
 [5] "JCF_14"         "JUSTBEST_14"    "Kaps&Japs_14"   "nam"
 [9] "Ram1_14"        "RCB_14"         "Rocks&Rolls_14" "Sensex"
[13] "Sin_Gine_14"
> JUSTBEST_14
[1] 863.1884
> Books1_14
[1] 861.1884
> RCB_14
[1] 859.1884

In this program we have used the paste function to create distinct names of the R objects. The variable names construction uses elements from the first column of the Sensex object. As we need to create the variables along with the current date, we use the Sys.time function which returns the current date. Then, we use the days function from the chron package, which extracts the “day” from the date object. The variable names from the first column is concatenated with the day of the month using an underscore symbol “_” and the option sep=“” which says that there should be no gap between the various arguments of the paste function. The Sys.time function will be dealt with in more detail in Section 3.6.

With large data files it is memory-consuming to create new objects for some modifications of existing columns (or rows). Thus, there is this economic reason for modifying the objects without creating new ones. The R functions with and within meet the said requirement. For the faithful dataset, we will use the within function for carrying out necessary changes.

Example 3.2.5. Modifying `faithful` Dataset Using `within` R Function

In the faithful there are two variables in eruptions and waiting and both are measured in minutes. Suppose we seek to convert the eruption time into seconds and the waiting time is to be transferred on the logarithm scale. The within function can be used for this required manipulation.

> head(faithful)
  eruptions waiting
1     3.600      79
2     1.800      54
3     3.333      74
4     2.283      62
5     4.533      85
6     2.883      55
> faithful <- within(faithful,{
+ eruptions <- eruptions*60
+ waiting <- log(waiting)
+ })
> head(faithful)
  eruptions  waiting
1    216.00 4.369448
2    108.00 3.988984
3    199.98 4.304065
4    136.98 4.127134
5    271.98 4.442651
6    172.98 4.007333

The use of the within function helps in quick data preparation while avoiding creation of unnecessary new variables.

$c03-math-0003$

3.6 Working with Time and Date

Time and dates have always been a complex entity and it does not become any easier in programming languages either. A year has 7 months with 31 days, 4 months with 30 days, and 1 month with 28 days for 3 years and 29 days every fourth year. Even if we ignore a leap year, the number of weekdays, such as Monday, Tuesday, etc., in a year is different, and this is strangely distributed across the months. The number of days of a month, except for February in non-leap years, is not an integer multiple of 7, the number of days of a year is not an integer multiple of the number of months, or the number of weeks of a year. Similarly, time order is also a complex issue to deal with.

The number of ways in which we can write the date is in a further multiple ways. The complexity can be understood as the dates are written in different styles: “9-Sep-2010”, “9-Sep-10”,“09-September-2010”, “09-09-10”,“09/09/10”, etc., all represent the same date. The month is written in numeric as well as text. The year may either be specified in full four digits or the last two digits of a century. We need to take into account all such complexities. Chapter 4 of Spector (2008) is a dedicated and rigorous treatment of handling dates, and we will deal with it in some detail.

As used in the previous section, the current time may be obtained using Sys.time(). Internally, for each time stamp, R stores a number. Similarly, Sys.Date() returns the current system date. This can be easily verified.

> Sys.time()
[1] "2011-06-14 23:28:34 IST"
> as.numeric(Sys.time())
[1] 1308074318
> as.numeric(Sys.time()+1)
[1] 1308074367
> as.numeric(Sys.time()+2)
[1] 1308074370
> Sys.Date()
[1] "2011-06-14"

We see that R is currently reading time up to seconds accuracy. Can the accuracy be increased? That is, we need to know the time in millisecond units. As is the practice, set the default number of digits at 3 using the options function.

> op <- options(digits.secs=3)
> Sys.time()
[1] "2011-06-14 23:33:43.964 IST"

Date objects belong to the classes POSIXct and POSIXt, where these two classes are of the date/times class. For more details, try ?POSIXct. In Example 3.4.2 we had used a small function: month.abb. What is it really? Let us check it out.

> month.abb
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug"
 [9] "Sep" "Oct" "Nov" "Dec"
> month.name
 [1] "January"  "February"        "June"
 [7] "July"     "August"          "December"

Let us store some date objects. To begin with we will consider the system date itself, and see the analysis (extraction actually) that may be performed with it.

> curr_date <- Sys.Date()
> curr_date
[1] "2015-04-13"
> years(curr_date); quarters(curr_date); months(curr_date)
[1] 2015
Levels: 2015
[1] "Q2"
[1] "April"
> days(curr_date); weekdays(curr_date); julian(curr_date)
[1] 13
31 Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8  < ... < 31
[1] "Monday"
[1] 16538
attr(,"origin")
[1] "1970-01-01"

In the above display, some functions are from the chron package, whereas the rest are from the base package. Tables 4.1, 4.2 and 4.3 of Spector (2008) have details about various formats of dates and time. We will clarify a few of them here. As seen earlier, a single date may be written in distinct ways: 9-Sep-2010, 9-Sep-10, 09-September-2010, 09-09-10, 09/09/10. The format can be specified through the format option as a string, and the conversion of text matter to date is achieved through the Date function. Let us now check how R understands these dates as one and the same.

> x1 <- as.Date('9-Sep-2010',format='%d-%b-%Y')
> x2 <- as.Date('9-Sep-10',format='%d-%b-%y')
> x3 <- as.Date('09-September-2010','%d-%B-%Y')
> x4 <- as.Date('09-09-10','%d-%m-%y')
> x5 <- as.Date('09/09/10','%d/%m/%y')
> x1;x2;x3;x4;x5
[1] "2010-09-09"
[1] "2010-09-09"
[1] "2010-09-09"
[1] "2010-09-09"
[1] "2010-09-09"
> x1==x2; x2==x3; x3==x4; x4==x5
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE

Some algebra is possible with date objects, for example, difftime, mean, and range.

> x1+1
[1] "2010-09-10"
> difftime(x1,x2)
Time difference of 0 secs
> mean(c(x1,x2))
[1] "2010-09-09"
> range(c(x1,x2))
[1] "2010-09-09" "2010-09-09"

R is efficient in dealing with time and date variables and this section has given a brief exposition of it.

$c03-math-0004$

3.7 Text Manipulations

Data is not always in a ready-to-analyze format. We have seen that working with character or factor objects is not as convenient as working with numeric or integer objects. Working with text matter is a task with much higher complexity and inevitable inconvenience too. In fact, there is a specialized school working with such problems, known as Text miners. Here, we will illustrate the important text tools working our way through a complex text matter. The complexity of text functions is such that we feel that it is better to work with some examples instead of looking at their definitions. This approach forms the remainder of this section.

In Section 1.5, we indicated the importance and relevance of subscribing to the R mailing list. The mail exchanged among the subscribers is uploaded at the end of the day. Furthermore, all the mail in a month is consolidated in a tar compressed text file. As an example, the mail exchanged during the month of October in 2005 is available in the file 2005-10.txt.gz, which can be downloaded from the R website. The first few lines of this text file are displayed below:

From lisawang at uhnres.utoronto.ca  Sat Oct  1 00:14:23 2005
From: lisawang at uhnres.utoronto.ca (Lisa Wang)
Date: Fri, 30 Sep 2005 17:14:23 -0500
Subject: [R] How to get to the varable in a list
Message-ID: <[email protected]>
Hello,
I have a list "lis" as the following:

We will first learn how to read such text files in R using the readLines function, which reads the different lines of a txt file as a character class.

> Imine <- readLines("2005-10.txt.gz")
> Imine[1:10]
 [1] "From lisawang at uhnres.utoronto.ca  Sat Oct  1 00:14:23 2005"
 [2] "From: lisawang at uhnres.utoronto.ca (Lisa Wang)"
 [3] "Date: Fri, 30 Sep 2005 17:14:23 -0500"
 [4] "Subject: [R] How to get to the varable in a list"
 [5] "Message-ID: <[email protected]>"
 [6] ""
 [7] "Hello,"
 [8] ""
 [9] "I have a list "lis" as the following:"
[10] ""

Verify for yourself the difference in actual text file 2005-10.txt.gz and the R object Imine. The rest of the section will help you to extract information from such files. As an example, we will first extract Date, Subject, and Message-ID for the first mail of October 2005. We see that Date is in the third line of the object. We will ask R to return this line number with the use of functions grep and grepl.

> grep("Date",Imine[1:10])
[1] 3
> grepl("Date",Imine[1:10])
 [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Thus, we see that the grep function finds in which line (row number) the text, commonly referred as a string, of interest occurs. The grepl is a logical function. The number of characters in a line can be found by the next line of codes through the R function nchar.

> unlist(lapply(Imine[1:10],nchar))
 [1] 61 48 37 48 50  0  6  0 37  0

The fourth line of the text file contains the subject of the mail. We want to obtain the subject sans the content header Subject: [R]. Note that we have included the space after [R]. We will first find the position where the subject begins and next extract the exact matter using the R function.

> nchar("Subject: [R] ")
[1] 13
> substring(Imine[4],14)
[1] "How to get to the varable in a list"

Thus, the function substring deletes the first 13 characters of the string and returns the rest of the string. Let us now extract the Message Id of this particular mail. As with the subject id, we have Message-ID: < as an indicator of the line of the object which contains the message ID. An added complexity is that we need to remove the sign > at the end of the message. We can see once more the utility of the nchar function.

> grep("Message-ID: <",Imine[1:10])
[1] 5
> nchar(Imine[5])
[1] 50
> nchar("Message-ID: <")
[1] 13
> substr(Imine[5],14,49)
[1] "[email protected]"

We will conclude this section with the extraction of the time and date of this message. The line containing the date and time is indicated by Date:. After doing extended manipulations we can obtain the date and time of the message. Recall from the previous section that the format of 30 Sep 2005 17:14:23 is \%d %B %Y \%H:\%M:\%S. Thus, we can extract the exact Date and Time of this email.

> grep("Date: ",Imine[1:10])
[1] 3
> temp <- strsplit(Imine[3],"Date: ")[[1]][2]
> temp
[1] "Fri, 30 Sep 2005 17:14:23 -0500"
> tempdate <- substring(temp,6,nchar(temp)-6)
> tempdate
[1] "30 Sep 2005 17:14:23"
> strptime(tempdate,"%d %B %Y %H:%M:%S")
[1] "2005-09-30 17:14:23"

Though this book does not deal with the emerging area of text mining, data also exists in rich and hidden forms in text format and files. The functions used here form the base and it will be useful for many text manipulations.

$c03-math-0005$

3.8 Scripts and Text Editors for R

In earlier sections we saw the need of using objects against just plain computing at the terminal. As the need and experience grows, the user finds it difficult to get the task accomplished, even within this framework. Consider the hypothetical scenario where the program runs into a few hundred lines. A mistake made at the 21st line is observed after 89 lines of code have been executed. There is thus this intrinsic need to write the R codes in a separate file and execute them. Consider the set of following codes:

yb <- read.table("/.../youden.csv",header=TRUE,sep=",")
quantile(yb$Preparation_1,seq(0,1,.1))
# here seq give 0, .1, .2, ...,1
quantile(yb$Preparation_2,seq(0,1,.1))
fivenum(yb$Preparation_1)
fivenum(yb$Preparation_2)
sd(yb$Preparation_1); sd(yb$Preparation_2)
var(yb$Preparation_1); var(yb$Preparation_2)
range(yb$Preparation_1); range(yb$Preparation_2)

Copy and paste these codes in any text editor, such as Notepad, vi, kate, gedit, etc., and save the file as yb.R or yb.txt. In the File option of the menu ribbon, a Windows user will find New Script and Open Script options. Using the Open Script option, load the yb.R file. Choose Run line or selection from the Edit option. We will then get the results the same as those obtained in Section 2.4.1. If there are any errors, we can modify the codes from the yb.R file, and thus the task of fixing the bugs is simplified. The Windows user may also explore the package Rcmdr explained in the next subsection.

3.8.1 Text Editors for Linuxians

Linuxians unfortunately do not have any Menu ribbon option available to them. Interestingly, we have a host of other options. As an example, just open the terminal and set the address to the working directory. At the terminal, run the following one-line code:

We demonstrate two more options for Linuxians. Prof John Fox, McMaster University, and his team have developed a special package Rcmdr. Installing packages and loading libraries exercises will be explained in the next subsection. Enter the code below in the R session:

> library(Rcmdr)

and what you will see next is a user-friendly version of R. Yes, we have a new set of very useful tools. The options on the menu ribbon now includes File, Edit, Data, Statistics, Graphics, Models, Distributions, Tools, and Help. From the File menu, choose Open script file and open the file yb.R and then click on the Submit button. We get the same results as earlier.

The third option for a Linuxian is also better. You can go to the web http://rkward.sourceforge.net/ and download the RKWard 0.5.3 binaries. Of course knowing that Linuxians prefer the terminal, simply key in sudo apt-get install rkward verb. Having installed RKWard, start it. This software is as user-friendly as many enterprise editions. The options on the ribbon are File, Edit, View, Workspace, Run, Analysis, Plots, Distributions, Windows, Settings, and Help options. If the reader is wondering why we are trivializing here, we justify it as we have seen a lot of users struggling in using R in the Linux environment. We will compromise here though by not repeating how to run yb.R in RKWard.

RStudio is quickly emerging as a popular variant and may be obtained from www.rstudio.com.

$c03-math-0007$

3.9 Further Reading

Spector (2008) is a comprehensive book dedicated to data preparation. Chapter 2 of Venables and Ripley (2002) contains data manipulation for R and S. We also recommend that the reader go through Chapter 9 of Zuur, et al. (2009) for common R mistakes.

3.10 Complements, Problems, and Programs

Problem 3.1 For the data.frame some in Example 3.2.2, what will be your expectation of the R code summary(some)? Validate the expectation by running the code too.
Problem 3.2 By considering the dataset rootstock imported in Section 3.3, export the data back to the working directory using the write.dta function from the foreign package.
Problem 3.3 Run edit(newsome1) as required in Section 3.4, and comment on how this function is different from the View function.
Problem 3.4 For any directory in your computer, use the function list.files to obtain the contents, inclusive of files and maybe other directories. Recollect that the default list.files() function returns the contents in the working directory, and hence you need to experiment with a directory other than getwd().
Problem 3.5 The attach function, when applied on a data.frame object, loads the variables in the R session. How do you undo this operation? If the attach function is repeated more than once, what will be the result?
Problem 3.6 Suppose that the option header=FALSE is an error when an object is imported. Write appropriate codes which bring up the right variable names and deletes the wrong observations too. For example, suppose that the chest data is inappropriately imported with chest <-read.csv(“Chest_VH.csv”,header=FALSE). A simple use of names(chest) <- chest[1,] and chest <- chest[-1,] will not fix the problem.
Problem 3.7 Using the aggregate function, as in Example 3.5.1, obtain the frequency instead of sum. Also, extend the list variables in the example to include both GPP and Grade, and hence obtain the sum of Sat for possible combinations of these two variables.
Problem 3.8 Using the ifelse conditional function, create a new as.Date type of function, which can read date objects available in a vector in two different forms.
Problem 3.9 Find the time difference between two time objects in units of hours, days, etc.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.