4
Data Analysis Using R Programming

In an Internet online advertisement, a job vacancy advertisement for a statistician reads as follows:

Job Summary
Statistician I
Salary: Open
Employer: XYZ Research and Statistics
Location: City X, State Y
Type: Full time – entry level
Category: Financial analyst/Statistics,
Data analysis/processing, Statistical organization
& administration
Required Education: Masters Degree preferred

XYZ Research and Statistics is a national leader in designing, managing, and analyzing financial data. XYZ partners with other investigators to offer respected statistical expertise supported by sophisticated web-based data management systems. XYZ services assure timely and secure implementation of trials and reliable data analyses.

Job Description
Position Summary: An exciting opportunity is available for a statistician to join a small but growing group focused on financial investment analysis and related translational research. XYZ, which is located in downtown City XX, is responsible for the design, management, and analysis of a variety of investment and financial, as well as the analysis of associated market data. The successful candidate will collaborate with fellow statistics staff and financial investigators to design, evaluate, and interpret investment studies.
Primary Duties and Responsibilities: Analyzes investment situations and associated ancillary studies in collaboration with fellow statisticians and other financial engineers. Prepares tables, figures, and written summaries of study results; interprets results in collaboration with other financial; and assists in preparation of manuscripts. Provides statistical consultation with collaborating staff. Performs other job-related duties as assigned.
Requirements
Required Qualifications: Masters Degree in Statistics, Applied Mathematics, or related field. Sound knowledge of applied statistics. Proficiency in statistical computing in R.
Preferred Responsibilities/Qualifications: Statistical consulting experience. S-Plus or R programming language experience. Experience with analysis of high-dimensional data. Ability to communicate well orally and in writing. Excellent interpersonal/teamwork skills for effective collaboration. Spanish language skills a plus.
*In your cover letter, describe how your skills and experience match the qualifications for the position.
To learn more about XYZ, visit www.XYZ.org.

Clearly, one should be cognizant of the overt requirement of an acceptable level of professional proficiency in data analysis using R programming!

Even if one is not in such a job market, as a statistician working in the fields of Finance, Asset Allocations, Portfolio Optimization, and so on, a skill set that would include R programming would be helpful and interesting.

4.1 Data and Data Processing

Data are facts or figures from which conclusions can be drawn. When the data have been recorded, classified, organized, related, or interpreted within a framework so that meaning emerges, they become information. There are several steps involved in turning data into information, and these steps are known as data processing. This section describes data processing and how computers perform these steps efficiently and effectively.

It will be indicated that many of these processing activities may be undertaken using R programming, or performed in an R environment with the aid of available R packages – where R functions and data sets are stored.

4.1.1 Introduction

4.1.1.1 Coding

4.1.1.1.1 Automated coding systems

The simplified flowchart shows how raw data are transformed into information:

equation

Data processing takes place once all of the relevant data have been collected. They are gathered from various sources and entered into a computer where they can be processed to produce information – the output.

Data processing includes the following steps:

  • Data coding
  • Data capture
  • Editing
  • Imputation
  • Quality control
  • Producing results
Data Coding

First, before raw data can be entered into a computer, they must be coded. To do this, survey responses must be labeled, usually with simple, numerical codes. This may be done by the interviewer in the field or by an office employee. The data coding step is important because it makes data entry and data processing easier.

Surveys have two types of questions – closed questions and open questions. The responses to these questions affect the type of coding performed.

A closed question means that only a fixed number of predetermined survey responses are permitted. These responses will have already been coded.

The following question, in a survey on sporting activities, is an example of a closed question:

To what degree is sport important in providing you with the following benefits?

  1. Very important
  2. Somewhat important
  3. Not important

An open question implies that any response is allowed, making subsequent coding more difficult. In order to code an open question, the processor must sample a number of responses, and then design a code structure that includes all possible answers.

The following code structure is an example of an open question:

What sports do you participate in?

Specify (28 characters)__________________________________

In the Census and almost all other surveys, the codes for each question field are premarked on the questionnaire. To process the questionnaire, the codes are entered directly into the database and are prepared for data capturing. The following is an example of premarked coding:

What language does this person speak most often at home?

  1. English
  2. French
  3. Other – Specify_________________
Automated Coding Systems

There are programs in use that will automate repetitive and routine tasks. Some of the advantages of an automated coding system are that the process increasingly becomes

  • faster,
  • more consistent, and
  • more economical.

The next step in data processing is inputting the coded data into a computer database. This method is known as data capture.

Data Capture

This is the process by which data are transferred from a paper copy, such as questionnaires and survey responses, to an electronic file. The responses are then put into a computer. Before this procedure takes place, the questionnaires must be groomed (prepared) for data capture. In this processing step, the questionnaire is reviewed to ensure that the entire minimum required data have been reported, and that they are decipherable. This grooming is usually performed during extensive automated edits.

There are several methods used for capturing data:

  • Tally charts are used to record data, such as the number of occurrences of a particular event, and to develop frequency distribution tables.
  • Batch keying is one of the oldest methods of data capture. It uses a computer keyboard to type in the data. This process is very practical for high-volume entry where fast production is a requirement. No editing procedures are necessary but there must be a high degree of confidence in the editing program.
  • Interactive capture is often referred to as intelligent keying. Usually, captured data are edited before they are imputed. However, this method combines data capture and data editing in one function.
  • Optical character readers or bar code scanners are able to recognize alpha or numeric characters. These readers scan lines and translate them into the program. These bar code scanners are quite common and often seen in department stores. They can take the shape of a gun or a wand.
  • Magnetic recordings allow for both reading and writing capabilities. This method may be used in areas where data security is important. The largest application for this type of data capture is the PIN number found on automatic bank cards. A computer keyboard is one of the best known input (or data entry) devices in current use. In the past, people performed data entry using punch cards or paper tape.

Some modern examples of data input devices are

  • optical mark reader
  • bar code reader
  • scanner used in desktop publishing
  • light pen
  • trackball
  • mouse

Once data have been entered into a computer database, the next step is ensuring that all of the responses are accurate. This method is known as data editing.

Data Editing

Data should be edited before being presented as information. This action ensures that the information provided is accurate, complete, and consistent. There are two levels of data editing – micro- and macroediting.

Microediting corrects the data at the record level. This process detects errors in data through checks of the individual data records. The intent at this point is to determine the consistency of the data and correct the individual data records.

Macroediting also detects errors in data, but does this through the analysis of aggregate data (totals). The data are compared with data from other surveys, administrative files, or earlier versions of the same data. This process determines the compatibility of data.

Imputations

Editing is of little value to the overall improvement of the actual survey results, if no corrective action is taken when items fail to follow the rules set out during the editing process. When all of the data have been edited using the applied rules and a file is found to have missing data, then imputation is usually done as a separate step.

Nonresponse and invalid data definitely impact the quality of the survey results.

Imputation resolves the problems of missing, invalid, or incomplete responses identified during editing, as well as any editing errors that might have occurred.

At this stage, all of the data are screened for errors because respondents are not the only ones capable of making mistakes; errors can also occur during coding and editing.

Some other types of imputation methods include the following:

  • Hot deck uses other records as “donors” in order to answer the question (or set of questions) that needs imputation.
  • Substitution relies on the availability of comparable data. Imputed data can be extracted from the respondent's record from a previous cycle of the survey, or the imputed data can be taken from the respondent's alternative source file (e.g., administrative files or other survey files for the same respondent).
  • Estimator uses information from other questions or from other answers (from the current cycle or a previous cycle), and through mathematical operations derives a plausible value for the missing or incorrect field.
  • Cold deck makes use of a fixed set of values, which covers all of the data items. These values can be constructed with the use of historical data, subject matter expertise, and so on.
  • The donor can also be found through a method called nearest neighbor imputation. In this case, some sort of criteria must be developed to determine which responding unit is “most like” the unit with the missing value in accordance with the predetermined characteristics. The closest unit to the missing value is then used as the donor.

Imputation methods can be performed automatically, manually, or in combination.

Data Quality

  • Quality assurance
  • Quality control
  • Quality management in statistical agencies

Quality is an essential element at all levels of processing. To ensure the quality of a product or service in survey development activities, both quality assurance and quality control methods are used.

Quality Assurance

Quality assurance refers to all planned activities necessary in providing confidence that a product or service will satisfy its purpose and the users' needs. In the context of survey conducting activities, this can take place at any of the major stages of survey development: planning, design, implementation, processing, evaluation, and dissemination.

This approach anticipates problems prior to their unexpected occurrences, and uses all available information to generate improvements. It is not restricted to any specific quality, the planning stage, and is all-encompassing in its activities standards. It is applicable mostly at the planning stage, and is all-encompassing in its activities.

Quality Control

Quality control is a regulatory procedure through which one may measure quality, with preset standards, and then act on any differences. Examples of this include controlling the quality of the coding operation, the quality of the survey interviewing, and the quality of the data capture.

Quality control responds to observed problems, using ongoing measurements to make decisions on the processes or products. It requires a prespecified quality for comparability. It is applicable mostly at the processing stage following a set procedure that is a subset of quality assurance.

Quality Management in Statistical Agencies

The quality of the data must be defined and assured in the context of being “fit for use,” which will depend on the intended function of the data and the fundamental characteristics of quality. It also depends on the users' expectations of what is considered to be useful information.

There is no standard definition among statistical agencies for the term “official Statistics.” There is a generally accepted, but evolving, range of quality issues underlying the concept of “fitness for use”. These elements of quality need to be considered and balanced in the design and implementation of an agency's statistical program.

The following is a list of the elements of quality:

  • Relevance
  • Accuracy
  • Timeliness
  • Accessibility
  • Interpretability
  • Coherence

These elements of quality tend to overlap. Just as there is no single measure of accuracy, there is no effective statistical model for bringing together all these characteristics of quality into a single indicator. Also, except in simple or one-dimensional cases, there is no general statistical model for determining whether one particular set of quality characteristics provides higher overall quality than another.

Producing Results

After editing, data may be processed further to produce a desired output. The computer software used to process the data will depend on the form of output required. Software applications for word processing, desktop publishing, graphics (including graphing and drawing), programming, databases, and spreadsheets are commonly used. The following are some examples of ways that software can produce data:

  • Spreadsheets are programs that automatically add columns and rows of figures, calculate means, and perform statistical analyses.
  • Databases are electronic filing cabinets. They systematically store data for easy access, and produce summaries, aggregates, or reports.
  • Specialized programs can be developed to edit, clean, impute, and process the final tabular output.

Review Questions for Section 4.1

  1. In the job description for an entry level statistician today, from the viewpoint of a prospective applicant for that position, what basic statistical computing languages are important in order to meet the requirement? Why?
  2. For a typical MBA (Master of Business Administration) program in Business and Finance, should the core curriculum include the development of proficient skill in the use of R programming in Statistics? Why?
    1. Contrast the concepts of data and information.
    2. How does the process of data processing convert data to information?
  3. In the steps which convert data into information, how are statistics and computing applied to the various data processing steps.
    1. Describe and delineate quality assurance and quality control in computer data processing.
    2. In what way does statistics feature in these phases of data processing?

4.2 Beginning R

R is an open-source, freely available, integrated software environment for data manipulation, computation, analysis, and graphical display.

The R environment consists of the following:

  • A data handling and storage facility
  • Operators for computations on arrays and matrices
  • A collection of tools for data analysis
  • Graphical capabilities for analysis and display
  • An efficient and continuing developing programming algebra-like programming language that consists of loops, conditionals, user-defined functions, and input and output capabilities.

The term “environment” is used to show that it is indeed a planned and coherent system.

R and Statistics

R was initially written by Robert Gentleman and Ross Ihaka of the Statistics Department of the University of Auckland, New Zealand, in 1997. Since then there has been the R-development core group of about 20 people with write access to the R source code.

The original introduction to the R environment, evolved from the S/S-Plus languages, was not primarily directed toward statistics. However, since its development in the 1990s, it appeared to have been “hijacked” by many working in the areas of classical and modern statistical techniques, including many applications in financial engineering, econometrics, biostatistics with respect to epidemiology, public health and preventive medicine. These applications have led to the raison d'état for writing this book.

As of this writing, the latest version of R is R-3.3.2, officially released on October 31, 2016. The primary source of R packages is the Comprehensive R Archive Network, CRAN, at http://cran.r-project.org/. Another source of R packages may be found in numerous publications, for example, the Journal of Statistical Software, now at its 45th volume, is available at http://www.jstatsoft.org/v45.

Let us get started (the R-3.3.2 version environment is being used here). Recall in Section 4.1, the R environment was obtained as follows:

Here is R:
Let us download the open-source high-level program R from the Internet and take a first look at the R computing environment.
Remark: Access the Internet at the website of CRAN (The Comprehensive
R Archive Network: http://cran.r-project.org/
To install R: R-3.3.2-win32.exe http://www.r-project.org/
=> download R
=> Select: USA
 http://cran.cnr.Berkeley.edu <http://cran.cnr.berkeley.edu/ >
 University of California, Berkeley, CA
=> http://cran.cnr.berkeley.edu/
=> Windows (95 and later)
=> base
=> R-3.3.2-win32.exe 
AFTER the down-loading:
=> Double-click on: R-3.3.2-win32.exe
(on the DeskTop) to un-zip & install R
=> An icon (Script R 3.3.2) will appear on ones Computer
“desktop” as follows: Figure 4.1
On the computer “desktop” is the R icon:

In this book, the following special color scheme legend will be used for all statements during the computational activities in the R environment, to clarify the various inputs to and outputs from the computational process:

  1. Texts in this book (WarnockPro-Regular font)
  2. Line input in R code (CourierStd)
  3. Line output in R code (CourierStd)
  4. Line comment statements in R code (WarnockPro-Italic font)
Figure depicting the R icon (top) on the computer desktop. “R 2.9.1.Ink” is written at the bottom.

Figure 4.1 The R icon on the computer desktop (The R 3.3.2 looks exactly the same as that for R 2.9.1).

Note: The # sign is the comment character: All text in the line following this sign is treated as a comment by the R program, that is, no computational action will be taken regarding such a statement. That is, the computational activities will proceed as though the comment statements are ignored. These comment statements help the programmer and user by providing some clarification of the purposes involved in the remainder of the R environment. The computations will proceed even if these comment statements are eliminated.

# is known as the number sign, it is also known as the pound sign/key, the hash key, and, less commonly, as the octothorp, octothorpe, octathorp, octotherp, octathorpe, and octatherp.

To use R under Windows: Double-click on the R 3.3.2 icon…

Upon selecting and clicking on R, the R window opens with the following declaration:

  1. R version 3.3.2 (2016-10-31)
  2. Copyright 2016 The R Foundation for Statistical Computing
  3. ISBN 3-900051-07-0
  4. R is free software and comes with ABSOLUTELY NO WARRANTY.
  5. You are welcome to redistribute it under certain conditions.
  6. Type ‘license()' or ‘licence()' for distribution details.
  7. R is a collaborative project with many contributors.
  8. Type ‘contributors()' for more information and
  9. ‘citation()' on how to cite R or R packages in publications.
  10. Type ‘demo()' for some demos, ‘help()' for on-line help, or
  11. ‘help.start()' for an HTML browser interface to help.
  12. Type ‘q()' to quit R.
> # This is the R computing environment.
> # Computations may begin now!
> # First, use R as a calculator, and try a simple arithmetic
> # operation, say: 1 + 1
> 1+1
> [1] 2 # This is the output!
> # WOW! It's really working!
> # The [1] in front of the output result is part of R’s way of printing
> # numbers and vectors. Although it is not so useful here, it does
> # become so when the output result is a longer vector

From this point on, this book is most beneficially read with the R environment at hand. It will be a most effective learning experience if one practices each R command as one goes along the textual materials.

4.2.1 A First Session Using R

This section introduces some important and practical features of the R environment (Figure 4.2). Login and start an R session in the Windows system of the computer

>
> # This is the R environment.
> help.start() # Outputting the page shown in Figure 4.1
> # Statistical Data Analysis Manuals
starting httpd help server ... done
If nothing happens, you should open
‘http://127.0.0.1:28103/doc/html/index.html’ yourself
At this point, explore the HTML interface to on-line help right from the desktop, using the mouse pointer to note the various features of this facility available within the R environment. Then, returning to the R environment:
> help.start()
Carefully read through each of the sections under “Manuals” – to obtain an introduction to the basic language of the R environment. Then look through the items under “Reference” to reach beyond the elementary  level, including access to the available “R Packages” – all R functions and datasets are stored in packages.
For example, if one selects the Packages Reference, the following R Package Index window will open up, showing Figure 4.3, listing a collection of R program packages under the R library: C:Program FilesRR-2.14.1library
Figure depicting the statistical data analysis for output of the R command that includes manuals, reference, miscellaneous material, and material specific to the windows port.

Figure 4.2 Output of the R Command.

One may now access each of these R program packages, and use them for further applications as needed.

Figure depicting some package indices (left) and their meanings (right).

Figure 4.3 Package Index.

Returning to the R environment:

>
> x <- rnorm(100)
> # Generating a pseudo-random 100-vector x
> y <- rnorm(x)
> # Generating another pseudo-random 100-vector y
> plot (x, y)
> # Plotting x vs. y in the plane, resulting in a graphic
> # window: Figure 4.4.

Remark: For reference, Appendix 1 contains the CRAN documentation of the R function plot(), available for graphic outputting, which may be found by the R code segment:

Figure depicting graphical output for plot (x, y).

Figure 4.4 Graphical Output for plot (x, y).

> ?plot

CRAN has documentations for many R functions and packages.

Again, returning to the R workspace, and enter

>
>
> ls() # (This is a lower-case “L” followed by “s”, viz., the ‘list’
>  # command.)
> # (NOT 1 = “ONE” followed by “s”)
> # This command will list all the R objects now in the
> # R workspace:
> # Outputting:
[1] "E" "n" "s" "x" "y" "z"

Again, returning to the R workspace, and enter

>
> rm (x, y) # Removing all x and all y from the R workspace
> x # Calling for x
Error: object 'x' not found
> # Of course, the xs have just been removed!
> y # Calling for y
Error: object 'y' not found # Because the ys have also been
# removed!
>
> x <- 1:10 # Let x = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> x # Outputting x (just checking!)
[1] 1 2 3 4 5 6 7 8 9 10
> w <- 1 + sqrt(x)/2 # w is a weighting vector of
> # standard deviations
> dummy <- data.frame (x = x, y = x + rnorm(x)*w)
> # Making a data frame of 2 columns, x, and y, for inspection
> dummy # Outputting the data frame dummy
x y
1 1 1.311612
2 2 4.392003
3 3 3.669256
4 4 3.345255
5 5 7.371759
6 6 -0.190287
7 7 10.835873
8 8 4.936543
9 9 7.901261
10 10 10.712029
>
> fm <- lm(y∼x, data=dummy)
> # Doing a simple Linear Regression
> summary(fm) # Fitting a simple linear regression of y on x,
> # then inspect the analysis, and outputting:
Call:
lm(formula = y ∼ x, data = dummy)
Residuals:
Min 1Q Median 3Q Max
-6.0140 -0.8133 -0.0385 1.7291 4.2218
Coefficients:
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0814 2.0604 0.525 0.6139
x 0.7904 0.3321 2.380 0.0445*
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.016 on 8 degrees of freedom
Multiple R-squared: 0.4146, Adjusted R-squared: 0.3414
F-statistic: 5.665 on 1 and 8 DF, p-value: 0.04453
> fm1 <- lm(y∼x, data=dummy, weight=1/w^2)
> summary(fm1) # Knowing the standard deviation,
> # then doing a weighted
> #  regression and outputting:
> #  regression and outputting:


Call:
lm(formula = y ∼ x, data = dummy, weights = 1/w^2)
Residuals:
  Min 1Q Median 3Q Max
-2.69867 -0.46190 -0.00072 0.90031 1.83202
Coefficients:
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2130 1.6294 0.744 0.4779
x 0.7668 0.3043 2.520 0.0358*
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.356 on 8 degrees of freedom
Multiple R-squared: 0.4424, Adjusted R-squared: 0.3728
F-statistic: 6.348 on 1 and 8 DF, p-value: 0.03583
> attach(dummy) # Making the columns in the data
> # frame as variables
The following object(s) are masked _by_ '.GlobalEnv': x
> lrf <- lowess(x, y) # a non-parametric local
> # regression functionlrf
> plot (x, y) # Making a standard point plot, outputting: Figure 4.5.
.

Figure 4.5 Graphical Output for plot (x, y).

Remark: For reference, Appendix 1 contains the CRAN documentation of the R function plot(), available for graphic outputting, which may be found by the R code segment:

> ?plot
> # CRAN has documentations for many R functions and packages.

Again, returning to the R workspace, and enter

>
> ls() # (This is a lower-case “L” followed by “s”, viz., the ‘list’
> # command.)
> # (NOT 1 = “ONE” followed by “s”)
> # This command will list all the R objects now in the
> # R workspace:
> # Outputting:
[1] "E" "n" "s" "x" "y" "z"

Again, returning to the R workspace, and enter

>
> rm (x, y) # Removing all x and all y from the R workspace
> x # Calling for x
Error: object 'x' not found
> # Of course, the xs have just been removed!
>
> y # Calling for y
Error: object 'y' not found 
> # Because the ys have been removed too!
>
> x <- 1:10 # Let x = (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> x # Outputting x (just checking!)
[1] 1 2 3 4 5 6 7 8 9 10
> w <- 1 + sqrt(x)/2 # w is a weighting vector of
> # standard deviations
> dummy <- data.frame (x = x, y = x + rnorm(x)*w)
> # Making a data frame of 2 columns, x, and y, for inspection
> dummy # Outputting the data frame dummy
x y
1 1 1.311612
2 2 4.392003
3 3 3.669256
4 4 3.345255
5 5 7.371759
6 6 -0.190287
7 7 10.835873
8 8 4.936543
9 9 7.901261
10 10 10.712029
> fm <- lm(y∼x, data=dummy)
> # Doing a simple Linear Regression
> summary(fm)
> # Fitting a simple linear regression of y on x,
> # then inspect the analysis, and outputting!:
Call:
lm(formula = y ∼ x, data = dummy)
Residuals:
Min 1Q Median 3Q Max
 -6.0140 -0.8133 -0.0385 1.7291 4.2218
Coefficients:
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0814 2.0604 0.525 0.6139
x 0.7904 0.3321 2.380 0.0445 *
--
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.016 on 8 degrees of freedom
Multiple R-squared: 0.4146, Adjusted R-squared: 0.3414
F-statistic: 5.665 on 1 and 8 DF, p-value: 0.04453
> fm1 <- lm(y∼x, data=dummy, weight=1/w^2)
> summary(fm1) # Knowing the standard deviation,
> # then doing a weighted
> # regression and outputting:
> # regression and outputting:
Call:
lm(formula = y ∼ x, data = dummy, weights = 1/w^2)
Residuals:
  Min 1Q Median 3Q Max
 -2.69867 -0.46190 -0.00072 0.90031 1.83202
Coefficients:
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.2130 1.6294 0.744 0.4779
x 0.7668 0.3043 2.520 0.0358 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.356 on 8 degrees of freedom
Multiple R-squared: 0.4424, Adjusted R-squared: 0.3728
F-statistic: 6.348 on 1 and 8 DF, p-value: 0.03583
> attach(dummy) # Making the columns in the data frame as
> # variables
> lrf <- lowess(x, y)
> lrf
> plot (x, y) # Making a standard point plot,
> # outputting: Figure 4.6
> abline(0, 1, lty=3) # adding in the true regression line:
> # (Intercept = 0, Slope = 1),
> # outputting: Figure 4.7.
> abline(coef(fm)) # adding in the unweighted regression line:
> # outputting Figure 4.8
abline(coef(fm1), col="red")
# adding in the weighted regression line:
# outputting Figure 4.8.
detach() # Removing data frame from the search path
plot(fitted(fm), resid(fm), # Doing a standard diagnostic
+ # plot
+ xlab="Fitted values", # to check for heteroscedasticity**,
+ ylab="residuals", # viz., checking for differing variance.
+ main="Residuals vs Fitted")
# Outputting Figure 4.9.
**Heteroskedasticity occurs when the variance of the error terms differ across observations.
qqnorm(resid(fm), main="Residuals Rankit Plot")
# Doing a normal scores plot to check for skewness, kurtosis,
+ # and
# outliers.
# (Not very useful here.) Outputting
Figure 4.11.
rm(fm, fm1, lrf, x, dummy) # Removing these 5 objects
fm
Error: object 'fm' not found # Checked!
fm1
Error: object 'fm1' not found # Checked!
lrf
Error: object 'lrf' not found # Checked!
x
Error: object 'x' not found # Checked!
dummy
Error: object 'dummy' not found # Check!
# END OF THIS PRACTICE SESSION!
An x-y plot depicting the addition in the local regression line, where y-axis is calibrated on a scale of 0–10 and x-axis on a scale of 2–10.

Figure 4.6 Adding in the local regression line.

An x-y plot depicting the addition in the true regression line, where y-axis is calibrated on a scale of 0–10 and x-axis on a scale of 2–10. Here, intercept and slope are 0 and 1, respectively.

Figure 4.7 Adding in the true regression line (intercept = 0, slope =1).

An x-y plot depicting the addition in the unweighted regression line, where y-axis is calibrated on a scale of 0–10 and x-axis on a scale of 2–10.

Figure 4.8 Adding in the unweighted regression line.

An x-y plot depicting the addition in the weighted regression line, where y-axis is calibrated on a scale of 0–10 and x-axis on a scale of 2–10.

Figure 4.9 Adding in the weighted regression line.

A graphical representation for a standard diagnostic plot to check for heteroscedasticity; where residuals is plotted on the y-axis on a scale of -2–4 and fitted values on the x-axis on a scale of 2–10.

Figure 4.10 A standard diagnostic plot to check for heteroscedasticity.

A graphical representation for a normal scores plot to check for skewness, kurtosis, and outliers; where sample quantiles is plotted on the y-axis on a scale of -6–4 and theoretical quantiles on the x-axis on a scale of -1.5–1.5.

Figure 4.11 A normal scores plot to check for skewness, kurtosis, and outliers.

4.2.2 The R Environment – This is Important!

Getting through the First Session in Section 4.2.1 shows:

Technically, R is an expression language with a simple syntax, which is almost self-explanatory. It is case sensitive, so x and X are different symbols and refer to different variables. All alphanumeric symbols are allowed, plus ‘.’ and ‘-’, with the restriction that a name must start with '.' or a letter, and if it starts with ‘.’ the second character must not be a digit. The command prompt > indicates when R is ready for input.

This is where one types commands to be processed by R, which will happen when one hits the ENTER key. Commands consist of either expressions or assignments. When an expression is given as a command, it is immediately evaluated, printed, and the value is discarded. An assignment evaluates an expression and passes the value to a variable – but the value is not automatically printed. To print the computed value, simple enter the variable again at the next command.Commands are separated either by a new line, or separated by a semicolon (‘;’). Several elementary commands may be grouped together into one compound expression by braces (‘{‘ and ‘}').

Comments, starting with a hashmark/number sign (‘#’), may be put almost anywhere: everything to the end of the line following this sign is a comment.

Comments may not be used in an argument list of a function definition or inside strings. If a command is not complete at the end of a line, R will give a different prompt, a “+” sign, by default.

On the second and subsequent lines, continue to read input until the command is completed syntactically. The result of a command is printed to the output device: If the result is an array, such as a vector or a matrix, then the elements are formatted with line break (wherever necessary) with the indices of the leading entries labeled in square brackets: [index].

For example, an array of 15 elements may be outputted as:

> array(8, 15)
[1]  8 8 8 8 8 8 8 8 8 8
[11] 8 8 8 8 8
The labels ‘[1]’ and ‘[11]’ indicate the 1st and
11th elements in the output. 
These labels are not part of the data itself!
Similarly, the labels for a matrix are placed at the
start of each row and column in the output. 
For example, for the 3 x 5 matrix M, it is outputted
as:
>
> M <- matrix(1:15, nrow=3)
> M
[,1] [,2] [,3] [,4] [,5]
[1,]  1  4  7  10  13
[2,]  2  5  8  11  14
[3,]  3  6  9  12  15
>

Note that the storage is a column-major, namely, the elements of the first column are printed out first followed by those of the second column, and so on. To cause a matrix to be filled in a row-wise manner, rather than the default column-wise fashion, the additional switch byrow=T will cause the matrix to be filled row-wise rather than column-wise:

>
> M <- matrix(1:15, nrow=3, byrow=T)
> M
[,1] [,2] [,3] [,4] [,5]
[1,]  1   2  3   4   5
[2,]  6   7  8   9   10
[3,] 11  12  13  14  15
>

The First Session also shows that there is a host of helpful resources imbedded in the R environment that one can readily access, using the online help provided by CRAN.

Review Questions for Section 4.2

  1. Let us get started!

    Please follow the step-by-step instructions given in Section 4.2 to set up an R environment. The R window show looks like this:

    >
    Great!
    

    Now enter the following arithmetic operations: press “Enter” after each entry

    (a) 2 + 3 <Enter>
    (b) 13 – 7 <Enter>
    (c) 17 * 23 <Enter)
    (d) 100/25 <Enter>10/25
    (e) Did you obtain the following results:
    5, 6, 391, 4?
    
  2. Here is a few more: The <Enter> prompt will be omitted from now on.
    1. 2^4
    2. sqrt(3)
    3. 1i [1i is used for the complex unit i, where i = −1]
    4. (2 + 3i) + (4 + 5i)
    5. (2 + 3i) × (4 + 5i)
  3. Here is a short session on using R to do complex arithmetic, just enter the following commands into the R environment, and report the results:
    > th <- seq(-pi, pi, len=20)
    > th  (a)   How many numbers are printed out?
    > z <- exp(1i*th)
    > z   (b)   How many complex numbers are printed out?
    > par(pty="s")  
    (c)   Along the menu-bar at the top of the R environment:
    *   Select and left-click on “Window”:, then
    *   Move downwards and select the 2nd option:
              R Graphic Device 2 (ACTIVE)
    *   Go to the “R Graphic Device 2 (ACTIVE) Window”
    (d)   What is there?
    > plot(z)
    (e)   Describe what is in the Graphic Device 2 Window.

4.3 R as a Calculator

4.3.1 Mathematical Operations Using R

To learn to do statistical analysis and computations, one may start by considering the R programming language as a simple calculator.

Start from here: just enter an arithmetic expression, press the <Enter> key, and the answer from the machine is found in the next line

>
> 2 + 3
[1] 5
>
OK!  What about other calculations?  Such as: 13 - 7, 3 x 5, 12 /4, 72,  
√2, e3, e, ln 5 = loge5, (4 + √3) (4 – √3), (4 + i√3)(4 - i√3),…
and so on.  Just try:
>
> 13 - 7
[1] 6
> 3*5
[1] 15
> 12/4
[1] 3
> 7^2
[1] 49
> sqrt(2)
[1] 1.414214
>
> exp(3)
[1] 20.08554
>
> exp(1i*pi)     [1i is used for the complex number
                        i = √-1.]
[1] -1-0i        [ This is just the famous Euler’s Identity  
                       equation:   eiπ +1 = 0.]
 > log(5)
[1] 1.609438      
> (4 + sqrt(3))*(4 - sqrt(3))
[1] 13    
[Checking: (4+√3) (4-√3) = 42 - (√3)2 = 16 - 3  = 13 (Checked!)]
> (4 + 1i*sqrt(3))*(4 - 1i*sqrt(3))
[1] 19+0i   [Checking: (4+i√3)(4-i√3) = 42 -(i√3)2
                  = 16 - (-3) = 19 (Checked!)

Remark: The [1] in front of the computed result is R's way of outputting numbers. It becomes useful when the result is a long vector. The number N in the brackets [N] is the index of the first number on that line. For example, if one generated 23 random numbers from a normal distribution

>
> x <- rnorm(23)
> x
 [1] -0.5561324  0.2478934 -0.8243522  1.0697415  1.5681899
 [6] -0.3396776 -0.7356282  0.7781117  1.2822569 -0.5413498
[11]  0.3348587 -0.6711245 -0.7789205 -1.1138432 -1.9582234
[16] -0.3193033 -0.1942829  0.4973501 -1.5363843 -0.3729301
[21]  0.5741554 -0.4651683 -0.2317168
>

Remark: After the random numbers have been generated, there is no output until one calls for x, namely, x has become a vector with 23 elements, call that a 23-vector.

The [11] on the third line of the output indicates that 0.3348587 is the 11th element in the 23-vector x. The numbers of outputs per line depends on the length of each element as well as the width of the page.

4.3.2 Assignment of Values in R and Computations Using Vectors and Matrices

R is designed to be a dynamically typed language, namely, at any time one may change the data type of any variable. For example, one can first set x to be numeric as has been done so far, say x = 7; next one may set x to be a vector, say x = c (1, 2, 3, 4); then again one may set x to a word object, such as “Hi!”. Just watch the following R environment:

>
> x <- 7
> x
[1] 7
> x <- c(1, 2, 3, 4) #  x is assigned to be a 4-vector.
> x
[1] 1  2  3  4
> x <- c("Hi!") #  x is assigned to be a character string.
> x
[1] "Hi!"
> x <- c("Greetings & Salutations!")
> x
[1] "Greetings & Salutations!"
> x <- c("The rain in Spain falls mainly on the
+            plain.")
[1] "The rain in Spain falls mainly on the plain."
>  x <- c("Calculus", "Financial", "Engineering", “R”)
> x
[1] "Calculus", "Financial", "Engineering", “R”
>

4.3.3 Computations in Vectors and Simple Graphics

The use of arrays and matrices was introduced in Section 4.2.2.

In finite mathematics, a matrix is a two-dimensional array of elements, which are usually numbers. In R, the use of the matrix extends to elements of any type, such as a matrix of character strings. Arrays and matrices may be represented as vectors with dimensions. In statistics in which most variables carry multiple values, therefore, computations are usually performed between vectors of many elements. These operations among multivariates result in large matrices. To demonstrate the results, graphical representations are often useful. The following simple example illustrates these operations being readily accomplished in the R environment:

>
> weight <- c(73, 59, 97)
> height <- c(1.79, 1.64, 1.73)
> bmi <- weight/height^2
> bmi  #  Read the BMI Notes below
[1] 22.78331   21.93635   32.41004
> #  To summarize the results proceed to compute as follows:
> cbind(weight, height, bmi)
       weight    height        bmi
[1,]     73       1.79      22.78331
[2,]     59       1.64      21.93635
[3,]     97       1.73      32.41004
>
> rbind(weight, height, bmi)
  [,1]    [,2]       [,3]
weight 73.00000     59.00000     97.00000
height  1.79000      1.64000      1.73000
bmi    22.78331    21.93635    32.41004
>

Clearly, the functions cbind and rbind bind (namely, join, link, glue, concatenate) by column and row, respectively, the vectors to form new vectors or matrices.

4.3.4 Use of Factors in R Programming

In the analysis of, for example, health science data sets, categorical variables are often needed. These categorical variables indicate subdivisions of the origin data set into various classes, for example: age, gender, disease stages, degrees of diagnosis, and so on. Input of the original data set is generally delineated into several categories using a numeric code: 1 = age, 2 = gender, 3 = disease stage, and so on. Such variables are specified as factors in R, resulting in a data structure that enables one to assign specific names to the various categories. In certain analyses, it is necessary for R to distinguish among categorical codes and variables whose values have direct numerical meanings.

A factor has four levels, consisting of two items

  1. a vector of integers between 1 and 4 and
  2. a character vector of length four containing strings which describe the four levels.

Consider the following example:

  • A certain type of cancer is being categorized into 4 levels: Levels 1, 2, 3, and 4.
  • The corresponding pain levels corresponding with these diagnoses are: none, mild, moderate, and severe, respectively.
  • In the data set, five case subjects have been diagnosed in terms of their respective levels.

The following R code segment delineates the data set:

> cancerpain <- c(1, 4, 3, 3, 2, 4)
> fcancerpain <- factor(cancerpain, level=1:4)
> levels(fcancerpain) <- c("none", "mild",
+                                    "moderate", "severe")

The first statement creates a numerical vector cancerpain that encodes the pain levels of six case subjects. This is being considered as a categorical variable for which, using the factor function, a factor fcancerpain is created. This may be called with one argument in addition to cancerpain, namely, levels 1–4, which indicates that the input coding uses the values 1–4. In the final line, the pain level names are changed to the four specified character strings. The result is

> fcancerpain
[1] none     severe   moderate moderate mild     severe  
Levels: none mild moderate severe
> as.numeric(fcancerpain)
[1] 1 4 3 3 2 4
> levels(fcancerpain)
[1] "none"     "mild"     "moderate"      "severe"

Remarks: The function as.numeric outputs the numerical coding as numbers 1–4, and the function levels outputs the names of the respective levels.

The original input coding in terms of the numbers 1–4 is no longer needed. There is an additional option using the function ordered that is similar to the function factor used here.

BMI:

The body mass index (BMI), is a useful measure for human body fat based on an individual's weight and height – it does not actually measure the percentage of fat in the body. Invented in the early nineteenth century, BMI is defined as a person's body weight (in kilograms) divided by the square of the height (in meters). The formula universally used in health science produces a unit of measure of kg/m2:

equation

A BMI chart may be used displaying BMI as a function of weight (horizontal axis) and height (vertical axis) with contour lines for different values of BMI or colors for different BMI categories (Figure 4.12).

Figure 4.12 A graph of BMI (body mass index): The dashed lines represent subdivisions within a major class. The “underweight” classification is further divided into “severe,” “moderate,” and “mild” subclasses. World Health Organization data.

4.3.5 Simple Graphics

Generating graphical presentations is an important aspect of statistical data analysis. Within the R environment, one may construct plots that allow production of plots and control of the graphical features. Thus, with the previous example, the relationship between body weight and height may be considered by first plotting one versus the other by using the following R code segments:

>
> plot (weight, height)
> #  Outputting:  Figure 4.13.
A graphical representation for weight, height, where height is plotted on the y-axis on a scale of 1.65–1.75 and weight on the  x-axis on a scale of 60–90.

Figure 4.13 An x–y plot for > plot (weight, height).

Remarks:

  1. Note the order of the parameters in the plot (x, y) command: the first parameter is x (the independent variable – on the horizontal axis), and the second parameter is y (the dependent variable – on the vertical axis).
  2. Within the R environment, there are many plotting parameters that may be selected to modify the output. To get a full list of available options, return to the R environment and call for
> ?plot # This is a call for “Help!” within the R environment.
> # The output is the R documentation for:
> plot {graphics} # Generic X-Y plotting

This is the official documentation of the R function plot, within the R package graphics – note the special notations used for plot and {graphics}. To fully make use of the provisions of the R environment, one should carefully investigate all such documentations. (R has many available packages, each containing a number of useful functions.) This document shows all the plotting options available with the R environment. A copy of this documentation is shown in Appendix 1 for reference.

For example, to change the plotting symbol, one may use the keyword pch (for “plotting character”) in the following R command:

> plot (weight, height, pch=8)
> #  Outputting: Figure 4.14.
A graphical representation for weight, height, pch=8, where height is plotted on the y-axis on a scale of 1.65–1.75 and weight on the  x-axis on a scale of 60–90.

Figure 4.14 An x–y plot for > plot (weight, height, pch = 8).

Note that the output is the same as that shown in Figure 4.13, except that the points are marked with little “8-point stars”, corresponding to plotting character pch = 8.

In the documentation for pch, a total of 26 options are available, providing different plotting characteristics for points in R graphics. They are shown in Figure 4.15.

Figure depicting plotting symbols in R: pch=n, n =0, 1, 2, . . . , 25.

Figure 4.15 Plotting symbols in R: pch = n, n = 0, 1, 2,…, 25.

The parameter BMI was chosen in order that this value should be independent of a person's height, thus expressing as a single number or index indicative of whether a case subject is overweight, and by what relative amount. Of course, one may plot “height” as the abscissa (namely, the horizontal “x-axis”) and “weight” as the ordinate (namely, the vertical “y-axis”), as follows:

>  plot(height, weight, pch=8) #  Outputting: Figure 4.16.
A graphical representation for height, weight, pch=8 ; where weight is plotted on the y-axis on a scale of 60–90 and height on the  x-axis on a scale of 1.65–1.75.

Figure 4.16 An xy plot for > plot (height, weight, pch = 8).

Since a normal BMI is between 18.5 and 25, averaging (18.5 + 25)/2 = 21.75. For this BMI value, the weight of a typical “normal” person would be (21.75 × height2). Thus, one can superimpose a line of “expected” weights at BMI = 21.75 on Figure 4.16. This line may be accomplished in the R environment by the following code segments:

> ht <- c(1.79, 1.64, 1.73)
> lines(ht, 21.75*ht^2) # Outputting: Figure 4.17.

A graphical representation where weight is plotted on the y-axis on a scale of 60–90 and height on the  x-axis on a scale of 1.65–1.75.

Figure 4.17 Superimposed reference curve using line (ht, 21.75×ht^2).

In the last plot, a new variable for heights (ht) was defined instead of the original (height) because

  1. The relation between height and weight is a quadratic one, and hence nonlinear. Although it may not be obvious on the plot, it is preferable to use points that are spread evenly along the x-axis than to rely on the distribution of the original data.
  2. As the values of height are not sorted, the line segments would not connect neighboring points but would run back and forth between distant points.

Remarks:

  1. In the final example above, R was actually doing the arithmetic of vectors.
  2. Notice that the two vectors weight and height are both 3-vectors, making it reasonable to perform the next step.
  3. The cbind statement, used immediately after the computations have been completed, forms a new matrix by binding together matrices horizontally or column-wise. It results in a multivariate response variable. Similarly, the rbind statement does a similar operation vertically, or row-wise.
  4. But, if for some reason (such as mistake in one of the entries) the two entries weight and height have different number of elements, then R will output an error message. For example
>
> weight <- c(73, 59, 97)                        #  a 3-vector
> height <- c(1.79, 1.64, 1.73, 1.48) #  a 4-vector !
> bmi <- weight/height^2                       #  Outputting:
Warning message:                                    # An Error message!
In weight/height^2 :
longer object length is not a multiple of shorter object length
>

4.3.6 x as Vectors and Matrices in Statistics

It has just been shown that a variable, such as x or M may be assigned as

  1. a number, such as x = 7,
  2. a vector or an array, such as x = c(1, 2, 3, 4), and
  3. a matrix, such as x =
    [,1] [,2] [,3] [,4] [,5]
    [1,] 1 4 7 10 13
    [2,] 2 5 8 11 14
    [3,] 3 6 9 12 15
  4. a character string, such as x = “The rain in Spain falls mainly on the plain.”
  5. in fact, in R, a variable x may be assigned a complete data set that may consist of a multiple-dimensional set of elements each of which may in turn be anyone of the above kinds of variables. For example, besides being a numerical vector, such as in (2) above, x may be the following:
    1. A character vector, which is a vector of text strings whose elements are expressed in quotes, using double-, single-, or mixed-quotes:
            > c("one", "two", "three", "four", "five")
            > # Double-quotes
            [1] "one"   "two"   "three" "four"  "five"
            >
            > c('one', 'two', 'three', 'four', 'five')  
            > # Single-quotes
            [1] ‘one’   ‘two’   ‘three’   ‘four’   ‘five’
            >
            > c("one", 'two', "three", 'four', "five")
            > # Mixed-quotes
             [1] "one"   ‘two’   "three"  ‘four’  "five"
      

      However, if there is a mixed pair of quotes such as “xxxxx,” it will not be accepted. For example:

      > c("one", "two", "three", "four", "five')
      
    2. A logical vector, which takes the value TRUE or FALSE (or NA). For inputs, one may use the abbreviations T or F.

These vectors are similarly specified using the c function:

                 > c(T, F, T, F, T)
            [1]  TRUE  FALSE  TRUE  FALSE  TRUE

In most cases, there is no need to specify repeated logical vectors. It is acceptable to use a single logical value to provide the needed options, as vectors of more than one value will respond in terms of relational expressions. Observe

> weight <- c(73, 59, 97)
> height <- c(1.79, 1.64, 1.73)
> bmi <- weight/height^2
> bmi # Outputting:  
[1] 22.78331   21.93635   32.41004
> bmi > 25 # A single logical value will suffice!
[1] FALSE  FALSE  TRUE
>

4.3.7 Some Special Functions that Create Vectors

Three functions that create vectors are c, seq, and rep

  1. c for “concatenate”, or the joining of objects end-to-end (this was introduced earlier) – for example
    > x <- c(1, 2, 3, 4) #  x is assigned to be a 4-vector.
    > x
    [1] 1  2  3  4
    
  2. seq for “sequence”, for defining an equidistant sequence of numbers – for example
    > seq(1, 20, 2) #  To output a sequence from 1 to 20, in steps of 2
     [1]  1  3  5  7  9 11 13 15 17 19
    > seq(1, 20) #  To output a sequence from 1 to 20, in steps of 1
    >                           #  (which may be omitted)
    [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
    > 1:20 #  This is a simplified alternative to writing seq(1, 20).
    [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
    > seq(1, 20, 2.5) #  To output a sequence from 1 to 20, in steps of  
    >                                 # 2.5 .
    [1]  1.0  3.5  6.0  8.5 11.0 13.5 16.0 18.5
    
  3. rep for “ replicate”, for generating repeated values. This function takes two forms, depending on whether the second argument is a single number or a vector – for example
    > rep(1:2, c(3,5)) # Replicating the first element (1) 3 times,
    +                                   # and
    >                        #  then replicating the second element (2) 5 +                                  # times
    [1] 1 1 1 2 2 2 2 2 #  This is the output.
    > vector <- c(1, 2, 3, 4)
    > vector # Outputting the vector
    [1] 1 2 3 4
    > rep(vector, 5) #  Replicating vector 5 times:
     [1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
    

4.3.8 Arrays and Matrices

In finite mathematics, a matrix M is a two-dimensional array of elements, generally numbers, such as

M = 1 4 7 10 13
2 5 8 11 14
3 6 9 12 15

and the array is usually placed inside parenthesis () or some brackets {}, [], and so on. In R, the use of a matrix is extended to elements of many types: numbers as well as character strings. For example, in R, the matrix M is expressed as

[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15

4.3.9 Use of the Dimension Function dim() in R

In R, the above 3 × 5 matrix may be set up as vectors with dimension dim(x) using the following code segment:

> x <- 1:j
> x
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
> dim(x) <- c(3, 5) #  a dimension of 3 rows by 5 columns
..,

Remark: Here a total of 15 elements, 1–15, are set to be the elements of the matrix x. Then the dimension of x is set as c(3, 5), making x to become a 3 × 5 matrix. The assignment of the 15 elements follows a column-wise procedure, namely, the elements of the first column are allocated first followed by those of the second column, then the third column, and so on.

4.3.10 Use of the Matrix Function matrix() In R

Another way to generate a matrix is using the function matrix().

The above 3 × 5 matrix may be created by the following 1-line code segment:

>  matrix (1:15, nrow=3)
>  matrix # Outputting:
 [,1] [,2] [,3] [,4] [,5]
[1,]    1  4  7  10   13
[2,]    2  5  8  11   14
[3,]    3  6  9  12   15

However, if the 15 elements should be allocated by row, then the following code segment should be used:

>  matrix (1:15, nrow=3, byrow=T)
>  matrix # Outputting:
[,1]  [,2]  [,3]  [,4] [,5]
[1,]    1   2  3   4    5
[2,]    6   7  8   9    10
[3,]   11   12 13  14   15

4.3.11 Some Useful Functions Operating on Matrices in R: Colnames, Rownames, and t (for transpose)

Using the previous example

  1. the five columns of the 3 × 5 matrix x is first assigned the names C1, C2, C3, C4, and C5, respectively, then
  2. the transpose is obtained, and finally
  3. one takes the transpose of the transpose to obtain the original matrix x:
    > matrix (1:15, nrow=3, byrow=T)
    >  matrix # Outputting:
     [,1]   [,2]  [,3]  [,4]  [,5]
    [2,]    6    7   8    9    10
    [3,]    11   12  13   1 4  15
    > colnames(x) <- c("C1", "C2", "C3", "C4", "C5")
    > x  # Outputting:
    C1  C2  C3   C4   C5
    [1,]    1     4    7   10  .gf
    [2,]    2     5    8   11  14
    [3,]    3     6    9   12  15
    > t(x)
    [,1]  [,2]   [,3]
    C1    1   2    3
    C2    4   5    6
    C3    7   8    9
    C4    10  11   12
    C5   13     14    15
    > t(t(x)) #  which is just x, as expected!
    C1  C2   C3  C4  C5
    [1,]    1     4    7   10  13
    [2,]    2     5    8   11  14
    [3,]    3     6    9   12  15
    

Yet another way is to use the function LETTERS, which is a built-in variable containing the capital letters A through Z. Other useful vectors include letters, month.name, and month.abb for lower-case letters, month names, and abbreviated names of months, respectively. Take a look:

> X <-LETTERS
> X # Outputting:
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
[16] "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
> M # Outputting:
 [1] "January"   "February"  "March"     "April"     "May"      
 [6] "June"      "July"      "August"    "September" "October"  
[11] "November"  "December"
> m <- month.abb
> m # Outputting:
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct"
[11] "Nov" "Dec"

4.3.12 NA “Not Available” for Missing Values in Data sets

NA is a logical constant of length 1 that contains a missing value indicator. NA may be forced to any other vector type except raw. There are also constants NA integer, NA real, NA complex, and NA character of the other atomic vector types that support missing values: all of these are reserved words in the R language.

The generic function .na indicates which elements are missing.

The generic function .na<- sets elements to NA.

The reserved words in R's parser are

if, else, repeat, while, function, for, in next, break, NA complex, NA character, and 1, 2, and so on, which are used to refer to arguments passed down from an enclosing function.

Reserved words outside quotes are always parsed to be references to the objects linked to in the foregoing list, and are not allowed as syntactic names. They are allowed as nonsyntactic names.

4.3.13 Special Functions That Create Vectors

There are three useful R functions that are often used to create vectors:

  1. c for “concatenate”, which was introduced in Section 4.3.2 for joining items together end-to-end, for example:
    > c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29)
    > #  The first 10 prime numbers
     [1]  2  3  5  7 11 13 17 19 23 29
    
  2. seq for “sequence” is used for listing equidistant sequences of numbers, for example:
    > seq(1, 20) #  Sequence from 1 to 20
     [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
    > seq(1, 20, 1) #  Sequence from 1 to 20, in steps of 1
     [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
    > 1:20 #  Sequence from 1 to 20, in steps of 1
    [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
    > seq(1, 20, 2) #  Sequence from 1 to 20, in steps of 2
    [1]  1  3  5  7  9 11 13 15 17 19
    > seq(1, 20, 3) #  Sequence from 1 to 20, in steps of 3
    [1]  1  4  7 10 13 16 19
    > seq(1, 20, 10) #  Sequence from 1 to 20, in steps of 10
    [1]  1 11
    > seq(1, 20, 20) #  Sequence from 1 to 20, in steps of 20
    [1] 1
    > seq(1, 20, 21) #  Sequence from 1 to 20, in steps of 21
    [1] 1
    >
    
  3. rep for “replicate” is used to generate repeated values, and may be expressed in two ways, for example:
    > x <- c(3, 4, 5)
    > rep(x, 4) #  Replicate the vector x 4-times.
     [1] 3 4 5 3 4 5 3 4 5 3 4 5
    > rep(x, 1:3) #  Replicate the elements of x: the first element
    >                         # once, the second element twice,
    >                         # and the third element three times.
    [1] 3 4 4 5 5 5
    > rep(1:3, c(3,4,5)) #  For the sequence (1, 2, 3), replicate its
    >                                   # elements 3-, 4-, and 5-times, respectively
    [1] 1 1 1 2 2 2 2 3 3 3 3 3
    

Review Questions for Section 4.3

  1. A Tower of Powers – by computations using R

    There is an interesting challenge in arithmetic that goes like this

    equation

    What is the value of img ? Namely, an infinity of ascending tower of powers of the square root of 2.

    Solution: Let x be the value of this “Tower of Powers”, then it is easily seen that √2x = x itself ! Agree?

    Watch the lowest √2.

    And clearly it follows that x = 2, because √22 = 2.

    This shows that the value of this “Infinite Tower of Powers of √2” is just 2.

    Now use the R environment to verify this interesting result:

    (a)   Compute  √2
    > sqrt(2)
    (b)   Compute  √2√2
    > sqrt(2)^sqrt(2)                                        [a 2-Towers of √2-s]  
    (c)  > sqrt(2)^sqrt(2)^sqrt(2)                [a 3-Towers of √2-s]
    (d) > sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)[a 4-Towers of √2-s]
    (e) > sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)  
                                                                                [a 5-Towers of √2-s]
    (f)  Now try the following computations of  10-, 20-, 30-, and finally 40-Towers of Powers of √2, and finally reach the result of 2 (accurate to 6 places of decimal!).
    > sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)
    [1] 1.983668     [a 10-Towers of Powers of √2-s]
    > sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^
    + sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)
    [1] 1.999586          [a 20-Towers of Powers of √2-s]
    >sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^
    +sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^
    + sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)
    [1] 1.999989       [a 30-Towers of Powers of √2-s]
    >sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^
    +sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^
    +sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^
    + sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)^sqrt(2)
    [1] 2                   [a 40-Towers of Powers of √2-s]
    

    Thus, this R computation verifies the solution.

    1. What are the equivalents in R for the basic mathematical operations of

      +, −, ×, / (division), √, squaring of a number

    2. Describe the use of factors in R programming. Give an example.
  2. If x = (0, 1, 2, 3, 4, 5) and y = (0, 1, 4, 9, 16, 25) in R, plot:
    1. y versus x,
    2. x versus y,
    3. y versus x,
    4. y versus √x,
    5. y versus √x, and
    6. x versus √y.
  3. Explain, given an example, how the following functions may be used to combine matrices to form new ones:
    1. cbind and
    2. rbind.
    1. Describe is the R function factor().
    2. Give an example of using factor() to create new arrays.
  4. Using examples, illustrate two procedures for creating
    1. a vector and
    2. a matrix.
  5. Describe, using examples, the following three functions for creating vectors:
    1. c,
    2. seq, and
    3. rep.
    1. Use the function dim() to set up a matrix. Give an example.
    2. Use the function matrix() to set up a matrix. Give an example.
  6. Describe, using an example, the use of the following functions operating on a matrix in R: t(), colnames(), and rownames().
    1. What are reserved word in the R environment?
    2. In R, how is the logical constant NA used? Give an example.

Exercises for Section 4.3

Enter the R environment, and do the following exercises using R programming:

  1. Perform the following elementary arithmetic exercises:
       (a)   7 + 31;   (b)   87 – 23;   (c)   3.1417 X (7)2;      (d)   22/7;  
       (e)   e√2
    
  2. Body mass index (BMI) is calculated from your weight in kilograms and your height in meters:
    BMI = kg / m2
    Using 1 kg ≈ 2.2 lb, and 1m ≈ 3.3 ft ≈ 39.4 in
       (a)   Calculate your BMI.
       (b)   Is it in the “normal” range  18.5 ≤ BMI ≤ 25?
    
  3. In the MPH program, five graduate students taking the class in Introductory Epidemiology measured their weight (in kg) and height (in meters). The result is summarized in the following matrix:
                      John      Chang    Michael   Bryan   Jose
          WEIGHT      69.1       62.5        74.3       70.9     96.6
           HEIGHT      1.81       1.46        1.69       1.82     1.74
          (a)   Construct a matrix showing their BMI as the last row.
          (b)   Plot:    (i)  WEIGHT (on the y-axis) vs HEIGHT (on the x-axis)
                             (ii)  HEIGHT vs WEIGHTi
             (iii)  Assuming that the weight of a typical “normal”:
                                   person is
                                   (21.75 x HEIGHT2), superimpose a line-of-  
               “expected”-weight at BMI = 21.75 on the plot in (i).
    
    1. To convert between temperatures in degrees Fahrenheit (F) and Celsius (C), the following conversion formulas are used:
      equation
      equation

      At standard temperature and pressure, the freezing and boiling points of water are 0 and 100 °C, respectively. What are the freezing and boiling points of water in degrees Fahrenheit?

    2. For C = 0, 5, 10, 15, 20, 25, ..., 80, 85, 90, 95, 100, compute a conversion table that shows the corresponding F temperatures.

      Note: To create the sequence of Celsius temperatures use the R function seq(0, 100, 5).

  4. Use the data in Table A below. Assume a person is initially HIV-negative. If the probability of getting infected per act is p, then the probability of not getting infected per act is (1 − p).

    The probability of not getting infected after two consecutive acts is (1 − p)2, and the probability of not getting infected after three consecutive acts is (1 − p)3. Therefore, the probability of not getting infected after n consecutive acts is (1 − p)n, and the probability of getting infected after n consecutive acts is 1 − (1 − p)n.

    For the nonblood transfusion transmission probability (per act risk) in Table A, calculate the risk of being infected after one year (365 days) if one carries out the needle sharing injection drug use (IDU) once daily for one year.

    Do these cumulative risks seem reasonable? Why? Why not?

    Table A       
    Estimated per-act risk (transmission probability) for acquisition of  HIV, by exposure route to an infected source.                  
    Source: CDC
    Exposure route                                       Risk per 10,000 exposures
    Blood Transfusion (BT)                                              9,000
    Needle-sharing Injection-Drug Use (IDU)           67
    

    Solution:

    > p <- 67/10000
    > p
    [1] 0.0067
    > q <- (1 - p)
    > q
    [1] 0.9933
    > q365 <- q^365
    > q365
    [1] 0.08597238
    > p365 <- 1 - q365
    > p365
    [1] 0.9140276
    # =>  Probability of being infected in a year = 91.40%.  
    #      A high risk, indeed!
    
Straddle in the Black–Scholes Model

What is a Straddle? A straddle, also known as disambiguation, is a type of financial investment strategy involving simultaneously both the put and call of a given stock, to provide additional opportunities for profiting (with the concomitant risk for losing!).

In finance, a straddle refers to two transactions that share the same security, with positions that offset one another. One holds long risk, the other short. Thus, it involves the purchase or sale of particular option derivatives that allow the holder to profit based on how much the price of the underlying security moves, regardless of the direction of price movement.

A straddle involves buying a call and put with same strike price and expiration date:

  1. If the stock price is close to the strike price at expiration of the options, the straddle leads to a loss.
  2. However, if there is a sufficiently large move in either direction, a significant profit will result.

A straddle is appropriate when an investor is expecting a large move in a stock price but does not known in which direction the move will be.

The purchase of particular option derivatives is known as a long straddle, while the sale of the option derivatives is known as a short straddle.

A long straddle buys both a call option and a put option on some stock, interest rate, index, or other underlying. The two options are bought at the same strike price and expire at the same time. The owner of a long straddle makes a profit if the underlying price moves a long way from the strike price, either above or below. Thus, an investor may take a long straddle position if he thinks the market is highly volatile, but does not know in which direction it is going to move. This position is a limited risk, since the most a purchaser may lose is the cost of both options. At the same time, there is unlimited profit potential.

For example, company ABC is set to release its quarterly financial results in 2 weeks. An investor believes that the release of these results will cause a large movement in the price of the ABC's stock, but does not know whether the price will go up or down. The investor may enter into a long straddle, where one gets a profit no matter which way the price of ABC stock moves, if the price changes enough either way:

  • If the price goes up enough, the investor may use the option and ignores the put option.
  • If the price goes down, the investor uses the put option and ignores the call option.
  • If the price does not change enough, he loses money, up to the total amount paid for the two options.

    Thus, the risk is limited by the total premium paid for the options, as opposed to the short straddle where the risk is virtually unlimited.

  • If the stock is sufficiently volatile and option duration is long, the investor could profit from both options. This would require the stock to move both below the put option's strike price and above the call option's strike price at different times before the option expiration date.

A short straddle is a nondirectional option trading strategy that involves simultaneously selling a put and a call of the same underlying security, strike price, and expiration date. The profit is limited to the premium received from the sale of put and call. The risk is virtually unlimited as large moves of the underlying security's price either up or down will cause losses proportional to the magnitude of the price move. A maximum profit upon expiration is achieved if the underlying security trades exactly at the strike price of the straddle. Thus, both puts and calls comprising the straddle expire worthless allowing straddle owner to keep full credit received as their profit. This strategy is also called “nondirectional” because the short straddle profits when the underlying security changes little in price before the expiration of the straddle. The short straddle may be classified as a credit spread because the sale of the short straddle results in a credit of the premiums of the put and call.

(A risk for holder of a short straddle position is unlimited due to the sale of the call and the put options that expose the investor to unlimited losses (on the call) or losses limited to the strike price (on the put), whereas maximum profit is limited to the premium gained by the initial sale of the options.)

An example of a Straddle Financial Investment, using R:

In CRAN, the package, FinancialMath: Financial Mathematics for Actuaries provides a numerical example for assessing a Straddle investment, using the Black–Scholes equation for estimating the call and put prices:

traddle.bls                               Straddle Spread - Black Scholes

Description

Gives a table and graphical representation of the payoff and profit of a long or short straddle for a range of future stock prices. Uses the Black–Scholes equation for estimating the call and put prices.

Usage

straddle.bls(S,K,r,t,sd,position,plot=TRUE/FALSE)

Arguments

S                    spot price at time 0
K                    strike price of the put and call 
r                    continuously compounded yearly risk free rate
t                    time of expiration (in years)
sd                  standard deviation of the stock ( a measure of its volatility)
position     either buyer or seller of option (“long” or “short”)
plot              specifying whether or not to plot the payoff and profit

Details

Stock price at time t = St
For St  ≤  K1: payoff = K1 - St
For K1 < St  ≤  K2: payoff = 0
For St  >  K2: payoff = St – K2
profit = payoff - (priceK1 + priceK2) * er*t
For St  ≤  K: payoff = St - K
For St > K: payoff = K - St
Profit = Payoff + (pricecall + priceput) * er*t

Value

A list of two components.

Payoff A data frame of different payoffs and profits for given stock prices.

Premiums A matrix of the premiums for the call and put options, and the net cost.

See Also
option.put
option.call
strangle.bls

4.4 Using R in Data Analysis in Financial Engineering

In financial investigations, after preparing the collected data sets (as discussed in Section 3.1) to undertake financial analysis, the first step is to enter the data sets into the R environment. Once the data sets are placed within the R environment, analysis will process the data to obtain results leading to creditable conclusions, and likely recommendations for definitive courses of actions to improve pertinent aspects of public and personal data. Several methods for data set entry will be examined.

A graphical representation where ($) is plotted on the y-axis on a scale of -18–110 and stock price on the  x-axis on a scale of 0–198. Solid and dashed lines are representing profit and payoff, respectively.

Figure 4.18(a) Straddle investment in the Black–Scholes model.

4.4.1 Entering Data at the R Command Prompt

Data Frames and Data sets: For many financial investigators, the terms data frame and data set may be used interchangeably.

Data sets: In many applications, a complete data set contain several data frames, including the real data that have been collected.

Data Frames: Rules for data frames are similar to those for arrays and matrices, introduced earlier. However, data frames are more complicated than arrays. In an array, if just one cell is a character, then all the columns will be characters. On the other hand, a data frame can consist of the following:

  • A column of “IDnumber,” which is numeric
  • A column of “Name,” which is character.

In a data frame:

  1. each variable can have long variable descriptions and
  2. a factor can have “levels” or value levels.

These properties can be transferred from the original data set in other software formats (such as SPS, Stata, etc.). They can also be created in R.

4.4.1.1 Creating a Data Frame for R Computation Using the EXCEL Spreadsheet (on a Windows Platform)

As an example, using a typical set of real case control epidemiologic research data, consider the data set in Table B, from a clinical trial to evaluate the efficacy of maintenance chemotherapy for case subjects with acute myelogenous leukemia (AML), conducted at Stanford University, California, U.S.A., in 1977. After reaching a status of remission through treatment by chemotherapy, the patients who entered the study were assigned randomly to two groups:

  1. Maintained: This group received maintenance chemotherapy
  2. Nonmaintained: This group did not and is the control group.

The clinical trial was to ascertain whether maintenance chemotherapy prolonged the time until relapse (= “death”).

Procedure:  (1)   To create an acute myelogenous
                             leukemia (AML) data file,
                             called  AML.csv, in Windows
                      (2)   To input it into R as a data file AML
(1) Creating a data frame for R computation
1.  Data Input Using EXCEL:
     (a)  Open the Excel spreadsheet
     (b)  Type in data such that the variable names are      
            in the row 1 of the Excel spreadsheet.
     (c)  Consider each row of data as an individual in  
            the study.
     (d)  Start with column A.
2.   Save as a  .csv  file:
(a)  Click:  “File” → “Save as” → and then, in the  
       file name box (the upper box at the bottom) type:  AML
    (b)  In the “Save in:” Box (at the top), choose  
           “Local Disc (C:)”
           The file  AML  will then be saved in the top
           level of the C:Drive, but another level may also  
           be chosen.
 In the “Save as Type” Box (the lower box at the bottom), scroll down,  select, and click on  CSV
            (Comma delimited = Comma Separated Values)
  To close out of Excel using the big “X” at the top right-hand corner:    Click X.
                                                      
                                     Table B**
Data for the AML maintenance Clinical Study (a + indicates a censored value)
____________________________________________________________
Group                                           Duration for Complete Remission (weeks)      .                                                                  
 1=Maintained (11)      9,13,13+,18,23,28+,31,34,45+,48,161+  }  1=Uncensored
 0=Nonmaintained (12)     5, 5, 8, 8, 12, 16+,23,27,30,33,43,45}  0=Censored (+).                                                              
NB:  The Nonmaintained group may be considered as MBD***
 The AML Clinical Study Data:  Tableman & Kim (2004).- Table B-1: 23 data points, taken from -
      “Survival Analysis Using S: Analysis of Time-to-Event Data” by Mara Tableman and Jong Sung Kimz, published by Chapman & Hall/CRC, Boca Raton, 2004
The cancer epigenome is characterised by specific DNA methylation and chromatin modification patterns.  The proteins that mediate these changes are encoded by the epigenetics genes, here defined as DNA methyltransferases (DNMT), methyl-CpG-binding domain (MBD) proteins, histone acetyltransferases (HAT), histone deacetylases (HDAC), histone methyltransferases (HMT), and histone demethylases.
3.  In Windows, check the C:Drive for the  AML.csv file., namely, C:AML
4.   Read  AML  into R:
   (a)   Open R
   (b)   Use the   read.csv()  function:
   > aml <- read.csv("C:\AML.csv", header = T,sep=
  +                                   ",")
   (c)  Actually, it can also be done by
   > aml <- read.csv("C:\AML.csv")
   > #  Read in the  AML.csv  file from the C:Drive of the
   > # Computer, and call it  
   > #  aml
5.  Output the  AML.csv  file for inspection
> aml #  Outputting:
        weeks group status
        1      9      1      1
        2     13     1      1
        3       13     1      0
        4       18     1      1
        5       23     1      1
        6       28     1      0
        7       31     1      1
        8       34     1      1
        9       45     1      0
      10     48     1      1
      11   161     1      0
      12       5     0      1
      13       5     0      1
      14       8     0      1
      15       8     0      1
      16     12     0      1
      17     16     0      0
      18     23     0      1
      19     27     0      1
      20     30     0      1
      21     33     0      1
      22     43     0      1
      23     45     0      1
>                                  

Later, in Section 6.3, this data set will be revisited and further processed for survival analysis.

4.4.1.2 Obtaining a Data Frame from a Text File

Data from various sources are often entered using many different software programs.

They may be transferred from one format to another through the ASCII file format.

For example, in Windows, a text file is the most common ASCII file, usually having a “.txt” extension. There are other files in ASCII format, including the “.R” command file.

Data from most software programs can be outputted or saved as an ASCII file. From Excel, a very common spreadsheet program, the data can be saved as “.csv” (comma separated values) format. This is an easy way to interface between Excel spreadsheet files and R. Open the Excel file and “save as” the csv format.

  • Files with field separators:

    As an example, suppose the file “csv1.xls” is originally an Excel spreadsheet. After “save as” into csv format, the output file is called “csv1.csv”, the contents of which is

         "name",    "gender", "age"
         "A",       "F",      20
         "B",       "M",      30
         "C",       "F",      40
    

    The characters are enclosed in quotes and the delimiters (variable separators) are commas. Sometimes the file may not contain quotes, as in the file “csv2.csv”.

          name,     gender, age
          A,         F,      20
          B,         M,      30
          C,         F,      40
    

    For both files, the R command to read in the data set is the same.

    > a <- read.csv("csv1.csv", as.is=TRUE)
    > a
         name gender age
    1    A         F   20
    2    B         M   30
    3    C         F   40
    

    The argument ‘as.is=TRU ' keeps all characters as they are, otherwise the characters would have been coerced into factors. The variable “name” should not be factored but “gender” should. The following command should, therefore, be entered

    > a$gender <- factor(a$gender)
    

    Note that the object “a” has class data frame and that the names of the variables within the data frame “a” must be referenced using the dollar sign notation $. Otherwise, R will state that the object “gender” cannot be found.

    For files with white space (spaces and tabs) as the separator, such as in the file

    "data1.txt", the command to use is read.table():
    > a <- read.table("data1.txt", header=TRUE,
    +                         as.is=TRUE)
    
  • Files without field separators:

    Consider the file “data2.txt” which in fixed field format without field separators.

              name   gender    age
    1         A           F      20
    2         B           M      30
    3         C           F      40
    

    To read in such a file, use the function read.fwf():

    1. Skip the first line, which is the header.
    2. The width of each variable and the column names must be specified.
    > a <- read.fwf("data2.txt", skip=1, width=c(1,1,2),
    +                col.names = c("name", "gender", "age"),
    +                as.is=TRUE)
    

4.4.1.3 Data Entry and Analysis Using the Function data.entry()

The previous section deals with creating data frames by reading in data created from programs outside R, such as Excel. It is also possible to enter data directly into R by using the function data.entry(). However, if the data size is large (say more than 15 columns and/or more than 25 rows), the chance of human error is high with the spreadsheet or text mode data entry. A software program specially designed for data entry, such as Epidata, is more appropriate. (http://www.epidata.dk)

4.4.1.4 Data Entry Using Several Available R Functions

The data set, in Table C, lists deaths among subjects who received a dose of tolbutamide or a placebo in the University Group Diabetes Program (1970), stratifying by age:

                              Table C**
Deaths Among Subjects Who Received Tolbutamide or a Placebo in the University Group Diabetes Program (1970)
                        Age < 55              Age ≥ 55               Combined
 Tolbutamide Placebo     Tolbutamide Placebo      Tolbutamide Placebo
Deaths             8            5               22          16                30         21     
Survivors    98        115               76          69               174       184
____________________________________________________
**Available at http://www.medepi.net/data/ugdp.txt

The R functions that can be used to import the data frame have been previously introduced in Sections 4.3.3–4.3.13 A convenient way to enter data at the command prompt is to use the R functions:

c(), matrix(), array(), apply(), list(), data.frame(), and  
odd.ratio(),
as shown by the following examples and using the data in Table C.
> #Entering data for a vector
> vector1 <- c(8, 98, 5, 115) #  Using data from Table C.
> vector1
[1]   8  98   5 115
>
> vector2 <- c(22, 76, 16, 69); vector2
> #  Data from Table C.
[1] 22 76 16 69
>
> # Entering data for a matrix
> matrix1 <- matrix(vector1, 2, 2)
> matrix1
      [,1] [,2]
[1,]   8    5
[2,]   98   115
> matrix2 <- matrix(vector2, 2, 2); matrix2
       [,1]  [,2]
[1,]   22     16
[2,]   76     69
>
> # Entering data for an array
>  udata <- array(c(vector1, vector2), c(2, 2, 2))
>  udata
,,  1
        [,1]  [,2]
[1,]     8    5
[2,]    98    115
,,  2
       [,1]   [,2]
[1,]   22      16
[2,]   76      69
> apply(udata, c(1, 2), sum); udata.tot
       [,1]   [,2]
[1,]   30      21
[2,] 174     184
>
> # Entering a list
> x <- list(crude.data = udata.tot, stratified.data =
+              udata)
> x$crude.data
       [,1] [,2]
[1,]   30     21
[2,] 174   184
> x$stratified
,,  1
       [,1]  [,2]
[1,]     8    5
[2,]   98    115
,,  2
       [,1]     [,2]
[1,]    22      16
[2,]    76      69
>
> # Entering a simple data frame
> subjectname <- c("Peter", "Paul", "Mary")
> subjectnumber <- 1:length(subjectname)
> age <- c(26, 27, 28) # These are their true ages,
>                                         # respectively, in 1964!
> gender <- c("Male", "Male", "Female")
> data1 <- data.frame(subjectnumber, subjectname,
+                                age, gender)
> data1
                     subjectnumber subjectname   age   gender
1                       1            Peter       26      Male
2                       2            Paul        27      Male
3                       3            Mary        28      Female
>
> # Entering a simple function
> odds.ratio <- function(aa, bb, cc, dd){ aa*dd /
+                                 (bb*cc)}
> odds.ratio(30, 174, 21, 184) # Data from Table C.
[1] 1.510673

4.4.1.5 Data Entry and Analysis Using the Function scan()

The R function scan() is taken from the CRAN package base.

This function reads data into a vector or list from the console or file. This function takes the following usage form:

scan(file = ", what = double(), nmax = -1, n = -1,  
sep = "",
                quote = if(identical(sep, "
")) "" else  
                "'"", dec = ".",
                skip = 0, nlines = 0, na.strings = "NA",
                flush = FALSE, fill = FALSE, strip.white =
                FALSE,
                quiet = FALSE, blank.lines.skip = TRUE,
                multi.line = TRUE,
                comment.char = "", allowEscapes =
                FALSE,
                fileEncoding = "", encoding = "unknown",  
                text)

Argument

what The type of what gives the type of data to be read.
The supported types are logical, integer, numeric, complex, character, raw, and list. If what is a list, it is assumed that the lines of the data file are records each containing length (what) items (fields) and the list components should have elements that are one of the first six types listed or NULL.

The what argument describes the tokens that scan() should expect in the input file.

For a detailed description of this function, execute

> ?scan

The methodology of applying scan() is similar to c(), as described in Section 4.4.1.4, except that it does not matter the numbers are being entered on different lines, it will still be a vector.

  • Use scan() when accessing data from a file that has an irregular or a complex structure.
  • Use scan() to read individual tokens and use the argument what to describe the stream of tokens in the file.
  • scan() converts tokens into data, and then assemble the data into records.
  • Use scan() along with readLines(), especially when one attempts to read an unorthodox file format. Together, these two functions will likely result in a successful processing of the individual lines and tokens of the file!

The function readLines() reads lines from a file, and returns them to a list of character strings:

> lines <- readLines(“input.text”)

One may limit the number of lines to be read, per pass, by using the n parameter that gives the maximum number of lines to be read:

> lines <- readLines(“input.text, n=5)
> #  read 5 lines and stop

The function scan() reads one token at a time, and handles it accordingly as instructed.

An example:

Assume that the file to be scanned and read contains triplets of data (like the dates, and the corresponding daily highs and lows of financial markets):

15-Oct-1987 2439.78 2345.63
16-Oct-1987 2396.21 2207.73
19-Oct-1987 2164.16 1677.55
20-Oct-1987 2067.47 1616.23
21-Oct-1987 2087.07 1951.76

Use a list to operate scan() that it should expect a repeating, 3-token sequence:

> triplets <- scan(“triples.txt, what=list(character(0),
+                        numeric(0), numeric(0)))

Give names to the list elements, and scan() will assign those names to the data:

> triplets <- scan(“triples.txt,
+                       what=list(date=character(0),
+                       high=numeric(0), low=numeric(0)))
Reads 5 records.
> triples #  Outputs:
$date
[1] “15-Oct-1987”  “15-Oct-1987”  “19-oct-1987”  “20-Oct-1987”  “21-oct-1987”
$high
[1]  2439-78   2396.21   2164.16   2067.47   2081.07
$low
[1]  2345.63   2207.73   1677.55   1616.21   1951.76

4.4.1.6 Data Entry and Analysis Using the Function Source()

The R function source() is also taken from the CRAN package base. This function reads data into a vector or list from the console or file. It takes the following usage form:

  1. source() causes R to accept its input from the named file or URL or connection.
  2. Input is read and parsed from that file until the end of the file is reached, then the parsed expressions are evaluated sequentially in the chosen environment:
source(file, local = FALSE, echo = verbose, print.eval  
= echo,
          verbose = getOption("verbose"),
          prompt.echo = getOption("prompt"),
          max.deparse.length = 150, chdir = FALSE,
          encoding = getOption("encoding"),
          continue.echo = getOption("continue"),
          skip.echo = 0, keep.source =
          getOption("keep.source"))

For commands that are stored in an external file, such as “commands.R” in the working directory “work,” they can be executed in an R environment with the command

> source(“command.R”)

The function source() instructs R to read the text and execute its contents. Thus, when one has a long, or frequently used, piece of R code, one may capture it inside a text file. This allows one to rerun the code without having to retype it, and use the function source() to read and execute the code.

For example, suppose the file howdy.R contains the familiar greeting:

        Print (“Hi, My Friend!”)

Then by sourcing the file, one may execute the content of the file, as in the follow R code segment:

> source(“howdy.R”)
[1] “Hi, My Friend!”

Setting echo-TRUE will echo the same script lines before they are executed, with the R prompt shown before each line:

> source(“howdy.R”, echo=TRUE)
> Print(“Hi, My Friend!”)
[1] “Hi, My Friend!”

4.4.1.7 Data Entry and Analysis Using the Spreadsheet Interface in R

This method consists of the following R functions in the package Utils.

  • Spreadsheet Interface for Entering Data

    This is a spreadsheet-like editor for entering or editing data, with the following R functions:

    data.entry(..., Modes = NULL, Names = NULL)
    dataentry(data, modes)
    de(..., Modes = list(), Names = NULL)
    

The arguments of these R functions are:

A list of variables: currently these should be numerals or character vectors or a list containing such vectors.

Modes The modes to be used for the variables.
Names The names to be used for the variables.
data A list of numeric and/or character vectors.
modes A list of length up to that of data giving the modes of (some of) the variables. list() is allowed.

The function data.entry() edits an existing object, saving the changes to the original object name.

However, the function edit() edits an existing object but not saving the changes to the original object name so that one must assign it to an object name (even if it is the original name). To enter a vector, one needs to initialize a vector and then use the function data.entry(). For example:

Start by entering the R environment, and set
> x <- c(2, 4, 6, 8, 10)
> #  X is initially defined as an array of 5 elements.
> x #  Just checking – to make sure!
[1]  2  4  6  8 10 # x is indeed set to be an array of 5elements                          
>                      
> data.entry(x) #  Entering the Data Editor:
> #  The Data Editor window pops up, and looking at the first
> # column: it is now named “x”, with the first 5 rows (all on first
> # column) filled,  respectively, by the numbers 2, 4, 6, 8, 10
> #  One can now edit this dataset by, say, changing all the
> # entries to 2, then closing the Data Editor window, and
> # returning to the R console window:
> x
[1] 2 2 2 2 2  # x  is indeed changed!
> #  Thus one can change the entries for x via the Data Editor,
> # and save the changes.

When using the functions data.entry(x) and edit() for data entry, there are a number of limitations:

  1. Arrays and nontabular lists cannot be entered using a spreadsheet editor.
  2. When using the function edit() to create a new data frame, one must assign an object name in order to save the data frame.
  3. This approach is not a preferred method of entering data because one often prefers to have the original data to be in a text editor or available to be read in from a data file.

4.4.1.8 Financial Mathematics Using R: The CRAN Package FinancialMath

To illustrate the ease of use of R in financial mathematics, consider a simple example of the repayment process of a loan, such as a mortgage on a piece of real estate.

Two examples are used for calculating the loan repayment process:

Example (A): A loan of one million dollars, at an interest rate of 2.5%, to be repaid by equal monthly installments over 30 years or 360 months. One should also consider the rate and total amount of interests to be repaid over the life of the loan

> amort.table(Loan=1000000,n=360,
+                    i=0.025,pf=360,plot=TRUE)

Example (B): Example A shows that the monthly payment is $2,812.31. Suppose the borrower can afford to pay more that this monthly amount, say, $5,000.00 monthly, how does this alternate repayment scheme affect the overall repayment process, particularly in terms of the total interest payable over the whole life of the loan?

Package: FinancialMath'
Date: December 16, 2016
Type: Package
Title: Financial Mathematics for Actuaries
Version: 0.1.1
Author Kameron Penn [aut, cre],
Jack Schmidt [aut]
Maintainer Kameron Penn <[email protected]>
Description Contains financial math functions and introductory derivative functions included in the Society of Actuaries and Casualty Actuarial Society ‘Financial Mathematics' exam and some topics on the “Models for Financial Economics' exam.”
License GPL-2
Encoding UTF-8
LazyData true
Needs Compilation no
Repository CRAN
Date/Publication 2016-12-16 22:51:34

amort.table Amortization Table

Description

Produces an amortization table for paying off a loan while also solving for either the number of payments, loan amount, or the payment amount. In the amortization table, the payment amount, interest paid, principal paid, and balance of the loan are given for each period. If n ends up not being a whole number, outputs for the balloon payment, drop payment, and last regular payment are provided. The total interest paid, and total amount paid is also given. It can also plot the percentage of each payment toward interest versus period.

Usage

amort.table(Loan=NA, n=NA, pmt=NA, i=0.025,
           ic=1, pf=1, plot=TRUE)

Arguments

Loan loan amount
n the number of payments/periods
pmt value of level payments
i nominal interest rate convertible ic times per year
ic interest conversion frequency per year
pf the payment frequency- number of payments per year
plot tells whether or not to plot the percentage of each payment toward interest vs.period

Details

Effective Rate of Interest: eff.i = [1 + (i/ic]ic - 1
                                  j = (1 + eff.i)(1/pf)  - 1
                                         Loan = pmt * anj
Balance at the end of period t: Bt = pmt * an - t|j
Interest paid at the end of period t: it = Bt-1*j
Principal paid at the end of period t: pt = pmt - it
Total Paid = pmt * n
Total Interest Paid = pmt * n - Loan
If n = n* + k where n* is an integer and 0 < k < 1:
Last regular payment (at period n*) = pmt * sk/j
Drop payment (at period n* + 1) = Loan * (1 + j)n*+1 - pmt *sn*j
Balloon payment (at period n*) = Loan * (1 + j)n* - pmt * sn*j + pmt

Value

A list of two components.

Schedule A data frame of the amortization schedule.
Other A matrix of the input variables and other calculated variables.

Note

Assumes that payments are made at the end of each period.

One of n, Loan, or pmt must be NA (unknown).

If pmt is less than the amount of interest accumulated in the first period, then the function will stop because the loan will never be paid off due to the payments being too small.

If pmt is greater than the loan amount plus interest accumulated in the first period, then the function will stop because one payment will pay off the loan.

Author(s)

K. Penn and J. Schmidt

See Also

amort.period

annuity.level

Remark:

A graphical representation for percentage of payment toward interest, where percent is plotted on the y-axis on a scale of 0.000–0.025 and period on the  x-axis on a scale of 50–350.

Figure 4.18(b) $1Million loan repaying by 360 equal monthly payments for loan at fixed annual mortgage rate @ 2.5%.

In this case, it took only 258.04 months to pay off the loan. Hence, the overall saving in interest payments, for the $1M loan, is

equation

(Seems attractive?)

4.4.2 The Function list() and the Construction of data.frame() in R

The Function list()

A list in R consists of an ordered collection of objects – its components, which may be of any modes or types. For examples, a list may consist of a matrix, a numeric vector, a complex vector, a logical value, a character array, a function, and so on. Thus, some simple way to create a list would be

Example 1:  It is as easy as “1, 2, 3”!
> x <- 1
> y <- 2
> z <- 3
> list1 <- list(x, y, z) #  Forming a simple list
> list1 #  Outputting:
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3

Moreover, the components are always numbered, and may be referred to as such. Thus, if my.special.list is the name of a list with four components, they may be referred to, individually, as

my.special.list[[1]], my.special.list[[2]],
my.special.list[[3]], and my.special.list[[4]].
If one defines my.special.list as:
> my.special.list <- list(name="John", wife="Mary",
+   number.of.children=3, children.age=c(2, 4, 6))
then
> my.special.list[[1]] #  Outputting:
[1] "John"
> my.special.list[[2]]
[1] "Mary"
> my.special.list[[3]]
[1] 3
> my.special.list[[4]]
[1] 2 4 6
The Number of Components in a List

The number of (top-level) components in a list may be found by the function length(). Thus

> length(my.special.list)
[1] 4
viz., the list my.special.list has 4 components.

To combine a set of objects into a larger composite collection for more efficient processing, the list function may be used to construct a list from its components.

As an example, consider

> odds <- c(1, 3, 5, 7, 9, 11,13,15,17,19)
> evens <- c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
> mylist <- list(before=odds, after=evens)
> mylist
$before
 [1]  1  3  5  7  9 11 13 15 17 19
$after
 [1]  2  4  6  8 10 12 14 16 18 20
> mylist$before
 [1]  1  3  5  7  9 11 13 15 17 19
> mylist$after
 [1]  2  4  6  8 10 12 14 16 18 20
Components of a List

Components of a list may be named. In such a case, the component may be referred to either

  1. by giving the component name as a character string in place of the number in double square brackets or
  2. by giving an expression of the form
> name$component_name
       for the same object.
Example 2:  A family affair -
> my.special.list <- list(name="John", wife="Mary",
+   number.of.children=3, children.age=c(2, 4, 6))
> my.special.list # Outputting:
$name
[1] "John"
$wife
[1] "Mary"
$number.of.children
[1] 3
$children.age
[1] 2 4 6
Thus, for this list:
> my.special.list[[1]]
[1] "John"
> my.special.list$name
>  #  This is the same as my.special.list[[1]]
[1] "John"
> my.special.list[[2]]
[1] "Mary"
> my.special.list$wife
> #  This is the same as my.special.list[[2]]
[1] "Mary"
> my.special.list[[2]]
[1] "Mary"
> my.special.list[[3]]
[1] 3
> my.special.list$number.of.children
> #  This is the same as my.special.list[[3]]
[1] 3
> my.special.list[[4]]
[1] 2 4 6
> my.special.list$children.age
> # This is the same as my.special.list[[4]]
[1] 2 4 6
Extraction of a Variable from a List

To extract the name of a component stored in another variable, one may use the names of the list components in double square brackets, viz., list1[[“name”]]. The following R code segment may be used:

> x <- "name"; my.special.list[[John]]
[1] "John"
Constructing, Modifying, and Concatenating Lists:
New lists may be constructed from existing objects by the function
list().  
Thus,  the form
> new.list <- list(name_1=object_1,... name-
+                       n=object_n)
will set up s list, list1, of n components using object_1,…,
object_n for the components and giving then names as specified.

4.4.3 Stock Market Risk Analysis: ES (Expected Shortfall) in the Black–Scholes Model

                            Package ‘Dowd’
March 11, 2016
Type Package
Title Functions Ported from ‘MMR2' Toolbox Offered in Kevin Dowd's
Book Measuring Market Risk
Version 0.12
Date 2015-08-20
Author Dinesh Acharya <[email protected]>
Maintainer Dinesh Acharya <[email protected]>
Description ‘Kevin Dowd's' book Measuring Market Risk is a widely read book
in the area of risk measurement by students and
practitioners alike. As he claims, ‘MATLAB' indeed might have been the most
suitable language when he originally wrote the functions, but,
with growing popularity of R it is not entirely
valid. As ‘Dowd's' code was not intended to be error free and were mainly for reference, some functions in this package have inherited those errors. An attempt will be made in future releases to identify and correct them.'Dowd's' original code can be downloaded from www.kevindowd.org/measuring-marketrisk/.
It should be noted that 'Dowd' offers both 'MMR2'
and 'MMR1' toolboxes. Only 'MMR2' was ported to
R. 'MMR2' is more recent version of 'MMR1' toolbox and they both have mostly similar function. The toolbox mainly
contains different parametric and non-parametric
methods for measurement of market risk as well as
backtesting risk measurement methods.
Depends R (>= 3.0.0), bootstrap, MASS, forecast
Suggests PerformanceAnalytics, testthat
License GPL
NeedsCompilation no
Repository CRAN
Date/Publication 2016-03-11 00:45:03
BlackScholesCallESSim ES of Black-Scholes call using Monte Carlo Simulation
Description
Estimates ES of Black-Scholes call Option using Monte Carlo simulation
Usage
BlackScholesCallESSim(amountInvested, stockPrice, strike, r, mu, sigma,
maturity, numberTrials, cl, hp)
Arguments
amountInvested Total amount paid for the Call Option and is positive (negative) if the option
position is long (short)
stockPrice Stock price of underlying stock
strike Strike price of the option
r Risk-free rate
mu Expected rate of return on the underlying asset and is in annualised term
sigma Volatility of the underlying stock and is in annualised term
maturity The term to maturity of the option in days
numberTrials The number of interactions in the Monte Carlo simulation exercise
cl Confidence level for which ES is computed and is scalar
hp Holding period of the option in days and is scalar
Value
ES
Author(s)
Dinesh Acharya
References
Dowd, Kevin. Measuring Market Risk, Wiley, 2007.
Lyuu, Yuh-Dauh. Financial Engineering & Computation: Principles, Mathematics, Algorithms,
Cambridge University Press, 2002.
Examples
# Market Risk of American call with given parameters.
BlackScholesCallESSim(0.20, 27.2, 25, .16, .2, .05, 60, 30, .95, 30)
> In the R domain
>
> install.packages("Dowd")
Installing package into ‘C:/Users/Bert/Documents/R/win-library/3.2’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
# A CRAN mirror is selected
The downloaded binary packages are in
      C:UsersBertAppDataLocalTempRtmpuYe2Oxdownloaded_packages
> library(4.4.3       Stock Market Risk Analysis:)
> # ES (Expected Shortfall) in the Black-Scholes Model
> library(Dowd)
> ls("package:Dowd") # Outputting:
  [1] "AdjustedNormalESHotspots"              
  [2] "AdjustedNormalVaRHotspots"            
  [3] "AdjustedVarianceCovarianceES"          
  [4] "AdjustedVarianceCovarianceVaR"        
  [5] "ADTestStat"                            
  [6] "AmericanPutESBinomial"                
  [7] "AmericanPutESSim"                      
  [8] "AmericanPutPriceBinomial"              
  [9] "AmericanPutVaRBinomial"                
 [10] "BinomialBacktest"                      
 [11] "BlackScholesCallESSim"                
 [12] "BlackScholesCallPrice"                
 [13] "BlackScholesPutESSim"                  
 [14] "BlackScholesPutPrice"                  
 [15] "BlancoIhleBacktest"                    
---------------------------------------------
 [145] "tVaRPlot3D"                            
[146] "VarianceCovarianceES"                  
[147] "VarianceCovarianceVaR"                
>
>
> # Market Risk of American call with given
+  # parameters.
> BlackScholesCallESSim(0.20, 27.2, 25, .16, .2, .05,
+                                   60, 30, .95, 30)
> # Outputting the Black-Scholes Call ES (Expected Shortfall):
[1] 0.001294227
> #  viz., according to the Black-Scholes Model, for this Call,
> #  the ES (Expected Shortfall) is predicted to be at the 0.1%
> #  level, or, very unlikely indeed!
Some Remarks on Expected Shortfall
  • Expected shortfall (ES) is a risk measure – a concept used in the field of financial risk measurement to evaluate the market risk, or credit risk, of a portfolio.

    The “expected shortfall at q% level” is the expected return on the portfolio in the worst { q}qqq% of cases. ES is an alternative to Value at Risk that is more sensitive to the shape of the tail of the loss distribution.

  • Expected shortfall is also called
    1. Conditional Value at Risk (CVaR),
    2. Average Value at Risk (AVaR), and
    3. Expected Tail Loss (ETL).
  • ES estimates the risk of an investment in a conservative way, focusing on the less profitable outcomes. A value of { q}ES often used in practice is 5%.
  • Expected shortfall is a coherent, as well as a spectral, measure of financial portfolio risk. It requires a quantile-level, { q}and is defined to be the expected loss of portfolio value.

Review Questions for Section 4.4

  1. To use R in data analysis, the data to be processed must first be entered into the R environment; discuss 7 ways of data entry, giving examples.
  2. How can the function list() be used to enter data into the R environment? Provide an example.
  3. Use the function data.frame() to enter data into the R environment, giving an example.
  4. Use the following functions to input data into the R environment, giving an example of each: c(), matrix(), and array().
  5. Use the function source() to enter data into the R environment, giving an example.
  6. When using the functions data.entry() and edit() for data entry, what are the limitations?
  7. Show that the function list() may be used to combine several components to form a new list. Give an example.
  8. Write a code segment in R to extract the name of a component stored in another variable, giving an example.
  9. With an example, use the concatenation function c() with given list arguments, obtain a list whose components are those of the argument list joined together sequentially, in the form of
         > list.ABC <- c(list.A, list.B, list.C)
    
  10. Look up the software Epidata from its web site http://www.epidata.dk, and suggest an efficient method of data entry in R.

4.5 Univariate, Bivariate, and Multivariate Data Analysis

  1. A univariate data set has only one variable:{x}, for example, {patient name}
  2. A bivariate data set has two variables:{x1, x2}, or {x, y}, for example, {patient name, gender}
  3. A multivariate data set has more than two, or many, variables: {x1, x2, x3,…, xn},
  4. for example, {investor name, gender, age, capital, preferred
  5. minimum return on investment, periodic payout,…}

4.5.1 Univariate Data Analysis

As an example, enter the following code segments:

> x <- rexp(100); x
> #  Outputting 100 exponentially-distributed
> # random numbers:  
 [1] 0.39136880 0.66948212 1.48543076 0.34692128 0.71533079 0.12897216
 [7] 1.08455419 0.07858231 1.01995665 0.81232737 0.78253619 4.27512555
 [13] 2.11839466 0.47024886 0.62351482 1.02834522 2.17253419 0.37622879
 [19] 0.16456926 1.81590741 0.16007371 0.95078524 1.26048607 5.92621325
 [25] 0.21727112 0.07086311 0.83858727 1.01375231 1.49042968 0.53331210
 [31] 0.21069467 0.37559212 0.10733795 2.84094906 0.17899040 1.34612473
 [37] 0.00290699 1.77078060 1.79505318 0.09763821 1.96568170 0.15911043
 [43] 4.36726420 0.33652419 0.01196883 0.35657882 0.72797670 0.91958975
 [49] 0.68777857 0.29100399 0.22553560 1.56909742 0.20617517 0.37169621
 [55] 0.53173534 0.26034316 0.21965356 2.94355695 1.88392667 1.13933083
 [61] 0.31663107 0.23899975 0.01544856 1.30674088 0.53674598 1.72018758
 [67] 0.31035278 0.81074737 0.09104104 1.52426229 1.35520172 0.27969075
 [73] 1.36320488 0.56317216 0.85022837 0.49031656 0.17158651 0.31015165
 [79] 2.07315953 1.29566872 1.28955269 0.33487343 0.20902716 2.84732652
 [85] 0.58873236 1.54868210 2.93994181 0.46520037 0.73687959 0.50062507
 [91] 0.20275282 0.49697531 0.58578119 0.49747575 1.53430435 4.56340237
 [97] 0.90547787 0.72972219 2.60686316 0.33908320

Note: The function rexp() is defined as follows:

  1. rexp(n, rate = 1)

with arguments:

x vector
n number of observations. If length(n) > 1, the length is taken to be the number required.

The exponential distribution with rate λ has density

f(x) = λ e -λx,      for x ≥ 0.  

If the rate λ is not specified, it assumes the default value of 1.

Remark: The function rexp() is one of the functions in R under exponential in the CRAN package stats.

To undertake a biostatistical analysis of this set of univariate data, one may call up the function univax(), in the package epibasix, using the following code segments:

> library(epibasix)
> univar(x) #  Outputting:
Univariate Summary
Sample Size: 100
Sample Mean: 1.005
Sample Median: 0.646
Sample Standard Deviation: 1.067
>

Thus, for this sample size of 100 elements, the mean, median and standard deviation have been computed.

For data analysis of univariate data sets, the R package epibasix may be used.

This CRAN package covers many elementary financial functions for statistics and econometrics. It contains elementary tools for analysis of common financial problems, ranging from sample size estimation, through 2 x 2 contingency table analysis, and basic measures of agreement (kappa, sensitivity/specificity).

Appropriate print and summary statements are also written to facilitate interpretation wherever possible. This work is appropriate for graduate financial engineering courses.

This package is a work in progress.

To start, enter the R environment and use the code segment:

> install.packages("epibasix")
Installing package(s) into ‘C:/Users/bertchan/Documents/R/win-library/2.14
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
> #  Select CA1
trying URL
'http://cran.cnr.Berkeley.edu/bin/windows/contrib/2.14/epibasix_1.1.zip'
Content type 'application/zip' length 57888 bytes (56 Kb)
opened URL
downloaded 56 Kb
package ‘epibasix’ successfully unpacked and MD5 sums checked
The downloaded packages are in
C:UsersertchanAppDataLocalTempRtmpMFOrEndownloaded_packages
With epibasix loaded into the R environment, to learn more about this package,
follow these steps:
1.  Go to the CRAN website: http://cran.r-project.org/
2.  Select (single-click) Packages, on the left-hand column
3.  On the page: select E (for epibasix)
        Available CRAN Packages By Name
        A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
4.  Scroll down list of packages whose name starts with “E” or “e”, and select:
epibasix
5.  When the epibasix page opens up, select:   Reference manual:  epibasix.pdf
6.  The information is now on displayed, as follows:
Package      ‘epibasix’
             January 2, 2012
Version      1.1
Date         2009-05-13
Author       Michael A Rotondi <[email protected]>
Maintainer   Michael A Rotondi [email protected]
Depends R (>= 2.01)

For another example, consider the same analysis on the first one hundred Natural
Numbers, using the following R code segments:
> x <-1:100; x #  Consider, and then output, the first 100
>                      #  Natural Numbers
[1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
[19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
[37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
[55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
[73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
[91]  91  92  93  94  95  96  97  98  99 100
> #  ANOVA Tables: Summarized in the following tables,
> # ANOVA is used for
> # two different purposes:  
> library(epibasix)
> univar(x) #  Performing a univariate data analysis
>           #  on the vector x, and Outputting:
Univariate Summary
Sample Size: 100
Sample Mean: 50.5
Sample Median: 50.5
Sample Standard Deviation: 29.011
And that’s it!

4.5.2 Bivariate and Multivariate Data Analysis

When there are two variables,(X, Y), one need to consider the following two cases:

  1. Case I: In the classical regression model, only Y called the dependent variable, is required to be random. X is defined as a fixed, nonrandom, variable and is the independent variable. Under this model, observations are obtained by preselecting values of X and determining the corresponding value of Y.
  2. Case II: If both X and Y are random variables, it is called the correlation model – under which sample observations are obtained by selecting a random sample of the units of association (such as persons, characteristics (age, gender, locations, points of time, specific events/actions/…,) or elements on which the two measurements are based) and by recording a measurement of X and of Y. In this case, values of X are not preselected but occurring at random, depending on the unit of association selected in the sample.

Regression Analysis:

  1. Case I: Correlation analysis cannot be meaningfully under this model.
  2. Case II: Regression analysis can be performed under the correlation model.

Correlation for two variables implies a corelationship between the variables, and does not distinguish between them as to which is the dependent or the independent variable. Thus, one may fit a straight line to the data either by minimizing ∑(xix)2 or by minimizing ∑(yiy)2. The fitted regression line will, in general, be different in the two cases – and a logical question arises as to which line to fit.

Two situation do exit, and should be considered:

  1. If the objective is to obtain a measure of strength of the relationship between the two variables, it does not matter which line is fitted – the measure calculated will be the same in either case.
  2. If one needs to use the equation describing the relationship between the two variables for the dependency of one upon the other, it does matter which line is to be fitted. The variable for which one wishes to estimate means or to make predictions should be treated as the depending variable. That is, this variable should be regressed with respect to the other variable.
Available R Packages for Bivariate Data Analysis

Among the R packages for bivariate data analysis, a notable one available for sample size calculations for bivariate random intercept regression model is the bivariate power.

An Example in Bivariate Data Analysis

As an example, this package may be used to calculate necessary sample size to achieve 80% power at 5% alpha level for null and alternative hypotheses that correlate between RI 0 and 0.2, respectively, across six time points. Other covariance parameter are set as follows:

Correlation between residuals = 0;
Standard deviations: 1st RI = 1,  2nd RI = 2, 1st residual = 0.5,  2nd residual = 0.75
The following R code segment may be used:
> library(bivarRIpower)
> bivarcalcn(power=.80,powerfor='RI',timepts=6,
+ d1=1,d2=2, p=0,p1=.2,s1=.5,s2=.75,r=0,r1=.1)
#  Outputting:
Variance parameters
Clusters                            = 209.2
Repeated measurements         = 6
Standard deviations
 1st random intercept         = 1
 2nd random intercept         = 2
 1st residual term            = 0.5
 2nd residaul term            = 0.75
Correlations      
 RI under H_o                 = 0
 RI under H_a                 = 0.2
 Residual under H_o           = 0
 Residual under H_a           = 0.1
 Con obs   under H_o          = 0
 Con obs   under H_a          = 0.1831984
 Lag obs   under H_o          = 0
 Lag obs   under H_a          = 0.1674957
Correlation variances under H_o
------------------------------------------------------------
Random intercept              = 0.005096138
Residual                      = 0.0009558759
Concurrent observations       = 0.00358999
Lagged observations           = 0.003574277
Power (%) for correlations
------------------------------------------------------------
Random intercept              = 80%
Residual                      = 89.9%
Concurrent observations       = 86.4%
Lagged observations           = 80%
>
Bivariate Normal Distribution

Under the correlation model, the bivariates X and Y vary together in a joint distribution, which, if this joint distribution is a normal distribution, is called a bivariate normal distribution, from which inferences may be made based on the results of sampling properly from the population. If the joint distribution is known to be non-normal, or if the form is unknown, inferential procedures are invalid. The following assumptions must hold for inferences about the population to be valid when sampling from a bivariate distribution:

  1. For each value of X, there is a normally distributed subpopulation of Y values.
  2. For each value of Y, there is a normally distributed subpopulation of X values.
  3. The joint distribution of X and Y is a normal distribution called the bivariate normal distribution.
  4. The subpopulation of Y values have the same variance.
  5. The sub-population of X values have the same variance.

Two random variables X and Y are said to be jointly normal if they can be expressed in the form

(4.2) equation

where U and V are independent normal random variables.

If X and Y are jointly normal, then any linear combination

(4.3) equation

has a normal distribution. The reason is that if one has X = aU + bV and Y = cU + dV for some independent normal random variables U and V, then

(4.4) equation

Thus, Z is the sum of the independent normal random variables (as1 + cs2)U and (bs1 + ds2)V, and is therefore normal.

A very important property of jointly normal random variables is that zero correlation implies independence.

Zero Correlation Implies Independence

If two random variables X and Y are jointly normal and are uncorrelated, then they are independent.

(This property can be verified using multivariate transforms)

Multivariate Data Analysis

The following are the two similar, but distinct, approaches used for multivariate data analysis:

  1. The Multiple Linear Regression Analysis: Assuming that a linear relationship exists between some variable Y, call the dependent variable, and n independent variables, X1, X2, X3,…, Xn, which are called explanatory or predictor variables because of their use.

    The following are the assumptions underlying multiple regression model analysis:

    1. The Xi are nonrandom fixed variables indicating that any inferences drawn from sample data apply only to the set of X values observed, but not to larger collections of X. Under this regression model, correlation analysis is not meaningful.
    2. For each set of Xi values, there is a subpopulation of Y values. Usually, one assumes that these Y values are normally distributed.
    3. The variances of Y are all equal.
    4. The Y values are independent of the different selected set of X values.

    For multiple linear regression, the model equation is

    where yj is a typical value from one of the subpopulations of Y values, and the βi values are the regression coefficients.

    x1j, x2j, x3j,…, xnj are, respectively, particular values of the independent variables X1, X2, X3,…, Xn, and ej is a random variable with mean 0 and variance σ2, the common variance of the subpopulation of Y values. Generally, ej is assumed normal and independently distributed.

    When Equation (4.1) consists of one dependent variable and two independent variables, the model becomes

    (4.6) equation

    A plane in three-dimensional space may be fitted to the data points. For models containing more than two variables, it is a hyperplane.

    equation

    If y is the mean of the observed data,

    equation

    then the variability of the data set may be measured using three sums of squares (proportional to the variance of the data):

    1. The Total Sum of Squares (proportional to the variance of the data):
      (4.7) equation
    2. The Regression Sum of Squares:
      (4.8) equation
    3. The Sum of Squares of Residuals:

    The most general definition of the coefficient of multiple determination is

    (4.10) equation

    The parameter of interest in this model is the coefficient of multiple determination, img, obtained by dividing the explained sum of squares by the total sum of squares:

    (4.11) equation
    where:
    ∑ (yifi)2 = the explained variation
    = the original observed values from the calculated Y values
    = the sum of squared deviation of the calculated values from the mean of the observed Y values, or
    = the sum of squares due to regression (SSR)
    ∑ (yiy)2 = the unexplained variation
    = the sum of squared deviations of the original observations from the calculated values
    = the sum of squares about regression, or
    = the error sum of squares (SSE)

    The total variation is the sum of squared deviations of each observation of Y from the mean of the observations:

    namely,

    (4.13) equation

    or

    (4.14) equation

    The Multiple Correlation Model Analysis – The object of this approach is to gain insight into the strength of the relationship between variables.

    The multiple regression correlation model analysis equation is

    (4.15) equation

    where yj is a typical value from one of the subpopulations of Y values, the βi are the regression coefficients, x1j, x2j, x3j,…, xnj are, respectively, particular known values of the random variables X1, X2, X3,…, Xn, and ej is a random variable with mean 0 and variance σ2, the common variance of the subpopulation of Y values. Generally, ej is assumed normal and independently distributed.

    This model is similar to model Equation (4.5), with the following important distinction:

    1. in Equation (4.5), the xi are nonrandom variables, but
    2. in Equation (4.9), the xi are random variables.

    That is, in the correlation model, Equation (4.9), there is a joint distribution of Y and the Xi that is called a multivariate distribution.

    Under this model, the variables are no longer considered as being dependent or independent, because logically they are interchangeable, and either of the Xi may play the role of Y.

  2. The Correlation Model Analysis

    The Multiple Correlation Coefficient: To analyze the relationships among the variables, consider the multiple correlation coefficient, which is the square root of the coefficient of multiple determination, and hence the sample value may be computed by taking the square root of Equation (4.12), namely,

    (4.16) equation
  3. Analysis of Variance (ANOVA)

    In statistics, ANalysis Of VAriance (ANOVA) is a collection of statistical models in which the observed variance in a particular variable is partitioned into components from different sources of variation. ANOVA provides a statistical test of whether or not the means of several groups are all equal, and therefore, generalizes t-test to more than two groups. Doing multiple two-sample t-tests would result in an increased chance of committing a Type I error. For this reason, ANOVAs are useful in comparing two, three, or multiple means.

    ANOVA Tables: Summarized in the following tables, ANOVA is used for two different purposes:

  1. To estimate and test hypotheses for simple linear regression about population variances
  2. To estimate and test hypotheses about population means

ANOVA table for testing hypotheses about simple linear regression

Source DF Sum of squares Mean squares
F-value P-value MSG/MSE = F1,n-2 Pr(F > F1,n-2)
Model 1 Σ(yiy)2 = SSModel SSM = MSM
Residual n − 2 Σei2 = SSResidual SSR/(n − 2) = MSE
Total n − 1 Σ(yiy)2 = SSTotal SST/(n − 1) = MST

Residuals are often called errors since they are the part of the variation that the line could not explain, so MSR = MSE = sum of squared residuals/df = σ = estimate for variance of the population regression line SSTot/(n−1) = MSTOT = sy2 = the total variance of the ys F = t2 for simple linear regression.

The larger the F (the smaller the p-value) the more the y's variation in the line explained and so the less likely that H0 is true. One rejects a hypothesis when the p-value < α.

R2 = proportion of the total variation of y explained by the regression line
= SSM/SST
= 1 − SSResidual/SST

ANOVA table for testing hypotheses about population means

Source DF Sum of squares Mean squares F-value P-value p-value
Group (between) k−1 Σni(img)2 = SSG SSG/(k−1) = MSG MSG/MSE Pr = Fk−1,Nk = (F > Fk−1,Nk) Pr(F > Fk−1,Nk)
Error (within) Nk Σ(ni − 1)si2 = SSE SSE/(Nk) = MSE
Total N−1 Σ(xij−)2 = SSTot SSTot/(N−1) = MST

N = total number of observations = Σni, where ni = number of observations for group i.

The F test statistic has two different degrees of freedom: the numerator = k − 1, and the denominator = Nk · Fk−1,Nk

Note: SSE/(N−k) = MSE = sp2 = (pooled sample variance)

equation

(This is the “average” variance for each group.) SSTot/(N−1) = MSTOT = s2 = the total variance of the data (assuming NO groups)

F ≈ variance of the (between) sample means divided by the ∼average variance of the data, the larger the F (the smaller the p-value) the more varied the means are, so the less likely H0 is true. It is rejected when the p-value < α.

R2 = proportion of the total variation explained by the difference in means = img

A graphical representation for percentage of payment toward interest, where percent is plotted on the y-axis on a scale of 0.0–0.4 and period on the  x-axis on a scale of 0–250.

Figure 4.19 $1 Million loan repaying by equal monthly payments, of $5,000 each, for a loan at fixed annual mortgage rate @ 2.5%.

Cluster Means
	2005-11-01	2005-11-02	2005-11-03	2005-11-04	2005-11-07	
1	-7.174567e-05	-0.0004228193	0.0070806633	0.0086669900	0.003481978
2	8.414595e-03	0.0025193420	0.0127072920	-0.0007027570	0.006205226
3	-9.685162e-04	-0.0021474822	-0.0004991982	-0.0005400452	0.000752058
	2005-11-08	2005-11-09	2005-11-10	2005-11-11	2005-11-14
1	0.0006761683	0.001840756	0.001091617	0.008291539	-0.0006469167
2	0.0003292600	-0.002378200	0.000922087	0.013334906	-0.0046930640
3	0.0014039486	-0.001694911	0.001119758	0.001929053	-0.0003708446
	2005-11-15	2005-11-16	2005-11-17	2005-11-18	2005-11-21
1	-0.0005408223	0.001686769	0.0048643273	0.005051224	0.001901881
2	0.0012668650	-0.007187498	0.0076581030	0.012527202	0.002659666
3	-0.0008278810	0.001879324	0.0003510514	-0.001005688	0.001511155
	2005-11-22	2005-11-23	2005-11-24	2005-11-25	2005-11-28
1	0.0037938340	0.0024438233	-0.000394091	0.0022790387	-0.006316902
2	0.0021424940	0.0035671280	-0.002559544	0.0033748180	-0.009816739
3	0.0003204556	0.0008044326	0.001046522	0.0007771762	-0.001717004
	2005-11-29	2005-11-30	2005-12-01	2005-12-02	2005-12-05
1	0.0031941750	-0.0037985373	0.011412029	0.0030362450	-0.003450339
2	0.0018864440	-0.0040370550	0.015977559	0.0070552900	-0.000268694
3	0.0001304264	-0.0001264086	0.002280999	0.0004551984	-0.001906796
	2005-12-06	2005-12-07	2005-12-08	2005-12-09	2005-12-12
1	-0.001325485	0.0007240550	-0.0066049723	0.0031775860	-0.001708432
2	0.002864672	-0.0026064460	-0.0034829450	0.0007472420	-0.000278182
3	0.002119992	-0.0007569246	0.0008520178	0.0006191402	0.000879002
	2005-12-13	2005-12-14	2005-12-15	2005-12-16	2005-12-19
1	0.0046313440	-0.004764346	0.0029376953	0.001213580	0.0004901883
2	0.0013586830	-0.007283576	-0.0073441080	0.004292528	0.0049806720
3	0.0008959768	-0.001095387	0.0007868516	0.001616794	0.0023614496
	2005-12-20	2005-12-21	2005-12-22	2005-12-23	2005-12-26
1	0.0071593677	0.007886516	0.0001159380	0.001659755	0.001861593
2	-0.0016334200	0.004319502	-0.0036299400	-0.002234004	0.000000000
3	0.0009404428	0.002617884	-0.0002855498	0.000507245	0.000246350
	2005-12-27	2005-12-28	2005-12-29	2005-12-30	2006-01-02
1	-0.001653545	-0.0005308263	0.004244265	-0.0020539910	0.0007404967
2	0.006571546	0.0016811710	0.007701353	-0.0039764730	0.0000000000
3	0.001191907	0.0015647450	0.001259883	-0.0007382012	0.0000960024
	2006-01-03	2006-01-04	2006-01-05	2006-01-06	2006-01-09
1	-0.0013538470	0.001985783	-0.001291249	0.0015069157	0.006899025
2	0.0065956320	0.012163609	-0.002285923	0.0022278130	-0.000174350
3	-0.0001111704	0.001657156	-0.000790521	0.0005815596	0.000574890
	2006-01-10	2006-01-11	2006-01-12	2006-01-13	2006-01-16
1	-0.001533005	0.0040728800	0.004428087	-0.001791852	-0.0005044543
2	-0.006484459	0.0109940950	0.005887780	0.000629145	0.0041673760
3	-0.000713456	-0.0008200066	0.001761519	-0.002024939	-0.0000711238
	2006-01-17	2006-01-18	2006-01-19	2006-01-20	2006-01-23
1	-0.0064264427	-0.009959086	0.008732922	-0.005498331	-0.012300008
2	-0.0060610800	-0.007104074	0.004901245	-0.008712876	0.000080500
3	-0.0004887026	-0.000363750	0.001841324	-0.002770640	-0.000594795
	2006-01-24	2006-01-25	2006-01-26	2006-01-27	2006-01-30	2006-01-31
1	0.0043293403	0.003928724	0.007596804	0.013018524	0.004510001	-0.000363935
2	-0.0009696750	-0.001245166	0.008348556	0.004847171	0.001426750	0.002736095
3	-0.0007514866	-0.001468543	0.001332284	0.002198388	0.000585831	0.000676559
	2006-02-01	2006-02-02	2006-02-03	2006-02-06	2006-02-07
1	0.0025679990	-0.0033918090	0.001392298	0.003829826	-0.0016356553
2	0.0033390250	-0.0052315370	0.006229563	0.000010100	-0.0003661590
3	-0.0001937426	-0.0009799974	0.001561223	0.000789857	-0.0005261388
	2006-02-08	2006-02-09	2006-02-10	2006-02-13	2006-02-14
1	-0.0034102587	0.004724740	-0.0002670790	-0.0043036747	0.0071958650
2	-0.0007797950	0.005323768	-0.0010290070	0.0030521410	-0.0005124970
3	0.0002705448	0.001218430	-0.0001116748	-0.0002136692	0.0009441608
	2006-02-15	2006-02-16	2006-02-17	2006-02-20	2006-02-21
1	-0.0012863397	0.0045143087	-0.0008902317	-0.002237808	0.0050509303
2	-0.0099343840	0.0129171990	0.0012844190	0.004250758	0.0063638350
3	0.0002048986	0.0009410282	0.0012602358	0.001315374	0.0009478938
	2006-02-22	2006-02-23	2006-02-24	2006-02-27	2006-02-28
1	0.007771075	0.0008127150	0.0050935397	0.0050390380	-0.0108695373
2	0.004433658	-0.0060669620	-0.0027457400	0.0053722110	-0.0119569840
3	0.002287663	0.0003148286	-0.0003851528	0.0009570004	-0.0006375348
	2006-03-01	2006-03-02	2006-03-03	2006-03-06	2006-03-07
1	0.0038673107	-0.003960535	-0.004273008	0.0005488833	-0.001580974
2	0.0105047210	-0.004177747	0.001755361	0.0004602100	-0.007028455
3	0.0001534274	-0.003340033	-0.001016466	0.0001020288	-0.001403624
	2006-03-08	2006-03-09	2006-03-10	2006-03-13	2006-03-14
1	-0.004631919	0.0054281930	0.0077513163	0.0026288817	-0.0011077267
2	-0.004230567	0.0035344700	0.0128746620	0.0082543960	-0.0005305960
3	-0.000989756	0.0001691144	-0.0002940006	0.0006133858	0.0006779516
	2006-03-15	2006-03-16	2006-03-17	2006-03-20	2006-03-21
1	0.0039076300	-0.002874349	0.0035407443	-0.0002971253	0.0017513113
2	-0.0017679700	0.000666818	0.0043129660	0.0021624820	-0.0002753190
3	0.0001087012	0.000484471	-0.0001434976	0.0006474828	-0.0008360246
	2006-03-22	2006-03-23	2006-03-24	2006-03-27	2006-03-28
1	0.0034417023	0.0042167183	-0.0005082437	-0.002477860	-0.005123098
2	0.0013273100	-0.0043185900	0.0027725900	-0.006945127	-0.000727865
3	0.0006235516	0.0002566278	0.0001903118	-0.000559792	-0.001748221
	2006-03-29	2006-03-30	2006-03-31	2006-04-03	2006-04-04
1	0.0083719110	-0.0023536467	0.0032189740	0.002929918	-0.0053531483
2	0.0009252440	0.0062064890	-0.0009292510	0.006944399	-0.0026184930
3	-0.0004704764	0.0003112066	-0.0003227242	-0.001220841	-0.0005425296
	2006-04-05	2006-04-06	2006-04-07	2006-04-10	2006-04-11
1	0.001672089	0.0035391920	0.0006688370	-0.0018577213	-0.0078612010
2	0.004752739	0.0015471340	0.0005119710	0.0000160000	-0.0095747430
3	0.000371034	0.0001308954	-0.0001606024	0.0004964256	0.0008645472
	2006-04-12	2006-04-13	2006-04-14	2006-04-17	2006-04-18
1	-0.001794527	-6.576767e-05	0.0003451073	-0.01162293	0.0072958027
2	-0.002978612	3.927437e-03	0.0000000000	0.00000000	-0.0050312550
3	-0.001221078	-2.154967e-03	-0.0000179728	-0.00127895	0.0008801984
	2006-04-19	2006-04-20	2006-04-21	2006-04-24	2006-04-25
1	0.005262631	0.004346695	-0.0001275733	-0.0055535810	-0.0008195537
2	0.010893639	0.004201724	0.0048380810	-0.0016947400	-0.0041756790
3	0.001021676	0.001559365	-0.0000182248	-0.0001482958	-0.0017556356
	2006-04-26	2006-04-27	2006-04-28	2006-05-01	2006-05-02
1	0.001310372	-0.003174862	-0.0081970307	-0.0042204567	0.005113142
2	0.004699408	0.000883969	-0.0029437410	0.0000000000	0.003189327
3	-0.000656350	0.000053330	0.0002421188	-0.0007466388	0.001444051
	2006-05-03	2006-05-04	2006-05-05	2006-05-08	2006-05-09
1	-0.001263695	-0.0006502133	0.006613750	0.0063954017	-0.003357079
2	-0.011943802	0.0030146350	0.011235363	0.0061756300	0.002517020
3	-0.001037496	-0.0000032026	0.002150292	0.0007181488	-0.001401839
	2006-05-10	2006-05-11	2006-05-12	2006-05-15	2006-05-16
1	-0.0025826107	-0.008643966	-0.016678837	-0.006493649	-0.002878691
2	-0.0013872090	-0.001096609	-0.017989732	-0.013399941	0.003285150
3	-0.0005785598	-0.003397807	-0.004047129	-0.000274325	0.001044719
	2006-05-17	2006-05-18	2006-05-19	2006-05-22	2006-05-23
1	-0.010999777	-0.0085399823	0.005933202	-0.021077161	0.007398834
2	-0.028406916	-0.0097114170	0.001848007	-0.025997761	0.018970677
3	-0.003777801	0.0001129756	0.002052023	-0.001940486	0.002266990
	2006-05-24	2006-05-25	2006-05-26	2006-05-29	2006-05-30
1	-0.0004267907	0.0061746340	0.016697015	-0.0011805273	-0.017512962
2	-0.0111155890	0.0000000000	0.025842125	0.0017720070	-0.019842413
3	-0.0011767074	0.0006506848	0.002053482	0.0005608306	-0.003672446
	2006-05-31	2006-06-01	2006-06-02	2006-06-05	2006-06-06
1	0.004247769	0.0048529913	0.002548541	-0.006215087	-0.007154035
2	0.009323326	0.0015364830	0.006993059	0.000000000	-0.022326545
3	0.001607127	-0.0008596914	0.002280503	-0.001082275	-0.001973220
	2006-06-07	2006-06-08	2006-06-09	2006-06-12	2006-06-13
1	-0.0015405703	-0.008894756	0.004967781	-0.004911457	-0.017072989
2	0.0056383270	-0.027379310	0.012429170	-0.013895865	-0.023992295
3	-0.0003462346	-0.001317298	0.002478040	-0.001395419	-0.003113317
	2006-06-14	2006-06-15	2006-06-16	2006-06-19	2006-06-20
1	-0.0008148080	0.0180758310	0.0023954587	0.001269003	-0.002988990
2	0.0022722680	0.0215693860	-0.0029262460	0.008192221	0.003859327
3	-0.0006585664	0.0004955194	-0.0000012182	-0.001387469	-0.001853536
	2006-06-21	2006-06-22	2006-06-23	2006-06-26	2006-06-27
1	0.0019448093	0.005720239	0.0003236553	0.0022263750	-0.004784092
2	0.0033826850	0.007397751	0.0005459380	-0.0024934820	-0.006852493
3	-0.0002515638	-0.000396063	-0.0009850436	-0.0002276698	-0.001709434
	2006-06-28	2006-06-29	2006-06-30	2006-07-03	2006-07-04	2006-07-05
1	0.002418113	0.013946565	0.0007082933	0.0051894390	0.0036357180	-0.003018747
2	0.002764602	0.014041765	0.0144738140	0.0087848690	0.0019620600	-0.008807793
3	0.000155729	0.003295715	0.0007134052	0.0007825544	-0.0005997304	-0.002343919
	2006-07-06	2006-07-07	2006-07-10	2006-07-11	2006-07-12
1	0.0006826483	-0.0053623837	0.006527871	-0.0026637643	-0.000460215
2	0.0055633270	-0.0057979700	0.005982938	-0.0038346560	0.004470950
3	0.0008409888	0.0001182382	0.001124756	0.0001879154	-0.000366662
	2006-07-13	2006-07-14	2006-07-17	2006-07-18	2006-07-19
1	-0.013003389	-0.0051940770	-0.0004672143	0.0002666863	0.010286439
2	-0.015294320	-0.0103727210	-0.0022662700	-0.0056064500	0.018419580
3	-0.002107831	0.0007863122	0.0008104944	-0.0013472198	0.001975698
	2006-07-20	2006-07-21	2006-07-24	2006-07-25	2006-07-26	2006-07-27
1	0.001826095	-0.0080939067	0.013847674	0.0046651393	0.0005912200	-0.000995908
2	0.008023110	-0.0053387720	0.019239808	0.0006067870	0.0043343300	0.005560211
3	0.001140619	-0.0006294194	0.003391497	0.0009658202	0.0007748634	0.001358731
	2006-07-28	2006-07-31	2006-08-01	2006-08-02	2006-08-03
1	0.0054530087	0.0009621367	-0.0016914860	0.001702776	3.388133e-05
2	0.0102982490	0.0002672360	0.0000000000	-0.003903872	-1.101953e-02
3	0.0008377254	0.0012288024	-0.0002328868	-0.000336470	-1.946893e-03
	2006-08-04	2006-08-07	2006-08-08	2006-08-09	2006-08-10
1	0.0004074303	-0.005170347	0.001632508	0.0004397503	0.005330442
2	0.0095373900	-0.012098329	0.000553560	0.0107776200	-0.005122863
3	0.0025471316	-0.001348114	0.001391077	0.0003048744	0.000744234
	2006-08-11	2006-08-14	2006-08-15	2006-08-16	2006-08-17
1	0.0003234277	0.0042420490	0.006281944	0.002505853	0.0022312940
2	0.0016952790	0.0080151810	0.014854679	0.004466055	0.0029039360
3	-0.0003461608	-0.0005048628	0.002816991	0.001751071	0.0003821246
	2006-08-18	2006-08-21	2006-08-22	2006-08-23	2006-08-24
1	0.001322035	-0.0061511060	0.007819312	-0.0024164317	-0.0006088927
2	-0.003261951	-0.0030139970	0.003136533	-0.0002148440	0.0020208950
3	0.001275974	-0.0002004374	0.002229195	-0.0002547412	-0.0001443246
	2006-08-25	2006-08-28	2006-08-29	2006-08-30	2006-08-31
1	0.001879964	0.0001150750	0.0039964717	-0.0013533323	0.003702581
2	0.000665318	0.0018015970	0.0060754350	0.0027585440	-0.001466397
3	0.000606633	0.0008119902	-0.0000166692	0.0008165558	0.002074951
	2006-09-01	2006-09-04	2006-09-05	2006-09-06	2006-09-07
1	0.003825121	0.0022519797	0.002990688	-0.005576266	-0.0046239360
2	0.003126659	0.0045617030	-0.000424630	-0.005904139	-0.0061051260
3	0.001543151	-0.0000862626	0.000150264	-0.002215612	-0.0008681144
	2006-09-08	2006-09-11	2006-09-12	2006-09-13	2006-09-14	2006-09-15
1	0.005100734	-0.006268633	0.0079822300	0.004665103	-0.0020439300	0.007261748
2	0.005038040	-0.008630266	0.0115443790	0.003047893	-0.0032925020	0.005426908
3	0.002150122	-0.002463850	0.0007386444	0.001504165	-0.0003697208	0.001212241
	2006-09-18	2006-09-19	2006-09-20	2006-09-21	2006-09-22
1	-0.002313630	-0.0022077103	0.002879550	0.0017740240	-0.0120944313
2	0.003786049	-0.0025686540	0.012243951	0.0039715930	-0.0091896050
3	-0.001438723	0.0004212428	0.001333139	0.0009339414	0.0000184756
	2006-09-25	2006-09-26	2006-09-27	2006-09-28	2006-09-29	2006-10-02
1	0.0015608587	0.007422705	0.0054981810	0.003993309	0.002248728	-0.004799167
2	-0.0027169680	0.012816618	0.0036014040	0.000706224	0.001409431	-0.005008288
3	0.0006968214	0.002279966	-0.0000253728	0.000284538	0.000532484	-0.001994171
	2006-10-03	2006-10-04	2006-10-05	2006-10-06	2006-10-09
1	0.0011308367	0.007532294	0.007702274	0.0013383343	-1.842333e-06
2	0.0006241600	0.006657881	0.007166048	0.0010145910	3.046625e-03
3	-0.0000141456	0.001957197	0.001609076	-0.0009962644	2.434124e-04
	2006-10-10	2006-10-11	2006-10-12	2006-10-13	2006-10-16
1	0.0058490050	-0.001032232	0.005763076	0.0057859347	0.0016738733
2	0.0089863210	0.001066884	0.004239901	-0.0020605700	-0.0018373360
3	-0.0004574944	0.000480683	0.001357373	0.0004827066	-0.0004780952
	2006-10-17	2006-10-18	2006-10-19	2006-10-20	2006-10-23
1	-0.0072083463	0.0065625757	-0.004172731	0.0013755277	0.0067030360
2	-0.0105018630	0.0091540220	0.001017744	0.0026080710	0.0055130300
3	-0.0002134596	0.0004904236	-0.001158779	-0.0000334576	-0.0001258452
	2006-10-24	2006-10-25	2006-10-26	2006-10-27	2006-10-30
1	0.0006909480	0.0007045297	0.0004360173	-0.006801505	-0.0013724143
2	-0.0035125790	0.0028470680	-0.0006172110	0.002649247	-0.0047902870
3	0.0001992902	0.0006721280	0.0005680150	0.000502184	0.0003469408
	2006-10-31	2006-11-01	2006-11-02	2006-11-03	2006-11-06
1	-0.001518782	1.051133e-05	-0.0013112907	0.0043942913	0.006986784
2	-0.008238610	4.875862e-03	0.0034278750	0.0059986060	0.010778959
3	0.001557833	1.065105e-03	-0.0000054624	-0.0000061736	0.002111045
	2006-11-07	2006-11-08	2006-11-09	2006-11-10	2006-11-13
1	-0.000492798	0.0010355810	-0.0042464430	-0.0022569473	0.0017332780
2	0.003869904	-0.0059257000	-0.0005821010	-0.0026741380	0.0015220670
3	0.002714624	0.0004917422	0.0007154908	-0.0007612248	-0.0003075366
	2006-11-14	2006-11-15	2006-11-16	2006-11-17	2006-11-20
1	0.005042735	0.0039623527	0.000571091	-0.0041334430	-0.0008046817
2	-0.001419646	0.0067483290	0.000088600	-0.0046228550	0.0012852990
3	0.001633393	-0.0003903638	-0.000064044	-0.0004950332	0.0008012360
	2006-11-21	2006-11-22	2006-11-23	2006-11-24	2006-11-27
1	0.0035024650	-0.001423817	-0.0022464847	-0.0074566157	-0.008884057
2	0.0015966630	0.000878824	-0.0028429630	-0.0120525280	-0.014190477
3	0.0000896774	-0.000362714	-0.0009733886	-0.0007358458	-0.001985401
	2006-11-28	2006-11-29	2006-11-30	2006-12-01	2006-12-04
1	-0.0005720070	0.010011509	-0.0028207700	-0.0045097767	0.008035140
2	-0.0064998550	0.012901202	-0.0086468080	-0.0064254010	0.006552949
3	0.0004750374	0.001182122	0.0003041304	-0.0001392252	0.003841311
	2006-12-05	2006-12-06	2006-12-07	2006-12-08	2006-12-11
1	0.0006129080	0.0012128337	0.0014398583	-0.001522731	0.006351745
2	-0.0028113180	0.0071741430	0.0074404230	-0.002915097	0.006216417
3	-0.0002819442	-0.0002501772	-0.0002762668	0.000266215	0.000853780
	2006-12-12	2006-12-13	2006-12-14	2006-12-15	2006-12-18
1	0.0003822723	0.0032613870	0.0085999303	0.004852111	0.0007975687
2	0.0077836640	0.0016147920	0.0100852550	0.001294973	0.0045667980
3	0.0014112930	-0.0003111648	0.0006167068	0.002972998	-0.0018724352
	2006-12-19	2006-12-20	2006-12-21	2006-12-22	2006-12-25
1	-0.007259400	0.0039956217	-0.001713398	-0.001404090	-0.0004215033
2	-0.006190852	0.0015531190	0.000940550	-0.005018284	0.0000000000
3	-0.001259345	0.0002665752	0.002535426	-0.002011179	-0.0000406544
	2006-12-26	2006-12-27	2006-12-28	2006-12-29	2007-01-01
1	0.0026915907	0.009066652	0.0004164007	-0.001200784	-9.820667e-06
2	0.0000000000	0.010333820	-0.0020348850	-0.001101976	0.000000e+00
3	0.0004873476	0.002159424	-0.0010962306	-0.000159042	-1.140000e-05
	2007-01-02	2007-01-03	2007-01-04	2007-01-05	2007-01-08
1	0.0016820867	0.004437189	-0.0002060123	-0.002683635	-0.001690284
2	0.0000000000	0.014834908	0.0001279650	-0.002580941	-0.004541686
3	0.0004246192	0.001879946	0.0006782908	-0.001927514	-0.001011925
	2007-01-09	2007-01-10	2007-01-11	2007-01-12	2007-01-15
1	0.0044103520	-0.0006727953	0.007703170	0.003633612	0.005514158
2	0.0032808260	-0.0014696800	0.013481449	0.005427095	0.008135203
3	0.0001659808	-0.0012173928	0.001547265	-0.000644134	0.001385737
	2007-01-16	2007-01-17	2007-01-18	2007-01-19	2007-01-22
1	-0.0002334037	-0.0009348027	0.0018915933	0.005485118	-0.001422125
2	-0.0023385150	0.0036009040	-0.0000306000	0.004899725	-0.004463529
3	0.0019940186	-0.0003052906	-0.0002513524	0.001870075	0.000533480
	2007-01-23	2007-01-24	2007-01-25	2007-01-26	2007-01-29
1	-0.0024795987	0.010078597	-0.005617319	0.0003207103	0.0026204570
2	-0.0005070610	0.006027774	-0.001732155	-0.0095941300	0.0074250550
3	-0.0000391516	0.001921089	-0.001420198	0.0000189644	0.0006710216
	2007-01-30	2007-01-31	2007-02-01	2007-02-02	2007-02-05	2007-02-06
1	0.002703326	-0.0017867277	0.005665622	0.005261889	1.452767e-05	0.0016964537
2	0.004207812	-0.0000649000	0.009002711	0.004474549	2.320000e-05	0.0009123640
3	0.002512929	0.0001274554	0.001459012	0.002004385	1.534503e-03	-0.0001289984
	2007-02-07	2007-02-08	2007-02-09	2007-02-12	2007-02-13
1	0.001088440	0.0005737837	-0.0000244410	-0.003474908	0.0039971257
2	0.002937055	-0.0060861880	0.0065013820	-0.004102506	-0.0023852220
3	-0.001036504	0.0008835816	-0.0007572888	-0.001322240	0.0002060698
	2007-02-14	2007-02-15	2007-02-16	2007-02-19	2007-02-20
1	0.003429143	0.0003378953	-0.0010496707	0.001237031	0.002481922
2	0.006839015	0.0014878740	0.0030582220	0.000894606	-0.003270610
3	0.001787848	0.0004864332	-0.0005123012	-0.002126213	0.001312554
	2007-02-21	2007-02-22	2007-02-23	2007-02-26	2007-02-27
1	-0.0003544737	0.004133925	-0.003687659	-0.0006432653	-0.028181932
2	-0.0099202370	0.003907766	-0.000693145	-0.0035030230	-0.035746244
3	0.0008831304	-0.000198173	0.000224204	0.0003787182	-0.003557775
	2007-02-28	2007-03-01	2007-03-02	2007-03-05	2007-03-06	2007-03-07
1	-0.006545974	-0.004681529	-0.007204097	-0.016965306	0.0131994283	0.001848668
2	-0.011946524	-0.001959064	0.002364749	-0.014462125	0.0116318040	0.011790712
3	-0.001276448	-0.000431924	-0.001183209	-0.000934475	0.0006908982	0.000013807
	2007-03-08	2007-03-09	2007-03-12	2007-03-13	2007-03-14	2007-03-15
1	0.01238125	0.0054817963	-0.0019347513	-0.011489525	-0.016127064	0.010698674
2	0.01192565	0.0003297920	-0.0031733170	-0.007777588	-0.028205491	0.014756610
3	0.00211282	0.0003630802	0.0003176836	-0.001476343	-0.003026142	0.003330527
	2007-03-16	2007-03-19	2007-03-20	2007-03-21	2007-03-22	2007-03-23
1	-0.0040807663	0.011023104	0.006051645	0.008317470	0.006371441	0.003735611
2	0.0011294700	0.014488916	0.004157206	0.007430739	0.013969587	0.001698465
3	-0.0004977044	0.001539199	0.001811865	0.002705776	0.002193543	-0.000066116
	2007-03-26	2007-03-27	2007-03-28	2007-03-29	2007-03-30
1	-0.0038874940	-0.0024984800	-0.007625166	0.008954610	0.0028986417
2	-0.0075840220	-0.0047538740	-0.009870398	0.011153959	0.0004970940
3	-0.0006071596	-0.0008749648	-0.001164874	0.001407261	-0.0002237774
	2007-04-02	2007-04-03	2007-04-04	2007-04-05	2007-04-06
1	-0.0013403807	0.008656804	0.0031424653	-0.0014326240	-0.0004997497
2	-0.0014515930	0.010335207	0.0013071590	0.0050004920	0.0000000000
3	0.0000685862	0.002129822	0.0004029036	-0.0001666756	-0.0000500682
	2007-04-09	2007-04-10	2007-04-11
1	0.006564997	0.0004269093	-0.001272110
2	0.000000000	0.0063294250	-0.001044170
3	0.000570819	-0.0000510372	-0.000411593
Clustering Vector
  SBI   SPI   SII   LMI   MPI   ALT LPP25 LPP40 LPP60
    3     2     3     3     1     1     3     3     1

Within cluster sum of squares by cluster:

[1] 0.003806242 0.000000000 0.005432037
 (between_SS / total_SS =  75.3 %)

Available components:

[1] "cluster"   "centers"   "totss"    "withinss"  "tot.withinss"
[6] "betweenss" "size"      "iter"     "ifault"  
>
++++++++++++++++++++++++++++++++++++++
Adjust Open, High, Low, Close Prices For Splits and Dividends

Description

Adjust all columns of an OHLC object for split and dividend.

Usage

adjustOHLC(x,
                  adjust = c("split","dividend"),
                                  use.Adjusted = FALSE,
                                  ratio = NULL,
      symbol.name=deparse(substitute(x)))

Arguments

x An OHLC object
adjust adjust by split, dividend, or both (default)
use.Adjusted use the ‘Adjusted' column in Yahoo! data to adjust
ratio ratio to adjust with, bypassing internal calculations
symbol.name used if x is not named the same as the symbol adjusting

Details

This function calculates the adjusted Open, High, Low, and Close prices according to split and dividend information.

There are three methods available to calculate the new OHLC object prices.

By default, getSplits and getDividends are called to retrieve the respective information. These may dispatch to custom methods following the “.” methodology used by quantmod dispatch. See getSymbols for information related to extending quantmod. This information is passed to adjRatios from the TTR package, and the resulting ratio calculations are used to adjust to observed historical prices. This is the most precise way to adjust a series.

The second method works only on standard Yahoo! data containing an explicit Adjusted column.

A final method allows for one to pass a ratio into the function directly.

All methods proceed as follows:

  1. New columns are derived by taking the ratio of adjusted value to original Close, and multiplying by the difference of the respective column and the original Close.
  2. This is then added to the modified Close column to arrive at the remaining “adjusted” Open, High, Low column values. If no adjustment is needed, the function returns the original data unaltered.

Value

An object of the original class, with prices adjusted for splits and dividends.

Warning

Using use.Adjusted = TRUE will be less precise than the method that employs actual split and dividend information. This is due to loss of precision from Yahoo! using Adjusted columns of only two decimal places. The advantage is that this can be run offline, and for short series or those with few adjustments the loss of precision will be small.

The resulting precision loss will be from row observation to row observation, as the calculation will be exact for intraday values.

Author(s)

Jeffrey A. Ryan

References

Yahoo Finance http://finance.yahoo.com

See Also

getSymbols.yahoo getSplits getDividends

Examples

getSymbols("AAPL", from="1990-01-01",
                  src="yahoo")
head(AAPL)
head(AAPL.a <- adjustOHLC(AAPL))
head(AAPL.uA <- adjustOHLC(AAPL,  
        use.Adjusted=TRUE))
# intraday adjustments are precise across all
# methods
# an example with Open to Close (OpCl)
head(cbind(OpCl(AAPL),OpCl(AAPL.a),
OpCl(AAPL.uA)))
# Close to Close changes may lose precision
head(cbind(ClCl(AAPL),ClCl(AAPL.a),
        ClCl(AAPL.uA)))
## End

In the R domain:

>
> install.packages("quantmod")
Installing package into ‘C:/Users/Bert/Documents/R/win-library/3.2’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session
A CRAN mirror is selected.
---
also installing the dependencies ‘xts’, ‘TTR’
sums checked
The downloaded binary packages are in
        C:UsersBertAppDataLocalTempRtmp2jzxqkdownloaded_packages
> library(quantmod)
> ls("package:quantmod")
[1]	"Ad"	"add_axis"	"add_BBands"
[4]	"add_DEMA"	"add_EMA"	"add_EVWMA"
[7]	"add_GMMA"	"add_MACD"	"add_RSI"
[10]	"add_Series"	"add_SMA"	"add_SMI"
[13]	"add_TA"	"add_VMA"	"add_Vo"
[16]	"add_VWAP"	"add_WMA"	"addADX"
[19]	"addAroon"	"addAroonOsc"	"addATR"
[22]	"addBBands"	"addCCI"	"addChAD"
[25]	"addChVol"	"addCLV"	"addCMF"
[28]	"addCMO"	"addDEMA"	"addDPO"
[31]	"addEMA"	"addEMV"	"addEnvelope"
[34]	"addEVWMA"	"addExpiry"	"addKST"
[37]	"addLines"	"addMACD"	"addMFI"
[40]	"addMomentum"	"addOBV"	"addPoints"
[43]	"addROC"	"addRSI"	"addSAR"
[46]	"addShading"	"addSMA"	"addSMI"
[49]	"addTA"	"addTDI"	"addTRIX"
[52]	"addVo"	"addVolatility"	"addWMA"
[55]	"addWPR"	"addZigZag"	"addZLEMA"
[58]	"adjustOHLC"	"allReturns"	"annualReturn"
[61]	"as.quantmod.OHLC"	"attachSymbols"	"axTicksByTime2"
[64]	"axTicksByValue"	"barChart"	"buildData"
[67]	"buildModel"	"candleChart"	"chart_pars"
[70]	"chart_Series"	"chart_theme"	"chartSeries"
[73]	"chartShading"	"chartTA"	"chartTheme"
[76]	"Cl"	"ClCl"	"current.chob"
[79]	"dailyReturn"	"Delt"	"dropTA"
[82]	"findPeaks"	"findValleys"	"fittedModel"
[85]	"fittedModel<-"	"flushSymbols"	"futures.expiry"
[88]	"getDefaults"	"getDividends"	"getFin"
[91]	"getFinancials"	"getFX"	"getMetals"
[94]	"getModelData"	"getOptionChain"	"getPrice"
[97]	"getQuote"	"getSplits"	"getSymbolLookup"
[100]	"getSymbols"	"getSymbols.csv"	"getSymbols.FRED"
[103]	"getSymbols.google"	"getSymbols.mysql"	"getSymbols.MySQL"
[106]	"getSymbols.oanda"	"getSymbols.rda"	"getSymbols.RData"
[109]	"getSymbols.SQLite"	"getSymbols.yahoo"	"getSymbols.yahooj"
[112]	"has.Ad"	"has.Ask"	"has.Bid"
[115]	"has.Cl"	"has.Hi"	"has.HLC"
[118]	"has.Lo"	"has.OHLC"	"has.OHLCV"
[121]	"has.Op"	"has.Price"	"has.Qty"
[124]	"has.Trade"	"has.Vo"	"Hi"
[127]	"HiCl"	"HLC"	"importDefaults"
[130]	"is.BBO"	"is.HLC"	"is.OHLC"
[133]	"is.OHLCV"	"is.quantmod"	"is.quantmodResults"
[136]	"is.TBBO"	"Lag"	"lineChart"
[139]	"listTA"	"Lo"	"loadSymbolLookup"
[142]	"loadSymbols"	"LoCl"	"LoHi"
[145]	"matchChart"	"modelData"	"modelSignal"
[148]	"monthlyReturn"	"moveTA"	"new.replot"
[151]	"newTA"	"Next"	"oanda.currencies"
[154]	"OHLC"	"OHLCV"	"Op"
[157]	"OpCl"	"OpHi"	"OpLo"
[160]	"OpOp"	"options.expiry"	"peak"
[163]	"periodReturn"	"quantmodenv"	"quarterlyReturn"
[166]	"reChart"	"removeSymbols"	"saveChart"
[169]	"saveSymbolLookup"	"saveSymbols"	"seriesAccel"
[172]	"seriesDecel"	"seriesDecr"	"seriesHi"
[175]	"seriesIncr"	"seriesLo"	"setDefaults"
[178]	"setSymbolLookup"	"setTA"	"show"
[181]	"showSymbols"	"specifyModel"	"standardQuote"
[184]	"summary"	"swapTA"	"tradeModel"
[187]	"unsetDefaults"	"unsetTA"	"valley"
[190]	"viewFin"	"viewFinancials"	"Vo"
[193]	"weeklyReturn"	"yahooQF"	"yahooQuote.EOD"
[196]	"yearlyReturn"	"zoom_Chart"	"zoomChart"
[199]	"zooom"
> adjustOHLC
function (x, adjust = c("split", "dividend"), use.Adjusted = FALSE,
    ratio = NULL, symbol.name = deparse(substitute(x)))
{
    if (is.null(ratio)) {
        if (use.Adjusted) {
            if (!has.Ad(x))
                stop("no Adjusted column in 'x'")
            ratio <- Ad(x)/Cl(x)
        }
        else {
            div <- getDividends(symbol.name, from = "1900-01-01")
            splits <- getSplits(symbol.name, from = "1900-01-01")
            if (is.xts(splits) && is.xts(div) && nrow(splits) >
                0 && nrow(div) > 0)
                div <- div * 1/adjRatios(splits = merge(splits,
                  index(div)))[, 1]
            ratios <- adjRatios(splits, div, Cl(x))
            if (length(adjust) == 1 && adjust == "split") {
                ratio <- ratios[, 1]
            }
            else if (length(adjust) == 1 && adjust == "dividend") {
                ratio <- ratios[, 2]
            }
            else ratio <- ratios[, 1] * ratios[, 2]
        }
    }
    Adjusted <- Cl(x) * ratio
    structure(cbind((ratio * (Op(x) - Cl(x)) + Adjusted), (ratio *
        (Hi(x) - Cl(x)) + Adjusted), (ratio * (Lo(x) - Cl(x)) +
        Adjusted), Adjusted, if (has.Vo(x))
        Vo(x)
    else NULL, if (has.Ad(x))
        Ad(x)
    else NULL), .Dimnames = list(NULL, colnames(x)))
}
<environment: namespace:quantmod>
> ## Not run:
> getSymbols("AAPL", from="1990-01-01", src="yahoo")
    As of 0.4-0, ‘getSymbols’ uses env=parent.frame() and
 auto.assign=TRUE by default.

This behavior will be phased out in 0.5-0 when the call will default to use auto.assign=FALSE. getOption("getSymbols.env") and getOptions("getSymbols.auto.assign") are now checked for alternate defaults.

This message is shown once per session and may be disabled by setting options("getSymbols.warning4.0"=FALSE). See ?getSymbols for more details.

[1] "AAPL"
Warning message:
In download.file(paste(yahoo.URL, "s=", Symbols.name, "&a=", from.m,  :
  downloaded length 452489 != reported length 200
> head(AAPL)
              AAPL.  AAPL   AAPL.  AAPL.   AAPL.      AAPL.
              Open   High   Low    Close   Volume     Adjusted
1990-01-02    35.25  37.50  35.00  37.250  45799600   1.132075
1990-01-03    38.00  38.00  37.50  37.500  51998800   1.139673
1990-01-04    38.25  38.75  37.25  37.625  55378400   1.143471
1990-01-05    37.75  38.25  37.00  37.75   30828000   1.147270
1990-01-08    37.50  38.00  37.00  38.000  25393200   1.154868
1990-01-09    38.00  38.00  37.00  37.625  21534800   1.143471
> head(AAPL.a <- adjustOHLC(AAPL))
            AAPL.    AAPL.    AAPL.     AAPL.      AAPL.      AAPL.
            Open     High      Low       Close     Volume     Adjusted
1990-01-02  1.071292  1.139673  1.063694   1.132075   45799600   1.132075
1990-01-03  1.154868  1.154868  1.139673   1.139673   51998800   1.139673
1990-01-04  1.162466  1.177662  1.132075   1.143471   55378400   1.143471
1990-01-05  1.147270  1.162466  1.124477   1.147270   30828000   1.147270
1990-01-08  1.139673  1.154868  1.124477   1.154868   25393200   1.154868
1990-01-09  1.154868  1.154868  1.124477   1.143471   21534800   1.143471
> head(AAPL.uA <- adjustOHLC(AAPL,
+         use.Adjusted=TRUE))
            AAPL.     AAPL.   AAPL.  AAPL.       AAPL.     AAPL.
            Open      High     Low     Close     Volume   Adjusted
1990-01-02  1.071292  1.139673   1.063695   1.132075    45799600    1.132075
1990-01-03  1.154869  1.154869   1.139673   1.139673    51998800    1.139673
1990-01-04  1.162466  1.177661   1.132074   1.143471    55378400    1.143471
1990-01-05  1.147270  1.162466   1.124477   1.147270    30828000    1.147270
1990-01-08  1.139672  1.154868   1.124477   1.154868    25393200    1.154868
1990-01-09  1.154868  1.154868   1.124476   1.143471    21534800    1.143471
>  
> # intraday adjustments are precise across all methods
> # an example with Open to Close (OpCl)
> head(cbind(OpCl(AAPL),OpCl(AAPL.a),
+            OpCl(AAPL.uA)))
             OpCl.AAPL     OpCl.AAPL.a   OpCl.AAPL.uA
1990-01-02   0.056737647   0.056737647   0.056737647
1990-01-03  -0.013157869  -0.013157869  -0.013157869
1990-01-04  -0.016339895  -0.016339895  -0.016339895
1990-01-05   0.000000000   0.000000000   0.000000000
1990-01-08   0.013333307   0.013333307   0.013333307
1990-01-09  -0.009868395  -0.009868395  -0.009868395
> # Close to Close changes may lose precision
> head(cbind(ClCl(AAPL),ClCl(AAPL.a),
+    ClCl(AAPL.uA)))
             ClCl.AAPL     ClCl.AAPL.a   ClCl.AAPL.uA
1990-01-02   NA            NA            NA
1990-01-03   0.006711382   0.006711382   0.006711569
1990-01-04   0.003333333   0.003333333   0.003332535
1990-01-05   0.003322259   0.003322259   0.003322340
1990-01-08   0.006622490   0.006622490   0.006622678
1990-01-09  -0.009868395  -0.009868395  -0.009868660
>  
chartSeries Create Financial Charts

Description

Charting tool to create standard financial charts gives a time series like object. Serves as the base function for future technical analysis additions. Possible chart styles include candles, matches (1 pixel candles), bars, and lines. Chart may have white or black background.

reChart allows for dynamic changes to the chart without having to respecify the full chart parameters.

Usage

chartSeries(x, type = c("auto", 
   "candlesticks", "matchsticks", 
   "bars","line"), subset = NULL,
   show.grid = TRUE, name = NULL,
   time.scale = NULL,
   log.scale = FALSE, TA = 'addVo()',
   TAsep=';', line.type = "l",
   bar.type = "ohlc",
   theme = chartTheme("black"),
   layout = NA,
   major.ticks='auto', minor.ticks=TRUE,
   yrange=NULL,
   plot=TRUE,
   up.col,dn.col,color.vol 
   = TRUE, multi.col = FALSE, ...)
reChart(type = c("auto", "candlesticks",    
       "matchsticks", "bars","line"),
       subset = NULL,
       show.grid = TRUE,
       chartSeries 23
       name = NULL,
       time.scale = NULL,
       line.type = "l",
       bar.type = "ohlc",
       theme = chartTheme("black"),
       major.ticks='auto', minor.ticks=TRUE,
       yrange=NULL,
       up.col,dn.col,color.vol = 
       TRUE, multi.col = FALSE,
       ...)

Arguments

x an OHLC object – see details
type style of chart to draw
subset xts style date subsetting argument
show.grid display price grid lines?
name name of chart
time.scale what is the timescale? automatically deduced (broken)
log.scale should the y-axis be log-scaled?
TA a vector of technical indicators and params, or character strings
TAsep TA delimiter for TA strings
line.type type of line in line chart
bar.type type of barchart - ohlc or hlc
theme a chart.theme object
layout if NULL bypass internal layout
major.ticks where should major ticks be drawn
minor.ticks should minor ticks be drawn?
yrange override y-scale
plot should plot be drawn
up.col up bar/candle color
dn.col down bar/candle color
color.vol color code volume?
multi.col 4 color candle pattern
additional parameters

Details

Currently, chart displays standard style OHLC charts familiar in financial applications or line charts not passing OHLC data. Works with objects having explicit time-series properties.

Line charts are created with close data, or from single column time series.

The subset argument can be used to specify a particular area of the series to view. The underlying series is left intact to allow for TA functions to use the full data set. Additionally, it is possible to use syntax borrowed from the first and last functions, for example, “last 4 months.”

TA allows for the inclusion of a variety of chart overlays and technical indicators. A full list is available from addTA. The default TA argument is addVo() – which adds volume, if available, to the chart being drawn.

theme requires an object of class chart.theme, created by a call to chartTheme. This function can be used to modify the look of the resulting chart. See chart.theme for details. line.type and bar.type allow further fine tuning of chart styles to user tastes. multi.col implements a color coding scheme used in some charting applications, and follows the following rules:

• grey => Op[t] < Cl[t] and Op[t] < Cl[t-1]
• white => Op[t] < Cl[t] and Op[t] > Cl[t-1]
• red => Op[t] > Cl[t] and Op[t] < Cl[t-1]
• black => Op[t] > Cl[t] and Op[t] > Cl[t-1]

reChart takes any number of arguments from the original chart call and redraws the chart with the updated parameters. One item of note: if multiple color bars/candles are desired, it is necessary to respecify the theme argument. Additionally, it is not possible to change TA parameters at present. This must be done with addTA/dropTA/swapTA/moveTA commands.

Value

Returns a standard chart plus volume, if available, suitably scaled.

If plot=FALSE a chob object will be returned.

Note

Most details can be fine-tuned within the function, though the code does a reasonable job of scaling and labeling axes for the user. The current implementation maintains a record of actions carried out for any particular chart. This is used to recreate the original when adding new indicator. A list of applied TA actions is available with a call to listTA. This list can be assigned to a variable and used in new chart calls to recreate a set of technical indicators. It is also possible to force all future charts to use the same indicators by calling setTA.

Additional motivation to add outlined candles to allow for scaling and advanced color coding is owed to Josh Ulrich, as are the base functions (from TTR) for the yet to be released technical analysis charting code.

Many improvements in the current version were the result of conversations with Gabor Grothendieck. Many thanks to him.

Author(s)

Jeffrey A. Ryan

References

Josh Ulrich - TTR package and multi.col coding

See Also

  getSymbols, addTA, setTA, chartTheme

Examples

## Not run:
getSymbols("YHOO")
chartSeries(YHOO)
chartSeries(YHOO, subset='last 4 months')
chartSeries(YHOO, subset='2007::2008-01')
chartSeries(YHOO,theme=chartTheme('white'))
chartSeries(YHOO,TA=NULL) #no volume
chartSeries(YHOO,TA=c(addVo(),addBBands())) #add volume and Bollinger Bands from TTR
addMACD() # add MACD indicator to current chart
setTA()
chartSeries(YHOO) # draws chart again, this time will all indicators present
## End(Not run)
chartTheme Create A Chart Theme

Description

Charting tool to create standard financial charts gives a time series like object. Serves as the base function for future technical analysis additions. Possible chart styles include candles, matches (1 pixel candles), bars, and lines. Chart may have white or black background. reChart allows for dynamic changes to the chart without having to respecify the full chart parameters.

Usage

chartSeries(x,
type = c("auto", "candlesticks", "matchsticks", "bars", "line"),
subset = NULL,
show.grid = TRUE,
name = NULL,
time.scale = NULL,
log.scale = FALSE,
TA = 'addVo()',
TAsep=';',
line.type = "l",
bar.type = "ohlc",
theme = chartTheme("black"),
layout = NA,
major.ticks='auto', minor.ticks=TRUE,
yrange=NULL,
plot=TRUE,
up.col,dn.col,color.vol = TRUE, multi.col = FALSE,
...)
reChart(type = c("auto", "candlesticks", "matchsticks", "bars", "line"),
subset = NULL,
show.grid = TRUE,
chartSeries 23
name = NULL,
time.scale = NULL,
line.type = "l",
bar.type = "ohlc",
theme = chartTheme("black"),
major.ticks='auto', minor.ticks=TRUE,
yrange=NULL,
up.col,dn.col,color.vol = TRUE, multi.col = FALSE,
...)

Arguments

x an OHLC object – see details
type style of chart to draw
subset xts style date subsetting argument
show.grid display price grid lines?
name name of chart
time.scale what is the timescale? automatically deduced (broken)
log.scale should the y-axis be log-scaled?
TA a vector of technical indicators and params, or character strings
TAsep TA delimiter for TA strings
line.type type of line in line chart
bar.type type of barchart – ohlc or hlc
theme a chart.theme object
layout if NULL bypass internal layout
major.ticks where should major ticks be drawn
minor.ticks should minor ticks be drawn?
yrange override y-scale
plot should plot be drawn
up.col up bar/candle color
dn.col down bar/candle color
color.vol color code volume?
multi.col 4 color candle pattern
additional parameters

Details

Currently, charts displays standard style OHLC charts familiar in financial applications or line charts not passing OHLC data. Works with objects having explicit time-series properties.

Line charts are created with close data, or from single column time series.

The subset argument can be used to specify a particular area of the series to view. The underlying series is left intact to allow for TA functions to use the full data set. Additionally, it is possible to use syntax borrowed from the first and last functions, for example, “last 4 months.”

TA allows for the inclusion of a variety of chart overlays and technical indicators. A full list is available from addTA. The default TA argument is addVo() – which adds volume, if available, to the chart being drawn. theme requires an object of class chart.theme, created by a call to chartTheme. This function can be used to modify the look of the resulting chart. See chart.theme for details. line.type and bar.type allow further fine tuning of chart styles to user tastes. multi.col implements a color coding scheme used in some charting applications, and follows the following rules:

• grey => Op[t] < Cl[t] and Op[t] < Cl[t-1]
• white => Op[t] < Cl[t] and Op[t] > Cl[t-1]
• red => Op[t] > Cl[t] and Op[t] < Cl[t-1]
• black => Op[t] > Cl[t] and Op[t] > Cl[t-1]

reChart takes any number of arguments from the original chart call and redraws the chart with the updated parameters. One item of note: If multiple color bars/candles are desired, it is necessary to respecify the theme argument. Additionally, it is not possible to change TA parameters at present.

This must be done with addTA/dropTA/swapTA/moveTA commands.

Value

Returns a standard chart plus volume, if available, suitably scaled.

If plot=FALSE, a chob object will be returned.

Note

Most details can be fine-tuned within the function, though the code does a reasonable job of scaling and labeling axes for the user.

The current implementation maintains a record of actions carried out for any particular chart. This is used to recreate the original when adding new indicator. A list of applied TA actions is available with a call to listTA.

This list can be assigned to a variable and used in new chart calls to recreate a set of technical indicators. It is also possible to force all future charts to use the same indicators by calling setTA.

Additional motivation to add outlined candles to allow for scaling and advanced color coding is owed to Josh Ulrich, as are the base functions (from TTR) for the yet to be released technical analysis charting code.

Many improvements in the current version were the result of conversations with Gabor Grothendieck. Many thanks to him.

Author(s)

Jeffrey A. Ryan

ReferencesJosh Ulrich - TTR package and multi.col coding

See Also

getSymbols, addTA, setTA, chartTheme
Figure depicting cluster dendrogram.

Figure 4.20 Output for plot(hclust)

A screenshot depicting “chartSeries(YHOO)” with subset is equal to the last 4 months.

Figure 4.21

Figure 4.22

Figure 4.23

Figure 4.24

Figure 4.25

Figure 4.26

Figure 4.27

Figure 4.28

Figure 4.29

Review Questions for Section 4.5

  1. Define: univariate, bivariate, and multivariate data analyses, giving an example of each.
    1. How are these analyses carried out in the R environment?
    2. Give examples of the R code segments for these analyses.
    1. What is meant by regression analysis?
    2. How is regression analysis used in data analysis?
    1. How is regression analysis carried in the R environment?
    2. Provide examples of the R functions used for regression analysis.
    1. Summarize the two uses of the ANOVA Table in data analysis.
    2. For data analysis, suggest an applicable R code segments.

Exercise for Section 4.5

Using the R code segment below,

  1. create a 50-vector x of 50 random numbers from the standard normal distribution,
  2. output x, and
  3. perform a univariate data analysis on x.
> x <- rnorm(1:50)
> x
> install.packages(“epibasix”)
> library(epibasix)
> univar(x)
> x

Appendix 1: Documentation for the Plot Function

plot {graphics}
R Documentation
Generic X-Y Plotting

Description

Generic function for plotting of R objects. For more details about the graphical parameter arguments, see par.

For simple scatter plots, plot.default will be used. However, there are plot methods for many R objects, including functions, data.frames, density objects, and so on. Use methods(plot) and the documentation for these.

Usage

plot(x, y,…)

Arguments

x: the coordinates of points in the plot. Alternatively, a single plotting structure, function, or any R object with a plot method can be provided.
Y: the y coordinates of points in the plot, optional if x is an appropriate structure.
Arguments to be passed to methods, such as graphical parameters (see par). Many methods will accept the following arguments:

type

what type of plot should be drawn. Possible types are

  • “p” for points,
  • “l” for lines,
  • “b” for both,
  • “c” for the lines part alone of “b”,
  • “o” for both “overplotted,”
  • “h” for “histogram” like (or “high-density”) vertical lines,
  • “s” for stair steps,
  • “S” for other steps, see “Details” below,
  • “n” for no plotting.

All other types give a warning or an error; using, for example, type = “punkte” being equivalent to type = “p” for S compatibility. Note that some methods, for example, plot.factor, do not accept this.

Main an overall title for the plot: see title.

Sub a sub title for the plot: see title.

Xlab a title for the x-axis: see title.

Ylab a title for the y-axis: see title.

Asp the y/x aspect ratio: see plot.window.

Details

The two step types differ in their xy preference: Going from (x1,y1) to (x2,y2) with x1 < x2, type = “s” moves first horizontal, then vertical, whereas type = “S” moves the other way around.

See Also

plot.default, plot.formula and other methods; points, lines, par.
For X-Y-Z plotting see contour, persp, and image.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.122.4