Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

4
Principle component analysis of multivariate time series

Principal component analysis (PCA) is a statistical technique used for explaining the variance–covariance matrix of a set of m‐dimensional variables through a few linear combinations of these variables. In this chapter, we will illustrate the method to show that a large m‐dimensional process can often be sufficiently explained by smaller k principal components and thus reduce a higher dimension problem to one with fewer dimensions.

4.1 Introduction

Because of the advances of computing technology, the dimension, m, used in data analysis has become larger and larger. PCA is a statistical method that converts a set of correlated variables into a set of uncorrelated variables through an orthogonal transformation. Hopefully, a small subset of the uncorrelated variables carries sufficient information of the original large set of correlated variables.

The PCA concept to achieve parsimony was first introduced by Karl Pearson (1901). However, it was Hotelling (1933) who developed the method of stochastic variables and officially introduced the term of principal components in 1933. The techniques can be used on general variables or standardized variables and hence either the covariance matrix or correlation matrix. To many people, these techniques seem related, but in reality, they could be quite different. The goal of the method is to represent a large m‐dimensional process with much smaller k principal components and hence reduce a higher dimension problem to one with fewer dimensions.

We will begin with the population PCA, discuss its properties and interpretations, and then extend it to a sample PCA. We will also discuss the large sample properties of the sample principal components.

4.2 Population PCA

Given a m‐dimensional random vector Z = [Z₁, …, Z_m]′, let Σ be the covariance matrix,

where μ = E(Z_t). We will choose a vector such that Y₁ = α′Z has the maximum variance. Moreover, to obtain a unique solution, we also require that α′α = 1. That is, we will choose such that

(4.1)

Putting them together, we obtain the solution using the method of the Lagrange multiplier. That is, let

(4.2)

where λ is a Lagrange multiplier, and we maximize V with the constraint. Thus,

Since α ≠ 0, we have

That is, λ is an eigenvalue and α is the corresponding eigenvector of Γ, that is

(4.3)

In fact, because

this λ is actually the largest eigenvalue of Γ. We will call it λ₁ and its corresponding eigenvector as

Because Γ is m × m, there will be m such eigenvalues, λ₁ ≥ λ₂ ≥ ⋯ ≥ λ_m ≥ 0, and we denote the corresponding normalized eigenvectors as α₁, α₂, …, α_m, where Thus, we obtain the following linear combinations

(4.4)

such that

(4.5)

We will call the first principal component, the second principal component, and so on.

4.3 Implications of PCA

Let Λ be the diagonal matrix of eigenvalues and be the matrix formed from the corresponding normalized eigenvectors in the PCA. Since P′P = PP′ = I, Eq. (4.3) implies that

(4.6)

and

(4.7)

Since γ_i,i = Var(Z_i), i = 1, 2, …, m, we note that

(4.8)

The proportion of the total variance of Z_t explained by the ith principal component Y_i is given by

(4.9)

In many applications, the sum of the variances of the first few principal components may account for more than 85 or 90% of the total variance. This implies that the study of a given high m‐dimensional process Z can be accomplished through the careful study of a set containing a small number of principal components without losing much information.

Let e_i be the ith m‐dimensional unit vector with its ith element being 1 and 0 otherwise, for example, and In many applications, we may want to find the relationship between a principal component Y_i and variable Z_j in Z. It is interesting to note that

It follows that

(4.10)

for i, j = 1, 2, …, m. Thus, one can use the relative magnitude of the correlations or the coefficients α_i,j associated with variable Z_j to interpret the principal components. These correlations and coefficients together with their positive or negative signs can also be used to measure the importance of the variables in Z to a given principal component. They sometimes lead to different rankings, but they often reach similar results.

In these discussions, we obtain principal components for a given process Z through its variance–covariance matrix Γ. Clearly, we can also construct these principal components through its correlation matrix,

(4.11)

where D is the diagonal matrix in which the ith diagonal element is the variance of the ith process, that is

(4.12)

In other words, we will construct the principal components for the standardized variables,

(4.13)

and ρ is the covariance matrix of U because Cov(U) = D^−1/2ΓD^−1/2 = ρ. Hence, the procedure of constructing the principal components from these standardized variables is exactly the same as before. The first natural question to ask is, why do we use standardized variables? One obvious answer is that in many applications the unit used in variables may not be the same. For example, if values of one variable are mostly small numbers between 0 and 1 like percentages but values of one other variable are mostly numbers in the millions or billions like imports and exports, then to avoid the unnecessary impact of the different units used in these variables, we will naturally consider using standardized variables. The second question to ask is, are the two sets of principal components constructed from Γ and ρ the same? The unfortunate answer is decidedly no. In fact, they could be very different. The choice depends on applications and could be a challenge in applied research.

4.4 Sample principle components

In practice, we may not know the parameter value Γ of Z. So, given a m‐dimensional stationary vector time series Z_t, we will simply replace the unknown population variance–covariance matrix, Γ, by the sample variance–covariance matrix, computed from Z_t, t = 1, 2, …, n,

(4.14)

where

is the sample mean vector,
is the sample variance of the Z_i,
is the sample covariance between Z_i and Z_j.

In exactly the same approach, we will now choose a vector such that has the maximum sample variance, where we denote the vector and its resulting linear combination with a hat for distinction between the population and sample principal components. Thus, the first sample principal component will be obtained by

The second sample principal component will be obtained by

and zero sample covariance between and

Continuing, the ith sample principal component will be obtained by

Let be the eigenvalue–eigenvector pairs of the sample variance–covariance matrix, The ith sample principal component is in fact given by

where From the process and the results from Sections 4.2 and 4.3, we have

Sample covariance between is zero for i ≠ j.
Sample covariance between
Sample correlation between

Similar to the population principal components, we can obtain the sample principal components through the sample variance–covariance matrix or the sample correlation matrix,

(4.15)

where is the diagonal matrix in which the ith diagonal element is the sample variance of Z_{i, t}, that is

(4.16)

In other words, we will construct the sample principal components for the standardized sample variables,

(4.17)

and is the sample covariance matrix of The procedure is exactly the same as before. For simplicity, we will use the same notations, for the sample principal components and their associated eigenvalues and eigenvectors irrespective of whether they are constructed from Whether from should be clear from the applications. Again, the two sets of sample principal components constructed from will in general be different. The choice depends on applications.

Once the sample principal components are obtained, the next natural step is to investigate their properties through Their large sample properties are summarized in the following theorem, and we refer readers to Anderson (1963) for its proof. We also refer readers to reference books by Johnson and Wichern (2002), and Rao (2002).

The result in (1) implies that the values are independently normally distributed as N(λ_i, 2λ_i/n). Hence, the 100(1 − α)% confidence interval for λ_i can be obtained as follows:

(4.19)

where N_α/2 is the upper 100(α/2)th percentile of the standard normal distribution.

From these discussions, it is clear that PCA is a statistical method used to find k linear combinations of m original statistical variables through the analysis of the covariance or correlation matrix of these m variables with k being much less than m, so that the study of a given high m‐dimensional process can be accomplished through the study of a much smaller number of principal components without losing much information. Obviously, this makes sense only when the covariance and correlation is stable and constant. Hence, for a time series process, it has to be stationary. For a nonstationary series, it needs to be reduced to stationary by using some transformations such as power transformation and differencing. Since the residual vector, a_t, is actually a function of Z_t, one can also analyze the covariance or correlation matrix of residuals after a VAR model fitting. Also, when using PCA, we normally work on a mean adjusted data set.

4.5 Empirical examples

4.5.1 Daily stock returns from the first set of 10 stocks

Example 4.1

Let us consider the data set, listed as WW4a in the Data Appendix, of the daily stock returns from 10 different stocks, including CVX (Chevron), XOM (Exxon), AAPL (Apple), FB (Facebook), MSFT (Microsoft), MRK (Merck), PFE (Pfizer), BAC (Bank of America), JPM (JP Morgan), and WFC (Wells Fargo & Co.) that were traded on the New York Stock Exchange from August 2, 2016 to December 30, 2016. The data are plotted in Figure 4.1.

Figure 4.1 Daily stock returns for 10 stocks between August 2 and December 30, 2016.

In this example, the sample Z_t = [Z_1,t, Z_2,t, Z_3,t, Z_4,t, Z_5,t, Z_6,t, Z_7,t, Z_8,t, Z_9,t, Z_10,t], t = 1, 2, …, n, is the daily stock returns for Chevron (Z_1,t), Exxon (Z_2,t), Apple (Z_3,t), Facebook (Z_4,t), Microsoft (Z_5,t), Merck (Z_6,t), Pfizer (Z_7,t), Bank of America (Z_8,t), JP Morgan (Z_9,t), and Wells Faro & Co (Z_10,t) that were traded on the New York Stock Exchange from August 2, 2016 to December 30, 2016 and we have m = 10 and n = 106. We have the sample mean vector,

with sample covariance matrix,

	CVX	XOM	AAPL	FB	MSFT
CVX	1.087154e−04	6.983585e−05	8.576443e−06	2.841400e−05	2.349692e−05
XOM	6.983585e−05	1.063956e−04	1.159600e−05	2.475253e−05	2.188846e−05
AAPL	8.576443e−06	1.159600e−05	1.231327e−04	6.237891e−05	5.277169e−05
FB	2.841400e−05	2.475253e−05	6.237891e−05	1.487562e−04	6.697871e−05
MSFT	2.349692e−05	2.188846e−05	5.277169e−05	6.697871e−05	1.012891e−04
MRK	2.015158e−05	5.188890e−05	3.017592e−05	3.534955e−05	2.720992e−0
PFE	1.959854e−05	4.650749e−05	1.080636e−05	3.061439e−05	1.225974e−05
BAC	4.039745e−05	5.224031e−05	−1.339308e−05	−2.972687e−06	7.338406e−06
JPM	3.425218e−05	3.701334e−05	−1.168483e−05	6.079249e−07	6.054941e−06
WFC	1.889755e−05	2.852743e−05	−2.163606e−05	−1.736215e−05	−1.199474e−05
	MRK	PFE	BAC	JPM	WFC
CVX	2.015158e−05	1.959854e−05	4.039745e−05	3.425218e−05	1.889755e−05
XOM	5.188890e−05	4.650749e−05	5.224031e−05	3.701334e−05	2.852743e−05
AAPL	3.017592e−05	1.080636e−05	−1.339308e−05	−1.168483e−05	−2.163606e−05
FB	3.534955e−05	3.061439e−05	−2.972687e−06	6.079249e−07	−1.736215e−05
MSFT	2.720992e−05	1.225974e−05	7.338406e−06	6.054941e−06	−1.199474
MRK	2.522852e−04	1.180967e−04	9.456417e−05	7.451309e−05	2.881565e−05
PFE	1.180967e−04	1.659857e−04	7.508658e−05	6.907373e−05	3.707669e−05
BAC	9.456417e−05	7.508658e−05	2.364619e−04	1.434413e−04	1.192228e−04
JPM	7.451309e−05	6.907373e−05	1.434413e−04	1.184657e−04	8.792855e−05
WFC	2.881565e−05	3.707669e−05	1.192228e−04	8.792855e−05	1.720248e−04

and the sample correlation matrix,

	CVX	XOM	AAPL	FB	MSFT
CVX	1.00000000	0.6493384	0.07412677	0.223434058	0.22391548
XOM	0.64933841	1.0000000	0.10131175	0.196752448	0.21084926
AAPL	0.07412677	0.1013118	1.00000000	0.460907181	0.47253396
FB	0.22343406	0.1967524	0.46090718	1.000000000	0.54565465
MSFT	0.22391548	0.2108493	0.47253396	0.545654650	1.00000000
MRK	0.12167962	0.3167136	0.17120954	0.182473725	0.17021586
PFE	0.14589582	0.3499658	0.07558875	0.194828564	0.09455067
BAC	0.25195788	0.3293542	−0.07848980	−0.015850063	0.04741763
JPM	0.30181896	0.3296857	−0.09674747	0.004579479	0.05527545
WFC	0.13818618	0.2108654	−0.14866064	−0.108535145	−0.09086861
	MRK	PFE	BAC	JPM	WFC
CVX	0.1216796	0.14589582	0.25195788	0.301818963	0.13818618
XOM	0.3167136	0.34996582	0.32935425	0.329685727	0.21086537
AAPL	0.1712095	0.0755887	−0.07848980	−0.096747468	−0.14866064
FB	0.1824737	0.19482856	–0.01585006	0.004579479	−0.10853514
MSFT	0.1702159	0.09455067	0.04741763	0.055275452	−0.09086861
MRK	1.0000000	0.57710707	0.38716859	0.431012985	0.13832064
PFE	0.5771071	1.00000000	0.37900626	0.492585061	0.21941693
BAC	0.3871686	0.37900626	1.00000000	0.857032493	0.59113031
JPM	0.4310130	0.49258506	0.85703249	1.000000000	0.61593958
WFC	0.1383206	0.21941693	0.59113031	0.615939576	1.00000000

4.5.1.1 The PCA based on the sample covariance matrix

The eigenvalues and eigenvectors, which are often known as variances and component loadings, of the sample variance–covariance matrix, are given in Table 4.1.

Table 4.1 Sample PCA results for the 10 stock returns based on the sample covariance matrix.

	Comp. 1	Comp. 2	Comp. 3	Comp. 4	Comp. 5	Comp. 6	Comp. 7	Comp. 8	Comp. 9	Comp. 10
CVX	−0.166	−0.104	−0.407	0.576	0.120				0.622	−0.227
XOM	−0.230	−0.126	−0.243	0.552		0.203			−0.686	0.215
AAPL		−0.431	−0.173	−0.354	0.180	0.377	−0.639	−0.264
FB		−0.513	−0.319	−0.231	−0.324	−0.213	0.505	−0.399
MSFT		−0.362	−0.286	−0.197	0.127	−0.116		0.841
MRK	−0.470	−0.329	0.596		0.396	0.223	0.320
PFE	−0.369	−0.167	0.316	0.126	−0.725	−0.118	−0.358	0.136		−0.165
BAC	−0.539	0.289	−0.191	−0.222	0.279	−0.438	−0.164	−0.155	−0.217	−0.416
JPM	−0.391	0.175				−0.206			0.265	0.827
WFC	−0.324	0.379	−0.251	−0.265	−0.266	0.678	0.255			−0.110
Variance	0.00058	0.00030	0.00018	0.00011	0.00009	0.00008	0.00007	0.00005	0.00003	0.00002
Percentage of Total Variance	0.384	0.199	0.119	0.073	0.060	0.053	0.046	0.033	0.020	0.013

Figure 4.2 shows a useful plot, which is often known as a scree plot or screeplot. It is a plot of eigenvalues versus i, that is, the magnitude of an eigenvalue versus its number. To determine the suitable number, k, of components, we look for an elbow in the plot, where the component preceding the vertex of the elbow is chosen to be cutoff point k. In our example, we will choose k to be 3 or 4.

Screeplot of the PCA from the sample covariance matrix displaying a descending curve with circle markers. — **Figure 4.2** Screeplot of the PCA from the sample covariance matrix.

The eigenvalues of the first four components account for

of the total sample variance.

The four sample principal components are

Now, let us examine the four components more carefully. The first component represents the general market other than communication technology sector. The second component represents the contrast between financial and non‐financial sectors. The third component represents the contrast between health and non‐health sectors. The fourth component represents the contrast between oil and non‐oil industries. Thus, the PCA has provided us with four components that contain a vast amount of information for the 10 stock returns traded on the New York Stock Exchange.

4.5.1.2 The PCA based on the sample correlation matrix

Now let us try the PCA using the sample correlation matrix, which is based on the standardized variables, images

The eigenvalues and eigenvectors, which are also known as variances and component loadings, of the sample correlation matrix are given in Table 4.2. The screeplot is shown in Figure 4.3.

Table 4.2 Sample PCA result for the Greater New York City CPI based on the sample correlation matrix.

	Comp. 1	Comp. 2	Comp. 3	Comp. 4	Comp. 5	Comp. 6	Comp. 7	Comp. 8	Comp. 9	Comp. 10
CVX	−0.287	−0.155	0.654	−0.136			0.287		0.578	−0.150
XOM	−0.354	−0.137	0.464	−0.320	0.187		−0.319	0.116	−0.598	0.146
AAPL		−0.491	−0.199	0.233	0.744	−0.192	0.163	−0.204
FB	−0.143	−0.506		0.171	−0.517	−0.416	0.355	0.308	−0.167
MSFT	−0.150	−0.490		0.341	−0.241	0.493	−0.539	−0.131	0.102
MRK	−0.346		−0.418	−0.405	0.164	0.318		0.602	0.207
PFE	−0.364		−0.354	−0.429	−0.198	−0.367	−0.217	−0.538	0.143	−0.166
BAC	−0.433	0.236		0.281		0.258	0.310	−0.121	−0.367	−0.602
JPM	−0.458	0.231		0.206		0.154	0.238	−0.210		0.747
WFC	−0.310	0.320		0.458	0.144	−0.457	−0.415	0.342	0.269
Variance	3.393	2.210	1.196	0.939	0.557	0.504	0.406	0.378	0.296	0.121
Cumulative % of Total Variance	33.93%	56.03%	67.99%	77.38%	82.85%	87.99%	92.95	95.83%	98.79%	100%

Screeplot of the PCA from the sample correlation matrix displaying a descending curve with circle markers. — **Figure 4.3** Screeplot of the PCA from the sample correlation matrix.

The screeplot again indicates k = 4. The eigenvalues of the first four components account for

which is almost the same as the one obtained using the covariance matrix. The four sample principal components are now

Now, let us examine the four components. The first component now represents the general stock market. The second component represents the contrast mainly between financial and non‐health related sectors. The third component represents the contrast between oil and health sectors. The fourth component now represents the contrast between financial/technology and non‐financial/non‐technology industry. For this data set, the PCA results from the covariance matrix and the correlation matrix are very much equivalent.

4.5.2 Monthly Consumer Price Index (CPI) from five sectors

Example 4.2

Let us consider the data set, WW4b, for the monthly CPI from five different sectors, energy, apparel, commodities, housing, and gas, from January 1986 to December 2014 in Greater New York City, where the January 1984 index is the reference point at 100 for all sectors. The data are plotted in Figure 4.4 and listed as WW4b in the Data Appendix. The variables are observed monthly from January 1986 through December 2014.

Figure 4.4 Consumer price index for five industry sectors for the Greater New York City Area between January 1986 and December 2014.

The plot shows that these series are seasonal and nonstationary. So, we remove seasonality using the method of Cleveland et al. (1990) and trend phenomenon with differencing. With the notations introduced before, we have m = 5 and n = 348. Let Z_1,t, Z_2,t, Z_3,t, Z_4,t, Z_5,t be the monthly price index for energy, apparel, commodities, housing, and gas, respectively. Then, we have the sample mean vector,

the sample covariance matrix,

and the sample correlation matrix,

Note the large variances of energy, housing, and gas, the small variances of apparel and commodities, and the large correlation among energy, commodities, housing, and gas.

4.5.2.1 The PCA based on the sample covariance matrix

The eigenvalues and eigenvectors, which are often known as variances and component loadings of the sample variance–covariance matrix, are given in Table 4.3.

Table 4.3 Sample PCA results for the Greater New York City CPI based on the sample covariance matrix.

	Comp. 1	Comp. 2	Comp. 3	Comp. 4	Comp. 5
Energy	0.516	0.081	−0.301	0.782	−0.158
Apparel	0.003	−0.057	0.859	0.398	0.318
Commodities	0.227	−0.331	0.354	−0.148	−0.832
Housing	0.461	−0.755	−0.117	−0.187	0.411
Gas	0.686	0.557	0.184	−0.415	0.118
Variance	11 068.83	379.65	89.84	18.11	2.70
Proportion of total variance explained by ith component	0.9576	0.0328	0.0078	0.0016	0.0002

The two sample principal components are

The first component explains 95.76% of the total sample variance, and the first two explain 99.04%. Thus, sample variation is very much summarized by the first principle component or the first two principle components. Figure 4.5 shows the useful screeplot where the vertex of the elbow can be easily seen to be k = 1.

Screeplot of the PCA from the sample covariance matrix displaying a descending curve with 5 circle markers. — **Figure 4.5** Screeplot of the PCA from the sample covariance matrix.

Now, let us examine component 1 more carefully. In this component, the loadings are all positive. The component can be regarded as the CPI growth component that grew over the time period that we observed. The five variables are combined into a composite score, which is plotted in Figure 4.6, and it follows a combination of patterns observed mainly for gasoline and energy in Figure 4.4.

Time series plot of principle component 1, displaying a fluctuating curve with highest peak found between 2005 and 2010. — **Figure 4.6** Time series plot of principle component 1.

Thus, the PCA has provided us with a single component that contains the vast majority of information for the five individual variables. From this, we can conclude that gasoline and energy were the true drivers of the overall economy for the Greater New York City area during the period between 1986 and 2014.

4.5.2.2 The PCA based on the sample correlation matrix

Now let us try the PCA using the sample correlation matrix. The eigenvalues and eigenvectors, which are also known as variances and component loadings, of the sample correlation matrix are given in Table 4.4.

Table 4.4 Sample PCA results for the Greater New York City CPI based on the sample correlation matrix.

	Comp. 1	Comp. 2	Comp. 3	Comp. 4	Comp. 5
Energy	0.503	0.100	0.322	0.676	−0.420
Apparel	0.044	−0.987	0.100	0.103	0.060
Commodities	0.501	−0.107	−0.417	−0.505	−0.556
Housing	0.499	0.032	−0.550	0.267	0.614
Gas	0.495	0.061	0.641	−0.455	0.366
Variance	3.837	1.018	0.135	0.008	0.002
Proportion of total variance explained by ith component	0.7674	0.2036	0.027	0.0016	0.0004

The two sample principal components are

The first component explains 76.74% of the total sample variance, and the first two explain 97.1%. Thus, sample variation of the five industries is primarily summarized by the first two principle components. Figure 4.7 shows the screeplot, which clearly indicates k = 2.

Screeplot of the PCA from the sample correlation matrix, displaying a descending curve with 5 circle markers lying on it. — **Figure 4.7** Screeplot of the PCA from the sample correlation matrix.

From Table 4.4, we see that the loadings in component 1 are all positive, almost equal for energy, commodities, housing, and gas, and have strong positive correlations among them. It represents the CPI growth over the time period that we observed. The loadings in component 2 are relatively positive small numbers for energy, housing, and gas, and negative for apparel and commodities. It represents the market contrast between consumer goods and utility housing. Since the loading for apparel is especially dominating, component 2 can also be simply regarded as representing the apparel sector.

The five variables are combined into two composite scores, which are plotted in Figure 4.8.

Time series plot of principle components 1 and 2, each displaying a fluctuating curve. — **Figure 4.8** Time series plot of principle components 1 and 2.

The plots of component 1 based on sample covariance matrix and sample correlation matrix as shown in Figures 4.6 and 4.8 are almost the same. However, the proportion of total variance explained by component 1 is 96% when the original variables are used and 77% when the standardized variables are used. As shown in the example, the PCA results from the covariance matrix and the correlation matrix could be different.

For further information on PCA and applications, we refer readers to Joyeux (1992), Cubadda (1995), Ait‐Sahalia and Xiu (2017), Estrada and Perron (2017), Jandarov et al. (2017), Passemier et al. (2017), Sang et al. (2017), and Zhu et al. (2017), among others.

Software code

R code for Example 4.1

library(ggplot2)
setwd("C:/Bookdata/")
##Import Data: daily stock returns from ten different stocks from C:/Bookdata/WW4a.csv
d10 <- as.data.frame(read.csv("WW4a.csv")[,-1])
d10 <- t(t(d10) - colMeans(d10))
rownames(d10) <- seq(as.Date("2016/8/2"), by="day", length=107)

cov(d10)
cor(d10)

##Plot Time Series Data
plot(seq(as.Date("2016/8/2"), by="day", length=107),d10[,1],type='l',ylab="Stock Returns",xlab="Day")
for(i in 2:10){
  lines(seq(as.Date("2016/8/2"), by="day", length=107),d10[,i],type='l',col=i)
}
legend("topleft",legend=colnames(d10),col=1:10,lty=1)

##Principal Component Analysis
pca <- princomp(d10)
pca <- princomp(d10,cor=F)
lds <- pca$loadings
screeplot(pca,type="lines",main="screeplot")
pca <- princomp(d10,cor=T)
lds <- pca$loadings
scs <- pca$scores
screeplot(pca,type="lines",main="screeplot")
##Plot: Sectors by Their First 2 Loadings
library(ggplot2)
C <- as.data.frame(cbind(lds[,1],lds[,2]))
ggplot(C,aes(C[,1],C[,2],label=rownames(C))) +
    geom_point(size=4,col=3) +
    geom_text(vjust=0,hjust=0,angle = 10,size=5) +
    xlab("Loading 1") +
    ylab("Loading 2")

##Plot: Time Points by Their Scores on First 2 Components
C <- as.data.frame(cbind(scs[,1],scs[,2]))
palette(rainbow(400))
ggplot(C,aes(C[,1],C[,2],label=substring(rownames(C),1,7))) +
    geom_point(size=4,col=1:nrow(C)) +
    geom_text(vjust=0,hjust=0,angle = 10,size=5) +
    xlab("Scoring 1") +
    ylab("Scoring 2")
palette("default")

##Plot: Time Points by Their Scores on First 2 Components
C <- as.data.frame(cbind(scs[,1],scs[,2]))
palette(rainbow(400))
ggplot(C,aes(C[,1],C[,2],label=substring(rownames(C),1,4))) +
    geom_path() +
    geom_point(size=4,col=1:nrow(C)) +
    geom_text(vjust=0,hjust=0,angle = 10,size=5) +
    xlab("Scoring 1") +
    ylab("Scoring 2")
palette("default")
plot(seq(as.Date("2009/4/1"), by="month", length=107),C[,1],type="l",ylab="Scoring 1",xlab="Date")
plot(seq(as.Date("2009/4/1"), by="month", length=107),C[,2],type="l",ylab="Scoring 2",xlab="Date")
ag <- aggregate(C[,1:2],list(substr(rownames(C),1,4)),mean)
C <- as.data.frame(ag)
ggplot(C,aes(C[,2],C[,3],label=C[,1])) +
    geom_point(size=4,col=3) +
    geom_text(vjust=0,hjust=0,angle = 10,size=5) +
    xlab("Scoring 1") +
    ylab("Scoring 2")
q()

R Code for Example 4.2

library(ggplot2)

##Import Data: monthly Consumer Price Index (CPI) from five different sectors from C:/Bookdata/WW4b.csv
setwd("/Users/zedali/Desktop/Bookdata/")
d5 <- as.data.frame(read.csv("WW4b.csv")[,-1])
rownames(d5) <- seq(as.Date("1986/1/1"), by="month", length=347)

##Covariance and Correlation Matrices
d5 <- t(t(d5) - colMeans(d5))
cov(d5)
cor(d5)

##Plot Time Series Data
plot(seq(as.Date("1986/1/1"), by="month", length=347),d5[,1],type='l',ylab="Consumer Price
     Index (Jan84=100)",xlab="Date",ylim=c(-90,160))
for(i in 2:5){
        lines(seq(as.Date("1986/1/1"), by="month", length=347),d5[,i],type='l',lty=i)
}

legend("topleft",legend=colnames(d5),,lty=1:5)
##Principal Component Analysis
pca <- princomp(d5,cor=T)
lds <- pca$loadings
scs <- pca$scores
screeplot(pca,type="lines",main="screeplot")

##Plot: Sectors by Their First 2 Loadings
library(ggplot2)
C <- as.data.frame(cbind(lds[,1],lds[,2]))
ggplot(C,aes(C[,1],C[,2],label=rownames(C))) +
    geom_point(size=4,col=3) +
    geom_text(vjust=0,hjust=0,angle = 10,size=5) +
    xlab("Loading 1") +
    ylab("Loading 2")

##Plot: Time Points by Their Scores on First 2 Components
C <- as.data.frame(cbind(scs[,1],scs[,2]))
palette(rainbow(400))
ggplot(C,aes(C[,1],C[,2],label=substring(rownames(C),1,7))) +
    geom_point(size=4,col=1:nrow(C)) +
    geom_text(vjust=0,hjust=0,angle = 10,size=5) +
    xlab("Scoring 1") +
    ylab("Scoring 2")
palette("default")

##Plot: Time Points by Their Scores on First 2 Components
C <- as.data.frame(cbind(scs[,1],scs[,2]))
palette(rainbow(400))
ggplot(C,aes(C[,1],C[,2],label=substring(rownames(C),1,4))) +
    geom_path() +
    geom_point(size=4,col=1:nrow(C)) +
    geom_text(vjust=0,hjust=0,angle = 10,size=5) +
    xlab("Scoring 1") +
    ylab("Scoring 2")
palette("default")

plot(seq(as.Date("2009/4/1"), by="month", length=347),C[,1],type="l",ylab="Scoring
       1",xlab="Date")
plot(seq(as.Date("2009/4/1"), by="month", length=347),C[,2],type="l",ylab="Scoring
       2",xlab="Date")
ag <- aggregate(C[,1:2],list(substr(rownames(C),1,4)),mean)
C <- as.data.frame(ag)
ggplot(C,aes(C[,2],C[,3],label=C[,1])) +
    geom_point(size=4,col=3) +
    geom_text(vjust=0,hjust=0,angle = 10,size=5) +
    xlab("Scoring 1") +
    ylab("Scoring 2")
q()

Projects

Find a multivariate analysis book and carefully read its chapter on PCA.
Find an m‐dimensional social science related time series data set with m ≥ 6. Construct your principle component model based on its sample covariance matrix and evaluate your findings with a written report and analysis software code.
For the data set in Project 2, construct your principle component model based on its sample correlation matrix, and compare your result with that in Project 2. Write a report on your findings with associated software code.
Find an m‐dimensional natural science related data set with m ≥ 6. Construct your principle component model based on its sample covariance and correlation matrices separately, and evaluate your findings with a written report and analysis software code.
Find an m‐dimensional time series data set of your interest with m ≥ 10. Construct your principle component model based on its sample covariance and correlation matrices separately, and evaluate your findings with a written report and analysis software code.

References

Ait‐Sahalia, Y. and Xiu, D. (2018). Principal component analysis of high frequency data. Journal of American Statistical Association 11–14. https://doi.org/10.1080/01621459.2017.1401542.
Anderson, T.W. (1963). Asymptotic theory for principal components analysis. Annals of Mathematical Statistics 34: 122–148.
Cleveland, R.B., Cleveland, W.S., McRae, J.E., and Terpenning, I. (1990). A seasonal‐trend decomposition procedure based on loess (with discussion). Journal of Official Statistics 6: 3–73.
Cubadda, G. (1995). A note on testing for seasonal cointegration using principal components in the frequency domain. Journal of Time Series Analysis 16: 499–508.
Estrada, F. and Perron, P. (2017). Extracting and analyzing the warming trend in global and hemispheric temperatures. Journal of Time Series Analysis 38: 711–732.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principle components. Journal of Educational Psychology 24: 417–441. 498−520.
Jandarov, R.A., Sheppard, L.A., Sampson, P.D., and Szpiro, A.A. (2017). A novel principal component analysis for spatially misaligned multivariate air pollution data. Journal of Statistical Society, Series C 66: 3–28.
Johnson, R.A. and Wichern, D.W. (2002). Applied Multivariate Statistical Analysis, 5e. Prentice Hall.
Joyeux, R. (1992). Testing for seasonal cointegration using principal components. Journal of Time Series Analysis 13: 109–118.
Passemier, D., Li, Z., and Yao, J. (2017). On estimation of the noise variance in high dimensional probabilistic principal component analysis. Journal of Royal Statistical Society, Series B 79: 51–67.
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 6^th Series II: 559–572.
Rao, C.R. (2002). Linear Statistical Inference and Its Applications, 2e. Wiley.
Sang, T., Wang, L., and Cao, J. (2017). Parametric functional principal component analysis. Biometrics 73: 802–810.
Zhu, H., Shen, D., Peng, X., and Liu, L.Y. (2017). MWPCR: multiscale weighted principal component regression for high‐dimensional prediction. Journal of American Statistical Association 112: 1009–1021.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.