The most critical feature of the conditional return distribution is arguably its second moment structure, which is empirically the dominant time-varying characteristic of the distribution. This fact has spurred an enormous literature on the modeling and forecasting of return volatility.
Andersen et al. (2003)
“Some concepts are easy to understand but hard to define. This also holds true for volatility” This could be a quote from someone who living before Markowitz because the way he model the volatility is very clear and intuitive. Markowitz proposes his celebrated portfolio theory in which volatility is defined as standard deviation so that from then onward finance has become more intertwined with mathematics.
Volatility is the backbone of finance in the sense that it does not only provide information signal to investors but also inputs of various financial models. What makes volatility so important? The answer stresses the importance of uncertainty, which is the main characteristic of the financial model.
There is a long tradition in finance to predict volatility using ARCH and GARCH-type models in which there are certain drawbacks that cause failure, e.g., volatility clustering, information asymmetry and so on. Even though, this issues are addressed via differently models, machine learning models have not been extensively used in the literature.
In this chapter, our aim is to show how we can enhance the predictive performance using machine learning-based model. We will visit various machine learning algorithms, namely support vector regression, neural network, and deep learning, so that we are able to compare the predictive performance.
Modeling volatility amounts to modeling uncertainty so that we better understand and approach the uncertainty enabling us to have good enough approximation to the real world. In order to gauge the extent to which proposed model accounts for the real situation, we need to calculate the return volatility, which is also known as Realized volatility”. Realized volatility is the square root of realized variance, which is the sum of squared return. Realized volatility is used to calculate the performance of the volatility prediction method. Here is the formula for return volatility:
where r and
Let’s see how return volatility is computed in Python:
In
[
1
]
:
import
numpy
as
np
from
scipy.stats
import
norm
import
scipy.optimize
as
opt
import
yfinance
as
yf
import
pandas
as
pd
import
datetime
import
time
from
arch
import
arch_model
import
matplotlib.pyplot
as
plt
from
numba
import
jit
from
sklearn.metrics
import
mean_squared_error
import
warnings
warnings
.
filterwarnings
(
'
ignore
'
)
In
[
2
]
:
stocks
=
'
^GSPC
'
start
=
datetime
.
datetime
(
2010
,
1
,
1
)
end
=
datetime
.
datetime
(
2020
,
3
,
15
)
s_p500
=
yf
.
download
(
stocks
,
start
=
start
,
end
=
end
,
interval
=
'
1d
'
)
[
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
100
%
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
]
1
of
1
downloaded
In
[
3
]
:
ret
=
100
*
(
s_p500
.
pct_change
(
)
[
1
:
]
[
'
Adj Close
'
]
)
realized_vol
=
ret
.
rolling
(
5
)
.
std
(
)
In
[
4
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
realized_vol
.
index
,
realized_vol
)
plt
.
title
(
'
Realized Volatility- S&P-500
'
)
plt
.
ylabel
(
'
Volatility
'
)
plt
.
xlabel
(
'
Date
'
)
plt
.
savefig
(
'
images/realized_vol.png
'
)
plt
.
show
(
)
Figure 4-1 shows the realized volatility of S&P-500 over the period of 2010-2020. What is striking is the spikes around crisis period such as 2007-2008 financial crisis as well as Covid-19 pandemic.
The way volatility is estimated has an undeniable impact on the reliablitity and accuracy of the related analysis. So, this chapter deals with both classical and ML-based volatility prediction techniques with a view to show the superior prediction performance of the ML-based models.In order to compare the brand new ML-based models, we start with modeling the classical volatility models. Some very well known classical volatility models are, but not limited to:
ARCH
GARCH
GJR-GARCH
EGARCH
It is time to dig into the classical volatilility models. Let’s start with ARCH model.
One of the early attempt to model the volatility was proposed by Engel (1982) and it is known as ARCH model. ARCH model is a univariate model and it is based on the historical asset returns. The ARCH(p) model has the following form:
where
where
All these equations tell us that ARCH is a univariate and non-linear model in which volatility is estimated with squared of past returns. The one of the most distinctive feature of ARCH is that it has the property of the time-varying conditional variance1 so that ARCH is able to model the phenomenon known as volatility clustering, that is large changes tend to be followed by large changes of either sign, and small changes tend to be followed by small changes as put by Benoit Mandelbrot (1963). Hence, an important announcement arrives into the market, it results in a huge volatility.
The following code shows how to plot clustering and what it looks like:
In
[
5
]
:
retv
=
100
*
ret
.
values
date
=
pd
.
bdate_range
(
start
=
'
1/1/2010
'
,
end
=
'
3/15/2020
'
)
In
[
6
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
s_p500
.
index
[
1
:
]
,
ret
)
plt
.
title
(
'
Volatility clustering of S&P-500
'
)
plt
.
ylabel
(
'
Daily returns
'
)
plt
.
xlabel
(
'
Date
'
)
plt
.
savefig
(
'
images/vol_clustering.png
'
)
plt
.
show
(
)
Similar to spikes in realized variance, Figure 4-2 suggests some large movements and, unsuprisingly, these ups and downs happen around important events such as Covid-19 pandemic mid-2020.
Despite its appealing features such as simplicity, non-linearity, easiness, and adjustment for forecast, it has certain drawbacks, which can be listed as
Equal response to the positive and negative shocks
Strong assumptions such as restrictions on parameters
Possible misprediction due to slow-adjustment to large movements
These drawbacks motivate researchers to work on extensions of ARCH model and Bollerslev (1986) and Taylor (2008) proposed GARCH model.
Now, we will employ ARCH model to predict volatility but first let’s generate our own Python code and then compare it to see the difference with the built-in Python code.
In
[
7
]
:
n
=
252
split_date
=
ret
.
iloc
[
-
n
:
]
.
index
In
[
8
]
:
sgm2
=
ret
.
var
(
)
K
=
ret
.
kurtosis
(
)
alpha
=
(
-
3.0
*
sgm2
+
np
.
sqrt
(
9.0
*
sgm2
*
*
2
-
12.0
*
(
3.0
*
sgm2
-
K
)
*
K
)
)
/
(
6
*
K
)
omega
=
(
1
-
alpha
)
*
sgm2
initial_parameters
=
[
alpha
,
omega
]
omega
,
alpha
Out
[
8
]
:
(
0.5579500203177984
,
0.44568819417572125
)
In
[
9
]
:
@jit
(
nopython
=
True
,
parallel
=
True
)
def
arch_likelihood
(
initial_parameters
,
retv
)
:
omega
=
abs
(
initial_parameters
[
0
]
)
alpha
=
abs
(
initial_parameters
[
1
]
)
T
=
len
(
retv
)
logliks
=
0
sigma2
=
np
.
zeros
(
T
)
sigma2
[
0
]
=
np
.
var
(
retv
)
for
t
in
range
(
1
,
T
)
:
sigma2
[
t
]
=
omega
+
alpha
*
(
retv
[
t
-
1
]
)
*
*
2
logliks
=
np
.
sum
(
0.5
*
(
np
.
log
(
sigma2
)
+
retv
*
*
2
/
sigma2
)
)
return
logliks
In
[
10
]
:
logliks
=
arch_likelihood
(
initial_parameters
,
retv
)
logliks
Out
[
10
]
:
414610.07621266536
In
[
11
]
:
def
opt_params
(
x0
,
retv
)
:
opt_result
=
opt
.
minimize
(
arch_likelihood
,
x0
=
x0
,
args
=
(
retv
)
,
method
=
'
Nelder-Mead
'
,
options
=
{
'
maxiter
'
:
5000
}
)
params
=
opt_result
.
x
(
'
Results of Nelder-Mead minimization
{}
{}
'
.
format
(
'
'
.
join
(
[
'
-
'
]
*
28
)
,
opt_result
)
)
(
'
Resulting params = {}
'
.
format
(
params
)
)
return
params
In
[
12
]
:
params
=
opt_params
(
initial_parameters
,
retv
)
Results
of
Nelder
-
Mead
minimization
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
final_simplex
:
(
array
(
[
[
6.66113178e+03
,
3.11668324e-01
]
,
[
6.66113182e+03
,
3.11668301e-01
]
,
[
6.66113170e+03
,
3.11668310e-01
]
]
)
,
array
(
[
12899.46225979
,
12899.46225979
,
12899.46225979
]
)
)
fun
:
12899.462259792343
message
:
'
Optimization terminated successfully.
'
nfev
:
256
nit
:
130
status
:
0
success
:
True
x
:
array
(
[
6.66113178e+03
,
3.11668324e-01
]
)
Resulting
params
=
[
6.66113178e+03
3.11668324e-01
]
In
[
13
]
:
def
arch_apply
(
ret
)
:
omega
=
params
[
0
]
alpha
=
params
[
1
]
T
=
len
(
ret
)
sigma2_arch
=
np
.
zeros
(
T
+
1
)
sigma2_arch
[
0
]
=
np
.
var
(
ret
)
for
t
in
range
(
1
,
T
)
:
sigma2_arch
[
t
]
=
omega
+
alpha
*
ret
[
t
-
1
]
*
*
2
return
sigma2_arch
In
[
14
]
:
sigma2_arch
=
arch_apply
(
ret
)
Defining the split location and assign the splitted data to split variable
Calculating variance of S&P-500
Calculating kurtosis of S&P-500
Identifying the initial value for slope coefficient
Identifying the initial value for constant term
Using paralel processing to decrease the processing time
Taking absolute values and assigning the initial values into related variables
Identifying the initial value of variance
Iterating the variance of S&P-500
Calculation log-likelihood
Calling the function
Minimizing the log-likelihood function
Creating a variable “params” for optimized parameters
Well, we model volatility via ARCH using our own optimization method and ARCH equation. How about comparing it with the built-in Python code. This built-in code can be imported from ARCH library and it is extremely easy-to-apply. The result of built-in code is provided below and it turns out that these two results are very similar to each other.
In
[
15
]:
arch
=
arch_model
(
ret
,
mean
=
'zero'
,
vol
=
'ARCH'
,
p
=
1
)
.
fit
(
disp
=
'off'
)
(
arch
.
summary
())
Zero
Mean
-
ARCH
Model
Results
======================================================================
========
Dep
.
Variable
:
Adj
Close
R
-
squared
:
0.000
Mean
Model
:
Zero
Mean
Adj
.
R
-
squared
:
0.000
Vol
Model
:
ARCH
Log
-
Likelihood
:
-
3440.61
Distribution
:
Normal
AIC
:
6885.22
Method
:
Maximum
Likelihood
BIC
:
6896.92
No
.
Observations
:
2566
Date
:
Mon
,
May
03
2021
Df
Residuals
:
2564
Time
:
11
:
20
:
45
Df
Model
:
2
Volatility
Model
========================================================================
coef
std
err
t
P
>|
t
|
95.0
%
Conf
.
Int
.
------------------------------------------------------------------------
omega
0.6663
4.827e-02
13.803
2.441e-43
[
0.572
,
0.761
]
alpha
[
1
]
0.3124
5.880e-02
5.312
1.082e-07
[
0.197
,
0.428
]
========================================================================
Covariance
estimator
:
robust
Although developing our own code is always helpful and improve our understanding, the beauty of built-in code is not only restricted to its simplicity. Finding the optimal lag value using built-in code is another advantage of it along with the optimized running producedure.
All we need is to create a for loop and define a proper information criteria. Below, Bayesian Information Criteria (BIC) is chosen as the model selection method and in order to select lag. The reason why BIC is picked is that as long as we have large enough samples, BIC is a reliable tool for model selection as discussed by Burnham and Anderson (2002, 2004). Now, we iterate ARCH model from 1 to 5 lags.
In
[
16
]
:
bic_arch
=
[
]
for
p
in
range
(
1
,
5
)
:
arch
=
arch_model
(
ret
,
mean
=
'
zero
'
,
vol
=
'
ARCH
'
,
p
=
p
)
.
fit
(
disp
=
'
off
'
)
bic_arch
.
append
(
arch
.
bic
)
if
arch
.
bic
==
np
.
min
(
bic_arch
)
:
best_param
=
p
arch
=
arch_model
(
ret
,
mean
=
'
Constant
'
,
vol
=
'
ARCH
'
,
p
=
p
)
.
fit
(
disp
=
'
off
'
)
(
arch
.
summary
(
)
)
forecast
=
arch
.
forecast
(
start
=
split_date
[
0
]
)
forecast_arch
=
forecast
Constant
Mean
-
ARCH
Model
Results
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
Dep
.
Variable
:
Adj
Close
R
-
squared
:
-
0.002
Mean
Model
:
Constant
Mean
Adj
.
R
-
squared
:
-
0.002
Vol
Model
:
ARCH
Log
-
Likelihood
:
-
3146.52
Distribution
:
Normal
AIC
:
6305.03
Method
:
Maximum
Likelihood
BIC
:
6340.14
No
.
Observations
:
2566
Date
:
Mon
,
May
03
2021
Df
Residuals
:
2560
Time
:
11
:
20
:
48
Df
Model
:
6
Mean
Model
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
coef
std
err
t
P
>
|
t
|
95.0
%
Conf
.
Int
.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
mu
0.0834
1.443e-02
5.779
7.515e-09
[
5.511e-02
,
0.112
]
Volatility
Model
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
coef
std
err
t
P
>
|
t
|
95.0
%
Conf
.
Int
.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
omega
0.2463
2.505e-02
9.832
8.214e-23
[
0.197
,
0.295
]
alpha
[
1
]
0.1699
3.935e-02
4.317
1.584e-05
[
9.274e-02
,
0.247
]
alpha
[
2
]
0.2138
3.848e-02
5.557
2.745e-08
[
0.138
,
0.289
]
alpha
[
3
]
0.2166
4.191e-02
5.166
2.385e-07
[
0.134
,
0.299
]
alpha
[
4
]
0.2064
4.595e-02
4.491
7.075e-06
[
0.116
,
0.296
]
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
Covariance
estimator
:
robust
In
[
17
]
:
rmse_arch
=
np
.
sqrt
(
mean_squared_error
(
realized_vol
[
-
n
:
]
/
100
,
np
.
sqrt
(
forecast_arch
.
variance
.
iloc
[
-
len
(
split_date
)
:
]
/
100
)
)
)
(
'
The RMSE value of ARCH model is {:.4f}
'
.
format
(
rmse_arch
)
)
The
RMSE
value
of
ARCH
model
is
0.1116
In
[
18
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
realized_vol
/
100
,
label
=
'
Realized Volatility
'
)
plt
.
plot
(
forecast_arch
.
variance
.
iloc
[
-
len
(
split_date
)
:
]
/
100
,
label
=
'
Volatility Prediction-ARCH
'
)
plt
.
title
(
'
Volatility Prediction with ARCH
'
,
fontsize
=
12
)
plt
.
legend
(
)
plt
.
savefig
(
'
images/arch.png
'
)
plt
.
show
(
)
Iterating ARCH parameter p over specified interval
Running ARCH model with different p values
Finding the minimum Bayesian Information Criteria score to select the best model
Running ARCH model with the best p value
Forecasting the volatility based on the optimized ARCH model
Calculating the RMSE score
The result of volatility prediction based on our first model is shown in Figure 4-3.
GARCH model is an extension of ARCH model incorporating lagged conditional variance. So, ARCH is improved by adding p number of delated conditional variance, which makes GARCH model a multivariate one in the sense that it is a autoregressive moving average models for conditional variance with p number of lagged squared returns and q number of lagged conditional variance. GARCH (p,q) can be formulated as:
where
ARCH model is unable to capture the influence of historical innovations. However, as a more parsimonious model, GARCH model can account for the change in historical innovations because GARCH models can be expressed as an infinite-order ARCH. Let’s show how GARCH can be shown as infinite order of ARCH.
Then replace
Now, let’s substitute
Similar to ARCH model, there are more than one way to model volatility using GARCH in Python. Let us try to develop our own Python-based code using optimization technique first. In what follows, arch
library will be used to predict volatility.
In
[
19
]:
a0
=
0.0001
sgm2
=
ret
.
var
()
K
=
ret
.
kurtosis
()
h
=
1
-
alpha
/
sgm2
alpha
=
np
.
sqrt
(
K
*
(
1
-
h
**
2
)
/
(
2.0
*
(
K
+
3
)))
beta
=
np
.
abs
(
h
-
omega
)
omega
=
(
1
-
omega
)
*
sgm2
initial_parameters
=
np
.
array
([
omega
,
alpha
,
beta
])
(
'Initial parameters for omega, alpha, and beta are
{}
{}
{}'
.
format
(
omega
,
alpha
,
beta
))
Initial
parameters
for
omega
,
alpha
,
and
beta
are
0.444951365916522
0.519448113662449
0.0007320236366448185
In
[
20
]:
retv
=
ret
.
values
In
[
21
]:
@jit
(
nopython
=
True
,
parallel
=
True
)
def
garch_likelihood
(
initial_parameters
,
retv
):
omega
=
initial_parameters
[
0
]
alpha
=
initial_parameters
[
1
]
beta
=
initial_parameters
[
2
]
T
=
len
(
retv
)
logliks
=
0
sigma2
=
np
.
zeros
(
T
)
sigma2
[
0
]
=
np
.
var
(
retv
)
for
t
in
range
(
1
,
T
):
sigma2
[
t
]
=
omega
+
alpha
*
(
retv
[
t
-
1
])
**
2
+
beta
*
sigma2
[
t
-
1
]
logliks
=
np
.
sum
(
0.5
*
(
np
.
log
(
sigma2
)
+
retv
**
2
/
sigma2
))
return
logliks
In
[
22
]:
logliks
=
garch_likelihood
(
initial_parameters
,
retv
)
(
'The Log likelihood is {:.4f}'
.
format
(
logliks
))
The
Log
likelihood
is
1143.7064
In
[
23
]:
def
garch_constraint
(
initial_parameters
):
alpha
=
initial_parameters
[
0
]
gamma
=
initial_parameters
[
1
]
beta
=
initial_parameters
[
2
]
return
np
.
array
([
1
-
alpha
-
beta
])
In
[
24
]:
bounds
=
[(
0.0
,
1.0
),
(
0.0
,
1.0
),
(
0.0
,
1.0
)]
In
[
25
]:
def
opt_paramsG
(
initial_parameters
,
retv
):
opt_result
=
opt
.
minimize
(
garch_likelihood
,
x0
=
initial_parameters
,
constraints
=
np
.
array
([
1
-
alpha
-
beta
]),
bounds
=
bounds
,
args
=
(
retv
),
method
=
'Nelder-Mead'
,
options
=
{
'maxiter'
:
5000
})
params
=
opt_result
.
x
(
'
Results of Nelder-Mead minimization
{}
{}'
.
format
(
'-'
*
35
,
opt_result
))
(
'-'
*
35
)
(
'
Resulting parameters = {}'
.
format
(
params
))
return
params
In
[
26
]:
params
=
opt_paramsG
(
initial_parameters
,
retv
)
Results
of
Nelder
-
Mead
minimization
-----------------------------------
final_simplex
:
(
array
([[
0.03754678
,
0.17239775
,
0.79143287
],
[
0.03753309
,
0.17240623
,
0.7914592
],
[
0.03754475
,
0.17232523
,
0.79150698
],
[
0.03757972
,
0.172482
,
0.79134976
]]),
array
([
760.48013548
,
760.4801371
,
760.48014785
,
760.48014867
]))
fun
:
760.4801354810166
message
:
'Optimization terminated successfully.'
nfev
:
250
nit
:
141
status
:
0
success
:
True
x
:
array
([
0.03754678
,
0.17239775
,
0.79143287
])
-----------------------------------
Resulting
parameters
=
[
0.03754678
0.17239775
0.79143287
]
In
[
27
]:
def
garch_apply
(
ret
):
omega
=
params
[
0
]
alpha
=
params
[
1
]
beta
=
params
[
2
]
T
=
len
(
ret
)
sigma2
=
np
.
zeros
(
T
+
1
)
sigma2
[
0
]
=
np
.
var
(
ret
)
for
t
in
range
(
1
,
T
):
sigma2
[
t
]
=
omega
+
alpha
*
ret
[
t
-
1
]
**
2
+
beta
*
sigma2
[
t
-
1
]
return
sigma2
The parameters we get from our own code for developing GARCH model are approximately:
The following built-in Python code confirms that we did a great job on the ground that the parameters obtained the built-in code is quite similar to our code. So, we have learned how to code GARCH and ARCH models to predict volatility.
In
[
28
]:
garch
=
arch_model
(
ret
,
mean
=
'zero'
,
vol
=
'GARCH'
,
p
=
1
,
o
=
0
,
q
=
1
)
.
fit
(
disp
=
'off'
)
(
garch
.
summary
())
Zero
Mean
-
GARCH
Model
Results
======================================================================
========
Dep
.
Variable
:
Adj
Close
R
-
squared
:
0.000
Mean
Model
:
Zero
Mean
Adj
.
R
-
squared
:
0.000
Vol
Model
:
GARCH
Log
-
Likelihood
:
-
3118.48
Distribution
:
Normal
AIC
:
6242.96
Method
:
Maximum
Likelihood
BIC
:
6260.51
No
.
Observations
:
2566
Date
:
Mon
,
May
03
2021
Df
Residuals
:
2563
Time
:
11
:
21
:
04
Df
Model
:
3
Volatility
Model
======================================================================
======
coef
std
err
t
P
>|
t
|
95.0
%
Conf
.
Int
.
----------------------------------------------------------------------
------
omega
0.0376
8.599e-03
4.367
1.259e-05
[
2.070e-02
,
5.441e-02
]
alpha
[
1
]
0.1724
2.295e-02
7.514
5.756e-14
[
0.127
,
0.217
]
beta
[
1
]
0.7914
2.347e-02
33.714
3.572e-249
[
0.745
,
0.837
]
======================================================================
======
Covariance
estimator
:
robust
It is apparent that it is easy to work with GARCH(1,1) but how do we know that theses parameters are the optimum one. Let us decide the optimum parameter set given the lowest BIC value.
In
[
29
]:
bic_garch
=
[]
for
p
in
range
(
1
,
5
):
for
q
in
range
(
1
,
5
):
garch
=
arch_model
(
ret
,
mean
=
'zero'
,
vol
=
'GARCH'
,
p
=
p
,
o
=
0
,
q
=
q
)
.
fit
(
disp
=
'off'
)
bic_garch
.
append
(
garch
.
bic
)
if
garch
.
bic
==
np
.
min
(
bic_garch
):
best_param
=
p
,
q
garch
=
arch_model
(
ret
,
mean
=
'zero'
,
vol
=
'GARCH'
,
p
=
p
,
o
=
0
,
q
=
q
)
.
fit
(
disp
=
'off'
)
(
garch
.
summary
())
forecast
=
garch
.
forecast
(
start
=
split_date
[
0
])
forecast_garch
=
forecast
Zero
Mean
-
GARCH
Model
Results
======================================================================
========
Dep
.
Variable
:
Adj
Close
R
-
squared
:
0.000
Mean
Model
:
Zero
Mean
Adj
.
R
-
squared
:
0.000
Vol
Model
:
GARCH
Log
-
Likelihood
:
-
3114.97
Distribution
:
Normal
AIC
:
6247.95
Method
:
Maximum
Likelihood
BIC
:
6300.60
No
.
Observations
:
2566
Date
:
Mon
,
May
03
2021
Df
Residuals
:
2557
Time
:
11
:
21
:
06
Df
Model
:
9
Volatility
Model
======================================================================
=====
coef
std
err
t
P
>|
t
|
95.0
%
Conf
.
Int
.
----------------------------------------------------------------------
-----
omega
0.1027
5.964e-02
1.723
8.498e-02
[
-
1.416e-02
,
0.220
]
alpha
[
1
]
0.1306
3.561e-02
3.667
2.456e-04
[
6.078e-02
,
0.200
]
alpha
[
2
]
0.1659
0.113
1.468
0.142
[
-
5.561e-02
,
0.387
]
alpha
[
3
]
0.0900
0.131
0.686
0.492
[
-
0.167
,
0.347
]
alpha
[
4
]
0.0804
5.161e-02
1.557
0.119
[
-
2.078e-02
,
0.182
]
beta
[
1
]
2.3949e-15
0.883
2.713e-15
1.000
[
-
1.730
,
1.730
]
beta
[
2
]
0.2298
0.333
0.691
0.490
[
-
0.422
,
0.882
]
beta
[
3
]
8.3894e-15
0.449
1.868e-14
1.000
[
-
0.880
,
0.880
]
beta
[
4
]
0.2058
0.139
1.478
0.139
[
-
6.715e-02
,
0.479
]
======================================================================
=====
Covariance
estimator
:
robust
In
[
30
]:
rmse_garch
=
np
.
sqrt
(
mean_squared_error
(
realized_vol
[
-
n
:]
/
100
,
np
.
sqrt
(
forecast_garch
.
variance
.
iloc
[
-
len
(
split_date
):]
/
100
)))
(
'The RMSE value of GARCH model is {:.4f}'
.
format
(
rmse_garch
))
The
RMSE
value
of
GARCH
model
is
0.1027
In
[
31
]:
plt
.
figure
(
figsize
=
(
10
,
6
))
plt
.
plot
(
realized_vol
/
100
,
label
=
'Realized Volatility'
)
plt
.
plot
(
forecast_garch
.
variance
.
iloc
[
-
len
(
split_date
):]
/
100
,
label
=
'Volatility Prediction-GARCH'
)
plt
.
title
(
'Volatility Prediction with GARCH'
,
fontsize
=
12
)
plt
.
legend
()
plt
.
savefig
(
'images/garch.png'
)
plt
.
show
()
So, as shown, GARCH is able to explain the effect of all historical shocks on the contemporaneous conditional variance. To remedy this issue, GJR-GARCH was proposed by Glosten, Jagannathan and Runkle (1993).
This model performs well in modeling the asymmetric effects of the announcements. The equation of the model includes one more parameter
where
In
[
32
]:
bic_gjr_garch
=
[]
for
p
in
range
(
1
,
5
):
for
q
in
range
(
1
,
5
):
gjrgarch
=
arch_model
(
ret
,
mean
=
'zero'
,
p
=
p
,
o
=
1
,
q
=
q
)
.
fit
(
disp
=
'off'
)
bic_gjr_garch
.
append
(
gjrgarch
.
bic
)
if
gjrgarch
.
bic
==
np
.
min
(
bic_gjr_garch
):
best_param
=
p
,
q
gjrgarch
=
arch_model
(
ret
,
mean
=
'zero'
,
p
=
p
,
o
=
1
,
q
=
q
)
.
fit
(
disp
=
'off'
)
(
gjrgarch
.
summary
())
forecast
=
gjrgarch
.
forecast
(
start
=
split_date
[
0
])
forecast_gjrgarch
=
forecast
Zero
Mean
-
GJR
-
GARCH
Model
Results
======================================================================
========
Dep
.
Variable
:
Adj
Close
R
-
squared
:
0.000
Mean
Model
:
Zero
Mean
Adj
.
R
-
squared
:
0.000
Vol
Model
:
GJR
-
GARCH
Log
-
Likelihood
:
-
3030.12
Distribution
:
Normal
AIC
:
6080.24
Method
:
Maximum
Likelihood
BIC
:
6138.74
No
.
Observations
:
2566
Date
:
Mon
,
May
03
2021
Df
Residuals
:
2556
Time
:
11
:
21
:
10
Df
Model
:
10
Volatility
Model
======================================================================
=======
coef
std
err
t
P
>|
t
|
95.0
%
Conf
.
Int
.
----------------------------------------------------------------------
-------
omega
0.0390
43.311
9.006e-04
0.999
[
-
84.850
,
84.928
]
alpha
[
1
]
5.8308e-15
86.209
6.764e-17
1.000
[
-
1.690e+02
,
1.690e+02
]
alpha
[
2
]
4.5942e-15
64.320
7.143e-17
1.000
[
-
1.261e+02
,
1.261e+02
]
alpha
[
3
]
2.7611e-15
54.949
5.025e-17
1.000
[
-
1.077e+02
,
1.077e+02
]
alpha
[
4
]
8.2718e-16
120.430
6.869e-18
1.000
[
-
2.360e+02
,
2.360e+02
]
gamma
[
1
]
0.3242
329.839
9.830e-04
0.999
[
-
6.461e+02
,
6.468e+02
]
beta
[
1
]
0.8073
2333.868
3.459e-04
1.000
[
-
4.573e+03
,
4.575e+03
]
beta
[
2
]
1.9236e-08
1719.001
1.119e-11
1.000
[
-
3.369e+03
,
3.369e+03
]
beta
[
3
]
1.7134e-16
1709.038
1.003e-19
1.000
[
-
3.350e+03
,
3.350e+03
]
beta
[
4
]
3.2047e-16
1355.064
2.365e-19
1.000
[
-
2.656e+03
,
2.656e+03
]
======================================================================
=======
Covariance
estimator
:
robust
In
[
33
]:
rmse_gjr_garch
=
np
.
sqrt
(
mean_squared_error
(
realized_vol
[
-
n
:]
/
100
,
np
.
sqrt
(
forecast_gjrgarch
.
variance
.
iloc
[
-
len
(
split_date
):]
/
100
)))
(
'The RMSE value of GJR-GARCH models is {:.4f}'
.
format
(
rmse_gjr_garch
))
The
RMSE
value
of
GJR
-
GARCH
models
is
0.1144
In
[
34
]:
plt
.
figure
(
figsize
=
(
10
,
6
))
plt
.
plot
(
realized_vol
/
100
,
label
=
'Realized Volatility'
)
plt
.
plot
(
forecast_gjrgarch
.
variance
.
iloc
[
-
len
(
split_date
):]
/
100
,
label
=
'Volatility Prediction-GJR-GARCH'
)
plt
.
title
(
'Volatility Prediction with GJR-GARCH'
,
fontsize
=
12
)
plt
.
legend
()
plt
.
savefig
(
'images/gjr_garch.png'
)
plt
.
show
()
Together with the GJR-GARCH model, EGARCH, proposed by Nelson (1991), is a also tool for controlling for the effect of the asymmetric announcements and additionally it is specified in logarithmic form, there is no need to put restriction to avoid negative volatility.
The main difference of EGARCH equation is that logarithm is taken of the variance on the left-hand-side of the equation. This indicates the leverage effect meaning that there exists a negative correlation between past asset returns and volatility. If
In
[
35
]:
bic_egarch
=
[]
for
p
in
range
(
1
,
5
):
for
q
in
range
(
1
,
5
):
egarch
=
arch_model
(
ret
,
mean
=
'zero'
,
vol
=
'EGARCH'
,
p
=
p
,
o
=
1
,
q
=
q
)
.
fit
(
disp
=
'off'
)
bic_egarch
.
append
(
egarch
.
bic
)
if
egarch
.
bic
==
np
.
min
(
bic_egarch
):
best_param
=
p
,
q
egarch
=
arch_model
(
ret
,
mean
=
'zero'
,
vol
=
'EGARCH'
,
p
=
p
,
o
=
1
,
q
=
q
)
.
fit
(
disp
=
'off'
)
(
egarch
.
summary
())
forecast
=
egarch
.
forecast
(
start
=
split_date
[
0
])
forecast_egarch
=
forecast
Zero
Mean
-
EGARCH
Model
Results
======================================================================
========
Dep
.
Variable
:
Adj
Close
R
-
squared
:
0.000
Mean
Model
:
Zero
Mean
Adj
.
R
-
squared
:
0.000
Vol
Model
:
EGARCH
Log
-
Likelihood
:
-
3005.07
Distribution
:
Normal
AIC
:
6030.15
Method
:
Maximum
Likelihood
BIC
:
6088.65
No
.
Observations
:
2566
Date
:
Mon
,
May
03
2021
Df
Residuals
:
2556
Time
:
11
:
21
:
14
Df
Model
:
10
Volatility
Model
======================================================================
=======
coef
std
err
t
P
>|
t
|
95.0
%
Conf
.
Int
.
----------------------------------------------------------------------
-------
omega
-
0.0143
7.763e-03
-
1.847
6.469e-02
[
-
2.956e-02
,
8.741e-04
]
alpha
[
1
]
0.0685
6.360e-02
1.077
0.282
[
-
5.618e-02
,
0.193
]
alpha
[
2
]
0.0558
8.367e-02
0.667
0.505
[
-
0.108
,
0.220
]
alpha
[
3
]
0.0107
7.159e-02
0.149
0.881
[
-
0.130
,
0.151
]
alpha
[
4
]
0.0851
4.927e-02
1.728
8.396e-02
[
-
1.142e-02
,
0.182
]
gamma
[
1
]
-
0.2763
4.216e-02
-
6.554
5.587e-11
[
-
0.359
,
-
0.194
]
beta
[
1
]
0.8652
0.160
5.420
5.970e-08
[
0.552
,
1.178
]
beta
[
2
]
1.8427e-15
0.189
9.760e-15
1.000
[
-
0.370
,
0.370
]
beta
[
3
]
0.0597
0.193
0.310
0.757
[
-
0.318
,
0.437
]
beta
[
4
]
2.1382e-15
0.176
1.215e-14
1.000
[
-
0.345
,
0.345
]
======================================================================
=======
Covariance
estimator
:
robust
In
[
36
]:
rmse_egarch
=
np
.
sqrt
(
mean_squared_error
(
realized_vol
[
-
n
:]
/
100
,
np
.
sqrt
(
forecast_egarch
.
variance
.
iloc
[
-
len
(
split_date
):]
/
100
)))
(
'The RMSE value of EGARCH models is {:.4f}'
.
format
(
rmse_egarch
))
The
RMSE
value
of
EGARCH
models
is
0.0987
In
[
37
]:
plt
.
figure
(
figsize
=
(
10
,
6
))
plt
.
plot
(
realized_vol
/
100
,
label
=
'Realized Volatility'
)
plt
.
plot
(
forecast_egarch
.
variance
.
iloc
[
-
len
(
split_date
):]
/
100
,
label
=
'Volatility Prediction-EGARCH'
)
plt
.
title
(
'Volatility Prediction with EGARCH'
,
fontsize
=
12
)
plt
.
legend
()
plt
.
savefig
(
'images/egarch.png'
)
plt
.
show
()
Up to now, we have discussed the classical volatility models but from this point on we will see how Machine Learning and Bayesian Approach can be used to model volatility. In the context of Machine Learning, Support Vector Machines and Neural Network will be the first models to visit. Let’s get started.
Support Vector Machines (SVM) is a supervised learning algorithm, which can be applicable to both classification and regression. The aim in SVM is to find a line that separate two classes. It sounds easy but here is the challenging part: There are almost infinitely many lines that can be used to distinguish the classes. But we are looking for the optimal line by which the classes can be perfectly discriminated.
In linear algebra jargon, the optimal line is called “hyperplane”, which maximize the distance between the points, which are closest to the hyperplane but belonging to different classes. The distance between the two points, i.e., support vectors, is known as “margin”. So, in SVM, what we are trying to do is to maximize the margin between support vectors.
SVM for classification is labeled as Support Vector Classification (SVC). Keeping all characteristics of SVM, it can be applicable to regression. Again, in regression, the aim is to find the hyperplane that minimize the error and maximize the margin. This method is called Support Vector Regression (SVR) and, in this part, we will apply this method to GARCH model. Combining these two models comes up with a different name: “SVR-GARCH”.
The following code shows us the preparations before running then SVR-GARCH in Python. The most crucial step here is to obtain independent variables, which are realized volatility and square of historical returns.
In
[
38
]
:
from
sklearn.svm
import
SVR
from
scipy.stats
import
uniform
as
sp_rand
from
sklearn.model_selection
import
RandomizedSearchCV
from
sklearn.metrics
import
mean_squared_error
In
[
39
]
:
realized_vol
=
ret
.
rolling
(
5
)
.
std
(
)
realized_vol
=
pd
.
DataFrame
(
realized_vol
)
realized_vol
.
reset_index
(
drop
=
True
,
inplace
=
True
)
In
[
40
]
:
returns_svm
=
ret
*
*
2
returns_svm
=
returns_svm
.
reset_index
(
)
del
returns_svm
[
'
Date
'
]
In
[
41
]
:
X
=
pd
.
concat
(
[
realized_vol
,
returns_svm
]
,
axis
=
1
,
ignore_index
=
True
)
X
=
X
[
4
:
]
.
copy
(
)
X
=
X
.
reset_index
(
)
X
.
drop
(
'
index
'
,
axis
=
1
,
inplace
=
True
)
In
[
42
]
:
realized_vol
=
realized_vol
.
dropna
(
)
.
reset_index
(
)
realized_vol
.
drop
(
'
index
'
,
axis
=
1
,
inplace
=
True
)
In
[
43
]
:
svr_poly
=
SVR
(
kernel
=
'
poly
'
)
svr_lin
=
SVR
(
kernel
=
'
linear
'
)
svr_rbf
=
SVR
(
kernel
=
'
rbf
'
)
Computing realized volatility and assign a new variable to it named “realized_vol”
Creating a new variables for different SVR kernel
Let us run and see our first SVR-GARCH application with linear kernel. Root mean squared error (RMSE) is the metric to be used to compare.
In
[
44
]
:
para_grid
=
{
'
gamma
'
:
sp_rand
(
)
,
'
C
'
:
sp_rand
(
)
,
'
epsilon
'
:
sp_rand
(
)
}
clf
=
RandomizedSearchCV
(
svr_lin
,
para_grid
)
clf
.
fit
(
X
.
iloc
[
:
-
n
]
.
values
,
realized_vol
.
iloc
[
1
:
-
(
n
-
1
)
]
.
values
.
reshape
(
-
1
,
)
)
predict_svr_lin
=
clf
.
predict
(
X
.
iloc
[
-
n
:
]
)
In
[
45
]
:
predict_svr_lin
=
pd
.
DataFrame
(
predict_svr_lin
)
predict_svr_lin
.
index
=
ret
.
iloc
[
-
n
:
]
.
index
In
[
46
]
:
rmse_svr
=
np
.
sqrt
(
mean_squared_error
(
realized_vol
.
iloc
[
-
n
:
]
/
100
,
predict_svr_lin
/
100
)
)
(
'
The RMSE value of SVR with Linear Kernel is {:.4f}
'
.
format
(
rmse_svr
)
)
The
RMSE
value
of
SVR
with
Linear
Kernel
is
0.0010
In
[
47
]
:
realized_vol
.
index
=
ret
.
iloc
[
4
:
]
.
index
In
[
48
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
realized_vol
/
100
,
label
=
'
Realized Volatility
'
)
plt
.
plot
(
predict_svr_lin
/
100
,
label
=
'
Volatility Prediction-SVR-GARCH
'
)
plt
.
title
(
'
Volatility Prediction with SVR-GARCH (Linear)
'
,
fontsize
=
12
)
plt
.
legend
(
)
plt
.
savefig
(
'
images/svr_garch_linear.png
'
)
plt
.
show
(
)
Identifying the hyperparameter space for tuning
Applying hyperparameter tuning with RandomizedSearchCV
Fitting SVR-GARCH with linear kernel to data
Predicting the volatilities based on the last 250 observations and store them in the “predict_svr_lin”
Figure 4-7 exhibits the predicted values and actual observation. By eyeballing, one can tell that SVR-GARCH perform well. As you can guess, linear kernel works fine if the dataset is linearly separable and it is also the suggestion of Occam’s Razor2. What if it does not? Let’s continue with RBF and Polynomial kernels. The former one uses elliptical curves around the observations and the latter, differently from the first two, focuses on the combinations of samples, too. Let’s now see how they work.
SVR-GARCH application with RBF kernel, a function that projections data into a new vector space, can be found below. From the practical standpoint, SVR-GARCH application with different kernels is not a labor-intensive process, all we need to switch the kernel name.
In
[
49
]:
para_grid
=
{
'gamma'
:
sp_rand
(),
'C'
:
sp_rand
(),
'epsilon'
:
sp_rand
()}
clf
=
RandomizedSearchCV
(
svr_rbf
,
para_grid
)
clf
.
fit
(
X
.
iloc
[:
-
n
]
.
values
,
realized_vol
.
iloc
[
1
:
-
(
n
-
1
)]
.
values
.
reshape
(
-
1
,))
predict_svr_rbf
=
clf
.
predict
(
X
.
iloc
[
-
n
:])
In
[
50
]:
predict_svr_rbf
=
pd
.
DataFrame
(
predict_svr_rbf
)
predict_svr_rbf
.
index
=
ret
.
iloc
[
-
n
:]
.
index
In
[
51
]:
rmse_svr_rbf
=
np
.
sqrt
(
mean_squared_error
(
realized_vol
.
iloc
[
-
n
:]
/
100
,
predict_svr_rbf
/
100
))
(
'The RMSE value of SVR with RBF Kernel is {:.4f}'
.
format
(
rmse_svr_rbf
))
The
RMSE
value
of
SVR
with
RBF
Kernel
is
0.0058
In
[
52
]:
plt
.
figure
(
figsize
=
(
10
,
6
))
plt
.
plot
(
realized_vol
/
100
,
label
=
'Realized Volatility'
)
plt
.
plot
(
predict_svr_rbf
/
100
,
label
=
'Volatility Prediction-SVR_GARCH'
)
plt
.
title
(
'Volatility Prediction with SVR-GARCH (RBF)'
,
fontsize
=
12
)
plt
.
legend
()
plt
.
savefig
(
'images/svr_garch_rbf.png'
)
plt
.
show
()
Both RMSE score and the visualization suggests that SVR-GARCH with linear kernel outperforms that of with RBF kernel. The RMSE of SVR-GARCH with linear and RBF kernels are 0.0017 and 0.0051, respectively. In addition, the application with linear kernel is able to capture the huge spike in mid-2020 corresponding to the Covid-19 pandemic.
Lastly, SVR-GARCH with polynomial kernel is employed but it turns out that it has the lowest RMSE implying that it is the worst performing kernel among these three different applications.
In
[
53
]:
para_grid
=
{
'gamma'
:
sp_rand
(),
'C'
:
sp_rand
(),
'epsilon'
:
sp_rand
()}
clf
=
RandomizedSearchCV
(
svr_poly
,
para_grid
)
clf
.
fit
(
X
.
iloc
[:
-
n
]
.
values
,
realized_vol
.
iloc
[
1
:
-
(
n
-
1
)]
.
values
.
reshape
(
-
1
,))
predict_svr_poly
=
clf
.
predict
(
X
.
iloc
[
-
n
:])
In
[
54
]:
predict_svr_poly
=
pd
.
DataFrame
(
predict_svr_poly
)
predict_svr_poly
.
index
=
ret
.
iloc
[
-
n
:]
.
index
In
[
55
]:
rmse_svr_poly
=
np
.
sqrt
(
mean_squared_error
(
realized_vol
.
iloc
[
-
n
:]
/
100
,
predict_svr_poly
/
100
))
(
'The RMSE value of SVR with Polynomial Kernel is {:.4f}'
.
format
(
rmse_svr_poly
))
The
RMSE
value
of
SVR
with
Polynomial
Kernel
is
0.1090
In
[
56
]:
plt
.
figure
(
figsize
=
(
10
,
6
))
plt
.
plot
(
realized_vol
/
100
,
label
=
'Realized Volatility'
)
plt
.
plot
(
predict_svr_poly
/
100
,
label
=
'Volatility Prediction-SVR-GARCH'
)
plt
.
title
(
'Volatility Prediction with SVR-GARCH (Polynomial)'
,
fontsize
=
12
)
plt
.
legend
()
plt
.
savefig
(
'images/svr_garch_poly.png'
)
plt
.
show
()
Neural Network (NN) is the building block for deep learning. In NN, data is processed by multiple stages in a way to make a decision. Each neuron takes a result of a dot product as input and use it as input in activation function to make a decision.
where b is bias, w is weight, and x is input data.
During this process, input data is undertaken various mathematical manipulation in hidden and output layers. Generally speaking, NN has three types of layers:
Input layer
Hidden layer
Output layer
Input layer includes raw data. In going from input layer to hidden layer, we learn coefficients. There may be one or more than one hidden layers depending on the network structuree. The more hidden layer the network has, the more complicated it is. Hidden layer, locating between inout and output layers, perform nonlinear transformation via activation function.
Finally, output layer is the layer in which output is produced and decision is made.
In Machine Learning, Gradient Descent is the tool applied to minimize the cost function but employing Gradient Descent in neural network is not feasible due to the chain-like structure in neural network. Thus, a new concept known as backpropagation is proposed to minimize the cost function. The idea of backpropagation rest upon the calculating error between observed and actual output and pass this error to the hidden layer. So, we move backward and the main equation takes the form of:
where z is linear transformation and
Now, we apply neural network based volatility prediction using “MLPRegressor” module from scikit-learn Python even though we have various options3 to run neural network in Python. In the following network structure, number of hidden neuron is set to 100 in a layer and the iteration number is given as 1000. This number iterates the optimization procedure until convergence with default activation function of rectified linear unit.
In
[
57
]
:
from
sklearn.neural_network
import
MLPRegressor
clf
=
MLPRegressor
(
hidden_layer_sizes
=
100
,
max_iter
=
1000
,
learning_rate_init
=
0.001
)
In
[
58
]
:
clf
.
fit
(
X
.
iloc
[
:
-
n
]
.
values
,
realized_vol
.
iloc
[
1
:
-
(
n
-
1
)
]
.
values
.
reshape
(
-
1
,
)
)
Out
[
58
]
:
MLPRegressor
(
hidden_layer_sizes
=
100
,
max_iter
=
1000
)
In
[
59
]
:
NN_predictions
=
clf
.
predict
(
X
.
iloc
[
-
n
:
]
)
In
[
60
]
:
NN_predictions
=
pd
.
DataFrame
(
NN_predictions
)
NN_predictions
.
index
=
ret
.
iloc
[
-
n
:
]
.
index
In
[
61
]
:
rmse_NN
=
np
.
sqrt
(
mean_squared_error
(
realized_vol
.
iloc
[
-
n
:
]
/
100
,
NN_predictions
/
100
)
)
(
'
The RMSE value of NN is {:.4f}
'
.
format
(
rmse_NN
)
)
The
RMSE
value
of
NN
is
0.0022
In
[
62
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
realized_vol
/
100
,
label
=
'
Realized Volatility
'
)
plt
.
plot
(
NN_predictions
/
100
,
label
=
'
Volatility Prediction-NN
'
)
plt
.
title
(
'
Volatility Prediction with Neural Network
'
,
fontsize
=
12
,
fontweight
=
0
)
plt
.
legend
(
)
plt
.
savefig
(
'
images/NN.png
'
)
plt
.
show
(
)
Importing “MLPRegressor” library
Configuring Neural Network model
Fitting Neural Network model to the training data4.
Predicting the volatilities based on the last 250 observations and store them in the “NN_predictions” variable
Figure 4-10 shows the volatility prediction result based on neural network model. Despite its reasonable performance, we can play with the number of hidden neurons to find the best-fit neural network model. To do that, we can apply Keras library, Python interface for artificial neural networks.
Now, it is time to predict volatility using deep learning. Based on Keras, it is easy to configure the network structure. All we need is to determine the number of neuron of the specific layer. Here, the number of neuron for first and second hidden layers are 256 and 128, respectively. As volatility has a continuous type, we have only one output neuron.
In
[
63
]
:
import
tensorflow
as
tf
from
tensorflow
import
keras
from
tensorflow.keras
import
layers
In
[
64
]
:
model
=
keras
.
Sequential
(
[
layers
.
Dense
(
256
,
activation
=
"
relu
"
)
,
layers
.
Dense
(
128
,
activation
=
"
relu
"
)
,
layers
.
Dense
(
1
,
activation
=
"
linear
"
)
,
]
)
In
[
65
]
:
model
.
compile
(
loss
=
'
mse
'
,
optimizer
=
'
rmsprop
'
)
In
[
66
]
:
epochs_trial
=
np
.
arange
(
100
,
400
,
4
)
batch_trial
=
np
.
arange
(
100
,
400
,
4
)
models
=
np
.
zeros
(
(
4
,
1
)
)
for
i
,
j
,
k
in
zip
(
range
(
4
)
,
epochs_trial
,
batch_trial
)
:
model
.
fit
(
X
.
iloc
[
:
-
n
]
.
values
,
realized_vol
.
iloc
[
1
:
-
(
n
-
1
)
]
.
values
.
reshape
(
-
1
,
)
,
batch_size
=
k
,
epochs
=
j
,
verbose
=
False
)
DL_predict
=
model
.
predict
(
np
.
asarray
(
X
.
iloc
[
-
n
:
]
)
)
DL_RMSE
=
np
.
sqrt
(
(
np
.
array
(
realized_vol
.
iloc
[
-
n
:
]
)
/
100
-
DL_predict
.
flatten
(
)
/
100
)
*
*
2
)
.
mean
(
)
(
'
DL_RMSE_{}:{}
'
.
format
(
i
+
1
,
DL_RMSE
)
)
(
'
Minimim DL_RMSE:{}
'
.
format
(
DL_RMSE
.
min
(
)
)
)
DL_RMSE_1
:
0.00685412675103267
DL_RMSE_2
:
0.006896319277362136
DL_RMSE_3
:
0.007131744702991328
DL_RMSE_4
:
0.007362087484620285
Minimim
DL_RMSE
:
0.007362087484620285
In
[
67
]
:
DL_RMSE
=
np
.
sqrt
(
(
np
.
array
(
realized_vol
.
iloc
[
-
n
:
]
)
/
100
-
DL_predict
.
flatten
(
)
/
100
)
*
*
2
)
.
mean
(
)
(
'
The Average value of RMSE of DL model is {:.4f}
'
.
format
(
DL_RMSE
)
)
The
Average
value
of
RMSE
of
DL
model
is
0.0074
In
[
68
]
:
DL_predict
=
pd
.
DataFrame
(
DL_predict
)
DL_predict
.
index
=
ret
.
iloc
[
-
n
:
]
.
index
In
[
69
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
realized_vol
/
100
,
label
=
'
Realized Volatility
'
)
plt
.
plot
(
DL_predict
/
100
,
label
=
'
Volatility Prediction-DL
'
)
plt
.
title
(
'
Volatility Prediction with Deep Learning
'
,
fontsize
=
12
)
plt
.
legend
(
)
plt
.
savefig
(
'
images/DL.png
'
)
plt
.
show
(
)
Configuring network structure by deciding number of layers and neurons
Compiling model with loss and optimizer
Deciding the epoch and batch size using “np.arange”
Fitting the deep learning model
Predicting the volatility based on the weights obtained from the training phase.
Calculating RMSE score by flattening the predictions
It turns out that we get minimum RMSE score as we increase the layer size, which is quite understandable because most of the time number of layers and model performance goes hand in hand up a point in which model tends to overfit. Figuring out the proper number of layer for a specific data is a key in deep learning in the sense that we stop adding more layer before model get into the overfitting problem.
Figure 4-11 shows the volatility prediction result derived from the following code and it implies that deep learning provides a strong tool in modeling volatility, too.
The way we approach to the probability is of central importance in the sense that it distinquishes the classical (or frequentist) and Bayesian approach. According to the former method, the relative frequency will converge to the true probability. However, Bayesian application is based on the subjective interpretation. Unlike frequentists, Bayesian statisticians consider the probability distribution as uncertain and it is revised as new information comes in.
Due to the different interpretation of the probability of these two approaches, likelihood, defined as, given a set of parameters, the probability of observed event, is computed differently.
Starting from joint density function, we can give the mathematical representation of likelihood function:
Among possible
In fact, you are familiar with the method based on the approach, which is maximum likelihood estimation. Having defined the main difference between Bayesian and Frequentist approaches, it is time to delve more into the Bayes’ Theorem.
Bayesian approach is based on conditional distribution, which states that probability gauges the extent to which one has about a uncertain event. So, Bayesian application suggests a rule that can be used to update the beliefs that one holds in light of new information (Rachev et al., 2008).
Bayesian estimation is used when we have some prior information regarding a parameter. For example, before looking at a sample to esti- mate the mean latexmath:[$ mu $] of a distribution, we may have some prior belief that it is close to 2, between 1 and 3. Such prior beliefs are especially important when we have a small sample. In such a case, we are interested in combining what the data tells us, namely, the value calculated from the sample, and our prior information.
Alpaydin (2020)
Similar to the frequentist application, Bayesian estimation is based on probability density
In the light of this information, we can estimate
or
where
Finally,
Consequently, Bayes’ Theorem suggests that the posterior density is directly proportional to the prior and likelihood terms but inverserly related to the evidence term. As the evidence is there for scaling, we can describe this process as:
where
Within this context, Bayes’ Theorem sounds attractive, doesn’t it? Well, it does but it comes with a cost, which is analytical intractability. Even if Bayes’ Theorem is theoretically intuitive, it is, by and large, hard to solve analytically. This is the major drawback in wide applicability of Bayes’ Theorem. However, good news is that numerical methods provide solid methods to solve this probabilistic model.
So, some methods proposed to deal with the computational issue in Bayes’ Theorem. These methods provides solution with approximation, which can be listed as:
Quadrature approximation
Maximum a posteriori estimation
Grid Approach
Sampling Based Approach
Metropolis-Hastings
Gibbs Sampler
No U-Turn Sampler
Of these approaches, I will restrict my attention to the Metropolis-Hastings algorithm, which will be our method to be used in modeling Bayes’ Theorem. Metropolis-Hastings (M-H) method is rest upon the Markov Chain Monte Carlo (MCMC). Alos, maximum a posteriori estimation will be discussed in Chapter 6. So, before moving forward, it would be better to talk about MCMC method.
Markov Chain is a model for us to describe the transition probabilities among states, which is a rule of a game. A chain is called Markovian if the probability of current state
Thus, MCMC relies on Markov Chain to find the parameter space
where D refers to distributional approximation. Realized values of parameter space can be used to make inference about posterior. In a nutshell, MCMC method helps us to gather i.i.d sample from posterior density so that we can calculate the posterior probability.
To illustrate, we can refer to Figure 4-12. This figure tells us the probability of moving from one state to another. For the sake of simplicity, we set the probability to be 0.2 indicating, for instance, that transition from study to sleeping has a probability of 0.2.
In
[
70
]:
import
quantecon
as
qe
from
quantecon
import
MarkovChain
import
networkx
as
nx
from
pprint
import
pprint
In
[
71
]:
P
=
[[
0.5
,
0.2
,
0.3
],
[
0.2
,
0.3
,
0.5
],
[
0.2
,
0.2
,
0.6
]]
mc
=
qe
.
MarkovChain
(
P
,
(
'studying'
,
'travelling'
,
'sleeping'
))
mc
.
is_irreducible
Out
[
71
]:
True
In
[
72
]:
states
=
[
'studying'
,
'travelling'
,
'sleeping'
]
initial_probs
=
[
0.5
,
0.3
,
0.6
]
state_space
=
pd
.
Series
(
initial_probs
,
index
=
states
,
name
=
'states'
)
In
[
73
]:
q_df
=
pd
.
DataFrame
(
columns
=
states
,
index
=
states
)
q_df
=
pd
.
DataFrame
(
columns
=
states
,
index
=
states
)
q_df
.
loc
[
states
[
0
]]
=
[
0.5
,
0.2
,
0.3
]
q_df
.
loc
[
states
[
1
]]
=
[
0.2
,
0.3
,
0.5
]
q_df
.
loc
[
states
[
2
]]
=
[
0.2
,
0.2
,
0.6
]
In
[
74
]:
def
_get_markov_edges
(
Q
):
edges
=
{}
for
col
in
Q
.
columns
:
for
idx
in
Q
.
index
:
edges
[(
idx
,
col
)]
=
Q
.
loc
[
idx
,
col
]
return
edges
edges_wts
=
_get_markov_edges
(
q_df
)
pprint
(
edges_wts
)
{(
'sleeping'
,
'sleeping'
):
0.6
,
(
'sleeping'
,
'studying'
):
0.2
,
(
'sleeping'
,
'travelling'
):
0.2
,
(
'studying'
,
'sleeping'
):
0.3
,
(
'studying'
,
'studying'
):
0.5
,
(
'studying'
,
'travelling'
):
0.2
,
(
'travelling'
,
'sleeping'
):
0.5
,
(
'travelling'
,
'studying'
):
0.2
,
(
'travelling'
,
'travelling'
):
0.3
}
In
[
75
]:
G
=
nx
.
MultiDiGraph
()
G
.
add_nodes_from
(
states
)
for
k
,
v
in
edges_wts
.
items
():
tmp_origin
,
tmp_destination
=
k
[
0
],
k
[
1
]
G
.
add_edge
(
tmp_origin
,
tmp_destination
,
weight
=
v
,
label
=
v
)
pos
=
nx
.
drawing
.
nx_pydot
.
graphviz_layout
(
G
,
prog
=
'dot'
)
nx
.
draw_networkx
(
G
,
pos
)
edge_labels
=
{(
n1
,
n2
):
d
[
'label'
]
for
n1
,
n2
,
d
in
G
.
edges
(
data
=
True
)}
nx
.
draw_networkx_edge_labels
(
G
,
pos
,
edge_labels
=
edge_labels
)
nx
.
drawing
.
nx_pydot
.
write_dot
(
G
,
'mc_states.dot'
)
There are two common MCMC methods: Metropolis-Hastings and Gibbs Sampler. Here, we delve into the former one.
Metropolis-Hastings (M-H) allows us to have efficient sampling procedure with two steps: First we draw sample from proposal density and, in the second step, we decide either to accept or reject.
Let
Select initial value for
Select a new parameter value
Compute the following acceptance probability:
If
Repeat from step 2.
Well, it appears intimidating but don’t be. We have built-in code in Python makes the applicability of the M-H algorithm way easier. We use “PyFlux” library to make use of Bayes’ Theorem. Let’s go and apply M-H algorithm to predict volatility.
In
[
76
]
:
import
pyflux
as
pf
from
scipy.stats
import
kurtosis
In
[
77
]
:
model
=
pf
.
GARCH
(
ret
.
values
,
p
=
1
,
q
=
1
)
(
model
.
latent_variables
)
model
.
adjust_prior
(
1
,
pf
.
Normal
(
)
)
model
.
adjust_prior
(
2
,
pf
.
Normal
(
)
)
x
=
model
.
fit
(
method
=
'
M-H
'
,
iterations
=
'
1000
'
)
(
x
.
summary
(
)
)
Index
Latent
Variable
Prior
Prior
Hyperparameters
V
.
I
.
Dist
Transform
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=
==
==
==
==
==
==
==
=
==
==
==
==
==
==
==
==
==
==
==
==
=
==
==
==
==
==
==
==
==
==
==
0
Vol
Constant
Normal
mu0
:
0
,
sigma0
:
3
Normal
exp
1
q
(
1
)
Normal
mu0
:
0
,
sigma0
:
0.5
Normal
logit
2
p
(
1
)
Normal
mu0
:
0
,
sigma0
:
0.5
Normal
logit
3
Returns
Constant
Normal
mu0
:
0
,
sigma0
:
3
Normal
None
Acceptance
rate
of
Metropolis
-
Hastings
is
0.11135
Acceptance
rate
of
Metropolis
-
Hastings
is
0.1741
Acceptance
rate
of
Metropolis
-
Hastings
is
0.25145
Tuning
complete
!
Now
sampling
.
Acceptance
rate
of
Metropolis
-
Hastings
is
0.2579
GARCH
(
1
,
1
)
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
Dependent
Variable
:
Series
Method
:
Metropolis
Hastings
Start
Date
:
1
Unnormalized
Log
Posterior
:
-
3098.3787
End
Date
:
2565
AIC
:
6204.757311703505
Number
of
observations
:
2565
BIC
:
6228.156166733925
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
Latent
Variable
Median
Mean
95
%
Credibility
Interval
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
=
Vol
Constant
0.0388
0.0389
(
0.0308
|
0.0491
)
q
(
1
)
0.1926
0.1941
(
0.1646
|
0.2282
)
p
(
1
)
0.7739
0.773
(
0.7404
|
0.8022
)
Returns
Constant
0.0841
0.0844
(
0.0616
|
0.1074
)
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
==
None
In
[
78
]
:
model
.
plot_z
(
[
1
,
2
]
)
model
.
plot_fit
(
figsize
=
(
15
,
5
)
)
model
.
plot_ppc
(
T
=
kurtosis
,
nsims
=
1000
)
Configuring GARCH model using PyFlux library
Print the estimation of latent variables (parameters)
Adjusts the priors for the model latent variables
Fit the model using M-H process
Plot the latent variables
Plot the fitted model
Plot the histogram for posterior check
It is worthile to visualize the results of what we have done so far for volatility prediction with Bayesian-based Garch Model.
Figure 4-13 exhibits the distribution of latent variables. Latent variable q gathers around 0.2 and the other latent variable p mostly takes values between 0.7 and 0.8.
Figure 4-14 indicates the demeaned volatility series and the GARCH prediction result based on Bayesian approach.
Figure 4-15 visualizes the posterior predictions of the Bayesian model with the data so that we are able to detect systematic discrepancies, if any. The vertical line represents the test statistic and it turns out the observed value is larger than that of our model.
After we are done with the training part, we all set to move on to the next phase, which is prediction. 252 step will be predicted and the result is compared with realized volatility and RMSE is computed.
In
[
79
]
:
bayesian_prediction
=
model
.
predict_is
(
n
,
fit_method
=
'
M-H
'
)
Acceptance
rate
of
Metropolis
-
Hastings
is
0.10645
Acceptance
rate
of
Metropolis
-
Hastings
is
0.16775
Acceptance
rate
of
Metropolis
-
Hastings
is
0.2475
Tuning
complete
!
Now
sampling
.
Acceptance
rate
of
Metropolis
-
Hastings
is
0.2435
In
[
80
]
:
bayesian_RMSE
=
np
.
sqrt
(
(
np
.
array
(
realized_vol
.
iloc
[
-
n
:
]
)
/
100
-
bayesian_prediction
.
values
/
100
)
*
*
2
)
.
mean
(
)
(
'
The RMSE of Bayesian model is {:.4f}
'
.
format
(
bayesian_RMSE
)
)
The
RMSE
of
Bayesian
model
is
0.0049
In
[
81
]
:
bayesian_prediction
.
index
=
ret
.
iloc
[
-
n
:
]
.
index
In
[
82
]
:
plt
.
figure
(
figsize
=
(
10
,
6
)
)
plt
.
plot
(
realized_vol
/
100
,
label
=
'
Realized Volatility
'
)
plt
.
plot
(
bayesian_prediction
[
'
Series
'
]
/
100
,
label
=
'
Volatility Prediction-Bayesian
'
)
plt
.
title
(
'
Volatility Prediction with M-H Approach
'
,
fontsize
=
12
)
plt
.
legend
(
)
plt
.
savefig
(
'
images/bayesian.png
'
)
plt
.
show
(
)
Eventually, we are ready to observe the prediction result of the Bayesian approach and the following code does it for us.
Figure 4-16 visualizes the volatility prediction based on Metropolis-Hasting based Bayesian approach and it seems to overshot towards the mid-2020 and overall performance of this method is not bad in the sense that it outperforms many models introduced here except SVR-GARCH with linear kernel and Neural Network.
Volatility prediction is a key to understand the dynamics of financial market in the sense that it helps us to gauge the uncertainty. With that being said, it is used as input in many financial model including risk models. These facts emphasize the importance of having accurate volatility prediction. Traditionally, parametric methods such ARCH, GARCH and their extensions have been extensively used but these models suffer from being inflexible. To remedy this issue, data-driven models are found promising and this chapter attempts to make use of these models, namely, Support Vector Machines, Neural Network, and Deep Learning-based models, and it turns out data-driven model outperforms the parametric models.
In the next chapter, market risk, a core financial risk topic, will be discussed both from theoretical and empirical standpoint and the machine learning models will be incorporate to further improve the estimation of this risk.
Articles cited in this chapter:
Andersen, Torben G., Tim Bollerslev, Francis X. Diebold, and Paul Labys. “Modeling and forecasting realized volatility.” Econometrica 71, no. 2 (2003): 579-625.
Burnham, Kenneth P., and David R. Anderson. “A practical information-theoretic approach.” Model selection and multimodel inference 2 (2002).
Burnham, Kenneth P., and David R. Anderson. “Multimodel inference: understanding AIC and BIC in model selection.” Sociological methods & research 33, no. 2 (2004): 261-304.
Eagle, Robert F. “Autoregressive conditional heteroskedasticity with estimates of the variance of UK inflation.” Econometrica 50, no. 4 (1982): 987-1008.
Mandelbrot, Benoit. “New methods in statistical economics.” Journal of political economy 71, no. 5 (1963): 421-440.
Books cited in this chapter:
Alpaydin, E., 2020. Introduction to machine learning. MIT press.
Focardi, Sergio M. Modeling the market: New theories and techniques. Vol. 14. John Wiley & Sons, 1997.
Rachev, Svetlozar T., John SJ Hsu, Biliana S. Bagasheva, and Frank J. Fabozzi. Bayesian methods in finance. Vol. 153. John Wiley & Sons, 2008.
Wilmott, Paul. Paul Wilmott on quantitative finance. John Wiley & Sons, 2013.
1 Conditional variance means that volatility estimation is a function of the past values of asset returns.
2 Occam’s Razor, also known as “law of parsimony”, states that given a set of explanations, simpler explanation is the most plausible and likely one.
3 Of these alternatives, Tensorflow, PyTorch, and Neurolab are the most prominent libraries.
4 Please see this manual :link-MLP: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html {link-MLP}
3.239.149.56