Chapter 4. Machine Learning-Based Volatility Prediction

The most critical feature of the conditional return distribution is arguably its second moment structure, which is empirically the dominant time-varying characteristic of the distribution. This fact has spurred an enormous literature on the modeling and forecasting of return volatility.

Andersen et al. (2003)

“Some concepts are easy to understand but hard to define. This also holds true for volatility” This could be a quote from someone who living before Markowitz because the way he model the volatility is very clear and intuitive. Markowitz proposes his celebrated portfolio theory in which volatility is defined as standard deviation so that from then onward finance has become more intertwined with mathematics.

Volatility is the backbone of finance in the sense that it does not only provide information signal to investors but also inputs of various financial models. What makes volatility so important? The answer stresses the importance of uncertainty, which is the main characteristic of the financial model.

There is a long tradition in finance to predict volatility using ARCH and GARCH-type models in which there are certain drawbacks that cause failure, e.g., volatility clustering, information asymmetry and so on. Even though, this issues are addressed via differently models, machine learning models have not been extensively used in the literature.

In this chapter, our aim is to show how we can enhance the predictive performance using machine learning-based model. We will visit various machine learning algorithms, namely support vector regression, neural network, and deep learning, so that we are able to compare the predictive performance.

Modeling volatility amounts to modeling uncertainty so that we better understand and approach the uncertainty enabling us to have good enough approximation to the real world. In order to gauge the extent to which proposed model accounts for the real situation, we need to calculate the return volatility, which is also known as Realized volatility”. Realized volatility is the square root of realized variance, which is the sum of squared return. Realized volatility is used to calculate the performance of the volatility prediction method. Here is the formula for return volatility:

σ^=1n-1n=1N(rt-μ)2

where r and μ are return and mean of return, n is number of observations.

Let’s see how return volatility is computed in Python:

In [1]: import numpy as np
        from scipy.stats import norm
        import scipy.optimize as opt
        import yfinance as yf
        import pandas as pd
        import datetime
        import time
        from arch import arch_model
        import matplotlib.pyplot as plt
        from numba import jit
        from sklearn.metrics import mean_squared_error
        import warnings
        warnings.filterwarnings('ignore')

In [2]: stocks = '^GSPC'
        start = datetime.datetime(2010,1,1)
        end = datetime.datetime(2020,3,15)
        s_p500 = yf.download(stocks,start=start,end = end, interval='1d')
        [*********************100%***********************]  1 of 1 downloaded

In [3]: ret=100*(s_p500.pct_change()[1:]['Adj Close'])1
        realized_vol=ret.rolling(5).std()

In [4]: plt.figure(figsize=(10,6))
        plt.plot(realized_vol.index,realized_vol)
        plt.title('Realized Volatility- S&P-500')
        plt.ylabel('Volatility')
        plt.xlabel('Date')
        plt.savefig('images/realized_vol.png')
        plt.show()
1

Calculating the returns of S&P-500 based on adjusted closing prices

Figure 4-1 shows the realized volatility of S&P-500 over the period of 2010-2020. What is striking is the spikes around crisis period such as 2007-2008 financial crisis as well as Covid-19 pandemic.

rel_vol
Figure 4-1. Realized Volatility- S&P-500

The way volatility is estimated has an undeniable impact on the reliablitity and accuracy of the related analysis. So, this chapter deals with both classical and ML-based volatility prediction techniques with a view to show the superior prediction performance of the ML-based models.In order to compare the brand new ML-based models, we start with modeling the classical volatility models. Some very well known classical volatility models are, but not limited to:

  • ARCH

  • GARCH

  • GJR-GARCH

  • EGARCH

It is time to dig into the classical volatilility models. Let’s start with ARCH model.

ARCH Model

One of the early attempt to model the volatility was proposed by Engel (1982) and it is known as ARCH model. ARCH model is a univariate model and it is based on the historical asset returns. The ARCH(p) model has the following form:

σt2=ω+k=1pαk(rt-k)2

where

rt=σtϵt

where ϵt is assumed to be normally distributed. In this parametric model, we need to satisfy some assumptions to have strictly positive variance. To this respect, following condition should hold:

  • ω>0

  • αk0

All these equations tell us that ARCH is a univariate and non-linear model in which volatility is estimated with squared of past returns. The one of the most distinctive feature of ARCH is that it has the property of the time-varying conditional variance1 so that ARCH is able to model the phenomenon known as volatility clustering, that is large changes tend to be followed by large changes of either sign, and small changes tend to be followed by small changes as put by Benoit Mandelbrot (1963). Hence, an important announcement arrives into the market, it results in a huge volatility.

The following code shows how to plot clustering and what it looks like:

In [5]: retv=100*ret.values1
        date=pd.bdate_range(start='1/1/2010', end='3/15/2020')2

In [6]: plt.figure(figsize=(10,6))
        plt.plot(s_p500.index[1:],ret)
        plt.title('Volatility clustering of S&P-500')
        plt.ylabel('Daily returns')
        plt.xlabel('Date')
        plt.savefig('images/vol_clustering.png')
        plt.show()
1

Return DataFrame into Numpy representation

2

Downloading the S&P-500 prices using Yahoo Finance API

Similar to spikes in realized variance, Figure 4-2 suggests some large movements and, unsuprisingly, these ups and downs happen around important events such as Covid-19 pandemic mid-2020.

clustering
Figure 4-2. Volatility Clustering- S&P-500

Despite its appealing features such as simplicity, non-linearity, easiness, and adjustment for forecast, it has certain drawbacks, which can be listed as

  • Equal response to the positive and negative shocks

  • Strong assumptions such as restrictions on parameters

  • Possible misprediction due to slow-adjustment to large movements

These drawbacks motivate researchers to work on extensions of ARCH model and Bollerslev (1986) and Taylor (2008) proposed GARCH model.

Now, we will employ ARCH model to predict volatility but first let’s generate our own Python code and then compare it to see the difference with the built-in Python code.

In [7]: n=2521
        split_date=ret.iloc[-n:].index2

In [8]: sgm2=ret.var()3
        K=ret.kurtosis()4
        alpha=(-3.0*sgm2+np.sqrt(9.0*sgm2**2-12.0*(3.0*sgm2-K)*K))/(6*K)5
        omega=(1-alpha)*sgm26
        initial_parameters = [alpha,omega]
        omega,alpha
Out[8]: (0.5579500203177984, 0.44568819417572125)

In [9]: @jit(nopython=True,parallel=True)7
        def arch_likelihood(initial_parameters,retv):
            omega = abs(initial_parameters[0])8
            alpha =abs(initial_parameters[1])8
            T=len(retv)
            logliks=0
            sigma2 = np.zeros(T)
            sigma2[0]=np.var(retv)9
            for t in range(1,T):
                sigma2[t]=omega + alpha * (retv[t - 1])**2 10
            logliks=np.sum(0.5*(np.log(sigma2)+retv**2/sigma2))11
            return logliks


In [10]: logliks=arch_likelihood(initial_parameters,retv)
         logliks
Out[10]: 414610.07621266536

In [11]: def opt_params(x0, retv):
             opt_result = opt.minimize(arch_likelihood,x0=x0,args = (retv),method='Nelder-Mead',options={'maxiter': 5000})12
             params = opt_result.x13
             print('
Results of Nelder-Mead minimization
{}
{}'.format(''.join(['-']*28), opt_result))
             print('
Resulting params = {}'.format(params))
             return params

In [12]: params=opt_params(initial_parameters, retv)

         Results of Nelder-Mead minimization
         ----------------------------
          final_simplex: (array([[6.66113178e+03, 3.11668324e-01],
                [6.66113182e+03, 3.11668301e-01],
         [6.66113170e+03, 3.11668310e-01]]), array([12899.46225979, 12899.46225979,
          12899.46225979]))
                    fun: 12899.462259792343
                message: 'Optimization terminated successfully.'
                   nfev: 256
                    nit: 130
                 status: 0
                success: True
                      x: array([6.66113178e+03, 3.11668324e-01])

         Resulting params = [6.66113178e+03 3.11668324e-01]

In [13]: def arch_apply(ret):
                 omega =params[0]
                 alpha =params[1]
                 T=len(ret)
                 sigma2_arch = np.zeros(T + 1)
                 sigma2_arch[0] = np.var(ret)
                 for t in range(1, T):
                     sigma2_arch[t] = omega + alpha * ret[t - 1] ** 2
                 return sigma2_arch

In [14]: sigma2_arch=arch_apply(ret)
1

Defining the split location and assign the splitted data to split variable

2

Calculating variance of S&P-500

3

Calculating kurtosis of S&P-500

4

Identifying the initial value for slope coefficient α

5

Identifying the initial value for constant term ω

6

Using paralel processing to decrease the processing time

7

Taking absolute values and assigning the initial values into related variables

8

Identifying the initial value of variance

9

Iterating the variance of S&P-500

10

Calculation log-likelihood

11

Calling the function

12

Minimizing the log-likelihood function

13

Creating a variable “params” for optimized parameters

Well, we model volatility via ARCH using our own optimization method and ARCH equation. How about comparing it with the built-in Python code. This built-in code can be imported from ARCH library and it is extremely easy-to-apply. The result of built-in code is provided below and it turns out that these two results are very similar to each other.

In [15]: arch = arch_model(ret,mean='zero',vol='ARCH',p=1).fit(disp='off')
         print(arch.summary())
         Zero Mean - ARCH Model Results

         ======================================================================
          ========
         Dep. Variable:              Adj Close   R-squared:
          0.000
         Mean Model:                 Zero Mean   Adj. R-squared:
          0.000
         Vol Model:                       ARCH   Log-Likelihood:
          -3440.61
         Distribution:                  Normal   AIC:
          6885.22
         Method:            Maximum Likelihood   BIC:
          6896.92
         No. Observations:                 2566

         Date:                Mon, May 03 2021   Df Residuals:
          2564
         Time:                        11:20:45   Df Model:
            2
                                     Volatility Model
         ========================================================================
                          coef    std err          t      P>|t|  95.0% Conf. Int.
         ------------------------------------------------------------------------
         omega          0.6663  4.827e-02     13.803  2.441e-43 [  0.572,  0.761]
         alpha[1]       0.3124  5.880e-02      5.312  1.082e-07 [  0.197,  0.428]
         ========================================================================

         Covariance estimator: robust

Although developing our own code is always helpful and improve our understanding, the beauty of built-in code is not only restricted to its simplicity. Finding the optimal lag value using built-in code is another advantage of it along with the optimized running producedure.

All we need is to create a for loop and define a proper information criteria. Below, Bayesian Information Criteria (BIC) is chosen as the model selection method and in order to select lag. The reason why BIC is picked is that as long as we have large enough samples, BIC is a reliable tool for model selection as discussed by Burnham and Anderson (2002, 2004). Now, we iterate ARCH model from 1 to 5 lags.

In [16]: bic_arch=[]
         for p in range(1,5):1
                 arch = arch_model(ret,mean='zero',vol='ARCH',p=p).fit(disp='off')2
                 bic_arch.append(arch.bic)
                 if arch.bic==np.min(bic_arch):3
                     best_param=p
         arch = arch_model(ret,mean='Constant',vol='ARCH',p=p).fit(disp='off')4
         print(arch.summary())
         forecast=arch.forecast(start=split_date[0])5
         forecast_arch=forecast
         Constant Mean - ARCH Model Results

         ======================================================================
          ========
         Dep. Variable:              Adj Close   R-squared:
          -0.002
         Mean Model:             Constant Mean   Adj. R-squared:
          -0.002
         Vol Model:                       ARCH   Log-Likelihood:
          -3146.52
         Distribution:                  Normal   AIC:
          6305.03
         Method:            Maximum Likelihood   BIC:
          6340.14
         No. Observations:                 2566

         Date:                Mon, May 03 2021   Df Residuals:
          2560
         Time:                        11:20:48   Df Model:
            6
         Mean Model

         ======================================================================
          ====
         coef    std err          t      P>|t|    95.0% Conf. Int.

         ----------------------------------------------------------------------
          ----
         mu             0.0834  1.443e-02      5.779  7.515e-09 [5.511e-02,
          0.112]
         Volatility Model

         ======================================================================
          ====
         coef    std err          t      P>|t|    95.0% Conf. Int.

         ----------------------------------------------------------------------
          ----
         omega          0.2463  2.505e-02      9.832  8.214e-23   [  0.197,
          0.295]
         alpha[1]       0.1699  3.935e-02      4.317  1.584e-05 [9.274e-02,
          0.247]
         alpha[2]       0.2138  3.848e-02      5.557  2.745e-08   [  0.138,
          0.289]
         alpha[3]       0.2166  4.191e-02      5.166  2.385e-07   [  0.134,
          0.299]
         alpha[4]       0.2064  4.595e-02      4.491  7.075e-06   [  0.116,
          0.296]
         ======================================================================
          ====

         Covariance estimator: robust

In [17]: rmse_arch = np.sqrt(mean_squared_error(realized_vol[-n:]/100, np.sqrt(forecast_arch.variance.iloc[-len(split_date):]/100)))6
         print('The RMSE value of ARCH model is {:.4f}'.format(rmse_arch))
         The RMSE value of ARCH model is 0.1116

In [18]: plt.figure(figsize=(10,6))
         plt.plot(realized_vol/100,label='Realized Volatility')
         plt.plot(forecast_arch.variance.iloc[-len(split_date):]/100,label='Volatility Prediction-ARCH')
         plt.title('Volatility Prediction with ARCH', fontsize=12)
         plt.legend()
         plt.savefig('images/arch.png')
         plt.show()
1

Iterating ARCH parameter p over specified interval

2

Running ARCH model with different p values

3

Finding the minimum Bayesian Information Criteria score to select the best model

4

Running ARCH model with the best p value

5

Forecasting the volatility based on the optimized ARCH model

6

Calculating the RMSE score

The result of volatility prediction based on our first model is shown in Figure 4-3.

arch
Figure 4-3. Volatility Prediction with ARCH

GARCH Model

GARCH model is an extension of ARCH model incorporating lagged conditional variance. So, ARCH is improved by adding p number of delated conditional variance, which makes GARCH model a multivariate one in the sense that it is a autoregressive moving average models for conditional variance with p number of lagged squared returns and q number of lagged conditional variance. GARCH (p,q) can be formulated as:

σt2=ω+k=1qαkrt-k2+k=1pβkσt-k2

where ω, β, and α are parameters to be estimated and q and p are maximum lag in the model. In order to have consistent GARCH, following conditions should hold:

  • ω>0

  • β0

  • α0

  • β+α<1

ARCH model is unable to capture the influence of historical innovations. However, as a more parsimonious model, GARCH model can account for the change in historical innovations because GARCH models can be expressed as an infinite-order ARCH. Let’s show how GARCH can be shown as infinite order of ARCH.

σt2=ω+αrt-12+βσt-12

Then replace σt-12 by ω+αrt-22+βσt-22

σt2=ω+αrt-12+β(ω+αrt-22σt-22)
=ω(1+β)+αrt-12+βαrt-22+β2σt-22)

Now, let’s substitute σt-22 by ω+αrt-32+βσt-32 and do the necessary math, we end up with:

σt2=ω(1+β+β2+...)+αk=1βk-1rt-k

Similar to ARCH model, there are more than one way to model volatility using GARCH in Python. Let us try to develop our own Python-based code using optimization technique first. In what follows, arch library will be used to predict volatility.

In [19]: a0=0.0001
         sgm2=ret.var()
         K=ret.kurtosis()
         h=1-alpha/sgm2
         alpha=np.sqrt(K*(1-h**2)/(2.0*(K+3)))
         beta=np.abs(h-omega)
         omega=(1-omega)*sgm2
         initial_parameters = np.array([omega,alpha,beta])
         print('Initial parameters for omega, alpha, and beta  are 
{}
{}
{}'.format(omega,alpha,beta))
         Initial parameters for omega, alpha, and beta  are
         0.444951365916522
         0.519448113662449
         0.0007320236366448185

In [20]: retv=ret.values

In [21]: @jit(nopython=True,parallel=True)
         def garch_likelihood(initial_parameters,retv):
             omega = initial_parameters[0]
             alpha =initial_parameters[1]
             beta =initial_parameters[2]
             T=len(retv)
             logliks=0
             sigma2 = np.zeros(T)
             sigma2[0]=np.var(retv)
             for t in range(1,T):
                 sigma2[t]=omega + alpha * (retv[t - 1])**2+beta*sigma2[t-1]
             logliks=np.sum(0.5*(np.log(sigma2)+retv**2/sigma2))
             return logliks

In [22]: logliks=garch_likelihood(initial_parameters,retv)
         print('The Log likelihood  is {:.4f}'.format(logliks))
         The Log likelihood  is 1143.7064

In [23]: def garch_constraint(initial_parameters):
             alpha = initial_parameters[0]
             gamma = initial_parameters[1]
             beta = initial_parameters[2]
             return np.array([1-alpha-beta])

In [24]: bounds = [(0.0,1.0), (0.0,1.0), (0.0,1.0)]

In [25]: def opt_paramsG(initial_parameters, retv):
             opt_result = opt.minimize(garch_likelihood,x0=initial_parameters,constraints=np.array([1-alpha-beta]),bounds=bounds,args = (retv),method='Nelder-Mead',options={'maxiter': 5000})
             params = opt_result.x
             print('
Results of Nelder-Mead minimization
{}
{}'.format('-'*35, opt_result))
             print('-'*35)
             print('
Resulting parameters = {}'.format(params))
             return params

In [26]: params=opt_paramsG(initial_parameters, retv)

         Results of Nelder-Mead minimization
         -----------------------------------
          final_simplex: (array([[0.03754678, 0.17239775, 0.79143287],
                [0.03753309, 0.17240623, 0.7914592 ],
                [0.03754475, 0.17232523, 0.79150698],
         [0.03757972, 0.172482  , 0.79134976]]), array([760.48013548, 760.4801371 ,
          760.48014785, 760.48014867]))
                    fun: 760.4801354810166
                message: 'Optimization terminated successfully.'
                   nfev: 250
                    nit: 141
                 status: 0
                success: True
                      x: array([0.03754678, 0.17239775, 0.79143287])
         -----------------------------------

         Resulting parameters = [0.03754678 0.17239775 0.79143287]

In [27]: def garch_apply(ret):
                 omega =params[0]
                 alpha =params[1]
                 beta =params[2]
                 T=len(ret)
                 sigma2 = np.zeros(T + 1)
                 sigma2[0] = np.var(ret)
                 for t in range(1, T):
                     sigma2[t] = omega + alpha * ret[t - 1] ** 2+beta*sigma2[t-1]
                 return sigma2

The parameters we get from our own code for developing GARCH model are approximately:

  • ω= 0.0375

  • α : 0.1724

  • β: 0.7913

The following built-in Python code confirms that we did a great job on the ground that the parameters obtained the built-in code is quite similar to our code. So, we have learned how to code GARCH and ARCH models to predict volatility.

In [28]: garch = arch_model(ret,mean='zero',vol='GARCH',p=1,o=0,q=1).fit(disp='off')
         print(garch.summary())
         Zero Mean - GARCH Model Results

         ======================================================================
          ========
         Dep. Variable:              Adj Close   R-squared:
          0.000
         Mean Model:                 Zero Mean   Adj. R-squared:
          0.000
         Vol Model:                      GARCH   Log-Likelihood:
          -3118.48
         Distribution:                  Normal   AIC:
          6242.96
         Method:            Maximum Likelihood   BIC:
          6260.51
         No. Observations:                 2566

         Date:                Mon, May 03 2021   Df Residuals:
          2563
         Time:                        11:21:04   Df Model:
            3
         Volatility Model

         ======================================================================
          ======
         coef    std err          t      P>|t|      95.0% Conf. Int.

         ----------------------------------------------------------------------
          ------
         omega          0.0376  8.599e-03      4.367  1.259e-05
          [2.070e-02,5.441e-02]
         alpha[1]       0.1724  2.295e-02      7.514  5.756e-14     [  0.127,
          0.217]
         beta[1]        0.7914  2.347e-02     33.714 3.572e-249     [  0.745,
          0.837]
         ======================================================================
          ======

         Covariance estimator: robust

It is apparent that it is easy to work with GARCH(1,1) but how do we know that theses parameters are the optimum one. Let us decide the optimum parameter set given the lowest BIC value.

In [29]: bic_garch=[]
         for p in range(1,5):
             for q in range(1,5):
                 garch = arch_model(ret,mean='zero',vol='GARCH',p=p,o=0,q=q).fit(disp='off')
                 bic_garch.append(garch.bic)
                 if garch.bic==np.min(bic_garch):
                     best_param=p,q
         garch = arch_model(ret,mean='zero',vol='GARCH',p=p,o=0,q=q).fit(disp='off')
         print(garch.summary())
         forecast=garch.forecast(start=split_date[0])
         forecast_garch=forecast
         Zero Mean - GARCH Model Results

         ======================================================================
          ========
         Dep. Variable:              Adj Close   R-squared:
          0.000
         Mean Model:                 Zero Mean   Adj. R-squared:
          0.000
         Vol Model:                      GARCH   Log-Likelihood:
          -3114.97
         Distribution:                  Normal   AIC:
          6247.95
         Method:            Maximum Likelihood   BIC:
          6300.60
         No. Observations:                 2566

         Date:                Mon, May 03 2021   Df Residuals:
          2557
         Time:                        11:21:06   Df Model:
            9
         Volatility Model

         ======================================================================
          =====
         coef    std err          t      P>|t|     95.0% Conf. Int.

         ----------------------------------------------------------------------
          -----
         omega          0.1027  5.964e-02      1.723  8.498e-02 [-1.416e-02,
          0.220]
         alpha[1]       0.1306  3.561e-02      3.667  2.456e-04  [6.078e-02,
          0.200]
         alpha[2]       0.1659      0.113      1.468      0.142 [-5.561e-02,
          0.387]
         alpha[3]       0.0900      0.131      0.686      0.492    [ -0.167,
          0.347]
         alpha[4]       0.0804  5.161e-02      1.557      0.119 [-2.078e-02,
          0.182]
         beta[1]    2.3949e-15      0.883  2.713e-15      1.000    [ -1.730,
          1.730]
         beta[2]        0.2298      0.333      0.691      0.490    [ -0.422,
          0.882]
         beta[3]    8.3894e-15      0.449  1.868e-14      1.000    [ -0.880,
          0.880]
         beta[4]        0.2058      0.139      1.478      0.139 [-6.715e-02,
          0.479]
         ======================================================================
          =====

         Covariance estimator: robust

In [30]: rmse_garch = np.sqrt(mean_squared_error(realized_vol[-n:]/100, np.sqrt(forecast_garch.variance.iloc[-len(split_date):]/100)))
         print('The RMSE value of GARCH model is {:.4f}'.format(rmse_garch))
         The RMSE value of GARCH model is 0.1027

In [31]: plt.figure(figsize=(10,6))
         plt.plot(realized_vol/100,label='Realized Volatility')
         plt.plot(forecast_garch.variance.iloc[-len(split_date):]/100,label='Volatility Prediction-GARCH')
         plt.title('Volatility Prediction with GARCH', fontsize=12)
         plt.legend()
         plt.savefig('images/garch.png')
         plt.show()
garch
Figure 4-4. Volatility Prediction with GARCH

So, as shown, GARCH is able to explain the effect of all historical shocks on the contemporaneous conditional variance. To remedy this issue, GJR-GARCH was proposed by Glosten, Jagannathan and Runkle (1993).

GJR-GARCH

This model performs well in modeling the asymmetric effects of the announcements. The equation of the model includes one more parameter γ and it takes the following form:

σt2=ω+k=1q(αkrt-k2+γrt-k2I(ϵt-1<0)+k=1pβkσt-k2

where γ controls for the asymmetry of the announcements and if

  • γ=0, then the response to the past shock is the same

  • γ>0, then the response to the past negative shock is stronger than that of a positive one

  • γ<0, then the response to the past positive shock is stronger than that of a negative one

In [32]: bic_gjr_garch=[]
         for p in range(1,5):
             for q in range(1,5):
                 gjrgarch = arch_model(ret,mean='zero',p=p,o=1,q=q).fit(disp='off')
                 bic_gjr_garch.append(gjrgarch.bic)
                 if gjrgarch.bic==np.min(bic_gjr_garch):
                     best_param=p,q
         gjrgarch = arch_model(ret,mean='zero',p=p,o=1,q=q).fit(disp='off')
         print(gjrgarch.summary())
         forecast=gjrgarch.forecast(start=split_date[0])
         forecast_gjrgarch=forecast
         Zero Mean - GJR-GARCH Model Results

         ======================================================================
          ========
         Dep. Variable:              Adj Close   R-squared:
          0.000
         Mean Model:                 Zero Mean   Adj. R-squared:
          0.000
         Vol Model:                  GJR-GARCH   Log-Likelihood:
          -3030.12
         Distribution:                  Normal   AIC:
          6080.24
         Method:            Maximum Likelihood   BIC:
          6138.74
         No. Observations:                 2566

         Date:                Mon, May 03 2021   Df Residuals:
          2556
         Time:                        11:21:10   Df Model:
           10
         Volatility Model

         ======================================================================
          =======
         coef    std err          t      P>|t|       95.0% Conf. Int.

         ----------------------------------------------------------------------
          -------
         omega          0.0390     43.311  9.006e-04      0.999      [-84.850,
          84.928]
         alpha[1]   5.8308e-15     86.209  6.764e-17      1.000
          [-1.690e+02,1.690e+02]
         alpha[2]   4.5942e-15     64.320  7.143e-17      1.000
          [-1.261e+02,1.261e+02]
         alpha[3]   2.7611e-15     54.949  5.025e-17      1.000
          [-1.077e+02,1.077e+02]
         alpha[4]   8.2718e-16    120.430  6.869e-18      1.000
          [-2.360e+02,2.360e+02]
         gamma[1]       0.3242    329.839  9.830e-04      0.999
          [-6.461e+02,6.468e+02]
         beta[1]        0.8073   2333.868  3.459e-04      1.000
          [-4.573e+03,4.575e+03]
         beta[2]    1.9236e-08   1719.001  1.119e-11      1.000
          [-3.369e+03,3.369e+03]
         beta[3]    1.7134e-16   1709.038  1.003e-19      1.000
          [-3.350e+03,3.350e+03]
         beta[4]    3.2047e-16   1355.064  2.365e-19      1.000
          [-2.656e+03,2.656e+03]
         ======================================================================
          =======

         Covariance estimator: robust

In [33]: rmse_gjr_garch = np.sqrt(mean_squared_error(realized_vol[-n:]/100, np.sqrt(forecast_gjrgarch.variance.iloc[-len(split_date):]/100)))
         print('The RMSE value of GJR-GARCH models is {:.4f}'.format(rmse_gjr_garch))
         The RMSE value of GJR-GARCH models is 0.1144

In [34]: plt.figure(figsize=(10,6))
         plt.plot(realized_vol/100,label='Realized Volatility')
         plt.plot(forecast_gjrgarch.variance.iloc[-len(split_date):]/100,label='Volatility Prediction-GJR-GARCH')
         plt.title('Volatility Prediction with GJR-GARCH', fontsize=12)
         plt.legend()
         plt.savefig('images/gjr_garch.png')
         plt.show()
gjr_garch
Figure 4-5. Volatility Prediction with GJR-GARCH

EGARCH

Together with the GJR-GARCH model, EGARCH, proposed by Nelson (1991), is a also tool for controlling for the effect of the asymmetric announcements and additionally it is specified in logarithmic form, there is no need to put restriction to avoid negative volatility.

log(σt2)=ω+k=1pβklogσt-k2+k=1qαi|rk-1|σt-k2+k=1qγkrt-kσt-k2

The main difference of EGARCH equation is that logarithm is taken of the variance on the left-hand-side of the equation. This indicates the leverage effect meaning that there exists a negative correlation between past asset returns and volatility. If γ<0, it implies leverage effect and if γ0, it shows asymmetry in volatility.

In [35]: bic_egarch=[]
         for p in range(1,5):
             for q in range(1,5):
                 egarch = arch_model(ret,mean='zero',vol='EGARCH',p=p,o=1,q=q).fit(disp='off')
                 bic_egarch.append(egarch.bic)
                 if egarch.bic==np.min(bic_egarch):
                     best_param=p,q
         egarch = arch_model(ret,mean='zero',vol='EGARCH',p=p,o=1,q=q).fit(disp='off')
         print(egarch.summary())
         forecast=egarch.forecast(start=split_date[0])
         forecast_egarch=forecast
         Zero Mean - EGARCH Model Results

         ======================================================================
          ========
         Dep. Variable:              Adj Close   R-squared:
          0.000
         Mean Model:                 Zero Mean   Adj. R-squared:
          0.000
         Vol Model:                     EGARCH   Log-Likelihood:
          -3005.07
         Distribution:                  Normal   AIC:
          6030.15
         Method:            Maximum Likelihood   BIC:
          6088.65
         No. Observations:                 2566

         Date:                Mon, May 03 2021   Df Residuals:
          2556
         Time:                        11:21:14   Df Model:
           10
         Volatility Model

         ======================================================================
          =======
         coef    std err          t      P>|t|       95.0% Conf. Int.

         ----------------------------------------------------------------------
          -------
         omega         -0.0143  7.763e-03     -1.847  6.469e-02
          [-2.956e-02,8.741e-04]
         alpha[1]       0.0685  6.360e-02      1.077      0.282   [-5.618e-02,
          0.193]
         alpha[2]       0.0558  8.367e-02      0.667      0.505      [ -0.108,
          0.220]
         alpha[3]       0.0107  7.159e-02      0.149      0.881      [ -0.130,
          0.151]
         alpha[4]       0.0851  4.927e-02      1.728  8.396e-02   [-1.142e-02,
          0.182]
         gamma[1]      -0.2763  4.216e-02     -6.554  5.587e-11      [ -0.359,
          -0.194]
         beta[1]        0.8652      0.160      5.420  5.970e-08      [  0.552,
          1.178]
         beta[2]    1.8427e-15      0.189  9.760e-15      1.000      [ -0.370,
          0.370]
         beta[3]        0.0597      0.193      0.310      0.757      [ -0.318,
          0.437]
         beta[4]    2.1382e-15      0.176  1.215e-14      1.000      [ -0.345,
          0.345]
         ======================================================================
          =======

         Covariance estimator: robust

In [36]: rmse_egarch = np.sqrt(mean_squared_error(realized_vol[-n:]/100, np.sqrt(forecast_egarch.variance.iloc[-len(split_date):]/100)))
         print('The RMSE value of EGARCH models is {:.4f}'.format(rmse_egarch))
         The RMSE value of EGARCH models is 0.0987

In [37]: plt.figure(figsize=(10,6))
         plt.plot(realized_vol/100,label='Realized Volatility')
         plt.plot(forecast_egarch.variance.iloc[-len(split_date):]/100,label='Volatility Prediction-EGARCH')
         plt.title('Volatility Prediction with EGARCH', fontsize=12)
         plt.legend()
         plt.savefig('images/egarch.png')
         plt.show()
egarch
Figure 4-6. Volatility Prediction with GJR-GARCH

Up to now, we have discussed the classical volatility models but from this point on we will see how Machine Learning and Bayesian Approach can be used to model volatility. In the context of Machine Learning, Support Vector Machines and Neural Network will be the first models to visit. Let’s get started.

Support Vector Regression-GARCH

Support Vector Machines (SVM) is a supervised learning algorithm, which can be applicable to both classification and regression. The aim in SVM is to find a line that separate two classes. It sounds easy but here is the challenging part: There are almost infinitely many lines that can be used to distinguish the classes. But we are looking for the optimal line by which the classes can be perfectly discriminated.

In linear algebra jargon, the optimal line is called “hyperplane”, which maximize the distance between the points, which are closest to the hyperplane but belonging to different classes. The distance between the two points, i.e., support vectors, is known as “margin”. So, in SVM, what we are trying to do is to maximize the margin between support vectors.

SVM for classification is labeled as Support Vector Classification (SVC). Keeping all characteristics of SVM, it can be applicable to regression. Again, in regression, the aim is to find the hyperplane that minimize the error and maximize the margin. This method is called Support Vector Regression (SVR) and, in this part, we will apply this method to GARCH model. Combining these two models comes up with a different name: “SVR-GARCH”.

The following code shows us the preparations before running then SVR-GARCH in Python. The most crucial step here is to obtain independent variables, which are realized volatility and square of historical returns.

In [38]: from sklearn.svm import SVR
         from scipy.stats import uniform as sp_rand
         from sklearn.model_selection import RandomizedSearchCV
         from sklearn.metrics import mean_squared_error

In [39]: realized_vol=ret.rolling(5).std()1
         realized_vol=pd.DataFrame(realized_vol)
         realized_vol.reset_index(drop=True,inplace=True)

In [40]: returns_svm=ret**2
         returns_svm=returns_svm.reset_index()
         del returns_svm['Date']

In [41]: X=pd.concat([realized_vol,returns_svm],axis=1,ignore_index=True)
         X=X[4:].copy()
         X=X.reset_index()
         X.drop('index',axis=1,inplace=True)

In [42]: realized_vol=realized_vol.dropna().reset_index()
         realized_vol.drop('index',axis=1,inplace=True)

In [43]: svr_poly=SVR(kernel='poly')2
         svr_lin=SVR(kernel='linear')2
         svr_rbf=SVR(kernel='rbf')2
1

Computing realized volatility and assign a new variable to it named “realized_vol”

2

Creating a new variables for different SVR kernel

Let us run and see our first SVR-GARCH application with linear kernel. Root mean squared error (RMSE) is the metric to be used to compare.

In [44]: para_grid ={'gamma': sp_rand(),
         'C': sp_rand(),
         'epsilon': sp_rand()}1
         clf =RandomizedSearchCV(svr_lin,para_grid)2
         clf.fit(X.iloc[:-n].values,realized_vol.iloc[1:-(n-1)].values.reshape(-1,))3
         predict_svr_lin=clf.predict(X.iloc[-n:])4

In [45]: predict_svr_lin=pd.DataFrame(predict_svr_lin)
         predict_svr_lin.index=ret.iloc[-n:].index

In [46]: rmse_svr =np.sqrt(mean_squared_error(realized_vol.iloc[-n:]/100, predict_svr_lin/100))
         print('The RMSE value of SVR with Linear Kernel is {:.4f}'.format(rmse_svr))
         The RMSE value of SVR with Linear Kernel is 0.0010

In [47]: realized_vol.index=ret.iloc[4:].index

In [48]: plt.figure(figsize=(10,6))
         plt.plot(realized_vol/100,label='Realized Volatility')
         plt.plot(predict_svr_lin/100,label='Volatility Prediction-SVR-GARCH')
         plt.title('Volatility Prediction with SVR-GARCH (Linear)', fontsize=12)
         plt.legend()
         plt.savefig('images/svr_garch_linear.png')
         plt.show()
1

Identifying the hyperparameter space for tuning

2

Applying hyperparameter tuning with RandomizedSearchCV

3

Fitting SVR-GARCH with linear kernel to data

4

Predicting the volatilities based on the last 250 observations and store them in the “predict_svr_lin”

svr_garch_linear
Figure 4-7. Volatility Prediction with SVR-GARCH Linear Kernel

Figure 4-7 exhibits the predicted values and actual observation. By eyeballing, one can tell that SVR-GARCH perform well. As you can guess, linear kernel works fine if the dataset is linearly separable and it is also the suggestion of Occam’s Razor2. What if it does not? Let’s continue with RBF and Polynomial kernels. The former one uses elliptical curves around the observations and the latter, differently from the first two, focuses on the combinations of samples, too. Let’s now see how they work.

SVR-GARCH application with RBF kernel, a function that projections data into a new vector space, can be found below. From the practical standpoint, SVR-GARCH application with different kernels is not a labor-intensive process, all we need to switch the kernel name.

In [49]: para_grid ={'gamma': sp_rand(),
         'C': sp_rand(),
         'epsilon': sp_rand()}
         clf =RandomizedSearchCV(svr_rbf,para_grid)
         clf.fit(X.iloc[:-n].values,realized_vol.iloc[1:-(n-1)].values.reshape(-1,))
         predict_svr_rbf=clf.predict(X.iloc[-n:])

In [50]: predict_svr_rbf=pd.DataFrame(predict_svr_rbf)
         predict_svr_rbf.index=ret.iloc[-n:].index

In [51]: rmse_svr_rbf =np.sqrt(mean_squared_error(realized_vol.iloc[-n:]/100, predict_svr_rbf/100))
         print('The RMSE value of SVR with RBF Kernel is  {:.4f}'.format(rmse_svr_rbf))
         The RMSE value of SVR with RBF Kernel is  0.0058

In [52]: plt.figure(figsize=(10,6))
         plt.plot(realized_vol/100,label='Realized Volatility')
         plt.plot(predict_svr_rbf/100,label='Volatility Prediction-SVR_GARCH')
         plt.title('Volatility Prediction with SVR-GARCH (RBF)', fontsize=12)
         plt.legend()
         plt.savefig('images/svr_garch_rbf.png')
         plt.show()

Both RMSE score and the visualization suggests that SVR-GARCH with linear kernel outperforms that of with RBF kernel. The RMSE of SVR-GARCH with linear and RBF kernels are 0.0017 and 0.0051, respectively. In addition, the application with linear kernel is able to capture the huge spike in mid-2020 corresponding to the Covid-19 pandemic.

svr_garch_rbf
Figure 4-8. Volatility Prediction with SVR-GARCH RBF Kernel

Lastly, SVR-GARCH with polynomial kernel is employed but it turns out that it has the lowest RMSE implying that it is the worst performing kernel among these three different applications.

In [53]: para_grid ={'gamma': sp_rand(),
         'C': sp_rand(),
         'epsilon': sp_rand()}
         clf =RandomizedSearchCV(svr_poly,para_grid)
         clf.fit(X.iloc[:-n].values,realized_vol.iloc[1:-(n-1)].values.reshape(-1,))
         predict_svr_poly=clf.predict(X.iloc[-n:])

In [54]: predict_svr_poly=pd.DataFrame(predict_svr_poly)
         predict_svr_poly.index=ret.iloc[-n:].index

In [55]: rmse_svr_poly =np.sqrt(mean_squared_error(realized_vol.iloc[-n:]/100, predict_svr_poly/100))
         print('The RMSE value of SVR with Polynomial Kernel is {:.4f}'.format(rmse_svr_poly))
         The RMSE value of SVR with Polynomial Kernel is 0.1090

In [56]: plt.figure(figsize=(10,6))
         plt.plot(realized_vol/100,label='Realized Volatility')
         plt.plot(predict_svr_poly/100,label='Volatility Prediction-SVR-GARCH')
         plt.title('Volatility Prediction with SVR-GARCH (Polynomial)', fontsize=12)
         plt.legend()
         plt.savefig('images/svr_garch_poly.png')
         plt.show()
svr_garch_poly
Figure 4-9. Volatility Prediction with SVR-GARCH Polynomial Kernel

Neural Network

Neural Network (NN) is the building block for deep learning. In NN, data is processed by multiple stages in a way to make a decision. Each neuron takes a result of a dot product as input and use it as input in activation function to make a decision.

z=w1x1+w2x2+b

where b is bias, w is weight, and x is input data.

During this process, input data is undertaken various mathematical manipulation in hidden and output layers. Generally speaking, NN has three types of layers:

  • Input layer

  • Hidden layer

  • Output layer

Input layer includes raw data. In going from input layer to hidden layer, we learn coefficients. There may be one or more than one hidden layers depending on the network structuree. The more hidden layer the network has, the more complicated it is. Hidden layer, locating between inout and output layers, perform nonlinear transformation via activation function.

Finally, output layer is the layer in which output is produced and decision is made.

In Machine Learning, Gradient Descent is the tool applied to minimize the cost function but employing Gradient Descent in neural network is not feasible due to the chain-like structure in neural network. Thus, a new concept known as backpropagation is proposed to minimize the cost function. The idea of backpropagation rest upon the calculating error between observed and actual output and pass this error to the hidden layer. So, we move backward and the main equation takes the form of:

δl=δJδzjl

where z is linear transformation and δ represents error. There is much more to say but to keep myself on track I stop here. For those who wants to dig more into math behind Neural Network please refer to Wilmott (2013) and Alpaydin (2020).

Now, we apply neural network based volatility prediction using “MLPRegressor” module from scikit-learn Python even though we have various options3 to run neural network in Python. In the following network structure, number of hidden neuron is set to 100 in a layer and the iteration number is given as 1000. This number iterates the optimization procedure until convergence with default activation function of rectified linear unit.

In [57]: from sklearn.neural_network import MLPRegressor1
         clf = MLPRegressor(hidden_layer_sizes=100,max_iter=1000, learning_rate_init=0.001)2

In [58]: clf.fit(X.iloc[:-n].values, realized_vol.iloc[1:-(n-1)].values.reshape(-1,))3
Out[58]: MLPRegressor(hidden_layer_sizes=100, max_iter=1000)

In [59]: NN_predictions=clf.predict(X.iloc[-n:])4

In [60]: NN_predictions=pd.DataFrame(NN_predictions)
         NN_predictions.index=ret.iloc[-n:].index

In [61]: rmse_NN =np.sqrt(mean_squared_error(realized_vol.iloc[-n:]/100, NN_predictions/100))
         print('The RMSE value of NN is {:.4f}'.format(rmse_NN))
         The RMSE value of NN is 0.0022

In [62]: plt.figure(figsize=(10,6))
         plt.plot(realized_vol/100,label='Realized Volatility')
         plt.plot(NN_predictions/100,label='Volatility Prediction-NN')
         plt.title('Volatility Prediction with Neural Network', fontsize=12, fontweight=0)
         plt.legend()
         plt.savefig('images/NN.png')
         plt.show()
1

Importing “MLPRegressor” library

2

Configuring Neural Network model

3

Fitting Neural Network model to the training data4.

4

Predicting the volatilities based on the last 250 observations and store them in the “NN_predictions” variable

Figure 4-10 shows the volatility prediction result based on neural network model. Despite its reasonable performance, we can play with the number of hidden neurons to find the best-fit neural network model. To do that, we can apply Keras library, Python interface for artificial neural networks.

NN
Figure 4-10. Volatility Prediction with Neural Network

Now, it is time to predict volatility using deep learning. Based on Keras, it is easy to configure the network structure. All we need is to determine the number of neuron of the specific layer. Here, the number of neuron for first and second hidden layers are 256 and 128, respectively. As volatility has a continuous type, we have only one output neuron.

In [63]: import tensorflow as tf
         from tensorflow import keras
         from tensorflow.keras import layers

In [64]: model = keras.Sequential(
             [layers.Dense(256, activation="relu"),
              layers.Dense(128, activation="relu"),
                  layers.Dense(1, activation="linear"),])1

In [65]: model.compile(loss='mse', optimizer='rmsprop')2

In [66]: epochs_trial=np.arange(100,400,4)3
         batch_trial=np.arange(100,400,4)3
         models=np.zeros((4,1))
         for i,j,k in zip(range(4),epochs_trial,batch_trial):
             model.fit(X.iloc[:-n].values, realized_vol.iloc[1:-(n-1)].values.reshape(-1,),batch_size=k, epochs=j, verbose=False)4
             DL_predict=model.predict(np.asarray(X.iloc[-n:]))5
             DL_RMSE = np.sqrt((np.array(realized_vol.iloc[-n:])/100 - DL_predict.flatten()/100) ** 2).mean()6
             print('DL_RMSE_{}:{}'.format(i+1,DL_RMSE))
         print('Minimim DL_RMSE:{}'.format(DL_RMSE.min()))
         DL_RMSE_1:0.00685412675103267
         DL_RMSE_2:0.006896319277362136
         DL_RMSE_3:0.007131744702991328
         DL_RMSE_4:0.007362087484620285
         Minimim DL_RMSE:0.007362087484620285

In [67]: DL_RMSE = np.sqrt((np.array(realized_vol.iloc[-n:])/100 - DL_predict.flatten()/100) ** 2).mean()
         print('The Average value of RMSE of DL model is {:.4f}'.format(DL_RMSE))
         The Average value of RMSE of DL model is 0.0074

In [68]: DL_predict=pd.DataFrame(DL_predict)
         DL_predict.index=ret.iloc[-n:].index

In [69]: plt.figure(figsize=(10,6))
         plt.plot(realized_vol/100,label='Realized Volatility')
         plt.plot(DL_predict/100,label='Volatility Prediction-DL')
         plt.title('Volatility Prediction with Deep Learning',  fontsize=12)
         plt.legend()
         plt.savefig('images/DL.png')
         plt.show()
1

Configuring network structure by deciding number of layers and neurons

2

Compiling model with loss and optimizer

3

Deciding the epoch and batch size using “np.arange”

4

Fitting the deep learning model

5

Predicting the volatility based on the weights obtained from the training phase.

6

Calculating RMSE score by flattening the predictions

It turns out that we get minimum RMSE score as we increase the layer size, which is quite understandable because most of the time number of layers and model performance goes hand in hand up a point in which model tends to overfit. Figuring out the proper number of layer for a specific data is a key in deep learning in the sense that we stop adding more layer before model get into the overfitting problem.

Figure 4-11 shows the volatility prediction result derived from the following code and it implies that deep learning provides a strong tool in modeling volatility, too.

DL
Figure 4-11. Volatility Prediction with Deep Learning

Bayesian Approach

The way we approach to the probability is of central importance in the sense that it distinquishes the classical (or frequentist) and Bayesian approach. According to the former method, the relative frequency will converge to the true probability. However, Bayesian application is based on the subjective interpretation. Unlike frequentists, Bayesian statisticians consider the probability distribution as uncertain and it is revised as new information comes in.

Due to the different interpretation of the probability of these two approaches, likelihood, defined as, given a set of parameters, the probability of observed event, is computed differently.

Starting from joint density function, we can give the mathematical representation of likelihood function:

(θ|x1,x2,...,xp)=Pr(x1,x2,...,xp|θ)

Among possible θ values, what we are trying to do is to decide which one is more likely. Under the statistical model proposed by likelihood function, the observed data x1,...,xp is the most probable.

In fact, you are familiar with the method based on the approach, which is maximum likelihood estimation. Having defined the main difference between Bayesian and Frequentist approaches, it is time to delve more into the Bayes’ Theorem.

Bayes’ Theorem

Bayesian approach is based on conditional distribution, which states that probability gauges the extent to which one has about a uncertain event. So, Bayesian application suggests a rule that can be used to update the beliefs that one holds in light of new information (Rachev et al., 2008).

Bayesian estimation is used when we have some prior information regarding a parameter. For example, before looking at a sample to esti- mate the mean latexmath:[$ mu $] of a distribution, we may have some prior belief that it is close to 2, between 1 and 3. Such prior beliefs are especially important when we have a small sample. In such a case, we are interested in combining what the data tells us, namely, the value calculated from the sample, and our prior information.

Alpaydin (2020)

Similar to the frequentist application, Bayesian estimation is based on probability density Pr(x|θ). However, as we have discussed before, the way Bayesian and frequentist methods treat parameter set θ differently. Frequentist assumes θ to be fixed whereas it is, in Bayesian setting, taken as random variable, whose probability is known as prior density Pr(θ). Oh no!, we have another different term but no worries it is easy to understand.

In the light of this information, we can estimate (x|θ) using prior density ?(θ) and we come up with the following formula. Prior is employed when we need to estimate the conditional distribution of the parameters given observations.

Pr(θ|x1,x2,...,xp)=Pr(x1,x2,...,xp|θ)Pr(θ)Pr(x1,x2,...,xp)

or

Pr(θ|data)=(data|θ)Pr(θ)Pr(data)

where

  • Pr(θ|data) is the posterior density,which gives us the information about the parameters given observed data.

  • (data|θ) is the likelihood function, which estimates the probability of the data given parameters.

  • Pr(θ) is prior probability. It is the probability of the parameteres. Prior is basicallt the initial beliefs about estimates.

  • Finally, Pr is the evidence, which is used to update the prior.

Consequently, Bayes’ Theorem suggests that the posterior density is directly proportional to the prior and likelihood terms but inverserly related to the evidence term. As the evidence is there for scaling, we can describe this process as:

PosteriorLikelihoodxprior

where means “is proportional to”

Within this context, Bayes’ Theorem sounds attractive, doesn’t it? Well, it does but it comes with a cost, which is analytical intractability. Even if Bayes’ Theorem is theoretically intuitive, it is, by and large, hard to solve analytically. This is the major drawback in wide applicability of Bayes’ Theorem. However, good news is that numerical methods provide solid methods to solve this probabilistic model.

So, some methods proposed to deal with the computational issue in Bayes’ Theorem. These methods provides solution with approximation, which can be listed as:

  • Quadrature approximation

  • Maximum a posteriori estimation

  • Grid Approach

  • Sampling Based Approach

  • Metropolis-Hastings

  • Gibbs Sampler

  • No U-Turn Sampler

Of these approaches, I will restrict my attention to the Metropolis-Hastings algorithm, which will be our method to be used in modeling Bayes’ Theorem. Metropolis-Hastings (M-H) method is rest upon the Markov Chain Monte Carlo (MCMC). Alos, maximum a posteriori estimation will be discussed in Chapter 6. So, before moving forward, it would be better to talk about MCMC method.

Markov Chain Monte Carlo

Markov Chain is a model for us to describe the transition probabilities among states, which is a rule of a game. A chain is called Markovian if the probability of current state st depends only on the most recent state st-1.

Pr(st|st-1,st-2,...,st-p)=Pr(st|st-1)

Thus, MCMC relies on Markov Chain to find the parameter space θ with highest posterior probability. As the sample size grows, parameter values approximate to the posterior density.

limj+θjDPr(θ|x)

where D refers to distributional approximation. Realized values of parameter space can be used to make inference about posterior. In a nutshell, MCMC method helps us to gather i.i.d sample from posterior density so that we can calculate the posterior probability.

To illustrate, we can refer to Figure 4-12. This figure tells us the probability of moving from one state to another. For the sake of simplicity, we set the probability to be 0.2 indicating, for instance, that transition from study to sleeping has a probability of 0.2.

In [70]: import quantecon as qe
         from quantecon import MarkovChain
         import networkx as nx
         from pprint import pprint

In [71]: P = [[0.5, 0.2, 0.3],
              [0.2, 0.3, 0.5],
              [0.2, 0.2, 0.6]]

         mc = qe.MarkovChain(P, ('studying', 'travelling', 'sleeping'))
         mc.is_irreducible
Out[71]: True

In [72]: states = ['studying', 'travelling', 'sleeping']
         initial_probs = [0.5, 0.3, 0.6]
         state_space = pd.Series(initial_probs, index=states, name='states')

In [73]: q_df = pd.DataFrame(columns=states, index=states)
         q_df = pd.DataFrame(columns=states, index=states)
         q_df.loc[states[0]] = [0.5, 0.2, 0.3]
         q_df.loc[states[1]] = [0.2, 0.3, 0.5]
         q_df.loc[states[2]] = [0.2, 0.2, 0.6]

In [74]: def _get_markov_edges(Q):
             edges = {}
             for col in Q.columns:
                 for idx in Q.index:
                     edges[(idx,col)] = Q.loc[idx,col]
             return edges
         edges_wts = _get_markov_edges(q_df)
         pprint(edges_wts)
         {('sleeping', 'sleeping'): 0.6,
          ('sleeping', 'studying'): 0.2,
          ('sleeping', 'travelling'): 0.2,
          ('studying', 'sleeping'): 0.3,
          ('studying', 'studying'): 0.5,
          ('studying', 'travelling'): 0.2,
          ('travelling', 'sleeping'): 0.5,
          ('travelling', 'studying'): 0.2,
          ('travelling', 'travelling'): 0.3}

In [75]: G = nx.MultiDiGraph()
         G.add_nodes_from(states)
         for k, v in edges_wts.items():
             tmp_origin, tmp_destination = k[0], k[1]
             G.add_edge(tmp_origin, tmp_destination, weight=v, label=v)

         pos = nx.drawing.nx_pydot.graphviz_layout(G, prog='dot')
         nx.draw_networkx(G, pos)
         edge_labels = {(n1,n2):d['label'] for n1,n2,d in G.edges(data=True)}
         nx.draw_networkx_edge_labels(G , pos, edge_labels=edge_labels)
         nx.drawing.nx_pydot.write_dot(G, 'mc_states.dot')
mc_states
Figure 4-12. Interactions of different states

There are two common MCMC methods: Metropolis-Hastings and Gibbs Sampler. Here, we delve into the former one.

Metropolis-Hastings

Metropolis-Hastings (M-H) allows us to have efficient sampling procedure with two steps: First we draw sample from proposal density and, in the second step, we decide either to accept or reject.

Let q(θ|θt-1) be a proposal density and θ be a parameter space. The entire algorithm of M-H can be summarized as:

  1. Select initial value for θ1 from parameter space θ

  2. Select a new parameter value θ2 from proposal density, which can be, for the sake of easiness, Gaussian or Uniform distribution.

  3. Compute the following acceptance probability:

Pra(θ*,θt-1)=min(1,p(θ*)/q(θ*|θt-1)p(θt-1)/q(θt-1|θ*))
  1. If Pra(θ*,θt-1) is greater than a sample value drawn from uniform distribution U(0,1).

  2. Repeat from step 2.

Well, it appears intimidating but don’t be. We have built-in code in Python makes the applicability of the M-H algorithm way easier. We use “PyFlux” library to make use of Bayes’ Theorem. Let’s go and apply M-H algorithm to predict volatility.

In [76]: import pyflux as pf
         from scipy.stats import kurtosis

In [77]: model = pf.GARCH(ret.values,p=1,q=1)1
         print(model.latent_variables)2
         model.adjust_prior(1, pf.Normal())3
         model.adjust_prior(2, pf.Normal())3
         x=model.fit(method='M-H', iterations='1000')4
         print(x.summary())
         Index    Latent Variable           Prior           Prior Hyperparameters
            V.I. Dist  Transform
         ======== ========================= ===============
          ========================= ========== ==========
         0        Vol Constant              Normal          mu0: 0, sigma0: 3
            Normal     exp
         1        q(1)                      Normal          mu0: 0, sigma0: 0.5
            Normal     logit
         2        p(1)                      Normal          mu0: 0, sigma0: 0.5
            Normal     logit
         3        Returns Constant          Normal          mu0: 0, sigma0: 3
            Normal     None
         Acceptance rate of Metropolis-Hastings is 0.11135
         Acceptance rate of Metropolis-Hastings is 0.1741
         Acceptance rate of Metropolis-Hastings is 0.25145

         Tuning complete! Now sampling.
         Acceptance rate of Metropolis-Hastings is 0.2579
         GARCH(1,1)

         =======================================================
          ==================================================
         Dependent Variable: Series                              Method: Metropolis
          Hastings
         Start Date: 1                                           Unnormalized Log
          Posterior: -3098.3787
         End Date: 2565                                          AIC:
          6204.757311703505
         Number of observations: 2565                            BIC:
          6228.156166733925
         ======================================================================
          ====================================
         Latent Variable                          Median             Mean
              95% Credibility Interval
         ======================================== ==================
          ================== =========================
         Vol Constant                             0.0388             0.0389
              (0.0308 | 0.0491)
         q(1)                                     0.1926             0.1941
              (0.1646 | 0.2282)
         p(1)                                     0.7739             0.773
              (0.7404 | 0.8022)
         Returns Constant                         0.0841             0.0844
              (0.0616 | 0.1074)
         ======================================================================
          ====================================
         None

In [78]: model.plot_z([1,2])5
         model.plot_fit(figsize=(15,5))6
         model.plot_ppc(T=kurtosis,nsims=1000)7
1

Configuring GARCH model using PyFlux library

2

Print the estimation of latent variables (parameters)

3

Adjusts the priors for the model latent variables

4

Fit the model using M-H process

5

Plot the latent variables

6

Plot the fitted model

7

Plot the histogram for posterior check

It is worthile to visualize the results of what we have done so far for volatility prediction with Bayesian-based Garch Model.

Figure 4-13 exhibits the distribution of latent variables. Latent variable q gathers around 0.2 and the other latent variable p mostly takes values between 0.7 and 0.8.

latent_variable
Figure 4-13. Latent Variables

Figure 4-14 indicates the demeaned volatility series and the GARCH prediction result based on Bayesian approach.

model_fit
Figure 4-14. Model Fit

Figure 4-15 visualizes the posterior predictions of the Bayesian model with the data so that we are able to detect systematic discrepancies, if any. The vertical line represents the test statistic and it turns out the observed value is larger than that of our model.

posterior_predict
Figure 4-15. Posterior Prediction

After we are done with the training part, we all set to move on to the next phase, which is prediction. 252 step will be predicted and the result is compared with realized volatility and RMSE is computed.

In [79]: bayesian_prediction=model.predict_is(n, fit_method='M-H')1
         Acceptance rate of Metropolis-Hastings is 0.10645
         Acceptance rate of Metropolis-Hastings is 0.16775
         Acceptance rate of Metropolis-Hastings is 0.2475

         Tuning complete! Now sampling.
         Acceptance rate of Metropolis-Hastings is 0.2435

In [80]: bayesian_RMSE = np.sqrt((np.array(realized_vol.iloc[-n:])/100 - bayesian_prediction.values/100) ** 2).mean()2
         print('The RMSE of Bayesian model is {:.4f}'.format(bayesian_RMSE))
         The RMSE of Bayesian model is 0.0049

In [81]: bayesian_prediction.index=ret.iloc[-n:].index

In [82]: plt.figure(figsize=(10,6))
         plt.plot(realized_vol/100,label='Realized Volatility')
         plt.plot(bayesian_prediction['Series']/100,label='Volatility Prediction-Bayesian')
         plt.title('Volatility Prediction with M-H Approach',  fontsize=12)
         plt.legend()
         plt.savefig('images/bayesian.png')
         plt.show()
1

In-sample volatility prediction

2

Calculating the RMSE score

Eventually, we are ready to observe the prediction result of the Bayesian approach and the following code does it for us.

Figure 4-16 visualizes the volatility prediction based on Metropolis-Hasting based Bayesian approach and it seems to overshot towards the mid-2020 and overall performance of this method is not bad in the sense that it outperforms many models introduced here except SVR-GARCH with linear kernel and Neural Network.

bayesian
Figure 4-16. Bayesian Volatility Prediction

Conclusion

Volatility prediction is a key to understand the dynamics of financial market in the sense that it helps us to gauge the uncertainty. With that being said, it is used as input in many financial model including risk models. These facts emphasize the importance of having accurate volatility prediction. Traditionally, parametric methods such ARCH, GARCH and their extensions have been extensively used but these models suffer from being inflexible. To remedy this issue, data-driven models are found promising and this chapter attempts to make use of these models, namely, Support Vector Machines, Neural Network, and Deep Learning-based models, and it turns out data-driven model outperforms the parametric models.

In the next chapter, market risk, a core financial risk topic, will be discussed both from theoretical and empirical standpoint and the machine learning models will be incorporate to further improve the estimation of this risk.

Further Resources

Articles cited in this chapter:

  • Andersen, Torben G., Tim Bollerslev, Francis X. Diebold, and Paul Labys. “Modeling and forecasting realized volatility.” Econometrica 71, no. 2 (2003): 579-625.

  • Burnham, Kenneth P., and David R. Anderson. “A practical information-theoretic approach.” Model selection and multimodel inference 2 (2002).

  • Burnham, Kenneth P., and David R. Anderson. “Multimodel inference: understanding AIC and BIC in model selection.” Sociological methods & research 33, no. 2 (2004): 261-304.

  • Eagle, Robert F. “Autoregressive conditional heteroskedasticity with estimates of the variance of UK inflation.” Econometrica 50, no. 4 (1982): 987-1008.

  • Mandelbrot, Benoit. “New methods in statistical economics.” Journal of political economy 71, no. 5 (1963): 421-440.

Books cited in this chapter:

  • Alpaydin, E., 2020. Introduction to machine learning. MIT press.

  • Focardi, Sergio M. Modeling the market: New theories and techniques. Vol. 14. John Wiley & Sons, 1997.

  • Rachev, Svetlozar T., John SJ Hsu, Biliana S. Bagasheva, and Frank J. Fabozzi. Bayesian methods in finance. Vol. 153. John Wiley & Sons, 2008.

  • Wilmott, Paul. Paul Wilmott on quantitative finance. John Wiley & Sons, 2013.

1 Conditional variance means that volatility estimation is a function of the past values of asset returns.

2 Occam’s Razor, also known as “law of parsimony”, states that given a set of explanations, simpler explanation is the most plausible and likely one.

3 Of these alternatives, Tensorflow, PyTorch, and Neurolab are the most prominent libraries.

4 Please see this manual :link-MLP: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html {link-MLP}

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.239.149.56