Chapter 9. Regression and Regularization

We selected binary logistic regression to introduce the basics of machine learning in the Kicking the tires section of Chapter 1, Getting Started. The purpose was to illustrate the concept of discriminative classification. It is important to keep in mind that some regression algorithms, such as logistic regression, are classification models.

The variety and the number of regression models go well beyond the ubiquitous ordinary least square linear regression and logistic regression [9:1]. Have you heard of isotonic regression?

The purpose of regression is to minimize a loss function, the residual sum of squares (RSS) being one that is commonly used. The Accessing a model section in Chapter 2, Data Pipelines, introduced the thorny challenge of overfitting, which will be partially addressed in this chapter by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization.

The chapter starts with a description of the linear least squares regression. The second section introduces the concept of regularization with an implementation of ridge regression. Finally, logistic regression will be revisited in detail, then applied as a classification model.

Linear regression

Although simplistic, linear regression should have a prominent place in your machine learning toolbox. The term regression is usually associated with the concept of fitting a model to data and minimizing the error between the expected and predicted values by computing the sum of square errors, residual sum of square errors, or least square errors.

Least square problems fall into two broad categories:

  • Ordinary least squares
  • Non-linear least squares

Univariate linear regression

Let's start with the simplest form of linear regression, which is single variable regression, in order to introduce the terms and concepts behind linear regression. In its simplest interpretation, one variate linear regression consists of fitting a line to a set of data points {x, y}.

Note

M1: This is a single variable linear regression for a model f, with weights wj for features xj, and labels (or expected values) yj:

Univariate linear regression

Here, w1 is the slope, w0 is the intercept, f is the linear function that minimizes the RSS, and (xj, yj) is the set of n observations.

The RSS is also known as the sum of square errors (SSE). The mean squared error (MSE) for n observations is derived from the SSE and computed as the ratio of RSS/n.

Note

Terminology

The terminology used in the scientific literature regarding regression is a bit confusing at times. Regression weights are also known as regression coefficients or regression parameters. The weights are referred to as w in formulas and the source code throughout the chapter, although β is also used in reference books.

Implementation

Let's start with the implementation of formula M1 and create the class SingleLinearRegression. The linear regression is a data transformation that uses a model implicitly built from input data. Therefore, the simple linear regression.

The SingleLinearRegression class takes two arguments: implements the ITransform trait (line 1) as described in the Monadic data transformation section in Chapter 2, Data Pipelines.

  • An xt vector of single variable observations of type T
  • A vector of expected values or labels (line 1).

Let's implement our first and simplest regression model:

class SingleLinearRegression[T: ToDouble](
    xt: Vector[T], 
    expected: Vector[T])
  extends ITransform[T, Double] with Monitor[Double] {   //1
  type DblPair = (Double, Double) 

  val model: Option[DblPair] = train //3
  def train: Option[DblPair]= { … }
  override def |> : PartialFunction[T, Try[Double]] //2
}

The Monitor trait is used to collect profiling information during training (refer to the Monitor section under Utility classes in the Appendix).

The class has to define the type of the output of the prediction method, |>, that is a Double (line 2).

Tip

Model instantiation

The model parameters are computed through training and the model is instantiated regardless whether the model is actually validated. A commercial application requires the model to be validated using a methodology such as K-fold validation as described in the Design template for classifiers section of Appendix.

The training generates the model defined as the regression weights, in this case the slope and intercept (line 3). The model is set as None if an exception is thrown during training:

def train: Option[DblPair] = {
   val regr = new SimpleRegression(true) //4
   regr.addData(zipToSeries(xt, expected).toArray)  //5
   Some((regr.getSlope, regr.getIntercept))  //6
}

The regression weights or coefficients pairs, model, are computed using the SimpleRegression class from the stats.regression package of Apache Commons Math library, along with the true argument to trigger the computation of the intercept (line 4). The input time series and the labels (or expected values) are zipped to generate an array of two values (input, expected) (line 5). The model is initialized with the slope and intercept computed during the training (line 6).

The zipToSeries method converts a pair of vectors of type Vector[T] into a vector of type Vector[Array[Double]]. The method is defined in the TSeries object described in the Time series in Scala section of Chapter 3, Data Pre-processing.

Tip

Private versus private[this]

A private value or variable can be accessed by all the instances of a class. A value declared private[this] can be manipulated only by this instance. For instance, the value model can be accessed only by this instance of the SingleLinearRegression.

Test case

For our first test case, we compute the single variate linear regression of the price of the copper ETF (ticker symbol: CU) over a period of six months (January 1, 2013 to June 30, 2013):

for {
  path <- getPath(s"supervised/regression/CU.csv")
  src <- DataSource(path, false, true, 1)
  price <- src get adjClose //7
  days <- Try(Vector.tabulate(price.size)(_.toDouble)) //8
  linRegr <- SingleLinearRegression[Double](days, price) //9
} yield {
  if( linRegr.isModel ) 
    linRegr.model match {
      case Some(m) => 
        val (slope, intercept) = m
        val error = mse(days, price, slope, intercept)//10
      case None => 
  …
}

The daily closing price for the ETF, CU is extracted from a CSV file (line 7) as the expected values using a DataSource instance described in the Data extraction and Data sources section under Source consideration in the Appendix. The x-values days are automatically generated as a linear function (line 8). The expected values (price) and the sessions (days) are the input to the instantiation of the simple linear regression (line 9).

Once the model is created successfully, the test code computes the mean squared error, mse, between the predicted and expected values (line 10):

def mse(
    predicted: DblVec, 
    expected: DblVec, 
    slope: Double, 
    intercept: Double): Double = {
  val predicted = xt.map( slope*_ + intercept)
  Loss.mse(predicted, expected)  //11
}

The mean least squared error is computed using the mse method of Loss singleton (line 11). The original stock price and the linear regression equation are plotted in the following chart:

Test case

Simple linear regression of CU ETF daily price

The linear model, y = -0.087.x + 30.947 is accurate with a mean least square error (mse) 0.0832.

Although the single variable linear regression is convenient, it is limited to scalar time series. Let's consider the case of multiple variables.

Ordinary least squares (OLS) regression

OLS is a generalization of the simple linear regression described in the previous section. It applies to problems with more than one feature or variable. It computes the parameters, w, of a linear function y = f(x0 , x2 … xd ) by minimizing the residual sum of squares. The optimization problem is solved by performing vector and matrix operations (transposition, inversion, and substitution).

M2: Given the weights wj, the n observations (xi , yi )i:0,n-1 from the vector x and the expected output values y and the linear multivariate function y = f (x0, x1, …,xd, ..), the minimization of loss function is computed by the following formula:

Ordinary least squares (OLS) regression

They are several methodologies to minimize the RSS for a linear regression:

  • Resolution of the set of n equations with d variables (weights) using QR decomposition of the n by d matrix representing the time series of n observations of a vector of d dimensions with n >= d [9:2]
  • Singular value decomposition on the observations-features matrix in the case the dimension d exceeds the number of observations n [9:3]
  • Minimization of loss function using the batch gradient descent [9:4]
  • Minimization of loss function using the stochastic gradient descent [9:5]

An overview of these matrix decompositions and optimization techniques can be found in the Elements of linear algebra and Summary of optimization techniques sections of the Appendix.

The QR decomposition generates the smallest relative error MSE for the most common least squares problem. The technique is used in our implementation of the least squares regression.

Design

We select the linear algebra classes and methods defined in the Apache Commons Math library to implement our ordinary least squares regression [9:6].

This chapter describes several types of regression algorithm. It makes sense to define a generic trait regression that defines the key element component of a regression algorithm:

  • A model of type RegressionModel (line 1)
  • Two methods to access the components of the regression model: weights and rss (line 2 & 3)
  • A polymorphic method, train, that implements the training of this specific regression algorithm (line 4)
  • A private method, training, that wraps train into a Try monad (line 5):
    trait Regression {
      type Obs = Array[Double]
      val model: Option[RegressionModel] = train.toOption //1
       
      def weights: Option[Obs] = model.map(_.weights)//2
      def rss: Option[Double] = model.map(_.rss) //3
      def isModel: Boolean = model != None
      protected def train: RegressionModel /4
    }

The model is simply defined by its weights and its residual sum of squares, rss:

case class RegressionModel( //5
   weights: Obs, 
   rss: Double) extends Model[RegressionModel]

Let's create a class, MultiLinearRegression, as a data transformation, the model of which is implicitly derived from the input data (training set) as described in the Monadic data transformation section of Chapter 2, Data Pipelines:

class MultiLinearRegression[@specialized(Double) T:ToDouble](
    xt: Vector[Array[T]], 
    expected: DblVec
 ) extends ITransform[Array[T], Double] with Regression {//7

  override def train: Option[RegressionModel] //8
  override def |> :PartialFunction[Array[T],Try[Double]]//9
}

The MultiLinearRegression class takes two arguments; the multi-dimensional time series of observations, xt, and the vector of expected values (line 7). The class implements the ITransform trait and defines the type of the output value for the prediction or regression as a Double. The constructor for MultiLinearRegression creates the model through training (line 8). The ITransform method |> implements the run-time prediction for the multi-linear regression (line 9).

Tip

Regression model

The RSS is included in the model because it provides the client code with important information regarding the accuracy of the underlying technique used to minimize the loss function.

The relationship between the different components of the multi-linear regression is described in the following UML class diagram:

Design

UML class diagram for multi-linear (OLS) regression

The UML diagram omits the helper traits and classes such as Monitor or Apache Commons Math components.

Implementation

The training is performed during the instantiation of the MultiLinearRegression class (refer to the Design template for classifiers section in Appendix):

def train: RegressionModel = {	
  val mLr = new MultiLinearRAdapter   //10
  mLr.createModel(
    xt.map(_.map(implicitly[ToDouble[T]].apply(_))),
    expected
  ) //11
  RegressionModel(mLr.weights, mLr.rss) //12
}

The functionality of the ordinary least squares regression in the Apache Commons Math library is accessed through a reference mLr to the adapter class MultiLinearRAdapter (line 10).

The train method creates the model by invoking the OLSMultipleLinearRegression Apache Commons Math class (line 11) and returns the regression model (line 12). The various methods of the class are accessed through the MultiLinearRAdapter adapter class:

class MultiLinearRAdapter extends OLSMultipleLinearRegression {

  def createModel(y: DblVec, x: Vector[Obs]): Unit = 
      super.newSampleData(y.toArray, x.toArray)
  def weights: Obs = estimateRegressionParameters
  def rss: Double = calculateResidualSumOfSquares
}

The createModel, weights, and rss methods route the request to the corresponding methods in OLSMultipleLinearRegression.

The Scala exception handling monad Try{} is used as the return type for the train method in order to catch the different types of exception thrown by the Apache Commons Math library, such as MathIllegalArgumentException, MathRuntimeException, and OutOfRangeException.

Note

Exception handling

Wrapping up invocation of methods in a separate library or framework with a Scala exception handler Try { } matters for a couple of reasons:

It makes debugging easier by segregating your code from the third party

It allows your code to recover from the exception by re-executing the same function with an alternative third-party library methods, whenever possible

The predictive algorithm for ordinary least squares regression is implemented by the data transformation |>. The method predicts the output value given a model and an input value x:

def |> : PartialFunction[Array[T], Try[Double]] = {
  case x: Array[T] if(isModel && 
       x.length == model.get.size -1 => 
           Try(margin(x, model.get) ) //13
}

The predictive value is computed using the margin method defined in the RegressionModel singleton introduced earlier in this section (line 13).

Test case 1 – trending

The objective is to identify a trend in a time series. In this context, trending consists of extracting the long-term movement in a time series. Trend lines are detected using a multivariate least squares regression. The objective of this first test is to evaluate the filtering capability of the ordinary least squares regression.

The regression is performed on the relative price variation of the Copper ETF (ticker symbol: CU). The selected features are volatility and volume, and the label or target variable is the price change between two consecutive trading sessions, y.

The naming convention for the trading data and metrics is described in the Trading data section under Technical analysis in the Appendix.

The volume, volatility, and price variation for CU between January 1, 2013 and June 30, 2013 are plotted in the following chart:

Test case 1 – trending

Chart for price variation, volatility, and trading volume for Copper ETF

Let's write the client code to compute the multi-variate linear regression, price change = w0 + volatility.w1 + volume.w2:

for {
  path <- getPath(s"supervised/regression/CU.csv") //14
  src <- DataSource(path, true, true, 1)  //15
  price <- src.get(adjClose) //16
  volatility <- src.get(volatility) //17
  volume <- src.get(volume) //18
  (features, expected) <- differentialData(
     volatility, volume, price, diffDouble //19
  )
  regression <- MultiLinearRegression[Double](features, expected)
} yield {//20
  if( regression.isModel ) {
    val trend = features.map(margin(_,regression.weights.get))
    display(expected, trend)  //21
   }
}

Let's look at the steps in the execution of the test. It consists of collecting data, extracting the features and expected values, and training the multi-linear regression model:

  1. Locate the CSV-formatted data source file (line 14).
  2. Create a data source extractor, DataSource, for the trading session closing price, the session volatility, and session volume for the ETF, CU (line 15).
  3. Extract the price of the ETF (line 16), its volatility within a trading session (line 17), and trading volume during the session (line 18), using the DataSource transform.
  4. Generate the labeled data as a pair of features (relative volatility and relative volume for the ETF) and expected outcome {0, 1} for training the model for which 1 represents price increase and 0 represents price decrease (line 19). The generic differentialData method of the Stats singleton is described in the Times series section of Chapter 3, Data Pre-processing.
  5. The multi-linear regression is instantiated using the features set and the expected change in daily ETF price (line 20).
  6. Display the expected and trending values using JfreeChart (line 21).

The time series of expected value and the data predicted by the regression are plotted in the following chart:

Test case 1 – trending

Price variation and least squares regression for Copper ETF according to volatility and volume

The least squares regression model is defined by the linear function for the estimation of price variation:

price(t+1)-price(t) = -0.01 + 0.014 volatility – 0.0042 volume

The estimated price change (dotted line in the preceding chart) represents the long-term trend for which the noise is filtering out. In other words, the least squares regression operates as a simple, low-pass filter as an alternative to some of the filtering techniques such as the discrete Fourier transform or the Kalman filter described in Chapter 3, Data Pre-processing [9:7].

Although trend detection is an interesting application of the least squares regression, the method has limited filtering capabilities for time series [9:8]:

  • It is quite sensitive to outliers
  • As a deterministic method, it does not support noise analysis (distribution, frequencies, and so on)

Test case 2 – features selection

Our objective is to discover which subset of initial features generates the most accurate regression model; that is, the model with the smallest RSS on the training set.

Let's consider an initial set of D features {xi}. The objective is to estimate the subset of features {xid} that are the most relevant to the set of observations using a least squares regression. Each subset of features is associated to the model fj (x|wj ):

Test case 2 – features selection

Diagram for the features set selection

The selection of the model parameters, w, uses ordinary least square regression in the case that the features set if small. Performing a regression of each of subset of a large original features set is not practical.

M3: Given the weights wjk of the function/model fj and the n observations (xi, yi)i:0,n-1 and the linear multivariate function y = f (x0, x1, …,xd, ..), the features selection can be expressed mathematically as follows:

Test case 2 – features selection

Let's consider the following four financial time series over the period from January 1, 2009 to December 31, 2013:

  • Exchange rate of Chinese Yuan to US Dollar
  • S&P 500 index
  • Spot price of gold
  • 10-year treasury bond price

The problem is to estimate which combination of the three variables, S&P 500 index, gold price, and 10-year treasury bond price, is the most correlated to the exchange rate of Yuan. For practical reasons, we use the Exchange Trade Funds CYN as the proxy for the Yuan/US dollar exchange rate (similarly, we use SPY, GLD, and TLT for the S&P 500 index, spot price of gold, and 10-year treasury bond price, respectively).

Tip

Automation of features extraction

The code in this section implements an ad hoc extraction of features with an arbitrary fixed set of models. The process can be easily automated with an optimizer (such as gradient descent or a genetic algorithm) using 1/RSS as the objective function.

The number of models to evaluate is relatively small, so an ad hoc approach to compute the RSS for each combination is acceptable. Have a look at the following graph:

Test case 2 – features selection

Graph of the Chinese Yuan exchange rate, gold, 10-year treasury bond price, and S&P 500 index

The getRss method implements the computation of the RSS value given a set of observations, xt, expected (smoothed) values, y, and labels for features, featureLabels, and returns a textual result:

def getRss(
  xt: Vector[Array[Double]], 
  expected: DblVec, 
  labels: Array[String]): String = {

  val regression = new MultiLinearRAdapter[Double](
    xt,expected
  ) //22
  val descriptor = regression.weights.map(
    _.zipWithIndex.map( case(w, n) => {
       …  // Display regression weights
    }}
  )	
  s"rss= ${regression.rss}" //23
}

The getRss method merely trains the model by instantiating the multi-linear regression class (line 22). Once the regression model is trained during the instantiation of the MultiLinearRegression class, the coefficients of the regression weights and the RSS values are stringized (line 23). The getRss method is invoked for any combination of the ETFs, GLD, SPY, and TLT, against the CNY label.

Let's look at the test code:

val SMOOTHING_PERIOD: Int = 16 //24
val symbols = Array[String]("CNY", "GLD", "SPY", "TLT") //25
val movAvg = SimpleMovingAverage[Double](SMOOTHING_PERIOD) //26

for {
  pfnMovAve <- Try(movAvg |>)  //27
  smoothed <- filter(pfnMovAve) //28
  models <- createModels(smoothed) //29
  rsses <- Try(getModelsRss(models, smoothed)) //30
  (mses, tss) <- totalSquaresError(models,smoothed.head) //31
} yield {          
   s"""${rsses.mkString("
")}
${mses.mkString("
")}
      | 
Residual error= $tss".stripMargin
}

The dataset is large (1,260 trading sessions) and noisy enough to warrant filtering using a simple moving average with a period of 16 trading sessions (line 24). The purpose of the test is to evaluate the possible correlation between the four ETFs: CNY, GLD, SPY, and TLT (line 25). The execution test instantiates the simple moving average (line 26) as described in the Simple moving average section of Chapter 3, Data Pre-processing.

The workflow executes the following steps:

  1. Instantiate the simple moving average partial function pfnMovAve (line 27).
  2. Generate a smoothed historical prices series for the CNY, GLD, SPY, and TLT ETFs using the function filter as follows (line 28):
    type PFNMOVAVE = PartialFunction[DblVec, Try[DblVec]]
    
    def filter(pfnMovAve: PFNMOVEAVE): Try[Array[DblVec]]=Try {
       symbols.map(s => 
         DataSource(s"$getPath(s"$dataPath/$s.csv}")}", 
               true, true, 1))
         .map( _.get(adjClose) )
    	  .map( pfnMovAve(_))
         .map(_.getOrElse(-1.0))
  3. Generate the list of features for each model using the createModels method (line 30):
    type Models = List[(Array[String], DblMatrix)]
    type DblMatrix = Array[Array[Double]]
    
    def createModels(smoothed: Array[DblVec]): Try[Models] = 
    Try {
      val features = smoothed.drop(1).map(_.toArray)  //32
      List[(Array[String], DblMatrix)](   //33
       (Array[String]("CNY","SPY","GLD","TLT"),
               features.transpose),       
       (Array[String]("CNY","GLD","TLT"),
               features.drop(1).transpose),
       (Array[String]("CNY","SPY","GLD"),
               features.take(2).transpose),
       (Array[String]("CNY","SPY","TLT"), 
               features.zipWithIndex         
                       .filter( _._2 != 1)
                       .map( _._1)
                      .transpose),
                (Array[String]("CNY","GLD"),     
                        features.slice(1,2).transpose)
       )
    }

    The smoothed values for CNY are used as the expected values. Therefore, it is removed from the features list (line 32). The five models are evaluated by adding or removing elements from the features list (line 33).

  4. Next, the workflow computes the residual sum of squares for all the models using getModelsRss (line 30). The method invokes getRss, introduced earlier in this section, for each model (line 34):
    def getModelsRss(models: Models, y: Array[DblVec]): 
      List[String] = 
      models.map{ case (labels, m) => 
             s"${getRss(m.toVector, y.head, labels)}" }  //34
  5. Finally, the last step of the workflow consists of computing the mean squared errors, mses, for each model and the total squared error, tss (line 31):
    def totalSquaresError(
        models: Models, 
        expected: DblVec): Try[(List[String], Double)] = Try {
     
      val errors = models.map{
         case (labels, m) => rssSum(m, expected)._1 //35
      }
      val mses = models.zip(errors).map{
         case(f, e) => s"MSE: ${f._1.mkString(" ")} = $e"
      }
      (mses, Math.sqrt(errors.sum)/models.size)  //36
    }

The totalSquaresError method computes the error for each model by summing the RSS value, rssSum, for each model (line 35). The method returns a pair of array of the mean squared error for each model and the total squares error (line 36).

The RSS does not always provide an accurate visualization of the fitness of the regression model. The fitness of the regression model is commonly assessed using the r 2 statistics. The r 2 value is a number that indicates how well data fits a statistical model.

M4: RSS and r 2 statistics are defined in the following formula:

Test case 2 – features selection

The implementation of the computation of the r 2 statistics is simple: for each model, fj, the rssSum method computes the tuple {RSS, least squares errors} as defined in formula M4:

def rssSum(xt: DblMatrix, expected: DblVec): DblPair = {
  val regression = MultiLinearRegression[Double](xt,expected)//37
  val pfnRegr = regression |> //38
  val results = sse(expected.toArray, xt.map(pfnRegr(_).get))
  (regression.rss, results) //39
}

The rssSum method instantiates the MultiLinearRegression class (line 37), retrieves the RSS value, then validates the regressive model pfnRegr (line 38) against the expected values (line 39). The output of the test is presented in the following screenshot:

Test case 2 – features selection

The output results clearly show that the three-variable regression CNY=f (SPY, GLD, TLT) is the most accurate or fittest model for the time series CNY, followed by CNY =f (SPY, TLT). Therefore, the feature selection process generates the features set {SPY, GLD, TLT}.

Let's plot the model against the raw data:

Test case 2 – features selection

Ordinary least regression on the Chinese Yuan ETF (CNY)

The regression model smoothed the original time series, CNY. It weeded out all but the most significant price variations.

Test case 2 – features selection

Bar chart of the r2 value for different regression models for CNY ETF

The graph plotting the r 2 value for each of the models confirms that of the three features models, CNY=f (SPY, GLD, TLT) is the most accurate.

Note

General linear regression

The concept of linear regression is not restricted to polynomial fitting models such as y = w0 + w1.x + w2.x2 + …+ wnxn. Regression models can be also defined as a linear combination of basis functions j: y = w0 + w1.ϕ1(x) + w2ϕ2(x) + … + wnϕn(x) [9:9].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.137.75