We selected binary logistic regression to introduce the basics of machine learning in the Kicking the tires section of Chapter 1, Getting Started. The purpose was to illustrate the concept of discriminative classification. It is important to keep in mind that some regression algorithms, such as logistic regression, are classification models.
The variety and the number of regression models go well beyond the ubiquitous ordinary least square linear regression and logistic regression [9:1]. Have you heard of isotonic regression?
The purpose of regression is to minimize a loss function, the residual sum of squares (RSS) being one that is commonly used. The Accessing a model section in Chapter 2, Data Pipelines, introduced the thorny challenge of overfitting, which will be partially addressed in this chapter by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization.
The chapter starts with a description of the linear least squares regression. The second section introduces the concept of regularization with an implementation of ridge regression. Finally, logistic regression will be revisited in detail, then applied as a classification model.
Although simplistic, linear regression should have a prominent place in your machine learning toolbox. The term regression is usually associated with the concept of fitting a model to data and minimizing the error between the expected and predicted values by computing the sum of square errors, residual sum of square errors, or least square errors.
Least square problems fall into two broad categories:
Let's start with the simplest form of linear regression, which is single variable regression, in order to introduce the terms and concepts behind linear regression. In its simplest interpretation, one variate linear regression consists of fitting a line to a set of data points {x, y}.
The RSS is also known as the sum of square errors (SSE). The mean squared error (MSE) for n observations is derived from the SSE and computed as the ratio of RSS/n.
Terminology
The terminology used in the scientific literature regarding regression is a bit confusing at times. Regression weights are also known as regression coefficients or regression parameters. The weights are referred to as w in formulas and the source code throughout the chapter, although β is also used in reference books.
Let's start with the implementation of formula M1 and create the class SingleLinearRegression
. The linear regression is a data transformation that uses a model implicitly built from input data. Therefore, the simple linear regression.
The SingleLinearRegression
class takes two arguments: implements the ITransform
trait (line 1) as described in the Monadic data transformation
section in Chapter 2, Data Pipelines.
xt
vector of single variable observations of type T
expected
values or labels (line 1).Let's implement our first and simplest regression model:
class SingleLinearRegression[T: ToDouble]( xt: Vector[T], expected: Vector[T]) extends ITransform[T, Double] with Monitor[Double] { //1 type DblPair = (Double, Double) val model: Option[DblPair] = train //3 def train: Option[DblPair]= { … } override def |> : PartialFunction[T, Try[Double]] //2 }
The Monitor
trait is used to collect profiling information during training (refer to the Monitor
section under Utility classes in the Appendix).
The class has to define the type of the output of the prediction method, |>
, that is a Double
(line 2).
Model instantiation
The model parameters are computed through training and the model is instantiated regardless whether the model is actually validated. A commercial application requires the model to be validated using a methodology such as K-fold validation as described in the Design template for classifiers section of Appendix.
The training generates the model defined as the regression weights, in this case the slope and intercept (line 3). The model is set as None
if an exception is thrown during training:
def train: Option[DblPair] = { val regr = new SimpleRegression(true) //4 regr.addData(zipToSeries(xt, expected).toArray) //5 Some((regr.getSlope, regr.getIntercept)) //6 }
The regression weights or coefficients pairs, model
, are computed using the SimpleRegression
class from the stats.regression
package of Apache Commons Math library, along with the true
argument to trigger the computation of the intercept (line 4). The input time series and the labels (or expected values) are zipped to generate an array of two values (input, expected) (line 5). The model
is initialized with the slope and intercept computed during the training (line 6).
The zipToSeries
method converts a pair of vectors of type Vector[T]
into a vector of type Vector[Array[Double]]
. The method is defined in the TSeries
object described in the Time series in Scala section of Chapter 3, Data Pre-processing.
For our first test case, we compute the single variate linear regression of the price of the copper ETF (ticker symbol: CU) over a period of six months (January 1, 2013 to June 30, 2013):
for { path <- getPath(s"supervised/regression/CU.csv") src <- DataSource(path, false, true, 1) price <- src get adjClose //7 days <- Try(Vector.tabulate(price.size)(_.toDouble)) //8 linRegr <- SingleLinearRegression[Double](days, price) //9 } yield { if( linRegr.isModel ) linRegr.model match { case Some(m) => val (slope, intercept) = m val error = mse(days, price, slope, intercept)//10 case None => … }
The daily closing price
for the ETF, CU is extracted from a CSV file (line 7) as the expected values using a DataSource
instance described in the Data extraction and Data sources section under Source consideration in the Appendix. The x-values days
are automatically generated as a linear function (line 8). The expected values (price
) and the sessions (days
) are the input to the instantiation of the simple linear regression (line 9).
Once the model is created successfully, the test code computes the mean squared error, mse
, between the predicted and expected values (line 10):
def mse( predicted: DblVec, expected: DblVec, slope: Double, intercept: Double): Double = { val predicted = xt.map( slope*_ + intercept) Loss.mse(predicted, expected) //11 }
The mean least squared error is computed using the mse
method of Loss
singleton (line 11). The original stock price and the linear regression equation are plotted in the following chart:
The linear model, y = -0.087.x + 30.947 is accurate with a mean least square error (mse) 0.0832.
Although the single variable linear regression is convenient, it is limited to scalar time series. Let's consider the case of multiple variables.
OLS is a generalization of the simple linear regression described in the previous section. It applies to problems with more than one feature or variable. It computes the parameters, w, of a linear function y = f(x0 , x2 … xd ) by minimizing the residual sum of squares. The optimization problem is solved by performing vector and matrix operations (transposition, inversion, and substitution).
M2: Given the weights wj, the n observations (xi , yi )i:0,n-1 from the vector x and the expected output values y and the linear multivariate function y = f (x0, x1, …,xd, ..), the minimization of loss function is computed by the following formula:
They are several methodologies to minimize the RSS for a linear regression:
An overview of these matrix decompositions and optimization techniques can be found in the Elements of linear algebra and Summary of optimization techniques sections of the Appendix.
The QR decomposition generates the smallest relative error MSE for the most common least squares problem. The technique is used in our implementation of the least squares regression.
We select the linear algebra classes and methods defined in the Apache Commons Math library to implement our ordinary least squares regression [9:6].
This chapter describes several types of regression algorithm. It makes sense to define a generic trait regression that defines the key element component of a regression algorithm:
RegressionModel
(line 1)weights
and rss
(line 2 & 3)train
, that implements the training of this specific regression algorithm (line 4)training
, that wraps train
into a Try
monad (line 5):trait Regression { type Obs = Array[Double] val model: Option[RegressionModel] = train.toOption //1 def weights: Option[Obs] = model.map(_.weights)//2 def rss: Option[Double] = model.map(_.rss) //3 def isModel: Boolean = model != None protected def train: RegressionModel /4 }
The model is simply defined by its weights
and its residual sum of squares, rss
:
case class RegressionModel( //5
weights: Obs,
rss: Double) extends Model[RegressionModel]
Let's create a class, MultiLinearRegression
, as a data transformation, the model of which is implicitly derived from the input data (training set) as described in the Monadic data transformation
section of Chapter 2, Data Pipelines:
class MultiLinearRegression[@specialized(Double) T:ToDouble]( xt: Vector[Array[T]], expected: DblVec ) extends ITransform[Array[T], Double] with Regression {//7 override def train: Option[RegressionModel] //8 override def |> :PartialFunction[Array[T],Try[Double]]//9 }
The MultiLinearRegression
class takes two arguments; the multi-dimensional time series of observations, xt
, and the vector of expected
values (line 7). The class implements the ITransform
trait and defines the type of the output value for the prediction or regression as a Double
. The constructor for MultiLinearRegression
creates the model
through training (line 8). The ITransform
method |>
implements the run-time prediction for the multi-linear regression (line 9).
The relationship between the different components of the multi-linear regression is described in the following UML class diagram:
The UML diagram omits the helper traits and classes such as Monitor or Apache Commons Math components.
The training is performed during the instantiation of the MultiLinearRegression
class (refer to the Design template for classifiers
section in Appendix):
def train: RegressionModel = { val mLr = new MultiLinearRAdapter //10 mLr.createModel( xt.map(_.map(implicitly[ToDouble[T]].apply(_))), expected ) //11 RegressionModel(mLr.weights, mLr.rss) //12 }
The functionality of the ordinary least squares regression in the Apache Commons Math library is accessed through a reference mLr
to the adapter class MultiLinearRAdapter
(line 10).
The train method creates the model by invoking the OLSMultipleLinearRegression
Apache Commons Math class (line 11) and returns the regression model (line 12). The various methods of the class are accessed through the MultiLinearRAdapter
adapter class:
class MultiLinearRAdapter extends OLSMultipleLinearRegression { def createModel(y: DblVec, x: Vector[Obs]): Unit = super.newSampleData(y.toArray, x.toArray) def weights: Obs = estimateRegressionParameters def rss: Double = calculateResidualSumOfSquares }
The createModel
, weights
, and rss
methods route the request to the corresponding methods in OLSMultipleLinearRegression
.
The Scala exception handling monad Try{}
is used as the return type for the train
method in order to catch the different types of exception thrown by the Apache Commons Math library, such as MathIllegalArgumentException
, MathRuntimeException
, and OutOfRangeException
.
Exception handling
Wrapping up invocation of methods in a separate library or framework with a Scala exception handler Try { }
matters for a couple of reasons:
It makes debugging easier by segregating your code from the third party
It allows your code to recover from the exception by re-executing the same function with an alternative third-party library methods, whenever possible
The predictive algorithm for ordinary least squares regression is implemented by the data transformation |>
. The method predicts the output value given a model
and an input value x
:
def |> : PartialFunction[Array[T], Try[Double]] = { case x: Array[T] if(isModel && x.length == model.get.size -1 => Try(margin(x, model.get) ) //13 }
The predictive value is computed using the margin
method defined in the RegressionModel
singleton introduced earlier in this section (line 13).
The objective is to identify a trend in a time series. In this context, trending consists of extracting the long-term movement in a time series. Trend lines are detected using a multivariate least squares regression. The objective of this first test is to evaluate the filtering capability of the ordinary least squares regression.
The regression is performed on the relative price variation of the Copper ETF (ticker symbol: CU). The selected features are volatility
and volume
, and the label or target variable is the price change between two consecutive trading sessions, y
.
The naming convention for the trading data and metrics is described in the Trading data section under Technical analysis in the Appendix.
The volume, volatility, and price variation for CU between January 1, 2013 and June 30, 2013 are plotted in the following chart:
Let's write the client code to compute the multi-variate linear regression, price change = w0 + volatility.w1 + volume.w2:
for { path <- getPath(s"supervised/regression/CU.csv") //14 src <- DataSource(path, true, true, 1) //15 price <- src.get(adjClose) //16 volatility <- src.get(volatility) //17 volume <- src.get(volume) //18 (features, expected) <- differentialData( volatility, volume, price, diffDouble //19 ) regression <- MultiLinearRegression[Double](features, expected) } yield {//20 if( regression.isModel ) { val trend = features.map(margin(_,regression.weights.get)) display(expected, trend) //21 } }
Let's look at the steps in the execution of the test. It consists of collecting data, extracting the features and expected values, and training the multi-linear regression model:
DataSource
, for the trading session closing price
, the session volatility
, and session volume
for the ETF, CU (line 15).DataSource
transform.expected
outcome {0, 1} for training the model for which 1 represents price increase and 0 represents price decrease (line 19). The generic differentialData
method of the Stats
singleton is described in the Times series section of Chapter 3, Data Pre-processing.regression
is instantiated using the features
set and the expected
change in daily ETF price (line 20).The time series of expected value and the data predicted by the regression are plotted in the following chart:
The least squares regression model is defined by the linear function for the estimation of price variation:
price(t+1)-price(t) = -0.01 + 0.014 volatility – 0.0042 volume
The estimated price change (dotted line in the preceding chart) represents the long-term trend for which the noise is filtering out. In other words, the least squares regression operates as a simple, low-pass filter as an alternative to some of the filtering techniques such as the discrete Fourier transform or the Kalman filter described in Chapter 3, Data Pre-processing [9:7].
Although trend detection is an interesting application of the least squares regression, the method has limited filtering capabilities for time series [9:8]:
Our objective is to discover which subset of initial features generates the most accurate regression model; that is, the model with the smallest RSS on the training set.
Let's consider an initial set of D features {xi}. The objective is to estimate the subset of features {xid} that are the most relevant to the set of observations using a least squares regression. Each subset of features is associated to the model fj (x|wj ):
The selection of the model parameters, w, uses ordinary least square regression in the case that the features set if small. Performing a regression of each of subset of a large original features set is not practical.
M3: Given the weights wjk of the function/model fj and the n observations (xi, yi)i:0,n-1 and the linear multivariate function y = f (x0, x1, …,xd, ..), the features selection can be expressed mathematically as follows:
Let's consider the following four financial time series over the period from January 1, 2009 to December 31, 2013:
The problem is to estimate which combination of the three variables, S&P 500 index, gold price, and 10-year treasury bond price, is the most correlated to the exchange rate of Yuan. For practical reasons, we use the Exchange Trade Funds CYN as the proxy for the Yuan/US dollar exchange rate (similarly, we use SPY, GLD, and TLT for the S&P 500 index, spot price of gold, and 10-year treasury bond price, respectively).
The number of models to evaluate is relatively small, so an ad hoc approach to compute the RSS for each combination is acceptable. Have a look at the following graph:
The getRss
method implements the computation of the RSS value given a set of observations, xt
, expected (smoothed) values, y
, and labels for features, featureLabels
, and returns a textual result:
def getRss( xt: Vector[Array[Double]], expected: DblVec, labels: Array[String]): String = { val regression = new MultiLinearRAdapter[Double]( xt,expected ) //22 val descriptor = regression.weights.map( _.zipWithIndex.map( case(w, n) => { … // Display regression weights }} ) s"rss= ${regression.rss}" //23 }
The getRss
method merely trains the model by instantiating the multi-linear regression class (line 22). Once the regression model is trained during the instantiation of the MultiLinearRegression
class, the coefficients of the regression weights
and the RSS values are stringized (line 23). The getRss
method is invoked for any combination of the ETFs, GLD, SPY, and TLT, against the CNY label.
Let's look at the test code:
val SMOOTHING_PERIOD: Int = 16 //24 val symbols = Array[String]("CNY", "GLD", "SPY", "TLT") //25 val movAvg = SimpleMovingAverage[Double](SMOOTHING_PERIOD) //26 for { pfnMovAve <- Try(movAvg |>) //27 smoothed <- filter(pfnMovAve) //28 models <- createModels(smoothed) //29 rsses <- Try(getModelsRss(models, smoothed)) //30 (mses, tss) <- totalSquaresError(models,smoothed.head) //31 } yield { s"""${rsses.mkString(" ")} ${mses.mkString(" ")} | Residual error= $tss".stripMargin }
The dataset is large (1,260 trading sessions) and noisy enough to warrant filtering using a simple moving average with a period of 16 trading sessions (line 24). The purpose of the test is to evaluate the possible correlation between the four ETFs: CNY, GLD, SPY, and TLT (line 25). The execution test instantiates the simple moving average (line 26) as described in the Simple moving average section of Chapter 3, Data Pre-processing.
The workflow executes the following steps:
pfnMovAve
(line 27).smoothed
historical prices series for the CNY, GLD, SPY, and TLT ETFs using the function filter
as follows (line 28):type PFNMOVAVE = PartialFunction[DblVec, Try[DblVec]]
def filter(pfnMovAve: PFNMOVEAVE): Try[Array[DblVec]]=Try {
symbols.map(s =>
DataSource(s"$getPath(s"$dataPath/$s.csv}")}",
true, true, 1))
.map( _.get(adjClose) )
.map( pfnMovAve(_))
.map(_.getOrElse(-1.0))
createModels
method (line 30):type Models = List[(Array[String], DblMatrix)] type DblMatrix = Array[Array[Double]] def createModels(smoothed: Array[DblVec]): Try[Models] = Try { val features = smoothed.drop(1).map(_.toArray) //32 List[(Array[String], DblMatrix)]( //33 (Array[String]("CNY","SPY","GLD","TLT"), features.transpose), (Array[String]("CNY","GLD","TLT"), features.drop(1).transpose), (Array[String]("CNY","SPY","GLD"), features.take(2).transpose), (Array[String]("CNY","SPY","TLT"), features.zipWithIndex .filter( _._2 != 1) .map( _._1) .transpose), (Array[String]("CNY","GLD"), features.slice(1,2).transpose) ) }
The smoothed values for CNY are used as the expected values. Therefore, it is removed from the features list (line 32). The five models are evaluated by adding or removing elements from the features list (line 33).
models
using getModelsRss
(line 30). The method invokes getRss
, introduced earlier in this section, for each model (line 34):def getModelsRss(models: Models, y: Array[DblVec]): List[String] = models.map{ case (labels, m) => s"${getRss(m.toVector, y.head, labels)}" } //34
mses
, for each model and the total squared error, tss
(line 31):def totalSquaresError( models: Models, expected: DblVec): Try[(List[String], Double)] = Try { val errors = models.map{ case (labels, m) => rssSum(m, expected)._1 //35 } val mses = models.zip(errors).map{ case(f, e) => s"MSE: ${f._1.mkString(" ")} = $e" } (mses, Math.sqrt(errors.sum)/models.size) //36 }
The totalSquaresError
method computes the error for each model by summing the RSS value, rssSum
, for each model (line 35). The method returns a pair of array of the mean squared error for each model and the total squares error (line 36).
The RSS does not always provide an accurate visualization of the fitness of the regression model. The fitness of the regression model is commonly assessed using the r 2 statistics. The r 2 value is a number that indicates how well data fits a statistical model.
M4: RSS and r 2 statistics are defined in the following formula:
The implementation of the computation of the r
2 statistics is simple: for each model, fj, the rssSum
method computes the tuple {RSS, least squares errors} as defined in formula M4:
def rssSum(xt: DblMatrix, expected: DblVec): DblPair = { val regression = MultiLinearRegression[Double](xt,expected)//37 val pfnRegr = regression |> //38 val results = sse(expected.toArray, xt.map(pfnRegr(_).get)) (regression.rss, results) //39 }
The rssSum
method instantiates the MultiLinearRegression
class (line 37), retrieves the RSS value, then validates the regressive model pfnRegr
(line 38) against the expected values (line 39). The output of the test is presented in the following screenshot:
The output results clearly show that the three-variable regression CNY=f (SPY, GLD, TLT) is the most accurate or fittest model for the time series CNY, followed by CNY =f (SPY, TLT). Therefore, the feature selection process generates the features set {SPY, GLD, TLT}.
Let's plot the model against the raw data:
The regression model smoothed the original time series, CNY. It weeded out all but the most significant price variations.
The graph plotting the r 2 value for each of the models confirms that of the three features models, CNY=f (SPY, GLD, TLT) is the most accurate.
3.15.137.75