Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

E. Varga Practical Data Science with Python 3https://doi.org/10.1007/978-1-4842-4859-1_7

7. Machine Learning

Ervin Varga¹

(1)

Kikinda, Serbia

Machine learning is regarded as a subfield of artificial intelligence that deals with algorithms and technologies to squeeze out knowledge from data. Its fundamental ingredient is Big Data, since without help of a machine, our attempt to manually process huge volumes of data would be hopeless. As a product of computer science, machine learning tries to approach problems algorithmically rather than purely via mathematics. An external spectator of a machine learning module would admire it as some sort of magic happening inside a box. Eager reductionism may lead us to say that it all is just “bare” code executed on a classical computer system. Of course, such a statement would be an abomination. Machine learning does belong to a separate branch of software, which learns from data instead of blindly following predefined rules. Nonetheless, for its efficient application, we must know how and what such algorithms learn as well as what type of algorithm(s) to apply in a given context. No machine learning system can notice that it is misappropriated. The goal of this chapter is to lay down the foundational concepts and principles of machine learning exclusively through examples.

There are multiple ways to group machine learning algorithms. We can differentiate between the following three learning styles:

Supervised learning: Here an algorithm is exercised on known observations until it achieves a desirable level of performance. The main challenge is to acquire enough high-quality marked data for appropriate training. Some members of this group are linear regression, logistic regression, support vector machine, naive Bayes classifier, etc.
Unsupervised learning: These algorithms try to autonomously discover hidden structures in data for the purpose of grouping them, finding interesting relationships (a.k.a. association rule learning), or reducing inherent dimensionality (describe effects with fewer features). Some members of this group are K-Means clustering, principal component analysis, manifold learning, Aprori algorithm, etc.
Semi-supervised learning: These algorithms try to decipher hidden structures based on guidance from labeled specimens. It isn’t uncommon that unsupervised learning algorithms are run on half-marked data, as a preprocessing step, to label remaining data points (a technique known as label propagation).

Besides grouping algorithms, we can also discern various learning methods as follows:

Full-batch learning (a.k.a. statistical learning): We feed an algorithm all training data at once. After initial training, model parameters remain fixed. This scheme is also most popular for demonstrating various algorithms in action due to its simplicity.
Mini-batch learning: We feed an algorithm in a chunked manner. For example, with an enormous training sample, the standard gradient descent optimization method is prohibitive. This is where the stochastic gradient descent alternative becomes attractive, which works with a reduced chunk at any moment of time.
Online learning: This is an extreme case of mini-batching in which the batch size is reduced to a single observation. The usual setup is that the system is warmed up on historical data and left to update its parameters while running in production. This learning method has many subvariants, which are listed below as separate groups to avoid nesting.
Streaming: Here, an algorithm also works on a single observation at a time but cannot revisit past records (generic online learning algorithms are allowed to go over data multiple times). It is OK for a streaming system to cache recent items or keep running statistics (like moving average), although these are miniscule in comparison to the training corpus used in previous sets. Finally, many streaming approaches actually rely on micro-batching (with tiny batches). They are still called streaming since they don’t pass multiple times over older batches.
Active learning: This method actively seeks feedback from other entities while running. For example, an online learning–based spam filter may incorporate user actions (declassify a piec of spam as normal, or vice versa) to continuously update itself. A semi-supervised system could ask for help to label conflicting (confusing) observations.
Reinforcement learning: This special dynamic learning method builds upon interaction between the system and its environment. Using an efficient feedback loop and rewarding mechanism (with positive and negative rewards), an algorithm learns what actions are proper in a given context. Robots typically learn in this manner.

We can also embark on separating algorithms by their similarity (e.g., neural networks, spectral methods, Bayesian networks, etc.). Nevertheless, any classification can only broadly enumerate knowledge areas, with blurred borders and overlaps. However, some notions are shared by all of them. These will be the topic in the rest of this chapter.

Irrespectively of what approach you choose, at one point in time, you will need to peek under the hood of many algorithms to tweak advanced parameters, combine them to work in a unified fashion (a.k.a. ensemble), or connect them as a pipeline for staged processing. You cannot eventually escape deep mathematics. The good news is that you can survive this encounter in incremental fashion.

Exposition of Core Concepts and Techniques

This section gradually introduces common machine learning concepts using ordinary least squares regression, which is simple to comprehend yet powerful enough to serve as a basis for discussion. It establishes a linear relationship between features (predictors) to predict an output value (response). The features themselves may be of arbitrary degree and complexity (for example, they could be polynomial terms). Listing 7-1 shows the data_generator.py module , which produces features and outputs based on various criteria by simulating fake “real world” processes. Listing 7-1 provides the scaffolding for our demo harness to exhibit the following notions: real world, training process, process parameters, runtime model, runtime parameters, estimators, evaluation metrics (loss function, mean squared error, explained variance), overfitting, underfitting, feature interaction, collinearity, and regularization. My intention here is to err on the side of oversimplification to convey essential ideas. There are a lot of excellent books about machine learning with a gamut of complex mathematics behind the scenes; the field is immense.

import numpy as np

import pandas as pd

def generate_base_features(sample_size):

x_normal = np.random.normal(6, 9, sample_size)

x_uniform = np.random.uniform(0, 1, sample_size)

x_interacting = x_normal * x_uniform

x_combined = 3.6 * x_normal + np.random.exponential(2/3, sample_size)

x_collinear = 5.6 * x_combined

features = {

'x_normal': x_normal,

'x_uniform': x_uniform,

'x_interacting': x_interacting,

'x_combined': x_combined,

'x_collinear': x_collinear

}

return pd.DataFrame.from_dict(features)

def identity(x):

return x

def generate_response(X, error_spread, beta, f=identity):

error = np.random.normal(0, error_spread, (X.shape[0], 1))

intercept = beta[0]

coef = np.array(beta[1:]).reshape(X.shape[1], 1)

return f(intercept + np.dot(X, coef)) + error

Listing 7-1

data_generator.py Module

Our intimate knowledge about each underlying data generator process allows us to illuminate concepts in an exact manner. The real world simulated by generate_response has the following vector form: y = f (β₀ + Xβ_1 : n) + ε (see also the sidebar “Linear Regression Varieties”), where f is a discretionary function. The error term represents an inherent noise, which isn’t encompassed by the model. We cast it as a Gaussian random variable $varepsilon sim mathcal{N}left(mu =0,sigma = error\_ spread ight)$ ; that is, a term that symmetrically fluctuates around the output following the Normal distribution. For simplicity, we produce homoscedastic outputs, meaning the errors are uncorrelated and uniform (they don’t depend on X). The vector β is arbitrary.

An external observer can only see records (x_i, y_i) from this world, where i ∈ [1, sample size]. The art, engineering, and science is to reconstruct real-world phenomena by using only one or more samples, as shown in Figure 7-1. We proceed by assuming a specific model (for example, linear with a given error distribution). Afterward, we try to figure out model parameters (like vector β and σ) and seek proper features (this obviously has an impact on parameters). All in all, we have lots of stuff to presume and calculate.

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig1_HTML.jpg — Figure 7-1
¹From outside we can only gather observations and try to establish sound relationships between features and outputs

Establishing relationships between features and output allows us to either predict future outcomes or better understand what is going on in the system. In Figure 7-1, the frontal digits represent assumed predictors and fitted values (what we closely examine and crunch in production), while the digits on the globe represent real-world data. We can state that our model closely resembles the real world if there is an acceptable match between them.

Linear Regression Varieties

The generic form of a linear regression is given by y = m(x) + ε. The conditional expectation function m is E[y|x], if E[ε|x]=0 (when the homoscedasticity property holds). Depending on m’s structure, we can discern the following types of linear regressions:

Parametric: m = β₀ + Xβ_1 : n is smooth, defined, and easily interpretable.
Nonparametric: m is smooth, must be discovered from data, flexible, and hard to interpret.
Semiparametric: Somewhere between the preceding two cases (partially structured).

You must evaluate the cost-benefit ratio before deciding which version to use. For example, if you have outliers (extreme but valid observations that you cannot eradicate), then simple parametric linear regression isn’t a good choice. The nonparametric version is more robust and less vulnerable regarding outliers.

Imagine that you encounter a helpful wizard, who reveals everything about the real world’s processes. With that knowledge, how accurately may you predict future outcomes? You definitely cannot talk in absolute terms, as you have to contend with randomness in making your predictions; you may only speak about probabilities of seeing particular responses. In other words, you can develop a conditional probability distribution of an output given an observation. This can be expressed as $Pleft(y|X ight)=mathcal{N}left(fleft({eta}_0+{eta}_{1:n}mathrm{X} ight),sigma ight)$

The truth is, you are not straying far from reality (and sanity) if you are hoping for such a wizard. The scikit-learn framework ( https://scikit-learn.org ) is a wizard in its own right. You just need to carefully interpret scikit-learn’s words and never misapply the power bestowed on you. The aim of this section is to shed some light on necessary communication skills and language so that you become this wizard’s beloved apprentice.

Listing 7-2 shows the scikit-learn package in action with some visualization of what happens under the hood. The observer.py module contains functions to recover parameters and demonstrate various effects pertaining to training, testing, and evaluation. You don’t need to fully understand the auxiliary display code (see the explain_sse and plot_mse functions) to follow along. The train_model function recovers the real world’s parameters solely by using observations (a.k.a. labeled training data). The evaluate_model function checks the validity of the model by calculating the mean squared error (MSE) and explained variance score using test data. In practice, you should always select performance criterion to align with objectives of the problem. Out-of-context measurements are distorting.

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

plt.style.use('seaborn-whitegrid')

def train_model(model, X_train, y_train):

model.fit(X_train, y_train)

def evaluate_model(model, X_test, y_test, plot_residuals=False, title="):

from sklearn.metrics import mean_squared_error, explained_variance_score

y_pred = model.predict(X_test)

if plot_residuals:

_, ax = plt.subplots(figsize=(9, 9))

ax.set_title('Residuals Plot - ' + title, fontsize=19)

ax.set_xlabel('Predicted values', fontsize=15)

ax.set_ylabel('Residuals', fontsize=15)

sns.residplot(y_pred.squeeze(), y_test.squeeze(),

lowess=True,

ax=ax,

scatter_kws={'alpha': 0.3},

line_kws={'color': 'black', 'lw': 2, 'ls': '--'})

metrics = {

'explained_variance': explained_variance_score(y_test, y_pred),

'mse': mean_squared_error(y_test, y_pred)

}

return metrics

def make_poly_pipeline(model, degree):

from sklearn.pipeline import make_pipeline

from sklearn.preprocessing import PolynomialFeatures

return make_pipeline(PolynomialFeatures(degree=degree, include_bias=False), model)

def print_parameters(linear_model, metrics):

print('Intercept: %.3f' % linear_model.intercept_)

print('Coefficients: ', linear_model.coef_)

print('Explained variance score: %.3f' % metrics['explained_variance'])

print("Mean squared error: %.3f" % metrics['mse'])

def plot_mse(model, X, y, title, error_spread):

def collect_mse():

from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score

metrics_all = []

for train_size_pct in range(10, 110, 10):

X_train, X_test, y_train, y_test =

train_test_split(X, y, shuffle=False, train_size=train_size_pct / 100)

metrics_current = dict()

metrics_current['percent_train'] = train_size_pct

train_model(model, X_train, y_train)

metrics_train = evaluate_model(model, X_train, y_train)

metrics_current['Training score'] = metrics_train['mse']

metrics_cv = cross_val_score(

model,

X_train, y_train,

scoring='neg_mean_squared_error', cv=10)

metrics_current['CV score'] = -metrics_cv.mean()

if X_test.shape[0] > 0:

metrics_test = evaluate_model(model, X_test, y_test)

metrics_current['Testing score'] = metrics_test['mse']

else:

metrics_current['Testing score'] = np.NaN

metrics_all.append(metrics_current)

return pd.DataFrame.from_records(metrics_all)

import matplotlib.ticker as mtick

df = collect_mse()

error_variance = error_spread**2

ax = df.plot(

x='percent_train',

title=title,

kind='line',

xticks=range(10, 110, 10),

sort_columns=True,

style=['b+--', 'ro-', 'gx:'],

markersize=10.0,

grid=False,

figsize=(8, 6),

lw=2)

ax.set_xlabel('Training set size', fontsize=15)

ax.xaxis.set_major_formatter(mtick.PercentFormatter())

y_min, y_max = ax.get_ylim()

# FIX ME: See Exercise 2!

ax.set_ylim(max(0, y_min), min(2 * error_variance, y_max))

ax.set_ylabel('MSE', fontsize=15)

ax.title.set_size(19)

# Draw and annotate the minimum MSE.

ax.axhline(error_variance, color="g", ls="--", lw=1)

ax.annotate(

'Inherent error level',

xy=(15, error_variance),

textcoords='offset pixels',

xytext=(10, 80),

arrowprops=dict(facecolor='black', width=1, shrink=0.05))

def explain_sse(slope, intercept, x, y):

# Configure the diagram.

_, ax = plt.subplots(figsize=(7, 9))

ax.set_xlabel('x', fontsize=15)

ax.set_ylabel('y', fontsize=15)

ax.set_title(r'$SSE = sum_{i=1}^n (y_i - hat{y}_i)^2$', fontsize=19)

ax.grid(False)

ax.spines["top"].set_visible(False)

ax.spines["right"].set_visible(False)

ax.tick_params(direction='out', length=6, width=2, colors="black")

# Show x-y pairs.

ax.scatter(x, y, alpha=0.5, marker="x")

# Draw the regression line.

xlims = np.array([np.min(x), np.max(x)])

ax.plot(xlims, slope * xlims + intercept, lw=2, color="b")

# Draw the error terms.

for x_i, y_i in zip(x, y):

ax.plot([x_i, x_i], [y_i, slope * x_i + intercept], color="r", lw=2, ls="--")

Listing 7-2

observer.py Module

Notice the advantage of having a unified API (like, fit, predict, etc.) to work with various models. Even pipelines are handled in the same manner. Listing 7-3 shows the session.py module’s demo_metrics_and_mse function . It depicts steps to reconstruct parameters from observations using various noise levels. In the absence of noise we have a deterministic linear regression. As noise increases, the estimation of parameters deteriorates.

import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

import numpy as np

import pandas as pd

from sklearn.linear_model import LinearRegression

from data_generator import *

from observer import *

def set_session_seed(seed):

np.random.seed(seed) # Enables perfect reproduction of published results.

def demo_metrics_and_mse():

set_session_seed(100)

X = generate_base_features(1000)[['x_normal']]

for noise_level in [0, 2, 15]:

y = generate_response(X, noise_level, [-1.5, 4.1])

model = LinearRegression()

train_model(model, X, y)

metrics = evaluate_model(model, X, y)

print(' Iteration with noise level: %d' % noise_level)

print_parameters(model, metrics)

# Visualize the regression line and error terms.

if noise_level == 15:

slope = model.coef_[0][0]

intercept = model.intercept_

explain_sse(slope, intercept, X[:15].values, y[:15])

Listing 7-3

First Part of session.py Module (Some Imports Are for Later Use)

The warning module is used to silences FutureWarning warning types, thus removing clutter from output. The following is the printout from executing demo_metrics_and_mse():

Iteration with noise level: 0

Intercept: -1.500

Coefficients:

[[4.1]]

Explained variance score: 1.000

Mean squared error: 0.000

Iteration with noise level: 2

Intercept: -1.470

Coefficients:

[[4.0925705]]

Explained variance score: 0.997

Mean squared error: 4.187

Iteration with noise level: 15

Intercept: -1.535

Coefficients:

[[4.06304243]]

Explained variance score: 0.867

Mean squared error: 223.945

Figure 7-2 pictures how error accumulates; the line itself was calculated over the whole dataset, so it appears wrong on this subset. The MSE is simply an average of sum of squared errors (SSE). Each vertical dashed red line designates a single error term (difference between true y_i and its predicted value ${hat{y}}_i$ ). The model’s coefficients minimize this MSE (loss function), and the corresponding expression to calculate the vector β is an unbiased minimal variance estimator . It is independent of the inherent noise factor, and MSE is an estimator of this error term’s spread ( $$ sigma cong sqrt{MSE} $$ ). Estimates and values dependent upon them are denoted by a hat symbol.

Note

The intercept only makes sense when the matching predictor is of ratio level data type (the notion of absolute zero exists). Otherwise, it should be turned off (see the constructor for the LinearRegression class).

Hint

The print_parameters method is rudimentary. For a complete R-style summary of model parameters, you may want to utilize the statsmodels package (visit https://www.statsmodels.org ).

Once a model is prepared (trained), it is ready for production usage to deal with unseen data. This is what I refer to as a runtime model . All learned (runtime) parameters are its integral part and are usually distributable over various communication channels. This property allows training to be done separately on powerful machines with lots of data. For example, all we need to know in production to handle real data is the corresponding vector $$ hat{eta} $$ . Calculating $hat{y}={hat{eta}}_0+{hat{eta}}_{1:n}X$ can be performed on any constrained device.

While working with world1 (original model) we have used a trivial training process. All available data were reused both for training and testing, which is something you should avoid in practice. In this case, it didn’t cause problems. Our model’s complexity perfectly matched the truth, evidenced when higher noise steered a drop in explained variance as well as a rise of MSE. This was an indication that we haven’t tried to capture nonessential properties of the real world. An overly complex model is capable of doing this, which leads to overfitting. Contrary to this is underfitting, when our model is too weak even for capturing fundamental characteristics. The next two sections demonstrate these aspects thoroughly.

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig2_HTML.jpg — Figure 7-2
Error terms ${mathrm{e}}_{mathrm{i}}={mathrm{y}}_{mathrm{i}}-{hat{mathrm{y}}}_{mathrm{i}}$ , which are squared, summed up, and averaged to calculate MSE. The root MSE, (), is useful to restore the response’s original unit.

$$ {mathrm{e}}_{mathrm{i}}={mathrm{y}}_{mathrm{i}}-{hat{mathrm{y}}}_{mathrm{i}} $$ — Figure 7-2
Error terms ${mathrm{e}}_{mathrm{i}}={mathrm{y}}_{mathrm{i}}-{hat{mathrm{y}}}_{mathrm{i}}$ , which are squared, summed up, and averaged to calculate MSE. The root MSE, (), is useful to restore the response’s original unit.

Overfitting

We will augment the linear model with polynomial features of various degrees; the model itself remains linear in terms of features. The aim is to demonstrate overfitting. Since this model is more powerful than the previously investigated model, it has enough capacity to encompass erroneous fluctuations in data; in a deterministic case (without error), you wouldn’t notice any difference. Machine learning algorithms cannot decipher that irregularities in data aren’t critical. There is a mechanism to detect overfitting by splitting historical data into training and testing sets.

Listing 7-4 shows the demo_overfitting function using polynomial features as well as the plot_mse function to plot MSE for both training and test sets of various sizes. During the run the data is split into training and test sets of varying sizes. This expansion of the training process introduces a new process parameter: data volume reserved for training purposes (the rest is kept for testing). Previously we just used whatever we had without any breakdown. The inner visualization function demonstrates what actually happens behind curtains.

def demo_overfitting():

def visualize_overfitting():

train_model(optimal_model, X, y)

train_model(complex_model, X, y)

_, ax = plt.subplots(figsize=(9, 7))

ax.set_yticklabels([])

ax.set_xticklabels([])

ax.grid(False)

X_test = np.linspace(0, 1.2, 100)

plt.plot(X_test, np.sin(2 * np.pi * X_test), label='True function')

plt.plot(

X_test,

optimal_model.predict(X_test[:, np.newaxis]),

label='Optimal model',

ls='-.')

plt.plot(

X_test,

complex_model.predict(X_test[:, np.newaxis]),

label='Complex model',

ls='--',

lw=2,

color='red')

plt.scatter(X, y, alpha=0.2, edgecolor="b", s=20, label='Training Samples')

ax.fill_between(X_test, -2, 2, where=X_test > 1, hatch='/', alpha=0.05, color="black")

plt.xlabel('x', fontsize=15)

plt.ylabel('y', fontsize=15)

plt.xlim((0, 1.2))

plt.ylim((-2, 2))

plt.legend(loc='upper left')

plt.title('Visualization of How Overfitting Occurs', fontsize=19)

plt.show()

set_session_seed(172)

X = generate_base_features(120)[['x_uniform']]

y = generate_response(X, 0.1, [0, 2 * np.pi], f=np.sin)

optimal_model = make_poly_pipeline(LinearRegression(), 5)

plot_mse(optimal_model, X, y, 'Optimal Model', 0.1)

complex_model = make_poly_pipeline(LinearRegression(), 35)

plot_mse(complex_model, X, y, 'Complex Model', 0.1)

visualize_overfitting()

Listing 7-4

Contains Functions to Exemplify Overfitting

When a model matches the truth, the MSEs induced by both the training set and the test set gravitate around the achievable minimum MSE (reflects inherent variance in data), as shown in Figure 7-3. Otherwise, the test set shows a worse performance, as presented in Figure 7-4. This is a sign that the model isn’t generalizing properly and is picking up unimportant details from a training set. As a model’s power increases above the required level, the cross-validation (CV) score considerably deteriorates, too.

The CV score is a very efficient way to evaluate your model’s performance. It is an average of individual CV scores. The training data is randomly partitioned into K number of equal-sized complementary subsets (we use K=10; see the collect_mse inner function in Listing 7-2 as well as Exercise 7-2); this number K is another process-related parameter. In each round, one segment is used for testing (more precisely, for validation), while the rest is used as real training data (in the later section “Regularization,” we will use K-fold CV to set a runtime parameter). Every data point from the initial dataset is used only once for testing. In this manner, the algorithm cycles through all partitions, so the overall mean score is indeed a reliable performance indicator. There are also extreme variants, where testing is only done with a single observation (singleton) per iteration (a.k.a. leave-one-out CV).

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig3_HTML.jpg — Figure 7-3
MSE of the test set approaches natural variance when the model’s complexity is just right

Furthermore, the variance between CV, testing, and training scores lessens as we use more training data. There is no test score marker when we use all available data for training. The horizontal dashed line denotes the inherent error in data (it may stem from measurement errors). A proper model wouldn’t try to embody this segment. In machine learning we cannot simply instruct the computer to forget about inherent error. The trick is to set your model’s complexity just right, so that it has no capacity to memorize unwanted details. Apparently, the machine first tried to suck in everything, when the training set was small enough, but later gave up and only followed the main trend. Consequently, you must possess enough training data for a given model’s complexity. Monstrous models (especially deep neural networks) must devour a huge amount of training data before being ready for production.

Overfitting and underfitting (demonstrated in the next section) are tightly associated with the central issue of supervised learning known as bias-variance trade-off . A highly biased model may miss important variances in the training data, while a high-variance model may try to capture nonessential properties. Balancing these two forces is one of the most difficult aspects of training machine learning algorithms. For example, extreme care must be given to decision trees, which may easily cover all edge cases in training data.

Figure 7-5 shows why an overly complex model doesn’t generalize well; that is, it has low performance on a test set. Now, this set isn’t exclusively about some totally uncharted area from a domain but is simply a collection of data points excluded from the training set. Reasoning about unknown space entails a different learning approach (see also reference [1]), which will be the topic of the next case study.

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig4_HTML.jpg — Figure 7-4
MSE of the test set rises above the inherent error level when the model’s complexity is too high

The cross-validation score is even out of range here. Observe that the model completely memorized a small training sample at the beginning and always managed to follow unintended fluctuations. One way to combat this situation is to saturate the system with more training data. In our case, it has started to reasonably perform on a test set only from 60% of the training set size.

Underfitting and Feature Interaction

This section introduces another typical situation in which features interact. The goal is to showcase that it isn’t enough to solely identify pertinent features. You must understand how they interrelate in the real world. We assume two models: one that treats individual features as independent, and another that includes their interaction.

Listing 7-5 shows the demo_underfitting function that demonstrates underfitting. When a model matches the truth, the MSEs induced by both the training set and the test set gravitate around the achievable minimum MSE (similarly as shown in Figure 7-3). By contrast, Figure 7-6 illustrates what happens with a weak model; obviously, adding more data to a weak model doesn’t help. Underfitting in practice is less common then overfitting, particularly with powerful deep neural networks.

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig5_HTML.jpg — Figure 7-5
The complex model’s prediction line is jagged due to an attempt to encompass inherent error

On the right side of Figure 7-5, you can see what happens in an unexplored territory. Both fitted lines depart from the true function.

def demo_underfitting():

set_session_seed(15)

X = generate_base_features(200)

X_interacting = X[['x_interacting']]

y = generate_response(X_interacting, 2, [1.7, -4.3])

plot_mse(LinearRegression(), X_interacting, y, 'Optimal Model', 2)

X_weak = X[['x_normal', 'x_uniform']]

plot_mse(LinearRegression(), X_weak, y, 'Weak Model', 2)

Listing 7-5

Function to Exemplify Underfitting via Feature Interaction

The weak model properly enlists both participating features, but taken separately, they cannot provide value.

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig6_HTML.jpg — Figure 7-6
MSEs of all sets are far away from inherent error with a weak model

Collinearity

If you scrutinize the generate_base_features function, you will notice interrelationships between features x_normal, x_combined, and x_collinear. Our new simulated world uses only x_normal and x_combined. As before, we will presume two variants of this world: one using the same features, and the other one incorporating x_collinear, too. Listing 7-6 shows the code for demonstrating collinearity. In machine learning, this phenomenon has negative impact on performance and stability (it is hard to assess the impact of features on the outcome).

def demo_collinearity():

set_session_seed(10)

X = generate_base_features(1000)

X_world = X[['x_normal', 'x_combined']]

y = generate_response(X_world, 2, [1.1, -2.3, 3.1])

model = LinearRegression()

# Showcase the first assumed model.

train_model(model, X_world, y)

metrics = evaluate_model(model, X_world, y)

print(' Dumping stats for model 1')

print_parameters(model, metrics)

# Showcase the second assumed model.

X_extended_world = X[['x_normal', 'x_combined', 'x_collinear']]

train_model(model, X_extended_world, y)

metrics = evaluate_model(model, X_extended_world, y)

print(' Dumping stats for model 2')

print_parameters(model, metrics)

# Produce a scatter matrix plot.

df = X

df.columns = ['x' + str(i + 1) for i in range(len(df.columns))]

df['y'] = y

pd.plotting.scatter_matrix(df, alpha=0.2, figsize=(10, 10), diagonal="kde")

Listing 7-6

Extension of session.py with the demo_collinearity Function

Here is the output of executing demo_collinearity():

Dumping stats for model 1

Intercept: 1.021

Coefficients:

[[-2.1290165 3.05443636]]

Explained variance score: 0.999

Mean squared error: 4.274

Dumping stats for model 2

Intercept: 1.021

Coefficients:

[[-2.1290165 0.09438926 0.52857984]]

Explained variance score: 0.999

Mean squared error: 4.274

The only difference between these runs is the significance of x_combined. This is a typical sign of collinearity. The system is confused whether to “work” with x_combined or x_collinear, since one is a direct linear combination of another. There is also a strong relation between x_normal and x_combined. A practical way to check for such interdependence is to generate a scatter matrix plot , as shown in Figure 7-7; this is the final result of running demo_collinearity (the columns are renamed to better fit on the diagram). It is possible to produce a correlation matrix, but it will only reveal linear relationships. A diagram may show you non-linear associations, too.

Independence assumption is common in machine learning algorithms. Naive Bayes method fully relies upon this characteristic. Feature dependence has negative consequences on convergence in logistic regression. Moreover, highly related features are redundant and just complicate a model.

Residuals Plot

As a data scientist, you must constantly seek to inspect problems from multiple angles. Likewise, you should know about complementary plotting techniques, since they could illuminate otherwise invisible aspects of a problem. You have just seen the utility of a scatter matrix plot. In this section we generate two worlds, one linear and one quadratic. Both have tiny coefficients, inherent randomness, and are approximated by truly linear models. Listing 7-7 contains the demo_residuals function to highlight the importance of residuals plots.

Figure 7-8 shows two scatter plots together with regression lines for fitting a linear world by linear model (case 1) and fitting a quadratic world by linear model (case 2). The MSE is same in both cases, while the explained variance score is better for case 2. Which case would you choose as more agreeable for a linear model? My guess is that you would pick case 2. Well, Figure 7-9 tells a different story.

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig7_HTML.jpg — Figure 7-7
Scatter plots of all pairs of variables, with density plots on diagonal

Notice the heavy connection between x1, x4, x5, and y. Furthermore, observe x3’s distinctive shape in relation to the other features.

def demo_residuals():

def plot_regression_line(x, y, case_num):

_, ax = plt.subplots(figsize=(9, 9))

ax.set_title('Regression Plot - Case ' + str(case_num), fontsize=19)

ax.set_xlabel('x', fontsize=15)

ax.set_ylabel('y', fontsize=15)

sns.regplot(x.squeeze(), y.squeeze(),

ci=None,

ax=ax,

scatter_kws={'alpha': 0.3},

line_kws={'color': 'green', 'lw': 3})

set_session_seed(100)

X = generate_base_features(1000)

X1 = X[['x_normal']]

y1 = generate_response(X1, 0.04, [1.2, 0.00003])

X2 = X1**2

y2 = generate_response(X2, 0.04, [1.2, 0.00003])

model = LinearRegression()

# Showcase the first world with a linearly assumed model.

plot_regression_line(X1, y1, 1)

train_model(model, X1, y1)

metrics = evaluate_model(model, X1, y1, True, 'Case 1')

print(' Dumping stats for case 1')

print_parameters(model, metrics)

# Showcase the second world with a linearly assumed model.

plot_regression_line(X1, y2, 2)

train_model(model, X1, y2)

metrics = evaluate_model(model, X1, y2, True, 'Case 2')

print(' Dumping stats for case 2')

print_parameters(model, metrics)

Listing 7-7

Function That Creates Two Deceptive Offers, Where the Obvious One Isn’t Obviously Wrong

This code reuses the model instance from case 1 in case 2. You must always read carefully the documentation about what is happening when you call the fit method repeatedly. Here is the citation from scikit-learn’s tutorial (see https://scikit-learn.org/stable/tutorial/basic/tutorial.html ): “Hyper-parameters of an estimator can be updated after it has been constructed via the set_params() method . Calling fit() more than once will overwrite what was learned by any previous fit().” This behavior is exactly what we need.

You also must take care to reuse the same scaler instance that was used for training the model. The validation and test sets must be scaled in the same manner as the training dataset. Forgetting this detail may cause subtle bugs.

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig8_HTML.jpg — Figure 7-8
The direction of the slope is wrong for case 1, since it should be positive (see Listing 7-7)

For case 2 the slope is positive, as it should be. All signs suggest that this is a better match.

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig9_HTML.jpg — Figure 7-9
The residuals plot clearly favors case 1 over case 2 with evident visual explanation; the curvature nicely reveals a “quadratic” pattern in residuals

There is no residual pattern for case 1. See Listing 7-2 and the evaluate_model function (this creates a residuals plot with the lowest line for depicting trend).

Regularization

We don’t know in advance which model is optimal. We can either start with a weak model and add more features or start with an overly complex model and try to tone it down. Neither of these approaches is scalable with manual work. The idea is to just err on the side of complexity and utilize some automation to reduce the model’s complexity toward an optimal level. This is all about regularization , an automatic mechanism to eschew overfitting. There are many types of regularization. We will use the Ridge regression with built-in cross-validation.

Regularization encodes some constraint over coefficients using the language of mathematical optimization; this is expressed in the form of a penalty function. The Ridge regression (a.k.a. Tikhonov regularization) aims to keep coefficients as small as possible, since this is equivalent to attaining a least-complex model. Consequently, Ridge regression defines an L2 regularization term (l2-norm) $alpha {leftVert w ightVert}_2^2$ that is added to the basic loss function (in our case MSE). The vector w contains the model’s coefficients (weights). The parameter α balances minimization attempts between MSE and penalty term. Higher values tend to reward lesser coefficients, and vice versa. Alpha is calculated by some trial-and-error method (such as cross-validation over a list of candidates). Listing 7-8 shows the function to illuminate regularization, while the aftermath is shown in Figure 7-10.

def demo_regularization():

from sklearn.linear_model import RidgeCV

set_session_seed(172)

x = generate_base_features(120)[['x_uniform']]

y = generate_response(X, 0.1, [0, 2 * np.pi], f=np.sin)

regularized_model = make_poly_pipeline(

RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1, 5, 10, 20], gcv_mode="auto"),

35)

plot_mse(regularized_model, X, y, 'Regularized Model', 0.1)

Listing 7-8

Implementation of demo_regularization Function to Showcase Ridge Regression

All the magic happens inside the scikit-learn framework’s RidgeCV class.

Predicting Financial Movements Case Study

The realm of financial modeling will shed some light on time series analysis , where we want to react to events in near real time (for example, predicting stock prices in markets). This is totally different than what we have done thus far using batch processing. My aim here isn’t to develop a new breed of stock market application, but to drive your attention to novel problems and potential solutions with streaming data. You cannot access such data all at once, so stream processing techniques are intrinsically incremental and casual (they act on current knowledge to predict future outcomes). This entails that model parameters evolve over time instead of being fixed at the end of the training stage. Monitoring and regularizing this evolution are also some new tasks compared to classical approaches (see references [1-2]).

Another crucial difference is about timestamping of observations. In the Chapter 2 case study of e-commerce customer segmentation, the data was coarsely timestamped (data files were segregated by days). Here, each record will have its own absolute timestamp, so that we can monitor trend, seasonality, and other time-based patterns; relative timing isn’t enough for this purpose. To avoid time zone–related difficulties, it is beneficial to register timestamps as seconds since the beginning of epoch time (or similar higher granularity scheme).

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig10_HTML.jpg — Figure 7-10
Although the model’s complexity, in term of features, is the same as in our example for overfitting, due to regularization it doesn’t overfit as earlier

Data Retrieval

The accompanying source code already contains the CSV file with daily time series stock data for Apple (its stock symbol is AAPL). Its located inside the stock_market subfolder (with other artifacts). You can get a fresh copy, or work with another company’s equity (change the symbol below and rename the target file accordingly), by issuing the following command from Spyder’s IPython console:

!curl -o daily_AAPL.csv

"https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=AAPL&outputsize=full

&apikey=<YOUR API KEY>&datatype=csv"

We are relying on Alpha Vantage’s API (see https://www.alphavantage.co/documentation/#daily ) to retrieve daily time series data. You can get a free API key from Alpha Vantage and insert it into the URL in the preceding code where indicated. There are limits on the number of requests that are well documented on the site. The outputsize parameter is set to full, which pulls up to 20 years’ worth of historical data. The output format (datatype parameter) is set to csv.

Data Preprocessing

In the spirit of the data science process, we will first do some preprocessing and exploratory data analysis. All subsequent steps should be carried out from Spyder’s IPython console (ensure that you are in the stock_market folder). The next command shows the first five lines of the downloaded file:

>> !head -n 5 daily_AAPL.csv

timestamp,open,high,low,close,volume

2018-11-07,205.9700,210.0600,204.1300,209.9500,33106489

2018-11-06,201.9200,204.7200,201.6900,203.7700,31882881

2018-11-05,204.3000,204.3900,198.1700,201.5900,66163669

2018-11-02,209.5500,213.6500,205.4300,207.4800,91328654

According to the API documentation, “The most recent data point is the prices and volume information of the current trading day, updated realtime.” We will omit this and work only with stable data points. The next lines read the stock data into a Pandas data frame and show the first couple of records:

>> import pandas as pd

>> stock_data = pd.read_csv('daily_AAPL.csv', usecols=[0, 4, 5], skiprows=[1])

>> stock_data.head()

timestamp close volume

0 2018-11-06 203.77 31882881

1 2018-11-05 201.59 66163669

2 2018-11-02 207.48 91328654

3 2018-11-01 222.22 58323180

4 2018-10-31 218.86 38358933

We only need the timestamp, close price, and volume fields without the most recent data point. At this moment it is convenient to see the overall information about data types, number of rows, etc. The next command shows this information:

>> stock_data.info()

RangeIndex: 5247 entries, 0 to 5246

Data columns (total 3 columns):

timestamp 5247 non-null object

close 5247 non-null float64

volume 5247 non-null int64

dtypes: float64(1), int64(1), object(1)

memory usage: 123.1+ KB

Apparently, the timestamp has an inconvenient object type, which is too generic to be useful. The next lines convert this column into a DateTime index (this allows us to treat data chronologically):

>> stock_data['timestamp'] = pd.to_datetime(stock_data['timestamp'])

>> stock_data.set_index('timestamp', inplace=True, verify_integrity=True)

>> stock_data.head()

close volume

timestamp

2018-11-06 203.77 31882881

2018-11-05 201.59 66163669

2018-11-02 207.48 91328654

2018-11-01 222.22 58323180

2018-10-31 218.86 38358933

Figure 7-11 shows AAPL closing levels over time, which is produced with the following snippet. The style parameter controls the appearance of lines in a plot. To differentiate lines in a grayscale image, you cannot solely use colors. We see a huge price drop around 2015 in Figure 7-11. According to one report (read at http://time.com/money/3991712/apple-stock-price-drop ), the reason was a missed business expectation around iPhone sales.

>> import matplotlib.pyplot as plt

>> def plot_time_series(ts, title_prefix, style='b-'):

ax = ts.plot(figsize=(9, 8), lw=2, fontsize=12, style=style)

ax.set_title('%s Over Time' % title_prefix, fontsize=19)

ax.set_xlabel('Year', fontsize=15)

>> plot_time_series(stock_data['close'], 'AAPL Closing Levels')

Discovering Trends in Time Series

The chart in Figure 7-11 is very ragged, due to noise and seasonality. A popular way to identify trends in time series is to take a moving average . The following command plots the trend in closing levels by calculating the simple moving average (SMA) , as shown in Figure 7-12:

>> stock_data.sort_index(inplace=True)

>> plot_time_series(stock_data['close'].rolling('365D').mean(), 'AAPL Closing Trend')

If we don’t sort the index, then we will receive an error, ValueError: index must be monotonic. The rolling window is defined to be 365 days. Figure 7-13 combines closing levels and volume trends inside a single diagram. The volume doesn’t alter much over time, although when the price had started to drop then the volume had increased. Maybe there was an urge to sell stocks while the price was still good enough. This kind of comparison is useful for feature engineering and to get more insight into behavior. The next lines produce such a composite plot:

>> def compose_trends(ts):

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_ts = pd.DataFrame(scaler.fit_transform(ts), columns=ts.columns, index=ts.index)

return pd.concat([scaled_ts['close'].rolling('365D').mean(),

scaled_ts['volume'].rolling('365D').mean()], axis=1)

>> plot_time_series(compose_trends(stock_data), 'AAPL Closing & Volume Trends', ['b-', 'g--'])

Taking a moving average eliminates small nuances in data. Also observe in Figure 7-13 that the slope of the massive price drop is less than in Figure 7-11. It is imperative to scale the features before combining them in the same diagram. Scaling is also mandatory in various machine learning situations, when movement in one direction may completely shadow movements in other directions. This happens when features are on totally different scales.

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig11_HTML.jpg — Figure 7-11
Variation of AAPL closing levels over time

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig12_HTML.jpg — Figure 7-12
Much smoother diagram compared to Figure 7-11

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig13_HTML.jpg — Figure 7-13
The y axis is now showing scaled values

Transforming Features

In finance, looking at returns on a daily basis is more useful than using absolute quantities. Returns indicate how much an asset’s value fluctuates over time. There are two main ways to formulate returns:

Log returns : ${r}_t=mathit{log}left(frac{v_t}{v_{t-1}} ight)$ , where v denotes the asset’s value (such as closing price, adjusted closing price, volume, adjusted volume, etc.).
Scaled percent returns : ${r}_t=left(frac{v_t}{v_{t-1}}-1 ight)$ . There is a method pct_change in Pandas to calculate this quantity. For daily returns, scaled percent returns are near to log returns. You can see this from Taylor expansion of the log function, when $x=frac{V_t}{V_{t-1}}$ is very small, so only the first term (x − 1) matters.

Log returns has nice mathematical properties (such as additivity, symmetry pertaining to gains and losses, etc.). The following snippet attaches two new features to our data frame: the daily log returns of the stock price and the daily log changes in volume. Figure 7-14 shows AAPL price log returns over time, and Figure 7-15 presents the same for the volume.

>> import numpy as np

>> stock_data['close_ret'] = np.log(stock_data['close']).diff()

>> stock_data['volume_ret'] = np.log(stock_data['volume']).diff()

>> stock_data.head()

close volume close_ret volume_ret

timestamp

1998-01-02 16.25 6411700 NaN NaN

1998-01-05 15.88 5820300 -0.023032 -0.096773

1998-01-06 18.94 16182800 0.176216 1.022597

1998-01-07 17.50 9300200 -0.079075 -0.553913

1998-01-08 18.19 6910900 0.038671 -0.296936

>> stock_data.dropna();

>> plot_time_series(stock_data['close_ret'], 'AAPL Price Log Returns')

>> plot_time_series(stock_data['volume_ret'], 'AAPL Volume Log Returns')

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig14_HTML.jpg — Figure 7-14
This diagram has a couple of downward spikes that are far away from central values

We could treat spikes as outliers, which are hard to model and account for (unless you are a finance guru). One way to circumvent this problem is to use volatility-normalized log returns.

The next code section implements the logic to produce volatility-normalized log returns of the stock price, as shown in Figure 7-16. There is no need to perform the same for the volume returns, as we will soon see (its distribution is nearly normal).

>> ts['close_ret'] /= ts['close_ret'].ewm(halflife=halflife).std()

>> plot_time_series(stock_data['close_ret'], 'AAPL Volatility-Norm. Price Log Returns')

The half-life of 23 days amounts to a decay factor (weight) of 0.97 (you need to solve for w the equation ${w}^{23}=frac{1}{2}$ ), which controls how long the stock market remembers (or how fast it forgets) old events. The i-th data point has a decay factor of wⁱ. Volatility is calculated as a rolling standard deviation across data points taking into account the exponential decay for old data. Higher decay damps volatility. You should experiment with this factor to match the desired risk level.

Streaming Amounts

For simplicity reasons, we have thus far managed the data in batch mode. Nonetheless, log returns, moving average, and volatility are all amounts that may be calculated in real time. For log returns, you just need to cache the last value to make an update. The same is true for volatility with exponential downweighting, although it is not as obvious as with log returns. To track a moving average, you will need to store and update the last numerator and denumerator.

Suppose that you know the current weighted variance ${V}_{current}=frac{sum limits_i{w}^i{r}_i^2}{sum_i{w}^i}$ , where w is the decay factor; for daily returns, we may assume the mean to be zero. The denominator is the sum of a geometric series whose limit is $frac{1}{1-w}$ . When a new return r₀ arrives, then we have ${V}_{new}=left(1-w ight)left(wsum limits_i{w}^i{r}_i^2+{r}_0 ight)=w{V}_{current}+left(1-w ight){r}_0$ . You can compute the current volatility as $sqrt{V_{current}}$ .

To perform volatility normalization in streaming mode, you need to divide the new return with the current volatility: ${r}_0leftarrow aisebox{1ex}{${r}_0$}!left/ ! aisebox{-1ex}{$sqrt{V_{current}}$} ight..$ Of course, you would do this before updating the current volatility; that is, before executing V_current ← V_new.

Besides computing various running totals, there is a whole gamut of incremental algorithms that may run in streaming fashion. For example, gradient descent is an iterative optimization method that handles all data in one sweep. Stochastic gradient descent is an incremental and iterative method that runs in online mode and updates parameters on-the-fly. Streaming linear regression uses this approach (as described later in this chapter).

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig15_HTML.jpg — Figure 7-15
This diagram seems to be properly centered around zero, but we should still eyeball its distribution

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig16_HTML.jpg — Figure 7-16
Normalization has smoothed out the log returns and made the time series better behaved

To get a better feeling of what normalization did to price log returns, Figures 7-17 and 7-18 show the histograms of non-normalized and normalized variants, respectively. Later in the section you will see the complete code.

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig17_HTML.jpg — Figure 7-17
Histogram of the non-normalized price log returns, which is heavily left-tailed

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig18_HTML.jpg — Figure 7-18
Histogram of the normalized price log returns, which is bell shaped

For the sake of completeness, Figure 7-19 shows the histogram of the volume log returns.

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig19_HTML.jpg — Figure 7-19
This diagram is normally distributed with a slight right tail

Listings 7-9, 7-10, and 7-11 show separate modules that bundle all pertinent steps into coherent units. This concludes the preprocessing stage. The driver.py module calls these functions to implement the whole pipeline.

import numpy as np

import pandas as pd

def read_daily_equity_data(file):

stock_data = pd.read_csv(file, usecols=[0, 4, 5], skiprows=[1])

stock_data['timestamp'] = pd.to_datetime(stock_data['timestamp'])

stock_data.set_index('timestamp', inplace=True, verify_integrity=True)

stock_data.sort_index(inplace=True)

return stock_data

def compose_trends(ts):

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_ts = pd.DataFrame(scaler.fit_transform(ts), columns=ts.columns, index=ts.index)

return pd.concat([scaled_ts['close'].rolling('365D').mean(),

scaled_ts['volume'].rolling('365D').mean()], axis=1)

def create_log_returns(ts, halflife, normalize_close=True):

ts['close_ret'] = np.log(ts['close']).diff()

if normalize_close:

ts['close_ret'] /= ts['close_ret'].ewm(halflife=halflife).std()

ts['volume_ret'] = np.log(ts['volume']).diff()

return ts.dropna()

Listing 7-9

data_preprocessing.py Module , Which Contains Functions to Prepare the Data Frame for the Analysis Phase

import matplotlib.pyplot as plt

def plot_time_series(ts, title_prefix, style='b-'):

ax = ts.plot(figsize=(9, 8), lw=2, fontsize=12, style=style)

ax.set_title('%s Over Time' % title_prefix, fontsize=19)

ax.set_xlabel('Year', fontsize=15)

plt.show()

def hist_time_series(ts, xlabel, bins):

ax = ts.hist(figsize=(9, 8), xlabelsize=12, ylabelsize=12, bins=bins, grid=False)

ax.set_title('Distribution of %s' % xlabel, fontsize=19)

ax.set_xlabel(xlabel, fontsize=15)

plt.show()

Listing 7-10

data_visualization.py Module , Which Contains Auxiliary Visualizations of Time Series

from data_preprocessing import *

from data_visualization import *

stock_data = read_daily_equity_data('daily_AAPL.csv')

stock_data = create_log_returns(stock_data, 23)

plot_time_series(stock_data['close'], 'AAPL Closing Levels')

plot_time_series(stock_data['close'].rolling('365D').mean(), 'AAPL Closing Trend')

plot_time_series(compose_trends(stock_data), 'AAPL Closing & Volume Trends', ['b-', 'g--'])

# To produce the non-normalized price log returns plot you must call

# the create_log_returns function with normalize_close=False. Try this as an

# additional exercise.

plot_time_series(stock_data['close_ret'], 'AAPL Volatility-Norm. Price Log Returns')

plot_time_series(stock_data['volume_ret'], 'AAPL Volume Log Returns')

hist_time_series(stock_data['close_ret'], 'Daily Stock Log Returns', 50)

hist_time_series(stock_data['volume_ret'], 'Daily Volume Log Returns', 50)

Listing 7-11

driver.py Module , Which Connects All the Pieces Together (First Part of File Shown)

Feature Engineering

Currently, we have raw close levels, raw volume levels, volatility-normalized closing log returns, and volume log returns as our features (we will create more). To see how these impacts a potential target (like, predicted stock price change) it is useful to consult Pearson’s correlation coefficient r. It is an indicator for a linear relationship between two features, whose range is [-1, 1]. A positive correlation means as one value increases/decreases the other does the same. A negative correlation denotes the opposite behavior. A value of zero represents the absence of a linear relationship, although the quantities may be interrelated in non-linear ways. Usually, if |r| > 0.3, then we can consider the correlation to be noticeable (this a judgement call, so take this heuristics with a decent pinch of salt).

Log returns denote fluctuations and it is illuminating to find out whether these are mean-reverting or trend-following in some time period (for example, N days). Mean reversion suggests that returns oscillate around a mean, while trend-following says that they mimic the recent period. We can discover this by fixing N and calculating the coefficient r between returns from past and future. If the correlation is low, then we have mean reversion, otherwise trend-following behavior. Listing 7-12 shows the function that reports correlation coefficients and creates scatter plots between past and future price as well as volume log returns (see also Exercise 7-3). It calls the scatter_time_series function to make a scatter plot (see Listing 7-13).

from data_visualization import *

def report_auto_correlation(ts, periods=5):

for column in filter(lambda str: str.endswith('_ret'), ts.columns):

future_column = 'future_' + column

ts[future_column] = ts[column].shift(-periods).rolling(periods).sum()

current_column = 'current_' + column

ts[current_column] = ts[column].rolling(periods).sum()

print(ts[[current_column, future_column]].corr())

scatter_time_series(ts, current_column, future_column)

Listing 7-12

feature_engineering.py Module , with Function to Investigate Auto-correlation

def scatter_time_series(ts, x, y):

ax = ts.plot(x=x, y=y, figsize=(9, 8), kind="scatter", fontsize=12)

ax.set_title('Auto-correlation Graph', fontsize=19)

ax.set_xlabel(x, fontsize=15)

ax.set_ylabel(y, fontsize=15)

plt.show()

Listing 7-13

Function to Create a Scatter Plot to Trace Auto-correlation (data_visualization.py Module)

The following are the correlations for both quantities, and Figure 7-20 shows the scatter plot for volume log returns (you should run the accompanying source code to see the graph for price returns):

current_close_ret future_close_ret

current_close_ret 1.000000 0.013661

future_close_ret 0.013661 1.000000

current_volume_ret future_volume_ret

current_volume_ret 1.000000 -0.419489

future_volume_ret -0.419489 1.000000

We may conclude that price log returns revert to the mean, while volume log returns follow the opposite trend.

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig20_HTML.jpg — Figure 7-20
As current volume log returns increase, future volume log returns decrease, and vice versa

By varying the period’s length, it is possible to find the highest level of auto-correlation, which may serve as a basis for creating features. It seems that 5 days is a good choice (you may experiment with different periods using the accompanying code base). Therefore, out target feature will be the 5-days future price change, while our initial basic features will be the current 5-days price and volume changes.

Every domain has its own set of prepared favorite features that have been proven to be valuable in predicting targets. There is a powerful technical analysis library with 200 financial indicators called TA-Lib (see https://mrjbq7.github.io/ta-lib ). We will use three of them: normalized SMA, relative strength index (RSI), and on-balance volume (OBV). You have already seen SMA in the “Discovering Trends in Time Series” section. RSI is defined as $100-frac{100}{1+ RS}$ , where $RS=frac{mean gain over a period}{mean loss over a period}$ . OBV is connecting volume flow with price changes. Listing 7-14 contains the code for creating candidate features.

def create_features(ts):

from talib import SMA, RSI, OBV

target = 'future_close_ret'

features = ['current_close_ret', 'current_volume_ret']

for n in [14, 25, 50, 100]:

ts['sma_' + str(n)] = SMA(ts['close'].values, timeperiod=n) / ts['close']

ts['rsi_' + str(n)] = RSI(ts['close'].values, timeperiod=n)

ts['obv'] = OBV(ts['close'].values, ts['volume'].values.astype('float64'))

ts.drop(['close', 'volume', 'close_ret', 'volume_ret', 'future_volume_ret'],

axis='columns',

inplace=True)

ts.dropna(inplace=True)

return ts.corr()

Listing 7-14

Code to Create Features Described in Text (Part of feature_engineering.py)

The function returns the correlation matrix, which may be conveniently visualized using a heat map. It is also possible to return an exponentially weighted correlation matrix and to use a weighted moving average (see the Pandas EWM object’s corr and mean methods, respectively). The heat map plotting is implemented in a new function, as shown in Listing 7-15 (inside our visualization module).

def heat_corr_plot(corr_matrix):

import numpy as np

import seaborn as sns

mask = np.zeros_like(corr_matrix)

mask[np.triu_indices_from(mask)] = True

_, ax = plt.subplots(figsize=(9, 8))

sns.heatmap(corr_matrix, annot=True, cmap="gist_gray", fmt=".2f", lw=.5, mask=mask, ax=ax)

plt.tight_layout()

plt.show()

Listing 7-15

Code to Report Correlations Using Seaborn’s heatmap Facility

Finally, the driver.py module must be extended with the following statements:

from feature_engineering import *

report_auto_correlation(stock_data)

corr_matrix = create_features(stock_data)

heat_corr_plot(corr_matrix)

Figure 7-21 shows the correlation matrix, which can be used for filtering features. Reducing the number of features may improve both accuracy and performance.

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig21_HTML.jpg — Figure 7-21
This diagram shows only the relevant part of the correlation matrix

All diagonal elements are 1, since features are perfectly aligned with themselves. The matrix is symmetric, so the upper-right triangle is redundant. None of the features are strongly coupled with the target, at least not in a linear way.

Implementing Streaming Linear Regression

Streaming linear regression continually learns and updates its parameters as new training data is added to the corpus. We rely on Apache Spark’s MLlib framework that provides the StreamingLinearRegressionWithSGD class to perform streaming regression. The only task for us is to provide streaming sources for training and test data. One handy way is to deliver a list of resilient distributed dataset (RDD) instances over a queue. Converting a section of the Pandas DataFrame into an RDD is straightforward. Listing 7-16 shows the implementation to fit a linear model in online regime (see also Exercise 7-4), and Listing 7-17 shows the final expansion of the driver.py module. You will need to install pyspark (see https://spark.apache.org/docs/latest/api/python/index.html ).

def fit_and_predict(sparkSession, ts):

import numpy as np

from sklearn.model_selection import train_test_split

from pyspark.streaming import StreamingContext

from pyspark.mllib.regression import StreamingLinearRegressionWithSGD

def to_scaled_rdd(pandasDataFrame):

import pandas as pd

from sklearn.preprocessing import RobustScaler

from pyspark.mllib.regression import LabeledPoint

regressors = pandasDataFrame.columns[1:]

num_regressors = len(regressors)

# FIX ME: As a bonus exercise, read the last paragraph from section about residual

# plots and make the necessary bug fix! Compare the behavior of this version with the

# fixed one and see whether you can decipher anything from the outputs.

scaler = RobustScaler()

scaled_regressors = scaler.fit_transform(pandasDataFrame[regressors])

scaled_pandasDataFrame = pd.DataFrame(scaled_regressors, columns=regressors)

scaled_pandasDataFrame['target'] = pandasDataFrame[pandasDataFrame.columns[0]].values

sparkDataFrame = sparkSession.createDataFrame(scaled_pandasDataFrame)

return sparkDataFrame.rdd.map(

lambda row: LabeledPoint(row[num_regressors], row[:num_regressors]))

def report_accuracy(result_rdd):

from pyspark.mllib.evaluation import RegressionMetrics

if not result_rdd.isEmpty():

metrics = RegressionMetrics(

result_rdd.map(lambda t: (float(t[1]), float(t[0]))))

print("MSE = %s" % metrics.meanSquaredError)

print("RMSE = %s" % metrics.rootMeanSquaredError)

print("R-squared = %s" % metrics.r2)

print("MAE = %s" % metrics.meanAbsoluteError)

print("Explained variance = %s" % metrics.explainedVariance)

df_train, df_test = train_test_split(ts, test_size=0.2, shuffle=False)

train_rdd = to_scaled_rdd(df_train)

test_rdd = to_scaled_rdd(df_test)

streamContext = StreamingContext(sparkSession.sparkContext, 1)

train_stream = streamContext.queueStream([train_rdd])

test_stream = streamContext.queueStream([test_rdd])

numFeatures = len(ts.columns) - 1

model = StreamingLinearRegressionWithSGD(stepSize=0.05, numIterations=300)

np.random.seed(0)

model.setInitialWeights(np.random.rand(numFeatures))

model.trainOn(train_stream)

result_stream = model.predictOnValues(test_stream.map(lambda lp: (lp.label, lp.features)))

result_stream.cache()

result_stream.foreachRDD(report_accuracy)

streamContext.start()

streamContext.awaitTermination()

Listing 7-16

Content of the streaming_regression.py Module to Showcase Streaming Linear Regression in Online Mode

from pyspark.sql import SparkSession

from streaming_regression import *

sparkSession = SparkSession.builder

.master("local[4]")

.appName("Streaming Regression Case Study")

.getOrCreate()

fit_and_predict(sparkSession, stock_data)

Listing 7-17

Last Piece of the Main driver.py Module

The system will print the following report (after reading this, you should terminate the session):

MSE = 5.663621710391528

RMSE = 2.379836488162901

R-squared = -0.3217279555407153

MAE = 1.8382286514438086

Explained variance = 1.7169768418351132

At the time of this writing, the StreamingLinearRegressionWithSGD class is missing the setIntercept method . Consequently, we have very “strange” values for R-squared and explained variance. It is also crucial to remember the importance of scaling features before training the model. The RobustScaler scaler is convenient if you aren’t sure about the distribution of your regressors.

Exercise 7-1. Improve Reusability

The plot_mse function presets font sizes for title and axes. This tactic is also repeated in other places to preserve consistency. Nonetheless, this approach is a real maintenance nightmare (notice that explain_sse repeats the same setup). Improve the code to centralize setting common parameters (hint: read about style sheets at https://matplotlib.org/users/customizing.html ). What else can you devise to make the code base more flexible and reusable?

As you work on various machine learning problems, utility functions become very handy. There is no point in reinventing the wheel each time you need to plot MSEs. Try to make your code base reusable as much as you can.

Exercise 7-2. Fix a Bug

In Listing 7-2, inside the plot_mse function , you will find the following two lines of code:

# FIX ME: See Exercise 2!

ax.set_ylim(max(0, y_min), min(2 * error_variance, y_max))

Your task is to find out why this marked line is wrong and implement a fix. In reality, the situation is even worse, because nobody will point out to you a wrong section of code. Locating the exact place of an issue is a huge milestone toward correcting errors.

Hint: Figure 7-6 was produced with both lines commented out. Obviously, such resolution is a quick and dirty hack. Try running demo_underfitting with the original plot_mse function and see what happens.

Exercise 7-3. Avoid Side-Effects

In Listing 7-12 you can see a function with side effects. It modifies the input data frame with extra columns. This modification is a prerequisite for calling the function presented in Listing 7-14. All the chaining is driven from the driver module.

Programming in terms of side effects is generally a bad practice. You may wonder why the model is altered in this fashion (see Listing 7-2) after calling its fit method. The crucial difference is that model encapsulates its state inside an object, where you have more control over changes to internal stuff. When ordinary functions are spread out with assumed side effects, then it is easy to lose control and create an unmaintainable mess.

Refactor the feature_engineering.py module to group feature-related manipulations inside a dedicated class. Compare how such object orientation helps to retain control over internals of the system.

Exercise 7-4. Experience Streaming Behavior

In Listing 7-16 the training and test streams were constructed in the following fashion:

train_stream = streamContext.queueStream([train_rdd])

test_stream = streamContext.queueStream([test_rdd])

This was OK to demonstrate the scaffolding of the solution but makes no sense in a real environment. The whole point of streaming is to allow data to continuously arrive in chunks. Apache Spark even allows streams to be combined, so that you may have data coming over multiple channels (for example, you can also monitor a folder for new data files and parse them in real time).

Refactor the solution to have many parts in the training and test sets. You need to split the df_train and df_test referenced data frames into sections and convert each into an RDD. Finally, provide these parts in a list to the queueStream method. Observe what you get on the output (you should receive as many reports as there are pieces of test data).

Bonus task: The current accuracy report doesn’t reveal too much about performance. You may want to create a scatter plot of predictions vs. actual values (observations from test data). Draw also the ideal line (this is basically the function y=x).

Summary

This chapter barely scratches the surface of the machine learning (ML) domain. Even dozens of books wouldn’t be enough to fully cover (even at an introductory level) all available algorithms and technologies. The major aim of this chapter was to present common concepts that permeate the ML knowledge area. Without being aware of underfitting, overfitting, regularization, scaling, and similar topics, there is no way to be efficient with any ML approach. To be proficient in ML, you must also recall the golden tenet of data science and engineering: “Keep it simple, stupid!” (a.k.a. KISS principle). Typically, you don’t even need ML to solve a problem, and rarely will you ever need to fire up complicated deep neural networks.

Another potent message of this chapter is the rather blurred borderline between science and art regarding parameter tuning. There are some rules of thumb (for a good overview, I suggest the document at http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf ), but you will need lots of experimentation and trial. Consequently, to make all this happen in reasonable amount of time, you will need powerful hardware (readily available in a cloud as infrastructure or platform as a service).

ML is all about mathematics. There is no way to escape this fact. Neural networks appear to give you a seemingly good escape route (we will cover them in detail in Chapter 12), but eventually you will need to dig under the hood to understand what is going on. This is tightly associated with interpretability of your model. With our ordinary least-square method of regression, coefficients can be easily explained. A particular coefficient represents how much the target changes for a unit change in the corresponding feature. With more complex models, the situation is different. At any rate, I will talk more about inner characteristics of models in Chapter 9.

ML is getting huge attention from both research institutions and industry. This isn’t surprising, since as we delve more into the realm of Big Data, there is a greater need to handle such large amounts of data in an efficient manner. Only with the help of machines can we hope to seize control of massive data. One hot topic is transfer learning, which is an attempt to boost reuse of trained models. The idea is to leverage models optimized for one task to perform well on other tasks, too (maybe with minor extra tweaking and training). This will surely trigger a whole bunch of new algorithms and technologies.

One urgent need in the area of machine learning is an efficient and standardized way to exchange prediction models. One such standard is the Predictive Model Markup Language (PMML). There is a Python library for converting scikit-learn pipelines to PMML at https://github.com/jpmml/sklearn2pmml .

References

1.
Christopher Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
2.
Nathan George, “Machine Learning for Finance in Python,” DataCamp, https://www.datacamp.com/courses/machine-learning-for-finance-in-python .
3.
Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal, Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, Morgan Kaufmann, 2016.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Machine Learning

Create new playlist

Sign In

Sign Up

7. Machine Learning

Exposition of Core Concepts and Techniques

Linear Regression Varieties

Note

Hint

Overfitting

Underfitting and Feature Interaction

Collinearity

Residuals Plot

Regularization

Predicting Financial Movements Case Study

Data Retrieval

Data Preprocessing

Discovering Trends in Time Series

Transforming Features

Streaming Amounts

Feature Engineering

Implementing Streaming Linear Regression

Exercise 7-1. Improve Reusability

Exercise 7-2. Fix a Bug

Exercise 7-3. Avoid Side-Effects

Exercise 7-4. Experience Streaming Behavior

Summary

References

Table of Contents for
7. Machine Learning