© Ervin Varga 2019
E. Varga Practical Data Science with Python 3https://doi.org/10.1007/978-1-4842-4859-1_7

7. Machine Learning

Ervin Varga1 
(1)
Kikinda, Serbia
 

Machine learning is regarded as a subfield of artificial intelligence that deals with algorithms and technologies to squeeze out knowledge from data. Its fundamental ingredient is Big Data, since without help of a machine, our attempt to manually process huge volumes of data would be hopeless. As a product of computer science, machine learning tries to approach problems algorithmically rather than purely via mathematics. An external spectator of a machine learning module would admire it as some sort of magic happening inside a box. Eager reductionism may lead us to say that it all is just “bare” code executed on a classical computer system. Of course, such a statement would be an abomination. Machine learning does belong to a separate branch of software, which learns from data instead of blindly following predefined rules. Nonetheless, for its efficient application, we must know how and what such algorithms learn as well as what type of algorithm(s) to apply in a given context. No machine learning system can notice that it is misappropriated. The goal of this chapter is to lay down the foundational concepts and principles of machine learning exclusively through examples.

There are multiple ways to group machine learning algorithms. We can differentiate between the following three learning styles:
  • Supervised learning: Here an algorithm is exercised on known observations until it achieves a desirable level of performance. The main challenge is to acquire enough high-quality marked data for appropriate training. Some members of this group are linear regression, logistic regression, support vector machine, naive Bayes classifier, etc.

  • Unsupervised learning: These algorithms try to autonomously discover hidden structures in data for the purpose of grouping them, finding interesting relationships (a.k.a. association rule learning), or reducing inherent dimensionality (describe effects with fewer features). Some members of this group are K-Means clustering, principal component analysis, manifold learning, Aprori algorithm, etc.

  • Semi-supervised learning: These algorithms try to decipher hidden structures based on guidance from labeled specimens. It isn’t uncommon that unsupervised learning algorithms are run on half-marked data, as a preprocessing step, to label remaining data points (a technique known as label propagation).

Besides grouping algorithms, we can also discern various learning methods as follows:
  • Full-batch learning (a.k.a. statistical learning): We feed an algorithm all training data at once. After initial training, model parameters remain fixed. This scheme is also most popular for demonstrating various algorithms in action due to its simplicity.

  • Mini-batch learning: We feed an algorithm in a chunked manner. For example, with an enormous training sample, the standard gradient descent optimization method is prohibitive. This is where the stochastic gradient descent alternative becomes attractive, which works with a reduced chunk at any moment of time.

  • Online learning: This is an extreme case of mini-batching in which the batch size is reduced to a single observation. The usual setup is that the system is warmed up on historical data and left to update its parameters while running in production. This learning method has many subvariants, which are listed below as separate groups to avoid nesting.

  • Streaming: Here, an algorithm also works on a single observation at a time but cannot revisit past records (generic online learning algorithms are allowed to go over data multiple times). It is OK for a streaming system to cache recent items or keep running statistics (like moving average), although these are miniscule in comparison to the training corpus used in previous sets. Finally, many streaming approaches actually rely on micro-batching (with tiny batches). They are still called streaming since they don’t pass multiple times over older batches.

  • Active learning: This method actively seeks feedback from other entities while running. For example, an online learning–based spam filter may incorporate user actions (declassify a piec of spam as normal, or vice versa) to continuously update itself. A semi-supervised system could ask for help to label conflicting (confusing) observations.

  • Reinforcement learning: This special dynamic learning method builds upon interaction between the system and its environment. Using an efficient feedback loop and rewarding mechanism (with positive and negative rewards), an algorithm learns what actions are proper in a given context. Robots typically learn in this manner.

We can also embark on separating algorithms by their similarity (e.g., neural networks, spectral methods, Bayesian networks, etc.). Nevertheless, any classification can only broadly enumerate knowledge areas, with blurred borders and overlaps. However, some notions are shared by all of them. These will be the topic in the rest of this chapter.

Irrespectively of what approach you choose, at one point in time, you will need to peek under the hood of many algorithms to tweak advanced parameters, combine them to work in a unified fashion (a.k.a. ensemble), or connect them as a pipeline for staged processing. You cannot eventually escape deep mathematics. The good news is that you can survive this encounter in incremental fashion.

Exposition of Core Concepts and Techniques

This section gradually introduces common machine learning concepts using ordinary least squares regression, which is simple to comprehend yet powerful enough to serve as a basis for discussion. It establishes a linear relationship between features (predictors) to predict an output value (response). The features themselves may be of arbitrary degree and complexity (for example, they could be polynomial terms). Listing 7-1 shows the data_generator.py module , which produces features and outputs based on various criteria by simulating fake “real world” processes. Listing 7-1 provides the scaffolding for our demo harness to exhibit the following notions: real world, training process, process parameters, runtime model, runtime parameters, estimators, evaluation metrics (loss function, mean squared error, explained variance), overfitting, underfitting, feature interaction, collinearity, and regularization. My intention here is to err on the side of oversimplification to convey essential ideas. There are a lot of excellent books about machine learning with a gamut of complex mathematics behind the scenes; the field is immense.
import numpy as np
import pandas as pd
def generate_base_features(sample_size):
    x_normal = np.random.normal(6, 9, sample_size)
    x_uniform = np.random.uniform(0, 1, sample_size)
    x_interacting = x_normal * x_uniform
    x_combined = 3.6 * x_normal + np.random.exponential(2/3, sample_size)
    x_collinear = 5.6 * x_combined
    features = {
        'x_normal': x_normal,
        'x_uniform': x_uniform,
        'x_interacting': x_interacting,
        'x_combined': x_combined,
        'x_collinear': x_collinear
    }
    return pd.DataFrame.from_dict(features)
def identity(x):
    return x
def generate_response(X, error_spread, beta, f=identity):
    error = np.random.normal(0, error_spread, (X.shape[0], 1))
    intercept = beta[0]
    coef = np.array(beta[1:]).reshape(X.shape[1], 1)
    return f(intercept + np.dot(X, coef)) + error
Listing 7-1

data_generator.py Module

Our intimate knowledge about each underlying data generator process allows us to illuminate concepts in an exact manner. The real world simulated by generate_response has the following vector form: y = f  (β0 + 1 : n) + ε (see also the sidebar “Linear Regression Varieties”), where f is a discretionary function. The error term represents an inherent noise, which isn’t encompassed by the model. We cast it as a Gaussian random variable $$ varepsilon sim mathcal{N}left(mu =0,sigma = error\_ spread
ight) $$; that is, a term that symmetrically fluctuates around the output following the Normal distribution. For simplicity, we produce homoscedastic outputs, meaning the errors are uncorrelated and uniform (they don’t depend on X). The vector β is arbitrary.

An external observer can only see records (xi, yi) from this world, where i ∈ [1, sample size]. The art, engineering, and science is to reconstruct real-world phenomena by using only one or more samples, as shown in Figure 7-1. We proceed by assuming a specific model (for example, linear with a given error distribution). Afterward, we try to figure out model parameters (like vector β and σ) and seek proper features (this obviously has an impact on parameters). All in all, we have lots of stuff to presume and calculate.
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig1_HTML.jpg
Figure 7-1

1From outside we can only gather observations and try to establish sound relationships between features and outputs

Establishing relationships between features and output allows us to either predict future outcomes or better understand what is going on in the system. In Figure 7-1, the frontal digits represent assumed predictors and fitted values (what we closely examine and crunch in production), while the digits on the globe represent real-world data. We can state that our model closely resembles the real world if there is an acceptable match between them.

Linear Regression Varieties

The generic form of a linear regression is given by y = m(x) + ε. The conditional expectation function m is E[y|x], if E[ε|x]=0 (when the homoscedasticity property holds). Depending on m’s structure, we can discern the following types of linear regressions:
  • Parametric: m = β0 + 1 : n is smooth, defined, and easily interpretable.

  • Nonparametric: m is smooth, must be discovered from data, flexible, and hard to interpret.

  • Semiparametric: Somewhere between the preceding two cases (partially structured).

You must evaluate the cost-benefit ratio before deciding which version to use. For example, if you have outliers (extreme but valid observations that you cannot eradicate), then simple parametric linear regression isn’t a good choice. The nonparametric version is more robust and less vulnerable regarding outliers.

Imagine that you encounter a helpful wizard, who reveals everything about the real world’s processes. With that knowledge, how accurately may you predict future outcomes? You definitely cannot talk in absolute terms, as you have to contend with randomness in making your predictions; you may only speak about probabilities of seeing particular responses. In other words, you can develop a conditional probability distribution of an output given an observation. This can be expressed as $$ Pleft(y|X
ight)=mathcal{N}left(fleft({eta}_0+{eta}_{1:n}mathrm{X}
ight),sigma 
ight) $$

The truth is, you are not straying far from reality (and sanity) if you are hoping for such a wizard. The scikit-learn framework ( https://scikit-learn.org ) is a wizard in its own right. You just need to carefully interpret scikit-learn’s words and never misapply the power bestowed on you. The aim of this section is to shed some light on necessary communication skills and language so that you become this wizard’s beloved apprentice.

Listing 7-2 shows the scikit-learn package in action with some visualization of what happens under the hood. The observer.py module contains functions to recover parameters and demonstrate various effects pertaining to training, testing, and evaluation. You don’t need to fully understand the auxiliary display code (see the explain_sse and plot_mse functions) to follow along. The train_model function recovers the real world’s parameters solely by using observations (a.k.a. labeled training data). The evaluate_model function checks the validity of the model by calculating the mean squared error (MSE) and explained variance score using test data. In practice, you should always select performance criterion to align with objectives of the problem. Out-of-context measurements are distorting.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
def train_model(model, X_train, y_train):
    model.fit(X_train, y_train)
def evaluate_model(model, X_test, y_test, plot_residuals=False, title="):
    from sklearn.metrics import mean_squared_error, explained_variance_score
    y_pred = model.predict(X_test)
    if plot_residuals:
        _, ax = plt.subplots(figsize=(9, 9))
        ax.set_title('Residuals Plot - ' + title, fontsize=19)
        ax.set_xlabel('Predicted values', fontsize=15)
        ax.set_ylabel('Residuals', fontsize=15)
        sns.residplot(y_pred.squeeze(), y_test.squeeze(),
                      lowess=True,
                      ax=ax,
                      scatter_kws={'alpha': 0.3},
                      line_kws={'color': 'black', 'lw': 2, 'ls': '--'})
    metrics = {
        'explained_variance': explained_variance_score(y_test, y_pred),
        'mse': mean_squared_error(y_test, y_pred)
    }
    return metrics
def make_poly_pipeline(model, degree):
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import PolynomialFeatures
    return make_pipeline(PolynomialFeatures(degree=degree, include_bias=False), model)
def print_parameters(linear_model, metrics):
    print('Intercept: %.3f' % linear_model.intercept_)
    print('Coefficients: ', linear_model.coef_)
    print('Explained variance score: %.3f' % metrics['explained_variance'])
    print("Mean squared error: %.3f" % metrics['mse'])
def plot_mse(model, X, y, title, error_spread):
    def collect_mse():
        from sklearn.model_selection import train_test_split
        from sklearn.model_selection import cross_val_score
        metrics_all = []
        for train_size_pct in range(10, 110, 10):
            X_train, X_test, y_train, y_test =
                train_test_split(X, y, shuffle=False, train_size=train_size_pct / 100)
            metrics_current = dict()
            metrics_current['percent_train'] = train_size_pct
            train_model(model, X_train, y_train)
            metrics_train = evaluate_model(model, X_train, y_train)
            metrics_current['Training score'] = metrics_train['mse']
            metrics_cv = cross_val_score(
                model,
                X_train, y_train,
                scoring='neg_mean_squared_error', cv=10)
            metrics_current['CV score'] = -metrics_cv.mean()
            if X_test.shape[0] > 0:
                metrics_test = evaluate_model(model, X_test, y_test)
                metrics_current['Testing score'] = metrics_test['mse']
            else:
                metrics_current['Testing score'] = np.NaN
            metrics_all.append(metrics_current)
        return pd.DataFrame.from_records(metrics_all)
    import matplotlib.ticker as mtick
    df = collect_mse()
    error_variance = error_spread**2
    ax = df.plot(
        x='percent_train',
        title=title,
        kind='line',
        xticks=range(10, 110, 10),
        sort_columns=True,
        style=['b+--', 'ro-', 'gx:'],
        markersize=10.0,
        grid=False,
        figsize=(8, 6),
        lw=2)
    ax.set_xlabel('Training set size', fontsize=15)
    ax.xaxis.set_major_formatter(mtick.PercentFormatter())
    y_min, y_max = ax.get_ylim()
    # FIX ME: See Exercise 2!
    ax.set_ylim(max(0, y_min), min(2 * error_variance, y_max))
    ax.set_ylabel('MSE', fontsize=15)
    ax.title.set_size(19)
    # Draw and annotate the minimum MSE.
    ax.axhline(error_variance, color="g", ls="--", lw=1)
    ax.annotate(
        'Inherent error level',
        xy=(15, error_variance),
        textcoords='offset pixels',
        xytext=(10, 80),
        arrowprops=dict(facecolor='black', width=1, shrink=0.05))
def explain_sse(slope, intercept, x, y):
    # Configure the diagram.
    _, ax = plt.subplots(figsize=(7, 9))
    ax.set_xlabel('x', fontsize=15)
    ax.set_ylabel('y', fontsize=15)
    ax.set_title(r'$SSE = sum_{i=1}^n (y_i - hat{y}_i)^2$', fontsize=19)
    ax.grid(False)
    ax.spines["top"].set_visible(False)
    ax.spines["right"].set_visible(False)
    ax.tick_params(direction='out', length=6, width=2, colors="black")
    # Show x-y pairs.
    ax.scatter(x, y, alpha=0.5, marker="x")
    # Draw the regression line.
    xlims = np.array([np.min(x), np.max(x)])
    ax.plot(xlims, slope * xlims + intercept, lw=2, color="b")
    # Draw the error terms.
    for x_i, y_i in zip(x, y):
        ax.plot([x_i, x_i], [y_i, slope * x_i + intercept], color="r", lw=2, ls="--")
Listing 7-2

observer.py Module

Notice the advantage of having a unified API (like, fit, predict, etc.) to work with various models. Even pipelines are handled in the same manner. Listing 7-3 shows the session.py module’s demo_metrics_and_mse function . It depicts steps to reconstruct parameters from observations using various noise levels. In the absence of noise we have a deterministic linear regression. As noise increases, the estimation of parameters deteriorates.
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from data_generator import *
from observer import *
def set_session_seed(seed):
    np.random.seed(seed)    # Enables perfect reproduction of published results.
def demo_metrics_and_mse():
    set_session_seed(100)
    X = generate_base_features(1000)[['x_normal']]
    for noise_level in [0, 2, 15]:
        y = generate_response(X, noise_level, [-1.5, 4.1])
        model = LinearRegression()
        train_model(model, X, y)
        metrics = evaluate_model(model, X, y)
        print(' Iteration with noise level: %d' % noise_level)
        print_parameters(model, metrics)
        # Visualize the regression line and error terms.
        if noise_level == 15:
            slope = model.coef_[0][0]
            intercept = model.intercept_
            explain_sse(slope, intercept, X[:15].values, y[:15])
Listing 7-3

First Part of session.py Module (Some Imports Are for Later Use)

The warning module is used to silences FutureWarning warning types, thus removing clutter from output. The following is the printout from executing demo_metrics_and_mse():
Iteration with noise level: 0
Intercept: -1.500
Coefficients:
 [[4.1]]
Explained variance score: 1.000
Mean squared error: 0.000
Iteration with noise level: 2
Intercept: -1.470
Coefficients:
 [[4.0925705]]
Explained variance score: 0.997
Mean squared error: 4.187
Iteration with noise level: 15
Intercept: -1.535
Coefficients:
 [[4.06304243]]
Explained variance score: 0.867
Mean squared error: 223.945

Figure 7-2 pictures how error accumulates; the line itself was calculated over the whole dataset, so it appears wrong on this subset. The MSE is simply an average of sum of squared errors (SSE). Each vertical dashed red line designates a single error term (difference between true yi and its predicted value $$ {hat{y}}_i $$). The model’s coefficients minimize this MSE (loss function), and the corresponding expression to calculate the vector β is an unbiased minimal variance estimator . It is independent of the inherent noise factor, and MSE is an estimator of this error term’s spread ($$ sigma cong sqrt{MSE} $$). Estimates and values dependent upon them are denoted by a hat symbol.

Note

The intercept only makes sense when the matching predictor is of ratio level data type (the notion of absolute zero exists). Otherwise, it should be turned off (see the constructor for the LinearRegression class).

Hint

The print_parameters method is rudimentary. For a complete R-style summary of model parameters, you may want to utilize the statsmodels package (visit https://www.statsmodels.org ).

Once a model is prepared (trained), it is ready for production usage to deal with unseen data. This is what I refer to as a runtime model . All learned (runtime) parameters are its integral part and are usually distributable over various communication channels. This property allows training to be done separately on powerful machines with lots of data. For example, all we need to know in production to handle real data is the corresponding vector $$ hat{eta} $$. Calculating $$ hat{y}={hat{eta}}_0+{hat{eta}}_{1:n}X $$ can be performed on any constrained device.

While working with world1 (original model) we have used a trivial training process. All available data were reused both for training and testing, which is something you should avoid in practice. In this case, it didn’t cause problems. Our model’s complexity perfectly matched the truth, evidenced when higher noise steered a drop in explained variance as well as a rise of MSE. This was an indication that we haven’t tried to capture nonessential properties of the real world. An overly complex model is capable of doing this, which leads to overfitting. Contrary to this is underfitting, when our model is too weak even for capturing fundamental characteristics. The next two sections demonstrate these aspects thoroughly.
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig2_HTML.jpg
Figure 7-2

Error terms $$ {mathrm{e}}_{mathrm{i}}={mathrm{y}}_{mathrm{i}}-{hat{mathrm{y}}}_{mathrm{i}} $$, which are squared, summed up, and averaged to calculate MSE. The root MSE, ($$ sqrt{MSE} $$), is useful to restore the response’s original unit.

Overfitting

We will augment the linear model with polynomial features of various degrees; the model itself remains linear in terms of features. The aim is to demonstrate overfitting. Since this model is more powerful than the previously investigated model, it has enough capacity to encompass erroneous fluctuations in data; in a deterministic case (without error), you wouldn’t notice any difference. Machine learning algorithms cannot decipher that irregularities in data aren’t critical. There is a mechanism to detect overfitting by splitting historical data into training and testing sets.

Listing 7-4 shows the demo_overfitting function using polynomial features as well as the plot_mse function to plot MSE for both training and test sets of various sizes. During the run the data is split into training and test sets of varying sizes. This expansion of the training process introduces a new process parameter: data volume reserved for training purposes (the rest is kept for testing). Previously we just used whatever we had without any breakdown. The inner visualization function demonstrates what actually happens behind curtains.
def demo_overfitting():
    def visualize_overfitting():
        train_model(optimal_model, X, y)
        train_model(complex_model, X, y)
        _, ax = plt.subplots(figsize=(9, 7))
        ax.set_yticklabels([])
        ax.set_xticklabels([])
        ax.grid(False)
        X_test = np.linspace(0, 1.2, 100)
        plt.plot(X_test, np.sin(2 * np.pi * X_test), label='True function')
        plt.plot(
            X_test,
            optimal_model.predict(X_test[:, np.newaxis]),
            label='Optimal model',
            ls='-.')
        plt.plot(
            X_test,
            complex_model.predict(X_test[:, np.newaxis]),
            label='Complex model',
            ls='--',
            lw=2,
            color='red')
        plt.scatter(X, y, alpha=0.2, edgecolor="b", s=20, label='Training Samples')
        ax.fill_between(X_test, -2, 2, where=X_test > 1, hatch='/', alpha=0.05, color="black")
        plt.xlabel('x', fontsize=15)
        plt.ylabel('y', fontsize=15)
        plt.xlim((0, 1.2))
        plt.ylim((-2, 2))
        plt.legend(loc='upper left')
        plt.title('Visualization of How Overfitting Occurs', fontsize=19)
        plt.show()
    set_session_seed(172)
    X = generate_base_features(120)[['x_uniform']]
    y = generate_response(X, 0.1, [0, 2 * np.pi], f=np.sin)
    optimal_model = make_poly_pipeline(LinearRegression(), 5)
    plot_mse(optimal_model, X, y, 'Optimal Model', 0.1)
    complex_model = make_poly_pipeline(LinearRegression(), 35)
    plot_mse(complex_model, X, y, 'Complex Model', 0.1)
    visualize_overfitting()
Listing 7-4

Contains Functions to Exemplify Overfitting

When a model matches the truth, the MSEs induced by both the training set and the test set gravitate around the achievable minimum MSE (reflects inherent variance in data), as shown in Figure 7-3. Otherwise, the test set shows a worse performance, as presented in Figure 7-4. This is a sign that the model isn’t generalizing properly and is picking up unimportant details from a training set. As a model’s power increases above the required level, the cross-validation (CV) score considerably deteriorates, too.

The CV score is a very efficient way to evaluate your model’s performance. It is an average of individual CV scores. The training data is randomly partitioned into K number of equal-sized complementary subsets (we use K=10; see the collect_mse inner function in Listing 7-2 as well as Exercise 7-2); this number K is another process-related parameter. In each round, one segment is used for testing (more precisely, for validation), while the rest is used as real training data (in the later section “Regularization,” we will use K-fold CV to set a runtime parameter). Every data point from the initial dataset is used only once for testing. In this manner, the algorithm cycles through all partitions, so the overall mean score is indeed a reliable performance indicator. There are also extreme variants, where testing is only done with a single observation (singleton) per iteration (a.k.a. leave-one-out CV).
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig3_HTML.jpg
Figure 7-3

MSE of the test set approaches natural variance when the model’s complexity is just right

Furthermore, the variance between CV, testing, and training scores lessens as we use more training data. There is no test score marker when we use all available data for training. The horizontal dashed line denotes the inherent error in data (it may stem from measurement errors). A proper model wouldn’t try to embody this segment. In machine learning we cannot simply instruct the computer to forget about inherent error. The trick is to set your model’s complexity just right, so that it has no capacity to memorize unwanted details. Apparently, the machine first tried to suck in everything, when the training set was small enough, but later gave up and only followed the main trend. Consequently, you must possess enough training data for a given model’s complexity. Monstrous models (especially deep neural networks) must devour a huge amount of training data before being ready for production.

Overfitting and underfitting (demonstrated in the next section) are tightly associated with the central issue of supervised learning known as bias-variance trade-off . A highly biased model may miss important variances in the training data, while a high-variance model may try to capture nonessential properties. Balancing these two forces is one of the most difficult aspects of training machine learning algorithms. For example, extreme care must be given to decision trees, which may easily cover all edge cases in training data.

Figure 7-5 shows why an overly complex model doesn’t generalize well; that is, it has low performance on a test set. Now, this set isn’t exclusively about some totally uncharted area from a domain but is simply a collection of data points excluded from the training set. Reasoning about unknown space entails a different learning approach (see also reference [1]), which will be the topic of the next case study.
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig4_HTML.jpg
Figure 7-4

MSE of the test set rises above the inherent error level when the model’s complexity is too high

The cross-validation score is even out of range here. Observe that the model completely memorized a small training sample at the beginning and always managed to follow unintended fluctuations. One way to combat this situation is to saturate the system with more training data. In our case, it has started to reasonably perform on a test set only from 60% of the training set size.

Underfitting and Feature Interaction

This section introduces another typical situation in which features interact. The goal is to showcase that it isn’t enough to solely identify pertinent features. You must understand how they interrelate in the real world. We assume two models: one that treats individual features as independent, and another that includes their interaction.

Listing 7-5 shows the demo_underfitting function that demonstrates underfitting. When a model matches the truth, the MSEs induced by both the training set and the test set gravitate around the achievable minimum MSE (similarly as shown in Figure 7-3). By contrast, Figure 7-6 illustrates what happens with a weak model; obviously, adding more data to a weak model doesn’t help. Underfitting in practice is less common then overfitting, particularly with powerful deep neural networks.
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig5_HTML.jpg
Figure 7-5

The complex model’s prediction line is jagged due to an attempt to encompass inherent error

On the right side of Figure 7-5, you can see what happens in an unexplored territory. Both fitted lines depart from the true function.
def demo_underfitting():
    set_session_seed(15)
    X = generate_base_features(200)
    X_interacting = X[['x_interacting']]
    y = generate_response(X_interacting, 2, [1.7, -4.3])
    plot_mse(LinearRegression(), X_interacting, y, 'Optimal Model', 2)
    X_weak = X[['x_normal', 'x_uniform']]
    plot_mse(LinearRegression(), X_weak, y, 'Weak Model', 2)
Listing 7-5

Function to Exemplify Underfitting via Feature Interaction

The weak model properly enlists both participating features, but taken separately, they cannot provide value.
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig6_HTML.jpg
Figure 7-6

MSEs of all sets are far away from inherent error with a weak model

Collinearity

If you scrutinize the generate_base_features function, you will notice interrelationships between features x_normal, x_combined, and x_collinear. Our new simulated world uses only x_normal and x_combined. As before, we will presume two variants of this world: one using the same features, and the other one incorporating x_collinear, too. Listing 7-6 shows the code for demonstrating collinearity. In machine learning, this phenomenon has negative impact on performance and stability (it is hard to assess the impact of features on the outcome).
def demo_collinearity():
    set_session_seed(10)
    X = generate_base_features(1000)
    X_world = X[['x_normal', 'x_combined']]
    y = generate_response(X_world, 2, [1.1, -2.3, 3.1])
    model = LinearRegression()
    # Showcase the first assumed model.
    train_model(model, X_world, y)
    metrics = evaluate_model(model, X_world, y)
    print(' Dumping stats for model 1')
    print_parameters(model, metrics)
    # Showcase the second assumed model.
    X_extended_world = X[['x_normal', 'x_combined', 'x_collinear']]
    train_model(model, X_extended_world, y)
    metrics = evaluate_model(model, X_extended_world, y)
    print(' Dumping stats for model 2')
    print_parameters(model, metrics)
    # Produce a scatter matrix plot.
    df = X
    df.columns = ['x' + str(i + 1) for i in range(len(df.columns))]
    df['y'] = y
    pd.plotting.scatter_matrix(df, alpha=0.2, figsize=(10, 10), diagonal="kde")
Listing 7-6

Extension of session.py with the demo_collinearity Function

Here is the output of executing demo_collinearity():
Dumping stats for model 1
Intercept: 1.021
Coefficients:
 [[-2.1290165   3.05443636]]
Explained variance score: 0.999
Mean squared error: 4.274
Dumping stats for model 2
Intercept: 1.021
Coefficients:
 [[-2.1290165   0.09438926  0.52857984]]
Explained variance score: 0.999
Mean squared error: 4.274

The only difference between these runs is the significance of x_combined. This is a typical sign of collinearity. The system is confused whether to “work” with x_combined or x_collinear, since one is a direct linear combination of another. There is also a strong relation between x_normal and x_combined. A practical way to check for such interdependence is to generate a scatter matrix plot , as shown in Figure 7-7; this is the final result of running demo_collinearity (the columns are renamed to better fit on the diagram). It is possible to produce a correlation matrix, but it will only reveal linear relationships. A diagram may show you non-linear associations, too.

Independence assumption is common in machine learning algorithms. Naive Bayes method fully relies upon this characteristic. Feature dependence has negative consequences on convergence in logistic regression. Moreover, highly related features are redundant and just complicate a model.

Residuals Plot

As a data scientist, you must constantly seek to inspect problems from multiple angles. Likewise, you should know about complementary plotting techniques, since they could illuminate otherwise invisible aspects of a problem. You have just seen the utility of a scatter matrix plot. In this section we generate two worlds, one linear and one quadratic. Both have tiny coefficients, inherent randomness, and are approximated by truly linear models. Listing 7-7 contains the demo_residuals function to highlight the importance of residuals plots.

Figure 7-8 shows two scatter plots together with regression lines for fitting a linear world by linear model (case 1) and fitting a quadratic world by linear model (case 2). The MSE is same in both cases, while the explained variance score is better for case 2. Which case would you choose as more agreeable for a linear model? My guess is that you would pick case 2. Well, Figure 7-9 tells a different story.
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig7_HTML.jpg
Figure 7-7

Scatter plots of all pairs of variables, with density plots on diagonal

Notice the heavy connection between x1, x4, x5, and y. Furthermore, observe x3’s distinctive shape in relation to the other features.
def demo_residuals():
    def plot_regression_line(x, y, case_num):
        _, ax = plt.subplots(figsize=(9, 9))
        ax.set_title('Regression Plot - Case ' + str(case_num), fontsize=19)
        ax.set_xlabel('x', fontsize=15)
        ax.set_ylabel('y', fontsize=15)
        sns.regplot(x.squeeze(), y.squeeze(),
                    ci=None,
                    ax=ax,
                    scatter_kws={'alpha': 0.3},
                    line_kws={'color': 'green', 'lw': 3})
    set_session_seed(100)
    X = generate_base_features(1000)
    X1 = X[['x_normal']]
    y1 = generate_response(X1, 0.04, [1.2, 0.00003])
    X2 = X1**2
    y2 = generate_response(X2, 0.04, [1.2, 0.00003])
    model = LinearRegression()
    # Showcase the first world with a linearly assumed model.
    plot_regression_line(X1, y1, 1)
    train_model(model, X1, y1)
    metrics = evaluate_model(model, X1, y1, True, 'Case 1')
    print(' Dumping stats for case 1')
    print_parameters(model, metrics)
    # Showcase the second world with a linearly assumed model.
    plot_regression_line(X1, y2, 2)
    train_model(model, X1, y2)
    metrics = evaluate_model(model, X1, y2, True, 'Case 2')
    print(' Dumping stats for case 2')
    print_parameters(model, metrics)
Listing 7-7

Function That Creates Two Deceptive Offers, Where the Obvious One Isn’t Obviously Wrong

This code reuses the model instance from case 1 in case 2. You must always read carefully the documentation about what is happening when you call the fit method repeatedly. Here is the citation from scikit-learn’s tutorial (see https://scikit-learn.org/stable/tutorial/basic/tutorial.html ): “Hyper-parameters of an estimator can be updated after it has been constructed via the set_params() method . Calling fit() more than once will overwrite what was learned by any previous fit().” This behavior is exactly what we need.

You also must take care to reuse the same scaler instance that was used for training the model. The validation and test sets must be scaled in the same manner as the training dataset. Forgetting this detail may cause subtle bugs.
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig8_HTML.jpg
Figure 7-8

The direction of the slope is wrong for case 1, since it should be positive (see Listing 7-7)

For case 2 the slope is positive, as it should be. All signs suggest that this is a better match.
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig9_HTML.jpg
Figure 7-9

The residuals plot clearly favors case 1 over case 2 with evident visual explanation; the curvature nicely reveals a “quadratic” pattern in residuals

There is no residual pattern for case 1. See Listing 7-2 and the evaluate_model function (this creates a residuals plot with the lowest line for depicting trend).

Regularization

We don’t know in advance which model is optimal. We can either start with a weak model and add more features or start with an overly complex model and try to tone it down. Neither of these approaches is scalable with manual work. The idea is to just err on the side of complexity and utilize some automation to reduce the model’s complexity toward an optimal level. This is all about regularization , an automatic mechanism to eschew overfitting. There are many types of regularization. We will use the Ridge regression with built-in cross-validation.

Regularization encodes some constraint over coefficients using the language of mathematical optimization; this is expressed in the form of a penalty function. The Ridge regression (a.k.a. Tikhonov regularization) aims to keep coefficients as small as possible, since this is equivalent to attaining a least-complex model. Consequently, Ridge regression defines an L2 regularization term (l2-norm) $$ alpha {leftVert w
ightVert}_2^2 $$ that is added to the basic loss function (in our case MSE). The vector w contains the model’s coefficients (weights). The parameter α balances minimization attempts between MSE and penalty term. Higher values tend to reward lesser coefficients, and vice versa. Alpha is calculated by some trial-and-error method (such as cross-validation over a list of candidates). Listing 7-8 shows the function to illuminate regularization, while the aftermath is shown in Figure 7-10.
def demo_regularization():
    from sklearn.linear_model import RidgeCV
    set_session_seed(172)
    x = generate_base_features(120)[['x_uniform']]
    y = generate_response(X, 0.1, [0, 2 * np.pi], f=np.sin)
    regularized_model = make_poly_pipeline(
            RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1, 5, 10, 20], gcv_mode="auto"),
            35)
    plot_mse(regularized_model, X, y, 'Regularized Model', 0.1)
Listing 7-8

Implementation of demo_regularization Function to Showcase Ridge Regression

All the magic happens inside the scikit-learn framework’s RidgeCV class.

Predicting Financial Movements Case Study

The realm of financial modeling will shed some light on time series analysis , where we want to react to events in near real time (for example, predicting stock prices in markets). This is totally different than what we have done thus far using batch processing. My aim here isn’t to develop a new breed of stock market application, but to drive your attention to novel problems and potential solutions with streaming data. You cannot access such data all at once, so stream processing techniques are intrinsically incremental and casual (they act on current knowledge to predict future outcomes). This entails that model parameters evolve over time instead of being fixed at the end of the training stage. Monitoring and regularizing this evolution are also some new tasks compared to classical approaches (see references [1-2]).

Another crucial difference is about timestamping of observations. In the Chapter 2 case study of e-commerce customer segmentation, the data was coarsely timestamped (data files were segregated by days). Here, each record will have its own absolute timestamp, so that we can monitor trend, seasonality, and other time-based patterns; relative timing isn’t enough for this purpose. To avoid time zone–related difficulties, it is beneficial to register timestamps as seconds since the beginning of epoch time (or similar higher granularity scheme).
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig10_HTML.jpg
Figure 7-10

Although the model’s complexity, in term of features, is the same as in our example for overfitting, due to regularization it doesn’t overfit as earlier

Data Retrieval

The accompanying source code already contains the CSV file with daily time series stock data for Apple (its stock symbol is AAPL). Its located inside the stock_market subfolder (with other artifacts). You can get a fresh copy, or work with another company’s equity (change the symbol below and rename the target file accordingly), by issuing the following command from Spyder’s IPython console:
!curl -o daily_AAPL.csv
"https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=AAPL&outputsize=full
&apikey=<YOUR API KEY>&datatype=csv"

We are relying on Alpha Vantage’s API (see https://www.alphavantage.co/documentation/#daily ) to retrieve daily time series data. You can get a free API key from Alpha Vantage and insert it into the URL in the preceding code where indicated. There are limits on the number of requests that are well documented on the site. The outputsize parameter is set to full, which pulls up to 20 years’ worth of historical data. The output format (datatype parameter) is set to csv.

Data Preprocessing

In the spirit of the data science process, we will first do some preprocessing and exploratory data analysis. All subsequent steps should be carried out from Spyder’s IPython console (ensure that you are in the stock_market folder). The next command shows the first five lines of the downloaded file:
>> !head -n 5 daily_AAPL.csv
timestamp,open,high,low,close,volume
2018-11-07,205.9700,210.0600,204.1300,209.9500,33106489
2018-11-06,201.9200,204.7200,201.6900,203.7700,31882881
2018-11-05,204.3000,204.3900,198.1700,201.5900,66163669
2018-11-02,209.5500,213.6500,205.4300,207.4800,91328654
According to the API documentation, “The most recent data point is the prices and volume information of the current trading day, updated realtime.” We will omit this and work only with stable data points. The next lines read the stock data into a Pandas data frame and show the first couple of records:
>> import pandas as pd
>> stock_data = pd.read_csv('daily_AAPL.csv', usecols=[0, 4, 5], skiprows=[1])
>> stock_data.head()
    timestamp   close    volume
0  2018-11-06  203.77  31882881
1  2018-11-05  201.59  66163669
2  2018-11-02  207.48  91328654
3  2018-11-01  222.22  58323180
4  2018-10-31  218.86  38358933
We only need the timestamp, close price, and volume fields without the most recent data point. At this moment it is convenient to see the overall information about data types, number of rows, etc. The next command shows this information:
>> stock_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5247 entries, 0 to 5246
Data columns (total 3 columns):
timestamp    5247 non-null object
close        5247 non-null float64
volume       5247 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 123.1+ KB
Apparently, the timestamp has an inconvenient object type, which is too generic to be useful. The next lines convert this column into a DateTime index (this allows us to treat data chronologically):
>> stock_data['timestamp'] = pd.to_datetime(stock_data['timestamp'])
>> stock_data.set_index('timestamp', inplace=True, verify_integrity=True)
>> stock_data.head()
             close    volume
timestamp
2018-11-06  203.77  31882881
2018-11-05  201.59  66163669
2018-11-02  207.48  91328654
2018-11-01  222.22  58323180
2018-10-31  218.86  38358933
Figure 7-11 shows AAPL closing levels over time, which is produced with the following snippet. The style parameter controls the appearance of lines in a plot. To differentiate lines in a grayscale image, you cannot solely use colors. We see a huge price drop around 2015 in Figure 7-11. According to one report (read at http://time.com/money/3991712/apple-stock-price-drop ), the reason was a missed business expectation around iPhone sales.
>> import matplotlib.pyplot as plt
>> def plot_time_series(ts, title_prefix, style='b-'):
       ax = ts.plot(figsize=(9, 8), lw=2, fontsize=12, style=style)
       ax.set_title('%s Over Time' % title_prefix, fontsize=19)
       ax.set_xlabel('Year', fontsize=15)
>> plot_time_series(stock_data['close'], 'AAPL Closing Levels')

Discovering Trends in Time Series

The chart in Figure 7-11 is very ragged, due to noise and seasonality. A popular way to identify trends in time series is to take a moving average . The following command plots the trend in closing levels by calculating the simple moving average (SMA) , as shown in Figure 7-12:
>> stock_data.sort_index(inplace=True)
>> plot_time_series(stock_data['close'].rolling('365D').mean(), 'AAPL Closing Trend')
If we don’t sort the index, then we will receive an error, ValueError: index must be monotonic. The rolling window is defined to be 365 days. Figure 7-13 combines closing levels and volume trends inside a single diagram. The volume doesn’t alter much over time, although when the price had started to drop then the volume had increased. Maybe there was an urge to sell stocks while the price was still good enough. This kind of comparison is useful for feature engineering and to get more insight into behavior. The next lines produce such a composite plot:
>> def compose_trends(ts):
       from sklearn.preprocessing import MinMaxScaler
       scaler = MinMaxScaler()
       scaled_ts = pd.DataFrame(scaler.fit_transform(ts), columns=ts.columns, index=ts.index)
       return pd.concat([scaled_ts['close'].rolling('365D').mean(),
                         scaled_ts['volume'].rolling('365D').mean()], axis=1)
>> plot_time_series(compose_trends(stock_data), 'AAPL Closing & Volume Trends', ['b-', 'g--'])
Taking a moving average eliminates small nuances in data. Also observe in Figure 7-13 that the slope of the massive price drop is less than in Figure 7-11. It is imperative to scale the features before combining them in the same diagram. Scaling is also mandatory in various machine learning situations, when movement in one direction may completely shadow movements in other directions. This happens when features are on totally different scales.
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig11_HTML.jpg
Figure 7-11

Variation of AAPL closing levels over time

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig12_HTML.jpg
Figure 7-12

Much smoother diagram compared to Figure 7-11

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig13_HTML.jpg
Figure 7-13

The y axis is now showing scaled values

Transforming Features

In finance, looking at returns on a daily basis is more useful than using absolute quantities. Returns indicate how much an asset’s value fluctuates over time. There are two main ways to formulate returns:
  • Log returns : $$ {r}_t=mathit{log}left(frac{v_t}{v_{t-1}}
ight) $$, where v denotes the asset’s value (such as closing price, adjusted closing price, volume, adjusted volume, etc.).

  • Scaled percent returns : $$ {r}_t=left(frac{v_t}{v_{t-1}}-1
ight) $$. There is a method pct_change in Pandas to calculate this quantity. For daily returns, scaled percent returns are near to log returns. You can see this from Taylor expansion of the log function, when $$ x=frac{V_t}{V_{t-1}} $$ is very small, so only the first term (x − 1) matters.

Log returns has nice mathematical properties (such as additivity, symmetry pertaining to gains and losses, etc.). The following snippet attaches two new features to our data frame: the daily log returns of the stock price and the daily log changes in volume. Figure 7-14 shows AAPL price log returns over time, and Figure 7-15 presents the same for the volume.
>> import numpy as np
>> stock_data['close_ret'] = np.log(stock_data['close']).diff()
>> stock_data['volume_ret'] = np.log(stock_data['volume']).diff()
>> stock_data.head()
            close    volume  close_ret  volume_ret
timestamp
1998-01-02  16.25   6411700        NaN         NaN
1998-01-05  15.88   5820300  -0.023032   -0.096773
1998-01-06  18.94  16182800   0.176216    1.022597
1998-01-07  17.50   9300200  -0.079075   -0.553913
1998-01-08  18.19   6910900   0.038671   -0.296936
>> stock_data.dropna();
>> plot_time_series(stock_data['close_ret'], 'AAPL Price Log Returns')
>> plot_time_series(stock_data['volume_ret'], 'AAPL Volume Log Returns')
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig14_HTML.jpg
Figure 7-14

This diagram has a couple of downward spikes that are far away from central values

We could treat spikes as outliers, which are hard to model and account for (unless you are a finance guru). One way to circumvent this problem is to use volatility-normalized log returns.

The next code section implements the logic to produce volatility-normalized log returns of the stock price, as shown in Figure 7-16. There is no need to perform the same for the volume returns, as we will soon see (its distribution is nearly normal).
>> ts['close_ret'] /= ts['close_ret'].ewm(halflife=halflife).std()
>> plot_time_series(stock_data['close_ret'], 'AAPL Volatility-Norm. Price Log Returns')

The half-life of 23 days amounts to a decay factor (weight) of 0.97 (you need to solve for w the equation $$ {w}^{23}=frac{1}{2} $$), which controls how long the stock market remembers (or how fast it forgets) old events. The i-th data point has a decay factor of wi. Volatility is calculated as a rolling standard deviation across data points taking into account the exponential decay for old data. Higher decay damps volatility. You should experiment with this factor to match the desired risk level.

Streaming Amounts

For simplicity reasons, we have thus far managed the data in batch mode. Nonetheless, log returns, moving average, and volatility are all amounts that may be calculated in real time. For log returns, you just need to cache the last value to make an update. The same is true for volatility with exponential downweighting, although it is not as obvious as with log returns. To track a moving average, you will need to store and update the last numerator and denumerator.

Suppose that you know the current weighted variance $$ {V}_{current}=frac{sum limits_i{w}^i{r}_i^2}{sum_i{w}^i} $$, where w is the decay factor; for daily returns, we may assume the mean to be zero. The denominator is the sum of a geometric series whose limit is $$ frac{1}{1-w} $$. When a new return r0 arrives, then we have $$ {V}_{new}=left(1-w
ight)left(wsum limits_i{w}^i{r}_i^2+{r}_0
ight)=w{V}_{current}+left(1-w
ight){r}_0 $$. You can compute the current volatility as $$ sqrt{V_{current}} $$.

To perform volatility normalization in streaming mode, you need to divide the new return with the current volatility: $$ {r}_0leftarrow 
aisebox{1ex}{${r}_0$}!left/ !
aisebox{-1ex}{$sqrt{V_{current}}$}
ight.. $$ Of course, you would do this before updating the current volatility; that is, before executing Vcurrent ← Vnew.

Besides computing various running totals, there is a whole gamut of incremental algorithms that may run in streaming fashion. For example, gradient descent is an iterative optimization method that handles all data in one sweep. Stochastic gradient descent is an incremental and iterative method that runs in online mode and updates parameters on-the-fly. Streaming linear regression uses this approach (as described later in this chapter).
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig15_HTML.jpg
Figure 7-15

This diagram seems to be properly centered around zero, but we should still eyeball its distribution

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig16_HTML.jpg
Figure 7-16

Normalization has smoothed out the log returns and made the time series better behaved

To get a better feeling of what normalization did to price log returns, Figures 7-17 and 7-18 show the histograms of non-normalized and normalized variants, respectively. Later in the section you will see the complete code.
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig17_HTML.jpg
Figure 7-17

Histogram of the non-normalized price log returns, which is heavily left-tailed

../images/473947_1_En_7_Chapter/473947_1_En_7_Fig18_HTML.jpg
Figure 7-18

Histogram of the normalized price log returns, which is bell shaped

For the sake of completeness, Figure 7-19 shows the histogram of the volume log returns.
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig19_HTML.jpg
Figure 7-19

This diagram is normally distributed with a slight right tail

Listings 7-9, 7-10, and 7-11 show separate modules that bundle all pertinent steps into coherent units. This concludes the preprocessing stage. The driver.py module calls these functions to implement the whole pipeline.
import numpy as np
import pandas as pd
def read_daily_equity_data(file):
    stock_data = pd.read_csv(file, usecols=[0, 4, 5], skiprows=[1])
    stock_data['timestamp'] = pd.to_datetime(stock_data['timestamp'])
    stock_data.set_index('timestamp', inplace=True, verify_integrity=True)
    stock_data.sort_index(inplace=True)
    return stock_data
def compose_trends(ts):
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler()
    scaled_ts = pd.DataFrame(scaler.fit_transform(ts), columns=ts.columns, index=ts.index)
    return pd.concat([scaled_ts['close'].rolling('365D').mean(),
                      scaled_ts['volume'].rolling('365D').mean()], axis=1)
def create_log_returns(ts, halflife, normalize_close=True):
    ts['close_ret'] = np.log(ts['close']).diff()
    if normalize_close:
        ts['close_ret'] /= ts['close_ret'].ewm(halflife=halflife).std()
    ts['volume_ret'] = np.log(ts['volume']).diff()
    return ts.dropna()
Listing 7-9

data_preprocessing.py Module , Which Contains Functions to Prepare the Data Frame for the Analysis Phase

import matplotlib.pyplot as plt
def plot_time_series(ts, title_prefix, style='b-'):
    ax = ts.plot(figsize=(9, 8), lw=2, fontsize=12, style=style)
    ax.set_title('%s Over Time' % title_prefix, fontsize=19)
    ax.set_xlabel('Year', fontsize=15)
    plt.show()
def hist_time_series(ts, xlabel, bins):
    ax = ts.hist(figsize=(9, 8), xlabelsize=12, ylabelsize=12, bins=bins, grid=False)
    ax.set_title('Distribution of %s' % xlabel, fontsize=19)
    ax.set_xlabel(xlabel, fontsize=15)
    plt.show()
Listing 7-10

data_visualization.py Module , Which Contains Auxiliary Visualizations of Time Series

from data_preprocessing import *
from data_visualization import *
stock_data = read_daily_equity_data('daily_AAPL.csv')
stock_data = create_log_returns(stock_data, 23)
plot_time_series(stock_data['close'], 'AAPL Closing Levels')
plot_time_series(stock_data['close'].rolling('365D').mean(), 'AAPL Closing Trend')
plot_time_series(compose_trends(stock_data), 'AAPL Closing & Volume Trends', ['b-', 'g--'])
# To produce the non-normalized price log returns plot you must call
# the create_log_returns function with normalize_close=False. Try this as an
# additional exercise.
plot_time_series(stock_data['close_ret'], 'AAPL Volatility-Norm. Price Log Returns')
plot_time_series(stock_data['volume_ret'], 'AAPL Volume Log Returns')
hist_time_series(stock_data['close_ret'], 'Daily Stock Log Returns', 50)
hist_time_series(stock_data['volume_ret'], 'Daily Volume Log Returns', 50)
Listing 7-11

driver.py Module , Which Connects All the Pieces Together (First Part of File Shown)

Feature Engineering

Currently, we have raw close levels, raw volume levels, volatility-normalized closing log returns, and volume log returns as our features (we will create more). To see how these impacts a potential target (like, predicted stock price change) it is useful to consult Pearson’s correlation coefficient r. It is an indicator for a linear relationship between two features, whose range is [-1, 1]. A positive correlation means as one value increases/decreases the other does the same. A negative correlation denotes the opposite behavior. A value of zero represents the absence of a linear relationship, although the quantities may be interrelated in non-linear ways. Usually, if |r| > 0.3, then we can consider the correlation to be noticeable (this a judgement call, so take this heuristics with a decent pinch of salt).

Log returns denote fluctuations and it is illuminating to find out whether these are mean-reverting or trend-following in some time period (for example, N days). Mean reversion suggests that returns oscillate around a mean, while trend-following says that they mimic the recent period. We can discover this by fixing N and calculating the coefficient r between returns from past and future. If the correlation is low, then we have mean reversion, otherwise trend-following behavior. Listing 7-12 shows the function that reports correlation coefficients and creates scatter plots between past and future price as well as volume log returns (see also Exercise 7-3). It calls the scatter_time_series function to make a scatter plot (see Listing 7-13).
from data_visualization import *
def report_auto_correlation(ts, periods=5):
    for column in filter(lambda str: str.endswith('_ret'), ts.columns):
        future_column = 'future_' + column
        ts[future_column] = ts[column].shift(-periods).rolling(periods).sum()
        current_column = 'current_' + column
        ts[current_column] = ts[column].rolling(periods).sum()
        print(ts[[current_column, future_column]].corr())
        scatter_time_series(ts, current_column, future_column)
Listing 7-12

feature_engineering.py Module , with Function to Investigate Auto-correlation

def scatter_time_series(ts, x, y):
    ax = ts.plot(x=x, y=y, figsize=(9, 8), kind="scatter", fontsize=12)
    ax.set_title('Auto-correlation Graph', fontsize=19)
    ax.set_xlabel(x, fontsize=15)
    ax.set_ylabel(y, fontsize=15)
    plt.show()
Listing 7-13

Function to Create a Scatter Plot to Trace Auto-correlation (data_visualization.py Module)

The following are the correlations for both quantities, and Figure 7-20 shows the scatter plot for volume log returns (you should run the accompanying source code to see the graph for price returns):
                   current_close_ret  future_close_ret
current_close_ret           1.000000          0.013661
future_close_ret            0.013661          1.000000
                    current_volume_ret  future_volume_ret
current_volume_ret            1.000000          -0.419489
future_volume_ret            -0.419489           1.000000
We may conclude that price log returns revert to the mean, while volume log returns follow the opposite trend.
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig20_HTML.jpg
Figure 7-20

As current volume log returns increase, future volume log returns decrease, and vice versa

By varying the period’s length, it is possible to find the highest level of auto-correlation, which may serve as a basis for creating features. It seems that 5 days is a good choice (you may experiment with different periods using the accompanying code base). Therefore, out target feature will be the 5-days future price change, while our initial basic features will be the current 5-days price and volume changes.

Every domain has its own set of prepared favorite features that have been proven to be valuable in predicting targets. There is a powerful technical analysis library with 200 financial indicators called TA-Lib (see https://mrjbq7.github.io/ta-lib ). We will use three of them: normalized SMA, relative strength index (RSI), and on-balance volume (OBV). You have already seen SMA in the “Discovering Trends in Time Series” section. RSI is defined as $$ 100-frac{100}{1+ RS} $$, where $$ RS=frac{mean gain over a  period}{mean loss over a  period} $$. OBV is connecting volume flow with price changes. Listing 7-14 contains the code for creating candidate features.
def create_features(ts):
    from talib import SMA, RSI, OBV
    target = 'future_close_ret'
    features = ['current_close_ret', 'current_volume_ret']
    for n in [14, 25, 50, 100]:
        ts['sma_' + str(n)] = SMA(ts['close'].values, timeperiod=n) / ts['close']
        ts['rsi_' + str(n)] = RSI(ts['close'].values, timeperiod=n)
    ts['obv'] = OBV(ts['close'].values, ts['volume'].values.astype('float64'))
    ts.drop(['close', 'volume', 'close_ret', 'volume_ret', 'future_volume_ret'],
            axis='columns',
            inplace=True)
    ts.dropna(inplace=True)
    return ts.corr()
Listing 7-14

Code to Create Features Described in Text (Part of feature_engineering.py)

The function returns the correlation matrix, which may be conveniently visualized using a heat map. It is also possible to return an exponentially weighted correlation matrix and to use a weighted moving average (see the Pandas EWM object’s corr and mean methods, respectively). The heat map plotting is implemented in a new function, as shown in Listing 7-15 (inside our visualization module).
def heat_corr_plot(corr_matrix):
    import numpy as np
    import seaborn as sns
    mask = np.zeros_like(corr_matrix)
    mask[np.triu_indices_from(mask)] = True
    _, ax = plt.subplots(figsize=(9, 8))
    sns.heatmap(corr_matrix, annot=True, cmap="gist_gray", fmt=".2f", lw=.5, mask=mask, ax=ax)
    plt.tight_layout()
    plt.show()
Listing 7-15

Code to Report Correlations Using Seaborn’s heatmap Facility

Finally, the driver.py module must be extended with the following statements:
from feature_engineering import *
report_auto_correlation(stock_data)
corr_matrix = create_features(stock_data)
heat_corr_plot(corr_matrix)
Figure 7-21 shows the correlation matrix, which can be used for filtering features. Reducing the number of features may improve both accuracy and performance.
../images/473947_1_En_7_Chapter/473947_1_En_7_Fig21_HTML.jpg
Figure 7-21

This diagram shows only the relevant part of the correlation matrix

All diagonal elements are 1, since features are perfectly aligned with themselves. The matrix is symmetric, so the upper-right triangle is redundant. None of the features are strongly coupled with the target, at least not in a linear way.

Implementing Streaming Linear Regression

Streaming linear regression continually learns and updates its parameters as new training data is added to the corpus. We rely on Apache Spark’s MLlib framework that provides the StreamingLinearRegressionWithSGD class to perform streaming regression. The only task for us is to provide streaming sources for training and test data. One handy way is to deliver a list of resilient distributed dataset (RDD) instances over a queue. Converting a section of the Pandas DataFrame into an RDD is straightforward. Listing 7-16 shows the implementation to fit a linear model in online regime (see also Exercise 7-4), and Listing 7-17 shows the final expansion of the driver.py module. You will need to install pyspark (see https://spark.apache.org/docs/latest/api/python/index.html ).
def fit_and_predict(sparkSession, ts):
    import numpy as np
    from sklearn.model_selection import train_test_split
    from pyspark.streaming import StreamingContext
    from pyspark.mllib.regression import StreamingLinearRegressionWithSGD
    def to_scaled_rdd(pandasDataFrame):
        import pandas as pd
        from sklearn.preprocessing import RobustScaler
        from pyspark.mllib.regression import LabeledPoint
        regressors = pandasDataFrame.columns[1:]
        num_regressors = len(regressors)
        # FIX ME: As a bonus exercise, read the last paragraph from section about residual
        # plots and make the necessary bug fix! Compare the behavior of this version with the
        # fixed one and see whether you can decipher anything from the outputs.
        scaler = RobustScaler()
        scaled_regressors = scaler.fit_transform(pandasDataFrame[regressors])
        scaled_pandasDataFrame = pd.DataFrame(scaled_regressors, columns=regressors)
        scaled_pandasDataFrame['target'] = pandasDataFrame[pandasDataFrame.columns[0]].values
        sparkDataFrame = sparkSession.createDataFrame(scaled_pandasDataFrame)
        return sparkDataFrame.rdd.map(
                lambda row: LabeledPoint(row[num_regressors], row[:num_regressors]))
    def report_accuracy(result_rdd):
        from pyspark.mllib.evaluation import RegressionMetrics
        if not result_rdd.isEmpty():
            metrics = RegressionMetrics(
                    result_rdd.map(lambda t: (float(t[1]), float(t[0]))))
            print("MSE = %s" % metrics.meanSquaredError)
            print("RMSE = %s" % metrics.rootMeanSquaredError)
            print("R-squared = %s" % metrics.r2)
            print("MAE = %s" % metrics.meanAbsoluteError)
            print("Explained variance = %s" % metrics.explainedVariance)
    df_train, df_test = train_test_split(ts, test_size=0.2, shuffle=False)
    train_rdd = to_scaled_rdd(df_train)
    test_rdd = to_scaled_rdd(df_test)
    streamContext = StreamingContext(sparkSession.sparkContext, 1)
    train_stream = streamContext.queueStream([train_rdd])
    test_stream = streamContext.queueStream([test_rdd])
    numFeatures = len(ts.columns) - 1
    model = StreamingLinearRegressionWithSGD(stepSize=0.05, numIterations=300)
    np.random.seed(0)
    model.setInitialWeights(np.random.rand(numFeatures))
    model.trainOn(train_stream)
    result_stream = model.predictOnValues(test_stream.map(lambda lp: (lp.label, lp.features)))
    result_stream.cache()
    result_stream.foreachRDD(report_accuracy)
    streamContext.start()
    streamContext.awaitTermination()
Listing 7-16

Content of the streaming_regression.py Module to Showcase Streaming Linear Regression in Online Mode

from pyspark.sql import SparkSession
from streaming_regression import *
sparkSession = SparkSession.builder
                           .master("local[4]")
                           .appName("Streaming Regression Case Study")
                           .getOrCreate()
fit_and_predict(sparkSession, stock_data)
Listing 7-17

Last Piece of the Main driver.py Module

The system will print the following report (after reading this, you should terminate the session):
MSE = 5.663621710391528
RMSE = 2.379836488162901
R-squared = -0.3217279555407153
MAE = 1.8382286514438086
Explained variance = 1.7169768418351132

At the time of this writing, the StreamingLinearRegressionWithSGD class is missing the setIntercept method . Consequently, we have very “strange” values for R-squared and explained variance. It is also crucial to remember the importance of scaling features before training the model. The RobustScaler scaler is convenient if you aren’t sure about the distribution of your regressors.

Exercise 7-1. Improve Reusability

The plot_mse function presets font sizes for title and axes. This tactic is also repeated in other places to preserve consistency. Nonetheless, this approach is a real maintenance nightmare (notice that explain_sse repeats the same setup). Improve the code to centralize setting common parameters (hint: read about style sheets at https://matplotlib.org/users/customizing.html ). What else can you devise to make the code base more flexible and reusable?

As you work on various machine learning problems, utility functions become very handy. There is no point in reinventing the wheel each time you need to plot MSEs. Try to make your code base reusable as much as you can.

Exercise 7-2. Fix a Bug

In Listing 7-2, inside the plot_mse function , you will find the following two lines of code:
# FIX ME: See Exercise 2!
ax.set_ylim(max(0, y_min), min(2 * error_variance, y_max))

Your task is to find out why this marked line is wrong and implement a fix. In reality, the situation is even worse, because nobody will point out to you a wrong section of code. Locating the exact place of an issue is a huge milestone toward correcting errors.

Hint: Figure 7-6 was produced with both lines commented out. Obviously, such resolution is a quick and dirty hack. Try running demo_underfitting with the original plot_mse function and see what happens.

Exercise 7-3. Avoid Side-Effects

In Listing 7-12 you can see a function with side effects. It modifies the input data frame with extra columns. This modification is a prerequisite for calling the function presented in Listing 7-14. All the chaining is driven from the driver module.

Programming in terms of side effects is generally a bad practice. You may wonder why the model is altered in this fashion (see Listing 7-2) after calling its fit method. The crucial difference is that model encapsulates its state inside an object, where you have more control over changes to internal stuff. When ordinary functions are spread out with assumed side effects, then it is easy to lose control and create an unmaintainable mess.

Refactor the feature_engineering.py module to group feature-related manipulations inside a dedicated class. Compare how such object orientation helps to retain control over internals of the system.

Exercise 7-4. Experience Streaming Behavior

In Listing 7-16 the training and test streams were constructed in the following fashion:
train_stream = streamContext.queueStream([train_rdd])
test_stream = streamContext.queueStream([test_rdd])

This was OK to demonstrate the scaffolding of the solution but makes no sense in a real environment. The whole point of streaming is to allow data to continuously arrive in chunks. Apache Spark even allows streams to be combined, so that you may have data coming over multiple channels (for example, you can also monitor a folder for new data files and parse them in real time).

Refactor the solution to have many parts in the training and test sets. You need to split the df_train and df_test referenced data frames into sections and convert each into an RDD. Finally, provide these parts in a list to the queueStream method. Observe what you get on the output (you should receive as many reports as there are pieces of test data).

Bonus task: The current accuracy report doesn’t reveal too much about performance. You may want to create a scatter plot of predictions vs. actual values (observations from test data). Draw also the ideal line (this is basically the function y=x).

Summary

This chapter barely scratches the surface of the machine learning (ML) domain. Even dozens of books wouldn’t be enough to fully cover (even at an introductory level) all available algorithms and technologies. The major aim of this chapter was to present common concepts that permeate the ML knowledge area. Without being aware of underfitting, overfitting, regularization, scaling, and similar topics, there is no way to be efficient with any ML approach. To be proficient in ML, you must also recall the golden tenet of data science and engineering: “Keep it simple, stupid!” (a.k.a. KISS principle). Typically, you don’t even need ML to solve a problem, and rarely will you ever need to fire up complicated deep neural networks.

Another potent message of this chapter is the rather blurred borderline between science and art regarding parameter tuning. There are some rules of thumb (for a good overview, I suggest the document at http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf ), but you will need lots of experimentation and trial. Consequently, to make all this happen in reasonable amount of time, you will need powerful hardware (readily available in a cloud as infrastructure or platform as a service).

ML is all about mathematics. There is no way to escape this fact. Neural networks appear to give you a seemingly good escape route (we will cover them in detail in Chapter 12), but eventually you will need to dig under the hood to understand what is going on. This is tightly associated with interpretability of your model. With our ordinary least-square method of regression, coefficients can be easily explained. A particular coefficient represents how much the target changes for a unit change in the corresponding feature. With more complex models, the situation is different. At any rate, I will talk more about inner characteristics of models in Chapter 9.

ML is getting huge attention from both research institutions and industry. This isn’t surprising, since as we delve more into the realm of Big Data, there is a greater need to handle such large amounts of data in an efficient manner. Only with the help of machines can we hope to seize control of massive data. One hot topic is transfer learning, which is an attempt to boost reuse of trained models. The idea is to leverage models optimized for one task to perform well on other tasks, too (maybe with minor extra tweaking and training). This will surely trigger a whole bunch of new algorithms and technologies.

One urgent need in the area of machine learning is an efficient and standardized way to exchange prediction models. One such standard is the Predictive Model Markup Language (PMML). There is a Python library for converting scikit-learn pipelines to PMML at https://github.com/jpmml/sklearn2pmml .

References

  1. 1.

    Christopher Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

     
  2. 2.

    Nathan George, “Machine Learning for Finance in Python,” DataCamp, https://www.datacamp.com/courses/machine-learning-for-finance-in-python .

     
  3. 3.

    Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal, Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, Morgan Kaufmann, 2016.

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.93.0