Resampling methods are a set of techniques used to repeat data sampling – they simply rearrange the data to estimate the accuracy of a statistic. If we are developing a simulation model and we get unsatisfactory results, we can try to reorganize the starting data to remove any wrong correlations and re-check the capabilities of the model. Resampling methods are one of the most interesting inferential applications of stochastic simulations and random numbers. They are particularly useful in the nonparametric field, where the traditional inference methods cannot be correctly applied. They generate random numbers to be assigned to random variables or random samples. They require machine time related to the growth of repeated operations. They are very simple to implement and once implemented, they are automatic. The required elements must be placed in a sample that is, or at least can be, representative of the population. To achieve this, all the characteristics of the population must be included in the sample. In this chapter, we will try to extrapolate the results obtained from the representative sample of the entire population. Given the possibility of making mistakes in this extrapolation, it will be necessary to evaluate the degree of accuracy of the sample and the risk of arriving at incorrect predictions. In this chapter, we will learn how to apply resampling methods to approximate some characteristics of the distribution of a sample to validate a statistical model. We will analyze the basics of the most common resampling methods and learn how to use them by solving some practical cases.
In this chapter, we’re going to cover the following main topics:
In this chapter, we will address resampling method technologies. To deal with the topics in this chapter, you must have a basic knowledge of algebra and mathematical modeling. To work with the Python code in this chapter, you’ll need the following files (available on GitHub at the following URL: https://github.com/PacktPublishing/Hands-On-Simulation-Modeling-with-Python-Second-Edition):
Resampling methods are a set of techniques based on the use of subsets of data, which can be extracted either randomly or according to a systematic procedure. The purpose of this technology is to approximate some characteristics of the sample distribution – a statistic, a test, or an estimator – to validate a statistical model.
Resampling methods are one of the most interesting inferential applications of stochastic simulations and the generation of random numbers. These methods became widespread during the 1960s, originating from the basic concepts of Monte Carlo methods. The development of Monte Carlo methods took place mainly in the 1980s, following the progress of information technology and the increase in the power of computers. Their usefulness is linked to the development of non-parametric methods, in situations where the methods of classical inference cannot be correctly applied.
The following details can be observed from resampling methods:
Over time, various resampling methods have been developed and can be classified based on some characteristics.
Important note
A first classification can be made between methods based on randomly extracting subsets of sample data and methods in which resampling occurs according to a non-randomized procedure.
Further classification can be performed as follows:
Sampling is one of the fundamental topics of all statistical research. Sampling generates a group of elementary units, that is, a subset of a population, with the same properties as the entire population, at least with a defined risk of error.
By population, we mean the set, finite or unlimited, of all the elementary units to which a certain characteristic is attributed, which identifies them as homogeneous.
Important note
For example, this could be the population of temperature values in each place, in a period that can be daily, monthly, or yearly.
Sampling theory is an integral and preparatory part of statistical inference, along with the resulting sampling techniques, and allows us to identify the units whose variables are to be analyzed.
Statistical sampling is a method used to randomly select items so that every item in a population has a known, non-zero probability of being included in the sample. Random selection is considered a powerful means of building a representative sample that, in its structure and diversity, reflects the population under consideration. Statistical sampling allows us to obtain an objective sample: the selection of an element does not depend on the criteria defined for reasons of research convenience or availability and does not systematically exclude or favor any group of elements within a population.
In random sampling, associated with the calculation of probabilities, the following actions are performed:
Now, let’s learn why it may be preferable to analyze the data of a sample rather than that of the entire population:
Important note
Sampling is used if not all the elements of the population are available. For example, investigations into the past can only be done on available historical data, which is often incomplete.
When information is collected, a survey is performed on all the units that make up the population under study. When an analysis is carried out on the information collected, it is only possible to use it on part of the units that make up the population.
The pros of sampling are as follows:
The disadvantages of sampling are as follows:
Sampling can be performed by forced choice in cases where the reference population is partially unknown in terms of composition or size. Sampling cannot always replace a complete investigation, such as in the case of surveys regarding the movement of marital status, births, and deaths: all individual cases must be known.
In probability sampling, the probability that each unit of the population must be extracted is known. In contrast, in non-probability sampling, the probability that each unit of the population must be extracted is not known.
Let’s take a look at an example. If we extract a sample of university students by drawing lots from those present on any day in university, we do not get a probabilistic sample for the following reasons: non-attending students have no chance of entering, and the students who attend the most are more likely to be extracted than the other students of the following years.
The sampling procedure involves a series of steps that need to be followed appropriately to extract data that can adequately represent the population. Sampling is carried out as follows:
Now that we’ve adequately introduced the various sampling techniques, let’s look at a practical case.
This method is used to estimate characteristics such as the distortion and the standard deviation of a statistic. This technique allows us to obtain the desired estimates without necessarily resorting to parametric assumptions. A statistical parameter is a value that defines an essential characteristic of a population, so it is essential for its description. Jackknife is based on calculating the statistics of interest for the sub-samples we’ve obtained, leaving out one sample observation at a time. The Jackknife estimate is consistent for various sample statistics, such as mean, variance, correlation coefficient, maximum likelihood estimator, and others.
The Jackknife method was proposed in 1949 by M. H. Quenouille who, due to the low computational power of the time, created an algorithm that requires a fixed number of operations.
Important note
The main idea behind this method is to cut a different observation from the original sample each time and to re-evaluate the parameter of interest. The estimate will be compared with the same one that was calculated on the original sample.
Since the distribution of the variable is not known, the distribution of the estimator is not known either.
Jackknife samples are constructed by leaving an observation, xi, out of the original sample each time, as shown in the following equation:
Then, n samples of size m = n-1 are obtained. Let’s take a look at an example. Consider a sample of size n = 5 that produces five Jackknife samples of size m = 4, as follows:
The pseudo-value, , is recalculated on the generic i-th sample Jackknife. The procedure is iterated n times on each of the available Jackknife samples:
The following diagram shows this preliminary procedure:
Figure 6.1 – Representation of the Jackknife method
To calculate the variance of the Jackknife estimate, the following equation must be used:
In the previous equation, the term is defined as follows:
The calculated standard deviation will be used to build confidence intervals for the parameters.
To evaluate and possibly reduce the estimator distortion, the Jackknife estimate of the distortion is calculated as follows:
Essentially, the Jackknife method reduces bias and evaluates variance for an estimator.
To make comparisons regarding variability between different distributions, we can use the coefficient of variation (CV) since it considers the average of the distribution. The variation coefficient is a relative measure of dispersion and is a dimensionless magnitude. It allows us to evaluate the dispersion of the values around the average, regardless of the unit of measurement.
Important note
For example, the standard deviation of a sample of income expressed in dollars is completely different from the standard deviation of the same income expressed in euros, while the dispersion coefficient is the same in both cases.
The coefficient of variation is calculated using the following equation:
In the previous equation, we use the following parameters:
The variance is the average of the differences squared between each of the observations in a group of data and the arithmetic mean of the data:
So, it represents the squared error that we commit, on average, replacing a generic observation, xi, with the average, µ. The standard deviation is the square root of the variance and therefore represents the square root of the mean squared error:
The coefficient of variation, which can be defined starting from the average and standard deviation, is the appropriate index for comparing the variability of two characters. CV is particularly useful when you want to compare the dispersion of data with different units of measurement or with different ranges of variation.
If the mean of the distribution approaches zero, the coefficient of variation will approach infinity. In this case, it is sensitive to small variations in the mean.
Now, let’s look at some Python code that compares the CV of a distribution and the one obtained with resampling according to the Jackknife method:
import random
import statistics
import matplotlib.pyplot as plt
The random module implements pseudo-random number generators for various distributions. The random module is based on the Mersenne Twister algorithm. Mersenne Twister is a pseudorandom number generator. Originally developed to produce inputs for Monte Carlo simulations, almost uniform numbers are generated via Mersenne Twister, making them suitable for a wide range of applications.
The statistics module contains numerous functions for calculating mathematical statistics from numerical data. With the tools available in this module, it will be possible to calculate the averages and make measurements of the central position and diffusion measures.
The matplotlib library is a Python library for printing high-quality graphics. With matplotlib, it is possible to generate graphs, histograms, bar graphs, power spectra, error graphs, scatter graphs, and so on with a few commands. This is a collection of command-line functions like those provided by the MATLAB software.
PopData = list()
A list is an ordered collection of values and can be of various types. It is an editable container – it allows us to add, delete, and modify existing values. For our purpose, which is to continuously update our values, the list represents the most suitable solution. The list() function accepts a sequence of values and converts them into lists. With this command, we simply initialized the list, which is currently empty.
random.seed(5)
The random.seed() function is useful if we want to have the same set of data available to be processed in different ways as this makes the simulation reproducible.
Important note
This function initializes the basic random number generator. If you use the same seed in two successive simulations, you always get the same sequence of pairs of numbers.
for i in range(100):
DataElem = 10 * random.random()
PopData.append(DataElem)
In the previous piece of code, we generated 100 random numbers between 0 and 1 using the random() function. Then, for each step of the for loop, this number was multiplied by 10 to obtain a distribution of numbers comprised between 0 and 10.
def CVCalc(Dat):
CVCalc = statistics.stdev(Dat)/statistics.mean(Dat)
return CVCalc
As indicated in the Estimating the coefficient of variation section, this coefficient is simply the ratio between the standard deviation and the mean. To calculate the standard deviation, we used the statistics.stdev() function. This function calculates the sample standard deviation, which represents the square root of the sample variance. To calculate the mean of the data, we used the statistics.mean function. This function calculates the sample arithmetic mean of the data. We can immediately use the newly created function to calculate the variation coefficient of the distribution that we have created:
CVPopData = CVCalc(PopData) print(CVPopData)
The following result is returned:
0.6569398125747403
For now, we will leave this result out, but we will use it later to compare the results we obtained by resampling.
N = len(PopData)
JackVal = list()
PseudoVal = list()
N represents the number of samples present in the starting distribution. The JackVal list will contain the Jackknife sample, while the PseudoVal list will contain the Jackknife pseudo values.
for i in range(N-1):
JackVal.append(0)
for i in range(N):
PseudoVal.append(0)
The JackVal list has a length of N-1 and relates to what we discussed in the Defining the Jackknife method section.
for i in range(N):
for j in range(N):
if(j < i):
JackVal[j] = PopData[j]
else:
if(j > i):
JackVal[j-1]= PopData[j]
PseudoVal[i] = N*CVCalc(PopData)-
(N-1)*CVCalc(JackVal)
Jackknife samples (JackVal) are constructed by leaving an observation, xi, out of the original sample at each step of the external loop (for i in range(N)). At the end of each step of the external cycle, the pseudo value is evaluated using the following equation:
plt.hist(PseudoVal)
plt.show()
The following graph will be printed:
Figure 6.2 – Distribution of pseudo values
MeanPseudoVal=statistics.mean(PseudoVal)
print(MeanPseudoVal)
The following result is returned:
0.6545985339842991
As we can see, the value we’ve obtained is comparable with what we obtained from the starting distribution. Now, we will calculate the variance of the pseudo values:
VariancePseudoVal=statistics.variance(PseudoVal) print(VariancePseudoVal)
The following result is returned:
0.2435929299444099
Finally, let’s evaluate the variance of the Jackknife estimator:
VarJack = statistics.variance(PseudoVal)/N print(VarJack)
The following result is returned:
0.002435929299444099
We can use these results to compare the different resampling methods.
The resampling techniques are different and each approaches the problem from a different point of view. Now, let’s learn how to resample the data through bootstrapping.
The most well-known resampling technique is the one defined as bootstrapping, as introduced by B. Efron in 1993. The logic of the bootstrap method is to build samples that are not observed but statistically like those observed. This is achieved by resampling the observed series through an extraction procedure where we reinsert the observations.
This procedure is like extracting a number from an urn, with subsequent reinsertion of the number before the next extraction. Once a statistical test has been chosen, it is calculated both on the observed sample and on a large number of samples of the same size as that observed and obtained by resampling. The N values of the test statistic then allow us to define the sample distribution – that is, the empirical distribution of the chosen statistic.
Important note
A statistical test is a rule for discriminating samples that, if observed, lead to the rejection of an initial hypothesis, from those which, if observed, lead to accepting the same hypothesis until proven otherwise.
Since the bootstrapped samples derive from a random extraction process with reintegration from the original series, any temporal correlation structure of the observed series is not preserved. It follows that bootstrapped samples have properties such as the observed sample, but respect, at least approximately, the hypothesis of independence. This makes them suitable for calculating test statistics distributions, assuming there’s a null hypothesis for the absence of trends, change points, or a generic systematic temporal trend.
Once the sample distribution of the generic test statistic under the null hypothesis is known, it is possible to compare the value of the statistic itself, as calculated on the observed sample with the quantiles, deduced from the sample distribution, and check whether the value falls into critical regions with a significance level of 5% and 10%, respectively. Alternatively, you can define the percentage of times that the value of the statistic calculated on the observed sample is exceeded by the values coming from the N samples. This value is the statistic p-value for the observed sample and checks if this percentage is far from the commonly adopted meaning of 5% and 10%.
Bootstrapping is a statistical resampling technique with reentry so that we can approximate the sample distribution of a statistic. Therefore, it allows us to approximate the mean and variance of an estimator so that we can build confidence intervals and calculate test p-values when the distribution of the statistics of interest is not known.
Important note
Bootstrap is based on the fact that the only available sample is used to generate many more samples and to build the theoretical reference distribution. You use the data from the original sample to calculate a statistic and estimate its sample distribution without making any assumptions about the distribution model.
So, the original sample is used to generate the distribution; that is, the estimate of θ is constructed by substituting the empirical equivalent of the unknown distribution function of the population. The distribution function of the sample is obtained by constructing a distribution of frequencies of all the values it can assume in that experimental situation.
In the simple case of simple random sampling, the operation is as follows. Consider an observed sample with n elements, as described by the following equation:
From this distribution, m other samples of a constant number equal to n, say x * 1, ..., x * m are resampled. In each bootstrap extraction, the data from the first element of the sample can be extracted more than once. Each one that’s provided has a probability equal to 1 / n of being extracted.
Let E be the estimator of θ that interests us to study, say, E(x) = θ. Here, θ is a parameter of the static distribution of essential interest for its description. This quantity is calculated for each bootstrap sample, E(x * 1),…, E(x * m). In this way, m estimates of θ are available, from which it is possible to calculate the bootstrap mean, the bootstrap variance, the bootstrap percentiles, and so on. These values are approximations of the corresponding unknown values and carry information on the distribution of E(x). Therefore, starting from these estimated quantities, it is possible to calculate confidence intervals, test hypotheses, and so on.
We will proceed in a similar way to what we did for Jackknife resampling: we will generate a random distribution, carry out a resampling according to the bootstrap method, and then compare the results. Let’s see the code step by step to understand the procedure:
import random
import numpy as np
import matplotlib.pyplot as plt
The random module implements pseudo-random number generators for various distributions. The random module is based on the Mersenne Twister algorithm. Mersenne Twister is a pseudorandom number generator. Originally developed to produce inputs for Monte Carlo simulations, almost uniform numbers are generated via Mersenne Twister, making them suitable for a wide range of applications.
numpy is a Python library that contains numerous functions that help us manage multidimensional matrices. Furthermore, it contains a large collection of high-level mathematical functions that we can use on these matrices.
matplotlib is a Python library for printing high-quality graphics. With matplotlib, it is possible to generate graphs, histograms, bar graphs, power spectra, error graphs, scatter graphs, and so on with a few commands. It is a collection of command-line functions like those provided by the MATLAB software.
PopData = list()
A list is an ordered collection of values and can be of various types. It is an editable container – it allows us to add, delete, and modify existing values. For our purpose, which is to provide continuous updates for our values, the list represents the most suitable solution. The list() function accepts a sequence of values and converts them into lists. With this command, we simply initialized the list that is currently empty. This list will be populated by generating random numbers.
random.seed(7)
The random.seed () function is useful if we want to have the same set of data available to be processed in different ways as it makes the simulation reproducible.
for i in range(1000):
DataElem = 50 * random.random()
PopData.append(DataElem)
In the previous piece of code, we generated 1,000 random numbers between 0 and 1 using the random() function. Then, for each step of the for loop, this number was multiplied by 50 to obtain a distribution of numbers comprised between 0 and 50.
PopSample = random.choices(PopData, k=100)
This function extracts a sample of size k elements chosen from the population with substitution. We extracted a sample of 100 elements from the original population of 1,000 elements.
PopSampleMean = list()
for i in range(10000):
SampleI = random.choices(PopData, k=100)
PopSampleMean.append(np.mean(SampleI))
In this piece of code, we created a new list that will contain the sample. Here, we used a for loop with 10,000 steps. At each step, a sample of 100 elements was extracted using the random.choices() function from the initial population. Then, we obtained the average of this sample. This value was then added to the end of the list.
Important note
We resampled the data with the replacement, thereby keeping the resampling size equal to the size of the original dataset.
plt.hist(PopSampleMean)
plt.show()
The following graph will be printed:
Figure 6.3 – Histogram of the sample distribution
Here, we can see that the sample has a normal distribution.
MeanPopSampleMean = np.mean(PopSampleMean)
print("The mean of the Bootstrap estimator is ",MeanPopSampleMean)
The following result is returned:
The mean of the Bootstrap estimator is 24.105354873028915
Then, we can calculate the mean of the initial population:
MeanPopData = np.mean(PopData) print("The mean of the population is ",MeanPopData)
The following result is returned:
The mean of the population is 24.087053989747968
Finally, we can calculate the mean of the simple sample that was extracted from the initial population:
MeanPopSample = np.mean(PopSample) print("The mean of the simple random sample is ",MeanPopSample)
The following result is returned:
The mean of the simple random sample is 23.140472976536497
We can now compare the results. Here, the population and bootstrap sample means are practically identical, while the generic sample mean deviates from these values. This tells us that the bootstrap sample is more representative of the initial population than a generic sample that was extracted from it.
In this section, we will compare the two sampling methods that we have studied by highlighting their strengths and weaknesses:
Now, let’s learn how to use bootstrapping in the case of a regression problem.
Linear regression analysis is used to determine a linear relationship between two variables, x and y. If x is the independent variable, we try to verify if there is a linear relationship with the dependent variable, y: we try to identify the line capable of representing the distribution of points on a two-dimensional plane. If the points corresponding to the observations are close to the line, the model will effectively describe the link between x and y. The lines that can approximate the observations are infinite, but only one of them optimizes the representation of the data. In the case of a linear mathematical relationship, the observations of y can be obtained from a linear function of the observations of x:
In the previous equation, the terms are defined as follows:
The α and β parameters must be estimated starting from the observations collected for the two variables, x and y.
The slope, α, represents the variation of the mean response for every single increment of the explanatory variable:
Therefore, regression analysis aims to identify the α and β parameters that minimize the difference between the observed values of y and the estimated ones.
In the Demystifying bootstrapping section, we adequately introduced the bootstrapping technique: we can use this methodology to evaluate, for example, the uncertainty returned by an estimation model. In the example we will analyze, we will apply this methodology to study how possible outliers cause a significant deviation in the variance of a dependent variable that is explained by an independent variable in a regression model.
Here is the Python code (bootstrap_regression.py). As always, we will analyze the code line by line:
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
The numpy library is a Python library that contains numerous functions that can help us manage multidimensional matrices. Furthermore, it contains a large collection of high-level mathematical functions we can use on these matrices. Here, we imported the LinearRegression() function from sklearn.linear_model: this function performs an ordinary least squares linear regression.
The matplotlib library is a Python library for printing high-quality graphics.
The pandas library is an open source BSD-licensed library that contains data structures and operations to manipulate high-performance numeric values for the Python programming language.
Finally, we imported the seaborn library. It is a Python library that enhances the data visualization tools of the matplotlib module. In the seaborn module, there are several features we can use to graphically represent our data. Some methods facilitate the construction of statistical graphs with matplotlib.
x = np.linspace(0, 1, 100)
y = x + (np.random.rand(len(x)))
for i in range(30):
x=np.append(x, np.random.choice(x))
y=np.append(y, np.random.choice(y))
x=x.reshape(-1, 1)
y=y.reshape(-1, 1)
To start, we generated 100 values for the independent variable, x, using the linspace() function: the linspace() function of the numpy library allows us to define an array composed of a series of N numerical elements equally distributed between two extremes (0, 1). Then, we generated the dependent variable, y, by adding a random term to the value of x using the random.rand() function: this function computes random values in a given shape. It creates an array of the given shape and populates it with random samples from a uniform distribution over [0, 1]. Next, to add outliers artificially, we added 30 more observations using the random.choice() function: this function returns a random element of the non-empty sequence passed as an argument (see the The random.choice() function section in Chapter 2, Understanding Randomness and Random Numbers). Finally, we used the reshape() function to format the data as required by the linear regression model we will be using. In this case, we have passed the parameters (-1, 1) to indicate that we do not indicate how many rows there will be -1 because those already defined will remain, while we set several columns equal to 1.
reg_model = LinearRegression().fit(x, y)
r_sq = reg_model.score(x, y)
print(f"R squared = {r_sq}")
The LinearRegression() function minimizes the residual sum of squares between the observed targets in the dataset and the predicted targets from the linear approximation. To do this, we simply passed the x and y variable Subsequently, to evaluate the performance of the model, we evaluated the coefficient of determination, R-squared. R-squared measures how well a model can predict the data and falls between zero and one. The higher the value of the coefficient of determination, the better the model is at predicting the data. The following value is printed on the screen:
R squared = : 0.286581509708418
The value of the coefficient tells us that only about 28% of the variance is specified by the model. We must retrieve the values of the model parameters – slope and intercept:
alpha=float(reg_model.coef_[0]) print(f"slope: {reg_model.coef_}") beta=float(reg_model.intercept_[0]) print(f"intercept: {reg_model.intercept_}")
The following values are printed on the screen:
slope: [[0.79742372]] intercept: [0.5632016]
Then, we must draw a graph in which we report all the observations and the regression line. To do this, we must use the model to obtain the values of y starting from the values of x:
y_pred = reg_model.predict(x) plt.scatter(x, y) plt.plot(x, y_pred, linewidth=2) plt.xlabel('x') plt.ylabel('y') plt.show()
The following graph will be returned:
Figure 6.4 – Regression line on a scatter plot
boot_slopes = []
boot_interc = []
r_sqs= []
n_boots = 500
num_sample = len(x)
data = pd.DataFrame({'x': x[:,0],'y': y[:,0]})
We will need the first three lists to collect the slope, intercept, and R-squared values obtained in each boot. Successively, we set the number of boots and the number of samples to be extracted from the initial observations. Finally, we have created a DataFrame with the initial data (x, y): this will be used for resampling.
Now, let’s create a frame for the diagram we’re going to draw and then set up a for loop that repeats the instructions several times equal to the boot number:
plt.figure() for k in range(n_boots): sample = data.sample(n=num_sample, replace=True) x_temp=sample['x'].values.reshape(-1, 1) y_temp=sample['y'].values.reshape(-1, 1) reg_model = LinearRegression().fit(x_temp, y_temp) r_sqs_temp = reg_model.score(x_temp, y_temp) r_sqs.append(r_sqs_temp) boot_interc.append(float(reg_model.intercept_[0])) boot_slopes.append(float(reg_model.coef_[0])) y_pred_temp = reg_model.predict(x_temp) plt.plot(x_temp, y_pred_temp, color='grey', alpha=0.2)
To start, we extracted the sample of the initial distribution using the sample() function. This function returns a random sample of elements; two parameters were passed: n = num_sample, replace = True. The first sets the number of elements that is set equal to the number of initial observations. The second parameter (replace = True) allows the same observation to be sampled more than once, in the sense that the same observation can be resampled as many times as it can be omitted.
After extracting the new sample, we must extract the values to be used to adapt the model. After evaluating the parameters of the regression line and the coefficient of determination r_squared, these values are added to the previously initialized lists. Finally, the regression line for the current model is evaluated and it is added to the graph. This procedure is repeated several times equal to the expected number of boots.
plt.scatter(x, y)
plt.plot(x, y_pred, linewidth=2)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
First, we add all the regression lines we have evaluated to the scatter plot of the observations. The result is shown in the following figure:
Figure 6.5 – Regression lines evaluation
sns.histplot(data=boot_slopes, kde=True)
plt.show()
sns.histplot(data=boot_interc, kde=True)
plt.show()
The following graph shows the slope distribution of all 500 models fitted to the samples extracted from the initial distributions:
Figure 6.6 – The model’s slope distribution
The following graph shows the distribution of the intercept of all 500 models fitted to the samples extracted from the initial distributions:
Figure 6.7 – The model’s intercept distribution
plt.plot(r_sqs)
The following graph is returned:
Figure 6.8 – The model’s performance evaluation
We can see that the values oscillate between 0.1 and 0.5, so we try to extract the maximum value:
max_r_sq=max(r_sqs) print(f"Max R squared = {max_r_sq}")
The following value is displayed:
Max R squared = 0.5245632432772953
If we compare this value with what we obtained with the initial distribution of data (R-squared = 0.286581509708418), the improvement that we’ve obtained is evident. We went from about 28% of the explained variance to 52%. So, we can say that bootstrapping returned a very good result.
Now, let’s try to extract the values of the parameters of the regression line that return these results:
pos_max_r_sq=r_sqs.index(max(r_sqs)) print(f"Boot of the best Regression model = {pos_max_r_sq}") max_slope=boot_slopes[pos_max_r_sq] print(f"Slope of the best Regression model = {max_slope}") max_interc=boot_interc[pos_max_r_sq] print(f"Intercept of the best Regression model = {max_interc}")
The following results are shown:
Boot of the best Regression model = 383 Slope of the best Regression model = 1.1086506372800053 Intercept of the best Regression model = 0.3752482619581162
In this way, we will be able to draw the regression line that best approximates the resampled data.
Now that we’ve analyzed the resampling techniques in detail, let’s learn how to perform permutation tests.
When observing a phenomenon belonging to a set of possible results, we ask ourselves what the law of probability is that we can assign to this set. Statistical tests provide a rule that allows us to decide whether to reject a hypothesis based on the sample observations.
Parametric approaches are very uncertain about the experiment plan and the population model. When these assumptions are not respected, particularly when the data law does not conform to the needs of the test, the parametric results are less reliable. When the hypothesis is not based on knowledge of the data distribution and assumptions have not been verified, nonparametric tests are used. Nonparametric tests offer a very important alternative since they need fewer hypotheses.
Permutation tests are a special case of randomization tests that use a series of random numbers formulated from statistical inferences. The computing power of modern computers has made their widespread application possible. These methods do not require that their assumptions about data distribution are met.
A permutation test is performed through the following steps:
Two experiments use values in the same sample space under the respective distributions, P1 and P2, both of which are members of an unknown population distribution. Given the same dataset, x, if the inference conditional on x, which is obtained using the same test statistic, is the same, assuming that the exchangeability for each group is satisfied in the null hypothesis. The importance of permutation tests lies in their robustness and flexibility. The idea of using these methods is to generate a reference distribution from the data and recalculate the test statistics for each permutation of the data concerning the resulting discrete law.
Now, let’s look at a practical case of a permutation test in a Python environment.
A permutation test is a powerful non-parametric test for comparing the central trends of two independent samples. No verification is needed on the variability of the population groups or the shape of the distribution. The limitation of this test is its application to small samples.
Therefore, the permutation test allows us to evaluate the correlation between the data by returning the distribution of the test statistic under the null hypothesis: it is obtained by calculating all the possible values of the test statistic using an adequate number of resamplings of the observed data. In a dataset, the data labels are associated with those features; if the labels are swapped under the null hypothesis, the resulting tests produce exact significance levels. The confidence intervals can then be derived from the tests.
As mentioned in the Demystifying bootstrapping section, statistical significance tests initially assume the so-called null hypothesis. When comparing two or more groups of data, the null hypothesis always states that there is no difference between the groups regarding the parameter considered: the null hypothesis specifies that the groups are equal to each other and any observed differences must be attributed to chance alone.
We proceed by applying a statistical significance test, the result of which must be compared with a critical value: if the test result exceeds the critical value, then the difference between the groups is declared statistically significant and, therefore, the null hypothesis is rejected; otherwise, the null hypothesis is accepted.
The results of a statistical test do not have a value of absolute and mathematical certainty, but only of probability. Therefore, a decision to reject the null hypothesis is probably right, but it could be wrong. The measure of this risk of falling into error is called the significance level of the test. The significance level of a test can be chosen at will by the researcher, but usually, a probability level of 0.05 or 0.01 is chosen. This probability (called the p-value) represents a quantitative estimate of the probability that the observed differences are due to chance.
The p-value represents the probability of obtaining a more extreme result than the one observed if the diversity is entirely due to the sampling variability alone, thus assuming that the initial null hypothesis is true. p is a probability and therefore can only assume values between 0 and 1. A p-value approaching 0 means a low probability that the observed difference may be due to chance.
Statistically significant does not mean relevant but simply means that what has been observed is hardly due to chance. Numerous statistical tests are used to determine with a certain degree of probability the existence or otherwise of significant differences in the data under examination.
In the example we will analyze, we will compare the data that’s been collected on two observations: the first dataset is the well-known Iris flower dataset. In the second case, we will artificially generate a dataset. Our goal is to verify that, in the first dataset, there is a strong correlation between features and labels, something that is non-existent in the artificially generated dataset.
Here is the Python code (permutation_tests.py). As always, we will analyze the code line by line:
from sklearn.datasets import load_iris
import numpy as np
from sklearn import tree
from sklearn.model_selection import permutation_test_score
import matplotlib.pyplot as plt
import seaborn as sns
The sklearn.datasets library contains the most widely used datasets in data analysis. Among, these we import the well-known Iris dataset. The numpy library is a Python library that contains numerous functions that can help us manage multidimensional matrices. Furthermore, it contains a large collection of high-level mathematical functions we can use on these matrices. Here, we imported the tree classification model from the sklearn library. Then, we imported the permutation_test_score() function from sklearn.model_selection, which calculates the significance of a cross-validated score with permutations. The matplotlib library is a Python library for printing high-quality graphics. Finally, we imported the seaborn library. It is a Python library that enhances the data visualization tools of the matplotlib module. In the seaborn module, there are several features we can use to graphically represent our data. Some methods facilitate the construction of statistical graphs with matplotlib.
data=data = load_iris()
X = data.data
y = data.target
The Iris dataset is a multivariate dataset introduced by the British statistician and biologist Ronald Fisher in 1936 as an example of linear discriminant analysis. The dataset contains 50 samples from each of the three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
The following variables are contained:
np.random.seed(0)
X_nc_data = np.random.normal(size=(len(X), 4))
To do this, we used numpy’s random.normal() function. This function generates a normal distribution by default and uses a mean equal to 0 and a standard deviation equal to 1. We added the seed for reproducibility.
clf = tree.DecisionTreeClassifier(random_state=1)
Here, DecisionTreeClassifier was chosen. A decision tree algorithm is based on a non-parametric supervised learning method used for classification and regression. The aim is to build a model that predicts the value of a target variable using decision rules inferred from the data features.
p_test_iris = permutation_test_score(
clf, X, y, scoring="accuracy", n_permutations=1000
)
print(f"Score of iris flower classification = {p_test_iris[0]}")
print(f"P_value of permutation test for iris dataset = {p_test_iris[2]}")
p_test_nc_data = permutation_test_score(
clf, X_nc_data, y, scoring="accuracy", n_permutations=1000
)
print(f"Score of no-correletd data classification = {p_test_nc_data[0]}")
print(f"P_value of permutation test for no-correletd dataset = {p_test_nc_data[2]}")
For the test, we used the permutation_test_score() function, which evaluates the significance of a cross-validated score with permutations. This function performs a permutation of the target to resample the data and calculate the empirical p-value concerning the null hypothesis, which assumes the characteristics and objectives are independent. Three results are returned:
The p-value represents the fraction of randomized datasets in which the classifier performed well: a small p-value tells us that there is a real dependence between characteristics and targets. A high p-value may be due to a lack of real dependency between characteristics and targets.
Score of iris flower classification = 0.9666666666666668
P_value of permutation test for iris dataset = 0.000999000999000999
1,000 permutations have been made: the accuracy of the decision tree-based classifier is very high, which tells us that the model can predict the class of the type of Iris with excellent performance. However, the p-value is very low, which confirms the real dependence of the target, y, on the features contained in the matrix, X.
Score of no-correletd data classification = 0.2866666666666667
P_value of permutation test for no-correletd dataset = 0.8711288711288712
The classification score is low, indicating that the forecasts do not have good accuracy. However, the p-value is high, which indicates that a correlation between the features and the target has not been found.
pbox1=sns.histplot(data=p_test_iris[1], kde=True)
plt.axvline(p_test_iris[0],linestyle="-", color='r')
plt.axvline(p_test_iris[2],linestyle="--", color='b')
pbox1.set(xlim=(0,1))
plt.show()
The following graph was displayed:
Figure 6.9 – Histogram with a density curve for the permutation test
This graph shows a histogram with the density curve of the distribution of the results of the permutation tests performed 100 times. A blue dashed vertical line is drawn to indicate the p-value, and a continuous red vertical line is drawn to indicate the accuracy of the classifier. Note that since these are the original data (Iris dataset), characterized by a strong correlation between features and targets, the score of the classifier is much higher than those obtained by permuting the targets: the permutations of the targets cause the correlation between the features and targets to be lost. Furthermore, the p-value is located on the far left of the graph to indicate a low value and therefore a strong correlation between the features and targets.
Let’s see what happens to the randomly generated data:
Figure 6.10 – Histogram with a density curve for the randomly generated data
In this case, being random data, characterized by no correlation between features and targets, the score of the classifier is low and in line with those obtained by permuting the targets. Furthermore, the p-value is on the far right of the graph to indicate a high value and therefore a low correlation between the features and targets.
Now that we’ve analyzed a practical case of permutation testing, let’s learn how to perform data resampling by applying cross-validation.
Cross-validation is a method used in model selection procedures based on the principle of predictive accuracy. A sample is divided into two subsets, of which the first (training set) is used for construction and estimation, while the second (validation set) is used to verify the accuracy of the predictions of the estimated model. Through a synthesis of repeated predictions, a measure of the accuracy of the model is obtained. A cross-validation method is like Jackknife in that it leaves one observation out at a time. In another method, known as k-fold validation, the sample is divided into k subsets and, in turn, each of them is left out as a validation set.
Important note
Cross-validation can be used to estimate the mean squared error (MSE) (or, in general, any measure of precision) of a statistical learning technique to evaluate its performance or select its level of flexibility.
Cross-validation can be used for both regression and classification problems. The three main validation techniques of a simulation model are the validation set approach, leave-one-out cross-validation (LOOCV), and k-fold cross-validation. In the following sections, we will learn about these concepts in more detail.
This technique consists of randomly dividing the available dataset into two parts:
A statistical learning model is adapted to the training data and subsequently used for predicting the data of the validation set.
The measurement of the resulting test error, which is typically the MSE in the case of regression, provides an estimate of the real test error. The validation set is the result of a sampling procedure and therefore different samplings result in different estimates of the test error.
This validation technique has various pros and cons. Let’s take a look at a few:
The LOOCV and k-fold cross-validation techniques try to overcome these problems.
LOOCV also divides the observation set into two parts. However, instead of creating two subsets of comparable size, we do the following:
But even if MSE1 is impartial to the test error, it is a poor estimate because it is very variable. This is because it is based on a single observation (x1, y1).
LOOCV has some advantages over the validation set approach:
LOOCV can be computationally intensive, so for large datasets, it takes a long time to calculate. In the case of linear regression, however, there are direct computational formulas with low computational intensity.
In k-fold cross-validation (k-fold CV), the set of observations is randomly divided into k groups, or folders, of approximately equal size. The first folder is considered a validation set and the function is estimated on the remaining k-1 folders. The mean square error, MSEi, is then calculated on the observations of the folder that’s kept out. This procedure is repeated k times, each time choosing a different folder for validation, thus obtaining k estimates of the test error. The k-fold CV estimate is calculated by averaging these values, as follows:
This method has the advantage of being less computationally intensive if k << n. Furthermore, the k-fold CV tends to have less variability than the LOOCV on different-sized datasets, n.
Choosing k is crucial in k-fold CV. What happens when i changes in cross-validation? Let’s see what an extreme choice of k entails:
In this section, we will look at an example of the application of cross-validation. First, we will create an example dataset that contains simple data to identify to verify the procedure being performed by the algorithm. Then, we will apply k-fold CV and analyze the results:
import numpy as np
from sklearn.model_selection import KFold
numpy is a Python library that contains numerous functions that help us manage multidimensional matrices. Furthermore, it contains a large collection of high-level mathematical functions we can use on these matrices.
scikit-learn is an open source Python library that provides multiple tools for machine learning. In particular, it contains numerous classification, regression, and clustering algorithms; this includes support vector machines, logistic regression, and much more. Since it was released in 2007, scikit-learn has become one of the most widely used libraries in the field of machine learning, both supervised and unsupervised, thanks to the wide range of tools it offers, but also thanks to its API, which is documented, easy to use, and versatile.
Important note
Application programming interfaces (APIs) are sets of definitions and protocols that application software is created and integrated with. They allow products or services to communicate with other products or services without knowing how they are implemented, thus simplifying app development, and allowing a net saving of time and money. When creating new tools and products or managing existing ones, APIs offer flexibility, simplify design, administration, and use, and provide opportunities for innovation.
The scikit-learn API combines a functional user interface with an optimized implementation of numerous classification and meta-classification algorithms. It also provides a wide variety of data pre-processing, cross-validation, optimization, and model evaluation functions. scikit-learn is particularly popular for academic research since developers can use the tool to experiment with different algorithms by changing only a few lines of code.
Here, we generated a vector containing 10 integers, starting from the value 10 up to 100 with a step equal to 10. To do this, we used the numpy arange() function. This function generates equidistant values within a certain range. Three arguments have been passed, as follows:
The following array was returned:
[ 10 20 30 40 50 60 70 80 90 100]
kfold = KFold(5, True, 1)
scikit-learn’s KFold() function performs k-fold CV by dividing the dataset into k consecutive folds without shuffling by default. Each fold is then used once as validation, while the remaining k - 1 folds form the training set. Three arguments were passed, as follows:
for TrainData, TestData in kfold.split(StartedData):
print("Train Data :", StartedData[TrainData],"Test Data :", StartedData[TestData])
To do this, we used a loop for the elements generated by the kfold.split() method, which returns the indexes that the dataset is divided into. Then, for each step, which is equal to the number of folds, the elements of the subsets that were drawn are printed.
The following results are returned:
Train Data : [ 10 20 40 50 60 70 80 90] Test Data : [ 30 100] Train Data : [ 10 20 30 40 60 80 90 100] Test Data : [50 70] Train Data : [ 20 30 50 60 70 80 90 100] Test Data : [10 40] Train Data : [ 10 30 40 50 60 70 90 100] Test Data : [20 80] Train Data : [ 10 20 30 40 50 70 80 100] Test Data : [60 90]
These pairs of data (Train Data, Test Data) will be used in succession to train the model and validate it. This way, you can avoid overfitting and bias problems. Every time you evaluate the model, the extracted part of the dataset is used, and the remaining part of the dataset is used for training.
In this chapter, we learned how to resample a dataset. We analyzed several techniques that approach this problem through different techniques. First, we analyzed the basic concepts of sampling and learned about the reasons that push us to use a sample extracted from a population. Then, we examined the pros and cons of this choice. We also analyzed how a resampling algorithm works.
Then, we tackled the first resampling method: the Jackknife method. First, we defined the concepts behind the method and then moved on to the procedure, which allows us to obtain samples from the original population. To put the concepts we learned into practice, we applied Jackknife resampling to a practical case.
Next, we explored the bootstrap method, which builds unobserved but statistically, like the observed samples. This is accomplished by resampling the observed series through an extraction procedure where we reinsert the observations. After defining the method, we worked through an example to highlight the characteristics of the procedure. Furthermore, a comparison between Jackknife and bootstrap was made. Then, we analyzed a practical case of bootstrapping applied to a regression problem.
After analyzing the concepts underlying permutation tests and exploring an example of this test, we concluded this chapter by looking at various cross-validation methods. Our knowledge of the k-fold CV method was deepened through an example.
In the next chapter, we will learn about the basic concepts of various optimization techniques and how to implement them. We will understand the difference between numerical and stochastic optimization techniques, and we will learn how to implement stochastic gradient descent. Then, we will discover how to estimate missing or latent variables and optimize model parameters. Finally, we will discover how to use optimization methods in real-life applications.
18.219.71.21