Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 7
Missing Data: Background

7.1. INTRODUCTION

As we discussed in Section 3.3.2, dealing with missing data – a ubiquitous problem – is one of the crucial steps in making data useful at all. In this chapter we will describe the problem of missing data imputation in more general terms. We will present a specific case study that focuses on filling gaps in multivariate financial time series in the next chapter.

Providing a general recipe for tackling missing data is not possible, given that the problem arises in many different-in-nature practical applications. For example, filling gaps in financial time series can be quite different from filling gaps in satellite images or text. Nevertheless, some techniques can be widely reused over different domains, as we will show in this chapter and the next. Techniques to fill missing data are applicable regardless of whether or not a dataset is alternative, so in what follows we will not make such distinction. We only remark that, in general, we expect to have more missing data and data quality problems in the alternative data space. This is due to the increased variety, velocity, and variability of alternative data compared to more standardized traditional datasets.

Treating missing data is something that must be performed before any further analysis is attempted. A predictive model (e.g. an investment strategy) can then be calibrated on the treated dataset as a second step. We must be careful, though, to understand whether the missing data in the training set was something accidental (e.g. deleted records in the historical database by mistake) or is a recurrent and unescapable characteristic of the data that will reappear in live feeds, hopefully with the same patterns when later deployed in production. In the latter case, the missing data algorithms must be implemented in production as well. It is also important to understand whether the missing data algorithms we built in the preprocessing stage are applicable in a live environment. This will depend on constraints such as how those algorithms are implemented, what is the maximum computational time tolerated for the execution of the missing data treatment step, and the like.

However, as we already mentioned in Chapter 5, if missing data is not something accidental in the training set but reappears in production, it could start to appear in a completely different pattern due to a variety of reasons. These could be a temporary technical glitch that must be fixed. Alternatively, it might be because certain information is no longer collected and hence the associated data feed is interrupted. In the latter case this might call for a complete revision of the algorithms – both for the investment strategy and for the missing data treatment step. Another possibility is that the missing data pattern has changed compared to the training set due to the changing nature of the input data. With market data, one obvious example can be changes in trading hours or the holiday calendar. In this case, this calls for a revision and update of the algorithm used to fill the missing data and maybe those of the investment strategy. A careful analysis is necessary according to each individual case to assess the best course of action. Last, non-stationarity (see Section 4.4.2) or regime changes can also impact the data collection and hence the missingness pattern. For example, when consensus estimates are collected, say, for credit default swap prices, they are not published if the dispersion of the analysts' estimates is too big. A disagreement between analysts is more likely to happen in periods of market turmoil, which could thus add different missingness patterns to the data.

7.2. MISSING DATA CLASSIFICATION

Patterns of missingness can appear in very different forms, which can impact the imputation strategy, as we will describe in the following sections. Hence, it is useful to first analyze possible missing mechanisms as well as common patterns.

In the statistical literature one usually considers the data being generated by a distribution function, , with unknown parameters . The functional form of may or may not be known. It is then of interest to clarify how the missingness pattern is generated and how it is related to the observed data – that is, what general form the conditional distribution function has where is a collection of unknown parameters. Formally, we can separate the data into observed and missing parts, . This is meant to be understood as follows: there exists a complete dataset , but we only observe values . The values are not observed, so usually we would not know them. However, for the following reasoning it is very useful to consider their values and their relation to the missingness patterns as well. In the literature typically the following distinction is made:¹

Missing Completely at Random (MCAR): Missingness patterns do not depend on any observed or non-observed data values:
(7.1)
Missing at Random (MAR): Missingness patterns depend on observed but not on non-observed data values:
(7.2)

One may find the term MAR confusing, since the missingness pattern M is not random, but rather depends on the observed values. It is, however, commonly used in the literature.
Missing Not at Random (MNAR): Missingness patterns depend on both observed and non-observed data values:
(7.3)

An example for MAR is a survey where income quotes are missing for respondents above a certain age. An example for MNAR would be that in a survey income values are more likely to be missing if these values are below a certain threshold and age (observed) is above a certain value. In other words, respondents leave out income if they are old and earn little. The distinction has the following consequences: MCAR and MAR belong to a class of missingness that is called ignorable and that makes it applicable for multiple imputation (MI) approaches, which we will describe later. Roughly speaking, the non-observed values can be integrated out in these cases. In contrast, treating MNAR carefully is more difficult since in principle we cannot predict the missing values only from the observed ones. In these situations, extra data collection or additional insights from domain experts can be useful. Formally, one can then introduce suitable priors to deal with the imputation. Some of the MI packages allow for that.

7.2.1. Missing Data Treatments

In general, there are three methods to deal with missing data: (1) deletion, (2) replacement, and (3) predictive imputation. The first two are very simple and rudimentary, but they could be used in cases where the impact of their application is small or building a predictive imputation model could be too costly. We describe the three methods in the following.

7.2.1.1. Deletion

Deletion is the simplest method. It consists of simply removing records. This can be done listwise or pairwise. Listwise deletion means that any record in a dataset is deleted from an analysis if there is missing data on any variable taken into consideration in the analysis. In certain cases, this can be a viable option, but more often this constitutes a very costly procedure because a lot of data is discarded. Dropping records reduces the sample size and hence the statistical power of the results unless the remaining sample is still substantial. Moreover, this approach only works if the data is MCAR. If it is not, incomplete records that are dropped will differ from the complete cases still in the sample. Then the remaining selected random sample is no longer reflective of the entire population. This could lead to biased results. In some cases, listwise deletion is entirely impractical; for instance, for the credit default swap data discussed in the next chapter, we would lose a lot of valuable data.² Therefore, listwise deletion nowadays is usually dismissed in favor of more sophisticated techniques.

In pairwise deletion, missing data is simply ignored and only the non-missing variables are considered for each record. Pairwise deletion allows the use of more of the data. However, each computed statistic may be based on a different subset of cases and this could cause problems. For instance, using pairwise deletion may not yield a proper positive semidefinite correlation matrix.

More flexible and powerful strategies are ones where we predict missing data from the observed one. Generally, one can distinguish deterministic from stochastic approaches for data imputation.

7.2.1.2. Replacement

A basic deterministic approach is to impute missing values for a particular feature by a simple guess, such as the mean of the observed values of this feature or the majority value (mode). This can be a successful strategy if the missing fraction is very small. There are, however, two problems with this approach: (1) mean or mode imputation can be inaccurate, and (2) as discussed extensively in the literature (see Little & Rubin, 2019; Schafer, 1997), this simple imputation technique alters the statistical properties of the data. For instance, the variance of a variable is decreased through mean imputation. For missing values in a time series, we also need to be careful not to use a mean that is computed using future values, and only use a mean computed on historical values.

7.2.1.3. Predictive Imputation

To overcome the limitations of the simpleminded approaches, like mean imputation, a statistical framework has emerged over the last 30 years, which is termed multiple imputation (MI). The general idea of this framework is to deduce joint distribution functions from which the imputed data can be sampled. The data imputation is then nondeterministic, and multiple imputation sets can be generated. For predictive analytics on a completed dataset, statistics for the predicted quantities can be computed. Hence, the uncertainty about the imputation can be properly accounted for. Moreover, these imputation techniques ensure that statistical properties of the data, such as the underlying distribution, mean, and variance are not altered by the imputation.

This will be also one of the approaches we will use in the case examined in the next chapter. But before that, let's turn to provide a literature review of some missing data treatments that fall in the predictive imputation class.

7.3. LITERATURE OVERVIEW OF MISSING DATA TREATMENTS3

According³ to Wang (2010), inappropriate handling of missing data can introduce bias, leading to misleading conclusions and limited generalizability of research findings. Barnard (1999) argues that the most frequent types of associated problems with the lack of missing data treatment are: (1) loss of efficiency; (2) complications in handling and analyzing the data; and (3) bias resulting from differences between missing and complete data. This points to the fact that treating missing data is of crucial importance to practical applications.

In what follows we will review some of the important papers, in our view, on missing data imputation. We will substantiate the fact that – as expected by virtue of the no-free-lunch theorem – we cannot have a best-performing imputation algorithm for every problem. Instead, the “best” algorithm must be chosen for the specific problem we are examining.

7.3.1. Luengo et al. (2012)

The first paper we will summarize is that of Luengo et al. (2012), which compares the effects of 14 different imputation techniques on data on which 23 classifiers are subsequently trained. The classifiers fall into these three categories:

Rule Induction Learning. This group refers to algorithms that infer rules using different strategies. Those methods that produce a set of more or less interpretable rules belong in this category. These rules include discrete and/or continuous features, which are treated by each method depending on their definition and representation. This type of classification method has been the most used in cases of imperfect data.
Approximate Models. This group includes artificial neural networks, support vector machines, and statistical learning. Luengo et al. include in this group those methods that act like a black box. Hence, those methods that do not produce an interpretable model fall under this category. Although the naïve Bayes method is not a completely black box method, the paper considers that this is the most appropriate category for it.
Lazy Learning. This group includes methods that are not based on any model but use the training data to perform the classification directly. This process implies the presence of measures of similarity of some kind. Thus, all the methods that use a similarity function to relate the inputs to the training set are considered as belonging to this category.

The classification methods falling into the rule induction learning group are C4.5 (C4.5); Ripper (Ripper); CN2 (CN2); AQ-15 (AQ); PART (PART); Slipper (Slipper); scalable rule induction (SRI); Rule induction two in one (Ritio); and Rule extraction system version 6 (Rule-6). The classification methods falling into the approximate models group are multilayer perceptron (MLP); C-SVM (C-SVM); ν-SVM (ν-SVM); sequential minimal optimization (SMO); radial basis function network (RBFN); RBFN decremental (RBFND); RBFN incremental (RBFNI); logistic (LOG); naïve Bayes (NB); and learning vector quantization (LVQ). The classification methods falling into the lazy learning group are 1-NN (1-NN); 3-NN (3-NN), locally weighted learning (LWL), and lazy learning of Bayesian rules (LBR).

Finally, the imputation techniques they employ are do not impute (DNI), case deletion or ignore missing (IM), global most common/average (MC), concept most common/average (CMC), k-nearest neighbor (KNNI), weighted k-NN (WKNNI), k-means clustering imputation (KMI), fuzzy k-means clustering (FKMI), support vector machines (SVMI), event covering (EC), regularized expectation kmaximization (EM), singular value decomposition imputation (SVDI), Bayesian principal component analysis (BPCA), and local least squares imputation (LLSI).

They first apply each imputation technique before applying each classification method to each of the 21 (imputed) datasets. Each imputer-classifier combination is then given a rank on how it performed over the given dataset. The Wilcoxon signed rank test is then used to assign each imputer-classifier a single rank,⁴ which can be seen in Figure 7.1. The lower the value of the rank, the better that imputation technique performs in combination with that classifier.

7.3.1.1. Induction Learning Methods

Luengo et al. come to the conclusion that, for the rule induction learning classifiers, the imputation methods FKMI, SVMI, and EC perform best, as can be seen in Figure 7.2. These three imputation methods are, therefore, the most suitable for this type of classifiers. Furthermore, both FKMI and EC methods were also considered among the best overall.

7.3.1.2. Approximate Models

In the case of approximate models, differences between imputation methods are more evident. One can clearly select the EC imputation technique as the best solution (see Figure 7.3), as seen by its average rank of 4.75, almost 1 lower than the next nearest technique, KMI, which stands as the second best with an average rank of 5.65. Next, we see FKMI with an average rank of 6.20. In this family of classification methods, EC is, therefore, the superior imputation technique.

7.3.1.3. Lazy Learning Methods

For this set of methods (Figure 7.4) Luengo et al. find that MC is the best imputation technique with an average rank of 3.63, followed by CMC with an average ranking of 4.38. Only the FKMI method can be compared with the MC and CMC methods with an average rank of 4.75, with all other techniques having an average rank at or above 6.25. Once again, the DNI and IM methods obtain low rankings, with DNI coming 13th of 14, with only the BPCA method performing worse.

	RBFN	RBFND	RBFNI	C4.5	1-NN	LOG	LVQ	MLP	NB	ν-SVM	C-SVM	Ripper
IM	9	6.5	4.5	5	5	6	3.5	13	12	10	5.5	8.5
EC	1	1	1	2.5	9.5	3	7	8.5	10	13	1	8.5
KNNI	5	6.5	10.5	9	2.5	9	7	11	6.5	8	5.5	2.5
WKNNI	13	6.5	4.5	11	4	10	10	4.5	6.5	4.5	5.5	2.5
KMI	3.5	2	7	5	12	3	11	3	4.5	8	5.5	2.5
FKMI	12	6.5	10.5	7.5	6	3	1.5	4.5	11	4.5	5.5	2.5
SVMI	2	11.5	2.5	1	9.5	7.5	3.5	1.5	13	8	11	5.5
EM	3.5	6.5	13	13	11	12	12.5	10	4.5	4.5	10	12
SVDI	9	6.5	7	11	13	11	12.5	8.5	3	11.5	12	11
BPCA	14	14	14	14	14	13	7	14	2	2	13	13
LLSI	6	6.5	10.5	11	7.5	7.5	7	6.5	9	4.5	5.5	5.5
MC	9	6.5	10.5	7.5	7.5	3	7	6.5	8	11.5	5.5	8.5
CMC	9	13	2.5	5	1	3	1.5	1.5	14	14	5.5	8.5
DNI	9	11.5	7	2.5	2.5	14	14	12	1	1	14	14
PART	Slipper	3-NN	AQ	CN2	SMO	LBR	LWL	SRI	Ritio	Rule-6	Avg.	RANKS
1	4	11	6.5	10	5.5	5	8	6.5	6	5	6.83	7
6.5	1	13	6.5	5.5	2	9	8	6.5	6	1	5.7	2
6.5	11	5.5	11	5.5	5.5	9	8	11.5	11	11	7.76	10
6.5	7	5.5	6.5	1	5.5	9	8	11.5	6	11	6.96	8
6.5	3	5.5	6.5	5.5	9	9	2.5	9.5	12	7.5	6.24	5
6.5	10	1.5	2	5.5	3	9	2.5	1	2	3	5.26	1
6.5	7	9	1	5.5	9	3	8	6.5	6	2	6.09	3
6.5	7	5.5	12	13	11.5	9	2.5	3	6	4	8.37	11
6.5	12	12	10	12	11.5	1	12	9.5	10	11	9.72	12
13	7	14	13	14	13	13	13	13	13	13	11.87	14
6.5	7	5.5	6.5	11	9	9	8	3	6	7.5	7.22	9
6.5	2	1.5	6.5	5.5	5.5	3	2.5	3	6	7.5	6.11	4
12	13	5.5	3	5.5	1	3	8	6.5	1	7.5	6.28	6
14	14	10	14	5.5	14	14	14	14	14	14	10.61	13

FIGURE 7.1 Average rank for all the classifiers. Column “Avg.” is the average of all ranks for a given imputation technique.