Measurement and inferential challenges

Many of the activities that fall under the umbrella term of data mining involve either measurement or inference, or possibly both. This section details some of the challenges researchers face when attempting to measure difficult social science concepts or trying to infer general patterns from subpopulation sets of data. These tasks, measurement and inference, are often one and the same in the social sciences. While one can use a ruler to measure height, there is no way to directly measure sentiment or affinity. Instead, we create proxy measures for these concepts and hope to make accurate inferences about these quantities.

Overfitting is a common problem in social science research, especially in Big Data. The goal of many analyses is to quantify a relationship accurately. In an effort to do so, many researchers make a model of a particular dataset that fits that data very nicely. That is, the model explains a high proportion of the variation in the data at hand. These researchers often erroneously conclude that a model that fits their data well is a high-quality model. However, of crucial importance is how well they have captured the relationships in the data generally, not just in their small sample of data. A set of relationships that characterize a sample (also called training data) but do not fit other data of the same type or source (commonly referred to as test data) is said to be overfit. Though overfitting is a common problem in small data settings, researchers should be wary of it when using Big Data as well. To avoid overfitting, data miners ought to use parsimonious models and cross-validate their models on different data or subsets of their initial data. We discuss these methods throughout the book as examples of good practice.

Big Data further complicates a pair of issues that pose challenges in smaller datasets as well. The first is technically referred to as mixtures of relationships. In plain language, this refers to there not being one pattern in a set of data, but several possibly conflicting patterns. For instance, suppose a drug has a positive effect on men but a negative effect on women. An imprudent researcher might look for an effect of the drug and find zero effect, assuming about half of the data belongs to each gender. Here, a mixture of patterns mask each other. This challenge in data analysis is exacerbated by the large number of possible patterns in data that includes many variables and data that covers broad swaths of time, space, groups, and individuals. Identifying mixtures is often best done by exploring interactions between variables and by visualizing your data.

The second concern exacerbated by very large datasets is the recovery of findings that are statistically significant but substantively tiny. Many statistical techniques not only quantify relationships, say, between variables x and y, but also provide measures of the extent to which those relationships are unlikely to be zero. Relationships that are found to be unlikely to equal zero are said to be statistically significant. However, researchers should be aware of the fact that relationships can simultaneously be non-zero and trivially small. This is especially true in Big Data applications, because our ability to discriminate effect sizes from zero usually increases with data size. To avoid this pitfall, make sure to assess the substantive importance of any findings you generate, not just their statistical robustness. For instance, if you find an increase of 2 percent in consumer sentiment related to your product, would this increase, even if statistically robust, be important? To whom and why? What if the effect size was estimated to be non-zero, still only at 1 or 0.5 percent?

The modifiable areal unit problem (MAUP) is a source of statistical bias that comes in two flavors: scale and zonal problems. The issue was discovered by Gehlke and Biehl (1934) and described more completely by Robinson (1950) and Openshaw (1984), who lamented the following:

"The areal units (zonal objects) used in many geographical studies are arbitrary, modifiable, and subject to the whims and fancies of whoever is doing, or did, the aggregating."

Openshaw and Taylor (1979) described how they had constructed all possible groupings of counties in Iowa into larger districts. When considering the correlation between percent of Republican voters and percent of elderly voters, they could produce a million or so correlation coefficients. A set of 12 districts could be contrived to produce correlations that ranged from -0.97 to +0.99. The modifiable temporal unit problem is the temporal companion of MAUP, but instead of geographic aggregation, the problem relates to temporal aggregation. To avoid finding spurious relationships when aggregating data, try to aggregate to a natural or substantively meaningful unit. Additionally, try aggregating to several units of varying sizes to ensure the robustness of your results.

Another concern stems from how you choose what data to analyze. When you build your sample based on the value of the variable of interest (that is, the dependent variable), you bias your study in a way that leads to low or even zero explanatory power. A toy example helps illustrate that if you want to study the factors that lead to business success, and thus examine 50 successful businesses and find that all 50 have CEOs that drive sports cars, you might conclude that this is a key cause of success. Obviously, such a study is flawed because of sampling of the dependent variable; to understand the causes of success, you would have to compare successful companies with unsuccessful ones. Though this seems obvious, even academic researchers fall victim to this error, such as the study of suicide terrorism done by Robert Pape (2003). Pape only looked at cases of suicide terrorism in an effort to make claims about the causes of suicide terrorism.

Self-selection is a constant concern in social science research. We must keep in mind that social data is volunteered by people, and that these people may not be indicative of the population generally. For instance, if we pull down geo-located Twitter data about President Obama, we must keep in mind that these tweets would almost entirely comprise users on Twitter (perhaps a young, upper-middle class group). This sample may not be representative of, say, likely US voters (whose modal member is older and middle class).

When interpreting our findings, we need to be careful not to fall victim to the ecological fallacy—the incorrect assumption that facts about about a group apply equally to members of the group. For example, if a researcher finds that men are more aggressive drivers than women, this does not imply that all men are aggressive drivers. Conversely, finding no relationship between income and infant mortality at the county or state level does not mean that there is no relationship between these factors at the household level. Ecological fallacy ought to be considered a special case of MAUP. In general, we can avoid these fallacies by making inferences only for the level of data that we have—inferences about groups from group-level data and inferences about individuals from individual-level data. Even more concrete advice is to try and match the level of data you collect with your research question of interest. For example, if you are studying household-level economic decisions, then attempt to capture household data!

As Zeynep (2013) astutely noted, social media mining and sentiment analysis often hinge on a plus one additive property where polarity is counted as the cumulative frequency of positive and negative words. However, not all words have equal impact, and some words scale differently than others. Later in this book, we discuss an unsupervised method that outlines this scaling problem and offers a solution.

The final pitfall we warn against is to watch out for dissimilar denominators that lead to intrinsic heterogeneity. This issue affects researchers who attempt to compare the sizes of different phenomena when a better measure would be a phenomenon's rate. For instance, if one tweet is retweeted 50 times and another is retweeted 100 times, upon first glance, we might be tempted to conclude that the second tweet was more interesting. However, what if the first tweet was only seen by a total of 100 people, while the second was seen by 10,000 people? Then, we could more accurately say that the first had a 50 percent retweet rate, while the second only had a 1 percent retweet rate. While the example makes it clear that the way to avoid dissimilar denominators is to pay appropriate attention to them, it is not always easy to obtain the appropriate denominator for any given metric.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.242.148