Chapter 5. Data assessment: poking and prodding

This chapter covers

  • Descriptive statistics and other techniques for learning about your data
  • Checking assumptions you have about your data and its contents
  • Sifting through your data for examples of things you want to find
  • Performing quick, rough analyses to gain insight before spending a lot of time on software or product development

Figure 5.1 shows where we are in the data science process: assessing the data available and the progress we’ve made so far. In previous chapters we’ve searched for, captured, and wrangled data. Most likely, you’ve learned a lot along the way, but you’re still not ready to throw the data at the problem and hope that questions get answered. First, you have to learn as much as you can about what you have: its contents, scope, and limitations, among other features.

Figure 5.1. The fourth and final step of the preparation phase of the data science process: assessing available data and progress so far

It can be tempting to start developing a data-centric product or sophisticated statistical methods as soon as possible, but the benefits of getting to know your data are well worth the sacrifice of a little time and effort. If you know more about your data—and if you maintain awareness about it and how you might analyze it—you’ll make more informed decisions at every step throughout your data science project and will reap the benefits later.

5.1. Example: the Enron email data set

In my first job at a software company—after years of research-oriented, academic data science—I was helping to build software that would analyze communication patterns of employees of large, heavily regulated organizations in order to detect anomalous or problematic behavior. My employer, a Baltimore startup, developed software that helps make sense of massive amounts of employee data, which in many cases must be archived for a number of years according to current law and often contains evidence of wrongdoing that can be useful in investigations of known infractions as well as in the detection of yet-unknown ones. Two good examples of potential customers are compliance officers at large financial institutions and security departments at government agencies. Both of these have the express responsibility to prevent anyone from divulging or mishandling privileged or secret information. The monitoring of employees’ use of internal networks is often mandated or highly recommended by regulating agencies. Needless to say, we needed to do a thorough statistical analysis of employee communications and other activity while still being extremely careful regarding points of ethics and privacy.

But privacy was not much of a concern for one of the first data sets we used for demonstration purposes. The set of emails that were collected by investigators after the Enron scandal in the early 2000s is now a matter of public record and is well documented and studied by researchers (www.cs.cmu.edu/~./enron). Because it’s one of the most comprehensive and relevant public data sets available, we wanted to use the Enron emails to test and demonstrate the capabilities of our software.

Several versions of the Enron data set are available, including text versions in CSV format as well as the original proprietary format, PST, which can be generated by Microsoft Outlook. Chapter 4 covered the basics of data wrangling, and all of the problems and warnings described there certainly applied here. Depending on which version of the data set we started with, various preprocessing and wrangling steps may have already been done for us. Mostly, this was a good thing, but we always had to be wary of mistakes or non-standard choices that may have been made before the data got to us.

For this reason, and innumerable others, we needed to treat our data set like an unfamiliar beast. As with a newly discovered animal species, what we thought we had might not be what we had. Our initial assumptions might not have been true, and even if they were true, within a (figurative) species, there could have been tremendous differences from one individual to another. Likewise, even if you’re confident that your data set contains what you think it contains, the data itself surely varies from one data point to another. Without a preliminary assessment, you may run into problems with outliers, biases, precision, specificity, or any number of other inherent aspects of the data. In order to uncover these and get to know the data better, the first step of post-wrangling data analysis is to calculate some descriptive statistics.

5.2. Descriptive statistics

Descriptive statistics are what you might think:

  • Descriptions of a data set
  • Summaries of a data set
  • Maximum values
  • Minimum values
  • Average values
  • A list of possible values
  • A range of time covered by the data set
  • And much more

Those are examples; for a definition, let’s look at one from Wikipedia:

Descriptive statistics is the discipline of quantitatively describing the main features of a collection of information, or the quantitative description itself.

Descriptive statistics is both a set of techniques and the description of the data sets that are produced using those techniques.

It’s often hard to discuss descriptive statistics without mentioning inferential statistics. Inferential statistics is the practice of using the data you have to deduce—or infer—knowledge or quantities of which you don’t have direct measurements or data. For example, surveying 1000 voters in a political election and then attempting to predict the results of the general population (presumably far larger than 1000 individuals) uses inferential statistics. Descriptive statistics concerns itself with only the data you have, namely the 1000 survey responses. In this example, the generalization step from sample to population separates the two concepts.

With respect to a data set, you can say the following:

  • Descriptive statistics asks, “What do I have?”
  • Inferential statistics asks, “What can I conclude?”

Although descriptive and inferential statistics can be spoken of as two different techniques, the border between them is often blurry. In the case of election surveys, as in many others, you would have to perform descriptive statistics on the 1000 data points in order to infer anything about the rest of the voting populace that wasn’t surveyed, and it isn’t always clear where the description stops and where the inference starts.

I think most statisticians and businesspeople alike would agree that it takes inferential statistics to draw most of the cool conclusions: when the world’s population will peak and then start to decline, how fast a viral epidemic will spread, when the stock market will go up, whether people on Twitter have generally positive or negative sentiment about a topic, and so on. But descriptive statistics plays an incredibly important role in making these conclusions possible. It pays to know the data you have and what it can do for you.

5.2.1. Stay close to the data

I mentioned staying close to the data in chapter 1 as well, but it’s certainly worth repeating and is perhaps more important to mention here. The purpose of calculating descriptive statistics at this stage in a data science project is to learn about your data set so that you understand its capabilities and limitations; trying to do anything but learn about your data at this point would be a mistake. Complex statistical techniques such as those in machine learning, predictive analytics, and probabilistic modeling, for example, are completely out of the question for the moment.

Some people would argue with me, saying that it’s OK to dive right in and apply some machine learning (for example) to your data, because you’ll get to know the data as you go along, and if you’re astute, you’ll recognize any problems as they come and then remedy them. I wholeheartedly disagree. Complex methods like most of those used in machine learning today are not easily dissected or even understood. Random forests, neural networks, and support vector machines, among others, may be understood in theory, but each of these has so many moving parts that one person (or a team) can’t possibly comprehend all of the specific pieces and values involved in obtaining a single result. Therefore, when you notice an incorrect result, even one that’s grossly incorrect, it’s not straightforward to extract from a complex model exactly which pieces contributed to this egregious error. More importantly, complex models that involve some randomness (again, most machine learning techniques) may not reproduce a specific error if you rerun the algorithm. Such unpredictability in sophisticated statistical methods also suggests that you should get to know your own data before you allow any random processes or black boxes to draw conclusions for you.

The definition I use for close to the data is this:

You are close to the data if you are computing statistics that you are able to verify manually or that you can replicate exactly using another statistical tool.

In this phase of the project, you should calculate descriptive statistics that you can verify easily by some other means, and in some cases you should do that verification to be sure. Therefore, because you’re doing simple calculations and double-checking them, you can be nearly 100% certain that the results are correct, and the set of close-to-the-data descriptive statistics that you accumulate becomes a sort of inviolable canon of knowledge about your data set that will be of great use later. If you ever run across results that contradict these or seem unlikely in relation to these, then you can be nearly certain that you’ve made a significant error at some point in producing those results. In addition, which results within the canon are contradicted can be hugely informative in diagnosing the error.

Staying close to the data ensures that you can be incredibly certain about these preliminary results, and keeping a set of good descriptive statistics with you throughout your project provides you an easy reference to compare with subsequent more relevant but more abstruse results that are the real focus of your project.

5.2.2. Common descriptive statistics

Examples of helpful and informative descriptive statistics methods include but are not limited to mean, variance, median, sum, histogram, scatter plot, tabular summaries, quantiles, maximum, minimum, and cumulative distributions. Any or all of these might be helpful in your next project, and it’s largely a matter of both preference and relevance when deciding which ones you might calculate in order to serve your goals.

In the Enron email data set, here’s the first line of statistical questioning that occurs to me:

  1. How many people are there?
  2. How many messages are there?
  3. How many messages did individual people write?

A short paper called “Introducing the Enron Corpus” (2004) by Brian Klimt and Yiming Yang gives a good summary that answers these questions.

In the Enron email corpus, there are 619,446 messages from the accounts of 158 employees. But by removing mass emails and duplicate emails appearing in multiple accounts, Klimt and Yang reduced the data to 200,399 emails in a clean version of the data set. In the clean version, the average user sent 757 emails. These are useful facts to know. Without them, it may not be obvious that there is a problem if, later, a statistical model suggests that most people send dozens of emails per day. Because this data set spans two years, roughly speaking (another descriptive statistic!), we know that two to three emails per day is much more typical.

Speaking of ranges of time, in the Enron data set and others, I’ve seen dates reported incorrectly. Because of the way dates are formatted, a corrupted file can easily cause dates to be reported as 1900 or 1970 or another year, when that’s obviously erroneous. Enron didn’t exist until much later, and for that matter neither did email as we know it. If you want to use time as an important variable in your later analyses, having a few emails a century before the rest may be a big problem. It would have been helpful to recognize these issues while wrangling the data, as described in chapter 4, but it may have slipped past unnoticed, and some descriptive statistics can help you catch it now.

For example, let’s say you’re interested in analyzing how many emails are sent from year to year, but you skipped descriptive statistics, and you jumped straight into writing a statistical application that begins in the year of the earliest email (circa 1900) and ends at the latest (circa 2003). Your results would be heavily biased by the many years in the middle of that range that contained no messages. You might catch this error early and not lose much time in the process, but for larger, more complex analyses, you might not be so lucky. Comparing the real date range with your presumed one beforehand could have uncovered the erroneous dates more quickly. In today’s big data world, it wouldn’t be uncommon for someone to write an application that does this—analyzes quantities of emails over time—for billions of emails, which would probably require using a computing cluster and would cost hundreds or thousands of dollars per run. Not doing your homework—descriptive statistics—in that case could be costly.

5.2.3. Choosing specific statistics to calculate

In the paper describing the Enron corpus, Klimt and Yang make it clear that they’re primarily focused on classifying the emails into topics or other groups. In their case, dates and times are less important than subjects, term frequencies, and email threads. Their choice of descriptive statistics reflects that.

We were concerned mainly with users’ behavior over time, and so we calculated descriptive statistics such as these:

  • Total number of emails sent per month
  • Most prolific email senders and the number they sent
  • Number of emails sent each month by the most prolific senders
  • Most prolific email recipients and the number they received
  • Most prolific sender-recipient pairs and the number of emails they exchanged

It’s not always obvious which statistics would be the best choice for your particular project, but you can ask yourself a few questions that will help lead you to useful choices:

  1. How much data is there, and how much of it is relevant?
  2. What are the one or two most relevant aspects of the data with respect to the project?
  3. Considering the most relevant aspects, what do typical data points look like?
  4. Considering the most relevant aspects, what do the most extreme data points look like?

Question 1 is usually fairly straightforward to answer. For the Enron data set, you find the total number of emails, or the total number of email accounts, both of which I’ve mentioned already. Or if the project concerns only a subset of the data—for example, emails involving Ken Lay, the CEO who was later convicted of multiple counts of fraud, or maybe only emails sent in 2001—then you should find the totals for that subset as well. Is there enough relevant data for you to accomplish the goals of the project? Always be wary that prior data wrangling may not have been perfect, and so obtaining the precise subset may not be as easy as it seems. Errors in name or date formatting, among other things, could cause problems.

Question 2 concerns the focus of the project. If you’re studying the rise and fall of Enron as an organization, then time is a relevant aspect of the data. If you’re looking mainly at email classification, as Klimt and Yang were, then email folders are important, as are the subject and body text from the emails. Word counts or other language features may be informative at this point. Think about your project, look at a few individual data points, and ask yourself, “Which part do I care about the most?”

For question 3, take the answer to question 2 and calculate some summary statistics on the values corresponding to those aspects. If a time variable is important to you, then calculate a mean, median, and maybe some quantiles of all of the email timestamps in the data set (don’t forget to convert timestamps to a numerical value—for example, Unix time—and then back again for a sanity check). You might also calculate the number of emails sent each week, month, or year. If email classification is important to you, then add up the emails that appear in each of the folders and find the folders that contain the most emails across all accounts. Or look at how different people have different numbers and percentages of their emails in different folders. Do any of these results surprise you? Given your project goals, can you foresee any problems occurring in your analysis?

Question 4 is similar to question 3, but instead of looking at typical values, looks at extreme values such as maximum and minimum. The earliest and latest timestamps, as well as some extreme quantiles such as 0.01 and 0.99, can be useful. For email classifications, you should look at the folders containing the most emails as well as folders that may contain few—it’s likely many folders contain one or a few emails and are possibly mostly useless for analysis. Perhaps for later stages of the project you would consider excluding these. When looking at extreme values, are there any values so high or so low that they don’t make sense? How many values are outside a reasonable range? For categorical or other non-numeric data, what are the most common and least common categories? Are all of these meaningful and useful to subsequent analysis?

5.2.4. Make tables or graphs where appropriate

Beyond calculating the raw values of these statistics, you might find value in formatting some of the data as tables, such as the quantities of emails for the various categories of most prolific, or as graphs, such as the timelines of monthly quantities of email sent and received.

Tables and graphs can convey information more thoroughly and more quickly at times than pure text. Producing tables and graphs and keeping them for reference throughout your project is a good idea.

Figure 5.2 shows two plots excerpted from Klimt and Yang’s paper. They’re graphical representations of descriptive statistics. The first shows a cumulative distribution of users versus the number of messages they sent within the data set. The second plots the number of messages in Enron employees’ inboxes versus the number of folders that were present in those email accounts. If you’re interested in the number of emails sent by the various employees or in how the employees used folders, it might be good to keep these graphs handy and compare them with all subsequent results. It’s possible that they’ll either help verify that your results are reasonable or indicate that your results are not reasonable and then help you diagnose the problem.

Figure 5.2. Two graphs redrawn from Klimt and Yang’s “The Enron Corpus: A New Dataset for Email Classification Research” (published by Springer in Machine Learning: ECML 2004).

The types of descriptive graphs or tables that are appropriate for your project might be different from these, but they similarly should address the aspects of the data that are relevant to your goals and the questions you hope to answer.

5.3. Check assumptions about the data

Whether we like to admit it or not, we all make assumptions about data sets. As implied in the previous section, we might assume that our data is contained within a particular time period. Or we might assume that the names of the folders that contain emails are appropriate descriptors of the topics or classifications of those emails. These assumptions about the data can be expectations or hopes, conscious or subconscious.

5.3.1. Assumptions about the contents of the data

Let’s consider the element of time in the Enron data. I certainly assumed, when I began looking at the data, that the emails would span the few years between the advent of email in the late 1990s and the demise of the firm in the early 2000s. I would have been mistaken, because of the potential errors or corruption in the date formatting that I’ve already mentioned. In practice, I saw dates far outside the range that I assumed as well as some other dates that were questionable. My assumption about the date range certainly needed to be checked.

If you want to use the folder names in the email accounts to inform you about the contents of emails within, there’s an implied assumption that these folder names are indeed informative. You definitely would want to check this, which would likely involve a fair amount of manual work, such as reading a bunch of emails and using your best judgment about whether the folder name describes what’s in the email.

One specific thing to watch out for is missing data or placeholder values. Most people tend to assume—or at least hope—that all fields in the data contain a usable value. But often emails have no subject, or there is no name in the From field, or in CSV data there might be NA, NaN, or a blank space where a number should be. It’s always a good idea to check whether such placeholder values occur often enough to cause problems.

5.3.2. Assumptions about the distribution of the data

Beyond the contents and range of the data, you may have further assumptions about its distribution. In all honesty, I know many statisticians who will get excited about the heading of this section but then disappointed with its contents. Statisticians love to check the appropriateness of distribution assumptions. Try Googling “normality test” or go straight to the Wikipedia page and you’ll see what I mean. It seems there are about a million ways to test whether your data is normally distributed, and that’s one statistical distribution.

I’ll probably be banned from all future statistics conferences for writing this, but I’m not usually that rigorous. Generally, plotting the data using a histogram or scatter plot can tell you whether the assumption you want to make is at all reasonable. For example, figure 5.3 is a graphic from one of my research papers in which I analyzed performances in track and field. Pictured is a histogram of the best men’s 400 m performances of all time (after taking their logarithms), and overlaid on it is the curve of a normal distribution. That the top performances fit the tail of a normal distribution was one of the key assumptions of my research, so I needed to justify that assumption. I didn’t use any of the statistical tests for normality, partially because I was dealing with a tail of the distribution—only the best performances, not all performances in history—but also because I intended to use the normal distribution unless it was obviously inappropriate for the data. To me, visually comparing the histogram with a plot of the normal distribution sufficed as verification of the assumption. The histogram was similar enough to the bell curve for my purposes.

Figure 5.3. The logarithms of the best men’s 400 m performances of all time seem to fit the tail of a normal distribution.

Although I may have been less than statistically rigorous with the distribution of the track and field data, I don’t want to be dismissive of the value of checking the distributions of data. Bad things can happen if you assume you have normally distributed data and you don’t. Statistical models that assume normal distributions don’t handle outliers well, and the vast majority of popular statistical models make some sort of assumption of normality. This includes the most common kinds of linear regression, as well as the t-test. Assuming normality when your data isn’t even close can make your results appear significant when in fact they’re insignificant or plain wrong.

This last statement is valid for any statistical distribution, not only the normal. You may have categorical data that you think is uniformly distributed, when in fact some categories appear far more often than others. Social networking statistics, such as the kind I’ve calculated from the Enron data set—number of emails sent, number of people contacted in a day, and so on—are notoriously non-normal. They’re typically something like exponentially or geometrically distributed, both of which you should also check against the data before assuming them.

All in all, although it might be OK to skip a statistical test for checking that your data fits a particular distribution, do be careful and make sure that your data matches any assumed distribution at least roughly. Skipping this step can be catastrophic for results.

5.3.3. A handy trick for uncovering your assumptions

If you feel like you don’t have assumptions, or you’re not sure what your assumptions are, or even if you think you know all of your assumptions, try this: describe your data and project to a friend—what’s in the data set and what you’re going to do with it—and write down your description. Then, dissect your description, looking for assumptions.

For example, I might describe my original project involving the Enron data like this: “My data set is a bunch of emails, and I’m going to establish organization-wide patterns of behavior over the network of people using techniques from social network analysis. I’d like to draw conclusions about things like employee responsiveness as well as communication up the hierarchy, with a boss.”

In dissecting this description, you should first identify phrases and then think about what assumptions might underlie them, as in the following:

  • My data set is a bunch of emails— That’s probably true, but it might be worth checking to see whether there might be other non-email data types in there, such as chat messages or call logs.
  • Organization-wide— What is the organization? Are you assuming it’s clearly defined, or are there fuzzy boundaries? It might help to run some descriptive statistics regarding the boundaries of the organization, possibly people with a certain email address domain or people who wrote more than a certain number of messages.
  • Patterns of behavior— What assumptions do you have about what constitutes a pattern of behavior? Does everyone need to engage in the same behavior in order for it to be declared a pattern, or do you have a set of patterns and you’re looking for individual examples that match those patterns?
  • Network of people— Does everyone in the network need to be connected? Can there be unconnected people? Are you planning to assume a certain statistical model from social network analysis literature? Does it require certain assumptions?
  • Responsiveness— What do you assume this term means? Can you define it statistically and verify that the data supports such a definition by using the basic definition along with some descriptive statistics?
  • Hierarchy— Are you assuming you have complete knowledge of the organization’s hierarchy? Do you assume that it’s rigid, or does it change?

Realizing when you’re making assumptions—by dissecting your project description and then asking such questions—can help you avoid many problems later. You wouldn’t want to find out that a critical assumption was false only after you had completed your analysis, found odd results, and then gone back to investigate. Even more, you wouldn’t want a critical assumption to be false and never notice it.

5.4. Looking for something specific

Data science projects have all sorts of goals. One common goal is to be able to find entities within your data set that match a certain conceptual description. Here, I’m using the term entity to represent any unique individual represented in your data set. An entity could be a specific person, place, date, IP address, genetic sequence, or other distinct item.

If you’re working in online retailing, you might consider customers as your main entities, and you might want to identify those who are likely to purchase a new video game system or a new book by a particular author. If you’re working in advertising, you might be looking for people who are most likely to respond to a particular advertisement. If you’re working in finance, you might be looking for equities on the stock market that are about to increase in price. If it were possible to perform a simple search for these characterizations, the job would be easy and you wouldn’t need data science or statistics. But although these characterizations aren’t inherent in the data (can you imagine a stock that tells you when it’s about to go up?), you often can recognize them when you see them, at least in retrospect. The main challenge in such data science projects is to create a method of finding these interesting entities in a timely manner.

In the Enron email data set, we were looking for suspicious behavior that might somehow be connected to the illegal activity that we now know was taking place at the company. Although suspicious behavior in an email data set can take many forms, we can name a few: in general, employees discussing illegal activity, trying to cover something up, talking to suspicious people, or otherwise communicating in an abnormal fashion.

We already had statistical models of communication across social/organizational networks that we wanted to apply to the Enron data set, but there were any number of ways in which we could configure the model and its parameters in order to find suspicious behavior in its various forms. There was no guarantee that we would find the kind we were looking for, and there was also no guarantee that we would find any at all. One reason we might not find any was that there might not have been any.

5.4.1. Find a few examples

If you’re looking for something fairly specific, something interesting in your data, try to find something. Go through the data manually or use some simple searches or basic statistics to locate some examples of these interesting things. You should stay close to the data and be able to verify that these examples are indeed interesting. If you have a lot of data—if it’s hard even to browse through it—it’s OK to take a subset of the data and look for some good examples there.

If you can’t find any interesting examples, you might be in trouble. Sometimes, interesting things are rare or don’t exist in the form you think they do. In fact, the Enron data, as published, doesn’t contain a trail of clues or a smoking gun of any kind. It often helps to dig deeper, to change the way you’re searching, to think differently about the data and what you’re looking for, or otherwise to exhaust all possible ways to find the good, interesting examples in the data.

Sometimes brute force works. A team of a few people could theoretically read all of the Enron emails in a few days. It wouldn’t be fun—and I’m sure more than a few lawyers have done it—but it is possible and would be the most thorough way of finding what you’re looking for. I feel fairly confident in saying that over the course of the several months that I worked with the Enron data, I read most of the emails in that data set. This would have been a hallmark of a data science project gone wrong had the goal of the project not extended far beyond the Enron data set. We were developing software whose entire purpose was to make reading all emails unnecessary. We wanted to use the Enron data to characterize what suspicious communications look like so that we could use those characterizations to find such communications in other data sets. That’s why brute force made sense for us at the time. Depending on your data, looking through all your data manually might make sense for you, too.

You might also use brute force on a subset of your data. Think of it this way: if 1 in 1000 entities—data points, people, days, messages, whatever—is supposed to be interesting, then you should find one if you manually look at more than 0.1% of a data set consisting of a million entities.

You should probably look at more data than that to be sure you haven’t had bad luck, but if you’ve covered 1% of the data and you still haven’t found any, you know the interesting entities are rarer than you thought or nonexistent. You can adjust these percentages for rarer or more common entities.

If it turns out that you can’t find any interesting entities via any means, the only options are to look for another type of interesting thing, following another of your project’s goals, or to go back to all your data sources and find another data set that contains some trace of something you find interesting. It’s not fun to have to do that, but it’s a real possibility and not that rare in practice. Let’s be optimistic, though, and assume you were successful in finding some interesting entities within your data set.

5.4.2. Characterize the examples: what makes them different?

Once you’ve found at least a few examples of interesting things, take a close look and see how they’re represented within your data. The goal in this step is to figure out which features and attributes of the data could help you accomplish your goal of finding even more examples of these interesting things. Often you may be able to recognize by simple inspection some pattern or value that the interesting examples share, some aspect of their data points that could identify them and differentiate them from the rest of the data set.

For email data, is it the email text that has interesting content and terms, or is it the time at which the email was sent? Or is it possibly that the sender and recipient(s) themselves are interesting? For other data types, have a look at the various fields and values that are present and make note of the ones that seem to be most important in differentiating the interesting things from everything else. After all, that’s the foundation of most statistics (in particular, machine learning) projects: differentiating two (or more) groups of things from one another. If you can get a rough idea of how you can do this manually, then it’s far easier to create a statistical model and implement it in code that will help you find many more of these examples, which I cover in a later chapter.

Often there’s nothing quantitative about a data point that’s remarkable or easily differentiable from typical data points, but it’s interesting nonetheless. Take, for instance, a single email from Andrew Fastow, the CFO of Enron in its final years and one of the main perpetrators of the fraudulent activity that later landed him and others in jail. In the data set, none of the emails from Fastow contains any sign of fraud or secrecy, but what’s interesting is that there are only nine emails from him in the entire corpus. As CFO, one would think his role would include communicating with others more often than once every couple of months. Therefore, either he avoided email or he did a good job of deleting his emails from all servers and others’ personal inboxes and archives. In any case, an email from Fastow might be remarkable not because of any inherent information but only within the context of its rarity.

In a similar way, interesting things in your data might be characterizable by their context or by something I might call their neighborhood. Neighborhood is a term I’m borrowing from topology, a branch of mathematics:

The neighborhood of a [data] point is, loosely speaking, a set of other points that are similar to, or located near, the point in question.

Similarity and location can take on many meanings. For the Enron data, we could define one type of neighborhood of a particular email as “the set of emails sent by this email’s sender.” Or we could define a neighborhood as “the set of emails sent during the same week.” Both of these definitions contain a notion of similarity or nearness. By the first definition, an email sent by Andrew Fastow has a small neighborhood indeed: only eight other emails. By the second definition of neighborhood, the neighborhoods are much larger, often with several hundred emails in a given week.

In addition to aspects of a data point itself, you can use its neighborhood to help characterize it. If emails from Andrew Fastow are interesting to you, maybe all emails sent by people who seldom write emails are interesting. In that case, one quantitative characterization of interesting uses the same-sender definition of neighborhood and can be stated like this:

An email might be interesting if it’s sent by someone who rarely sends emails.

You can incorporate this statement into a statistical model. It is quantitative (rarity can be quantified) and it can be determined by information contained within the data set that you have.

Likewise, you could use a time-based neighborhood to create another characterization of interesting. Maybe, hypothetically, you found during your search an email that was sent in the middle of the night, from a work account to a private email address, asking to meet at an all-night diner. No such email exists in the Enron data set, but I like to pretend that the case was much more dramatic than it was—data science’s version of cloak and dagger, perhaps.

This concept of middle of the night, or odd hours, can be quantified in a few ways. One way is to choose hours of the day that represent middle of the night. Another way is to characterize odd hours as those hours in which almost no emails were written. You could use a time neighborhood of a few hours and characterize some interesting emails like this:

An email might be interesting if there are few other emails sent from within two hours both before and after the email in question.

This characterization, like the previous one, is both quantitative (few can be quantified empirically) and answerable within the data set you have.

A good characterization of an interesting entity or data point is one that is quantitative, that is present in or calculable from the data you have, and that in some way helps differentiate it from normal data, if only a little. We’ll use these characterizations, and I’ll talk about them more, in later sections on preliminary analyses and choosing statistical models.

5.4.3. Data snooping (or not)

Some might call it data snooping to poke around in the data, find examples of something you find interesting, and then tailor subsequent analyses to fit the examples. Some might say that this will unfairly bias the results and make them appear better than they are. For example, if you’re looking to estimate the number of blue pickup trucks in your neighborhood, and you happen to know that there’s usually a blue pickup truck parked a few blocks away, you’ll probably walk in that direction, counting trucks along the way, and that one truck that you already know about could skew your results upward, if only slightly. Or, at best, you’ll walk around randomly, but being close to your house and on one of your typical routes, you’re more likely to walk past it.

You want to avoid significant bias in your results, so you want to be careful not to let the preliminary characterizations I’m suggesting affect that. Data snooping can be a problem, and astute critics are right to say you should avoid it sometimes. But snooping is a problem only when assessing the accuracy or quality of your results. In particular, if you already know about some of the things your methods are attempting to discover again, you’re likely to be successful in those cases, and your results will be unfairly good.

But you’re not at the assessment phase yet. Right now, while trying to find and characterize data points, entities, and other things within the data set that are interesting and rare, you should do everything you can to be successful, because it’s a hard task. Later, though, all this useful snooping can complicate assessment of results, so I bring it up now to make you take note of a potential complication and to address potential critics who might say that you shouldn’t snoop around in your data.

5.5. Rough statistical analysis

Already in this chapter I’ve discussed basic descriptive statistics, validating assumptions, and characterizing some types of interesting things you’re looking for. Now, in terms of statistical sophistication, it’s time to take the analysis up one notch, but not two. I cover full-fledged statistical modeling and analysis in chapter 7, but before you get that far, it’s better to take only a single step in that direction and see how it works out.

Most sophisticated statistical algorithms take a while to implement; sometimes they take a while to run or compute on all your data. And as I’ve mentioned, a lot of them are quite fragile or difficult when it comes to understanding how and why they give a specific result and if it’s correct, in some sense. That’s why I prefer to approach such sophisticated analyses slowly and with care, particularly with a new or unfamiliar data set.

If some of the statistical concepts in this section are unfamiliar to you, feel free to skip ahead for now and come back to this section after you finish the rest of the book—or at least chapter 7. If you’re already familiar with most sophisticated statistical methods, this section can help you decide whether your planned statistical method is a good choice. Or if you don’t know yet what method you might use, this section can help you figure it out.

5.5.1. Dumb it down

Most statistical methods can be translated into rough versions that can be implemented and tested in a fraction of the time when compared to the full method. Trying one or a few of these now, before you begin the full implementation and analysis, can provide tremendous insight into what statistical methods will be useful to you and how.

If, for your final analysis, all you intend to do is a linear regression or a t-test, by all means charge right in. This section concerns primarily those projects that will likely include some sort of classification, clustering, inference, modeling, or any other statistical method that has more than a few parameters, fixed or variable.

Classification

If you plan to do some classification as part of your analysis, there are numerous statistical models designed for the task, from random forests to support vector machines to gradient boosting. But one of the simplest methods of classification is logistic regression.

The task of classification is, in its simplest form, assigning one of two class labels to entities based on a set of entity features that you’ve chosen and whose values you’ve calculated from the data. Typically, the labels are 0 and 1, where 1 represents interesting in the same sense I’ve used previously, and 0 is normal. You can have more classes and more complicated classification, but I’ll save that for later.

The most sophisticated methods in classification have many moving parts and therefore have the potential to perform much better than logistic regression. But as I’ve mentioned, they’re much harder to understand and debug. Logistic regression is a relatively simple method that works like linear regression, except the output values (the predictions, for new data) are between 0 and 1.

Compared to classification methods from machine learning, logistic regression is much faster to calculate and has virtually no parameters that you need to fiddle with. On the other hand, it carries a few assumptions—such as a certain type of normality—so if you have profoundly skewed or otherwise weird data values, it might not be the best choice.

If you have a favorite entity feature that you think will help classify yet-unknown entities, try it as the only feature/parameter in your logistic regression model. Your software tool can tell you whether the feature does indeed help, and then you can proceed to try another feature, either by itself or in addition to the first. Starting simple is generally the best choice, then increasing complexity, and checking to see if it helps.

One good candidate for an informative feature for finding suspicious emails in the Enron data set is the time of day at which the email was sent. Late-night emails might prove suspicious. Another feature might be the number of recipients of the email.

Another, more general method for investigating the usefulness of features for classification is to look at the distributions of feature values for each of the two classes (0 or 1). Using a couple of plots, you can see whether there seems to be a significant difference between the feature values of the two classes. Figure 5.4 is a two-dimensional plot of data points from three classes, designated by shape. The x- and y-axes represent two feature values for the data points. If your goal is to make statistical software that can find data points that are square—without knowing the true shape—you’d probably be in good shape; square data points have high x- and y-values. It would be easy to find a statistical model that correctly identifies square data points. The tough part might be finding the features that, when plotted, give neatly grouped classes such as these. Creating plots can help you find those good, useful features by giving you a sense of where the classes fall in the space of all data points and can help you figure out how to develop or tweak features to make them better.

Figure 5.4. A plot of three classes, given by shape, in two dimensions[1]

1

Lastly, if you have a favorite statistical method for classification that you understand, and you know how to adjust its parameters in your statistical software to make it simple and easy to understand, then that’s also a good way to perform a rough and fast classification. For example, you might use a random forest with 10 trees and a maximum depth of 2. Or a support vector machine with linear kernel functions can be reasonably easy to understand if you know the theory behind it. Both of these might be good choices if you’re familiar with how those techniques work. If you do choose this route, it’s important that you understand how to evaluate the results of your method, check the contributions of the various features, and make sure that the method is working the way you expect it to.

Clustering

Clustering is conceptually a lot like classification—there are entities with feature values that are intended to fall into groups—except there are no explicit, known labels. The process of clustering is often called unsupervised learning because the results are groups of similar entities—but because there are no labels, it may not immediately be clear what each group, or cluster, represents. It often takes manual inspection or descriptive statistics to figure out what kinds of entities are in each cluster.

As a rough version of clustering, I like to plot various values pertaining to entities and use plain visual inspection to determine whether the entities tend to form clusters. For data or entities with many aspects and values, it may take a while to visually inspect plots of one or two dimensions/variables at a time. But if you believe that a few key features should differentiate groups of entities from one another, you should be able to see that in a two-dimensional plot. If you can’t, you may want to revisit some assumptions that you’ve made. Blindly assuming that two entities are similar merely because they fall into the same cluster can lead to problems later. For example, even without labels/colors, the data points in figure 5.4 seem to cluster well into three groups. A clustering algorithm should be able to find them, if you know there are three clusters. On the other hand, if your data doesn’t group together so nicely, clustering might give poor results.

Beyond visual inspection, the simplest versions of clustering contain few variables and few clusters (most of the time, you have to set the number of clusters beforehand). If you can choose, say, three or four of your favorite entity features, and they cluster nicely using one of the simplest clustering algorithms, perhaps k-means, then you’re off to a good start and can proceed to more sophisticated clustering methods or configurations. It may also help to plot the results of your clustering algorithm within your software tool to make sure everything looks like it makes sense.

Inference

Statistical inference is the estimation of a quantitative value that you haven’t observed directly. For instance, in the case of the Enron project I’ve mentioned, at one point we wanted to estimate the probability that each employee would send an email to their boss as opposed to any other potential recipient. We intended to include this probability as a latent variable in a statistical model of communication, a complex model whose optimal parameter values could be found only via a complex optimization technique. For a lot of data, it could be slow.

But we could approximate the inference of this particular latent parameter value by counting the number of times each employee wrote an email to their boss and how many times they didn’t. It’s a rough approximation, but later, if the full model’s optimal parameter is found to be something quite different, we’d know that something may have gone wrong. If the two values differ, it doesn’t mean that something definitely did go wrong, but if we don’t understand and can’t figure out why they differ, it definitely shows that we don’t know how the model works on our data.

We could approach other latent variables in our statistical model in the same way: find a way to get a rough approximation and make note of it for later comparison with the estimated value within the full model. Not only is this a good check for possible errors, but it also tells us things about our data that we may not have learned while calculating descriptive statistics, as discussed earlier in this chapter—and, even better, these new pieces of information are specific to our project’s goals.

Other statistical methods

I certainly haven’t covered how to do a rough approximation of every statistical method here, but hopefully the previous examples give you the idea. As with almost everything in data science, there’s no one solution; nor are there 10 or 100. There are an infinite number of ways to approach each step of the process.

You have to be creative in how you devise and apply quick-and-dirty methods because every project is different and has different goals. The main point is that you shouldn’t apply sophisticated statistical methods without first making sure they’re reasonably appropriate for your project’s goals and your data and that you’re using them properly. Applying a simple version of the statistical analysis first gives you a feeling for how the method interacts with your data and whether it’s appropriate. Chapter 7 discusses several types of statistical methods in far more detail, and from it you should be able to gather more ideas for your own analyses.

5.5.2. Take a subset of the data

Often you have too much data to run even a simple analysis in a timely fashion. At this stage of doing many rough preliminary analyses, it might be OK to use subsets of data for testing the applicability of simple statistical methods.

If you do apply the rough statistical methods to subsets of the full data set, watch out for a few pitfalls:

  • Make sure you have enough data and entities for the statistical methods to give significant results. The more complicated the method is, the more data you need.
  • If the subset is not representative of the full data set, your results could be way off. Calculate descriptive statistics on this subset and compare them to the relevant descriptive statistics on the full data set. If they’re similar in ways that matter—keep your projects goals in mind—then you’re in good shape.
  • If you try only one subset, you may unknowingly have chosen a highly specialized or biased subset, even if you run some descriptive statistics. But if you take three distinct subsets, do a quick analysis on them all, and get similar results, you can be reasonably certain the results will generalize to the full data set.
  • If you try different subsets and get different results, it might not be a bad thing. Try to figure out why. Data inherently has variance, and different data might give different results, slightly or greatly. Use descriptive statistics and diagnosis of these simple statistical methods to make sure you understand what’s happening.

5.5.3. Increasing sophistication: does it improve results?

If you can’t get at least moderately good or promising results from a simple statistical method, proceeding with a more sophisticated method is dangerous. Increasing the sophistication of your method should improve results, but only if you’re on the right track. If the simple version of the method isn’t appropriate for your data or project, or if the algorithm isn’t configured properly, chances are that stepping up the sophistication isn’t going to help. Also, it’s harder to fix the configuration of a more sophisticated method, so if you begin with an improper configuration of a simple method, your configuration of the more sophisticated version will probably be as improper—or even more so—and harder to fix.

I like to make sure I have a solid, simple method that I understand and that clearly gives some helpful if not ideal results, and then I check the results as I step up the sophistication of the method. If the results improve with each step, I know I’m doing something right. If the results don’t improve, I know I’m either doing something wrong or I’ve reached the limit of complexity that the data or my project’s goals can handle.

Applying methods that are too complex for the data or setting goals that can’t handle it is generally called over-fitting. Specifically, this implies that the methods have too many moving parts, and all these moving parts work perfectly on your data, but then when you give the method new data, the accuracy of the results isn’t nearly as good. I cover over-fitting in more detail in chapter 7, but for now let it suffice to say that sophistication should lead to better results—to a point—and if you’re not experiencing that, there’s likely a problem somewhere.

Exercises

Continuing with the Filthy Money Forecasting personal finance app scenario first described in chapter 2, and relating to previous chapters’ exercises, try these exercises:

1.

Given that a main goal of the app is to provide accurate forecasts, describe three types of descriptive statistics you would want to perform on the data in order to understand it better.

2.

Assume that you’re strongly considering trying to use a statistical model to classify repeating and one-time financial transactions. What are three assumptions you might have about the transactions in one of these two categories?

Summary

  • Instead of jumping straight into a sophisticated statistical analysis, be intentionally deliberate during the exploratory phase of data science, because most problems can be avoided—or fixed quickly—through knowledge and awareness.
  • Before analyzing data, state your prior assumptions about the data and check that they’re appropriate.
  • Before assuming there are needles in your haystack, sift through the data, manually if necessary, and find some good examples of the types of things you want to find more of during the project.
  • Perform rough statistical analyses on subsets of the data to guarantee you’re on the right track before you devote too much time to a full software implementation.
  • Record any exploratory results somewhere handy; they might be of use later when making follow-up decisions.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.166.55