Previous chapters have introduced the basic programming fundamentals for working with data, detailing how you can tell a computer to do data processing for you. To use a computer to analyze data, you need to both access a data set and interpret that data set so that you can ask meaningful questions about it. This will enable you to transform raw data into actionable information.
This chapter provides a high-level overview of how to interpret data sets as you get started doing data science—it details the sources of data you might encounter, the formats that data may take, and strategies for determining which questions to ask of that data. Developing a clear mental model of what the values in a data set signify is a necessary prerequisite before you can program a computer to effectively analyze that data.
Before beginning to work with data, it’s important to understand where data comes from. There are a variety of processes for capturing events as data, each of which has its own limitations and assumptions. The primary modes of data collection fall into the following categories:
Sensors: The volume of data being collected by sensors has increased dramatically in the last decade. Sensors that automatically detect and record information, such as pollution sensors that measure air quality, are now entering the personal data management sphere (think of FitBits or other step counters). Assuming these devices have been properly calibrated, they offer a reliable and consistent mechanism for data collection.
Surveys: Data that is less externally measurable, such as people’s opinions or personal histories, can be gathered from surveys. Because surveys are dependent on individuals’ self-reporting of their behavior, the quality of data may vary (across surveys, or across individuals). Depending on the domain, people may have poor recall (i.e., people don’t remember what they ate last week) or have incentives to respond in a particular way (i.e., people may over-report healthy behaviors). The biases inherent in survey responses should be recognized and, when possible, adjusted for in your analysis.
Record keeping: In many domains, organizations use both automatic and manual processes to keep track of their activities. For example, a hospital may track the length and result of every surgery it performs (and a governing body may require that hospital to report those results). The reliability of such data will depend on the quality of the systems used to produce it. Scientific experiments also depend on diligent record keeping of results.
Secondary data analysis: Data can be compiled from existing knowledge artifacts or measurements, such as counting word occurrences in a historical text (computers can help with this!).
All of these methods of collecting data can lead to potential concerns and biases. For example, sensors may be inaccurate, people may present themselves in particular ways when responding to surveys, record keeping may only focus on particular tasks, and existing artifacts may already exclude perspectives. When working with any data set, it is vital to consider where the data came from (e.g., who recorded it, how, and why) to effectively and meaningfully analyze it.
Computers’ abilities to record and persist data have led to an explosion of available data values that can be analyzed, ranging from personal biological measures (how many steps have I taken?) to social network structures (who are my friends?) to private information leaked from insecure websites and government agencies (what are their Social Security numbers?). In professional environments, you will likely be working with proprietary data collected or managed by your organization. This might be anything from purchase orders of fair trade coffee to the results of medical research—the range is as wide as the types of organizations (since everyone now records data and sees a need for data analytics).
Luckily, there are also plenty of free, nonproprietary data sets that you can work with. Organizations will often make large amounts of data available to the public to support experiment duplication, promote transparency, or just see what other people can do with that data. These data sets are great for building your data science skills and portfolio, and are made available in a variety of formats. For example, data may be accessed as downloadable CSV spreadsheets (see Chapter 10), as relational databases (see Chapter 13), or through a web service API (see Chapter 14).
Popular sources of open data sets include:
Government publications: Government organizations (and other bureaucratic systems) produce a lot of data as part of their everyday activities, and often make these data sets available in an effort to appear transparent and accountable to the public. You can currently find publicly available data from many countries, such as the United States,1 Canada,2 India,3 and others. Local governments will also make data available: for example, the City of Seattle4 makes a vast amount of data available in an easy-to-access format. Government data covers a broad range of topics, though it can be influenced by the political situation surrounding its gathering and retention.
1U.S. government’s open data: https://www.data.gov
2Government of Canada open data: https://open.canada.ca/en/open-data
3Open Government Data Platform India: https://data.gov.in
4City of Seattle open data portal: https://data.seattle.gov
News and journalism: Journalism remains one of the most important contexts in which data is gathered and analyzed. Journalists do much of the legwork in producing data—searching existing artifacts, questioning and surveying people, or otherwise revealing and connecting previously hidden or ignored information. News media usually publish the analyzed, summative information for consumption, but they also may make the source data available for others to confirm and expand on their work. For example, the New York Times5 makes much of its historical data available through a web service, while the data politics blog FiveThirtyEight6 makes all of the data behind its articles available on GitHub (invalid models and all).
5New York Times Developer Network: https://developer.nytimes.com
6FiveThirtyEight: Our Data: https://data.fivethirtyeight.com
Scientific research: Another excellent source of data is ongoing scientific research, whether performed in academic or industrial settings. Scientific studies are (in theory) well grounded and structured, providing meaningful data when considered within their proper scope. Since science needs to be disseminated and validated by others to be usable, research is often made publicly available for others to study and critique. Some scientific journals, such as the premier journal Nature, require authors to make their data available for others to access and investigate (check out its list7 of scientific data repositories!).
7Nature: Recommended Data Repositories: https://www.nature.com/sdata/policies/repositories
Social networks and media organizations: Some of the largest quantities of data produced occur online, automatically recorded from people’s usage of and interactions with social media applications such as Facebook, Twitter, or Google. To better integrate these services into people’s everyday lives, social media companies make much of their data programmatically available for other developers to access and use. For example, it is possible to access live data from Twitter,8 which has been used for a variety of interesting analyses. Google also provides programmatic access9 to most of its many services (including search and YouTube).
8Twitter developer platform: https://developer.twitter.com/en/docs
9Google APIs Explorer: https://developers.google.com/apis-explorer/
Online communities: As data science has rapidly increased in popularity, so too has the community of data science practitioners. This community and its online spaces are another great source for interesting and varied data sets and analysis. For example, Kaggle10 hosts a number of data sets as well as “challenges” to analyze them. Socrata11 (which powers the Seattle data repository), also collects a variety of data sets (often from professional or government contributors). Somewhat similarly, the UCI Machine Learning Repository12 maintains a collection of data sets used in machine learning, drawn primarily from academic sources. And there are many other online lists of data sources as well—including a dedicated Subreddit
10Kaggle: “the home of data science and machine learning”: https://www.kaggle.com
11Socrata: data as a service platform: https://opendata.socrata.com
12UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/index.php
In short, there are a huge number of real-world data sets available for you to work with—whether you have a specific question you would like to answer, or just want to explore and be inspired.
Once you acquire a data set, you will have to understand its structure and content before (programmatically) investigating it. Understanding the types of data you will encounter depends on your ability to discern the level of measurement for a given piece of data, as well as the different structures that are used to hold that data.
Data can be made up of a variety of types of values (represented by the concept of “data type” in
R). More generally, data values can also be discussed in terms of their level of measurement14—a way of classifying data values in terms of how they can be measured and compared to other values.
14Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680. https://doi.org/10.1126/science.103.2684.677
The field of statistics commonly classifies values into one of four levels, described in Table 9.1.
Fruits: apples, bananas, oranges, etc.
“same or different”
Hotel rating: 5-star, 4-star, etc.
“bigger or smaller”
Lengths: 1 inch, 1.5 inches, 2 inches, etc.
“twice as big”
Dates: 05/15/2012, 04/17/2015, etc.
“3 units bigger”
Nominal data (often equivalently categorical data) is data that has no implicit ordering. For example, you cannot say that “apples are more than oranges,” though you can indicate that a particular fruit either is an apple or an orange. Nominal data is commonly used to indicate that an observation belongs in a particular category or group. You do not usually perform mathematical analysis on nominal data (e.g., you can’t find the “average” fruit), though you can discuss counts or distributions. Nominal data can be represented by strings (such as the name of the fruit), but also by numbers (e.g., “fruit type #1”, “fruit type #2”). Just because a value in a data set is a number, that does not mean you can do math upon it! Note that boolean values (
FALSE) are a type of nominal value.
Ordinal data establishes an order for nominal categories. Ordinal data may be used for classification, but it also establishes that some groups are greater than or less than others. For example, you may have classifications of hotels or restaurants as 5-star, 4-star, and so on. There is an ordering to these categories, but the distances between the values may vary. You are able to find the minimum, maximum, and even median values of ordinal variables, but you can’t compute a statistical mean (since ordinal values do not define how much greater one value is than another). Note that it is possible to treat nominal variables as ordinal by enforcing an ordering, though in effect this changes the measurement level of the data. For example, colors are usually nominal data—you cannot say that “red is greater than blue.” This is despite the conventional ordering based on the colors of a rainbow; when you say that “red comes before blue (in the rainbow),” you’re actually replacing the nominal color value with an ordinal value representing its position in a rainbow (which itself is dependent on the ratio value of its wavelength)! Ordinal data is also considered categorical.
Ratio data (often equivalently continuous data) is the most common level of measurement in real-world data: data based on population counts, monetary values, or amount of activity is usually measured at the ratio level. With ratio data, you can find averages, as well as measure the distance between different values (a feature also available with interval data). As you might expect, you can also compare the ratio of two values when working with ratio data (i.e., value
x is twice as great as value
Interval data is similar to ratio data, except there is no fixed zero point. For example, dates cannot be discussed in proportional terms (i.e., you wouldn’t say that Wednesday is twice as Monday). Therefore, you can compute the distance (interval) between two values (i.e., 2 days apart), but you cannot compute the ratio between two values. Interval data is also considered continuous.
Identifying and understanding the level of measurement of a particular data feature is important when determining how to analyze a data set. In particular, you need to know what kinds of statistical analysis will be valid for that data, as well as how to interpret what that data is measuring.
In practice, you will need to organize the numbers, strings, vectors, and lists of values described in the previous chapters into more complex formats. Data is organized into more robust structures—particularly as the data set gets large—to better signify what those numbers and strings represent. To work with real-world data, you will need to be able to understand these structures and the terminology used to discuss them.
In practice, most data sets are structured as tables of information, with individual data values arranged into rows and columns (see Figure 9.1). These tables are similar to how data may be recorded in a spreadsheet (using a program such as Microsoft Excel). In a table, each row represents a record or observation: an instance of a single thing being measured (e.g., a person, a sports match). Each column represents a feature: a particular property or aspect of the thing being measured (e.g., the person’s height or weight, the scores in a sports game). Each data value can be referred to as a cell in the table.
Viewed in this way, a table is a collection of “things” being measured, each of which has a particular value for a characteristic of that thing. And, because all the observations share the same characteristics (features), it is possible to analyze them comparatively. Moreover, by organizing data into a table, each data value (cell) can be automatically given two associated meanings: which observation it is from as well as which feature it represents. This structure allows you to discern semantic meaning from the numbers: the number
64 in figure Figure 9.1 is not just some value; it’s “Ada’s height.”
The table in Figure 9.1 represents a small (even tiny) data set, in that it contains just five observations (rows). The size of a data set is generally measured in terms of its number of observations: a small data set may contain only a few dozen observations, while a large data set may contain thousands or hundreds of thousands of records. Indeed, “Big Data” is a term that, in part, refers to data sets that are so large that they can’t be loaded into the computer’s memory without special handling, and may have billions or even trillions of rows! Yet, even a data set with a relatively small number of observations can contain a large number of cells if they record a lot of features per observations (though these tables can often be “inverted” to have more rows and fewer columns; see Chapter 12). Overall, the number of observations and features (rows and columns) is referred to as the dimensions of the data set—not to be confused with referring to a table’s “two-dimensional” data structure (because each data value has two meanings: observation and feature).
Although it is commonly structured in this way, data need not be represented as a single table. More complex data sets may spread data values across multiple tables (such as in a database; see Chapter 13). In other complex data structures, each individual cell in the table may hold a vector or even its own data table. This can cause the table to no longer be two-dimensional, but three- or more-dimensional. Indeed, many data sets available from web services are structured as “nested tables”; see Chapter 14 for details.
The first thing you will need to do upon encountering a data set (whether one you found online or one that was provided by your organization) is to understand the meaning of the data. This requires understanding the domain you are working in, as well as the specific data schema you are working with.
The first step toward being able to understand a data set is to research and understand the data’s problem domain. The problem domain is the set of topics that are relevant to the problem—that is, the context for that data. Working with data requires domain knowledge: you need to have a basic level of understanding of that problem domain to do any sensible analysis of that data. You will need to develop a mental model of what the data values mean. This includes understanding the significance and purpose of any features (so you’re not doing math on contextless numbers), the range of expected values for a feature (to detect outliers and other errors), and some of the subtleties that may not be explicit in the data set (such as biases or aggregations that may hide important causalities).
As a specific example, if you wanted to analyze the table shown in Figure 9.1, you would need to first understand what is meant by “height” and “weight” of a person, the implied units of the numbers (inches, centimeters, … or something else?), an expected range (does Ada’s height of 64 mean she is short?), and other external factors that may have influenced the data (e.g., age).
You do not need to necessarily be an expert in the problem domain (though it wouldn’t hurt); you just need to acquire sufficient domain knowledge to work within that problem domain!
While people’s heights and other data sets discussed in this text should be familiar to most readers, in practice you are quite likely to come across data from problem domains that are outside of your personal domain expertise. Or, more problematically, the data set may be from a problem domain that you think you understand but actually have a flawed mental model of (a failure of meta-cognition).
For example, consider the data set shown in Figure 9.2, a screenshot taken from the City of Seattle’s data repository. This data set presents information on Land Use Permits, a somewhat opaque bureaucratic procedure with which you may be unfamiliar. The question becomes: how would you acquire sufficient domain knowledge to understand and analyze this data set?
15City of Seattle: Land Use Permits (access requires a free account): https://data.seattle.gov/Permitting/Land-Use-Permits/uyyd-8gak
Gathering domain knowledge almost always requires outside research—you will rarely be able to understand a domain just by looking at a spreadsheet of numbers. To gain general domain knowledge, we recommend you start by consulting a general knowledge reference: Wikipedia provides easy access to basic descriptions. Be sure to read any related articles or resources to improve your understanding: sifting through the vast amount of information online requires cross-referencing different resources, and mapping that information to your data set.
That said, the best way to learn about a problem is to find a domain expert who can help explain the domain to you. If you want to know about land use permits, try to find someone who has used one in the past. The second best solution is to ask a librarian—librarians are specifically trained to help people discover and acquire basic domain knowledge. Libraries may also support access to more specialized information sources.
Once you have a general understanding of the context for a data set, you can begin interpreting the data set itself. You will need to focus on understanding the data schema (e.g., what is represented by the rows and columns), as well as the specific context for those values. We suggest you use the following questions to guide your research:
“What meta-data is available for the data set?”
Many publicly available data sets come with summative explanations, instructions for access and usage, or even descriptions of individual features. This meta-data (data about the data) is the best way to begin to understand what value is represented by each cell in the table, since the information comes directly from the source.
For example, Seattle’s land use permits page has a short summary (though you would want to look up what an “over-the-counter review application” is), provides a number of categories and tags, lists the dimensions of the data set (14,200 rows as of this writing), and gives a quick description of each column.
A particularly important piece of meta-data to search for is:
“Who created the data set? Where does it come from?”
Understanding who generated the data set (and how they did so!) will allow you to know where to find more information about the data—it will let you know who the domain experts are. Moreover, knowing the source and methodology behind the data can help you uncover hidden biases or other subtleties that may not be obvious in the data itself. For example, the Land Use Permits page notes that the data was provided by the “City of Seattle, Department of Planning and Development” (now the Department of Construction & Inspections). If you search for this organization, you can find its website.16 This website would be a good place to gain further information about the specific data found in the data set.
Once you understand this meta-data, you can begin researching the data set itself:
“What features does the data set have?”
Regardless of the presence of meta-data, you will need to understand the columns of the table to work with it. Go through each column and check if you understand:
What “real-world” aspect does each column attempt to capture?
For continuous data: what units are the values in?
For categorical data: what different categories are represented, and what do those mean?
What is the possible range of values?
If the meta-data provides a key to the data table, this becomes an easy task. Otherwise, you may need to study the source of the data to determine how to understand the features, sparking additional domain research.
As you read through a data set—or anything really—you should write down the terms and phrases you are not familiar with to look up later. This will discourage you from (inaccurately) guessing a term’s meaning, and will help delineate between terms you have and have not yet clarified.
For example, the Land Use Permits data set provides clear descriptions of the columns in the meta-data, but looking at the sample data reveals that some of the values may require additional research. For example, what are the different Permit Types and Decision Types? By going back to the source of the data (the Department of Construction home page), you can navigate to the Permits page and then to the “Permits We Issue (A-Z)” to see a full list of possible permit types. This will let you find out, for example, that “PLAT” refers to “creating or modifying individual parcels of property”—in other words, adjusting lot boundaries.
To understand the features, you will need to look at some sample observations. Open up the spreadsheet or table and look at the first few rows to get a sense for what kind of values they have and what that may say about the data.
Finally, throughout this process, you should continually consider:
“What terms do you not know or understand?”
16Seattle Department of Construction & Inspections (access requires a free account): http://www.seattle.gov/dpd/
Depending on the problem domain, a data set may contain a large amount of jargon, both to explain the data and inside the data itself. Making sure you understand all the technical terms used will go a long way toward ensuring you can effectively discuss and analyze the data.
Watch out for acronyms you are not familiar with, and be sure to look them up!
For example, looking at the “Table Preview,” you may notice that many of the values for the “Permit Type” feature use the term “SEPA.” Searching for this acronym would lead you to a page describing the State Policy Environmental Act (requiring environmental impact to be considered in how land is used), as well as details on the “Threshold Determination” process.
Overall, interpreting a data set will require research and work that is not programming. While it may seem like such work is keeping you from making progress in processing the data, having a valid mental model of the data is both useful and necessary to perform data analysis.
Perhaps the most challenging aspect of data analysis is effectively applying questions of interest to the data set to construct the desired information. Indeed, as a data scientist, it will often be your responsibility to translate from various domain questions to specific observations and features in your data set. Take, for example, a question like:
“What is the worst disease in the United States?”
To answer this question, you will need to understand the problem domain of disease burden measurement and acquire a data set that is well positioned to address the question. For example, one appropriate data set would be the Global Burden of Disease17 study performed by the Institute for Health Metrics and Evaluation, which details the burden of disease in the United States and around the world.
17IHME: Global Burden of Disease: http://www.healthdata.org/node/835
Once you have acquired this data set, you will need to operationalize the motivating question. Considering each of the key words, you will need to identify a set of diseases, and then quantify what is meant by “worst.” For example, the question could be more concretely phrased as any of these interpretations:
Which disease causes the largest number of deaths in the United States?
Which disease causes the most premature deaths in the United States?
Which disease causes the most disability in the United States?
Depending on your definition of “worst,” you will perform very different computations and analysis, possibly arriving at different answers. You thus need to be able to decide what precisely is meant by a question—a task that requires understanding the nuances found in the question’s problem domain.
Figure 9.3 shows visualizations that try to answer this very question. The figure contains screenshots of treemaps from an online tool called GBD Compare.18 A treemap is like a pie chart that is built with rectangles: the area of each segment is drawn proportionally to an underlying piece of data. The additional advantage of the treemap is that it can show hierarchies of information by nesting different levels of rectangles inside of one another. For example, in Figure 9.3, the disease burden from each communicable disease (shown in red) is nested within the same segment of each chart.
18GBD Compare: visualization for global burden of disease: https://vizhub.healthdata.org/gbd-compare/
Depending on how you choose to operationalize the idea of the “worst disease,” different diseases stand out as the most impactful. As you can see in Figure 9.3, almost 90% of all deaths are caused by non-communicable diseases such as cardiovascular diseases (CVD) and cancers (Neoplasms), shown in blue. When you consider the age of death for each person (computing a metric called Years of Life Lost), this value drops to 80%. Moreover, this metric enables you to identify causes of death that disproportionately affect young people, such as traffic accidents (Trans Inj) and self-harm, shown in green (see the middle chart in Figure 9.3). Finally, if you consider the “worst” disease to be that currently causing the most physical disability in the population (as in the bottom chart in Figure 9.3), the impacts of musculoskeletal conditions (MSK) and mental health issues (Mental) are exposed.
Because data analysis is about identifying answers to questions, the first step is to ensure you have a strong understanding of the question of interest and how it is being measured. Only after you have mapped from your questions of interest to specific features (columns) of your data can you perform an effective and meaningful analysis of that data.