Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. Preparing Data for Analysis

Estimates for how long data scientists spend preparing their data vary, but it’s safe to say that this step takes up a significant part of the time spent working with data. In 2014 the New York Times reported that data scientists spend 50 to 80 percent of their time cleaning and wrangling their data.¹ A 2016 survey by Crowdflower found that data scientists spend 60% of their time cleaning and organizing data in order to prepare it for analysis or modeling work.² Preparing data is such a common task that terms have sprung up to, lovingly or loathingly, describe it: data munging, data wrangling, and data prep. Munging, or “mung” has an acronym, Mash Until No Good, which I have certainly done on occasion.

Is all this data preparation work just mindless toil, or an important part of the process?

Ideally there is a data dictionary with clear descriptions of the fields, possible values, how the data was collected, and how it joins with other data. Unfortunately this is not the norm. Documentation often isn’t prioritized, even by people who see its value. Documentation also gets out of data quickly. If it’s not tightly coupled to the data storage, it can quickly get out of sync. Ideally the data dictionary would live in the data store itself, but this isn’t very common. If your organization already has a data dictionary tool, even if it’s in a wiki or document somewhere, please use it and contribute to it. And if nothing exists currently, consider starting one! This is one of the most valuable gifts you can give to your team, and your future self.

In a data warehouse, we try to store as much data as possible, without always knowing exactly what we’re going to do with it. This allows us to ask and answer all sorts of questions we couldn’t have thought of when initially designing the system. It also means that there is usually some work to be done before arriving at the final answer. In this chapter I’ll begin by talking about profiling the data, as a way to get to know its contents and check for data quality. Next I’ll talk about some data shaping techniques that will return the columns and rows needed for further analysis. Finally I’ll walk through some useful tools for cleaning data to deal with any data quality issues.

2.0 Types of Data

Before diving into preparing data for analysis, let’s talk about the types of data you may encounter. As data analysts and data scientists we often work with a mixture of what I like to call data exhaust, that was created for some other purpose such as running a website, and data that was collected explicitly for the purpose of analysis, whether it comes from website or mobile interaction trackers, surveys, or other sources.

2.0.1 Database data types

There is a fair amount written about data types from a database perspective, and most database documentation goes into them in depth. You don’t necessarily need to be an expert on the nuances to be good at analysis, but we’ll encounter situations later in the book where considering the data type is important, so this section will cover the basics. The main types of data are strings, numeric, logical, and datetime.

First the string data types. These can hold letters, numbers, and special characters including unprintable characters like tabs and newlines. Mathematical operations, such as adding, subtracting, and multiplying, can’t be performed on string data even if the field contains numbers. String fields can be defined to hold a fixed number of characters, or a variable number of characters. State abbreviations are two letters, so a field holding them could be defined to allow only two characters, whereas the full names of states are different lengths, so a field could be defined to allow a variable number of characters, but no more than 100. Strings can also be very long if they contain data like free text survey responses, emails, or other texts. Fields can be defined as text or blob, depending on the database, though they take up more space so these data types tend to be used sparingly. When data is loaded, if strings arrive that are too big for the defined data type, they can be truncated, or shortened, or rejected entirely. Databases have a number of string functions that we will make use of for various analysis purposes.

Numeric data types are all of the ones that store numbers and can generally have mathematical functions applied.

Datetime data types. Many functions are interchangeable between dates and datetimes, but some functions and some databases are picky about which one is used.

Other data types are supported by some but not all databases.

Table 2-1. Summary of database data types
Category	Type	Description
String	CHAR / VARCHAR	hold strings. A CHAR is always a fixed length, whereas a VARCHAR is variable length, up to some maximum size (256 characters for example).
	TEXT / BLOB	these data types can hold longer strings that won’t fit within a VARCHAR. Descriptions or free text entered by survey respondents might be held in these fields.
Numeric	INT / SMALLINT / BIGINT	holds integers, which are whole numbers, negative or positive. Some databases have a SMALLINT and/or BIGINT. SMALLINT is sometimes used in a highly optimized database when the range of values is known and is small in absolute terms. A SMALLINT takes less memory than a regular INT. A BIGINT on the other hand is capable of holding numbers with more digits than an INT. It also takes up more space, so will typically be used only when needed.
	FLOAT / DOUBLE	holds decimal numbers, sometimes with the number of decimal places specified.
Datetime	DATE	holds dates. Typically in a YYYY-MM-DD format, where YYYY is a 4-digit year, MM is the 2-digit month number, DD is the 2-digit day.
	DATETIME / TIMESTAMP	holds dates with times. Typically in a YYYY-MM-DD hh:mi:ss format, where YYYY is a 4-digit year, MM is the 2-digit month number, DD is the 2-digit day, hh is the 2-digit hour (usually 24-hour time, or values of 0 to 23), mi is the 2-digit minutes, and ss is the 2-digit seconds.
	TIME	holds just the time portion of a DATETIME.
Logical	BOOLEAN	Holds a TRUE or FALSE values
Other	JSON	not all database support JSON data types but it’s an emerging trend. JSON stands for JavaScript Object Notation, and despite its name is useful as a language-independent data format. JSON is useful for recursive data models where not all records have the same number of values.
	Geographic	some specialized database can hold geographical data types, but most typical data warehouses do not

2.0.1 Structured vs. Unstructured

The next grouping of data is structured vs. unstructured (and sometimes semi-structured). Most databases were designed to handle structured data. With structured data, each element fits into a column, and entities are rows. First a data model is created, and then data is inserted according to that data model. For example, an address record can be highly structured, with a field for street address, one for city, one for state, and another for postal code. Each field has a data type and allows only data of that type to be entered. Structured data is easy to query and analyze. It’s well suited for relational databases and SQL. There is a mature set of tools and methods for working with this data.

Unstructured data is more freeform. There is no predetermined structure or data model. Unstructured data is often the “everything else” that isn’t database data. Documents, emails, and web pages are unstructured. Photos, images, videos, and audio files are also examples of unstructured data. They don’t fit into the traditional data types, and so aren’t well suited for relational databases and SQL.

Semi-structured data is in-between, and in fact some of the “unstructured” data types have some structure that we can make use of. For example, emails have a from, to, subject, body, and sent timestamp that can be stored separately in a data model with those fields. Metadata, or data about data, can be extracted from other file types and stored for analysis. For example, music audio files might be tagged with artist, song name, genre, and duration.

“Schema on write” is used to describe when the data model is designed in advance, and the data store is set up to enforce that data model when new data is loaded. Relational databases have a predefined schema, with data types, numbers of columns, whether null values are allowed, and uniqueness constraints defined. Data that does not conform to the data model will be rejected, and usually alert the administrator with an error. This type of setup slows down load somewhat, as each piece of data has to be verified that it is as it’s supposed to be. Reading the data is fast, and there is a higher level of data quality assurance.

“Schema on read” is used to describe when there is not a predefined schema that is enforced at write time. Data is written regardless, and no checks are performed. Instead, a schema is assigned when the data is queried, or read. This approach leads to faster writing of data, since no checking is required, but slower query times because the data has to be checked at that point. Data quality is also potentially lower since no validation happened at write time, and therefore there are no alerts that might cause someone to fix upstream data quality issues.

2.0.2 First-party, Third-party, and Cloud Vendor data

First-party data is collected by the organization itself. This can be through server logs, databases that keep track of transactions and customer information, or other systems that are built and controlled by the organization and generate data interesting for analysis. Since the systems were created in-house, finding the people who built them and learning about how the data is generated is usually possible. Data analysts may also be able to influence or have control over how certain pieces of data are created and stored, particularly when bugs are responsible for poor data quality.

Third-party data may be purchased or obtained from freely available data sources, such as those published by governments. Unless the data has been collected specifically on behalf of the organization, data teams usually have little control over the format, frequency, and data quality of these sources. They also may not integrate well with first-party systems, except along broad categories. For example, most third-party sources do not have user-level data, and instead data might be joined with first-party data at the postal code, city, or higher level. Third party data can have unique and useful information, however, such as aggregate spending patterns, demographics, and market trends that would be very expensive or impossible to collect otherwise.

Cloud vendor data is something of a blend of both first and third party data. Examples of cloud vendor data are CRMs, email and marketing automation tools, ecommerce enabling software, and web and mobile interaction tracking. The data is usually considered first-party, since it is about the organization itself, created by its employees and customers. However, the code that generates and stores the data, and the data model are not controlled by the organization. Many cloud software vendors provide APIs to read the data out into other systems, though building and maintaining connectors is time consuming. A group of additional cloud vendors now offer services to connect to many common tools and move the data into data warehouses or data lakes. Some on-premise ETL tools also have connectors available for cloud services. The move to increasingly rely on cloud services is a trend that doesn’t seem likely to go away, and fortunately more vendors are making it easier to pull the data out of their infrastructure into data warehouses owned by their customers.

Many cloud vendors data sources provide some reporting capabilities, meaning that moving the data to a data warehouse is not always necessary. The department that interacts with a tool may find that reporting sufficient. An example is a customer service department that uses a ticketing tool to manage workflow, that also provides reporting on time to respond and agent productivity. That may be enough for a manager in the department to manage their team. Using customer service interactions as an input to understand customer lifetime value, or CLTV, might require ingesting that data into a data store to integrate with sales and cancellation data sets.

Tip

A good rule of thumb when deciding whether to import data from a particular data source is to ask whether this data needs to be combined with data sets from other systems in order to produce insights. If the answer is yes, do the work to integrate it. If the answer is no, or we’re not sure yet, wait until there is a stronger case to do so.

2.0.3 Sparse data

Sparse data occurs when there is a small amount of information within a larger set of empty or unimportant information. It might show us as many nulls in a particular column, and only a few values. Sparse data can occur when events are rare, such as software errors or purchases of products in the long tail of a product catalogue. It can also occur in the early days of a feature or product launch, when only testers or beta customers have access. JSON is one approach that has been developed to deal with sparse data from a writing and storage perspective, as it only stores the data that is present and omits the rest. This is in contrast to a row-store database, which has to hold memory for a field even if there is no value in it.

Sparse data can be problematic for analysis. When events are rare, trends aren’t necessarily meaningful and correlations are hard to distinguish from chance fluctuations. It’s worth profiling your data, discussed in the next section, to understand if and where your data is sparse. Some options are to group infrequent events or items into categories that are more common, exclude the sparse data or time period from the analysis entirely, or to show descriptive statistics along with cautionary explanations that the trends are not necessarily meaningful.

2.0.4 Quantitative vs. qualitative data

Quantitative data is numeric. It measures people, things, and events. Quantitative data can include descriptors, such as customer information, product type, or device configurations, but it comes with numeric information such as price, quantity, or visit duration. Counts, sums, average or other numeric functions are applied to the data. Quantitative data is often machine generated these days, but it doesn’t need to be. Height, weight, and blood pressure recorded on a paper patient intake form is quantitative, as are student quiz scores typed into a spreadsheet by a teacher.

Qualitative data is usually text-based and includes opinions, feelings, and descriptions that aren’t strictly quantitative. Temperature and humidity levels are quantitative, descriptors like ‘hot and humid’ are qualitative. The price a customer paid for a product is quantitative, whether they like or dislike it is qualitative. Survey feedback, customer support inquiries, and social media posts are qualitative. There are whole professions that deal with qualitative data. In a data analysis context, we usually try to quantify the qualitative. One technique for this is to extract keywords or phrases and count their occurrences. We’ll look at this in more detail in Chapter 5 on Text Analysis. Another technique is sentiment analysis, where the structure of language is used to interpret the meaning of the words used, in addition to their frequency. Sentences or other bodies of text can be scored for their level of positivity or negativity, and then counts or averages are used to derive insights that would be hard to summarize otherwise. There have been exciting advances in the field of natural language processing, or NLP, though much of this work is done with tools such as Python.

2.0.5 Categorical vs. continuous

Categorical data is descriptive, and each value is independent of other values. Binary or boolean values are probably the simplest categorical data types. Each record is one or the other, true or false, 0 or 1. Categorical data includes countries, colors, and customer name. Categorical data can have an order, or no order. Elementary school students in the United States progress from first grade, to second grade, to third grade. Each school year is distinct, but students move through them in order. I recently bought a stack of books that included a novel, a memoir, and a history book; there is nothing about their categories that implies or requires a particular reading order. Mathematical functions generally can’t be applied to categorical data, even if the values themselves are numerical. Frequency counts can be quite useful, as we’ll see in the next section. There’s no such thing as the average customer name, but the most commonly occurring country can be useful. The average school grade of my children can be calculated, but there is no such thing as grade 3.67.

In contrast, continuous data is numeric, and any value is technically possible, including fractional or decimal values. Speed, weight, and volume are continuous. Prices are continuous, even if a company tends to set prices at common levels such as $9.99, $19.99, and $29.99. Factor in discounts, summing and averaging across customers, and you will see that price data is in fact continuous. Time is continuous, though we break it up into hours, days, months and other larger and smaller slices that can serve as ordered categorical data.

2.1 Profiling: Distributions

Profiling is the first thing I do when I start working with any new data set. I look at how the data is arranged into schemas and tables. I look at the table names to get familiar with the topics covered, such as customers, orders, or visits. I check out the column names in a few tables, and start to construct a mental model of how the tables relate to each other. For example, the tables might include an order_detail one with line item breakouts that relate to the order table via an order_id, while the order table relates to the customer table via a customer_id.

The tables generally represent the operations of an organization, or some subset of the operations, so I think about what domain or domains are covered, such as ecommerce, marketing, or product interactions. Working with data is easier when we have knowledge of how the data was generated. Profiling can give clues about this, or about what questions to ask of the source, or people inside or outside the organization responsible for the collection or generation of the data. Even when you collect the data yourself, profiling is useful.

Another detail I check for is how history is represented, if at all. Data sets that are replicas of production databases may not contain previous values for customer addresses, or order statuses, for example, whereas a well constructed data warehouse may have daily snapshots of changing data fields.

Profiling data is related to the concept of Exploratory Data Analysis, or EDA, named by John Tukey. In his book of that name, Tukey described how to analyze data sets by computing various summaries, and visualizing the results. He included techniques for looking at distributions of data, including stem-and-leaf plots, box plots, and histograms.

After checking a few samples of data, I start looking at distributions. Distributions allow me to understand the range of values that exist in the data, and how often they occur, whether there are nulls, and whether negative values exist alongside positive ones. Distributions can be created with continuous or categorical data, and are also called frequencies. In this section we’ll look at how to create histograms, how binning can help us understand the distribution of continuous values, and how to use n-tiles to get more precise about distributions.

2.1.1 Histograms and frequencies

A frequency plot is a way to visualize the number of times something occurs in the data set. The field being profiled is usually plotted on the x-axis, with the count of observations on the y-axis, though graphs can also be made horizontally to accommodate long value names.

SELECT fruit, count(*) as quantity
FROM fruit_inventory
GROUP BY 1
;

A histogram is a way to visualize the distribution of numerical values in a data set, and will be familiar to those with a statistics background. A basic histogram might show the distribution of ages across a group of customers. Rather than group by the category or descriptor value, as we did with the fruit, here we group by the numerical value and count the entities, or people:

SELECT age, count(customer_id) as customers
FROM customers
GROUP BY 1 
;

One of my favorite SQL interview questions is to ask for a histogram of number of orders per customer, where the table contains a record of order. This can’t be solved with a simple query; it requires an intermediate aggregation step. This is one of the most common ways I profile data, and the solution is not complex:

SELECT orders, count(*) as num_customers
FROM
(
    SELECT customer_id, count(order_id) as orders
    FROM order_table
    GROUP BY 1
) a
GROUP BY 1
;

First the subquery counts the number of orders placed by each customer. Then the outer query uses the number of orders as a category, and counts the number of customers. Other aggregations can of course be used to create histograms of the number of customers by their first purchase date using min, customers by average order size.

2.1.3 Binning

Binning is useful when working with continuous values. Rather than count the number of observations or records for each value, ranges of values are grouped together and these are called bins or buckets. The number of records that fall into each interval is then counted. Bins can be variable or fixed size, depending on whether your goal is to group the data into bins that have particular meaning for the organization, are roughly equal width, or contain roughly equal numbers of records. Bins of varied size are created using CASE statements. Bins of fixed size can be accomplished in a few ways, including with rounding, logarithms, and n-tiles.

The CASE statement is a flexible way to control the number of bins, the range of values that fall into each bin, and how the bins are named. I find case statements useful when there is a long tail of very small or very large values that I want to group together rather than have empty bins in part of the distribution. Sometimes certain ranges of values have a special business meaning that we want to recreate in the data. Many companies selling to other businesses find it useful to separate customers into ‘enterprise’ and ‘SMB’, where the acronym means small and medium businesses, based on number of employees or revenue because their buying patterns are different. Imagine we want to group orders into three buckets because it will influence how we offer discounted shipping:

SELECT 
case when order_amount <= 100 then ‘up to 100’
     when order_amount <= 500 then ‘100 - 500’
     else ‘500+’ end as sales_bin
,case when order_amount <= 100 then ‘small’
     when order_amount <= 500 then ‘medium’
     else ‘large’ end as sales_bin
,count(customer_id) as customers
FROM orders
GROUP BY 1,2
;

To create equal width bins, rounding is useful. Rounding reduces the precision of the values, and we usually think about rounding as reducing the number of decimals places, or removing them all together by rounding to the nearest integer. The round function takes the form:

round(value,number_of_decimal_places)

You may not realize that the number of decimal places can also be a negative number, allowing this function to round to nearest tens, hundreds, thousands, and so on.

Table: The number 123,456.789 rounded with various decimal places

Decimal places	formula	result
2	round(123456.789,2)	123456.79
1	round(123456.789,1)	123456.8
0	round(123456.789,0)	123457
-1	round(123456.789,-1)	123460
-2	round(123456.789,-2)	123500
-3	round(123456.789,-3)	123000

SELECT round(sales,-1) as bin, count(customer_id) as customers
FROM table
GROUP BY 1
;

While they don’t create bins of equal width, logarithms create bins that increase in size with a useful pattern. To refresh your memory, a logarithm is the exponent to which 10 must be raised to produce that number:

log(number) = exponent

In this case 10 is called the base, and this is usually the default implementation in databases, but technically the base can be any number.

Table: results of log function on powers of 10

formula	result
log(1)	0
log(10)	1
log(100)	2
log(1000)	3
log(10000)	4

SELECT log(sales) as bin, count(customer_id) as customers
FROM table
GROUP BY 1
;

Note that the logarithm function does not work when values can be less than or equal to 0. It will return null or an error, depending on the database.

2.1.2 N-tiles

You’re probably familiar with the median, or middle value of a data set. This is the 50th percentile value. Half of the values are larger, half smaller. With quartiles we fill in the 25th and 75th percentile values. A quarter of the values are small and three quarters larger for the 25th percentile, three quarters smaller and one quarter larger at the 75th percentile. Deciles break the data set into 10 equal parts. N-tiles make this concept generic, by allowing us to calculate any percentile of the data set: 27th percentile, 50.5th percentile, and so on.

Many databases have a median function built in, but rely on more generic n-tile functions for the rest. N-tile functions are window functions, computing across a range of rows to return a value for a single row. They take an argument that specifies the number of bins to split the data into, and optionally a partition by and/or order by clause:

ntile(num_bins) over (partition by... order by...)

This can be used to bin records in practice by first calculating the ntile of each row in a subquery, and then wrapping it in an outer query that uses min and max to find the upper and lower boundaries of the value range:

SELECT ntile
,min(order_amount) as lower_bound
,max(net_sales) as upper_bound
,count(order_id) as orders
FROM
(
    SELECT customer_id, order_id, order_amount
    ,ntile(10) over (order by order_amount) as ntile
    FROM orders
) a
GROUP BY 1
;

A related function is percent_rank. Instead of returning the bins that the data falls into, percent_rank returns the percentile. It takes no argument but requires parentheses, and optionally takes a partition by and/or order by clause:

percent_rank() over (partition by... order by...)

While not as useful as ntile for binning, the percent_rank can be used to create a continuous distribution, or as an output itself for reporting or further analysis.

Ntile and percent_rank can be expensive to compute over large data sets, since they require sorting all of the rows. Filtering the table to only the data set you need helps. Some databases have also implemented approximate versions of the functions that are faster to compute and generally return high quality results if absolute precision is not required.

In many contexts there is no single correct or objectively best way to look at distributions of data. There is significant leeway for analysts to use the above techniques in order to understand data, and present it to others. However, Data scientists need to use judgment, and bring their ethical radars along whenever sharing distributions of sensitive data.

2.2 Profiling: Data Quality

Data quality is absolutely critical when it comes to creating good analysis. Although this may seem obvious, it has been one of the hardest lessons I’ve learned in my years of working with data. It’s easy to get so focused on the mechanics of processing the data, finding clever query techniques, and just the right visualization, only to have stakeholders ignore all that and point out the one data inconsistency. Ensuring data quality can be one of the hardest and most frustrating parts of analysis. The saying “garbage in, garbage out” captures only part of it. Good ingredients in plus incorrect assumptions can also lead to garbage out.

Comparing data against ground truth, or what is otherwise known to be true, is ideal though not always possible. For example, if you are working with a replica of a production database, you could compare the row counts in each system to verify that all arrived in the replica database. In other cases, you might know the dollar value and count of sales in a particular month, and so can query for this information in the database to make sure the sum of sales and count of records match. Often the difference between your query results come down to applying the correct filters, such as whether to include cancelled orders or to filter out test accounts, how you handle things like nulls and spelling anomalies, and setting up correct join conditions between tables.

Profiling is a way to uncover data quality issues early, before they negatively impact results and conclusions drawn from the data. Profiling reveals nulls, categorical codings that need to be deciphered, fields with multiple values that need to be parsed, and unusual datetime formats. Profiling can also uncover gaps and step changes in the data that have resulted from tracking changes or outages. Data is rarely perfect, and it’s often only through use in analysis that data quality issues are uncovered.

2.2.1 Detecting duplicates

A duplicate is any time you have two (or more) rows with the same information. Duplicates can exist for any number of reasons. A mistake might have been made during data entry, if there is some manual step. A tracking call might have fired twice. A processing step might have run multiple times. You might have created it with a hidden one-to-many join accidentally. However they came to be, duplicates can really throw a wrench in your analysis. I can recall times early in my career when I thought I had a great finding, only to have a product manager point out that my sales figure was twice the actual sales. It’s embarrassing, it erodes trust, and it requires rework and sometimes painstaking reviews of the code to find the problem. I’ve learned to check for duplicates as I go.

Fortunately, it’s relatively easy to find duplicates in our data. One way is to inspect a sample, with all columns ordered:

SELECT column_a, column_b, column_c...
FROM table
ORDER BY 1,2,3...
;

This will reveal whether the data is full of duplicates. For example, when looking at a brand new data set to understand how it’s composed. Or when you suspect that a process is generating duplicates. Or after a join where I suspect it might create a Cartesian join.

However, if there are only a few duplicates, they might not show up in the sample. It’s also taxing on your eyes and brain to scroll through and spot the duplicates. A more systematic way to find duplicates is to SELECT the columns and then count the rows (this might look familiar from the discussion of histograms!):

SELECT count(*)
FROM
(
    SELECT column_a, column_b, column_c…, count(*) as records
    FROM…
    GROUP BY 1,2,3...
) a
WHERE records > 1
GROUP BY 1
;

This will tell you whether there are any cases of duplicates. If the query returns 0, you’re good to go. For more detail you can list out the number of records (2, 3, 4, etc.):

SELECT records, count(*)
FROM
(
    SELECT column_a, column_b, column_c…, count(*) as records
    FROM…
    GROUP BY 1,2,3...
) a
WHERE records > 1
GROUP BY 1
;

And for full detail on which records have duplicates, you can list out all of the fields, and use this information to chase down which records are problematic.

SELECT *
FROM
(
    SELECT column_a, column_b, column_c…, count(*) as records
    FROM…
    GROUP BY 1,2,3...
) a
WHERE records = 2
;

Detecting duplicates is one thing, figuring out what to do about them is another. It’s almost always useful to understand why duplicates are occurring, and if possible fix the problem upstream. Can a data process be improved to reduce or remove duplication? Is there an error in an ETL process? Have you failed to account for a one-to-many relationship in a JOIN? Next we’ll turn to how to handle and remove duplicates.

2.2.2 Deduplication with GROUP BY and DISTINCT

Duplicates happen, and they’re not always a result of bad data. For example, take a join from a customer table to a transaction table. Hopefully some of our customers have more than one transaction. Imagine we want to generate a list of all customers with transactions in order to send them an email. We might write a query like this:

SELECT customer_id, customer_name, customer_email
FROM customers a
JOIN transactions b on a.customer_id = b.customer_id
;

This will return a row for each customer for each transaction, however. Probably not what we want - it could result in some customers getting multiples of the same email (many email tools have filters for this, but we shouldn’t rely on someone else catching our errors!)

One way to remove duplicates is to use the keyword distinct.

SELECT distinct customer_id, customer_name, customer_email
FROM customers a
JOIN transactions b on a.customer_id = b.customer_id
;

Another option is to use a GROUP BY.

SELECT customer_id, customer_name, customer_email
FROM customers a
JOIN transactions b on a.customer_id = b.customer_id
GROUP BY 1,2,3
;

We typically see GROUP BY when doing some kind of aggregation, and that is how I will more typically use it:

SELECT customer_id, customer_name, customer_email, count(*)
FROM customers a
JOIN transactions b on a.customer_id = b.customer_id
GROUP BY 1,2,3
;

I remember the first time I saw a colleague use GROUP BY to dedupe. I didn’t even realize it was possible to use GROUP BY without some kind of aggregation. Most of the time I use the last form. It’s partially a habit, and partially so I have a trace of which records had dupes. It’s also often useful for counting transactions, which I might want in the output anyway. I also use distinct when having that extra column of output is a problem. I find GROUP BY without an aggregation to be slightly less readable. There may be some performance differences on some databases, though in my experience they work out pretty much the same. It’s worth experimenting with if performance is a particular concern, but the data returned should be the same either way.

Sometimes we want to “smooth out” the duplicates by taking the maximum or minimum value across them. For example, if we have a number of transactions by the same customer, and need to dedupe to one record per customer, we might want to find the minimum, or first, and/or the maximum, or most recent purchase date.

SELECT customer_id, min(transaction_date) as first_transaction_date, max(transaction_date) as last_transaction_date, count(*) as total_orders
FROM table
GROUP BY customer_id
;

Analytic functions can be useful when we want to preserve other attributes from the first or last record. Apply the analytic function to the whole data set in a subquery. The outer query restricts the final result to just the row desired.

SELECT customer_id, transaction_date, source, amount
FROM
(
    SELECT customer_id, transaction_date, source, amount
    ,rank() over (partition by customer_id order by transaction_date) as rank
    FROM table
    WHERE …
) a
WHERE rank = 1
;

Row_number could also be used in place of rank() here.

2.2.3 Missing data

Data can be missing for a variety of reasons, with different implications for how you handle it. The field might not have been required by the system or process that collected it, such as an optional ‘how did you hear about us?’ field in an ecommerce checkout flow. Requiring this field might create friction for the customer and decrease successful checkouts. Alternatively, Data can be required normally, but fail to be collected due to a code bug or human error, such as in a medical questionnaire where the interviewer missed the second page of questions. Change in the way the data was collected can result in records before or after the change that have missing values. A tool tracking mobile app interactions might add an additional field recording whether the interaction was a tap or scroll, for example, or remove another field due to functionality change. Finally, data may be available but not at the level of detail, or granularity, needed for the analysis. An example of this comes from subscription businesses, where customers pay on an annual basis for a monthly product, and we want to analyze monthly revenue.

Missing data can be an important signal in and of itself, so don’t assume that it always needs to be fixed or filled. Missing data can reveal underlying system design, or biases in the data collection process that can be valuable insights alongside other analysis.

Records with missing fields can be filtered out entirely, but often we want to keep them and instead make some adjustments based on what we know about expected or typical values. We’ll talk about handling null values shortly. We have some options for filling in missing data.

One option is to fill missing data with a constant value. Filling with a constant value might be useful if, for example, item_price isn’t filled for some records, but I know that the price for item XYZ is always $20.

case when price is null and item_name = ‘xyz’ then 20 else price end as price

Another option is to fill with a derived value, either a mathematical function on other columns, or a case statement. For example, imagine we wanted to know the age of a customer, but only have the date of birth, or birth year. We can compute the age as of the date of the record:

SELECT datediff(‘year’,date_of_birth,transaction_date)...

Alternatively we might want to calculate the current age of a customer:

SELECT datediff(‘year’,date_of_birth,now())...

For data that is available, but not at the granularity needed, we often need to create additional rows in the data set. Taking the subscription example, imagine we have the subscription date, and the annual subscription amount. We can fill with fractional values, as in spreading an annual subscription amount into 12 equal monthly revenue amounts. This is converting ARR, or annual recurring revenue, into MRR, or monthly recurring revenue. This also works for quarterly data. Finance has requested that we build out a monthly revenue report, with some slices by customer type. We’ll assume we can just divide the yearly amount by 12 to get the monthly revenue, but we still only have one record per customer. We could create 12 new columns, one for each month of the subscription:

SELECT customer_id
,order_date
,order_amount
,order_amount / 12 as month_1
,order_amount / 12 as month_2
...
,order_amount / 12 as month 12
FROM orders
;

This gets a bit tedious, particularly if subscription periods can be not only one year, but two, three or five years. It’s also not helpful if what we want is the actual dates of the months. In theory we could write a query like this:

SELECT customer_id
,order_date
,order_amount
,order_amount / 12 as ‘2020-01’
,order_amount / 12 as ‘2020-02’
...
,order_amount / 12 as ‘2020-12’
FROM orders
;

However, if the data includes orders from customers across time, hard-coding the month names won’t be accurate. We could use case statements in combination with hard-coded month names, but again this is tedious and is likely to be error-prone as you add more convoluted logic. Instead, creating new rows through a join to a table such as a date dimension provides an elegant solution.

A date dimension is a static table that has one row per day, with optional extended date attributes, such as day of the week, month name, end of month, and fiscal year. The dates extend far enough into the past and far enough into the future to cover all anticipated uses. Because there are only 365 or 366 days per year, tables covering even 100 years don’t take up a lot of space.

If you’re using a Postgres database, the GENERATE_SERIES function can be used to create a date dimension either in order to populate the table initially, or if creating a table is not an option. It takes the form:

generate_series(start, stop, step interval)

Start is the first date you want in the series, stop is the last date, and step interval is the time period between values. The step interval can take any value, but one day is appropriate for a date dimension:

SELECT * 
FROM generate_series('2000-01-01'::timestamp,'2030-12-31', '1 day')

A query that generates a row for every day, regardless of whether a customer ordered on a particular day looks like this:

SELECT a.generate_series as order_date, b.customer_id, b.items

FROM
(
    SELECT *
    FROM generate_series('2020-01-01'::timestamp,'2020-12-31','1 day')
) a
LEFT JOIN 
(
    SELECT customer_id, order_date, count(item_id) as items
    FROM orders
    GROUP BY 1,2
) b on a.generate_series = b.order_date
;

Returning to our subscription example, in this case we want a record not for every day, but for every month:

SELECT a.date
,b.customer_id
,b.subscribe_date
,b.annual_subscription / 12 as monthly_subscription
FROM date_dim a
JOIN customer_subscriptions b on a.date between b.subscribe_date 
and b.subscribe_date + interval '11 months'
;

How not to kill your database: LIMIT, FETCH and sampling

Database tables can be very large, containing millions or billions of records. Querying across all of these records can cause problems at the least, and crash databases at the worst. To avoid cranky calls from database administrators, or getting locked out, it’s a good idea to limit the results returned during profiling or while testing queries. Limit clauses and sampling are two techniques that should be part of your toolbox.

LIMIT is added as the last line of the query, or subquery, and can take any positive integer value:

SELECT column_a, column_b
FROM table
LIMIT 1000
;
SELECT...
FROM
(
SELECT column_a, column_b, sum(sales) as total_sales
FROM table
GROUP BY 1,2
LIMIT 1000
) a
;

SQL Server does not support the LIMIT clause, but a similar result can be obtained using TOP:

SELECT top 1000
column_a, column_b
FROM table
;

Sampling can be accomplished by using a function on an ID field that has a random distribution of digits at the beginning or end. If the ID field is an integer, MOD can be used to find the last one, two, or more digits and filter on the result:

WHERE mod(integer_order_id,100) = 6

This will return every order with last two digits 06, which should be about 1% of the total. If the field is alphanumeric, you can use a RIGHT() to find a certain number of digits at the end:

WHERE right(alphanum_order_id,1) = ‘B’

This will return every order with a last digit of B, which will be about 3% of the total if all letters and numbers are equally common, an assumption worth validating.

Limiting the result set also makes your work faster, but be aware that subsets of data might not contain all of the variations in values and edge cases that exist in the full data set. Remember to remove the LIMIT or sampling before running your final analysis or report with your query, or you’ll end up with funny results!

2.4 Data cleaning

Profiling often reveals where changes can make the data more useful for analysis. Some of the steps are CASE transformations, adjusting for null, and changing data types. We’ll cover text transformations in depth in Chapter 8.

2.4.1 CASE transformations

Sometimes the data exists and is accurate, but it would be more useful for analysis if values were standardized or grouped into categories.

Non-standard values occur for a variety of reasons. Values might come from different systems with slightly different lists of choices, system code might have changed, options might have been presented to the customer in different languages, or the customer might have been able to fill out the value rather than pick from a list. Imagine a field containing information about the gender of a person. Values indicating a female person exist as ‘F’, ‘female’, ‘femme’. We can standardize the values like this:

CASE when gender = ‘F’ then ‘Female’
     when gender = ‘female’ then ‘Female’
     when gender = ‘femme’ then ‘Female’
     else gender 
     end as gender_cleaned

The database evaluates each when… then… in order, and stops executing when a match is found, it’s important to consider the order of statements. Else tells the database what to use as a default value if no matches are found, and can be a field name or a constant value. Else is optional; if not included any non-matches will return null. The return values that you give after then… must be the same data type (strings, numeric, boolean, etc.) or you’ll get an error.

CASE statements can also be used to add categorization or enrichment that does not exist in the original data. As an example, many organizations use a Net Promoter Score, or NPS, to monitor customer sentiment. NPS surveys ask respondents to rate, on a scale of 0 to 10, how likely they are to recommend a company or product to a friend or colleague. Scores of 0 to 6 are considered detractors, 7 and 8 are passive, and 9 and 10 are promoters. The final score is calculated by subtracting the percent of detractors from the percent of promoters. Survey result data sets usually include optional free text comments, and are sometimes enriched with information the organization knows about the person surveyed. Given a data set of NPS survey responses, the first step is to group the responses into the categories of detractor, passive, and promoter:

SELECT response_id
,likelihood
,case when likelihood <= 6 then ‘Detractor’
      when likelihood <= 8 then ‘Passive’
      else ‘Promoter’
	 end as response_type
FROM nps_responses
;

Note that the data type can differ between the field being checked and the return data type, in this case we are checking an integer and returning a string. Listing out all of the values is also an option, instead of checking greater than or less than, and is useful when the input isn’t continuous or values in order shouldn’t be grouped together:

case when likelihood in (0,1,2,3,4,5,6) then ‘Detractor’
     when likelihood in (7,8) then ‘Passive’
     when likelihood in (9,10) then ‘Promoter’
     end as response_type

CASE statements can consider multiple columns, and contain AND/OR logic. They can also be nested, though often this can be avoided with AND/OR logic, as shown below.

case when likelihood <= 6 and country = ‘US’ and high_value = true then ‘US high value detractor’
     when likelihood >= 9 and (country in (‘CA’,’JP’) or high_value = true) then ‘some other label’
     ... end

Cleaning or enriching data with a CASE statement works well as long as there is a relatively short list of variations, you can find them all in the data, and the list of values isn’t expected to change. For longer lists and ones that change frequently, a lookup table can be a better option. A lookup table exists in the database and is either static or populated with code that checks for new values periodically. The query will join to the lookup table to get the cleaned data, and in this way the cleaned values can be maintained outside of your code, and used by many queries without worrying about maintaining consistency between them. An example of this might be a lookup table that maps state abbreviations to full state names. In my own work, I often start with a case statement, and create a lookup table only once after the list becomes unruly or it’s clear that I or my team is going to need to use this cleaning step repeatedly.

Of course, it’s worth investigating whether the data can be cleaned upstream. I once started with a case statement that had 5 or so lines that grew to 10, then eventually to over 100, at which point it was unruly and difficult to maintain. The insights were valuable enough that I was able to convince engineers to change the tracking code, and send the meaningful categorizations in the data stream in the first place.

Another useful thing you can do with CASE statements is to create flags indicating whether a certain value is present, without returning the actual value. This can be useful during profiling, to understand how common is the existence of a particular attribute. Another use for flagging is when preparing a data set for statistical analysis. In this case a flag is also known as a dummy variable, taking a value of 0 or 1, and indicating the presence or absence of some qualitative variable. A flag or dummy variable can be constructed as:

SELECT customer_id
,case when gender = ‘F’ then 1 else 0 end as is_female
,case when likelihood in (9,10) then 1 else 0 end as is_promoter
FROM ...
;

If you are working with a data set that has multiple rows per entity, such as with line items in an order, you can flatten and flag at the same time:

SELECT customer_id
,max(case when fruit = ‘apple’ then 1 else 0 end) as bought_apples
,max(case when fruit = ‘orange’ then 1 else 0 end) as bought_oranges
FROM ...
GROUP BY 1
;

You can also construct more complex conditions for flags, such as requiring a threshold or amount of something before labeling with a value of one:

SELECT customer_id
,max(case when fruit = ‘apple’ and quantity > 5 then 1 else 0 end) as loves_apples
,max(case when fruit = ‘orange’ and quantity > 5 then 1 else 0 end) as loves_oranges
FROM ...
GROUP BY 1
;

CASE statements are powerful, and as we saw they can be used to clean, enrich and flag or add dummy variables to data sets. In the next section we’ll look at some special functions related to CASE statements that handle null values specifically.

2.4.2 Dealing with nulls: COALESCE, NULLIF, NVL

Null was one of the stranger concepts I had to get used to when I started working with data. Null isn’t just something you think about in daily life, where we’re used to dealing in concrete quantities of things. Null has a special meaning in databases, and was introduced by E.F. Codd, the inventor of the relational database, as a way to ensure that a database has a way to represent missing information. If someone asks me how many parachutes I have, I can answer zero. But if the question is never asked, I have null parachutes.

Nulls can represent fields for which no data was collected, or that aren’t applicable for that row. When new columns are added to a table, the values for previously created rows will be null unless explicitly filled with some other value. When two tables are joined via an outer join, nulls will appear in any fields for which there is no matching record in the second table.

Nulls are problematic for certain aggregations and grouping, different types of databases handle them in different ways. For example, imagine I have five records, with values of 5, 10, 15, 20, and null. The sum of these numbers is 50, but the average is either 10 or 12.5 depending on whether the null value is counted for the denominator, or the whole question is invalid since one of the values is null.

When tables are defined, they can either allow nulls, reject nulls, or populate a default value if the field would otherwise be left null. In practice, this means that you can’t always rely on a field to show up as null if the data is missing, because it may have been filled with a default value such as 0. I once had a long debate with a data engineer when it turned out that null dates in the source system were defaulting to ‘1970-01-01’ in our data warehouse. I insisted that the dates should be null instead, to reflect the fact that they were unknown or not applicable. The engineer pointed out that I could remember to filter those dates, or change them back to null with a case statement. I finally prevailed by pointing out that one day another user who wasn’t as aware of the nuances of default dates would come along, run a query, and get the puzzling result that many customers clustered around a year long before the company was even founded.

Nulls are often inconvenient or inappropriate for the analysis we want to do. They can also make output confusing to the intended audience for your analysis. Business people don’t necessarily understand how to interpret a null value or may assume there is a problem with data quality. There are a few ways to replace nulls with alternate values: CASE statements, and the specialized COALESCE and NULLIF functions. We saw previously that CASE statements can check a condition and return a value. This can also be used to check for a null, and if found, replace it with another value:

case when num_orders is null then 0 else num_orders end
case when address is null then ‘Unknown’ else address end
case when column_a is null then column_b else column_a end

The COALESCE function is a more compact way to achieve this. It takes two or more arguments, and returns the first one that is not null:

coalesce(num_orders,0)
coalesce(address,’Unknown’)
coalesce(column_a,column_b)
coalesce(column_a,column_b,column_c)

The function NVL exists in some databases, and is similar to COALESCE, but allows only two arguments.

The NULLIF function compares two numbers, and if they are not equal, it returns the first number; if they are equal the function returns null.

nullif(6,7)

returns 6, whereas

nullif(6,6)

returns null.

NULLIF is equivalent to the following more wordy case statement:

case when 6 = 7 then 6 
     when 6 = 6 then null
     end

This function can be useful to turn values back into nulls when you know a certain default value has been inserted into the database. For example, with my default time example, we could change it back to null by using:

nullif(date,’1970-01-01’)

Nulls can be problematic when filtering data in the WHERE clause using not equal to or not in list. Some databases will exclude null values as well as the values you intend to exclude. To correct this, add an or expression to the WHERE clause:

WHERE my_field is null or my_field <> ‘apple’

Tip

To figure out how your database handles nulls you could scour the documentation, or you can run a quick query on a column that you know includes nulls from your profiling, and examine whether the results include nulls when expected.

A concept related to nulls but slightly different is empty string, where there is no value but the field is not technically null. One reason an empty string might be used is to indicate that a field is known to be blank, as opposed to null where the value might be missing or unknown. For example, the database might have a name_suffix field that can be used to hold a value such as ‘Jr.’ Many people do not have a name_suffix, so empty string is appropriate. Empty string can also be used as a default value instead of null, or as a way to overcome a NOT NULL constraint by inserting a value, even if empty. An empty string can be specified in a query with two quote marks:

WHERE my_field = '' or my_field <> 'apple'

Profiling the frequencies of values should reveal whether your data includes nulls, empty strings, or both.

2.4.3 Casting and type conversions

Every field in a database is defined with a data type, which we reviewed at the beginning of this chapter. Data that isn’t of the field’s type will be rejected. Strings can’t be inserted into integer fields, and booleans are not allowed in date fields. Most of the time we can take the data types for granted, and apply string functions to strings, date functions to dates, and so one. Occasionally, however, we need to override the data type of the field and force it to be something else. This is where casting and type conversions come in.

Type conversion functions allow pieces of data with appropriate format to be changed from one data type to another. The syntax comes in a few forms that are basically equivalent. One way to change the data type is with the CAST function, CAST (input as data_type), or two colons, input :: data_type. Both of these are equivalent and convert the integer 1,234 to a string:

cast (1234 as varchar)
1234::varchar

Converting an integer to a string can be useful in case statements, when categorizing numeric values, with some unbounded upper or lower value. For example:

case when order_items <= 3 then order_items
     else ‘4+’ 
     end

Leaving the values that are less than or equal to 3 as integers, while returning the string ‘4+’ for higher values would result in an error. Casting the integers to char or varchar type solves the problem:

case when order_items <= 3 then order_items::varchar
     else ‘4+’

end

Another example of how this comes in useful is when you have parsed an integer out of a string, and then want to aggregate it or use mathematical functions. We’ll cover string parsing in more detail in Chapter 5, but imagine we have a data set of sales data, but the data was entered into the database with the $ sign, and so the field was created as a varchar data type. Removing the $ is straightforward with the replace function:

replace($19.99,’$’,’’)

The result is still a string, however, and so trying to apply an aggregation will return an error. To fix this, cast the result as a float:

replace($19.99,’$’,’’)::float
cast(replace($19.99,’$’,’’) as float)

Dates and datetimes can come in a bewildering array of formats, and knowing how to cast them to the desired format is useful. As a simple example, transaction or event data often arrives in the database as a datetime, but often we want to summarize by day. Simply grouping by the timestamp will result in more rows than necessary. Casting the timestamp to a date reduces the size of the results, and achieves our goal:

SELECT tx_timestamp::date, count(transactions) as num_transactions
FROM ...
GROUP BY 1
;

Likewise, a date can be cast to a timestamp if, for example, a particular function requires a timestamp as an input. Sometimes the year, month, and day are stored in separate columns, or end up as separate elements because they’ve been parsed out of a longer string. These then need to be assembled back into a date. To do this, first concatenate the values, which will result in a string. Then cast as a date. Any of these syntaxes works and returns the same value:

(year || ',' || month|| '-' || day)::date

or equivalently:

cast(concat(year, '-', month, '-', day) as date)

Another way to convert between data types is to use a DATATYPE function, where you specify the type and include the expression as the argument:

date(concat(year, '-', month, '-', day))

The TO_DATATYPE functions can take both a value and a format string, and so give you more control over how the data is converted. They are particularly useful when converting in and out of date or datetime formats, as they allow you to specify the order of the date and time elements.

Function	Purpose
to_char	Converts other types to string
to_number	Converts other types to numeric
to_date	Converts other types to date, with specified date parts
to_timestamp	Converts other types to date, with specified date and time parts

Sometimes the database automatically converts a data type. This is called type coercion. For example, integers and float numerics can usually be used together in mathematical functions or aggregations without explicitly changing the type. Char and varchar values can usually be mixed. Some databases will coerce boolean fields to 0 and 1 values, where 0 is false and 1 is true, but some databases require you to convert the values explicitly. Some databases are pickier than others about mixing dates and datetimes in result sets and functions. You can read through the documentation, or you can do some simple query experiments to learn how the database you’re working with handles data types implicitly and explicitly. There is usually a way to accomplish what you want, though sometimes you need to get creative in using functions in your queries.

2.3 Shaping Data

At the most basic level, data shape refers to how the data is represented in columns and rows. Each table in the database has a shape. The result set of each query has a shape. Shaping data is a rather abstract concept, but it’s a skill that can be learned, practiced, and mastered.

One of the most important concepts in shaping data is figuring out the granularity (or level of detail) of data needed. Just as rocks can range in size from giant boulders down to grains of sand, and even further down to microscopic dust, so too can data have varying levels of detail. For example, if the population of a country is a boulder, the population of a city is a small rock, and the household is a grain of sand. Data at a smaller level of detail might include the individual births and deaths, moves from one city or country to another.

Flattening data is another important concept in shaping. This refers to reducing the number of rows that represent an entity, including down to a single row. Joining multiple tables together to create a single output data set is one way to flatten data. Another way is through aggregation.

In this section we’ll first cover some considerations for choosing data shapes. Then we’ll look at some common use cases: pivoting and unpivoting. Later chapters will also touch on shaping data, and Chapter 8 will go into more detail on keeping complex SQL organized when creating data sets for further analysis.

2.3.1 For which output: BI, Visualization, statistics, ML

Deciding how to shape your data with SQL depends a lot on what you are planning to do with it afterwards. It’s generally a good idea to output a data set that has as few rows as possible, and still meets your need for granularity. This will leverage the computing power of the database, reduce the time it takes to move data from the database somewhere else, and reduce the amount of processing you or someone else needs to do in other tools. Some of the other tools that your output might go to are a BI tool for reporting and dashboarding, a spreadsheet for business users to examine, a statistics tool such as R, a machine learning model in Python, or straight to a visualization created with a range of tools.

When outputting data to a business intelligence tool for reports and dashboards, it’s important to understand the use case. Data sets may need to be generic and detailed, in order to enable exploration by end users. They may need to be small, aggregated, and include specific calculations to enable fast loading and response times in executive dashboards. Understanding how the tool works, and whether it performs better with smaller data sets, or is architected to perform its own aggregations across larger data sets, is important. There is no one size fits all answer. The more you know about how the data will be used, the better prepared you will be to shape the data appropriately.

Smaller, aggregated and highly specific data sets often work best for visualizations, whether created in commercial software or a programming language like R, Python or Javascript. Think about the level of aggregation and slices, or various elements the end users will need to filter on. Sometimes the data sets require a row for each slice, as well as an ‘everything’ slice. You may need to UNION together two queries, one at the detail level, and one at the ‘everything’ level.

When creating output for statistics packages or machine learning models, it’s important to understand the core entity being studied, the level of aggregation desired, and the attributes or features needed. For example, a model might need one record per customer with several attributes, or a record per transaction with its associated attributes as well as customer attributes. Generally the output for modeling will follow the notion of “tidy data” proposed by Hadley Wickham. Tidy data has these properties:

Each variable forms a column
Each observation forms a row
Each value is a cell

Data sets constructed with SQL can take any number of forms or shapes.

2.3.2 Pivoting with CASE statements

A pivot table is a summary of the statistics calculated from a larger data set. The summaries are functions applied to the underlying data, such as SUM, COUNT, AVERAGE, MIN, or MAX. The commercialization of the concept goes back to the early 1990s, when Lotus Development released the Improv spreadsheet program. Pivot tables are widely known from their implementation in Microsoft Excel, which has a drag and drop interface to create the summaries of data.

Pivot tables, or pivoted output, can be created using SQL statements using a CASE statement along with one or more aggregation functions. For example, imagine you have a data set that includes all purchases made by your customers, with each purchase a separate line item. You could pivot, or summarize the data with:

SELECT customer_id
,count(distinct order_id) as num_orders
,sum(order_amount) as total_amount
FROM orders
GROUP BY 1
;

Now imagine you have a similar purchase data set, with customers, the products purchased, and the purchase date, and you want to create one record per day, with the sales of each product summarized. Here a set of case statements can transform the data:

SELECT order_date
,sum(case when product = ‘shirt’ then order_amount else 0 end) as shirts_amount
,count(case when product = ‘shirt’ then customer_id end) as shirt_customers
,sum(case when product = ‘shoes’ then order_amount else 0 end) as shoes_amount
,count(case when product = ‘shoes’ then customer_id end) as shoes_customers
...
FROM orders
GROUP BY 1
;

Note that with the sum aggregation you can optionally use ‘else 0’ to avoid nulls in the result set. With count or count distinct you should not include an else statement, however, or you will inflate the result set. This is because the database won’t count a null, but it will count a substitute value such as zero.

Pivoting with case statements is quite handy, and having this ability opens up data warehouse table designs that are long but not wide which can be better for storing sparse data because adding columns to a table can be an expensive operation. This technique works well when there are a finite number of items to pivot. For people who have worked with other programming languages, it’s essentially looping, but written out explicitly line by line. This gives you a lot of control, such as if you want to calculate different metrics in each column, but it can also be tedious. Pivoting with case statements also doesn’t work well when new values arrive constantly or are rapidly changing. In those cases, pushing the computing to another layer of your analysis stack, such as a BI tool or statistical language, is appropriate.

2.3.3 Unpivot with UNION statements

Sometimes we have the opposite problem, and need to move data stored in columns into rows instead. This generally results in more lines of code, but it can be done. A common example of this kind of data set is one that has a single row per entity, with the metric stored as a column for each time period. For example, the population for North American countries at 10 year intervals starting in 1980 was as follows:

Fig: country population by year (in thousands)³

Country	year_1980	year_1990	year_2000	year_2010
Canada	24,593	27,791	31,100	34,207
Mexico	68,347	84,634	99,775	114,061
United States	227,225	249,623	282,162	309,326

To turn this into a result set with a row per country per year, we can use a UNION ALL:

SELECT country
,'1980' as year
,year_1980 as population
FROM country_populations
	UNION ALL
SELECT country
,'1990' as year
,year_1990 as population
FROM country_populations
	UNION ALL
SELECT country
,'2000' as year
,year_2000 as population
FROM country_populations
	UNION ALL
SELECT country
,'2010' as year
,year_2010 as population
FROM country_populations
;

In this example we use a constant, or hard code, the year in order to keep track of the year to which the population value corresponds. The hard-coded values can be of any type, depending on your use case. You may need to explicitly cast certain hard coded values, such as when entering using a date:

‘2020-01-01’::date as date_of_interest

What is the difference between UNION and UNION ALL? Both can be used to append, or stack data together in this fashion but they are slightly different. UNION removes duplicates from the result set, whereas UNION ALL retains all records, duplicate or not. UNION ALL is faster, since the database doesn’t have to do a pass over the data to find duplicates. It also ensures that every record ends up in the result set. I tend to use UNION ALL, and only use UNION when I have a reason to suspect duplicate data.

Unioning data can also be useful to bring together data from different sources. For example, imagine you have a table with yearly country population, and another with yearly gross domestic product, or GDP. One option is to join the tables, and obtain a result set with one column for population and another for GDP:

SELECT a.country, a.population, b.gdp
FROM populations a
JOIN gdp b on a.country = b.country
;

Another option is to UNION ALL the data sets, and end up with a stacked data set:

SELECT country, ‘population’ as metric, population as metric_value
FROM populations
	UNION ALL
SELECT country, ‘gdp’ as metric, gdp as metric_value
FROM gdp
;

Which approach to use largely depends on the output that you need for your analysis. The latter option can be useful when you have a number of different metrics in different tables, and no single table has a full set of entities, in this case countries. This is an alternative approach to a FULL OUTER JOIN.

2.3.4 PIVOT and UNPIVOT

Recognizing that the pivot and unpivot use cases are common, some database vendors have implemented functions to do this with fewer lines of code. Microsoft SQL Server and Snowflake have PIVOT functions that take the form of extra expressions in the WHERE clause:

SELECT...
FROM... 
    pivot( aggregation(value_column) for label_column in (label_1, label_2, ...)
;

Although this syntax is more compact than the CASE construction we saw earlier, the desired columns still need to be specified. As a result, PIVOT doesn’t solve the problem of newly arriving or rapidly changing sets of fields that need to be turned into columns. Postgres has a similar CROSSTAB function, available in the tablefunc module.

Microsoft SQL Server and Snowflake also have UNPIVOT functions that work in a similar fashion as expressions in the WHERE clause and transform rows into columns:

SELECT...
FROM... 
    unpivot( value_column for label_column in (label_1, label_2, ...))
;

Here again the syntax is more compact than the UNION or UNION ALL approach we looked at earlier, but the list of columns must be specified in the query.

Postgres has an UNNEST array function that can be used to unpivot data, thanks to its array data type. An array is a collection of elements, and in Postgres you can list the elements of an array in square brackets. The function can be used in the SELECT clause and takes the form:

unnest(array[element_1, element_2, …])

Returning to our earlier example with countries and populations, this query returns the same result as the query with the repeated UNION ALL clauses:

SELECT 
country
,unnest(array['1980', '1990', '2000', '2010'])
,unnest(array[year_1980, year_1990, year_2000, year_2010])
FROM country_populations
;

2.4 Conclusion

In this chapter we’ve covered a range of topics around preparing your data for analysis with SQL. First we reviewed the different types of data you’re likely to encounter. Then we dove into profiling, in order to learn more about what is in the data set and examine for quality. Then we looked at some ways to handle cleaning and enhancing data sets. Finally, we discussed shaping data and various approaches to creating the output needed for analysis. In the next few chapters we’ll talk about specific analysis applications of SQL, starting with Time Series Analysis in the next chapter.

¹ https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html

² https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#5bbdb2476f63

³ United States Census International Data Base: https://www.census.gov/data-tools/demo/idb/informationGateway.php

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
2. Preparing Data for Analysis

Chapter 2. Preparing Data for Analysis

2.0 Types of Data

2.0.1 Database data types

2.0.1 Structured vs. Unstructured

2.0.2 First-party, Third-party, and Cloud Vendor data

Tip

2.0.3 Sparse data

2.0.4 Quantitative vs. qualitative data

2.0.5 Categorical vs. continuous

2.1 Profiling: Distributions

2.1.1 Histograms and frequencies

Figure 2-1. Frequency plot of fruit inventory

Figure 2-2. Customers by Age

2.1.3 Binning

2.1.2 N-tiles

2.2 Profiling: Data Quality

2.2.1 Detecting duplicates

2.2.2 Deduplication with GROUP BY and DISTINCT

2.2.3 Missing data

Figure 2-3. Fig: a date dimension with date attributes

2.4 Data cleaning

2.4.1 CASE transformations

2.4.2 Dealing with nulls: COALESCE, NULLIF, NVL

Tip

2.4.3 Casting and type conversions

2.3 Shaping Data

2.3.1 For which output: BI, Visualization, statistics, ML

2.3.2 Pivoting with CASE statements

2.3.3 Unpivot with UNION statements

2.3.4 PIVOT and UNPIVOT

2.4 Conclusion

Table of Contents for 2. Preparing Data for Analysis

Create new playlist

Sign In

Sign Up

Chapter 2. Preparing Data for Analysis

2.0 Types of Data

2.0.1 Database data types

2.0.1 Structured vs. Unstructured

2.0.2 First-party, Third-party, and Cloud Vendor data

Tip

2.0.3 Sparse data

2.0.4 Quantitative vs. qualitative data

2.0.5 Categorical vs. continuous

2.1 Profiling: Distributions

2.1.1 Histograms and frequencies

Figure 2-1. Frequency plot of fruit inventory

Figure 2-2. Customers by Age

2.1.3 Binning

2.1.2 N-tiles

2.2 Profiling: Data Quality

2.2.1 Detecting duplicates

2.2.2 Deduplication with GROUP BY and DISTINCT

2.2.3 Missing data

Figure 2-3. Fig: a date dimension with date attributes

2.4 Data cleaning

2.4.1 CASE transformations

2.4.2 Dealing with nulls: COALESCE, NULLIF, NVL

Tip

2.4.3 Casting and type conversions

2.3 Shaping Data

2.3.1 For which output: BI, Visualization, statistics, ML

2.3.2 Pivoting with CASE statements

2.3.3 Unpivot with UNION statements

2.3.4 PIVOT and UNPIVOT

2.4 Conclusion

Table of Contents for
2. Preparing Data for Analysis