Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

E. Varga Practical Data Science with Python 3https://doi.org/10.1007/978-1-4842-4859-1_2

2. Data Engineering

Ervin Varga¹

(1)

Kikinda, Serbia

After project initiation, the data engineering team takes over to build necessary infrastructure to acquire (identify, retrieve, and query), munge, explore, and persist data. The goal is to enable further data analysis tasks. Data engineering requires different expertise than is required in later stages of a data science process. It is typically an engineering discipline oriented toward craftsmanship to provide necessary input to later phases. Often disparate technologies must be orchestrated to handle data communication protocols and formats, perform exploratory visualizations, and preprocess (clean, integrate, and package), scale, and transform data. All these tasks must be done in context of a global project vision and mission relying on domain knowledge. It is extremely rare that raw data from sources is immediately in perfect shape to perform analysis. Even in the case of a clean dataset, there is often a need to simplify it. Consequently, dimensionality reduction coupled with feature selection (remove, add, and combine) is also part of data engineering. This chapter illustrates data engineering through a detailed case study, which highlights most aspects of it. The chapter also presents some publicly available data sources.

Data engineers also must be inquisitive about data collection methods. Often this fact is neglected, and the focus is put on data representation. It is always possible to alter the raw data format, something hardly imaginable regarding data collection (unless you are willing to redo the whole data acquisition effort). For example, if you receive a survey result in an Excel file, then it is a no-brainer to transform it and save it in a relational database. On the other hand, if the survey participants were not carefully selected, then the input could be biased (favor a particular user category). You cannot apply a tool or program to correct such a mistake.

Figure 2-1 shows the well-known golden principle of data preparation: if you get garbage on input, you’re going to have garbage output.

../images/473947_1_En_2_Chapter/473947_1_En_2_Fig1_HTML.jpg — Figure 2-1
Poorly worded questions do not lead to good answers. Similarly, bad input cannot result in good output.

E-Commerce Customer Segmentation: Case Study

This case study introduces exploratory data analysis (EDA) by using the freely available small dataset from https://github.com/oreillymedia/doing_data_science (see also [1] in the “References” section at the end of the chapter).¹ It contains (among other data) simulated observations about ads shown and clicks recorded on the New York Times home page in May 2012. There are 31 files in CSV format, one file per day; each file is named nyt<DD>.csv, where DD is the day of the month (for example, nyt1.csv is the file for May 1). Every line in a file designates a single user. The following characteristics are tracked: age, gender (0=female, 1=male), number of impressions, number of clicks, and signed-in status (0=not signed in, 1=signed in).

The preceding description is representative of the usual starting point in many data science projects. Based on this description, we can deduce that this case study involves structured data (see also the sidebar “Flavors of Data”) because the data is organized in two dimensions (rows and columns) following a CSV format, where each column item is single valued and mandatory. If the description indicated some column items were allowed to take multiple values (for example, a list of values as text), be absent, or change data type, then we would be dealing with semistructured data. Nevertheless, we don’t know anything about the quality of the content nor have any hidden assumptions regarding the content. This is why we must perform EDA: to acquaint ourselves with data, get invaluable perceptions regarding users’ behavior, and prepare everything for further analysis. During EDA we can also notice data acquisition deficiencies (like missing or invalid data) and report these findings as defects in parallel with cleaning activities.

Note

Structured data doesn’t imply that the data is tidy and accurate. It just emphasizes what our expectation is about the organization of the data, estimated data processing effort, and complexity.

Flavors of Data

We have two general categories of data types: quantitative (numeric) and qualitative (categorical). You can apply meaningful arithmetic operations only to quantitative data. In our case study, the number of impressions is a quantitative variable, while gender is qualitative despite being encoded numerically (we instead could have used the symbols F and M, for example).

A data type may be on one of the following levels: nominal, ordinal, interval, or ratio. Nominal is a bare enumeration of categories (such as gender and logged-in status). It makes no sense to compare the categories nor perform mathematical operations. The ordinal level establishes ordering of values (T-shirt sizes XS, S, M, L, and XL are categorical with clear order regarding size). Interval applies to quantitative variables where addition and subtraction makes sense. Nonetheless, multiplication and division doesn’t (this means there is no proper starting point for values). The ratio level allows all arithmetic operations, although in practice these variables are restricted to be non-negative. In our case study, age, number of impressions, and number of clicks are all at this level (all of them are non-negative).

Understanding the differentiation of levels is crucial to understanding the appropriate statistics to summarize data. There are two basic descriptors of data: center of mass describes where the data is tending to, and standard deviation describes the average spread around a center (this isn’t a precise formulation, but will do for now). For example, it makes no sense to find the mean of nominal values. It only makes sense to dump the mode (most frequent item). For ordinals, it makes sense to present the median, while for interval and ratio we usually use the mean (of course, for them, all centers apply).

Chapter 1 covered the data science process and highlighted the importance of project initiation, the phase during which we establish a shared vision and mission about the scope of the work and major goals. The data engineering team must be aware of these and be totally aligned with the rest of the stakeholders on the project. This entails enough domain knowledge to open up proper communication channels and make the outcome of the data engineering effort useful to the others. Randomly poking around is only good to fool yourself that you are performing some “exploration.” All actions must be framed by a common context.

That being said, let us revisit the major aims of the case study here and penetrate deeper into the marketing domain. The initial data collection effort was oriented toward higher visibility pertaining to advertisement efforts. It is quite common that companies just acquire data at the beginning, as part of their strategic ambitions. With this data available, we now need to take the next step, which is to improve sales by better managing user campaigns.

As users visit a web site, they are exposed to various ads; each exposure increases the number of impressions counter. The goal is to attract users to click ads, thus making opportunities for a sale. A useful metric to trace users’ behavior is the click-through rate (part of domain knowledge), which is defined as $frac{number of clicks}{number of impressions}$ . On average, this rate is quite small (2% is regarded as a high value), so even a tiny improvement means a lot. The quest is to figure out what is the best tactic to increase the click-through rate.

It is a well-known fact in marketing that customized campaigns are much more effective than just blindly pouring ads on users. Any customization inherently assumes the existence of different user groups; otherwise, there would be no basis for customization. At this moment, we cannot make any assumptions about campaign design nor how to craft groups. Nonetheless, we can try to research what type of segmentation shows the greatest variation in users’ behavior. Furthermore, we must identify what are the best indicators to describe users. All in all, we have a lot of moving targets to investigate; this shouldn’t surprise you, as we are in the midst of EDA. Once we know where to go, then we could automate the whole process by creating a full data preparation pipeline (we will talk about such pipelines in later chapters).

Creating a Project in Spyder

Launch Spyder (refer to Chapter 1 for the Anaconda setup instructions, if needed) and select the Projects ➤ New Project menu item. In the dialog box that opens, specify the project’s name (in our case, segmentation), location (choose a folder on your machine to become this project’s parent directory), and type (leave it as Empty Project). In the File Explorer, create a folder structure as shown in Figure 2-2. The raw_data folder contains the unpacked files, the results folder hosts the cleaned and prepared data, and the scripts folder stores the various Python programs. The segmentation folder contains the driver code. In subsequent sections you will need to create Python files by selecting File ➤ New file.

../images/473947_1_En_2_Chapter/473947_1_En_2_Fig2_HTML.jpg — Figure 2-2
Folder structure of our Spyder project

Downloading the Dataset

You should now download the dataset and unpack it. You may do this manually, via shell script, or from Python (to name a few options). I’ve chosen the last option to demonstrate a couple of techniques in Python, although I do prefer bash scripts for this stuff. Listing 2-1 shows the download script, nyt_data.py, which downloads the dataset and performs necessary housekeeping. Notice how it unpacks the embedded ZIP file containing our data. The cleanup procedure showcases basic exception handling.

import requests, zipfile, io, shutil

unpackedFolder = '/dds_datasets/'

unpackedZipFile = 'dds_ch2_nyt.zip'

def retrieve(sourceFile, destinationFolder):

def cleanup():

try:

shutil.rmtree(destinationFolder + unpackedFolder)

except OSError as e:

print("Folder: %s, Error: %s" % (e.filename, e.strerror))

r = requests.get(sourceFile)

assert r.status_code == requests.codes.ok

z = zipfile.ZipFile(io.BytesIO(r.content))

z.extractall(destinationFolder)

# The top archive contains another ZIP file with our data.

z = zipfile.ZipFile(destinationFolder + unpackedFolder + unpackedZipFile)

z.extractall(destinationFolder)

cleanup()

Listing 2-1

The nyt_data.py Script

The requests package is the main HTTP library for Python (see https://requests.readthedocs.io ). The retrieved content is available under r.content. The zipfile package manages archives, and shutil provides high-level file operations. We use it to remove the temporary folder with embedded archives. Listing 2-2 shows the driver code for this part of the job. It is the first part of the driver.py script for downloading the raw data. Since it calls scripts in the scripts folder, this must be added to the path. The driver must be executed from the project’s base folder. Also, observe that the fileUrl starts with raw instead of blob.

import sys

import os

sys.path.append(os.path.abspath('scripts'))

from nyt_data import retrieve

repoUrl = 'https://github.com/oreillymedia/doing_data_science/'

fileUrl = 'raw/master/dds_datasets.zip'

retrieve(repoUrl + fileUrl, 'raw_data')

print('Raw data files are successfully retrieved.')

Listing 2-2

The Driver Code

After the execution of Listing 2-2, the CSV files will be available in the raw_data subfolder. You can easily verify this in Spyder’s File Explorer.

Exploring the Dataset

Let’s start our journey by peeking into our dataset. We will carry out all steps from Spyder’s IPython console. This provides an interactive working environment, and this is the cornerstone of why Python is so popular among data scientists. You can issue statements and watch their immediate execution. There’s no need to wait for long edit/build/execute cycles as is the case with compiled languages. Below is the result of printing the current working directory (your output will differ depending on your project’s parent folder)²:

>> %pwd

'/Users/evarga/Projects/pdsp_book/src/ch2/segmentation'

The %pwd magic command is mimicking the native pwd bash command. You could also get the same result with !pwd (! is the escape character to denote what follows is a shell command). The shell-related magic commands are operating system neutral, so it makes sense to use them when available. Sometimes, as is the case with !cd, the native shell command cannot even be properly run (you must rely on the corresponding magic variant). Visit https://ipython.readthedocs.io/en/stable/interactive/magics.html to receive help on built-in magic commands; run %man <shell command> to get help on a shell magic command (for example, try %man head or %man wc). Now, switch into the raw_data folder by running %cd raw_data (watch how the File Explorer refreshes its display).

I advise you to always at first preview some raw data file using the most rudimentary technique. Once you develop your first impression of what is there, then you can utilize more powerful tools and frameworks for data manipulation. Here is the dump of the nyt1.csv file (only the first ten lines):

>> !head -n 10 nyt1.csv

"Age","Gender","Impressions","Clicks","Signed_In"

36,0,3,0,1

73,1,3,0,1

30,0,3,0,1

49,1,3,0,1

47,1,11,0,1

47,0,11,1,1

0,0,7,1,0

46,0,5,0,1

16,0,3,0,1

The first line is a header with decently named columns. The rest of the lines are user records. Nonetheless, just with these ten lines displayed, we can immediately spot a strange value for age: 0. We don’t expect it be non-positive. We will need to investigate this later.

By looking at file sizes in the File Explorer, we see that they are all around 4.5MB (a couple of them are larger, between 7.5MB and 8MB). Consequently, most of them should contain a similar number of records. Let’s find out how many records are in the smallest file and in the biggest file (click the Size tab to sort files by their size):

>> !wc -l nyt26.csv

362235 nyt26.csv

>> !wc -l nyt13.csv

786045 nyt13.csv

So, the number of records fluctuates in the range of [360K, 790K] with sizes between 3.7MB and 8.0MB; we shouldn’t have a problem processing a single file in memory even on a modest laptop.

Finding Associations Between Features

The previously mentioned columns are our basic features for describing each web page visitor. Before we delve deeper into any feature engineering (a process of creating more informative characteristics, like the already mentioned click-through rate), we must see how the existing features interplay. Any rule or pattern designates a particular association (relationship). Visualization is very effective here and complements descriptive statistics (like minimum, maximum, mean, etc.). The SciPy ecosystem offers the Pandas framework to work with tabular data, including visualization, which is just what we need here (see reference [3] for superb coverage of Pandas). Recall that we are in the EDA stage, so nothing fancy is required (Chapter 6 describes more advanced visualizations). Type the following snippet in your console:

>> import pandas as pd

>> nyt_data = pd.read_csv('nyt1.csv')

>> nyt_data.head(n=10)

Age Gender Impressions Clicks Signed_In

0 36 0 3 0 1

1 73 1 3 0 1

2 30 0 3 0 1

3 49 1 3 0 1

4 47 1 11 0 1

5 47 0 11 1 1

6 0 0 7 1 0

7 46 0 5 0 1

8 16 0 3 0 1

9 52 0 4 0 1

The pd is a common acronym for the pandas package (like np is for numpy). The read_csv function does all the gory details of parsing the CSV file and forming the appropriate DataFrame object. This object is the principal entity in the Pandas framework for handling 2D data. The head method shows the first specified number of records in a nice tabular format. Observe that our table has inherited the column names from the CSV file. The leftmost numbers are indices; by default, they are record numbers starting at 0, but the example below shows how it may be easily redefined.

The next line shows the overall dimensionality of our table:

>> nyt_data.shape

(458441, 5)

We have 458441 records (rows) and 5 features (columns). The number of rows is very useful for finding out missing column values. So, you should always print the shape of your table.

The next snippet presents the overall descriptive statistics for columns (watch out for the custom index):

>> summary_stat = nyt_data.describe()

>> summary_stat

Age Gender ... Clicks Signed_In

count 458441.000000 458441.000000 ... 458441.000000 458441.000000

mean 29.482551 0.367037 ... 0.092594 0.700930

std 23.607034 0.481997 ... 0.309973 0.457851

min 0.000000 0.000000 ... 0.000000 0.000000

25% 0.000000 0.000000 ... 0.000000 0.000000

50% 31.000000 0.000000 ... 0.000000 1.000000

75% 48.000000 1.000000 ... 0.000000 1.000000

max 108.000000 1.000000 ... 4.000000 1.000000

[8 rows x 5 columns]

>> summary_stat['Impressions']

count 458441.000000

mean 5.007316

std 2.239349

min 0.000000

25% 3.000000

50% 5.000000

75% 6.000000

max 20.000000

Name: Impressions, dtype: float64

The describe method calculates the descriptive statistics (as shown in the leftmost part of the output) for each column and returns the result as a new DataFrame. When we display summary_stat, then it automatically abbreviates the output if all columns cannot fit on the screen. This is why we need to explicitly print the information for the Impressions column.

At first glance, all counts equal the total number of rows, so we have no missing values (at least, in this file). The Age column’s minimum value is zero, although we have already noticed this peculiarity. About 36% of users are male and 70% of them were signed in (the mean is calculated by summing up the 1s and dividing the sum by the number of rows). Well, presenting a mean for a categorical variable isn’t right, but Pandas has no way to know what terms like Gender or Signed In signify. Apparently, conveying semantics is much more than just properly naming things.

The next excerpt checks what Pandas thinks about column types:

>> nyt_data.dtypes

Age int64

Gender int64

Impressions int64

Clicks int64

Signed_In int64

dtype: object

Gender and Signed_In are not ratio-level values, although in many scenarios using numerical encoding for Boolean values is beneficial (they don’t confuse many machine-learning algorithms). The procedure to make quantitative variables out of qualitative ones is called making dummy variables (see also the pandas.get_dummies function). Let’s help Pandas and mark our nominal columns properly:

>> nyt_data['Gender'] = nyt_data['Gender'].astype('category')

>> nyt_data['Signed_In'] = nyt_data['Signed_In'].astype('category')

>> nyt_data['Gender'].describe()

count 458441

unique 2

top 0

freq 290176

Name: Gender, dtype: int64

The first (bold) line shows how to change the column’s type to category. Look how the type of statistics was altered to make sense for a categorical variable. We will later bundle up all these findings in a script to automate everything. Currently, we are solely playing with data. Think again how interactivity helps us in this mission. The next two lines produce two scatter plots, Age vs. Impressions and Age vs. Clicks, as shown in Figure 2-3 (a semicolon at the end of a line prevents printing the function’s return value):

>> nyt_data.plot(x='Age', y="Impressions", c=nyt_data['Gender'], kind="scatter", ➥

colormap='Paired', colorbar=False);

>> nyt_data.plot(x='Age', y="Clicks", c=nyt_data['Gender'], kind="scatter", ➥

colormap='Paired', colorbar=False);

Since plotting in Python is so easy, you should always make some plots. There is no guarantee that you will immediately hit the mark, but you could learn something. For example, we have discovered that age zero isn’t a bare error in data (all users are designated as females, cover the full range of impressions and clicks, and there is a distinct gap from the non-zero group). The next snippet selects users with age zero:

>> zero_age_users = nyt_data[nyt_data['Age'] == 0]

>> zero_age_users.head(n=10)

Age Gender Impressions Clicks Signed_In

6 0 0 7 1 0

10 0 0 8 1 0

12 0 0 4 0 0

15 0 0 6 0 0

19 0 0 5 0 0

24 0 0 4 0 0

39 0 0 7 2 0

41 0 0 4 0 0

46 0 0 3 0 0

47 0 0 7 0 0

../images/473947_1_En_2_Chapter/473947_1_En_2_Fig3_HTML.jpg — Figure 2-3
A scatter plot is handy to show potential relationships between numerical variables

You can add a third dimension by differently coloring the points.³ The dark dots denote males, while the light ones denote females. Due to the very large number of records (points are heavily overlapped), we cannot discern any clear pattern except that the zero age group is special.

The bolded section illustrates the Boolean indexing technique (a.k.a. masking), where a Boolean vector is used for filtering records. Only those records are selected where the matching element is True. Based on the dump, we see that all users are marked as females and logged out.

The next lines assure that zero age actually designates a logged-out category:

>> logged_in_zero_age_users_mask = (nyt_data['Age'] == 0) & (nyt_data['Signed_In'] == 1)

>> logged_in_zero_age_users_mask.any()

False

>> logged_out_non_zero_age_users_mask = (nyt_data['Age'] > 0) & (nyt_data['Signed_In'] == 0)

>> logged_out_non_zero_age_users_mask.any()

False

The any method checks if there is at least one True element in a Boolean mask. Both times we see False. Therefore, we can announce a very important rule that every logged-out user is signaled with zero age, which absolutely makes sense. If a user isn’t logged in, then how could we know her age? Observe that we have gradually reached this conclusion, always using data as a source of truth buttressed with logic and domain knowledge. All in all, the Signed_In column seems redundant (noninformative), but it may be useful for grouping purposes.

Don’t worry too much at this point about performance or elegancy. Many solutions are possible for the same problem. For example, here is another way to check that all logged-out users are of age zero:

>> grouped_by_logged_status = nyt_data.groupby('Signed_In')

>> import numpy as np

>> grouped_by_logged_status.agg([np.min, np.max])['Age']

amin amax

Signed_In

0 0 0

1 7 108

We have grouped the data by the signed-in status and aggregated it based on minimum and maximum. For a logged-out state, the only possible age value is zero. For a logged-in state, the youngest user is 7 years old.

Incorporating Custom Features

Feature engineering is both art and science that heavily relies on domain knowledge. Many scientists from academia and industry constantly seek new ways to characterize observations. Most findings are published, so that others may continue to work on various improvements. In the domain of e-commerce, it turned out that click-through rate, demographics (in our case study, gender), and age groups (see reference [2]) are informative features.

Let’s start by introducing a new column, Age_Group , that classifies users as “Unknown”, “1-17”, “18-24”, “25-34”, “35-44”, “45-54”, “55-64”, and “65+” (we remove the Age column):

>> nyt_data['Age_Group'] = pd.cut(nyt_data['Age'], bins=[-1, 0, 17, 24, 34, 44, 54, 64, 120], ➥

labels=["Unknown", "1-17", "18-24", "25-34", "35-44", "45-54", "55-64", "65+"])

>> nyt_data.drop('Age', axis="columns", inplace=True)

>> nyt_data.head(n=10)

Gender Impressions Clicks Signed_In Age_Group

0 0 3 0 1 35-44

1 1 3 0 1 65+

2 0 3 0 1 25-34

3 1 3 0 1 45-54

4 1 11 0 1 45-54

5 0 11 1 1 45-54

6 0 7 1 0 Unknown

7 0 5 0 1 45-54

8 0 3 0 1 1-17

9 0 4 0 1 45-54

>> nyt_data['Age_Group'].dtypes

CategoricalDtype(categories=['Unknown', '1-17', '18-24', '25-34', '35-44', '45-54',

'55-64', '65+'],

ordered=True)

The following code adds the click-through rate (CTR) column to our table:

>> nyt_data['CTR'] = nyt_data['Clicks'] / nyt_data['Impressions']

>> nyt_data.head(n=10)

Gender Impressions Clicks Signed_In Age_Group CTR

0 0 3 0 1 35-44 0.000000

1 1 3 0 1 65+ 0.000000

2 0 3 0 1 25-34 0.000000

3 1 3 0 1 45-54 0.000000

4 1 11 0 1 45-54 0.000000

5 0 11 1 1 45-54 0.090909

6 0 7 1 0 Unknown 0.142857

7 0 5 0 1 45-54 0.000000

8 0 3 0 1 1-17 0.000000

9 0 4 0 1 45-54 0.000000

We have to be careful to avoid potential division by zero issues. Pandas will automatically mark such values as Not a Number (NaN) . We could instantly remove such rows, but we must be careful here. The lack of ads is a sign that potential buyers are omitted from a campaign. The next statement shows how many users haven’t seen ads (True values are treated as 1s and False values as 0s, so we can directly sum them up):

>> np.sum(nyt_data['Impressions'] == 0)

3066

This is a low percentage of users (about 0.6%). Removing them wouldn’t impact the analysis. The next statement prunes away rows whose CTR is NaN:

>> nyt_data.dropna(inplace=True)

The click-through rate is a number from the interval [0, 1]. Consequently, we should delete all rows where the number of clicks is higher than the number of impressions. The next step is kind of a generic sanity action that we will apply to all data files:

>> nyt_data.drop((nyt_data['Clicks'] > nyt_data['Impressions']).nonzero()[0], inplace=True)

The Boolean mask must be converted to an array of indices where elements are True. This is the job of the nonzero method . Of course, we could also get an empty array, but this is OK. As you likely realize, besides exploring, we are also cleaning up the dataset. Most actions in data science are intermixed, and it is hard to draw borders between activity spaces.

The following lines produce two bar plots to visualize the distributions of the number of impressions and CTR for different age ranges (these are shown in Figure 2-4):

>> grouped_by_age_group = nyt_data.groupby(by='Age_Group').agg(np.mean)

>> grouped_by_age_group['Impressions'].plot(kind='bar', colormap="tab20c", title="Impressions");

>> grouped_by_age_group['CTR'].plot(kind='bar', colormap="tab20c", title="CTR");

Another interesting indicator may be the total number of clicks across age groups and logged-in statuses. The following lines produce two bar plots to visualize these metrics (these are shown in Figure 2-5):

>> grouped_by_age_group = nyt_data.groupby(by='Age_Group').agg(np.sum)

>> grouped_by_logged_status = nyt_data.groupby(by='Signed_In').agg(np.sum)

>> grouped_by_age_group['Clicks'].plot(kind='bar', colormap="tab20c", title="Clicks");

>> grouped_by_logged_status['Clicks'].plot(kind='bar', colormap="tab20c", title="Clicks");

../images/473947_1_En_2_Chapter/473947_1_En_2_Fig4_HTML.jpg — Figure 2-4
These two bar plots clearly emphasize the difference between features: number of impressions and CTR. The youngest and oldest people are more susceptible to click ads. The unknown age group is a mixture of all ages. At any rate, the segmentation by age groups turns out to be useful.

It is an amazing fact that 30% of unregistered users generate the same total number of clicks as the 70% registered ones. Furthermore, we can notice a steady increase in the number of clicks as age ranges advance. This might suggest that older people have more money to spend on shopping than younger ones. It could also be the case that older people are more attracted to the site than youngsters. Investigating the reason at some point in the future makes sense. At this moment, we are simply trying out various approaches to best describe users’ behavior.

../images/473947_1_En_2_Chapter/473947_1_En_2_Fig5_HTML.jpg — Figure 2-5
The total number of clicks represents the matching group’s importance (potential) to generate sale

Figure 2-6 illustrates how click-through rate and total number of clicks are distributed across combinations of age ranges and demographics. The next couple of lines implement this idea:

>> grouped_by_age_group_and_gender = nyt_data.groupby(by=['Age_Group', 'Gender'])➥

[['CTR', 'Clicks']].agg([np.mean, np.sum])

>> grouped_by_age_group_and_gender['CTR']['mean'].plot(kind='bar', colormap="tab20c",➥ title="CTR", rot=45);

>> grouped_by_age_group_and_gender['Clicks']['sum'].plot(kind='bar', colormap="tab20c",➥ title="Clicks", rot=45);

../images/473947_1_En_2_Chapter/473947_1_En_2_Fig6_HTML.jpg — Figure 2-6
There is no significant difference between genders, aside from total number of clicks for the youngest and oldest age groups

These diagrams also show how easy it is to customize plots. To enhance readability, I have printed the X axis labels slanted by 45 degrees.

We are now ready to capture our findings in tabular format; it compresses the whole data file into a set of descriptive statistics while still preserving crucial traits of user behavior. This is the essence of data preprocessing by means of mathematical statistics. The next snippet produces a multilevel indexed data frame:

>> def q25(x):

>> return x.quantile(0.25)

>> def q75(x):

>> return x.quantile(0.75)

>> compressed_nyt_data = nyt_data.groupby(by=['Age_Group', 'Gender'])➥

[['CTR', 'Clicks']].agg([np.mean, np.std, np.max, q25, np.median, q75, np.sum])

>> compressed_nyt_data

CTR ... Clicks

mean std amax q25 ... q25 median q75 sum

Age_Group Gender ...

Unknown 0 0.028355 0.085324 1.0 0.0 ... 0 0 0 19480

1-17 0 0.031786 0.088829 1.0 0.0 ... 0 0 0 683

1 0.029582 0.087169 1.0 0.0 ... 0 0 0 1382

18-24 0 0.010789 0.053556 1.0 0.0 ... 0 0 0 1002

1 0.010685 0.052763 1.0 0.0 ... 0 0 0 1165

25-34 0 0.010186 0.049832 1.0 0.0 ... 0 0 0 1388

1 0.010111 0.051750 1.0 0.0 ... 0 0 0 1549

35-44 0 0.010234 0.050691 1.0 0.0 ... 0 0 0 1707

1 0.010332 0.051503 1.0 0.0 ... 0 0 0 1955

45-54 0 0.010187 0.050994 1.0 0.0 ... 0 0 0 1542

1 0.009754 0.049075 1.0 0.0 ... 0 0 0 1690

55-64 0 0.020378 0.072637 1.0 0.0 ... 0 0 0 2105

1 0.020245 0.071184 1.0 0.0 ... 0 0 0 2451

65+ 0 0.029856 0.084455 1.0 0.0 ... 0 0 0 2765

1 0.029709 0.083338 1.0 0.0 ... 0 0 0 1585

[15 rows x 14 columns]

>> compressed_nyt_data['Clicks']

mean std amax q25 median q75 sum

Age_Group Gender

Unknown 0 0.143049 0.386644 4 0 0 0 19480

1-17 0 0.157664 0.386004 2 0 0 0 683

1 0.147021 0.383844 3 0 0 0 1382

18-24 0 0.053207 0.230291 2 0 0 0 1002

1 0.053995 0.233081 2 0 0 0 1165

25-34 0 0.051310 0.225113 2 0 0 0 1388

1 0.050374 0.224588 2 0 0 0 1549

35-44 0 0.051782 0.225929 2 0 0 0 1707

1 0.052232 0.227133 3 0 0 0 1955

45-54 0 0.051303 0.227742 2 0 0 0 1542

1 0.050018 0.224011 2 0 0 0 1690

55-64 0 0.102026 0.318453 3 0 0 0 2105

1 0.102854 0.320578 3 0 0 0 2451

65+ 0 0.152065 0.384822 3 0 0 0 2765

1 0.152801 0.386677 3 0 0 0 1585

The preceding table contains enough information to even restore some other properties. For example, the size of groups may be calculated by dividing the corresponding total number of clicks by its mean.

Automating the Steps

Listing 2-3 shows the extension of the script nyt_data.py with a summarize function that gathers all the pertinent steps from our exploration. It receives a data file name and returns a DataFrame instance containing descriptive statistics for depicting user behavior.

import pandas as pd

import numpy as np

def summarize(data_file):

def q25(x):

return x.quantile(0.25)

def q75(x):

return x.quantile(0.75)

# Read and parse the CSV data file.

nyt_data = pd.read_csv(data_file, dtype={'Gender': 'category'})

# Segment users into age groups.

nyt_data['Age_Group'] = pd.cut(

nyt_data['Age'],

bins=[-1, 0, 17, 24, 34, 44, 54, 64, 120],

labels=["Unknown",

"1-17",

"18-24",

"25-34",

"35-44",

"45-54",

"55-64",

"65+"])

nyt_data.drop('Age', axis="columns", inplace=True)

# Create the click through rate feature.

nyt_data['CTR'] = nyt_data['Clicks'] / nyt_data['Impressions']

nyt_data.dropna(inplace=True)

nyt_data.drop((nyt_data['Clicks'] > nyt_data['Impressions']).nonzero()[0],

inplace=True)

# Make final description of data.

compressed_nyt_data =

nyt_data.groupby(by=['Age_Group', 'Gender'])[['CTR', 'Clicks']]

.agg([np.mean, np.std, np.max, q25, np.median, q75, np.sum])

return compressed_nyt_data

Listing 2-3

The summarize Function Responsible for Preprocessing a Single Data File

You should get back the same result as from the previous section after passing the 'nyt1.csv' file to summarize . We are now left with implementing the pipeline to process all files. Observe how all those preliminary sidesteps and visualizations are gone. The source code is purely an end result of complex intellectual processes.

Inspecting Results

The next stage is to create a test pipeline to process all files. Parts of this test harness will transition into the final data pipeline infrastructure. Listing 2-4 shows the last extension to the nyt_data.py script. The traverse function calls back a custom collector, which further handles the data frame instance. The function uses the pathlib package introduced in Python 3.4. The get_file_number function extracts the sequence number from a data file name. This is important, as files may be traversed in arbitrary order. The file number is used to index into a collection storing some values from each data frame.

import pathlib

def traverse(sourceFolder, collect):

def get_file_number(data_file):

return int(data_file.name[3:-4]) - 1

for data_file in pathlib.Path(sourceFolder).glob('nyt∗.csv'):

collect(summarize(data_file.absolute()), get_file_number(data_file))

Listing 2-4

The traverse Function Calling Back a Custom Collector

Listing 2-5 shows the section from driver.py to call traverse and make some plots as shown in Figure 2-7 (those diagrams are vital to reinforce the expressive power of our segmentation tactic and accumulated statistics). Notice how traverse is called with the select_stats_unregistered callback function to collect pertinent data for each file in Listing 2-5. The main program collects the average click-through rate and total clicks for unregistered users that are representative enough for postulating a hypothesis.

import numpy as np

import pandas as pd

from nyt_data import traverse

summary_data = dict()

summary_data.setdefault('CTR', np.empty(31))

summary_data.setdefault('Clicks', np.empty(31))

def select_stats_unregistered(df, file_num):

summary_data['CTR'][file_num] = df['CTR']['mean'][('Unknown', '0')]

summary_data['Clicks'][file_num] = df['Clicks']['sum'][('Unknown', '0')]

traverse('raw_data', select_stats_unregistered)

print('Raw data files are successfully processed.')

# Make some plots of CTR and Total Clicks over time.

df = pd.DataFrame.from_dict(summary_data)

import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=2, ncols=1)

df['CTR'].plot(

title='Click Through Rate Over 1 Month',

ax=axes[0],

figsize=(8, 9),

xticks=[]

);

df['Clicks'].plot(

xticks=range(0, 31, 2),

title='Total Clicks Over 1 Month',

ax=axes[1],

figsize=(8, 9)

);

Listing 2-5

np.empty Allocates Spaces for 31 Days of Data Points

../images/473947_1_En_2_Chapter/473947_1_En_2_Fig7_HTML.jpg — Figure 2-7
There is no noticeable fluctuation pattern in click-through rate

Notice that there are clear peaks in the total number of clicks. They happened on the 6th, 13th, 20th, and 27th of May 2012, which are all Sundays (the numbering starts at zero; see also Exercise 2-2). Consequently, on those days the web site had experienced much higher traffic than during weekdays. Moreover, we see an overall increase in activity in the second half of the month (could be that people waited their monthly salaries until around 15th). This may be a recurring condition, since on the last day of the month the number of clicks drops to the level witnessed at the beginning of the month. Finally, on every Saturday, activity languished a bit compared to the rest of the week (perhaps people typically sit in front of their computers less on Saturdays). These effects are intriguing and deserve deeper analysis with more data.

Persisting Results

The whole point of data preprocessing is to preserve the results for later use. The previous extension of driver.py only visualizes compressed data frames. Once the program is run, all that remains are diagrams and raw CSV files. It would be better to save data frames into the results folder, as depicted earlier in Figure 2-2. During later data analysis, these data frames could be read without starting everything all over. There are many available file formats. We are going to use Apache Parquet (see https://parquet.apache.org ), which delivers a partitioned binary columnar storage format for data frames. The following are some of the benefits and important aspects regarding Parquet:

It faithfully and efficiently writes and reads data frames.
It allows sharing of data across a multitude of programming languages and frameworks.
It nicely interoperates with the Hadoop ecosystem.
Pandas has a very intuitive API for handling serialization and deserialization of data frames into/from this format.

Listing 2-6 shows the final section of driver.py to save each data frame into a separate file (the part from Listing 2-5 should be commented out; see Exercise 2-2). A set of these files provides a clear demarcation line between this stage and the next one in a data science process.

def save_stats(df, file_num):

targetFile = 'nyt_summary_' + str(file_num + 1) + '.parquet'

df.columns = ['_'.join(column).rstrip('_') for column in df.columns.values]

df.to_parquet('results/' + targetFile)

traverse('raw_data', save_stats)

print('Raw data files are successfully processed.')

Listing 2-6

Code to Write Data Frames into Parquet Files Using the nyt_summary_<day number>.parquet File Naming Convention

It is mandatory to have string column names in a DataFrame instance before writing it into a Parquet file. The code shows how to convert a multilevel index into a flat structure, where the top index’s name is prepended to its subordinate columns. This transformation is reversible, so nothing is lost.

After all files are written, you may browse them in the File Explorer. They are all the same size of 6KB. Compare this to initial CSV file sizes.

Parquet Engines

We didn’t designate a specific Parquet engine in Listing 2-6, which means it was set to auto. In this case, Pandas tries to load pyarrow or fastparquet to handle read/write operations. If you don’t have these installed, then you will receive the following error message:

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.

pyarrow or fastparquet is required for parquet support

An easy to way to resolve this is to issue conda install pyarrow (you may also want to specify an environment). This command installs the pyarrow package, which is a Python binding for Apache Arrow (see http://arrow.apache.org ). Arrow is a cross-language development platform for columnar in-memory format for flat and hierarchical data. With pyarrow it is easy to read Apache Parquet files into Arrow structures. The capability to effortlessly convert from one format into another is an important criterion when choosing a specific technology.

Restructuring Code to Cope with Large CSV Files

Breaking up data into smaller files, as we have done so far, is definitely a good strategy to deal with a massive volume of data. Nonetheless, sometimes even a single CSV file is too large to fit into memory. Of course, you can try vertical scaling, by finding a beefed-up machine, but this approach is severely limited and costly. A better way is to process files in chunked fashion. A chunk is a piece of data from a file; once you are done with the current chunk, you can throw it away and get a new one. This tactic can save memory tremendously but may increase processing time.

Listing 2-7 shows the restructured version of the summarize function that handles data chunk-by-chunk. There are lots of further improvements that may be done regarding how chunks are processed. However, the intention here is to convey the basic idea in a straightforward manner. The key point is that only when you’re absolutely convinced that neither file sharding nor chunking helps should you switch to more advanced frameworks and setup. This is consistent with the golden tenet of data science to assume a simple resolution unless proven otherwise.

def summarize(data_file, chunksize):

def q25(x):

return x.quantile(0.25)

def q75(x):

return x.quantile(0.75)

# Read and parse the CSV data file chunk-by-chunk.

nyt_data = pd.DataFrame()

for chunk_df in pd.read_csv(

data_file,

dtype={'Gender': 'category'},

chunksize=chunksize):

# Segment users into age groups.

chunk_df['Age_Group'] = pd.cut(

chunk_df['Age'],

bins=[-1, 0, 17, 24, 34, 44, 54, 64, 120],

labels=["Unknown",

"1-17",

"18-24",

"25-34",

"35-44",

"45-54",

"55-64",

"65+"])

# Create the click-through rate feature.

chunk_df['CTR'] = chunk_df['Clicks'] / chunk_df['Impressions']

chunk_df.dropna(inplace=True)

chunk_df.drop((chunk_df['Clicks'] > chunk_df['Impressions']).nonzero()[0],

inplace=True)

# Append chunk to the main data frame.

nyt_data = nyt_data.append(

chunk_df[['Age_Group', 'Gender', 'Clicks', 'CTR']],

ignore_index=True)

# Make final description of data.

compressed_nyt_data =

nyt_data.groupby(by=['Age_Group', 'Gender'])[['CTR', 'Clicks']]

.agg([np.mean, np.std, np.max, q25, np.median, q75, np.sum])

return compressed_nyt_data

Listing 2-7

Restructured nyt_data_chunked.py to Showcase Chunked Data Processing

The traverse function accepts chunksize as a parameter, which is set by default to 10000. The driver.py script is redirected to use the new module, and all the rest runs unchanged.

Public Data Sources

Data analysis revolves around data, of course, so finding useful data sources is crucial for development and testing. In the absence of freely available data, you may synthetically generate some or rely on accrued corporate data. However, that usually isn’t necessary, because many public data sources are available; most are constantly inspected and improved by community members, and frequently accompanied with descriptions about their intended use. There are also high-quality datasets embedded in popular data science frameworks. The following snippet lists what is shipped with scikit-learn (only some lines are shown, for brevity):

>> import sklearn.datasets as datasets

>> [item for item in dir(datasets) if not item.startswith('_')]

['base',

'california_housing',

'clear_data_home',

'covtype',

'dump_svmlight_file',

'fetch_20newsgroups',

'fetch_20newsgroups_vectorized',

'fetch_california_housing',

'fetch_covtype',

'fetch_kddcup99',

'fetch_lfw_pairs',

'fetch_lfw_people',

...]

>> datasets.fetch_20newsgroups?

Signature: datasets.fetch_20newsgroups(data_home=None, subset="train", categories=None, shuffle=True, random_state=42, remove=(), download_if_missing=True)

Docstring:

Load the filenames and data from the 20 newsgroups dataset (classification).

Download it if necessary.

================= ==========

Classes 20

Samples total 18846

Dimensionality 1

Features text

================= ==========

Read more in the :ref:`User Guide <20newsgroups_dataset>`.

Parameters

----------

data_home : optional, default: None

Specify a download and cache folder for the datasets. If None,

all scikit-learn data is stored in '~/scikit_learn_data' subfolders.

...

The following list is just a taste of public data sources accessible on the Internet and what types of access are offered:

Data.gov ( https://www.data.gov ) is the U.S. federal government’s data hub. In addition to loads of datasets, you can find tools and resources to conduct research and create data products.
Geocoding API ( https://developers.google.com/maps/documentation/geocoding/start ) is a Google REST service that provides geocoding and reverse geocoding of addresses. This isn’t a dataset per se but is important in many data engineering endeavors.
GitHub repository ( https://github.com/awesomedata/awesome-public-datasets ) offers a categorized list of publicly available datasets. This is an example of a source aggregator that alleviates the burden of targeted searches for datasets.
HealthData.gov ( https://healthdata.gov ) offers all sorts of health data for research institutions and corporations.
Kaggle ( https://www.kaggle.com ) is a central place for data science projects. It regularly organizes data science competitions with excellent prizes. Many datasets are accompanied by kernels, which are notebooks containing examples of working with the matching datasets.
National Digital Forecast Database REST Web Service ( https://graphical.weather.gov/xml/rest.php ) delivers data from the National Weather Service’s National Digital Forecast Database in XML format.
Natural Earth ( http://www.naturalearthdata.com ) is a public domain map dataset that may be integrated with Geographic Information System software to create dynamic map-enabled queries and visualizations.
Open Data Network (see https://www.opendatanetwork.com ) is another good example of a source aggregator. It partitions data based on categories and geographical regions.
OpenStreetMap ( https://wiki.openstreetmap.org ) provides free geographic data for the world.
Quandl ( https://www.quandl.com ) is foremost a commercial platform for financial data but also includes many free datasets. The data retrieval is via a nice API with a Python binding.
Scikit Data Access ( https://github.com/MITHaystack/scikit-dataaccess ) a curated data pipeline and set of data interfaces for Python. It prescribes best practices to build your data infrastructure.
Statistics Netherlands opendata API client for Python ( https://cbsodata.readthedocs.io ) is a superb example of using the Open Data Protocol (see https://www.odata.org ) to standardize interaction with a source.
UCI Machine Learning Repository ( http://archive.ics.uci.edu/ml/index.php ) maintains over 400 datasets for different machine learning tasks. Each set is properly marked to identify which problems it applies to. This is a classical example of a source where you need to download the particular dataset.

Exercise 2-1. Enhance Reusability

The driver.py script contains lots of hard-coded values. This seriously hinders its reusability, since it assumes more than necessary. Refactor the code to use an external configuration file, which would be loaded at startup.

Don’t forget that the current driver code assumes the existence of the input and output folders. This assumption no longer holds with a configuration file. So, you will need to check that these folders exist and, if not, create them on demand.

Exercise 2-2. Avoid Nonrunnable Code

The driver.py script contains a huge chunk of commented-out code. This chunk isn’t executed unless the code is uncommented again. Such passive code tends to diverge from the rest of the source base, since it is rarely maintained and tested. Furthermore, lots of such commented sections impede readability.

Restructure the program to avoid relying on comments to control what section will run. There are many viable alternatives to choose from. For example, try to introduce a command-line argument to select visualization, output file generation, or both. You may want to look at https://docs.python.org/3/library/argparse.html .

Exercise 2-3. Augment Data

Sometimes you must augment data with missing features, and you should be vigilant to search for available solutions. If they are written in different languages, then you need to find a way to make them interoperate with Python. Suppose you have a dataset with personal names, but you are lacking an additional gender column. You have found a potential candidate Perl module, Text::GenderFromName (see https://metacpan.org/pod/Text::GenderFromName ). Figure out how to call this module from Python to add that extra attribute (you will find plenty of advice by searching on Google).

This type of investigation is also a job of data engineers. People focusing on analytics don’t have time or expertise to deal with peculiar data wrangling.

Summary

Data engineering is an enabler for data analysis. It requires deep engineering skills to cope with various raw data formats, communication protocols, data storage systems, exploratory visualization techniques, and data transformations. It is hard to exemplify all possible scenarios.

The e-commerce customer segmentation case study demonstrates mathematical statistics in action to compress the input data into a set of descriptive properties. These faithfully preserve main behavioral traits of users whose actions were preserved in raw data. The case study ends by examining large datasets and how to process them chunk-by-chunk. This is a powerful technique that may solve the problem without reaching to more advanced and distributed approaches.

In most cases, when dealing with common data sources, you have a standardized solution. For example, to handle disparate relational database management systems in a unified fashion, you can rely on the Python Database API (see https://www.python.org/dev/peps/pep-0249 ). By contrast, when you need to squeeze out information from unusual data sources (like low-level devices), then you must be creative and vigilant. Always look for opportunity to reuse before crafting your own answer. In the case of low-level devices, the superb library PyVISA (see https://pyvisa.readthedocs.io ) enables you to control all kinds of measurement devices independently of the interface (e.g., GPIB, RS232, USB, and Ethernet).

Data-intensive systems must be secured. It isn’t enough to connect the pieces and assume that under normal conditions all will properly work. Sensitive data must be anonymized before being made available for further processing (especially before being openly published) and must be encrypted both at rest and in transit. It is also imperative to ensure consistency and integrity of data. Analysis of maliciously tampered data may have harmful consequences (see reference [4] for a good example of this).

There is a surge to realize the concept of data as commodity , like cloud computing did with computational resources. One such initiative is coming from Data Rivers (consult [5]), whose system performs data collection, cleansing (including wiping out personally identifiable information), and validation. It uses Kafka as a messaging hub with Confluent’s schema registry to handle a humungous number of messages.

Another interesting notion is data science as a service (watch the webinar at [6]) that operationalizes data science and makes all necessary infrastructure instantly available to corporations.

References

1.
Cathy O’Neil and Rachel Schutt, Doing Data Science, O’Reilly, 2013.
2.
“Using Segmentation to Improve Click Rate and Increase Sales,” Mailchimp, https://mailchimp.com/resources/using-segmentation-to-improve-click-rate-and-increase-sales , Feb. 28, 2017.
3.
Wes McKinney, Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2nd Edition, O’Reilly, 2017.
4.
UC San Diego Jacobs School of Engineering, “How Unsecured, Obsolete Medical Record Systems and Medical Devices Put Patient Lives at Risk,” http://jacobsschool.ucsd.edu/news/news_releases/release.sfe?id=2619 , Aug. 28, 2018.
5.
Stephanie Kanowitz, “Open Data Grows Up,” GCN Magazine, https://gcn.com/articles/2018/08/28/pittsburgh-data-rivers.aspx , Aug. 28, 2018,.
6.
“Introducing the Data Science Sandbox as a Service,” Cloudera webinar recorded Aug. 30, 2018, available at https://www.cloudera.com/resources/resources-library.html .

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. Data Engineering

Create new playlist

Sign In

Sign Up

2. Data Engineering

E-Commerce Customer Segmentation: Case Study

Note

Flavors of Data

Creating a Project in Spyder

Downloading the Dataset

Exploring the Dataset

Finding Associations Between Features

Incorporating Custom Features

Automating the Steps

Inspecting Results

Persisting Results

Parquet Engines

Restructuring Code to Cope with Large CSV Files

Public Data Sources

Exercise 2-1. Enhance Reusability

Exercise 2-2. Avoid Nonrunnable Code

Exercise 2-3. Augment Data

Summary

References

Table of Contents for
2. Data Engineering