Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 21
Cases

21.1 Charles Book Club¹

CharlesBookClub.csv is the dataset for this case study.

The Book Industry

Approximately 50,000 new titles, including new editions, are published each year in the United States, giving rise to a $25 billion industry in 2001. In terms of percentage of sales, this industry may be segmented as follows:

16%	Textbooks
16%	Trade books sold in bookstores
21%	Technical, scientific, and professional books
10%	Book clubs and other mail-order books
17%	Mass-market paperbound books
20%	All other books

Book retailing in the United States in the 1970s was characterized by the growth of bookstore chains located in shopping malls. The 1980s saw increased purchases in bookstores stimulated through the widespread practice of discounting. By the 1990s, the superstore concept of book retailing gained acceptance and contributed to double-digit growth of the book industry. Conveniently situated near large shopping centers, superstores maintain large inventories of 30,000–80,000 titles and employ well-informed sales personnel. Book retailing changed fundamentally with the arrival of Amazon, which started out as an online bookseller and, as of 2015, was the world’s largest online retailer of any kind. Amazon’s margins were small and the convenience factor high, putting intense competitive pressure on all other book retailers. Borders, one of the two major superstore chains, discontinued operations in 2011.

Subscription-based book clubs offer an alternative model that has persisted, though it too has suffered from the dominance of Amazon.

Historically, book clubs offered their readers different types of membership programs. Two common membership programs are the continuity and negative option programs, which are both extended contractual relationships between the club and its members. Under a continuity program, a reader signs up by accepting an offer of several books for just a few dollars (plus shipping and handling) and an agreement to receive a shipment of one or two books each month thereafter at more-standard pricing. The continuity program is most common in the children’s book market, where parents are willing to delegate the rights to the book club to make a selection, and much of the club’s prestige depends on the quality of its selections.

In a negative option program, readers get to select how many and which additional books they would like to receive. However, the club’s selection of the month is delivered to them automatically unless they specifically mark “no” on their order form by a deadline date. Negative option programs sometimes result in customer dissatisfaction and always give rise to significant mailing and processing costs.

In an attempt to combat these trends, some book clubs have begun to offer books on a positive option basis, but only to specific segments of their customer base that are likely to be receptive to specific offers. Rather than expanding the volume and coverage of mailings, some book clubs are beginning to use database-marketing techniques to target customers more accurately. Information contained in their databases is used to identify who is most likely to be interested in a specific offer. This information enables clubs to design special programs carefully tailored to meet their customer segments’ varying needs.

Database Marketing at Charles

The Club

The Charles Book Club (CBC) was established in December 1986 on the premise that a book club could differentiate itself through a deep understanding of its customer base and by delivering uniquely tailored offerings. CBC focused on selling specialty books by direct marketing through a variety of channels, including media advertising (TV, magazines, and newspapers) and mailing. CBC is strictly a distributor and does not publish any of the books that it sells. In line with its commitment to understanding its customer base, CBC built and maintained a detailed database about its club members. Upon enrollment, readers were required to fill out an insert and mail it to CBC. Through this process, CBC created an active database of 500,000 readers; most were acquired through advertising in specialty magazines.

The Problem

CBC sent mailings to its club members each month containing the latest offerings. On the surface, CBC appeared very successful: mailing volume was increasing, book selection was diversifying and growing, and their customer database was increasing. However, their bottom-line profits were falling. The decreasing profits led CBC to revisit their original plan of using database marketing to improve mailing yields and to stay profitable.

A Possible Solution

CBC embraced the idea of deriving intelligence from their data to allow them to know their customers better and enable multiple targeted campaigns where each target audience would receive appropriate mailings. CBC’s management decided to focus its efforts on the most profitable customers and prospects, and to design targeted marketing strategies to best reach them. The two processes they had in place were:

Customer acquisition:
- New members would be acquired by advertising in specialty magazines, newspapers, and on TV.
- Direct mailing and telemarketing would contact existing club members.
- Every new book would be offered to club members before general advertising.
Data collection:
- All customer responses would be recorded and maintained in the database.
- Any information not being collected that is critical would be requested from the customer.

For each new title, they decided to use a two-step approach:

Conduct a market test involving a random sample of 4000 customers from the database to enable analysis of customer responses. The analysis would create and calibrate response models for the current book offering.
Based on the response models, compute a scorefor each customer in the database. Use this score and a cutoff value to extract a target customer list for direct-mail promotion.

Targeting promotions was considered to be of prime importance. Other opportunities to create successful marketing campaigns based on customer behavior data (returns, inactivity, complaints, compliments, etc.) would be addressed by CBC at a later stage.

Art History of Florence

A new title, The Art History of Florence, is ready for release. CBC sent a test mailing to a random sample of 4000 customers from its customer base. The customer responses have been collated with past purchase data. The dataset was randomly partitioned into three parts: Training Data (1800 customers): initial data to be used to fit models, Validation Data (1400 customers): holdout data used to compare the performance of different models, and Test Data (800 customers): data to be used only after a final model has been selected to estimate the probable performance of the model when it is deployed. Each row (or case) in the spreadsheet (other than the header) corresponds to one market test customer. Each column is a variable, with the header row giving the name of the variable. The variable names and descriptions are given in Table 21.1.

Data Mining Techniques

Various data mining techniques can be used to mine the data collected from the market test. No one technique is universally better than another. The particular context and the particular characteristics of the data are the major factors in determining which techniques perform better in an application. For this assignment, we focus on two fundamental techniques: k-nearest neighbors (k-NN) and logistic regression. We compare them with each other as well as with a standard industry practice known as RFM (recency, frequency, monetary) segmentation.

Table 21.1 List of Variables in Charles Book Club Dataset

Variable name	Description
Seq#	Sequence number in the partition
ID#	Identification number in the full
	(unpartitioned) market test dataset
Gender	0 = Male, 1 = Female
M	Monetary—Total money spent on books
R	Recency—Months since last purchase
F	Frequency—Total number of purchases
FirstPurch	Months since first purchase
ChildBks	Number of purchases from the category child books
YouthBks	Number of purchases from the category youth books
CookBks	Number of purchases from the category cookbooks
DoItYBks	Number of purchases from the category do-it-yourself books
RefBks	Number of purchases from the category reference books
	(atlases, encyclopedias, dictionaries)
ArtBks	Number of purchases from the category art books
GeoBks	Number of purchases from the category geography books
ItalCook	Number of purchases of book title Secrets of Italian Cooking
ItalAtlas	Number of purchases of book title Historical Atlas of Italy
ItalArt	Number of purchases of book title Italian Art
Florence	= 1 if The Art History of Florence was bought; = 0 if not
Related Purchase	Number of related books purchased

(Source: Reproduced with permission of Jacob Zahavi)

RFM Segmentation

The segmentation process in database marketing aims to partition customers in a list of prospects into homogeneous groups (segments) that are similar with respect to buying behavior. The homogeneity criterion we need for segmentation is the propensity to purchase the offering. However, since we cannot measure this attribute, we use variables that are plausible indicators of this propensity.

In the direct marketing business, the most commonly used variables are the RFM variables:

R = recency, time since last purchase

F = frequency, number of previous purchases from the company over a period

M = monetary, amount of money spent on the company’s products over a period

The assumption is that the more recent the last purchase, the more products bought from the company in the past, and the more money spent in the past buying the company’s products, the more likely the customer is to purchase the product offered.

The 1800 observations in the dataset were divided into recency, frequency, and monetary categories as follows:

Recency:

0–2 months (Rcode = 1)

3–6 months (Rcode = 2)

7–12 months (Rcode = 3)

13 months and up (Rcode = 4)

Frequency:

1 book (Fcode = l)

2 books (Fcode = 2)

3 books and up (Fcode = 3)

Monetary:

$0–$25 (Mcode = 1)

$26–$50 (Mcode = 2)

$51–$100 (Mcode = 3)

$101–$200 (Mcode = 4)

$201 and up (Mcode = 5)

Assignment

Partition the data into training (60%) and validation (40%). Use seed = 1.

What is the response rate for the training data customers taken as a whole? What is the response rate for each of the 4 × 5 × 3 = 60 combinations of RFM categories? Which combinations have response rates in the training data that are above the overall response in the training data?
Suppose that we decide to send promotional mail only to the “above-average” RFM combinations identified in part 1. Compute the response rate in the validation data using these combinations.
Rework parts 1 and 2 with three segments:
- Segment 1: RFM combinations that have response rates that exceed twice the overall response rate
- Segment 2: RFM combinations that exceed the overall response rate but do not exceed twice that rate
- Segment 3: the remaining RFM combinations
Draw the lift curve (consisting of three points for these three segments) showing the number of customers in the validation dataset on the x-axis and cumulative number of buyers in the validation dataset on the y-axis.

k-Nearest Neighbors

The k-NN technique can be used to create segments based on product proximity to similar products of the products offered as well as the propensity to purchase (as measured by the RFM variables). For The Art History of Florence, a possible segmentation by product proximity could be created using the following variables:

R: recency—months since last purchase
F: frequency—total number of past purchases
M: monetary—total money (in dollars) spent on books
FirstPurch: months since first purchase
RelatedPurch: total number of past purchases of related books (i.e., sum of purchases from the art and geography categories and of titles Secrets of Italian Cooking, Historical Atlas of Italy, and Italian Art)

Use the k-NN approach to classify cases with k = 1, 2, …, 11, using Florence as the outcome variable. Based on the validation set, find the best k. Remember to normalize all five variables. Create a lift curve for the best k model, and report the expected lift for an equal number of customers from the validation dataset.

The k-NN prediction algorithm gives a numerical value, which is a weighted average of the values of the Florence variable for the k-NN with weights that are inversely proportional to distance. Using the best k that you calculated above with k-NN classification, now run a model with k-NN prediction and compute a lift curve for the validation data. Use all 5 predictors and normalized data. What is the range within which a prediction will fall? How does this result compare to the output you get with the k-NN classification?

Logistic Regression

The logistic regression model offers a powerful method for modeling response because it yields well-defined purchase probabilities. The model is especially attractive in consumer-choice settings because it can be derived from the random utility theory of consumer behavior.

Use the training set data of 1800 records to construct three logistic regression models with Florence as the outcome variable and each of the following sets of predictors:

The full set of 15 predictors in the dataset
A subset of predictors that you judge to be the best
Only the R, F, and M variables

Create a cumulative gains chart summarizing the results from the three logistic regression models created above, along with the expected cumulative gains for a random selection of an equal number of customers from the validation dataset.
If the cutoff criterion for a campaign is a 30% likelihood of a purchase, find the customers in the validation data that would be targeted and count the number of buyers in this set.

21.2 German Credit

GermanCredit.csv is the dataset for this case study.

Background

Money-lending has been around since the advent of money; it is perhaps the world’s second-oldest profession. The systematic evaluation of credit risk, though, is a relatively recent arrival, and lending was largely based on reputation and very incomplete data. Thomas Jefferson, the third President of the United States, was in debt throughout his life and unreliable in his debt payments, yet people continued to lend him money. It wasn’t until the beginning of the 20th century that the Retail Credit Company was founded to share information about credit. That company is now Equifax, one of the big three credit scoring agencies (the other two are Transunion and Experion).

Individual and local human judgment are now largely irrelevant to the credit reporting process. Credit agencies and other big financial institutions extending credit at the retail level collect huge amounts of data to predict whether defaults or other adverse events will occur, based on numerous customer and transaction information.

Data

This case deals with an early stage of the historical transition to predictive modeling, in which humans were employed to label records as either good or poor credit. The German Credit dataset² has 30 variables and 1000 records, each record being a prior applicant for credit. Each applicant was rated as “good credit” (700 cases) or “bad credit” (300 cases). Table 21.2 shows the values of these variables for the first four records. All the variables are explained in Table 21.3. New applicants for credit can also be evaluated on these 30 predictor variables and classified as a good or a bad credit risk based on the predictor values.

The consequences of misclassification have been assessed as follows: The costs of a false positive (incorrectly saying that an applicant is a good credit risk) outweigh the benefits of a true positive (correctly saying that an applicant is a good credit risk) by a factor of 5. This is summarized in Table 21.4. The opportunity cost table was derived from the average net profit per loan as shown in Table 21.5. Because decision makers are used to thinking of their decision in terms of net profits, we use these tables in assessing the performance of the various models.

**Table 21.2** **First four records from German Credit dataset**

Table 21.3 Variables for the German Credit Dataset

Variable number	Variable name	Description	Variable type	Code description
1	OBS#	Observation numbers	Categorical	Sequence number in dataset
2	CHK₋ACCT	Checking account	Categorical	0: <0 DM
		status
				1: 0–200 DM
				2 : >200 DM
				3: No checking account
3	DURATION	Duration of	Numerical
		credit in months
4	HISTORY	Credit history	Categorical	0: No credits taken
				1: All credits at this bank paid back duly
				2: Existing credits paid back duly until now
				3: Delay in paying off in the past
				4: Critical account
5	NEW₋CAR	Purpose of	Binary	Car (new), 0: no, 1: yes
		credit
6	USED₋CAR	Purpose of	Binary	Car (used), 0: no, 1: yes
		credit
7	FURNITURE	Purpose of	Binary	Furniture/equipment,
		credit		0: no, 1: yes
8	RADIO/TV	Purpose of	Binary	Radio/television,
		credit		0: no, 1: yes
9	EDUCATION	Purpose of	Binary	Education, 0: no, 1: yes
		credit
10	RETRAINING	Purpose of	Binary	Retraining, 0: no, 1: yes
		credit
11	AMOUNT	Credit amount	Numerical
12	SAV₋ACCT	Average balance	Categorical	0: <100 DM
		in savings		1 : 101−500 DM
		account		2 : 501−1000 DM
				3 : >1000 DM
				4 : Unknown/no savings account
13	EMPLOYMENT	Present	Categorical	0 : Unemployed
		employment		1: <1 year
		since		2: 1−3 years
				3: 4−6 years
				4: ⩾7 years
14	INSTALL₋RATE	Installment	Numerical
		rate as
		% of disposable
		income
15	MALE₋DIV	Applicant is male	Binary	0: no, 1: yes
		and divorced
16	MALE₋SINGLE	Applicant is male	Binary	0: No, 1: Yes
		and single
17	MALE₋MAR₋WID	Applicant is male	Binary	0: No, 1: Yes
		and married
		or a widower
18	CO-APPLICANT	Application has	Binary	0: No, 1: Yes
		a coapplicant
19	GUARANTOR	Applicant has	Binary	0: No, 1: Yes
		a guarantor
20	PRESENT₋RESIDENT	Present resident	Categorical	0: ⩽1 year
		since (years)		1: 1−2 years
				2: 2−3 years
				3: ⩾3 years
21	REAL₋ESTATE	Applicant owns	Binary	0: No, 1: Yes
		real estate
22	PROP₋UNKN₋NONE	Applicant owns no	Binary	0: No, 1: Yes
		property (or unknown)
23	AGE	Age in years	Numerical
24	OTHER₋INSTALL	Applicant has	Binary	0: No, 1: Yes
		other installment
		plan credit
25	RENT	Applicant rents	Binary	0: No, 1: Yes
26	OWN₋RES	Applicant owns	Binary	0: No, 1: Yes
		residence
27	NUM₋CREDITS	Number of	Numerical
		existing credits
		at this bank
28	JOB	Nature of job	Categorical	0 : Unemployed/unskilled— non-resident
				1 : Unskilled— resident
				2 : Skilled employee/official
				3 : Management/self-employed/highly qualified employee/officer
29	NUM₋DEPENDENTS	Number of people	Numerical
		for whom liable to
		provide maintenance
30	TELEPHONE	Applicant has	Binary	0: No, 1: Yes
		phone in his
		or her name
31	FOREIGN	Foreign worker	Binary	0: No, 1: Yes
32	RESPONSE	Credit rating	Binary	0: No, 1: Yes
		is good

The original dataset had a number of categorical variables, some of which were transformed into a series of binary variables and some ordered categorical variables were left as is, to be treated as numerical. (Data adapted from German Credit)

Table 21.4 Opportunity Cost Table (Deutsche Marks)

	Predicted (decision)
Actual	Good (accept)	Bad (reject)
Good	0	100
Bad	500	0

(Data adapted from Deutsche Marks)

Table 21.5 Average Net Profit (Deutsche Marks)

	Predicted (decision)
Actual	Good (accept)	Bad (reject)
Good	100	0
Bad	−500	0

(Data adapted from Deutsche Marks)

Assignment

Review the predictor variables and guess what their role in a credit decision might be. Are there any surprises in the data?
Divide the data into training and validation partitions, and develop classification models using the following data mining techniques: logistic regression, classification trees, and neural networks.
Choose one model from each technique and report the confusion matrix and the cost/gain matrix for the validation data. Which technique has the highest net profit?
Let us try and improve our performance. Rather than accept the default classification of all applicants’ credit status, use the estimated probabilities (propensities) from the logistic regression (where success means 1) as a basis for selecting the best credit risks first, followed by poorer-risk applicants. Create a vector containing the net profit for each record in the validation set. Use this vector to create a cumulative gains chart for the validation set that incorporates the net profit.
1. How far into the validation data should you go to get maximum net profit? (Often, this is specified as a percentile or rounded to deciles.)
2. If this logistic regression model is used to score to future applicants, what “probability of success” cutoff should be used in extending credit?

21.3 Tayko Software Cataloger³

Tayko.csv is the dataset for this case study.

Background

Tayko is a software catalog firm that sells games and educational software. It started out as a software manufacturer and later added third-party titles to its offerings. It has recently put together a revised collection of items in a new catalog, which it is preparing to roll out in a mailing.

In addition to its own software titles, Tayko’s customer list is a key asset. In an attempt to expand its customer base, it has recently joined a consortium of catalog firms that specialize in computer and software products. The consortium affords members the opportunity to mail catalogs to names drawn from a pooled list of customers. Members supply their own customer lists to the pool, and can “withdraw” an equivalent number of names each quarter. Members are allowed to do predictive modeling on the records in the pool so they can do a better job of selecting names from the pool.

The Mailing Experiment

Tayko has supplied its customer list of 200,000 names to the pool, which totals over 5,000,000 names, so it is now entitled to draw 200,000 names for a mailing. Tayko would like to select the names that have the best chance of performing well, so it conducts a test—it draws 20,000 names from the pool and does a test mailing of the new catalog.

This mailing yielded 1065 purchasers, a response rate of 0.053. To optimize the performance of the data mining techniques, it was decided to work with a stratified sample that contained equal numbers of purchasers and nonpurchasers. For ease of presentation, the dataset for this case includes just 1000 purchasers and 1000 nonpurchasers, an apparent response rate of 0.5. Therefore, after using the dataset to predict who will be a purchaser, we must adjust the purchase rate back down by multiplying each case’s “probability of purchase” by 0.053/0.5, or 0.107.

Data

There are two outcome variables in this case. Purchase indicates whether or not a prospect responded to the test mailing and purchased something. Spending indicates, for those who made a purchase, how much they spent. The overall procedure in this case will be to develop two models. One will be used to classify records as purchase or no purchase. The second will be used for those cases that are classified as purchase and will predict the amount they will spend.

Table 21.6 shows the first few rows of data. Table 21.7 provides a description of the variables available in this case.

**Table 21.6** **First 10 records from Tayko dataset**

Table 21.7 Description of Variables for Tayko Dataset

				Code
Variable number	Variable name	Description	Variable type	description
1	US	Is it a US address?	Binary	1: Yes
				0: No
2–16	Source_*	Source catalog	Binary	1: Yes
		for the record		0: No
		(15 possible sources)
17	Freq.	Number of transactions	Numerical
		in last year at
		source catalog
18	last_update_days_ago	How many days ago	Numerical
		last update was made
		to customer record
19	1st_update_days_ago	How many days	Numerical
		ago first update
		to customer record was made
20	RFM%	Recency–frequency–	Numerical
		monetary percentile,
		as reported by
		source catalog
		(see Section 21.2)
21	Web_order	Customer placed at	Binary	1: Yes
		least one order		0: No
		via web
22	Gender=mal	Customer is male	Binary	1: Yes
				0: No
23	Address_is_res	Address is	Binary	1: Yes
		a residence		0: No
24	Purchase	Person made purchase	Binary	1: Yes
		in test mailing		0: No
25	Spending	Amount (dollars) spent	Numerical
		by customer in
		test mailing

Assignment

Each catalog costs approximately $2 to mail (including printing, postage, and mailing costs). Estimate the gross profit that the firm could expect from the remaining 180,000 names if it selects them randomly from the pool.
Develop a model for classifying a customer as a purchaser or nonpurchaser.
1. Partition the data randomly into a training set (800 records), validation set (700 records), and test set (500 records).
2. Run logistic regression with L2 penalty, using method LogisticRegressionCV, to select the best subset of variables, then use this model to classify the data into purchasers and nonpurchasers. Use only the training set for running the model. (Logistic regression is used because it yields an estimated “probability of purchase,” which is required later in the analysis.)
Develop a model for predicting spending among the purchasers.
1. Create subsets of the training and validation sets for only purchasers’ records by filtering for Purchase = 1.
2. Develop models for predicting spending with the filtered datasets, using:
  1. Multiple linear regression (use stepwise regression)
  2. Regression trees
3. Choose one model on the basis of its performance on the validation data.
Return to the original test data partition. Note that this test data partition includes both purchasers and nonpurchasers. Create a new data frame called Score Analysis that contains the test data portion of this dataset.
1. Add a column to the data frame with the predicted scores from the logistic regression.
2. Add another column with the predicted spending amount from the prediction model chosen.
3. Add a column for “adjusted probability of purchase” by multiplying “predicted probability of purchase” by 0.107. This is to adjust for oversampling the purchasers (see earlier description).
4. Add a column for expected spending: adjusted probability of purchase × predicted spending.
5. Plot the cumulative gains chart of the expected spending (cumulative expected spending as a function of number of records targeted).
6. Using this cumulative gains curve, estimate the gross profit that would result from mailing to the 180,000 names on the basis of your data mining models.

Note: Although Tayko is a hypothetical company, the data in this case (modified slightly for illustrative purposes) were supplied by a real company that sells software through direct sales. The concept of a catalog consortium is based on the Abacus Catalog Alliance.

21.4 Political Persuasion⁴

Voter-Persuasion.csv is the dataset for this case study.

Note: Our thanks to Ken Strasma, President of HaystaqDNA and director of targeting for the 2004 Kerry campaign and the 2008 Obama campaign, for the data used in this case, and for sharing the information in the following writeup.

Background

When you think of political persuasion, you may think of the efforts that political campaigns undertake to persuade you that their candidate is better than the other candidate. In truth, campaigns are less about persuading people to change their minds, and more about persuading those who agree with you to actually go out and vote. Predictive analytics now plays a big role in this effort, but in 2004, it was a new arrival in the political toolbox.

Predictive Analytics Arrives in US Politics

In January of 2004, candidates in the US presidential campaign were competing in the Iowa caucuses, part of a lengthy state-by-state primary campaign that culminates in the selection of the Republican and Democratic candidates for president. Among the Democrats, Howard Dean was leading in national polls. The Iowa caucuses, however, are a complex and intensive process attracting only the most committed and interested voters. Those participating are not a representative sample of voters nationwide. Surveys of those planning to take part showed a close race between Dean and three other candidates, including John Kerry.

Kerry ended up winning by a surprisingly large margin, and the better than expected performance was due to his campaign’s innovative and successful use of predictive analytics to learn more about the likely actions of individual voters. This allowed the campaign to target voters in such a way as to optimize performance in the caucuses. For example, once the model showed sufficient support in a precinct to win that precinct’s delegate to the caucus, money and time could be redirected to other precincts where the race was closer.

Political Targeting

Targeting of voters is not new in politics. It has traditionally taken three forms:

Geographic
Demographic
Individual

In geographic targeting, resources are directed to a geographic unit—state, city, county, etc.—on the basis of prior voting patterns or surveys that reveal the political tendency in that geographic unit. It has significant limitations, though. If a county is only, say, 52% in your favor, it may be in the greatest need of attention, but if messaging is directed to everyone in the county, nearly half of it is reaching the wrong people.

In demographic targeting, the messaging is intended for demographic groups—for example, older voters, younger women voters, Hispanic voters, etc. The limitation of this method is that it is often not easy to implement—messaging is hard to deliver just to single demographic groups.

Traditional individual targeting, the most effective form of targeting, was done on the basis of surveys asking voters how they plan to vote. The big limitation of this method is, of course, the cost. The expense of reaching all voters in a phone or door-to-door survey can be prohibitive.

The use of predictive analytics adds power to the individual targeting method, and reduces cost. A model allows prediction to be rolled out to the entire voter base, not just those surveyed, and brings to bear a wealth of information. Geographic and demographic data remain part of the picture, but they are used at an individual level.

Uplift

In a classical predictive modeling application for marketing, a sample of data is selected and an offer is made (e.g., on the web) or a message is sent (e.g., by mail), and a predictive model is developed to classify individuals as responding or not-responding. The model is then applied to new data, propensities to respond are calculated, individuals are ranked by their propensity to respond, and the marketer can then select those most likely to respond to mailings or offers.

Some key information is missing from this classical approach: how would the individual respond in the absence of the offer or mailing? Might a high-propensity customer be inclined to purchase irrespective of the offer? Might a person’s propensity to buy actually be diminished by the offer? Uplift modeling (see Chapter 13) allows us to estimate the effect of “offer vs. no offer” or “mailing vs. no mailing” at the individual level.

In this case, we will apply uplift modeling to actual voter data that were augmented with the results of a hypothetical experiment. The experiment consisted of the following steps:

Conduct a pre-survey of the voters to determine their inclination to vote Democratic.
Randomly split the voters into two samples—control and treatment.
Send a flyer promoting the Democratic candidate to the treatment group.
Conduct another survey of the voters to determine their inclination to vote Democratic.

Data

The data in this case are in the file Voter-Persuasion.csv. The target variable is MOVED_AD, where a 1 = “opinion moved in favor of the Democratic candidate” and 0 = “opinion did not move in favor of the Democratic candidate.” This variable encapsulates the information from the pre- and post-surveys. The important predictor variable is Flyer, a binary variable that indicates whether or not a voter received the flyer. In addition, there are numerous other predictor variables from these sources:

Government voter files
Political party data
Commercial consumer and demographic data
Census neighborhood data

Government voter files are maintained, and made public, to assure the integrity of the voting process. They contain essential data for identification purposes such as name, address and date of birth. The file used in this case also contains party identification (needed if a state limits participation in party primaries to voters in that party). Parties also staff elections with their own poll-watchers, who record whether an individual votes in an election. These data (termed “derived” in the case data) are maintained and curated by each party, and can be readily matched to the voter data by name. Demographic data at the neighborhood level are available from the census, and can be appended to the voter data by address matching. Consumer and additional demographic data (buying habits, education) can be purchased from marketing firms and appended to the voter data (matching by name and address).

Assignment

The task in this case is to develop an uplift model that predicts the uplift for each voter. Uplift is defined as the increase in propensity to move one’s opinion in a Democratic direction. First, review the variables in Voter-Persuasion.csv and understand which data source they are probably coming from. Then, answer the following questions and perform the tasks indicated:

Overall, how well did the flyer do in moving voters in a Democratic direction? (Look at the target variable among those who got the flyer, compared to those who did not.)
Explore the data to learn more about the relationships between the predictor variables and MOVED_AD (visualization can be helpful). Which of the predictors seem to have good predictive potential? Show supporting charts and/or tables.
Partition the data using the partition variable that is in the dataset, make decisions about predictor inclusion, and fit three predictive models accordingly. For each model, give sufficient detail about the method used, its parameters, and the predictors used, so that your results can be replicated.
Among your three models, choose the best one in terms of predictive power. Which one is it? Why did you choose it?
Using your chosen model, report the propensities for the first three records in the validation set.
Create a derived variable that is the opposite of Flyer. Call it Flyer-reversed. Using your chosen model, re-score the validation data using the Flyer-reversed variable as a predictor, instead of Flyer. Report the propensities for the first three records in the validation set.
For each record, uplift is computed based on the following difference:

P(success | Flyer = 1) − P(success | Flyer = 0)
Compute the uplift for each of the voters in the validation set, and report the uplift for the first three records.
If a campaign has the resources to mail the flyer only to 10% of the voters, what uplift cutoff should be used?

21.5 Taxi Cancellations⁵

Taxi-cancellation-case.csv is the dataset for this case study.

Business Situation

In late 2013, the taxi company Yourcabs.com in Bangalore, India was facing a problem with the drivers using their platform—not all drivers were showing up for their scheduled calls. Drivers would cancel their acceptance of a call, and, if the cancellation did not occur with adequate notice, the customer would be delayed or even left high and dry.

Bangalore is a key tech center in India, and technology was transforming the taxi industry. Yourcabs.com featured an online booking system (though customers could phone in as well), and presented itself as a taxi booking portal. The Uber ride sharing service started its Bangalore operations in mid-2014.

Yourcabs.com had collected data on its bookings from 2011 to 2013, and posted a contest on Kaggle, in coordination with the Indian School of Business, to see what it could learn about the problem of cab cancellations.

The data presented for this case are a randomly selected subset of the original data, with 10,000 rows, one row for each booking. There are 17 input variables, including user (customer) ID, vehicle model, whether the booking was made online or via a mobile app, type of travel, type of booking package, geographic information, and the date and time of the scheduled trip. The target variable of interest is the binary indicator of whether a ride was canceled. The overall cancellation rate is between 7% and 8%.

Assignment

How can a predictive model based on these data be used by Yourcabs.com?
How can a profiling model (identifying predictors that distinguish canceled/uncanceled trips) be used by Yourcabs.com?
Explore, prepare, and transform the data to facilitate predictive modeling. Here are some hints:
- In exploratory modeling, it is useful to move fairly soon to at least an initial model without solving all data preparation issues. One example is the GPS information—other geographic information is available so you could defer the challenge of how to interpret/use the GPS information.
- How will you deal with missing data, such as cases where NaN is indicated?
- Think about what useful information might be held within the date and time fields (the booking timestamp and the trip timestamp). The data file contains a worksheet with some hints on how to extract features from the date/time field.
- Think also about the categorical variables, and how to deal with them. Should we turn them all into dummies? Use only some?
Fit several predictive models of your choice. Do they provide information on how the predictor variables relate to cancellations?
Report the predictive performance of your model in terms of error rates (the confusion matrix). How well does the model perform? Can the model be used in practice?
Examine the predictive performance of your model in terms of ranking (lift). How well does the model perform? Can the model be used in practice?

21.6 Segmenting Consumers of Bath Soap⁶

BathSoapHousehold.csv is the dataset for this case study.

Business Situation

CRISA is an Asian market research agency that specializes in tracking consumer purchase behavior in consumer goods (both durable and nondurable). In one major research project, CRISA tracks numerous consumer product categories (e.g., “detergents”), and, within each category, perhaps dozens of brands. To track purchase behavior, CRISA constituted household panels in over 100 cities and towns in India, covering most of the Indian urban market. The households were carefully selected using stratified sampling to ensure a representative sample; a subset of 600 records is analyzed here. The strata were defined on the basis of socioeconomic status and the market (a collection of cities).

CRISA has both transaction data (each row is a transaction) and household data (each row is a household), and for the household data it maintains the following information:

Demographics of the households (updated annually)
Possession of durable goods (car, washing machine, etc., updated annually; an “affluence index” is computed from this information)
Purchase data of product categories and brands (updated monthly)

CRISA has two categories of clients: (1) advertising agencies that subscribe to the database services, obtain updated data every month, and use the data to advise their clients on advertising and promotion strategies; (2) consumer goods manufacturers, which monitor their market share using the CRISA database.

Key Problems

CRISA has traditionally segmented markets on the basis of purchaser demographics. They would now like to segment the market based on two key sets of variables more directly related to the purchase process and to brand loyalty:

Purchase behavior (volume, frequency, susceptibility to discounts, and brand loyalty)
Basis of purchase (price, selling proposition)

Doing so would allow CRISA to gain information about what demographic attributes are associated with different purchase behaviors and degrees of brand loyalty, and thus deploy promotion budgets more effectively. More effective market segmentation would enable CRISA’s clients (in this case, a firm called IMRB) to design more cost-effective promotions targeted at appropriate segments. Thus, multiple promotions could be launched, each targeted at different market segments at different times of the year. This would result in a more cost-effective allocation of the promotion budget to different market segments. It would also enable IMRB to design more effective customer reward systems and thereby increase brand loyalty.

Data

The data in Table 21.8 profile each household, each row containing the data for one household.

Measuring Brand Loyalty

Several variables in this case measure aspects of brand loyalty. The number of different brands purchased by the customer is one measure of loyalty. However, a consumer who purchases one or two brands in quick succession, then settles on a third for a long streak, is different from a consumer who constantly switches back and forth among three brands. Therefore, how often customers switch from one brand to another is another measure of loyalty. Yet a third perspective on the same issue is the proportion of purchases that go to different brands—a consumer who spends 90% of his or her purchase money on one brand is more loyal than a consumer who spends more equally among several brands.

All three of these components can be measured with the data in the purchase summary worksheet.

Assignment

Use k-means clustering to identify clusters of households based on:

The variables that describe purchase behavior (including brand loyalty)
The variables that describe the basis for purchase
The variables that describe both purchase behavior and basis of purchase

Note 1: How should k be chosen? Think about how the clusters would be used. It is likely that the marketing efforts would support two to five different promotional approaches.

Note 2: How should the percentages of total purchases comprised by various brands be treated? Isn’t a customer who buys all brand A just as loyal as a customer who buys all brand B? What will be the effect on any distance measure of using the brand share variables as is? Consider using a single derived variable.

Table 21.8 Description of variables for each household

Variable type	Variable name	Description
Member ID	Member id	Unique identifier for each household
Demographics	SEC	Socioeconomic class (1 = high, 5 = low)
	FEH	Eating habits(1 = vegetarian, 2 = vegetarian but eat eggs, 3 = nonvegetarian, 0 = not specified)
	MT	Native language (see table in worksheet)
	SEX	Gender of homemaker (1 = male, 2 = female)
	AGE	Age of homemaker
	EDU	Education of homemaker (1 = minimum, 9 = maximum)
	HS	Number of members in household
	CHILD	Presence of children in household (4 categories)
	CS	Television availability (1 = available, 2 = unavailable)
	Affluence Index	Weighted value of durables possessed
Purchase summary over the period	No. of Brands	Number of brands purchased
	Brand Runs	Number of instances of consecutive purchase of brands
	Total Volume	Sum of volume
	No. of Trans	Number of purchase transactions (multiple brands purchased in a month are counted as separate transactions)
	Value	Sum of value
	Trans/Brand Runs	Average transactions per brand run
	Vol/Trans	Average volume per transaction
	Avg. Price	Average price of purchase
Purchase within promotion	Pur Vol	Percent of volume purchased
	No Promo - %	Percent of volume purchased under no promotion
	lapPur Vol Promo 6%	Percent of volume purchased under promotion code 6
	Pur Vol Other Promo %	Percent of volume purchased under other promotions
Brandwise purchase	Br. Cd. lap(57, 144), 55, 272, 286, 24, 481, 352, 5, and 999 (others)	Percent of volume purchased of the brand
Price categorywise purchase	Price Cat 1 to 4	Percent of volume purchased under the price category
Selling propositionwise purchase	Proposition Cat 5 to 15	Percent of volume purchased under the product proposition category

Select what you think is the best segmentation and comment on the characteristics (demographic, brand loyalty, and basis for purchase) of these clusters. (This information would be used to guide the development of advertising and promotional campaigns.)
Develop a model that classifies the data into these segments. Since this information would most likely be used in targeting direct-mail promotions, it would be useful to select a market segment that would be defined as a success in the classification model.

21.7 Direct-Mail Fundraising

Fundraising.csv and FutureFundraising.csv are the datasets used for this case study.

Background

Note: Be sure to read the information about oversampling and adjustment in Chapter 5 before starting to work on this case.

A national veterans’ organization wishes to develop a predictive model to improve the cost-effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling is used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.

Data

The file Fundraising.csv contains 3120 records with 50% donors (TARGET_B = 1) and 50% non-donors (TARGET_B = 0). The amount of donation (TARGET_D) is also included but is not used in this case. The descriptions for the 22 variables (including two target variables) are listed in Table 21.9.

Assignment

Step 1—Partitioning: Partition the dataset into 60% training and 40% validation (set the seed to 12345).

Step 2—Model Building: Follow the following steps to build, evaluate, and choose a model.

Select classification tool and parameters: Run at least two classification models of your choosing. Be sure NOT to use TARGET_D in your analysis. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.) so that it can be replicated.
Classification under asymmetric response and cost: What is the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?

Calculate net profit: For each method, calculate the cumulative gains of net profit for both the training and validation sets based on the actual response rate (5.1%.) Again, the expected donation, given that they are donors, is $13.00, and the total cost of each mailing is $0.68. (Hint: To calculate estimated net profit, we will need to undo the effects of the weighted sampling and calculate the net profit that would reflect the actual response distribution of 5.1% donors and 94.9% non-donors. To do this, divide each row’s net profit by the oversampling weights applicable to the actual status of that row. The oversampling weight for actual donors is 50%/5.1% = 9.8. The oversampling weight for actual non-donors is 50%/94.9% = 0.53.)

Table 21.9 Description of Variables for the Fundraising Dataset

Variable	Description
ZIP	Zip code group (zip codes were grouped into five groups;
	1 = the potential donor belongs to this zip group.)
	00000–19999 ⇒ zipconvert_1
	20000–39999 ⇒ zipconvert_2
	40000–59999 ⇒ zipconvert_3
	60000–79999 ⇒ zipconvert_4
	80000–99999 ⇒ zipconvert_5
HOMEOWNER	1 = homeowner, 0 = not a homeowner
NUMCHLD	Number of children
INCOME	Household income
GENDER	0 = male, 1 = female
WEALTH	Wealth rating uses median family income and population statistics
	from each area to index relative wealth within each state
	The segments are denoted 0 to 9, with 9 being the highest-wealth
	group and zero the lowest. Each rating has a different
	meaning within each state.
HV	Average home value in potential donor’s neighborhood in hundreds of dollars
ICmed	Median family income in potential donor’s neighborhood in hundreds of dollars
ICavg	Average family income in potential donor’s neighborhood in hundreds
IC15	Percent earning less than $15K in potential donor’s neighborhood
NUMPROM	Lifetime number of promotions received to date
RAMNTALL	Dollar amount of lifetime gifts to date
MAXRAMNT	Dollar amount of largest gift to date
LASTGIFT	Dollar amount of most recent gift
TOTALMONTHS	Number of months from last donation to July 1998 (the last time the case was updated)
TIMELAG	Number of months between first and second gift
AVGGIFT	Average dollar amount of gifts to date
TARGET_B	Outcome variable: binary indicator for response
	1 = donor, 0 = non-donor
TARGET_D	Outcome variable: donation amount (in dollars). We will NOT be using this variable for this case.

Draw cumulative gains curves: Draw the different models’ net profit cumulative gains curves for the validation set in a single plot (net profit on the y-axis, proportion of list or number mailed on the x-axis). Is there a model that dominates?
Select best model: From your answer in (4), what do you think is the “best” model?

Step 3—Testing: The file FutureFundraising.csv contains the attributes for future mailing candidates.
1. Using your “best” model from Step 2 (number 5), which of these candidates do you predict as donors and non-donors? List them in descending order of the probability of being a donor. Starting at the top of this sorted list, roughly how far down would you go in a mailing campaign?

21.8 Catalog Cross-Selling⁷

CatalogCrossSell.csv is the dataset for this case study.

Background

Exeter, Inc. is a catalog firm that sells products in a number of different catalogs that it owns. The catalogs number in the dozens, but fall into nine basic categories:

Clothing
Housewares
Health
Automotive
Personal electronics
Computers
Garden
Novelty gift
Jewelry

The costs of printing and distributing catalogs are high. By far the biggest cost of operation is the cost of promoting products to people who buy nothing. Having invested so much in the production of artwork and printing of catalogs, Exeter wants to take every opportunity to use them effectively. One such opportunity is in cross-selling—once a customer has “taken the bait” and purchases one product, try to sell them another while you have their attention.

Such cross-promotion might take the form of enclosing a catalog in the shipment of a purchased product, together with a discount coupon to induce a purchase from that catalog. Or, it might take the form of a similar coupon sent by e-mail, with a link to the web version of that catalog.

But which catalog should be enclosed in the box or included as a link in the e-mail with the discount coupon? Exeter would like it to be an informed choice—a catalog that has a higher probability of inducing a purchase than simply choosing a catalog at random.

Assignment

Using the dataset CatalogCrossSell.csv, perform an association rules analysis, and comment on the results. Your discussion should provide interpretations in English of the meanings of the various output statistics (lift ratio, confidence, support) and include a very rough estimate (precise calculations are not necessary) of the extent to which this will help Exeter make an informed choice about which catalog to cross-promote to a purchaser.

Acknowledgment

The data for this case have been adapted from the data in a set of cases provided for educational purposes by the Direct Marketing Education Foundation (“DMEF Academic Data Set Two, Multi Division Catalog Company, Code: 02DMEF”); used with permission.

21.9 Time Series Case: Forecasting Public Transportation Demand

bicup2006.csv is the dataset for this case study.

Background

Forecasting transportation demand is important for multiple purposes such as staffing, planning, and inventory control. The public transportation system in Santiago de Chile has gone through a major effort of reconstruction. In this context, a business intelligence competition took place in October 2006, which focused on forecasting demand for public transportation. This case is based on the competition, with some modifications.

Problem Description

A public transportation company is expecting an increase in demand for its services and is planning to acquire new buses and to extend its terminals. These investments require a reliable forecast of future demand. To create such forecasts, one can use data on historic demand. The company’s data warehouse has data on each 15-minute interval between 6:30 and 22:00, on the number of passengers arriving at the terminal. As a forecasting consultant, you have been asked to create a forecasting method that can generate forecasts for the number of passengers arriving at the terminal.

Available Data

Part of the historic information is available in the file bicup2006.csv. The file contains the historic information with known demand for a 3-week period, separated into 15-minute intervals, and dates and times for a future 3-day period (DEMAND = NaN), for which forecasts should be generated (as part of the 2006 competition).

Assignment Goal

Your goal is to create a model/method that produces accurate forecasts. To evaluate your accuracy, partition the given historic data into two periods: a training period (the first two weeks), and a validation period (the last week). Models should be fitted only to the training data and evaluated on the validation data.

Although the competition winning criterion was the lowest Mean Absolute Error (MAE) on the future 3-day data, this is not the goal for this assignment. Instead, if we consider a more realistic business context, our goal is to create a model that generates reasonably good forecasts on any time/day of the week. Consider not only predictive metrics such as MAE, MAPE (mean absolute percentage error), and RMSE (root-mean-squared error), but also look at actual and forecasted values, overlaid on a time plot, as well as a time plot of the forecast errors.

Assignment

For your final model, present the following summary:

Name of the method/combination of methods.
A brief description of the method/combination.
All estimated equations associated with constructing forecasts from this method.
The MAPE and MAE for the training period and the validation period.
Forecasts for the future period (March 22–24), in 15-minute bins.
A single chart showing the fit of the final version of the model to the entire period (including training, validation, and future). Note that this model should be fitted using the combined training plus validation data.

Tips and Suggested Steps

Use exploratory analysis to identify the components of this time series. Is there a trend? Is there seasonality? If so, how many “seasons” are there? Are there any other visible patterns? Are the patterns global (the same throughout the series) or local?
Consider the frequency of the data from a practical and technical point of view. What are some options?
Compare the weekdays and weekends. How do they differ? Consider how these differences can be captured by different methods.
Examine the series for missing values or unusual values. Think of solutions.
Based on the patterns that you found in the data, which models or methods should be considered?
Consider how to handle actual counts of zero within the computation of MAPE.

Notes

¹The Charles Book Club case was derived, with the assistance of Ms. Vinni Bhandari, from The Bookbinders Club, a Case Study in Database Marketing, prepared by Nissan Levin and Jacob Zahavi, Tel Aviv University; used with permission.
² This dataset is available from ftp.ics.uci.edu/pub/machine-learning-databases/statlog.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 21 Cases

Create new playlist

Sign In

Sign Up

21.1 Charles Book Club1

The Book Industry

Database Marketing at Charles

The Club

The Problem

A Possible Solution

Art History of Florence

Data Mining Techniques

RFM Segmentation

Assignment

k-Nearest Neighbors

Logistic Regression

21.2 German Credit

Background

Data

Assignment

21.3 Tayko Software Cataloger3

Background

The Mailing Experiment

Data

Assignment

21.4 Political Persuasion4

Background

Predictive Analytics Arrives in US Politics

Political Targeting

Uplift

Data

Assignment

21.5 Taxi Cancellations5

Business Situation

Assignment

21.6 Segmenting Consumers of Bath Soap6

Business Situation

Key Problems

Data

Measuring Brand Loyalty

Assignment

21.7 Direct-Mail Fundraising

Background

Data

Assignment

21.8 Catalog Cross-Selling7

Background

Assignment

Acknowledgment

21.9 Time Series Case: Forecasting Public Transportation Demand

Background

Problem Description

Available Data

Assignment Goal

Assignment

Tips and Suggested Steps

Notes

Table of Contents for
Chapter 21 Cases

21.1 Charles Book Club¹

21.3 Tayko Software Cataloger³

21.4 Political Persuasion⁴

21.5 Taxi Cancellations⁵

21.6 Segmenting Consumers of Bath Soap⁶

21.8 Catalog Cross-Selling⁷