By the end of this chapter, you will be able to:
In this chapter, we will examine an extensive and detailed real-world case study of sales data. This case study will not only demonstrate the processes used in SQL analysis to find solutions for actual problems but will also provide you with confidence and experience in solving such problems.
Throughout SQL for Data Analytics, you have learned a range of new skills, including basic descriptive statistics, SQL commands and importing and exporting data in PostgreSQL, as well as more advanced methods, such as functions and triggers. In this final chapter of the book, we will combine these new skills with the scientific method and critical thinking to solve the real-world problem of understanding the cause of an unexpected drop in sales. This chapter provides a case study and will help you to develop confidence in applying your new SQL skillset to your own problem domains. To solve the problem presented in this use case, we will use the complete range of your newly developed skills, from using basic SQL searches to filter out the available information to aggregating and joining multiple sets of information and using windowing methods to group the data in a logical manner. By completing case studies such as this, you will refine one of the key tools in your data analysis toolkit, providing a boost to your data science career.
Throughout this chapter, we will cover the following case study. The new ZoomZoom Bat Scooter is now available for sale exclusively through its website. Sales are looking good, but suddenly, pre-orders start plunging by 20% after a couple of weeks. What's going on? As the best data analyst at ZoomZoom, it's been assigned to you to figure it out.
In this case study, we will be following the scientific method to help solve our problem, which, at its heart, is about testing guesses (or hypotheses) using objectively collected data. We can decompose the scientific method into the following key steps:
It is to be noted that in this chapter, we are completing a post-hoc analysis of the data, that is, the event has happened, and all available data has been collected. Post-hoc data analysis is particularly useful when events have been recorded that cannot be repeated or when certain external factors cannot be controlled. It is with this data that we are able to perform our analysis, and, as such, we will extract information to support or refute our hypothesis. We will, however, be unable to definitively confirm or reject the hypothesis without practical experimentation. The question that will be the subject of this chapter and that we need to answer is this: why did the sales of the ZoomZoom Bat Scooter drop by approximately 20% after about 2 weeks?
So, let's start with the absolute basics.
In this exercise, we will collect preliminary data using SQL techniques. We have been told that the pre-orders for the ZoomZoom Bat Scooter were good, but the orders suddenly dropped by 20%. So, when was production started on the scooter, and how much was it selling for? How does the Bat Scooter compare with other types of scooters in terms of price? The goal of this exercise is to answer these questions:
$ psql sqlda
sqlda=# SELECT model, base_msrp, production_start_date FROM products WHERE product_type='scooter';
The following table shows the details of all the products for the scooter product type:
Looking at the results from the search, we can see that we have two scooter products with Bat in the name; Bat and Bat Limited Edition. The Bat Scooter, which started production on October 10, 2016, with a suggested retail price of $599.99; and the Bat Limited Edition Scooter, which started production approximately 4 months later, on February 15, 2017, at a price of $699.99.
Looking at the product information supplied, we can see that the Bat Scooter is somewhat unique from a price perspective, being the only scooter with a suggested retail price of $599.99. There are two others at $699.99 and one at $499.99.
Similarly, if we consider the production start date in isolation, the original Bat Scooter is again unique in that it is the only scooter starting production in the last quarter or even half of the year (date format: YYYY-MM-DD). All other scooters start production in the first half of the year, with only the Blade scooter starting production in June.
In order to use the sales information in conjunction with the product information available, we also need to get the product ID for each of the scooters.
sqlda=# SELECT model, product_id FROM products WHERE product_type='scooter';
The query yields the product IDs shown in the following table:
sqlda=# SELECT model, product_id INTO product_names FROM products WHERE product_type='scooter';
Inspect the contents of the product_names table shown in the following figure:
As described in the output, we can see that the Bat Scooter lies between the price points of some of the other scooters and that it was also manufactured a lot later in the year compared to the others.
By completing this very preliminary data collection step, we have the information required to collect sales data on the Bat Scooter as well as other scooter products for comparison. While this exercise involved using the simplest SQL commands, it has already yielded some useful information.
This exercise has also demonstrated that even the simplest SQL commands can reveal useful information and that they should not be underestimated. In the next exercise, we will try to extract the sales information related to the reduction in sales of the Bat Scooter.
In this exercise, we will use a combination of simple SELECT statements, as well as aggregate and window functions, to examine the sales data. With the preliminary information at hand, we can use it to extract the Bat Scooter sales records and discover what is actually going on. We have a table, product_names, that contains both the model names and product IDs. We will need to combine this information with the sales records and extract only those for the Bat Scooter:
$ psql sqlda
sqlda=# d
The preceding query yields the following fields present in the database:
We can see that we have references to customer and product IDs, as well as the transaction date, sales information, the sales channel, and the dealership ID.
sqlda=# SELECT model, customer_id, sales_transaction_date, sales_amount, channel, dealership_id INTO products_sales FROM sales INNER JOIN product_names ON sales.product_id=product_names.product_id;
The output of the preceding code can be seen in the next step.
Throughout this chapter, we will be storing the results of queries and calculations in separate tables as this will allow you to look at the results of the individual steps in the analysis at any time. In a commercial/production setting, we would typically only store the end result in a separate table, depending upon the context of the problem being solved.
sqlda=# SELECT * FROM products_sales LIMIT 5;
The following table lists the top five customers who made a purchase. It shows the sale amount and the transaction details, such as the date and time:
sqlda=# SELECT * FROM products_sales WHERE model='Bat' ORDER BY sales_transaction_date;
The preceding query generates the following output:
sqlda=# SELECT COUNT(model) FROM products_sales WHERE model='Bat';
The model count for the 'Bat' model is as shown here:
So, we have 7328 sales, beginning October 10, 2016. Check the date of the final sales record by performing the next step.
sqlda=# SELECT MAX(sales_transaction_date) FROM products_sales WHERE model='Bat';
The last sale date is shown here:
The last sale in the database occurred on May 31, 2019.
sqlda=# SELECT * INTO bat_sales FROM products_sales WHERE model='Bat' ORDER BY sales_transaction_date;
sqlda=# UPDATE bat_sales SET sales_transaction_date=DATE(sales_transaction_date);
sqlda=# SELECT * FROM bat_sales ORDER BY sales_transaction_date LIMIT 5;
The following is the output of the preceding code:
sqlda=# SELECT sales_transaction_date, COUNT(sales_transaction_date) INTO bat_sales_daily FROM bat_sales GROUP BY sales_transaction_date ORDER BY sales_transaction_date;
sqlda=# SELECT * FROM bat_sales_daily LIMIT 22;
This will display the following output:
We can see a drop-in sales after October 20, as there are 7 days in the first 11 rows that record double-digit sales, and none over the next 11 days.
At this stage, we can confirm that there has been a drop off in sales, although we are yet to quantify precisely the extent of the reduction or the reason for the drop off in sales.
In this activity, we will use our knowledge of the windowing methods that we learned in Chapter 5, Window Functions for Data Analysis. In the previous exercise, we identified the occurrence of the sales drop as being approximately 10 days after launch. Here, we will try to quantify the drop off in sales for the Bat Scooter.
Perform the following steps to complete the activity:
Expected Output
The solution to the activity can be found on page 354.
While the count and cumulative sum columns are reasonably straightforward, why do we need the lag and volume columns? This is because we are looking for drops in sales growth over the first couple of weeks, hence, we compare the daily sum of sales to the same values 7 days earlier (the lag). By subtracting the sum and lag values and dividing by the lag, we obtain the volume value and can determine sales growth compared to the previous week.
Notice that the sales volume on October 17 is 700% above that of the launch date of October 10. By October 22, the volume is over double that of the week prior. As time passes, this relative difference begins to decrease dramatically. By the end of October, the volume is 28% higher than the week prior. At this stage, we have observed and confirmed the presence of a reduction in sales growth after the first 2 weeks. The next stage is to attempt to explain the causes of the reduction.
In this exercise, we will try to identify the causes of a sales drop. Now that we have confirmed the presence of the sales growth drop, we will try to explain the cause of the event. We will test the hypothesis that the timing of the scooter launch attributed to the reduction in sales. Remember, in Exercise 34, Preliminary Data Collection Using SQL Techniques, that the ZoomZoom Bat Scooter launched on October 10, 2016. Observe the following steps to complete the exercise:
$ psql sqlda
sqlda=# SELECT * FROM products;
The following figure shows the launch dates for all the products:
All the other products launched before July, compared to the Bat Scooter, which launched in October.
sqlda=# SELECT * FROM products WHERE product_type='scooter';
The following table shows all the information for products with the product type of scooter:
To test the hypothesis that the time of year had an impact on sales performance, we require a scooter model to use as the control or reference group. In an ideal world, we could launch the ZoomZoom Bat Scooter in a different location or region, for example, but just at a different time, and then compare the two. However, we cannot do this here. Instead, we will choose a similar scooter launched at a different time. There are several different options in the product database, each with its own similarities and differences to the experimental group (ZoomZoom Bat Scooter). In our opinion, the Bat Limited Edition Scooter is suitable for comparison (the control group). It is slightly more expensive, but it was launched only 4 months after the Bat Scooter. Looking at its name, the Bat Limited Edition Scooter seems to share most of the same features, with a number of extras given that it's a "limited edition."
sqlda=# SELECT * FROM sales LIMIT 5;
The sales information for the first five customers is as follows:
sqlda=# SELECT products.model, sales.sales_transaction_date INTO bat_ltd_sales FROM sales INNER JOIN products ON sales.product_id=products.product_id WHERE sales.product_id=8 ORDER BY sales.sales_transaction_date;
sqlda=# SELECT * FROM bat_ltd_sales LIMIT 5;
The following table shows the transaction details for the first five entries of Bat Limited Edition:
sqlda=# SELECT COUNT(model) FROM bat_ltd_sales;
The total sales count can be seen in the following figure:
This is compared to the original Bat Scooter, which sold 7,328 items.
sqlda=# SELECT MAX(sales_transaction_date) FROM bat_ltd_sales;
The transaction details of the last Bat Limited Edition product are as follows:
sqlda=# ALTER TABLE bat_ltd_sales ALTER COLUMN sales_transaction_date TYPE date;
sqlda=# SELECT * FROM bat_ltd_sales LIMIT 5;
The following table shows the first five records of bat_ltd_sales:
sqlda=# SELECT sales_transaction_date, count(sales_transaction_date) INTO bat_ltd_sales_count FROM bat_ltd_sales GROUP BY sales_transaction_date ORDER BY sales_transaction_date;
sqlda=# SELECT * FROM bat_ltd_sales_count;
The sales count is shown in the following figure:
sqlda=# SELECT *, sum(count) OVER (ORDER BY sales_transaction_date) INTO bat_ltd_sales_growth FROM bat_ltd_sales_count;
sqlda=# SELECT * FROM bat_ltd_sales_growth LIMIT 22;
The following table displays the first 22 records of sales growth:
sqlda=# SELECT * FROM bat_sales_growth LIMIT 22;
The following table shows the sales details for the first 22 records of the bat_sales_growth table:
Sales of the limited-edition scooter did not reach double digits during the first 22 days, nor did the daily volume of sales fluctuate as much. In keeping with the overall sales figure, the limited edition sold 64 fewer units over the first 22 days.
sqlda=# SELECT *, lag(sum , 7) OVER (ORDER BY sales_transaction_date) INTO bat_ltd_sales_delay FROM bat_ltd_sales_growth;
sqlda=# SELECT *, (sum-lag)/lag AS volume INTO bat_ltd_sales_vol FROM bat_ltd_sales_delay;
sqlda=# SELECT * FROM bat-ltd_sales_vol LIMIT 22;
The sales volume can be seen in the following figure:
Looking at the volume column in the preceding diagram, we can again see that the sales growth is more consistent than the original Bat Scooter. The growth within the first week is less than that of the original model, but it is sustained over a longer period. After 22 days of sales, the sales growth of the limited-edition scooter is 65% compared to the previous week, as compared with the 28% growth identified in the second activity of the chapter.
At this stage, we have collected data from two similar products launched at different time periods and found some differences in the trajectory of the sales growth over the first 3 weeks of sales. In a professional setting, we may also consider employing more sophisticated statistical comparison methods, such as tests for differences of mean, variance, survival analysis, or other techniques. These methods lie outside the scope of this book and, as such, limited comparative methods will be used.
While we have shown there to be a difference in sales between the two Bat Scooters, we also cannot rule out the fact that the sales differences can be attributed to the difference in the sales price of the two scooters, with the limited-edition scooter being $100 more expensive. In the next activity, we will compare the sales of the Bat Scooter to the 2013 Lemon, which is $100 cheaper, was launched 3 years prior, is no longer in production, and started production in the first half of the calendar year.
In this activity, we are going to investigate the hypothesis that the reduction in sales growth can be attributed to the price point of the Bat Scooter. Previously, we considered the launch date. However, there could be another factor – the sales price included. If we consider the product list of scooters shown in Figure 9.23, and exclude the Bat model scooter, we can see that there are two price categories, $699.99 and above, or $499.99 and below. The Bat Scooter sits exactly between these two groups; perhaps the reduction in sales growth can be attributed to the different pricing model. In this activity, we will test this hypothesis by comparing Bat sales to the 2013 Lemon:
The following are the steps to perform:
Expected Output
The solution to the activity can be found on page 356.
Now that we have collected data to test the two hypotheses of timing and cost, what observations can we make and what conclusions can we draw? The first observation that we can make is regarding the total volume of sales for the three different scooter products. The Lemon Scooter, over its production life cycle of 4.5 years, sold 16,558 units, while the two Bat Scooters, the Original and Limited Edition models, sold 7,328 and 5,803 units, respectively, and are still currently in production, with the Bat Scooter launching about 4 months earlier and with approximately 2.5 years of sales data available. Looking at the sales growth of the three different scooters, we can also make a few different observations:
Based on this information, we can make a number of different conclusions:
Before we draw the conclusion that the difference can be attributed to seasonal variations and launch timing, let's ensure that we have extensively tested a range of possibilities. Perhaps marketing work, such as email campaigns, that is, when the emails were sent, and the frequency with which the emails were opened, made a difference.
Now that we have considered both the launch timing and the suggested retail price of the scooter as a possible cause of the reduction in sales, we will direct our efforts to other potential causes, such as the rate of opening of marketing emails. Does the marketing email opening rate have an effect on sales growth throughout the first 3 weeks? We will find this out in our next exercise.
In this exercise, we will analyze the sales growth using the email opening rate. To investigate the hypothesis that a decrease in the rate of opening emails impacted the Bat Scooter sales rate, we will again select the Bat and Lemon Scooters and will compare the email opening rate.
Perform the following steps to complete the exercise:
$ psql sqlda
sqlda=# SELECT * FROM emails LIMIT 5;
The following table displays the email information for the first five rows:
To investigate our hypothesis, we need to know whether an email was opened, and when it was opened, as well as who the customer was who opened the email and whether that customer purchased a scooter. If the email marketing campaign was successful in maintaining the sales growth rate, we would expect a customer to open an email soon before a scooter was purchased.
The period in which the emails were sent, as well as the ID of customers who received and opened an email, can help us to determine whether a customer who made a sale may have been encouraged to do so following the receipt of an email.
sqlda=# SELECT emails.email_subject, emails.customer_id, emails.opened, emails.sent_date, emails.opened_date, bat_sales.sales_transaction_date INTO bat_emails FROM emails INNER JOIN bat_sales ON bat_sales.customer_id=emails.customer_id ORDER BY bat_sales.sales_transaction_date;
sqlda=# SELECT * FROM bat_emails LIMIT 10;
The following table shows the first 10 rows of the bat_emails table ordered by sales_transaction_date:
We can see here that there are several emails unopened, over a range of sent dates, and that some customers have received multiple emails. Looking at the subjects of the emails, some of them don't seem related to the Zoom scooters at all.
sqlda=# SELECT * FROM bat_emails WHERE sent_date < sales_transaction_date ORDER BY customer_id LIMIT 22;
The following table lists the emails sent to the customers before the sales_transaction_date column:
sqlda=# DELETE FROM bat_emails WHERE sent_date < '2016-04-10';
In this exercise, we are removing information that we no longer require from an existing table. This differs from the previous exercises, where we created multiple tables each with slightly different information from other. The technique you apply will differ depending upon the requirements of the problem being solved; do you require a traceable record of analysis, or is efficiency and reduced storage key?
sqlda=# DELETE FROM bat_emails WHERE sent_date > sales_transaction_date;
sqlda=# DELETE FROM bat_emails WHERE (sales_transaction_date-sent_date) > '30 days';
sqlda=# SELECT * FROM bat_emails ORDER BY customer_id LIMIT 22;
The following table shows the emails where the difference between the transaction date and the sent date is less than 30:
At this stage, we have reasonably filtered the available data based on the dates the email was sent and opened. Looking at the preceding email_subject column, it also appears that there are a few emails unrelated to the Bat Scooter, for example, 25% of all EVs. It's a Christmas Miracle! and Black Friday. Green Cars. These emails seem more related to electric car production instead of scooters, and so we can remove them from our analysis.
sqlda=# SELECT DISTINCT(email_subject) FROM bat_emails;
The following table shows a list of distinct email subjects:
sqlda=# DELETE FROM bat_emails WHERE position('Black Friday' in email_subject)>0;
The position function in the preceding example is used to find any records where the Black Friday string is at the first character in the mail or more in email_structure. Thus, we are deleting any rows where Black Friday is in the email subject. For more information on PostgreSQL, refer to the documentation regarding string functions: https://www.postgresql.org/docs/current/functions-string.html.
sqlda=# DELETE FROM bat_emails WHERE position('25% off all EV' in email_subject)>0;
sqlda=# DELETE FROM bat_emails WHERE position('Some New EV' in email_subject)>0;
sqlda=# SELECT count(sales_transaction_date) FROM bat_emails;
We can see that 401 rows are left in the sample:
sqlda=# SELECT count(opened) FROM bat_emails WHERE opened='t'
We can see that 98 emails were opened:
sqlda=# SELECT COUNT(DISTINCT(customer_id)) FROM bat_emails;
We can see that 396 customers who received an email made a purchase:
sqlda=# SELECT COUNT(DISTINCT(customer_id)) FROM bat_sales;
Following is the output of the preceding code:
sqlda=# SELECT 396.0/6659.0 AS email_rate;
The output of the preceding query is displayed as follows:
In the preceding calculation, you can see that we included a decimal place in the figures, for example, 396.0 instead of a simple integer value (396). This is because the resulting value will be represented as less than 1 percentage point. If we excluded these decimal places, the SQL server would have completed the division operation as integers and the result would be 0.
Just under 6% of customers who made a purchase received an email regarding the Bat Scooter. Since 18% of customers who received an email made a purchase, there is a strong argument to be made that actively increasing the size of the customer base who receive marketing emails could increase Bat Scooter sales.
sqlda=# SELECT * INTO bat_emails_threewks FROM bat_emails WHERE sales_transaction_date < '2016-11-01';
sqlda=# SELECT COUNT(opened) FROM bat_emails_threewks;
We can see that we have sent 82 emails during this period:
sqlda=# SELECT COUNT(opened) FROM bat_emails_threewks WHERE opened='t';
The following is the output of the preceding code:
We can see that 15 emails were opened in the first 3 weeks.
sqlda=# SELECT COUNT(DISTINCT(customer_id)) FROM bat_emails_threewks;
We can see that 82 customers received emails during the first 3 weeks:
sqlda=# SELECT 15.0/82.0 AS sale_rate;
The following table shows the calculated percentage:
Approximately 18% of customers who received an email about the Bat Scooter made a purchase in the first 3 weeks. This is consistent with the rate for all available data for the Bat Scooter.
sqlda=# SELECT COUNT(DISTINCT(customer_id)) FROM bat_sales WHERE sales_transaction_date < '2016-11-01';
The following output reflects 160 customers where the transaction took place before November 1, 2016:
There were 160 customers in the first 3 weeks, 82 of whom received emails, which is slightly over 50% of customers. This is much more than 6% of customers over the entire period of availability of the scooter.
Now that we have examined the performance of the email marketing campaign for the Bat Scooter, we need a control or comparison group to establish whether the results were consistent with that of other products. Without a group to compare against, we simply do not know whether the email campaign of the Bat Scooter was good, bad, or neither. We will perform the next exercise to investigate performance.
In this exercise, we will investigate the performance of the email marketing campaign for the Lemon Scooter to allow for a comparison with the Bat Scooter. Our hypothesis is that if the email marketing campaign performance of the Bat Scooter is consistent with another, such as the 2013 Lemon, then the reduction in sales cannot be attributed to differences in the email campaigns.
Perform the following steps to complete the exercise:
$ psql sqlda
sqlda=# DROP TABLE lemon_sales;
sqlda=# SELECT customer_id, sales_transaction_date INTO lemon_sales FROM sales WHERE product_id=3;
sqlda=# SELECT emails.customer_id, emails.email_subject, emails.opened, emails.sent_date, emails.opened_date, lemon_sales.sales_transaction_date INTO lemon_emails FROM emails INNER JOIN lemon_sales ON emails.customer_id=lemon_sales.customer_id;
sqlda=# SELECT production_start_date FROM products Where product_id=3;
The following table shows the production_start_date column:
Now, delete the emails that were sent before the start of production of the 2013 Lemon Scooter:
sqlda=# DELETE FROM lemon_emails WHERE sent_date < '2013-05-01';
sqlda=# DELETE FROM lemon_emails WHERE sent_date > sales_transaction_date;
sqlda=# DELETE FROM lemon_emails WHERE (sales_transaction_date - sent_date) > '30 days';
sqlda=# SELECT DISTINCT(email_subject) FROM lemon_emails;
The following table shows the distinct email subjects:
Now, delete the email subject not related to the Lemon Scooter using the DELETE command:
sqlda=# DELETE FROM lemon_emails WHERE POSITION('25% off all EVs.' in email_subject)>0;
sqlda=# DELETE FROM lemon_emails WHERE POSITION('Like a Bat out of Heaven' in email_subject)>0;
sqlda=# DELETE FROM lemon_emails WHERE POSITION('Save the Planet' in email_subject)>0;
sqlda=# DELETE FROM lemon_emails WHERE POSITION('An Electric Car' in email_subject)>0;
sqlda=# DELETE FROM lemon_emails WHERE POSITION('We cut you a deal' in email_subject)>0;
sqlda=# DELETE FROM lemon_emails WHERE POSITION('Black Friday. Green Cars.' in email_subject)>0;
sqlda=# DELETE FROM lemon_emails WHERE POSITION('Zoom' in email_subject)>0;
sqlda=# SELECT COUNT(opened) FROM lemon_emails WHERE opened='t';
We can see that 128 emails were opened:
sqlda=# SELECT COUNT(DISTINCT(customer_id)) FROM lemon_emails;
The following figure shows that 506 customers made a purchase after receiving emails:
sqlda=# SELECT 128.0/506.0 AS email_rate;
We can see that 25% of customers opened the emails and made a purchase:
sqlda=# SELECT COUNT(DISTINCT(customer_id)) FROM lemon_sales;
We can see that 13854 customers made a purchase:
sqlda=# SELECT 506.0/13854.0 AS email_sales;
The preceding calculation generates a 36% output:
sqlda=# SELECT * INTO lemon_emails_threewks FROM lemon_emails WHERE sales_transaction_date < '2013-06-01';
sqlda=# SELECT COUNT(sales_transaction_date) FROM lemon_emails_threewks;
The following is the output of the preceding code:
There is a lot of interesting information here. We can see that 25% of customers who opened an email made a purchase, which is a lot higher than the 18% figure for the Bat Scooter. We have also calculated that just over 3.6% of customers who purchased a Lemon Scooter were sent an email, which is much lower than the almost 6% of Bat Scooter customers. The final interesting piece of information we can see is that none of the Lemon Scooter customers received an email during the first 3 weeks of product launch compared with the 82 Bat Scooter customers, which is approximately 50% of all customers in the first 3 weeks!
In this exercise, we investigated the performance of an email marketing campaign for the Lemon Scooter to allow for a comparison with the Bat Scooter using various SQL techniques.
Now that we have collected a range of information about the timing of the product launches, the sales prices of the products, and the marketing campaigns, we can make some conclusions regarding our hypotheses:
At this stage, we have completed our post-hoc analysis (that is, data analysis completed after an event) and have evidence to support a couple of theories as to why the sales of the Bat Scooter dropped after the first 2 weeks. However, we cannot confirm these hypotheses to be true as we cannot isolate one from the other. This is where we need to turn to another tool in our toolkit: in-field testing. Precisely as the name suggests, in-field testing is testing hypotheses in the field, for instance, while a new product is being launched or existing sales are being made. One of the most common examples of in-field testing is A/B testing, whereby we randomly divide our users or customers into two groups, A and B, and provide them with a slightly modified experience or environment and observe the result. As an example, let's say we randomly assigned customers in group A to a new marketing campaign and customers in group B to the existing marketing campaign. We could then monitor sales and interactions to see whether one campaign was better than the other. Similarly, if we wanted to test the launch timing, we could launch in Northern California, for example, in early November, and Southern California in early December, and observe the differences.
The essence of in-field testing is that unless we test our post-hoc data analysis hypotheses, we will never know whether our hypothesis is true and, in order to test the hypothesis, we must only alter the conditions to be tested, for example, the launch date. To confirm our post-hoc analysis, we could recommend that the sales teams apply one or more of the following scenarios and monitor the sales records in real time to determine the cause of the reduction in sales:
Congratulations! You have just completed your first real-world data analysis problem using SQL. In this chapter, you developed the skills necessary to develop hypotheses for problems and systematically gather the data required to support or reject your hypothesis. You started this case study with a reasonably difficult problem of explaining an observed discrepancy in sales data and discovered two possible sources (launch timing and marketing campaign) for the difference while rejecting one alternative explanation (sales price). While being a required skill for any data analyst, being able to understand and apply the scientific method in our exploration of problems will allow you to be more effective and find interesting threads of investigation. In this chapter, you used the SQL skills developed throughout this book; from simple SELECT statements to aggregating complex datatypes as well as windowing methods. After completing this chapter, you will be able to continue and repeat this type of analysis in your own data analysis projects to help find actionable insights.
18.188.181.163