This section is included to assist the readers to perform the activities in the book. It includes detailed steps that are to be performed by the readers to achieve the objectives of the activities.
Solution
Solution
Solution
SELECT email
FROM customers
WHERE state='FL'
ORDER BY email
The following is the output of the preceding code:
SELECT first_name, last_name, email
FROM customers
WHERE city='New York City'
and state='NY'
ORDER BY last_name, first_name
The following is the output of the preceding code:
SELECT *
FROM customers
WHERE phone IS NOT NULL
ORDER BY date_added
The following is the output of the preceding code:
Solution
CREATE TABLE customers_nyc AS (
SELECT * FROM
customers
where city='New York City'
and state='NY');
DELETE FROM customers_nyc WHERE postal_code='10014';
ALTER TABLE customers_nyc ADD COLUMN event text;
UPDATE customers_nyc
SET event = 'thank-you party';
DROP TABLE customers_nyc;
This will delete the customers_nyc table from the database.
Solution
SELECT
c.*,
p.*,
COALESCE(s.dealership_id, -1),
CASE WHEN p.base_msrp - s.sales_amount >500 THEN 1 ELSE 0 END AS high_savings
FROM sales s
INNER JOIN customers c ON c.customer_id=s.customer_id
INNER JOIN products p ON p.product_id=s.product_id
LEFT JOIN dealerships d ON s.dealership_id = d.dealership_id;
Thus, have the data to build a new model that will help the data science team to predict which customers are the best prospects for remarketing from the output generated.
Solution
SELECT COUNT(*)
FROM sales;
You should get 37,711 sales.
SELECT c.state, SUM(sales_amount) as total_sales_amount
FROM sales s
INNER JOIN customers c ON c.customer_id=s.customer_id
GROUP BY 1
ORDER BY 1;
You will get the following output:
SELECT s.dealership_id, COUNT(*)
FROM sales s
WHERE channel='dealership'
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5
You should get the following output:
SELECT s.channel, s.product_id, AVG(sales_amount) as avg_sales_amount
FROM sales s
GROUP BY
GROUPING SETS(
(s.channel), (s.product_id),
(s.channel, s.product_id)
)
ORDER BY 1, 2
You should get the following output:
From the preceding figure, we can see the channel and product ID of all the products as well as the sales amount generated by each product.
Using aggregates, you have unlocked patterns that will help your company understand how to make more revenue and make the company better overall.
Solution
SELECT sales_transaction_date::DATE,
SUM(sales_amount) as total_sales_amount
FROM sales
WHERE sales_transaction_date>='2018-01-01'
AND sales_transaction_date<'2019-01-01'
GROUP BY 1
ORDER BY 1;
The following is the output of the preceding code:
WITH daily_deals as (
SELECT sales_transaction_date::DATE,
COUNT(*) as total_deals
FROM sales
GROUP BY 1
),
moving_average_calculation_30 AS (
SELECT sales_transaction_date, total_deals,
AVG(total_deals) OVER (ORDER BY sales_transaction_date ROWS BETWEEN 30 PRECEDING and CURRENT ROW) AS deals_moving_average,
ROW_NUMBER() OVER (ORDER BY sales_transaction_date) as row_number
FROM daily_deals
ORDER BY 1)
SELECT sales_transaction_date,
CASE WHEN row_number>=30 THEN deals_moving_average ELSE NULL END
AS deals_moving_average_30
FROM moving_average_calculation_30
WHERE sales_transaction_date>='2018-01-01'
AND sales_transaction_date<'2019-01-01';
The following is the output of the preceding code:
WITH total_dealership_sales AS
(
SELECT dealership_id,
SUM(sales_amount) AS total_sales_amount
FROM sales
WHERE sales_transaction_date>='2018-01-01'
AND sales_transaction_date<'2019-01-01'
AND channel='dealership'
GROUP BY 1
)
SELECT *,
NTILE(10) OVER (ORDER BY total_sales_amount)
FROM total_dealership_sales;
The following is the output of the preceding code:
Solution
from sqlalchemy import create_engine
import pandas as pd
% matplotlib inline
cnxn_string = ("postgresql+psycopg2://{username}:{pswd}"
"@{host}:{port}/{database}")
engine = create_engine(cnxn_string.format(
username="your_username",
pswd="your_password",
host="your_host",
port=5432,
database="your_db"))
data = pd.read_csv("~/Downloads/public_transportation'_statistics_by_zip_code.csv", dtype={'zip_code':str})
Check that the data looks correct by creating a new cell, entering data, and then hitting Shift + Enter to see the contents of data. You can also use data.head() to see just the first few rows:
import csv
from io import StringIO
def psql_insert_copy(table, conn, keys, data_iter):
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)
columns = ', '.join('"{}"'.format(k) for k in keys)
if table.schema:
table_name = '{}.{}'.format(table.schema, table.name)
else:
table_name = table.name
sql = 'COPY {} ({}) FROM STDIN WITH CSV'.format(
table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)
data.to_sql('public_transportation_by_zip', engine, if_exists='replace', method=psql_insert_copy)
SELECT
MAX(public_transportation_pct) AS max_pct,
MIN(public_transportation_pct) AS min_pct
FROM public_transportation_by_zip;
SELECT
(public_transportation_pct > 10) AS is_high_public_transport,
COUNT(s.customer_id) * 1.0 / COUNT(DISTINCT c.customer_id) AS sales_per_customer
FROM customers c
INNER JOIN public_transportation_by_zip t ON t.zip_code = c.postal_code
LEFT JOIN sales s ON s.customer_id = c.customer_id
WHERE public_transportation_pct >= 0
GROUP BY 1
;
Here's an explanation of this query:
We can identify customers living in an area with public transportation by looking at the public transportation data associated with their postal code. If public_transportation_pct > 10, then the customer is in a high public transportation area. We can group by this expression to identify the population that is or is not in a high public transportation area.
We can look at sales per customer by counting the sales (for example, using the COUNT(s.customer_id) aggregate) and dividing by the unique number of customers (for example, using the COUNT(DISTINCT c.customer_id) aggregate). We want to make sure that we retain fractional values, so we can multiply by 1.0 to cast the entire expression to a float: COUNT(s.customer_id) * 1.0 / COUNT(DISTINCT c.customer_id).
In order to do this, we need to join our customer data to the public transportation data, and finally to the sales data. We need to exclude all zip codes where public_transportation_pct is greater than, or equal to, 0 so that we exclude the missing data (denoted by -666666666).
Finally, we end with the following query:
From this, we see that customers in high public transportation areas have 12% more product purchases than customers in low public transportation areas.
data = pd.read_sql_query('SELECT * FROM public_transportation_by_zip WHERE public_transportation_pct > 0 AND public_transportation_pct < 50', engine)
data.plot.hist(y='public_transportation_pct')
data.to_sql('public_transportation_by_zip', engine, if_exists='replace')
CREATE TEMP VIEW public_transport_statistics AS (
SELECT
10 * ROUND(public_transportation_pct/10) AS public_transport,
COUNT(s.customer_id) * 1.0 / COUNT(DISTINCT c.customer_id) AS sales_per_customer
FROM customers c
INNER JOIN public_transportation_by_zip t ON t.zip_code = c.postal_code
LEFT JOIN sales s ON s.customer_id = c.customer_id
WHERE public_transportation_pct >= 0
GROUP BY 1
);
copy (SELECT * FROM public_transport_statistics) TO 'public_transport_distribution.csv' CSV HEADER;
First, we want to wrap our query in a temporary view, public_transport_statistics, so that we easily write the result to a CSV file later.
Next is the tricky part: we want to aggregate the public transportation statistics somehow. What we can do is round this percentage to the nearest 10%, so 22% would become 20%, and 39% would become 40%. We can do this by dividing the percentage number (represented as 0.0-100.0) by 10, rounding off, and then multiplying back by 10: 10 * ROUND(public_transportation_pct/10).
The logic for the remainder of the query is explained in step 6.
After creating the scatterplot, we get the following result, which shows a clear positive relationship between public transportation and sales in the geographical area:
Based on all this analysis, we can say that there is a positive relationship between geographies with public transportation and demand for electric vehicles. Intuitively, this makes sense, because electric vehicles could provide an alternative transportation option to public transport for getting around cities. As a result of this analysis, we would recommend that ZoomZoom management should consider expanding in regions with high public transportation and urban areas.
Solution
CREATE MATERIALIZED VIEW customer_search AS (
SELECT
customer_json -> 'customer_id' AS customer_id,
customer_json,
to_tsvector('english', customer_json) AS search_vector
FROM customer_sales
);
CREATE INDEX customer_search_gin_idx ON customer_search USING GIN(search_vector);
SELECT
customer_id,
customer_json
FROM customer_search
WHERE search_vector @@ plainto_tsquery('english', 'Danny Bat');
This results in eight matching rows:
In this complex task, we need to find customers who match with both a scooter and an automobile. That means we need to perform a query for each combination of scooter and automobile.
SELECT DISTINCT
p1.model,
p2.model
FROM products p1
LEFT JOIN products p2 ON TRUE
WHERE p1.product_type = 'scooter'
AND p2.product_type = 'automobile'
AND p1.model NOT ILIKE '%Limited Edition%';
This produces the following output:
SELECT DISTINCT
plainto_tsquery('english', p1.model) &&
plainto_tsquery('english', p2.model)
FROM products p1
LEFT JOIN products p2 ON TRUE
WHERE p1.product_type = 'scooter'
AND p2.product_type = 'automobile'
AND p1.model NOT ILIKE '%Limited Edition%';
This produces the following result:
SELECT
sub.query,
(
SELECT COUNT(1)
FROM customer_search
WHERE customer_search.search_vector @@ sub.query)
FROM (
SELECT DISTINCT
plainto_tsquery('english', p1.model) &&
plainto_tsquery('english', p2.model) AS query
FROM products p1
LEFT JOIN products p2 ON TRUE
WHERE p1.product_type = 'scooter'
AND p2.product_type = 'automobile'
AND p1.model NOT ILIKE '%Limited Edition%'
) sub
ORDER BY 2 DESC;
The following is the output of the preceding query:
While there could be a multitude of factors at play here, we see that the lemon scooter and the model sigma automobile is the combination most frequently purchased together, followed by the lemon and model chi. The bat is also fairly frequently purchased with both of those models, as well as the model epsilon. The other combinations are much less common, and it seems that customers rarely purchase the lemon zester, the blade, and the model gamma.
Solution:
C:> psql sqlda
sqlda=# EXPLAIN SELECT * FROM customers;
This query will produce the following output from the planner:
The setup cost is 0, the total query cost is 1536, the number of rows is 50000, and the width of each row is 140. The cost is actually in cost units, the number of rows is in rows, and the width is in bytes.
sqlda=# EXPLAIN SELECT * FROM customers LIMIT 15;
This query will produce the following output from the planner:
Two steps are involved in the query, and the limiting step costs 0.46 units within the plan.
sqlda=# EXPLAIN SELECT * FROM customers WHERE latitude > 30 and latitude < 40;
This query will produce the following output from the planner:
The total plan cost is 1786 units, and it returns 26439 rows.
Solution:
EXPLAIN ANALYZE SELECT * FROM customers WHERE ip_address = '18.131.58.65';
The following output will be displayed:
The query takes 0.191 ms to plan and 15.625 ms to execute.
CREATE INDEX ON customers(ip_address);
EXPLAIN ANALYZE SELECT * FROM customers WHERE ip_address = '18.131.58.65';
The following is the output of the preceding code:
Figure 8.79: Index scan with a filter on ip_address
The query takes 0.467 ms to plan and 0.123 ms to execute.
CREATE INDEX ix_ip_where ON customers(ip_address) WHERE ip_address = '18.131.58.65';
EXPLAIN ANALYZE SELECT * FROM customers WHERE ip_address = '18.131.58.65';
The following is the output of the preceding code:
The query takes 0.458 ms to plan and 0.056 ms to execute. We can see that both indices took around the same amount of time to plan, with the index that specifies the exact IP address being much faster to execute and slightly quicker to plan as well.
EXPLAIN ANALYZE SELECT * FROM customers WHERE suffix = 'Jr';
The following output will be displayed:
The query takes 0.153 ms of planning and 14.238 ms of execution.
CREATE INDEX ix_jr ON customers(suffix);
EXPLAIN ANALYZE SELECT * FROM customers WHERE suffix = 'Jr';
The following output will be displayed:
Again, the planning time is significantly elevated, but this cost is more than outweighed by the improvement in the execution time, which is reduced from 14.238 ms to 0.511 ms.
Solution
EXPLAIN ANALYZE SELECT * FROM emails where email_subject='Shocking Holiday Savings On Electric Scooters';
The following output will be displayed:
The planning time is 0.117 ms and the execution time is 119.801ms. There is no cost in setting up the query, but there is a cost of 10,652 in executing it.
EXPLAIN ANALYZE SELECT * FROM emails where email_subject='Black Friday. Green Cars.';
The following output will be displayed:
Approximately 0.097 ms is spent on planning the query, with 127.736 ms being spent on executing it. This elevated execution time can be partially attributed to an increase in the number of rows being returned. Again, there is no setup cost, but a similar execution cost of 10,652.
CREATE INDEX ix_email_subject ON emails USING HASH(email_subject);
EXPLAIN ANALYZE SELECT * FROM emails where email_subject='Shocking Holiday Savings On Electric Scooters';
The following output will be displayed:
The query plan shows that our newly created hash index is being used and has significantly reduced the execution time by over 100 ms, as well as the cost. There is a minor increase in the planning time and planning cost, all of which is easily outweighed by the reduction in execution time.
EXPLAIN ANALYZE SELECT * FROM emails where email_subject='Black Friday. Green Cars.';
The following is the output of the preceding code:
Again, we can see a reduction in both planning and execution expenses. However, the reductions in the "Black Friday…" search are not as good as those achieved in the "Shocking Holiday Savings..." search. If we look in more detail, we can see that the scan on the index is approximately two times longer, but there are also about twice as many records in the latter example. From this, we can conclude that the increase is simply due to the increase in the number of records being returned by the query.
CREATE INDEX ix_customer_id ON emails USING HASH(customer_id);
EXPLAIN ANALYZE SELECT * FROM emails WHERE customer_id > 100;
The following output will be displayed:
So, the final execution time comes to 152.656ms and the planning time comes to 0.199ms.
Solution
$ psql sqlda
sqlda=# SELECT customers.customer_id, customers.first_name, customers.last_name, emails.opened, emails.clicked FROM customers INNER JOIN emails ON customers.customer_id=emails.customer_id;
The following screenshot shows the output of the preceding code:
sqlda=# SELECT customers.customer_id, customers.first_name, customers.last_name, emails.opened, emails.clicked INTO customer_emails FROM customers INNER JOIN emails ON customers.customer_id=emails.customer_id;
SELECT * FROM customer_emails WHERE clicked='t' and opened='t';
The following figure shows the output of the preceding code:
sqlda=# SELECT customers.customer_id, customers.first_name, customers.last_name, customers.city FROM customers LEFT JOIN dealerships on customers.city=dealerships.city;
This will display the following output:
sqlda=# SELECT customers.customer_id, customers.first_name, customers.last_name, customers.city INTO customer_dealers FROM customers LEFT JOIN dealerships on customers.city=dealerships.city;
sqlda=# SELECT * from customer_dealers WHERE city is NULL;
The following figure shows the output of the preceding code:
The output shows the final list of customers in the cities where we have no dealerships.
Solution:
$ psql sqlda
sqlda=# CREATE FUNCTION max_sale() RETURNS integer AS $big_sale$
sqlda$# DECLARE big_sale numeric;
sqlda$# BEGIN
sqlda$# SELECT MAX(sales_amount) INTO big_sale FROM sales;
sqlda$# RETURN big_sale;
sqlda$# END; $big_sale$
sqlda-# LANGUAGE PLPGSQL;
sqlda=# SELECT MAX(sales_amount) FROM sales;
The following figure shows the output of the preceding code:
The output is created from a function that determines the highest sale amount, that is, 115000, in the database.
Solution
sqlda=# CREATE FUNCTION avg_sales_window(from_date DATE, to_date DATE) RETURNS numeric AS $sales_avg$
sqlda$# DECLARE sales_avg numeric;
sqlda$# BEGIN
sqlda$# SELECT AVG(sales_amount) FROM sales INTO sales_avg WHERE sales_transaction_date > from_date AND sales_transaction_date < to_date;
sqlda$# RETURN sales_avg;
sqlda$# END; $channel_avg$
sqlda-# LANGUAGE PLPGSQL;
sqlda=# SELECT avg_sales_window('2013-04-12', '2014-04-12');
The following figure shows the output of the preceding code:
The final output shows the average sales within specific dates, which comes to around 477.687.
Solution
smalljoins=# CREATE TABLE avg_qty_log (order_id integer, avg_qty numeric);
smalljoins=# CREATE FUNCTION avg_qty() RETURNS TRIGGER AS $_avg$
smalljoins$# DECLARE _avg numeric;
smalljoins$# BEGIN
smalljoins$# SELECT AVG(qty) INTO _avg FROM order_info;
smalljoins$# INSERT INTO avg_qty_log (order_id, avg_qty) VALUES (NEW.order_id, _avg);
smalljoins$# RETURN NEW;
smalljoins$# END; $_avg$
smalljoins-# LANGUAGE PLPGSQL;
smalljoins=# CREATE TRIGGER avg_trigger
smalljoins-# AFTER INSERT ON order_info
smalljoins-# FOR EACH ROW
smalljoins-# EXECUTE PROCEDURE avg_qty();
smalljoins=# SELECT insert_order(3, 'GROG1', 6);
smalljoins=# SELECT insert_order(4, 'GROG1', 7);
smalljoins=# SELECT insert_order(1, 'GROG1', 8);
smalljoins=# SELECT * FROM avg_qty_log;
The following figure shows the output of the preceding code:
With these orders and the entries in the log, we can see an increase in the average quantity of items per order.
Solution
C:> psql sqlda
sqlda=# SELECT pg_sleep(1000);
Sqlda=# SELECT pg_terminate_backend(14131);
The following figure shows the output of the preceding code:
Sqlda=# SELECT pg_sleep(1000);
This will display the following output:
We can see that the query is now terminated from the screenshot after using the pg_sleep command.
Solution
$ psql sqlda
sqlda=# SELECT *, sum(count) OVER (ORDER BY sales_transaction_date) INTO bat_sales_growth FROM bat_sales_daily;
The following table shows the daily cumulative sum of sales:
sqlda=# SELECT *, lag(sum, 7) OVER (ORDER BY sales_transaction_date) INTO bat_sales_daily_delay FROM bat_sales_growth;
sqlda=# SELECT * FROM bat_sales_daily_delay LIMIT 15;
The following is the output of the preceding code:
sqlda=# SELECT *, (sum-lag)/lag AS volume INTO bat_sales_delay_vol FROM bat_sales_daily_delay ;
The percentage sales volume can be calculated via the following equation:
(new_volume – old_volume) / old_volume
sqlda=# SELECT * FROM bat_sales_daily_delay_vol LIMIT 22;
The delay volume for the first 22 entries can be seen in the following:
Looking at the output table, we can see four sets of information: the daily sales count, the cumulative sum of the daily sales count, the cumulative sum offset by 1 week (the lag), and the relative daily sales volume.
Solution
$ psql sqlda
sqlda=# SELECT sales_transaction_date INTO lemon_sales FROM sales WHERE product_id=3;
sqlda=# SELECT count(sales_transaction_date) FROM lemon_sales;
We can see that 16558 records are available:
sqlda=# SELECT max(sales_transaction_date) FROM lemon_sales;
The following figure displays the sales_transaction_date column:
sqlda=# ALTER TABLE lemon_sales ALTER COLUMN sales_transaction_date TYPE DATE;
We are converting the datatype from DATE_TIME to DATE so as to remove the time information from the field. We are only interested in accumulating numbers, but just the date and not the time. Hence, it is easier just to remove the time information from the field.
sqlda=# SELECT *, COUNT(sales_transaction_date) INTO lemon_sales_count FROM lemon_sales GROUP BY sales_transaction_date,lemon_sales.customer_id ORDER BY sales_transaction_date;
sqlda=# SELECT *, sum(count) OVER (ORDER BY sales_transaction_date) INTO lemon_sales_sum FROM lemon_sales_count;
sqlda=# SELECT *, lag(sum, 7) OVER (ORDER BY sales_transaction_date) INTO lemon_sales_delay FROM lemon_sales_sum;
sqlda=# SELECT *, (sum-lag)/lag AS volume INTO lemon_sales_growth FROM lemon_sales_delay;
sqlda=# SELECT * FROM lemon_sales_growth LIMIT 22;
The following table shows the sales growth:
Similar to the previous exercise, we have calculated the cumulative sum, lag, and relative sales growth of the Lemon Scooter. We can see that the initial sales volume is much larger than the other scooters, at over 800%, and again finishes higher at 55%
18.219.12.23