By the end of this chapter, you will be able to:
In this chapter, we will improve the performance of some of our previous SQL queries. Now that we have a good understanding of the basics, we will build upon this foundation by making our queries more resource and time efficient. As we begin to work with larger datasets, these efficiencies become even more important, with each computational step taking longer to compute.
In the previous chapter, we developed the skills necessary to effectively analyze data within a SQL database, and in this chapter, we will turn our attention to the efficiency of this analysis, investigating how we can increase the performance of our SQL queries. Efficiency and performance are key components of data analytics, since without considering these factors, physical constraints such as time and processing power can significantly affect the outcome of an analysis. To elaborate on these limitations, we can consider two separate scenarios.
Let's say that we are performing post-hoc analysis (analysis after the fact or event). In this first scenario, we have completed a study and have collected a large dataset of individual observations of a variety of different factors or features. One such example is that described within our dealership sales database – analyzing the sales data for each customer. With the data collection process, we want to analyze the data for patterns and insights as specified by our problem statement. If our dataset was sufficiently large, we could quickly encounter issues if we didn't optimize the queries first; the most common issue would simply be the time taken to execute the queries. While this doesn't sound like a significant issue, unnecessarily long processing times can cause:
Similarly, another potential issue with sub-optimal queries is an increase in the required system memory and compute power. This can result in either of the following two scenarios:
Analysis/queries are part of a service or product. Let's think of a second scenario, where analysis is being completed as a component of a greater service or product, and so database queries may need to be completed in real time, or at least near-real time. In such cases, optimization and efficiency are key for the product to be a success. One such example is a GPS navigation system that incorporates the state of traffic as reported by other users. For such a system to be effective and provide up-to-date navigation information, the database must be analyzed at a rate that keeps up with the speed of the car and the progress of the journey. Any delays in the analysis that would prevent the navigation from being updated in response to traffic would be of significant impact to the commercial viability of the application.
After looking at these two examples, we can see that while efficiency is important in an effective and thorough post-hoc analysis, it is absolutely critical when incorporating the data analysis as a component of a separate product or service. While it is certainly not the job of a data scientist or data analyst to ensure that production and the database are working at optimal efficiency, it is critical that the queries of the underlying analysis are as effective as possible. If we do not have an efficient and current database in the first place, further refinements will not help in improving the performance of the analysis. In the next section, we will discuss methods of increasing the performance of scans for information throughout a database.
SQL-compliant databases provide a number of different methods for scanning, searching, and selecting data. The right scan method to use is very much dependent on the use case and the state of the database at the time of scanning. How many records are in the database? Which fields are we interested in? How many records do we expect to be returned? How often do we need to execute the query? These are just some of the questions that we may want to ask when selecting the most appropriate scanning method. Throughout this section, we will describe some of the search methods available, how they are used within SQL to execute scans, and a number of scenarios where they should/should not be used.
Before investigating the different methods of executing queries or scanning a database for information, it is useful to understand how the SQL server makes various decisions about the types of queries to be used. SQL-compliant databases possess a powerful tool known as a query planner, which implements a set of features within the server to analyze a request and decides upon the execution path. The query planner optimizes a number of different variables within the request with the aim of reducing the overall execution time. These variables are described in greater detail within the PostgreSQL documentation (https://www.postgresql.org/docs/current/runtime-config-query.html) and include parameters that correspond with the cost of sequential page fetches, CPU operations, and cache size.
In this chapter, we will not cover the details of how a query planner implements its analysis, since the technical details are quite involved. However, it is important to understand how to interpret the plan reported by the query planner. Interpreting the planner is critical if we want to get high performance from a database, as doing so allows us to modify the contents and structure of queries to optimize performance. So, before embarking on a discussion of the various scanning methods, we will gain practical experience in using and interpreting the analysis of the query planner.
When we want to retrieve information from a database, the query planner needs to search through the available records in order to get the data we need. There are various strategies employed within the database to order and allocate the information for fast retrieval. The process that the SQL server uses to search through a database is known as scanning.
There are a number of different types of scans that can be used to retrieve information. We will start with the sequential scan, as this is the easiest to understand and is the most reliable scan available within a SQL database. If all other scans fail, you can always fall back to the reliable sequential scan to get the information you need out of a database. In some circumstances, the sequential scan isn't the fastest or most efficient; however, it will always produce a correct result. The other interesting thing to note about the sequential scan is that, at this stage in the book, while you may not be aware of it, you have already executed a number of sequential scans. Do you recall entering the following command in Chapter 6, Importing and Exporting Data?
sqlda=# SELECT * FROM customers LIMIT 5
The following is the output of the preceding code:
Extracting data using the SELECT command directly from the database executes a sequential scan, where the database server traverses through each record in the database and compares each record to the criteria in the sequential scan, returning those records that match the criteria. This is essentially a brute-force scan and, thus, can always be called upon to execute a search. In many situations, a sequential scan is also often the most efficient method and will be automatically selected by the SQL server. This is particularly the case if any of the following is true:
In this exercise, we will introduce the EXPLAIN command, which displays the plan for a query before it is executed. When we use the EXPLAIN command in combination with a SQL statement, the SQL interpreter will not execute the statement, but rather return the steps that are going to be executed (a query plan) by the interpreter in order to return the desired results. There is a lot of information returned in a query plan and being able to comprehend the output is vital in tuning the performance of our database queries. Query planning is itself a complex topic and can require some practice in order to be comfortable in interpreting the output; even the PostgreSQL official documentation notes that plan-reading is an art that deserves significant attention in its own right. We will start with a simple plan and will work our way through more complicated queries and query plans.
In this exercise, we will interpret a query planner using the EXPLAIN command. We will interpret the query planner of the emails table of the sqlda database. Then, we will employ a more involved query, searching for dates between two specific values in the clicked_date field. We will need to ensure that the sqlda database is loaded as described within the Preface.
Retrieve the Exercise26.sql file from the accompanying source code. This file will contain all the queries used throughout this exercise. However, we will enter them manually using the SQL interpreter to reinforce our understanding of the query planner's operation.
All the exercises and activities in this chapter are also available on GitHub: https://github.com/TrainingByPackt/SQL-for-Data-Analytics/tree/master/Lesson08.
Observe the following steps to perform the exercise:
C:> psql sqlda
Upon successful connection, you will be presented with the interface to the PostgreSQL database:
sqlda=# EXPLAIN SELECT * FROM emails;
Information similar to the following will then be presented:
This information is returned by the query planner; while this is the simplest example possible, there is quite a bit to unpack in the planner information, so let's look through the output step by step:
The first aspect of the plan that is provided is the type of scan executed by the query. We will cover more of the scan types later in the chapter, but, as discussed in more detail soon, the Seq Scan (see Figure 8.4), or sequential scan, is a simple yet robust type of query:
The first measurement reported by the planner, as shown in Figure 8.5, is the start up cost, which is the time expended before the scan starts. This time may be required to first sort the data or complete other pre-processing applications. It is also important to note that the time measured is actually reported in cost units (see Figure 8.5) as opposed to seconds or milliseconds. Often, the cost units are an indication of the number of disk requests or page fetches made, rather than this being a measure in absolute terms. The reported cost is typically more useful as a means of comparing the performance of various queries, rather than as an absolute measure of time:
The next number in the sequence (see Figure 8.6) indicates the total cost of executing the query if all available rows are retrieved. There are some circumstances in which all the available rows may not be retrieved, but we will cover that soon:
The next figure in the plan (see Figure 8.7) indicates the total number of rows that are available to be returned – again, if the plan is completely executed:
The final figure (see Figure 8.8), as suggested by its label, indicates the width of each row in bytes.
When executing the EXPLAIN command, PostgreSQL does not actually implement the query or return the values. It does, however, return a description, along with the processing costs involved in executing each stage of the plan.
sqlda=# EXPLAIN SELECT * FROM emails LIMIT 5;
This repeats the previous statement, where the planner is limited to the first five records. This query will produce the following output from the planner:
Referring to Figure 8.9, we can see that there are two individual rows in the plan. This indicates that the plan is composed of two separate steps, with the lower line of the plan (or, in this case, the first step to be executed) being a repeat of that shown in Figure 8.8. The upper line of the plan is the component that limits the result to only 5 rows. The Limit process is an additional cost of the query; however, it is quite insignificant compared to the lower-level plan, which retrieves approximately 418158 rows at a cost of 9606 pages requests. The Limit stage only returns 5 rows at a cost of 0.11 page requests.
The overall estimated cost for a request comprises the time taken to retrieve the information from the disk as well as the number of rows that need to be scanned. The internal parameters, seq_page_cost and cpu_tuple_cost, define the cost of the corresponding operations within the tablespace for the database. While not recommended at this stage, these two variables can be changed to modify the steps prepared by the planner.
For more information, refer to the PostgreSQL documentation: https://www.postgresql.org/docs/current/runtime-config-query.html.
sqlda=# EXPLAIN SELECT * FROM emails WHERE clicked_date BETWEEN '2011-01-01' and '2011-02-01';
This will produce the following query plan:
The first aspect of this query plan to note is that it comprises a few different steps. The lower-level query is similar to the previous query in that it executes a sequential scan. However, rather than limiting the output, we are filtering it on the basis of the timestamp strings provided. Notice that the sequential scan is to be completed in parallel, as indicated by the Parallel Seq Scan, and the fact that two workers are planned to be used. Each individual sequence scan should return approximately 54 rows, taking a cost of 8038.49 to complete. The upper level of the plan is a Gather state, which is executed at the start of the query. We can see here for the first time that the upfront costs are non-zero (1,000) and total 9051.49, including the gather and search steps.
In this exercise, we worked with the query planner and the output of the EXPLAIN command. These relatively simple queries highlighted a number of the features of the SQL query planner as well as the detailed information that is provided by it. Having a good understanding of the query planner and the information it is returning to you will serve you well in your data science endeavors. Just remember that this understanding will come with time and practice; never hesitate to consult the PostgreSQL documentation: https://www.postgresql.org/docs/current/using-explain.html.
We will continue to practice reading query plans throughout this chapter as we look at different scan types and the methods, they use to improve performance.
Our aim in this activity is to query plan for reading and interpreting the information returned by the planner. Let's say that we are still dealing with our sqlda database of customer records and that our finance team would like us to implement a system to regularly generate a report of customer activity in a specific geographical region. To ensure that our report can be run in a timely manner, we need an estimate of how long the SQL queries will take. We will use the EXPLAIN command to find out how long some of the report queries will take:
Looking at the updated query plan, how many steps are involved in the query plan? What is the cost of the limiting step?
Expected output:
The solution to the activity can be found on page 340. For an additional challenge, try completing this exercise in Python using psycopg2.
In this activity, we practiced reading the plans returned by the query planner. As discussed previously, plan reading requires substantial practice to master it. This activity began this process and it is strongly recommended that you frequently use the EXPLAIN command to improve your plan reading.
Index scans are one method of improving the performance of our database queries. Index scans differ from sequential scan in that a pre-processing step is executed before the search of database records can occur. The simplest way to think of an index scan is just like the index of a text or reference book. In writing a non-fiction book, a publisher parses through the contents of the book and writes the page numbers corresponding with each alphabetically sorted topic. Just as the publisher goes to the initial effort of creating an index for the reader's reference, so we can create a similar index within the PostgreSQL database. This index within the database creates a prepared and organized set or a subset of references to the data under specified conditions. When a query is executed and an index is present that contains information relevant to the query, the planner may elect to use the data that was pre-processed and pre-arranged within the index. Without using an index, the database needs to repeatedly scan through all records, checking each record for the information of interest.
Even if all of the desired information is at the start of the database, without indexing, the search will still scan through all available records. Clearly, this would take a significantly longer time than necessary.
There are a number of different indexing strategies that PostgreSQL can use to create more efficient searches, including B-trees, hash indexes, Generalized Inverted Indexes (GINs), and Generalized Search Trees (GISTs). Each of these different index types has its own strengths and weaknesses and, hence, is used in different situations. One of the most frequently used indexes is the B-tree, which is the default indexing strategy used by PostgreSQL and is available in almost all database software. We will first spend some time investigating the B-tree index, looking at what makes it useful as well as some of its limitations.
The B-tree index is a type of binary search tree and is characterized by the fact that it is a self-balancing structure, maintaining its own data structure for efficient searching. A generic B-tree structure can be found in Figure 8.12, in which we can see that each node in the tree has no more than two elements (thus providing balance) and that the first node has two children. These traits are common among B-trees, where each node is limited to n components, thus forcing the split into child nodes. The branches of the trees terminate at leaf nodes, which, by definition, have no children:
Using Figure 8.12 as an example, say we were looking for the number 13 in the B-tree index. We would start at the first node and select whether the number was less than 5 or greater than 10. This would lead us down the right-hand branch of the tree, where we would again choose between less than 15 and greater than 20. We would then select less than 15 and arrive at the location of 13 in the index. We can see immediately that this operation would be much faster than looking through all available values. We can also see that for performance, the tree must be balanced to allow for an easy path for traversal. Additionally, there must be sufficient information to allow splitting. If we had a tree index with only a few possible values to split on and a large number of samples, we would simply divide the data into a few groups.
Considering B-trees in the context of database searching, we can see that we require a condition to divide the information (or split) on, and we also need sufficient information for a meaningful split. We do not need to worry about the logic of following the tree, as that will be managed by the database itself and can vary depending on the conditions for searching. Even so, it is important for us to understand the strengths and weaknesses of the method to allow us to make appropriate choices when creating the index for optimal performance.
To create an index for a set of data, we use the following syntax:
CREATE INDEX <index name> ON <table name>(table column);
We can also add additional conditions and constraints to make the index more selective:
CREATE INDEX <index name> ON <table name>(table column) WHERE [condition];
We can also specify the type of index:
CREATE INDEX <index name> ON <table name> USING TYPE(table column)
For example:
CREATE INDEX ix_customers ON customers USING BTREE(customer_id);
In the next exercise, we will start with a simple plan and work our way through more complicated queries and query plans using index scans.
In this exercise, we will create a number of different index scans and will investigate the performance characteristics of each of the scans.
Continuing with the scenario from the previous activity, say we had completed our report service but wanted to make the queries faster. We will try to improve this performance using indexing and index scans. You will recall that we are using a table of customer information that includes contact details such as name, email address, phone number, and address information, as well as the latitude and longitude details of their address. The following are the steps to perform:
This file can be downloaded from the accompanying source code available at https://github.com/TrainingByPackt/SQL-for-Data-Analytics/blob/master/Lesson08/Exercise27.
C:> psql sqlda
Upon successful connection, you will be presented with the interface to the PostgreSQL database:
sqlda=# EXPLAIN SELECT * FROM customers WHERE state='FO';
The following is the output of the preceding code:
Note that there is only 1 row returned and that the setup cost is 0, but the total query cost is 1661.
sqlda=# EXPLAIN SELECT DISTINCT state FROM customers;
The output is as follows:
So, there are 51 unique values within the state column.
sqlda=# CREATE INDEX ix_state ON customers(state);
sqlda=# EXPLAIN SELECT * FROM customers WHERE state='FO';
The following is the output of the preceding code:
Notice that an index scan is now being used using the index we just created in step 5. We can also see that we have a non-zero setup cost (0.29), but the total cost is very much reduced, from the previous 1661 to only 8.31! This is the power of the index scan.
Now, let's look at a slightly different example, looking at the time it takes to return a search on the gender column.
sqlda=# EXPLAIN SELECT * FROM customers WHERE gender='M';
The output is as follows:
sqlda=# CREATE INDEX ix_state ON customers(gender);
d customers;
Scrolling to the bottom, we can see the indexes using the ix_ prefix, as well as the column from the table used to create the index:
sqlda=# EXPLAIN SELECT * FROM customers WHERE gender='M';
The following is the output of the preceding code:
Notice that the query plan has not changed at all, despite the use of the index scan. This is because there is insufficient information to create a useful tree within the gender column. There are only two possible values, M and F. The gender index essentially splits the information in two; one branch for males, and one for females. The index has not split the data into branches of the tree well enough to gain any benefit. The planner still needs to sequentially scan through at least half of the data, and so it is not worth the overhead of the index. It is for this reason that the query planner insists on not using the index.
sqlda=# EXPLAIN SELECT * FROM customers WHERE (latitude < 38) AND (latitude > 30);
The following is the output of the preceding code:
Notice that the query is using a sequential scan with a filter. The initial sequential scan returns 17788 before the filter and costs 1786 with 0 start up cost.
sqlda=# CREATE INDEX ix_latitude ON customers(latitude);
We can see that the plan is a lot more involved than the previous one, with a bitmap heap scan and a bitmap index scan being used. We will cover bitmap scans soon, but first, let's get some more information by adding the ANALYZE command to EXPLAIN.
sqlda=# EXPLAIN ANALYZE SELECT * FROM customers WHERE (latitude < 38) AND (latitude > 30);
The following output will be displayed:
With this extra information, we can see that there is 0.3 ms of planning time and 14.582 ms of execution time, with the index scan taking almost the same amount of time to execute as the bitmap heat scan takes to start.
sqlda=# CREATE INDEX ix_latitude_less ON customers(latitude) WHERE (latitude < 38) and (latitude > 30);
Using this more targeted index, we were able to shave 0.681 ms off the execution time, at the cost of an additional 0.3 ms of planning time.
Thus, we have squeezed some additional performance out of our query as our indexes have made the searching process more efficient. We may have had to pay an upfront cost to create the index, but once created, repeat queries can be executed more quickly.
In this activity, we will determine whether index scans can be used to reduce query time. After creating our customer reporting system for the marketing department in Activity 10: Query Planning, we have received another request to allow records to be identified by their IP address or the associated customer names. We know that there are a lot of different IP addresses and we need performant searches. Plan out the queries required to search for records by IP address as well as for certain customers with the suffix Jr in their name.
Here are the steps to follow:
Expected output
The solution to the activity can be found on page 341.
In this activity, we have squeezed some additional performance out of our query as our indexes have made the searching process more efficient. We will learn how the hash index works in the next section.
The final indexing type we will cover is the hash index. The hash index has only recently gained stability as a feature within PostgreSQL, with previous versions issuing warnings that the feature is unsafe and reporting that the method is typically not as performant as B-tree indexes. At the time of writing, the hash index feature is relatively limited in the comparative statements it can run, with equality (=) being the only one available. So, given that the feature is only just stable and somewhat limited in options for use, why would anyone use it? Well, hash indices are able to describe large datasets (in the order of tens of thousands of rows or more) using very little data, allowing more of the data to be kept in memory and reducing search times for some queries. This is particularly important for databases that are at least several gigabytes in size.
A hash index is an indexing method that utilizes a hash function to achieve its performance benefits. A hash function is a mathematical function that takes data or a series of data and returns a unique length of alphanumeric characters depending upon what information was provided and the unique hash code used. Let's say we had a customer named "Josephine Marquez." We could pass this information to a hash function, which could produce a hash result such as 01f38e. Say we also had records for Josephine's husband, Julio; the corresponding hash for Julio could be 43eb38a. A hash map uses a key-value pair relationship to find data.
We could (but are not limited to) use the values of a hash function to provide the key, using the data contained in the corresponding row of the database as the value. As long as the key is unique to the value, we can quickly access the information we require. This method can also reduce the overall size of the index in memory, if only the corresponding hashes are stored, thereby dramatically reducing the search time for a query.
The following example shows how to create a hash index:
sqlda=# CREATE INDEX ix_gender ON customers USING HASH(gender);
You will recall that the query planner is able to ignore the indices created if it deems them to be not significantly faster for the existing query or just not appropriate. As the hash scan is somewhat limited in use, it may not be uncommon for a different search to ignore the indices.
In this exercise, we will generate a number of hash indexes and investigate the potential performance increases that can be gained from using them. We will start the exercise by rerunning some of the queries of previous exercises and comparing the execution times:
DROP INDEX <index name>;
sqlda=# EXPLAIN ANALYZE SELECT * FROM customers WHERE gender='M';
The following output will be displayed:
We can see here that the estimated planning time is 0.107 ms and the execution time is 29.905 ms.
sqlda=# CREATE INDEX ix_gender ON customers USING btree(gender);
sqlda=#
The following is the output of the preceding code:
We can see here that the query planner has not selected the B-tree index, but rather the sequential scan. The costs of the scans do not differ, but the planning and execution time estimates have been modified. This is not unexpected, as these measures are exactly that – estimates based on a variety of different conditions, such as data in memory and I/O constraints.
sqlda=# EXPLAIN ANALYZE SELECT * FROM customers WHERE gender='M';
The results of the five individual queries can be seen in the following screenshot; note that the planning and execution times differ for each separate execution of the query:
sqlda=# DROP INDEX ix_gender;
sqlda=# CREATE INDEX ix_gender ON customers USING HASH(gender);
sqlda=# EXPLAIN ANALYZE SELECT * FROM customers WHERE gender='M';
The following output will be displayed:
As with the B-tree index, there was no benefit to using the hash index on the gender column, and so it was not used by the planner.
sqlda=# EXPLAIN ANALYZE SELECT * FROM customers WHERE state='FO';
The following output will be displayed:
sqlda=# CREATE INDEX ix_state ON customers USING BTREE(state);
sqlda=# EXPLAIN ANALYZE SELECT * FROM customers WHERE state='FO';
The following is the output of the preceding code:
Here, we can see a significant performance increase due to the B-tree index with a slight setup cost. How does the hash scan perform? Given that the execution time has dropped from 22.3 ms to 0.103 ms, it is reasonable to conclude that the increased planning cost has increased by approximately 50%.
sqlda=# DROP INDEX ix_state;
sqlda=# CREATE INDEX ix_state ON customers USING HASH(state);
sqlda=# EXPLAIN ANALYZE SELECT * FROM customers WHERE state='FO';
The following is the output of the preceding code:
We can see that, for this specific query, a hash index is particularly effective, reducing both the planning/setup time and cost of the B-tree index, as well as reducing the execution time to less than 1 ms from approximately 25 ms.
In this exercise, we used hash indexes to find the effectiveness of a particular query. We saw how the execution time goes down when using a hash index in a query.
In this activity, we will investigate the use of hash indexes to improve performance using the emails table from the sqlda database. We have received another request from the marketing department. This time, they would like us to analyze the performance of an email marketing campaign. Given that the success rate of email campaigns is low, many different emails are sent to many customers at a time. Use the EXPLAIN and ANALYZE commands to determine the planning time and cost, as well as the execution time and cost, of selecting all rows where the email subject is Shocking Holiday Savings On Electric Scooters:
Expected output:
The solution to the activity can be found on page 343.
In this activity, a sequential scan was used in this query rather than the hash scan created due to the current limitations of hash scan usage. At the time of writing, use of the hash scan is limited to equality comparisons, which involves searching for values equal to a given value.
So far in this chapter, we have looked at a number of different scanning methods, and the use of both B-trees and hash scans as a means of reducing query times. We have also presented a number of different examples of where an index was created for a field or condition and was explicitly not selected by the query planner when executing the query as it was deemed a more inefficient choice. In this section, we will spend some time discussing the appropriate use of indexes for reducing query times, since, while indexes may seem like an obvious choice for increasing query performance, this is not always the case. Consider the following situations:
The JOIN functionality in SQL-compliant databases provides a very powerful and efficient method of combining data from different sources, without the need for complicated looping structures or a series of individual SQL statements. We covered joins and join theory in detail in Chapter 3, SQL for Data Preparation. As suggested by the name of the command, a join takes information from two or more tables and uses the contents of the records within each table to combine the two sets of information. Because we are combining this information without the use of looping structures, this can be done very efficiently. In this section, we will consider the use of joins as a more performant alternative to looping structures. The following is the Customer Information table:
The following table shows the Order Information table:
So, with this information, we may want to see whether there are some trends in the items that are sold based on the customer's address. We can use JOIN to bring these two sets of information together; we will use the Customer ID column to combine the two datasets and produce the information shown in the following table:
We can see in the preceding example that the join included all of the records where there was information available for both the customer and the order. As such, the customer Meat Hook was omitted from the combined information since no order information was available. In the example, we executed INNER JOIN; there are, however, a number of different joins available, and we will spend some time looking through each of them. The following is an example that shows the use of a performant INNER JOIN:
smalljoins=# EXPLAIN ANALYZE SELECT customers.*, order_info.order_id, order_info.product_code, order_info.qty FROM customers INNER JOIN order_info ON customers.customer_id=order_info.customer_id;
Refer to Chapter 3, SQL for Data Preparation, for more information on joins. In the next exercise, we will investigate the use of performant inner joins.
In this exercise, we will investigate the use of inner joins to efficiently select multiple rows of data from two different tables. Let's say that our good friends in the marketing department gave us two separate databases: one from SalesForce and one from Oracle. We could use a JOIN statement to merge the corresponding information from the two sources into a single source. Here are the steps to follow:
$ createdb smalljoins
$psql smalljoins < smalljoins.dump
$ psql smalljoins
smalljoins=# SELECT * FROM customers;
The following figure shows the output of the preceding code:
smalljoins=# SELECT * FROM order_info;
This will display the following output:
smalljoins=# SELECT customers.*, order_info.order_id, order_info.product_code, order_info.qty FROM customers INNER JOIN order_info ON customers.customer_id=order_info.customer_id;
The following figure shows the output of the preceding code:
smalljoins=# SELECT customers.*, order_info.order_id, order_info.product_code, order_info.qty INTO join_results FROM customers INNER JOIN order_info ON customers.customer_id=order_info.customer_id;
The following figure shows the output of the preceding code:
smalljoins=# EXPLAIN ANALYZE SELECT customers.*, order_info.order_id, order_info.product_code, order_info.qty FROM customers INNER JOIN order_info ON customers.customer_id=order_info.customer_id;
This will display the following output:
smalljoins=# EXPLAIN ANALYZE SELECT * FROM customers WHERE customer_id IN (SELECT customer_id FROM order_info);
The following screenshot shows the output of the preceding code:
Looking at the results of the two query planners, we can see that not only did the inner join take about a third of the time of the sequential query (0.177 ms compared with 1.533 ms), but also that we have returned more information by the inner join, with order_id, product_code, and qty also being returned.
smalljoins=# SELECT customers.*, order_info.order_id, order_info.product_code, order_info.qty FROM customers LEFT JOIN order_info ON customers.customer_id=order_info.customer_id;
The following screenshot shows the output of the preceding code:
Notice the differences between the left join and the inner join. The left join has included the result for customer_id 4 twice, and has included the result for Meat Hook once, although there is no order information available. It has included the results of the left table with blank entries for information that is not present in the right table.
smalljoins=# EXPLAIN ANALYZE SELECT customers.*, order_info.order_id, order_info.product_code, order_info.qty FROM customers LEFT JOIN order_info ON customers.customer_id=order_info.customer_id;
This will display the following output:
smalljoins=# EXPLAIN ANALYZE SELECT customers.*, order_info.order_id, order_info.product_code, order_info.qty FROM customers RIGHT JOIN order_info ON customers.customer_id=order_info.customer_id;
The following screenshot shows the output of the preceding code:
Again, we have two entries for customer_id 4, Guybrush Threepwood, but we can see that the entry for customer_id 1, Meat Hook, is no longer present as we have joined on the basis of the information within the contents of the order_id table.
smalljoins=# EXPLAIN ANALYZE SELECT customers.*, order_info.order_id, order_info.product_code, order_info.qty FROM customers RIGHT JOIN order_info ON customers.customer_id=order_info.customer_id;
The following screenshot shows the output of the preceding code:
We can see that the right join was marginally faster and more cost effective, which can be attributed to one less row being returned than in the left join.
smalljoins=# INSERT INTO order_info (order_id, customer_id, product_code, qty) VALUES (1621, 6, 'MEL386', 1);
smalljoins=# SELECT customers.*, order_info.order_id, order_info.product_code, order_info.qty FROM customers FULL OUTER JOIN order_info ON customers.customer_id=order_info.customer_id;
This will display the following output:
Notice the line that contains product_code MEL386, but no information regarding the customer; there's a similar case for the line for customer_id Meat Hook. The full outer join has combined all available information even if some of the information is not available from either table.
smalljoins=# EXPLAIN ANALYZE SELECT customers.*, order_info.order_id, order_info.product_code, order_info.qty FROM customers FULL OUTER JOIN order_info ON customers.customer_id=order_info.customer_id;
The following screenshot shows the output of the preceding code:
The performance is very similar to that of the other queries, given that an additional row is provided, which can be clearly seen in the final output.
In this exercise, we were introduced to the usage and performance benefits of joins. We observed the combination of information from two separate tables using fewer resources than individual searches require, as well as the use of OUTER JOIN to efficiently combine all information. In the next activity, we will build upon our understanding of joins with a much larger dataset.
In this activity, our goal is to implement various performant joins. In this activity, we will use joins to combine information from a table of customers as well as information from a marketing email dataset. Say we have just collated a number of different email records from a variety of different databases. We would like to distill the information down into a single table so that we can perform some more detailed analysis. Here are the steps to follow:
Expected output
The output shows the final list of customers in the cities where we have no dealerships.
The solution to the activity can be found on page 346.
In this activity, we used joins to combine information from a table of customers as well as information from a marketing email dataset and helped the marketing manager to solve their query.
So far in this chapter, we have discovered how to quantify query performance via the query planner, as well as the benefits of using joins to collate and extract information from multiple database tables. In this section, we will construct reusable queries and statements via functions, as well as automatic function execution via trigger callbacks. The combination of these two SQL features can be used to not only run queries or re-index tables as data is added to/updated in/removed from the database, but also to run hypothesis tests and track the results of the tests throughout the life of the database.
As in almost all other programming or scripting languages, functions in SQL are contained sections of code, which provides a lot of benefits, such as efficient code reuse and simplified troubleshooting processes. We can use functions to repeat/modify statements or queries without re-entering the statement each time or searching for its use throughout longer code segments. One of the most powerful aspects of functions is also that they allow us to break the code into smaller, testable chunks. As the popular computer science expression goes "If the code is not tested, it cannot be trusted."
So, how do we define functions in SQL? There is a relatively straightforward syntax, with the SQL syntax keywords:
CREATE FUNCTION some_function_name (function_arguments)
RETURNS return_type AS $return_name$
DECLARE return_name return_type;
BEGIN
<function statements>;
RETURN <some_value>;
END; $return_name$
LANGUAGE PLPGSQL;
The following is a small explanation of the function used in the preceding code:
The DECLARE return_name return_type statement is only required if return_name is provided, and a variable is to be returned from the function. If return_name is not required, this line can be omitted from the function definition.
The complete PostgreSQL documentation for functions can be found at https://www.postgresql.org/docs/current/extend.html.
In this exercise, we will create the most basic function – one that simply returns a constant value – so we can build up a familiarity with the syntax. We will construct our first SQL function that does not take any arguments as additional information. This function may be used to repeat SQL query statements that provide basic statistics about the data within the tables of the sqlda database. These are the steps to follow:
$ psql sqlda
sqlda=# CREATE FUNCTION fixed_val() RETURNS integer AS $$
This line starts the function declaration for fixed_val, and we can see that there are no arguments to the function, as indicated by the open/closed brackets, (), nor any returned variables.
sqlda$#
sqlda$# BEGIN
sqlda$# RETURN 1;
sqlda$# END; $$
sqlda-# LANGUAGE PLPGSQL;
This will complete the function definition.
sqlda=# SELECT * FROM fixed_val();
This will display the following output:
Notice that the function is called using the open and closed brackets in the SELECT statement.
sqlda=# EXPLAIN ANALYZE SELECT * FROM fixed_val();
The following screenshot shows the output of the preceding code:
So far, we have seen how to create a simple function, but simply returning a fixed value is not particularly useful. We will now create a function that determines the number of samples in the sales table. Notice that the three rows being referenced in the preceding screesnhot refer not to the result of SELECT * FROM fixed_val(); but rather the result of the query planner. Looking at the first line of the information returned by the query planner, we can see that only one row of information is returned from the SELECT statement.
sqlda=# CREATE FUNCTION num_samples() RETURNS integer AS $total$
sqlda$# DECLARE total integer;
sqlda$# BEGIN
sqlda$# SELECT COUNT(*) INTO total FROM sales;
sqlda$# RETURN total;
sqlda$# END; $total$
sqlda-# LANGUAGE PLPGSQL;
This will complete the function definition, and upon successful creation, the CREATE_FUNCTION statement will be shown.
sqlda=# SELECT num_samples();
The following figure shows the output of the preceding code:
We can see that by using the SELECT statement in combination with our SQL function, there are 37,711 records within the sales database.
In this exercise, we have created our first user-defined SQL function and discovered how to create and return information from variables within the function.
Our aim here is to create a user-defined function so we can calculate the largest sale amount in a single function call. In this activity, we will reinforce our knowledge of functions as we create a function that determines the highest sale amount in a database. At this stage, our marketing department is starting to make a lot of data analysis requests and we need to be more efficient in fulfilling them, as they are currently just taking too long. Perform the following steps:
Expected output
The solution to the activity can be found on page 348.
In this activity, we created a user-defined function to calculate the largest sale amount from a single function call using the MAX function.
Our goal is now to create a function that will allow us to calculate information from multiple tables using a single function. In this exercise, we will create a function that determines the average value from the sales amount column, with respect to the value of the corresponding channel. After creating our previous user-defined function to determine the biggest sale in the database, we have observed a significant increase in the efficiency with which we fulfill our marketing department's requests.
Perform the following steps to complete the exercise:
$ psql sqlda
sqlda=# CREATE FUNCTION avg_sales(channel_type TEXT) RETURNS numeric AS $channel_avg$
sqlda$# DECLARE channel_avg numeric;
sqlda$# BEGIN
sqlda$# SELECT AVG(sales_amount) INTO channel_avg FROM sales WHERE channel=channel_type;
sqlda$# RETURN channel_avg;
sqlda$# END; $channel_avg$
sqlda-# LANGUAGE PLPGSQL;
sqlda=# SELECT avg_sales('internet');
The following figure shows the output of the preceding code:
Now do the same for the dealership channel:
sqlda=# SELECT avg_sales('dealership');
The following figure shows the output of the preceding code:
This output shows the average sales for a dealership, which is 7939.331.
In this exercise, we were introduced to using function arguments to further modify the behavior of functions and the outputs they return.
The df and sf commands
You can use the df command in PostgreSQL to get a list of functions available in memory, including the variables and data types passed as arguments:
The sf function_name command in PostgreSQL can be used to review the function definition for already-defined functions:
In this activity, our goal is to create a function with arguments and compute the output. In this activity, we will construct a function that computes the average sales amount for transaction sales within a specific date range. Each date is to be provided to the function as a text string. These are the steps to follow:
Expected output
The solution to the activity can be found on page 349.
In this activity, we constructed a function that computes the average sales amount for transaction sales within a specific date range from the database.
Triggers, known as events or callbacks in other programming languages, are useful features that, as the name suggests, trigger the execution of SQL statements or functions in response to a specific event. Triggers can be initiated when one of the following happens:
The timing of the trigger can also be specified to occur:
Depending upon the context and the purpose of the database, triggers can have a wide variety of different use cases and applications. In a production environment where a database is being used to store business information and make process decisions (such as for a ride-sharing application or an e-commerce store), triggers can be used before any operation to create access logs to the database. These logs can then be used to determine who has accessed or modified the data within the database. Alternatively, triggers could be used to re-map database operations to a different database or table using the INSTEAD OF trigger.
In the context of a data analysis application, triggers can be used to either create datasets of specific features in real time (such as for determining the average of data over time or a sample-to-sample difference), test hypotheses concerning the data, or flag outliers being inserted/modified in a dataset.
Given that triggers are used frequently to execute SQL statements in response to events or actions, we can also see why functions are often written specifically for or paired with triggers. Self-contained, repeatable function blocks can be used for both trialing/debugging the logic within the function as well as inserting the actual code within the trigger. So, how do we create a trigger? Similarly, to the case with function definitions, there is a standard syntax; again, the SQL keywords:
CREATE TRIGGER some_trigger_name { BEFORE | AFTER | INSTEAD OF } { INSERT | DELETE | UPDATE | TRUNCATE } ON table_name
FOR EACH { ROW | STATEMENT }
EXECUTE PROCEDURE function_name ( function_arguments)
Looking at this generic trigger definition, we can see that there are a few individual components:
Some other functions that we will use are these:
There are a number of different options available for SQL triggers that lie outside the scope of this book. For the complete trigger documentation, you can refer to https://www.postgresql.org/docs/current/sql-createtrigger.html.
In this exercise, we will create a trigger that updates the fields whenever data is added. For this exercise, we will use the smalljoins database from the section of this chapter on joins and will create a trigger that updates the stock value within products for a product each time that an order is inserted into the order_info table. Using such a trigger, we can update our analysis in real time as end users interact with the database. These triggers will remove the need for us to run the analysis for the marketing department manually; instead, they will generate the results for us.
For this scenario, we will create a trigger to update the records for the available stock within the database for each of our products. As items are bought, the triggers will be fired, and the quantity of available stock will be updated. Here are the steps to perform:
$ psql smalljoins < Functions.sql
$ psql smalljoins
smalljoins=# df
This will display the following output:
smalljoins=# SELECT * FROM products;
The following figure shows the output of the preceding code:
For the order_info table, we can write the following query:
smalljoins=# SELECT * FROM order_info;
The following figure shows the output of the preceding code:
smalljoins=# SELECT insert_order(4, 'MON636', 10);
The following figure shows the output of the preceding code:
smalljoins=# SELECT * FROM order_info;
This will display the following output:
Notice the additional row with order_id 1623.
smalljoins=# SELECT update_stock();
The following figure shows the output of the preceding code:
This function call will determine how many Red Herrings are left in inventory (after the sales of the 10 additional herrings) and will update the table accordingly.
smalljoins=# SELECT * FROM products;
The following figure shows the output of the preceding code:
Updating the stock values manually will quickly become tedious. Let's create a trigger to do this automatically whenever a new order is placed.
smalljoins=# DROP FUNCTION update_stock;
smalljoins=# CREATE FUNCTION update_stock() RETURNS TRIGGER AS $stock_trigger$
smalljoins$# DECLARE stock_qty integer;
smalljoins$# BEGIN
smalljoins$# stock_qty := get_stock(NEW.product_code) – NEW.qty;
smalljoins$# UPDATE products SET stock=stock_qty WHERE product_code=NEW.product_code;
smalljoins$# RETURN NEW;
smalljoins$# END; $stock_trigger$
smalljoins-# LANGUAGE PLPGSQL;
Note that in this function definition, we are using the NEW keyword followed by the dot operator (.) and the product_code (NEW.product_code) and qty (NEW.qty) field names from the order_info table. The NEW keyword refers to the record that was recently inserted, updated, or deleted and provides a reference to the information within the record.
In this exercise, we want the trigger to fire after the record is inserted into order_info and thus the NEW reference will contain this information. So, we can use the get_stock function with NEW.product_code to get the currently available stock for the record and simply subtract the NEW.qty value from the order record.
smalljoins=# CREATE TRIGGER update_trigger
smalljoins-# AFTER INSERT ON order_info
smalljoins-# FOR EACH ROW
smalljoins-# EXECUTE PROCEDURE update_stock();
smalljoins=# SELECT insert_order(4, 'MON123', 2);
The following figure shows the output of the preceding code:
smalljoins=# SELECT * FROM order_info;
This will display the following output:
smalljoins=# SELECT * FROM products;
The following figure shows the output of the preceding code:
Our trigger worked! We can see that the available stock for the Rubber Chicken + Pulley MON123 has been reduced from 7 to 5, in accordance with the quantity of the inserted order.
In this exercise, we have successfully constructed a trigger to execute a secondary function following the insertion of a new record into the database.
Our goal here is to create a trigger for keeping track of the data that is updated. Let's say you are working as a data scientist for Monkey Islands, finest distributor of questionable and obscure items. The business is looking at trying a few different strategies to increase the number of items in each sale. To simplify your analysis, you decide to add a simple trigger that for each new order computes the average quantity in all the orders and puts the result in a new table along with the corresponding order_id. Here are the steps to follow:
Expected output
The solution to the activity can be found on page 350.
In this activity, we created a trigger for continuously keeping track of the data that is updated to analyze a product in the database.
Sometimes, you have a lot of data, or perhaps insufficient hardware resources, and a query just runs for a very long time. In such a situation, you may need to stop the query, perhaps so you can implement an alternative query to get the information you need, but without the delayed response. In this section of the chapter, we are going to investigate how we can stop hanging, or, at least, hanging extremely long queries through the use of a secondary PostgreSQL interpreter. The following are some of the commands that we will use to kill queries:
Our goal here is to learn how to cancel a long query to save time when we are stuck at query execution. You have been lucky enough to receive a large data store and you decided to run what you originally thought was a simple enough query to get some basic descriptive statistics of the data. For some reason, however, the query is taking an extremely long time and you are not even sure that it is running. You decide it is time to cancel the query, which means you would like to send a stop signal to the query but allow it sufficient time to clean up its resources gracefully. As there may be a wide variety of hardware available to us and the data required to induce a long query could be quite a lot to download, we will simulate a long query using the pg_sleep command. Here are the steps to follow.
For this exercise, you will require two separate SQL interpreter sessions running in separate windows, as shown in the following figure:
C:> psql sqlda
This will display the following output in two separate windows:
sqlda=# SELECT pg_sleep(1000);
After pressing Enter, you should notice that the cursor of the interpreter does not return:
sqlda=# SELECT pid, query FROM pg_stat_activity WHERE state = 'active';
The following figure shows the output of the preceding code:
sqlda=# SELECT pg_cancel_backend(14131);
The following figure shows the output of the preceding code:
This output screenshot shows an error as the query was canceled after the user's request.
In this exercise, we learned how to cancel a query that has taken a long time to execute.
Now our aim is to terminate a long query using the pg_terminate_background command just as we used pg_cancel_backend to stop the process. In this activity, we will consider the scenario as being one in which the cancelation of the query was not enough to stop the excessively long process. In such a situation, we require something a little heavier handed that, rather than requesting a clean termination of the process, forces a process to be terminated. Launch two separate SQL interpreters. Here are the steps to follow:
Expected output
The solution to the activity can be found on page 352.
In this activity, we terminated a long-running query using the pg_terminate_background command.
In this chapter, we have covered a wide variety of topics all designed to help us understand and improve the performance of our SQL queries. The chapter began with a thorough discussion of the query planner, including the EXPLAIN and ANALYZE statements, as well as various indexing methods. We discussed a number of different compromises and considerations that can be made to reduce the time taken to execute queries. We considered a number of scenarios where indexing methods would be of benefit and others where the query planner may disregard the index, thus reducing the efficiency of the query. We then moved on to the use of joins to efficiently combine information from a number of different tables and ended with an in-depth look at functions and automatic function calls through the use of triggers.
In the next chapter, we will combine all of the topics we have covered thus far in a final case study, applying our SQL knowledge and the scientific method in general, as we solve a real-world problem.
18.216.172.229