11. Inner Joins

Keywords Introduced

INNER JOIN • ON

Back in Chapter 1, we talked about the huge advance of relational databases over their predecessors. The significant achievement of relational databases was in their ability to allow data to be organized in any number of tables that are related but at the same time independent of each other. Prior to the advent of relational databases, traditional databases utilized a chain of internal pointers to explicitly define the relationships between tables. For example, you might start with a Customers table and then follow pointers to find the first order for a particular customer, then the next order, and so on until all orders for the customer have been retrieved. In contrast, relational databases allow relationships to be inferred by columns that tables have in common. These relationships are sometimes formalized by the definition of primary and foreign keys, but that isn’t always necessary.

With relational databases, it is the responsibility of the SQL developer to determine and define the relationships between tables. This allows for great flexibility in how different data elements can be combined. The great virtue of relational databases lies in the fact that someone can grab data from a variety of tables in numerous ways.

Let’s start with a common example. Most organizations have a business entity known as a customer. As such, a database typically contains a Customers table that defines each customer. Such a table would normally include a primary key to uniquely identify each customer, and any number of columns with attributes that further define the customer. Common attributes might include phone number, address, city, state, and so on.

The main idea is that all information about the customer is stored in a single table and only in that table. This simplifies the task of data updates. When a customer changes their phone number, only one table needs to be updated. However, the downside to this setup is that whenever someone needs any information about a customer, they must access the Customers table to retrieve the data.

This brings us to the concept of a join. Let’s say that someone is analyzing products that have been purchased. Along with information about the products, it might be necessary to provide information about the customers who purchased each product. For example, an analyst may wish to obtain customer ZIP codes for a geographic analysis. Whereas product information may be found in a Products table, the ZIP code is stored only in the Customers table. To get information on both customers and products, the analyst must join both tables together in such a way that the data matches correctly.

In essence, the promise of relational databases is fulfilled by the ability to join tables together in any desired manner. This is the normal situation. With this chapter, we will leave behind the somewhat artificial examples for which data is retrieved only from a single table and encounter more realistic scenarios involving data in multiple tables.

Joining Two Tables

To begin our exploration of the join process, let’s revisit the Sales table that we have previously encountered:

Image

To some extent, the use of this table in earlier chapters was somewhat misleading. In reality, a competent database designer would seldom create a table such as this. The problem is that it contains information about two separate entities, customers and orders. In the real world, this information would be split into at least two separate tables. A Customers table might look like the following table, seen previously in Chapter 2.:

Image

An Orders table might look like this:

Image

In this Orders table, we’ve chosen to include OrderDate and OrderAmount columns, rather than the QuantityPurchased and PricePerItem columns seen in the first Sales table. The data that was in the Sales table has now been split into two separate tables. The Customers table contains information only about customers. The Orders table contains information solely about items purchased. The Orders table includes a CustomerID column to indicate which customer placed the order. As you might remember from Chapter 1, “Relational Databases and SQL,” this is referred to as a foreign key.

The Customers and Orders tables both have four rows but that is something of a coincidence. There is one customer in the Customers table who has not placed an order. As you can see, CustomerID 4, Sofia Flores, does not appear in the Orders table. On the other hand, George Miller has placed two different orders, as evidenced by the two rows in the Orders table with an CustomerID of 2.

Even with these two tables, this is still a simplistic scenario with much missing. For example, an Orders table would typically include additional columns, such as information on the tax collected or the salesperson’s name. Plus, the Orders table itself might in fact be split into more than one table so that information about the entire order, such as the order date, could be stored apart from information about each item that was ordered. In other words, this is still not a completely realistic example. However, now that we’ve split our information into two separate tables, we can address how to create a SELECT statement that can pull data from both tables simultaneously.

Before we get to the SELECT statement itself, we need to address one additional concern, which is how to visually represent the two tables and the implied relationship that exists between them. Previously, we displayed each table with column names on the top row and corresponding data on subsequent rows. Now that we have more than one table to deal with, we’ll introduce another type of visual representation. Figure 11.1 shows a diagram with both tables, with the table name on the top row and the column names in each subsequent row. This diagram is a simplified version of what is commonly called an entity-relationship diagram. The term entity refers to the tables, and relationship refers to the line drawn between the data elements in those tables. Rather than showing detailed data, this diagram indicates the overall structure of the data.

Image

Figure 11.1 Entity-relationship diagram

The important point to notice is that we’ve drawn a line from CustomerID in the Customers table to the CustomerID in the Orders table. This indicates a relationship between these two tables, namely that both tables share values stored in the CustomerID column.

The Inner Join

We are now ready to present a SELECT statement with what is called an inner join:

SELECT *
FROM Customers
INNER JOIN Orders
ON Customers.CustomerID = Orders.CustomerID

Let’s examine this statement line by line. The SELECT keyword on the first line merely states that we want all (*) columns from both tables. The second line, with the FROM clause, indicates that the first table we want to specify is the Customers table. The third line introduces an INNER JOIN keyword, which is used to specify an additional table that we want to include in our query. In this case, we want to add the Orders table.

Finally, the fourth line introduces the ON keyword. This ON works in conjunction with the INNER JOIN and specifies exactly how the two tables are to be joined. In this situation, we are connecting the CustomerID column of the Customers table (Customers.CustomerID) to the CustomerID column of the Orders table (Orders.CustomerID). Because the CustomerID column has the same name in both the Customers and Orders table, we need to specify the table name as a prefix to the CustomerID column name. The prefix allows us to distinguish between these columns in two separate tables.

The preceding SELECT statement produces this data:

Image

Let’s analyze the results. Both the Customers table and the Orders table had four rows. Looking at the OrderID column, you can tell that we have data from all four rows from the Orders table. However, looking at the CustomerID column, you might notice that only three customers are shown. Why is that? The answer is that the customer with a CustomerID of 4 doesn’t exist in the Orders table. Because we’re joining the two tables together on the CustomerID field, we have no rows in the Orders table that match the CustomerID of 4 in the Customers table.

This brings us to an important observation: An inner join only returns data for which there is a match between both tables being joined. In the next chapter, we’ll talk about an alternative method of joining tables that will allow the customer information for the CustomerID of 4 to be shown, even if there are no orders for that customer.

Here’s a second important observation: Notice that the customer data for George Miller is repeated twice. He existed only once in the Customers table, so one might wonder why he appears on two rows. The answer is that the INNER JOIN causes all possible matches to be shown. Because George has two rows in the Orders table, both of these rows match with his row in the Customers table, resulting in his customer information being displayed twice.

Finally, you may be wondering why this join is referred to an inner join. There are, in fact, two main variations of the join: the inner join and the outer join. Outer joins will be covered in the next chapter.

Table Order in Inner Joins

An inner join brings back data where there is a match between the two specified tables. In the previous SELECT, we specified the Customers table in the FROM clause and the Orders table in the INNER JOIN clause. We might ask whether it matters which table is specified first. As it turns out, for inner joins, the order in which the tables are listed can be reversed with no difference in the results. The following two SELECT statements are logically identical and return the same data:

SELECT *
FROM Customers
INNER JOIN Orders
ON Customers.CustomerID = Orders.CustomerID
SELECT *
FROM Orders
INNER JOIN Customers
ON Orders.CustomerID = Customers.CustomerID

The only difference is that the first statement would display columns from the Customers table first and the Orders table second. The second statement would display columns from the Orders table first and the Customers table second. Despite the order of the columns, both statements return identical data.

Remember that SQL is not a procedural language. It doesn’t specify how a task is to be completed. SQL only indicates the desired logic, and leaves it to the internals of the database to decide exactly how to perform the required task. As such, SQL doesn’t influence or care how the database physically retrieves data. It doesn’t define which table to look at first. Instead, the database software determines the optimal method of obtaining the data.

Implicit Inner Joins

In the previous examples, we used the INNER JOIN and ON keywords to explicitly specify inner joins. It is also possible to specify inner joins with just the FROM and WHERE clauses. This is an older format and is sometimes referred to as implicit inner joins. We have already seen this statement that joins the Customers and Orders tables:

SELECT *
FROM Customers
INNER JOIN Orders
ON Customers.CustomerID = Orders.CustomerID

An alternate way of implicitly specifying the same inner join without the INNER JOIN and ON keywords is:

SELECT *
FROM Customers, Orders
WHERE Customers.CustomerID = Orders.CustomerID

In this older format, rather than using the INNER JOIN keyword to define the new table to join to, we merely list all tables to be joined in the FROM clause. Instead of using the ON clause to say how the tables are related, we use the WHERE clause to specify the relationship between the tables.

Even though this method works perfectly well and produces the same results, we highly recommend not using this format. The advantage of the INNER JOIN and ON keywords is that they explicitly present the logic of the join. That is their only purpose. Although it is possible to specify the relationship in a WHERE clause, the meaning of the SQL statement is much less apparent when the WHERE clause is used for both selection criteria and to indicate relationships between multiple tables. Further, the implicit join format does not work with outer joins, a topic to be covered in the following chapter.

Table Aliases Revisited

Let’s now look at the columns that were returned from the previous SELECT statement. Because we specified all (*) columns, we see all columns from both tables. The CustomerID column appears twice because that column exists in both tables. In practice, however, we would not want this data repeated. Here’s an alternate version of that SELECT, which now specifies only the columns we want to see. In this variant, we employ table aliases, C for Customers and O for Orders, which are placed immediately after the FROM and INNER JOIN keywords by inserting the AS keyword. The statement looks like this:

SELECT
C.CustomerID AS ‘Cust ID’,
C.FirstName AS ‘First Name’,
C.LastName AS ‘Last Name’,
O.OrderID AS ‘Order ID’,
O.OrderDate AS ‘Date’,
O.OrderAmount AS ‘Amount’
FROM Customers AS C
INNER JOIN Orders AS O
ON C.CustomerID = O.CustomerID

The results are:

Image

In this statement, we’re displaying only the CustomerID from the Customers table, and not from the Orders table. Also notice that we’re using the AS keyword to specify both column and table aliases. Note that the AS keyword is completely optional. All of the AS keywords can be removed from this SELECT, and the statement would still be valid and return the same results. However, we recommend using the AS keywords for the sake of clarity.

Looking Ahead

The ability to join tables together in a query is an essential feature of SQL. Relational databases would be of little use without joins. This chapter focused on the formulation of the inner join. The inner join brings back data for which there is a match between both tables being joined. We also talked about an alternate way of implicitly specifying the inner join, and the usefulness of using table aliases.

In our next chapter, we will turn to another important type of join: the outer join. As mentioned, inner joins only allow us to view data where there is a match between the tables being joined. So, if you have a customer with no orders, you won’t see any customer information when doing an inner join between a Customers table and an Orders table. The outer join will allow you to view that customer information, even if there are no orders for the customer. In other words, the outer join lets us see data that we would not otherwise be able to obtain with an inner join. Additionally, the next chapter will begin to introduce scenarios in which more than two tables are joined.

Joining Two Tables

The Inner Join

Table Order in Inner Joins

Implicit Inner Joins

Table Aliases Revisited

Looking Ahead

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
34.201.8.144