Fixing disastrous joins

NULL is not the only thing crying out for disaster. Joins are also a good source of trouble if not used properly. If you have the feeling that you know everything about joins and that this section can be skipped, give it a chance. Things might not be as easy as expected.

Create demo data for joins

To show what can go wrong during a join, the best thing to do is to create a simplistic example. In this case, two tables are created:

test=# CREATE TABLE a (aid int);
CREATE TABLE
test=# CREATE TABLE b (bid int);
CREATE TABLE

Then some rows can be added:

test=# INSERT INTO a VALUES (1), (2), (3);
INSERT 0 3
test=# INSERT INTO b VALUES (2), (3), (4);
INSERT 0 3

Note that the tables are quite similar but not identical.

For the sake of completeness, here is how common values can be found:

test=# SELECT * FROM a, b WHERE aid = bid;
 aid | bid 
-----+-----
   2 |   2
   3 |   3
(2 rows)

Understanding outer joins

A simple inner join is not what we are most interested in when talking about joins and problems in general. A lot more trouble comes from the so-called outer joins.

Let's look for all values on the left-hand side and match those that are present on the right-hand side:

test=# SELECT * 
   FROM a LEFT JOIN b 
    ON (aid = bid);
 aid | bid 
-----+-----
   1 |    
   2 |   2
   3 |   3
(3 rows)

The query is pretty simple and actually quite risk free as well, but what happens if an AND clause is added to the ON clause? Many people tend to do that to filter the data and reduce the result:

test=# SELECT * 
   FROM a LEFT JOIN b 
    ON (aid = bid AND bid = 2);
 aid | bid 
-----+-----
   1 |    
   2 |   2
   3 |    
(3 rows)

Most people don't expect this result. Keep in mind that the ON clause is here to tell PostgreSQL what to join. In a LEFT JOIN keyword, the left side is always taken completely. The ON clause merely indicates what fits the left-hand side. The scary part here is that the number of rows returned by the query is not reduced at all. It has to be that way.

Somebody might argue that this is an easy problem to fix. Why not just move the filter to the WHERE clause, as shown here?

test=# SELECT * 
   FROM a LEFT JOIN b 
    ON (aid = bid) 
   WHERE bid = 2;
 aid | bid 
-----+-----
   2 |   2
(1 row)

Only one row is returned now. Then what is the point of using a LEFT JOIN keyword here? The outer join is totally useless in this case because the filter strips all the rows created by the outer join.

The trouble with outer joins is that a wrong outer join will hit our support desk and will be nicely covered up as a performance problem. Here is why:

test=# SELECT avg(aid), avg(bid) 
   FROM a LEFT JOIN b 
    ON (aid = bid AND bid = 2);
        avg         |        avg         
--------------------+--------------------
 2.0000000000000000 | 2.0000000000000000
(1 row)

The results look correct at first glance. However, the customer expects that bid = 2 actually reduces the amount of data processed. People wonder why indexes are not used and complain about bad performance. In reality, this is not a performance problem but a logical problem because the query is most likely not what was originally intended by the developer.

A second problem with outer joins has to do with NULL values:

test=# SELECT count(aid), count(bid), count(*) 
   FROM a LEFT JOIN b 
    ON (aid = bid AND bid = 2);
 count | count | count 
-------+-------+-------
     3 |     1 |     3
(1 row)

The count(bid) function returns 1 while count(*) function will return 3. Keep in mind that count(*) will count rows regardless of their contents. However, count(column) will count all values that don't happen to be NULL. As there is just one number, the correct answer is 1 here. The count(*) function and outer joins, if not used wisely, can result in a deadly mixture—ready to explode!

Before writing an aggregate, always try to fix the outer join first and then think twice. Where shall NULL values be expected? What kind of count is really needed? Is the AND operator inside the ON clause really doing what you think it is doing? Whenever there is AND in the ON clause, be attentive and check twice. The same applies to count(*) versus count(column). An error along these lines can affect you in a brutal way.

Let's focus our attention on FULL JOIN now:

test=# SELECT * 
   FROM a FULL JOIN b 
    ON (aid = bid);
 aid | bid 
-----+-----
   1 |    
   2 |   2
   3 |   3
     |   4
(4 rows)

The idea of a FULL JOIN keyword is to find everything on the left and match it with everything on the right. All values from each of the sides will be shown in the result.

After all that you have learned, do you expect the following result?

test=# SELECT * 
   FROM a FULL JOIN b 
    ON (aid = bid AND aid = 2);
 aid | bid 
-----+-----
   1 |    
   2 |   2
   3 |    
     |   4
     |   3
(5 rows)

Again, the AND clause will only tell the system what goes together; it does not reduce data—quite the opposite. In this example, the amount of data returned is even increased by one row.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.98.208