Chapter 5 – How Joins Work Internally

“Only he who attempts the ridiculous may achieve the impossible”

Don Quixote

 

Netezza Join Quiz

image

Do you know which statement above is False?

 

Netezza Join Quiz Answer

image

All statements above are TRUE! Two joining rows have to be in the same memory of a single SPU

 

Redistribution

image

Netezza can redistribute data (temporarily) by re-hashing Customer_Number from the Order_Table. Now, all joining rows will be on the same SPUs memory. That is one of two way to get matching rows together.

 

Duplication of the Smaller Table across All-SPUs

image

Netezza took the Department_Table and gathered up all 5-rows (temporarily) and in memory Duplicated the entire 5-row Table across all SPUs. Now the joins can happen! This is the second way to get rows together. If one table is much bigger than the other, Netezza will duplicate the smaller table on all SPUs, just for the life of the query.

 

If the Join Condition is the Distribution Key no Movement

image

Netezza knows that it can only JOIN two rows together if they are physically on the same SPU. This can occur naturally if the join condition columns are the Distribution Keys of their respective tables, but most likely Netezza will have to move data to get the matching rows on the same SPU. What will the Optimizer decide to do next?

 

Matching Rows landed on SPU because of Distribution Keys

image

If both the Employee_Table and the Department_table (being joined by Dept_No) also have Dept_No as their respective Distribution Keys they are considered co-located. Anytime these two tables are joined, there is no need to redistribute or duplicate because the matching rows are naturally on the same SPU. That is the brilliance of the Netezza Hash Formula. All matching values are sent to the same SPU.

 

What if the Join Condition Columns are Not Distribution Keys

image

The Optimizer (PE) knows that the Dept_No column is the Distribution Key for the Department_Table. It also know s that the Dept_No column is NOT the Distribution Key for the Employee_Table, so the Optimizer commands the SPUs to Redistribute the entire Employee_Table by Dept_No temporarily. This is equivalent to loading the Employee_Table with a Distribution Key of Dept_No. Now all matching rows can join.

 

Quiz – Redistribute the Employees by their Dept_No

image

Fill in the quiz above. This is a great opportunity to understand the Netezza engine.

 

Quiz – Employees’ Dept_No landed on SPU with Matches

image

Each redistributed row landed on the same SPU as its matching row. Notice that Squiggy Jones has a NULL department so Netezza will not redistribute that row on an Inner Join. Smythe in Dept_No 10 hashes to SPU 1 but has no match. Turn the page.

 

When Rows are on the same SPU they can be Joined

image

 

Quiz – Redistribute the Orders to the Proper SPU

image

If Netezza decides to Redistribute the Order_Table by Customer_No, which SPUs will hold which Orders? Place their Customer_Number and Order_Total on the SPU after Redistribution.

Fill in the quiz above. This is a great opportunity to understand the Netezza engine.

 

Answer to Redistribute the Employees by their Dept_No Quiz

image

The Netezza Hashing Formula is consistent. It is used to load a Table’s rows via the Distribution Key of the table. Netezza follows the same Hash Formula to Redistribute for Joins.

Each redistributed row landed on the same SPU as its matching row. Turn the page.

 

A Visual of the Join in Action

image

 

The Joining of Two Tables

image

CustNo (1-6) (red) are the Join Condition (PK/FK). Each customer has placed one order. The matching join rows are on different SPUs because the tables were distributed by different Distribution Keys. Netezza gets the joining rows on the same SPU. They will redistribute the Order_Table by Cust_No in SPU memory.

 

Netezza Moves Joining Rows to the Same SPU

image

On all joins the matching rows must be on the same SPU, so hashing is how it is done.

 

Imagine Joining Two Random Distribution Tables

image

CustNo (1-6) (red) are the Join Condition (PK/FK). Each customer has placed one order. The matching join rows are on different SPUs because both tables are distributed randomly, but evenly. How will Netezza get the joining rows on the same SPU? They will redistribute both tables by CustNo or broadcast the smaller table.

 

Both Tables are Redistributed to Join Rows on the Same SPU

image

On all joins, the matching rows must be on the same SPU so hashing is how it is done.

 

How do you join if One Table is Big and One Table is Small?

image

CustNo is the Join Condition (PK/FK), and the Customer_Table has CustNo as its Primary Index. The Order_Table has OrderNo as its Distribution Key, but this table has millions of rows and the Customer_Table only has four rows. Netezza will NOT redistribute a big table, but instead duplicate the smaller table across all SPUs.

 

Duplicate the Small Table on Every SPU (like a mirror)

image

On a Big Table/Small Table join, the smaller table is duplicated on all SPUs.

 

What Could You Do If Two Tables Joined 1000 Times a Day?

image

Each time these two tables are joined via the CustNo column, there will be no data movement because the matching CustNo rows will be on the same SPU. That is because CustNo is the Distribution Key for both tables, so the matching rows are hashed and distributed to the same SPU. This is the beauty of the Hash Formula.

 

Joining Two Tables with the same PK/FK Primary Index

image

CustNo is the join condition (PK/FK), so the matching customer numbers are on the same SPU. They were hashed there originally. Each customer has placed one order. Netezza will have each SPU move their blocks into memory and perform a "Row Hash join". Those key words in the Explain tell you the join is taking place.

 

A Join With No Redistribution or Duplication

image

Both tables have the same Distribution Key and it is the join condition of CustNo.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.46.141