Chapter 2 – Best Practices for Table Design

“Beware of the young doctor and the old barber.”

- Benjamin Franklin

Converting Table Structures to Actian Matrix

image

Above, we are converting all of the tables in a Teradata database to Matrix table structures. We went to our Teradata system and right clicked on the database SQL_Class and chose "Convert Table Structures". We selected all of the tables and hit the blue arrow. We then chose to convert to Matrix. Watch in amazement what happens next!

Converting Table Structures to Actian Matrix Finale

image

All 20 Teradata tables have now been converted to Matrix. Just cut and paste to your Matrix system, and you have converted the tables.

Best Practices for Designing Tables

1.Choose the best sort key

2.Choose a great distribution key

3.Consider defining primary key and foreign key constraints

4.Use the smallest possible column size

5.Use date/time data types for date columns

6.Specify redundant predicates on the sort column

“I have found the best way to give advice to your children is to find out what they want and then advise them to do it.”

--Harry S. Truman

As you design your database, there are important decisions you must make that will heavily influence overall query performance. These design choices also have a significant effect on how data is stored, which in turn affects query performance by reducing the number of I/O operations and minimizing the memory required to process certain queries. Harry S. Truman was right. "If you want your Matrix system to run brilliantly, take advice from your users, and use best practices to deliver what they asked for".

Choose the Best Sort Key

When you give an Actian Matrix table a sort key, it stores your data on disk in sorted order.

The sort order is used by the optimizer to determine optimal query plans.

If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.

If you do range filtering or equality filtering on one column, specify that column as the sort key.

If you frequently join a table, specify the join column as both the sort key and the distribution key.

Data sorted correctly helps eliminate unneeded blocks. This is because
Matrix has metadata on each block showing column min and max values.

When you give an Actian Matrix table a sort key, it stores your data on disk in sorted order. The sort order is used by the optimizer to determine optimal query plans. If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. If you frequently join a table, specify the join column as both the sort key and the distribution key. This enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.

Each Block Comes With Metadata

image

Actian Matrix stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If a range-restricted column is a sort key, the query processor is able to use the min and max values to rapidly skip over large numbers of blocks during table. Where most databases use indexes to determine where data is, Matrix uses the block's metadata to determine where data is NOT!

Our query above is looking for data WHERE Order_Total < 300. The metadata shows this block will contain rows, and therefore it will be moved into memory for processing. Each slice has metadata for each of the blocks they own.

Creating a Sort Key

image

There are three basic reasons to use the sortkey keyword when creating a table. 1) If recent data is queried most frequently, specify the timestamp or date column as the leading column for the sort key. 2) If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. 3) If you frequently join a (dimension) table, specify the join column as the sort key. Above, you can see we have made our sortkey the Order_Date column. Look how the data is sorted!

Sort Keys Help Group By, Order By and Window Functions

image

When data is sorted on a strategic column, it will improve (GROUP BY and ORDER BY operations), window functions (PARTITION BY and ORDER BY operations), and even as a means of optimizing compression. But, as new rows are incrementally loaded, these new rows are sorted but they reside temporarily in a separate region on disk. In order to maintain a fully sorted table, you need to run the VACUUM command at regular intervals. You will also need to run ANALYZE.

Choose a Great Distribution Key

Good data distribution has two goals:

1.To distribute data evenly among the nodes and slices in a cluster.

2.To collocate data for joins and aggregations.

image

Uneven distribution, or data skew, forces some nodes to do more work than others which slows down the entire process. With parallel processing, a query is only as fast as the slowest node. Even distribution is a key concept when each node processes the information they own simultaneously with their node peers.

When rows that participate in joins or aggregations are located on different nodes, more data has to be moved among nodes. This is because Actian Matrix must ensure that two rows being joined are on the same node in the same memory. If this is not the case, then Matrix will either copy the smaller table to all nodes temporarily or redistribute one or both tables.

Distribution Key Where the Data is Unique

image

The entire row of a table is on a slice, but each column in the row is in a separate container (block). A Unique Distribution Key spreads the rows of a table evenly across the slices. A good Distribution Key is the key to good distribution!

Matching Distribution Keys for Co-Location of Joins

image

Notice that both tables are distributed on Dept_No. When these two tables are joined WHERE Dept_No = Dept_No, the rows with matching department numbers are on the same Slice. This is called Co-Location. This makes joins efficient and fast.

Big Table / Small Table Joins

image

Notice that the Department_Table has only four rows. Those four rows are copied to every slice. This is distributed by ALL. Now, the Department_Table can be joined to the Employee_Table with a guarantee that matching rows are co-located. They are co-located because the smaller table has copied ALL of its rows to each slice. When two joining tables have one large table (fact table) and the other table is small (dimension table), then use the ALL keyword to distribute the smaller table.

Define Primary Key and Foreign Key Constraints

1.Define primary key and foreign key constraints between tables wherever appropriate.

2.Primary key and foreign key constraints are informational only.

3.Actian Matrix does not enforce unique, primary key, and foreign key constraints.

4.The query planner uses these keys in certain statistical computations, to infer uniqueness and referential relationships that affect subquery decorrelation techniques, to order large
numbers of joins, and to eliminate redundant joins.

5.Actian Matrix does enforce NOT NULL column constraints.

Actian Matrix does not enforce unique, primary-key, and foreign-key constraints. Your application is responsible for ensuring uniqueness and managing the DML operations. The query planner will use primary and foreign keys in certain statistical computations to infer uniqueness and referential relationships that affect subquery decorrelation techniques, to order large numbers of joins, and to eliminate redundant joins. The planner leverages these key relationships, but it assumes that all keys in Actian Matrix tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique. Do not define key constraints for your tables if you doubt their validity. On the other hand, you should always declare primary and foreign keys and uniqueness constraints when you know that they are valid.

Primary Key and Foreign Key Examples

image

The query planner uses referential integrity in certain situations, to infer uniqueness and for referential relationships that affect subquery techniques, to order large numbers of joins, and to eliminate redundant joins.

Actian Matrix does not enforce primary key and foreign key constraints. The only reason to apply them is so the query optimizer can generate a better query plan.

Use the Smallest Column Size When Creating Tables

image

You will improve query performance by
reducing columns to the minimum possible size.

Table size is not impacted, but query processing will be if the column is being processed in a temporary table to gather intermediate results.

Actian Matrix compresses column data very effectively, so creating columns much larger than necessary has minimal impact on the size of data tables. It is in the processing of queries that the size can hurt you. This is because during processing for complex queries, intermediate query results might need to be stored in temporary tables. Because temporary tables are not compressed, unnecessarily large columns consume excessive memory and temporary disk space, which can affect query performance. Don't go overboard here! Don't have columns so small that they can't contain your largest values!

Use Date/Time Data Types for Date Columns

image

Use the DATE or TIMESTAMP data type rather than a character type when storing date/time information. Actian Matrix stores DATE and TIMESTAMP data more efficiently than CHAR or VARCHAR, which results in better query performance. Let Actian Matrix handle the DATE or TIMESTAMP conversions internally instead of you trying to do so in your applications. Most of the time users utilize CHAR or VARCHAR is in the ETL process of moving data. There is no need to do that because Matrix handles any conversions necessary.

Specify Redundant Predicates on the Sort Column

We want to join Table1 with Table2. Both have a distribution key on Customer_Number and both have a sort key on Order_Date.

SELECT T1.*, T2.Order_Number

FROM Table1 as T1

INNER JOIN

Table2 as T2

ON T1.Customer_Number = T2.Customer_Number

WHERE T1.Order_Date > '1/1/2014';

image

You should consider using a predicate on the leading sort column of the fact table, or the largest table, in a join. You can also add predicates to filter other tables that participate in the join, even when the predicates are redundant. These predicates refer to WHERE or AND clauses. Because Matrix has the max and min value for each column per block, you can get better performance when you choose a good sortkey. This allows Matrix to skip reading certain blocks because Matrix always checks the min and max values to see if the block should even be read. The second example above uses a redundant AND clause in hopes the entire table won't have to be read.

Setting the statement_timeout to Abort Long Queries

image

The above query aborts because it took longer than 10 milliseconds. The statement_timeout is designed to abort any statement that takes over the milliseconds specified. If the system setting WLM timeout (max_execution_time) is also specified as part of a WLM configuration, the lower of statement_timeout and max_execution_time is used.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.82.93