Chapter 18. Optimizing VertiPaq

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 18
Optimizing VertiPaq

The previous chapter introduced some of the internals of VertiPaq. That knowledge is useful to design and optimize a data model for a faster execution of DAX queries. While the previous chapter was more theoretical, in this chapter we move on to the more practical side. Indeed, this chapter describes the most important guidelines for saving memory and thereby improving the performance of a data model. The main objective in creating an efficient data model is to reduce the cardinality of columns in order to decrease the dictionary size, improve the compression, and speed up any iteration and filter.

The final goal of the chapter is optimizing a model. However, before going there, the first and most important skill to learn is the ability to evaluate the pros and cons of each design choice. You should not follow any rules blindly without evaluating their impact. For this reason, the first part of the chapter illustrates how to measure the size of each object in a model in memory. This is important when evaluating whether a decision made on a model was worth the effort or not, based on the memory impact of the decision.

Before moving on, we want to stress once more this important concept: You should always test the techniques described in every data model. Data distribution is important in VertiPaq. The very same Sales table structure may be compressed in different ways because of the data distribution, leading to different results for the same optimization techniques. Do not learn best practices. Instead, learn different optimization techniques, knowing in advance that not all of them will be applicable in every data model.

Gathering information about the data model

The first step for optimizing a data model is gathering information about the cost of the objects in the database. This section describes the tools and the techniques to collect all the data that help in prioritizing the possible optimizations of the physical structure.

Table 18-1 shows the pieces of information to collect from each object in a database.

Table 18-1 Information to collect for each object in a database

Object	Information to Collect
Table	Number of rows
Column	Number of unique values Size of dictionary Size of data (total size of all segments)
Hierarchy	Size of hierarchy structure
Relationship	Size of relationship structure

In general, object size strongly depends on the number of unique values in the columns being used or referenced. For this reason, the number of unique values in a column, also known as column cardinality, is the single most important piece of information to gather from a database.

In Chapter 17, “The DAX engines,” we introduced the Dynamic Management Views (DMVs) to retrieve information about the objects in the VertiPaq storage engine. The following sections describe how to interpret the relevant information through VertiPaq Analyzer, which simplifies the collection of data from DMVs.

The first piece of information to consider in a data model is the size of each table, in terms of cardinality (number of rows) and size in memory. Figure 18-1 shows the Table section of VertiPaq Analyzer executed on a Contoso data model in Power BI. The model used in this example contains more tables and data than the simplified data model previously used throughout the book.

The figure shows data pertaining to various tables. — **Figure 18-1** Details of tables shown in VertiPaq Analyzer.

The Table Size column represents the amount of memory used to store the compressed data in VertiPaq, whereas the Cardinality column shows the number of rows of each table. By drilling down a table name, it is possible to see the details of each column. At the column level, Cardinality shows the number of unique values in the entire table; however, the Table Size value is not available because each column only has the cost shown in Columns Total Size. For example, Figure 18-2 shows the columns available in the largest table of the data model, SalesQuota; note that the total size of each column is extremely variable within the same table.

In this figure we expand under SalesQuota to see more details. — **Figure 18-2** Details of tables and columns shown in VertiPaq Analyzer.

Each column reported by VertiPaq Analyzer carries a specific meaning described in the following list:

Cardinality: Object cardinality; the number of rows in a table or the number of unique values in a column, depending on the level of detail in the report.
Rows: Number of rows in the table. This metric is shown in the columns report (visible later in Figure 18-3) and not in the table report (in Figure 18-2), where the same information is available in the Cardinality metric, at the table detail level of the report.
Table Size: Size of the table in bytes. This metric contains the sum of Columns Total Size, User Hierarchies Size, and Relationships Size.
Columns Total Size: Size in bytes of a column. This metric contains the sum of Data Size, Dictionary Size, and Columns Hierarchies Size.
Data Size: Size in bytes of all the compressed data in segments and partitions. It does not include dictionary and column hierarchies. This number depends on the compression of the column, which, in turn, depends on the number of unique values and the distribution of the data across the table.
Dictionary Size: Size in bytes of dictionary structures. This number is only relevant for columns with hash encoding; it is a small fixed number for columns with value encoding. The dictionary size depends on the number of unique values in the column and on the average length of the strings in case of a text column.
Columns Hierarchies Size: Size in bytes of the automatically generated attribute hierarchies for columns. These hierarchies are necessary to access a column in MDX, and they are also used by DAX to optimize filter and sort operations.
Encoding: Type of encoding (hash or value) used for the column. The encoding of a column is selected automatically by the VertiPaq compression algorithm.
User Hierarchies Size: Bytes of user-defined hierarchies. This structure is computed at the table level, and its values are only visible at the table level detail in a VertiPaq Analyzer report. The user hierarchy size depends on the number of unique values and on the average length of the strings of the columns used in the hierarchy itself.
Relationship Size: Bytes of relationships between tables. The relationship size is related to the table on the many-side of a relationship. The size of a relationship depends on the cardinality of the columns involved in the relationship, although this is usually a tiny fraction of the cost of the table.
Table Size %: Ratio of Columns Total Size versus Table Size.
Database Size %: Ratio of Table Size versus Database Size, which is the sum of Table Size for all the tables.
Segments #: Number of segments. All the columns of a table have the same number of segments of the table.
Partitions #: Number of partitions. All the columns of a table have the same number of partitions of the table.
Columns #: Number of columns.

The attribute hierarchy size reported in Columns Hierarchies Size depends on the number of unique values in the column and on the average length of the strings, similarly to the dictionary size. However, the attribute hierarchy is created for both value and hash encoding, whereas the dictionary only exists for hash encoding. The attribute hierarchy creation can be disabled when the column is only used in aggregations and not as a filter or grouping condition. This optimization might require advanced settings. More details about the setting to disable attribute hierarchies are available at https://docs.microsoft.com/en-us/dotnet/api/microsoft.analysisservices.tabular.column.isavailableinmdx and https://blogs.msdn.microsoft.com/analysisservices/2018/06/08/new-memory-options-foranalysis-services/.

The Encoding selected for a column in the model might be changed by the developer. The data model can offer hints to suggest an encoding type to use. Usually, VertiPaq chooses the encoding that saves more memory; however, the developer might choose a specific encoding that may turn out to be more expensive in order to meet specific needs, like improving the speed of dynamic aggregations. A difference in query performance might be visible in tables with billions of rows, whereas it is usually not significant for tables with a few million rows. More details about encoding hints are available at https://docs.microsoft.com/en-us/sql/analysisservices/what-s-new-in-sql-server-analysis-services-2017?view=sql-server-2017#encoding-hints.

The first possible optimization using VertiPaq Analyzer reports is removing any columns that are not useful for the reports and that are expensive in memory. For example, the data shown in Figure 18-2 highlights that one of the most expensive columns of the SalesQuota table is SalesQuotaKey. SalesQuotaKey is not used in any report, and it is not required by the data model structure—as it happens for columns used in relationships. Indeed, the SalesQuotaKey column could be removed from the model without affecting any report and calculation, saving both refresh time and precious memory.

The process of identifying the most expensive columns is made simpler by using another report available in VertiPaq Analyzer shown in Figure 18-3. This Columns report shows all the columns in a flattened list where the reported name is the concatenation of the table and column names, sorting the list by descending Columns Total Size.

This report sorts the columns by descending total size. — **Figure 18-3** Details of columns shown in VertiPaq Analyzer.

Two of the three most expensive columns of the entire Contoso data model, OnlineSalesKey and SalesOrderNumber in the OnlineSales table, are seldom used in a report at the aggregated level. Each of these two columns imported in VertiPaq requires 10% of the data size of the entire data model. By removing these two columns, it is possible to save 20% of the database size. Being aware of the cost of every column helps one choose what to keep in the data model and what is too expensive relative to its analytical value.

The reason why the report in Figure 18-3 shows Rows and Cardinality side-by-side is to help recognize columns that are unique in a table. When the two numbers are close or identical, it is not useful to create summarized results over a column unless it is the target of an aggregation, such as the Amount column in the StrategyPlan table.

Another important piece of information available in VertiPaq Analyzer is included in the Relationships report shown in Figure 18-4. This report makes it easy to identify expensive relationships present in a data model, even though there are no critical situations in this specific example.

Here we see a relationships report. — **Figure 18-4** Size and cardinality of relationships shown in VertiPaq Analyzer.

In VertiPaq, relationships with a cardinality larger than 1 million unique values are particularly expensive, impacting the storage engine cost of any request involving that relationship. A common rule of thumb is to start paying attention to a relationship whenever its cardinality exceeds 100,000. Such relationships usually do not produce visible performance issues, but their presence starts to be measurable in hundreds of milliseconds and could create problems with any future growth of the database. While a single large relationship does not necessarily slow down a report visibly, its presence can undermine the performance of more complex calculations and reports.

An awareness of the cardinality of tables and columns is important in any further analyses of a DAX query’s performance. While this information could be retrieved by running simple DAX queries, it is faster and more efficient to use a tool like VertiPaq Analyzer to collect this data automatically—spending more time evaluating the metrics obtained rather than manually running trivial queries on the data model.

Denormalization

The first optimization that can be applied to a data model is to denormalize data. Every relationship has a memory cost and an additional overhead when the engine transfers the filter from one table to another. Purely from a performance point of view, an optimal model would be one made of a single table. However, such an approach would be less than usable and would force a single granularity for all the measures. Thus, an optimal data model is organized as a star schema around each table defined for measures sharing the same granularity. For this reason, one should denormalize unnecessary related tables, thus reducing the number of columns and relationships in the data model.

The denormalization required in a data model for DAX is usually counterintuitive for anyone with some experience in data modeling for a relational database. For instance, consider a simple data model where a Payment table has two columns, Payment Code and Payment Description. In a relational database, a table with Code and Description is commonly used to avoid duplicating the description content in each row of a Transactions table. It is common practice to only store the Payment Code in Transactions to save space in a relational model.

Table 18-2 shows a denormalized version of the Transactions table. There are many rows with duplicated values of Credit Card and Cash in the Payment Type Description column.

Table 18-2 Transactions table with Payment Type denormalized in the Code and Description columns

Date	Amount	Payment Type Code	Payment Type Description
2015-06-21	100	00	Cash
2015-06-21	100	02	Credit Card
2015-06-22	200	02	Credit Card
2015-06-23	200	00	Cash
2015-06-23	100	03	Wire Transfer
2015-06-24	200	02	Credit Card
2015-06-25	100	00	Cash

By using a separate table containing all the payment types, it is possible to only store the Payment Type Code in the Transactions table, as shown in Table 18-3.

Table 18-3 Transactions table normalized, with Payment Type Code only

Date	Amount	Payment Type Code
2015-06-21	100	00
2015-06-21	100	02
2015-06-22	200	02
2015-06-23	200	00
2015-06-23	100	03
2015-06-24	200	02
2015-06-25	100	00

By storing the description of payment types in a separate table (see Table 18-4), there is only one row for each payment type code and description. That table in a relational database reduces the total amount of space required, by avoiding the duplication of a long string in the Transactions table.

Table 18-4 Payment Type table that normalizes Code and Description

Payment Type Code	Payment Type Description
00	Cash
01	Debit Card
02	Credit Card
03	Wire Transfer

However, this optimization, which works perfectly fine for a relational database, might be a bad choice in a data model for DAX. The VertiPaq engine automatically creates a dictionary for each column, which means that the Transactions table will not pay a cost for duplicated descriptions as would be the case in a relational model.

Note

Compression techniques based on dictionaries are also available in certain relational databases. For example, Microsoft SQL Server offers this feature through the clustered columnstore indexes. However, the default behavior of a relational database is to store data without using a dictionary-based compression.

In terms of space saving, the denormalization is always better by denormalizing a single column in a separate table; on the other hand, the denormalization of many columns in a single table—as is the case for the attributes of a Product—might be more expensive than using a normalized model. For example, we can compare the memory cost between a normalized and a denormalized model:

Memory cost for normalized model:
- Column Transactions[Type Code]
- Column Payments[Type Code]
- Column Payments[Type Description]
- Relationship Transactions[Type Code] – Payments[Type Code]
Memory cost for denormalized model:
- Column Transactions[Type Code]
- Column Transactions[Type Description]

The denormalized model removes the cost of the Payments[Type Code] column and the cost of the relationship on Transactions[Type Code]. However, the cost of the Type Description column is different between Transactions and Payments tables, and in a very large table, the difference might be in favor of the normalized model. However, usually the aggregation of a column performs better when a filter is applied to another column of the same table, rather than a filter on a column in another table connected through a relationship. Does this justify a complete denormalization of the data model into a single table? Absolutely not! In terms of usability, the star schema should be always the preferred choice because it is a good trade-off in terms of resource usage and performance.

A star schema contains a table for each business entity such as Customer and Product, and all the attributes related to an entity are completely denormalized in such tables. For example, the Product table should have attributes such as Category, Subcategory, Model, and Color. This model works well whenever the cardinality of the relationship is not too large. As mentioned before, 1 million unique values is the threshold to define a large cardinality for a relationship, although 100,000 unique values already classifies a relationship as a potential risk for the performance of the queries.

In order to understand why the cardinality of a relationship is important for performance, it is useful to know what happens by applying a filter on a column. Consider the schema in Figure 18-5, where there are relationships between the Sales table and Product, Customer, and Date. By querying the data model filtering customers by gender, the engine transfers the filter from Customer to Sales by specifying the list of customer keys that belong to each gender type included in the query. If there are 10,000 customers, any list generated by a filter cannot be larger than this number. However, if there are 6 million customers, a filter by a single gender type might generate a list of unique keys, resulting in around 3 million unique values for each gender. A large number of keys involved in a relationship always has an impact in performance, even though in absolute terms said impact also depends on the version of the engine and on the hardware being used (CPU clock, cache size, RAM speed).

This figure shows the data model. — **Figure 18-5** The *Sales* table has relationships with the *Product*, *Customer*, and *Date* tables.

What can be done to optimize the data model when a relationship involves millions of unique values? If the measured performance degradation is not compatible with the query latency requirements, one might consider other forms of denormalization that reduce the cardinality of the relationship or that remove entirely the need for a relationship in certain queries. In the previous example, one might consider denormalizing the Gender column in the Sales table, in the event it is the only case where they need to optimize performance. If there are more columns to optimize, consider creating another table with the columns of Customer table that users query often and that have a low cardinality (and a low selectivity).

For instance, consider a table called Customer Info with Gender, Occupation, and Education columns. If the cardinality of these columns is 2, 5, and 5 values, respectively, a table with all the possible combinations has 50 rows (2 × 5 × 5). A query on any of these columns will be much faster because the filter applied to Sales will have a very short list of values. In terms of usability, the user will see two groups of attributes for the same entity, corresponding to the two tables, Customer and Customer Info. This is not an ideal situation. For this reason, this optimization should only be considered when strictly necessary, unless the same result can be obtained by using the Aggregations feature in the Tabular model.

Important

The Aggregations feature is discussed later in this chapter. It is a feature that automates the creation of the underlying tables and relationships whose only purpose is to optimize the performance of the storage engine requests. As of April 2019, the Aggregations feature only works for tables stored in DirectQuery and cannot replace the techniques described in this section. This will be possible when the Aggregations also work for tables stored in VertiPaq.

It is important that both tables have a direct relationship with the Sales table, as shown in Figure 18-6.

This figure shows the data model with the relationships. — **Figure 18-6** Both the *Customer* and *Customer Info* tables have a relationship with *Sales*.

The CustomerInfoKey column should be added to the Sales table before any data is imported into it so that it is a native column. As discussed in Chapter 17, native columns are better compressed than calculated columns. However, a calculated column could also be created with the following DAX expression:

Precision	Cardinality
Hour	24
15 Minutes	96
5 Minutes	288
Minute	1,440
Second	86,400
Millisecond	86,400,000

Query Request	Aggregation Used
Group by product brand and year	Product and Date
Group by product brand and month	Product and Date
Group by store country and year	Store and Date
Group by store country and month	Store and Date
Group by year	Product and Date (highest precedence)
Group by month	Product and Date (highest precedence)
Group by store country and product brand	No aggregation—query Sales table at detail level

Table of Contents for Chapter 18. Optimizing VertiPaq

Create new playlist

Sign In

Sign Up

Chapter 18Optimizing VertiPaq

Gathering information about the data model

Denormalization

Columns cardinality

Handling date and time

Calculated columns

Optimizing complex filters with Boolean calculated columns

Processing of calculated columns

Choosing the right columns to store

Optimizing column storage

Using column split optimization

Optimizing high-cardinality columns

Disabling attribute hierarchies

Optimizing drill-through attributes

Managing VertiPaq Aggregations

Conclusions

Table of Contents for
Chapter 18. Optimizing VertiPaq

Chapter 18
Optimizing VertiPaq