Chapter 12
Working with tables

Tables are an important part of DAX formulas. In previous chapters you learned how to iterate over tables, how to create calculated tables, and how to perform several other calculation techniques that require a table as their starting point. Moreover, CALCULATE filter arguments are tables: When authoring complex formulas, an ability to build the correct filter table is of paramount importance. DAX offers a rich set of functions to manage tables. In this chapter we introduce many DAX functions that are helpful for creating and managing tables.

For most of the new functions, we provide some examples that are useful for two purposes: They show how to use the function, and they act as a good DAX exercise to understand how to write complex measures.

Using CALCULATETABLE

The first function to manipulate tables is CALCULATETABLE. We have already used CALCULATETABLE multiple times in the book prior to this point. In this section we provide a more complete reference to the function, along with some considerations about when to use it.

CALCULATETABLE performs the same operations as CALCULATE, the only difference being in their result. CALCULATETABLE returns a table, whereas CALCULATE returns a single value like an integer or a string. As an example, if one needs to produce a table containing only red products, then CALCULATETABLE is the function to use:

CALCULATETABLE (
    'Product',
    'Product'[Color] = "Red"
)

A common question is what the difference is between CALCULATETABLE and FILTER. Indeed, the previous expression can be written with FILTER too:

FILTER (
    'Product',
    'Product'[Color] = "Red"
)

Even though the only difference seems to be the function name, in reality the semantics of these two functions are very different. CALCULATETABLE operates by changing the filter context first and later evaluating the expression. FILTER, on the other hand, iterates the result of its first argument, retrieving the rows that satisfy the condition. In other words, FILTER does not change the filter context.

You can appreciate the difference by reviewing the following example:

Red Products CALCULATETABLE =
CALCULATETABLE (
    ADDCOLUMNS (
        VALUES ( 'Product'[Color] ),
        "Num of Products", COUNTROWS ( 'Product' )
    ),
    'Product'[Color] = "Red"
)

The result is in Figure 12-1.

The figure shows a table with only one row for Color and Num of Products.
Figure 12-1 There are 99 red products in the Contoso database.

By using CALCULATETABLE, the filter context where both ADDCOLUMNS and COUNTROWS are evaluated is filtering red products. Therefore, the result is one row only that contains red as color and 99 as number of products. In other words, COUNTROWS only counted the red products, without requiring a context transition from the row generated by the VALUES function.

If one replaces CALCULATETABLE with FILTER, the result is different. Look at the following table:

Red Products FILTER external =
FILTER (
    ADDCOLUMNS (
        VALUES ( 'Product'[Color] ),
        "Num of Products", COUNTROWS ( 'Product' )
    ),
    'Product'[Color] = "Red"
)

This time, the result is no longer 99; instead, it shows the total number of products, as shown in Figure 12-2.

In this figure, the table still only has one row but Num of Products shoots to 2,517.
Figure 12-2 Although it shows a single line with Red, Num of Products counts all the products.

This table still contains Red for the product color, but now the number of products computes 2,517, which is the total number of products. The reason is that FILTER does not change the filter context. Moreover, FILTER is evaluated after ADDCOLUMNS. Consequently, ADDCOLUMNS iterates all the products, and COUNTROWS computes the total number of products because there is no context transition. Only later does FILTER select the Red row out of all the colors.

If one uses FILTER instead of CALCULATETABLE, the expression must be written differently, relying on CALCULATE to force the context transition:

Red Products FILTER internal =
ADDCOLUMNS (
    FILTER (
        VALUES ( 'Product'[Color] ),
        'Product'[Color] = "Red"
    ),
    "Num of Products", CALCULATE ( COUNTROWS ( 'Product' ) )
)

Now the result is back to 99. In order to obtain the same behavior as CALCULATETABLE, we needed to invert the execution order. This way FILTER runs first, and then the calculation of the number of rows relies on the context transition to force the row context of ADDCOLUMNS to become a filter context for COUNTROWS.

CALCULATETABLE works by modifying the filter context. It is powerful because it propagates its effect to multiple functions in a DAX expression. Its power comes with limitations in the type of filtering it can create. For example, CALCULATETABLE can only apply filters to columns that belong to the data model. If one only needs the customer whose sales amount is greater than one million, then CALCULATETABLE is not the right choice because Sales Amount is a measure. Therefore, CALCULATETABLE cannot apply a filter on a measure, whereas FILTER can. This is shown in the following expression; replacing FILTER with CALCULATETABLE is not an option, as it would lead to a syntax error:

Large Customers =
FILTER (
    Customer,
    [Sales Amount] > 1000000
)

CALCULATETABLE—like CALCULATE—performs a context transition and can have all the CALCULATE modifiers like ALL, USERELATIONSHIPS, CROSSFILTER, and many others. Consequently, it is much more powerful than FILTER. This is not to say that one should always try to use CALCULATETABLE and stop using FILTER. Each of the two functions has advantages and disadvantages, and the choice needs to be an educated one.

As a rule of thumb, one uses CALCULATETABLE whenever they need to apply a filter on a model column and/or there is the need for the other functionalities of CALCULATETABLE, like context transition and filter context modifiers.

Manipulating tables

DAX offers several functions to manipulate tables. These functions can be used to create new calculated tables, to create tables to iterate on, or to use their results as filter arguments in CALCULATE. In this section we provide a complete reference of those functions, along with examples. There are also other table functions that are mainly useful in queries. We show them in Chapter 13, “Authoring queries.”

Using ADDCOLUMNS

ADDCOLUMNS is an iterator that returns all the rows and columns of its first argument, adding newly created columns to the output. For example, the following calculated table definition produces a table with all the colors and the value of sales amount for each color:

ColorsWithSales =
ADDCOLUMNS (
    VALUES ( 'Product'[Color] ),
    "Sales Amount", [Sales Amount]
)

You can see the result in Figure 12-3.

The figure shows sales amount per color.
Figure 12-3 The result contains all the product colors and the sales amount for each color.

Being an iterator, ADDCOLUMNS evaluates the column expressions in a row context. In this example, it computes the sales of the given product color because the expression of Sales Amount uses a measure. Thus, there is an automatic CALCULATE surrounding Sales Amount that generates the context transition. If one uses a regular expression instead of a measure, then CALCULATE is frequently used to force the context transition.

ADDCOLUMNS is oftentimes used in conjunction with FILTER to obtain filters on temporary calculated columns. For example, to compute the products that sold more than 150,000.00 USD, a possible implementation is the following one:

HighSalesProducts =
VAR ProductsWithSales =
    ADDCOLUMNS (
        VALUES ( 'Product'[Product Name] ),
        "Product Sales", [Sales Amount]
    )
VAR Result =
    FILTER (
        ProductsWithSales,
        [Product Sales] >= 150000
    )
RETURN Result

You can see the result in Figure 12-4.

The figure returns sales amount per product name.
Figure 12-4 The result contains all the product names and the sales amount for each name.

The same expression can be written in several different ways, even without using ADDCOLUMNS. The following code, for example, ends up even simpler than the previous one, even though it does not add a Product Sales column to the output:

FILTER (
    VALUES ( 'Product'[Product Name] ),
    [Sales Amount] >= 150000
)

ADDCOLUMNS is useful to compute multiple columns or when further calculations are needed after this first step. For example, consider computing the set of products that together represent 15% of total sales. This calculation is no longer trivial because several steps are needed:

  1. Computing the sales amount for each product.

  2. Computing a running total of sales amount, by aggregating each product with all the products that sold more than the product itself.

  3. Transforming the running total into a percentage against the grand total of sales.

  4. Only returning the products whose percentage is less than or equal to 15%.

Authoring the full query in a single step is unnecessarily complex, whereas splitting the evaluation in four steps proves much easier:

Top Products =
VAR TotalSales = [Sales Amount]
VAR ProdsWithSales =
    ADDCOLUMNS (
        VALUES ( 'Product'[Product Name] ),
        "ProductSales", [Sales Amount]
    )
VAR ProdsWithRT =
    ADDCOLUMNS (
        ProdsWithSales,
        "RunningTotal",
        VAR SalesOfCurrentProduct = [ProductSales]
        RETURN
            SUMX (
                FILTER (
                    ProdsWithSales,
                    [ProductSales] >= SalesOfCurrentProduct
                ),
                [ProductSales]
            )
    )
VAR Top15Percent =
    FILTER (
        ProdsWithRT,
        [RunningTotal] / TotalSales <= 0.15
    )
RETURN Top15Percent

You can see the result in Figure 12-5.

This table shows product sales and running total per product name.
Figure 12-5 The result contains the top products that generate 15% of sales.

In the example, we implemented the result as a calculated table, but other uses are possible. For example, one could iterate the Top15Percent variable using SUMX to create a measure computing the sales of those products.

As with most other DAX functions, one should think of ADDCOLUMNS as one of the many building blocks of DAX. The real power of DAX unfolds when you learn how to leverage those building blocks and have them interact in more sophisticated calculations.

Using SUMMARIZE

SUMMARIZE is one of the most commonly used functions in DAX. It scans a table (its first argument), grouping columns of the same or other related tables in groups of one or more. The main use of SUMMARIZE is to only retrieve an existing combination of values, rather than the full list of values.

An example would be computing the number of distinct colors sold, to produce a report that shows the number of colors available and the number of colors sold at least once. The following measures would produce the desired result:

Num of colors :=
COUNTROWS (
    VALUES ( 'Product'[Color] )
)

Num of colors sold :=
COUNTROWS (
    SUMMARIZE ( Sales, 'Product'[Color] )
)

You can see the result of these two measures by brand in the report in Figure 12-6.

The report shows Num of Colors and Num of Colors sold by brand.
Figure 12-6 Num of colors sold uses SUMMARIZE to compute the number of colors sold.

In this case we used SUMMARIZE to group the sales by Product[Color], and then we counted the number of rows in the result. Because SUMMARIZE performs a group by, it only returns the colors referenced by Sales. On the other hand, VALUES ( Product[Color] ) returns all the existing colors whether they have sales or not.

Using SUMMARIZE, one can group data by any number of columns, provided that the columns used as parameters are reachable from Sales only when following many-to-one or one-to-one relationships. For example, to compute the average quantity sold per product and per day, this is one possible implementation:

AvgDailyQty :=
VAR ProductsDatesWithSales =
    SUMMARIZE (
        Sales,
        'Product'[Product Name],
        'Date'[Date]
    )
VAR Result =
    AVERAGEX (
        ProductsDatesWithSales,
        CALCULATE (
            SUM ( Sales[Quantity] )
        )
    )
RETURN Result

You can see the result of this measure in Figure 12-7.

The report shows the average daily sales quantity per year and per brand, along with an overall average in the total column.
Figure 12-7 The report shows the average daily sales quantity per year and per brand.

In this case, we used SUMMARIZE to scan Sales and to group it by product name and by date. The resulting table contains the product name and the date, only considering days with sales for that product. AVERAGEX takes care of computing the average over each row of the temporary table returned by SUMMARIZE. If there are no sales for a certain product on any given day, then the resulting table will not contain that date.

SUMMARIZE can also be used like ADDCOLUMNS to add further columns to the result. For example, the previous measure could also be authored the following way:

AvgDailyQty :=
VAR ProductsDatesWithSalesAndQuantity =
    SUMMARIZE (
        Sales,
        'Product'[Product Name],
        'Date'[Date],
        "Daily qty", SUM ( Sales[Quantity] )x
    )
VAR Result =
    AVERAGEX (
        ProductsDatesWithSalesAndQuantity,
        [Daily qty]
    )
RETURN Result

In this case SUMMARIZE returns a table that contains the product name, the date, and a newly introduced column named Daily qty. Daily qty is later averaged by AVERAGEX. Nevertheless, the use of SUMMARIZE to create temporary columns is deprecated because SUMMARIZE creates one row context and one filter context at the same time. For this reason, results are complex to understand when a context transition is generated in the expression by referencing either a measure or an explicit CALCULATE function. If one needs to compute additional columns after SUMMARIZE has performed the grouping operation, then it is better to use a pair of ADDCOLUMNS and SUMMARIZE together:

AvgDailyQty :=
VAR ProductsDatesWithSales =
    SUMMARIZE (
        Sales,
        'Product'[Product Name],
        'Date'[Date]
    )
VAR ProductsDatesWithSalesAndQuantity =
    ADDCOLUMNS (
        ProductsDatesWithSales,
        "Daily qty", CALCULATE ( SUM ( Sales[Quantity] ) )
    )
VAR Result =
    AVERAGEX (
        ProductsDatesWithSalesAndQuantity,
        [Daily qty]
    )
RETURN Result

Despite the code being more verbose, it is much easier to read and write because there is a single row context used in a context transition. This row context is introduced by ADDCOLUMNS while iterating over the result of SUMMARIZE. This pattern results in simpler (and most of the times faster) code.

It is possible to use more—optional—parameters with SUMMARIZE. They exist for the calculation of subtotals and to add columns to the result. We deliberately decided not to write about them to make the point even stronger: SUMMARIZE works fine when grouping tables and should not be used to compute additional columns. Although you can find code on the web that still uses SUMMARIZE to author new columns, consider it a very bad practice and always replace it with the pair ADDCOLUMNS/SUMMARIZE.

Using CROSSJOIN

CROSSJOIN performs the cross-join of two tables, returning the cartesian product of the two input tables. In other words, it returns all possible combinations of the values in the input tables. For example, the following expression returns all the combinations of product names and years:

CROSSJOIN (
    ALL ( 'Product'[Product Name] ),
    ALL ( 'Date'[Calendar Year] )
)

If the model contains 1,000 product names and five years, the resulting table contains 5,000 rows. CROSSJOIN is more often used in queries than in measures. Nevertheless, there are some scenarios where the use of CROSSJOIN becomes relevant—mostly because of performance.

For example, consider the need for an OR condition between two different columns in a CALCULATE filter argument. Because CALCULATE merges its filter arguments with an intersection, implementing the OR condition requires closer attention. For example, this is a possible implementation of CALCULATE filtering all the products that belong to the Audio category or have a Black color:

AudioOrBlackSales :=
VAR CategoriesColors =
    SUMMARIZE (
        'Product',
        'Product'[Category],
        'Product'[Color]
    )
VAR AudioOrBlack =
    FILTER (
        CategoriesColors,
        OR (
            'Product'[Category] = "Audio",
            'Product'[Color] = "Black"
        )
    )
VAR Result =
    CALCULATE (
        [Sales Amount],
        AudioOrBlack
    )
RETURN Result

The previous code works well, and it is optimal also from a performance point of view. SUMMARIZE scans the Product table, which is expected to contain a small number of rows. Thus, the evaluation of the filter is very quick.

If the requirement is to filter columns from different tables like color and year, then things are different. Indeed, one could extend the previous example; but to summarize by columns from two separate tables, SUMMARIZE needs to scan the Sales table:

AudioOr2007 Sales :=
VAR CategoriesYears =
    SUMMARIZE (
        Sales,
        'Product'[Category],
        'Date'[Calendar Year]
    )
VAR Audio2007 =
    FILTER (
        CategoriesYears,
        OR (
            'Product'[Category] = "Audio",
            'Date'[Calendar Year] = "CY 2007"
        )
    )
VAR Result =
    CALCULATE (
        [Sales Amount],
        Audio2007
    )
RETURN Result

Sales is not a small table; it might contain hundreds of millions of rows. Scanning it to retrieve the existing combinations of category and years could result in an expensive operation. Regardless, the resulting filter is not going to be large because there are only a few categories and years, yet the engine needs to scan a large table to retrieve the filter.

In that scenario, we recommend you build all the combinations of category and year, producing a small table; you would then filter that table, as in the following code:

AudioOr2007 Sales :=
VAR CategoriesYears =
    CROSSJOIN (
        VALUES ( 'Product'[Category] ),
        VALUES ( 'Date'[Calendar Year] )
    )
VAR Audio2007 =
    FILTER (
        CategoriesYears,
        OR (
            'Product'[Category] = "Audio",
            'Date'[Calendar Year] = "CY 2007"
        )
    )
VAR Result =
    CALCULATE (
        [Sales Amount],
        Audio2007
    )
RETURN Result

The full CROSSJOIN of categories and years contains a few hundred rows and the execution of this last version of the measure is faster.

CROSSJOIN is not only useful to speed up calculations. Sometimes, one is interested in retrieving rows even when no event happened. For example, by using SUMMARIZE to scan sales by category and country, the result only contains the categories and countries with sales for certain products. This is the intended behavior of SUMMARIZE, so it is not surprising. However, sometimes the absence of an event is more important than its presence. For example, one might want to investigate which brands have no sales in certain regions. In that case, the measure needs to build a more complex expression involving a CROSSJOIN, so to be able to also retrieve nonexistent combinations of values. We will provide more examples of CROSSJOIN in the next chapter.

Using UNION

UNION is a set function that performs the union of two tables. The ability to combine different tables into a single table can be important in certain circumstances. It is mainly used in calculated tables, much less frequently in measures. For example, the following table contains all the countries from both the Customer and the Store tables:

AllCountryRegions =
UNION (
    ALL ( Customer[CountryRegion] ),
    ALL ( Store[CountryRegion] )
)

You can look at the result in Figure 12-8.

The report is a single column, CountryRegion, which lists countries from two different tables.
Figure 12-8 UNION does not remove duplicates.

UNION does not remove duplicates before returning a result. Thus, if Australia belongs to the countries of both customers and stores, it appears twice in the resulting table. If needed, leveraging the DISTINCT function will remove duplicates.

We have described and used DISTINCT multiple times prior to this point, to obtain the distinct values of a column as visible in the current filter context. DISTINCT can also be used with a table expression as a parameter, and in that special case it returns the distinct rows of the table. Thus, the following is a good implementation removing potential duplicates from the CountryRegion column:

DistinctCountryRegions =
VAR CountryRegions =
    UNION (
        ALL ( Customer[CountryRegion] ),
        ALL ( Store[CountryRegion] )
   )
VAR UniqueCountryRegions =
    DISTINCT ( CountryRegions )
RETURN UniqueCountryRegions

You can see the resulting table in Figure 12-9.

In this report, CountryRegion only displays countries once.
Figure 12-9 DISTINCT removes duplicates from a table.

UNION maintains the data lineage of the input tables if the lineage of both tables is the same. In the previous formula, the result of DISTINCT has no lineage because the first table contains Customer[CountryRegion], and the second table contains Store[CountryRegion]. Because the data lineage of the input tables is different, the result has a new lineage not corresponding to any of the existing columns. Therefore, the following calculated table returns the same grand total of sales on all the rows:

DistinctCountryRegions =
VAR CountryRegions =
    UNION (
        ALL ( Customer[CountryRegion] ),
        ALL ( Store[CountryRegion] )
   )
VAR UniqueCountryRegions =
    DISTINCT ( CountryRegions )
VAR Result =
    ADDCOLUMNS (
        UniqueCountryRegions,
        "Sales Amount", [Sales Amount]
    )
RETURN Result

The result is presented in Figure 12-10.

In this report, you see sales amount per country. There are no duplicates in the CountryRegion column, and all amounts are the same.
Figure 12-10 CountryRegion is not a column in the model. Therefore, it does not filter Sales Amount.

If the calculated table needs to contain both the sales amount and the number of stores, including all country regions of both customers and stores, then the filtering must be handled manually through a more complex expression:

DistinctCountryRegions =
VAR CountryRegions =
    UNION (
        ALL ( Customer[CountryRegion] ),
        ALL ( Store[CountryRegion] )
   )
VAR UniqueCountryRegions =
    DISTINCT ( CountryRegions )
VAR Result =
    ADDCOLUMNS (
        UniqueCountryRegions,
        "Customer Sales Amount",
            VAR CurrentRegion = [CountryRegion]
            RETURN
                CALCULATE (
                    [Sales Amount],
                    Customer[CountryRegion] = CurrentRegion
                ),
        "Number of stores",
            VAR CurrentRegion = [CountryRegion]
            RETURN
                CALCULATE (
                    COUNTROWS ( Store ),
                    Store[CountryRegion] = CurrentRegion
                )
    )
RETURN Result

You see the result in Figure 12-11.

The report shows Customer Sales Amount and Number of stores per CountryRegion. The numbers make sense.
Figure 12-11 By using more complex CALCULATE statements, one can move the filter on stores and sales.

In the previous example CALCULATE applies a filter on either the customer or the store country region, using the value currently iterated by ADDCOLUMN on the result of UNION. Another option to obtain the same result is to restore the lineage using TREATAS (see Chapter 10, “Working with the filter context,” for more information about TREATAS), as in the following equivalent expression:

DistinctCountryRegions =
VAR CountryRegions =
    UNION (
        ALL ( Customer[CountryRegion] ),
        ALL ( Store[CountryRegion] )
    )
VAR UniqueCountryRegions =
    DISTINCT ( CountryRegions )
VAR Result =
    ADDCOLUMNS (
        UniqueCountryRegions,
        "Customer Sales Amount", CALCULATE (
            [Sales Amount],
            TREATAS (
                { [CountryRegion] },
                Customer[CountryRegion]
            )
        ),
        "Number of stores", CALCULATE (
            COUNTROWS ( Store ),
            TREATAS (
                { [CountryRegion] },
                Store[CountryRegion]
            )
        )
    )
RETURN Result

The result of our last two examples is the same; what differs is the technique used to move the filter from a new column to one that is part of the model. Moreover, in this last example you may notice the use of a table constructor: The curly braces transform CountryRegion into a table that can be used as a parameter of TREATAS.

Because UNION loses the data lineage if values come from different columns, TREATAS is a convenient function to control the data lineage of the result. It is worth noting that TREATAS ignores values that do not exist in the target columns.

Using INTERSECT

INTERSECT is a set function much like UNION. However, instead of appending one table to another, it returns the intersection of the two tables—that is, only the rows that appear in both tables. It was popular before the TREATAS function was introduced because it allows one to apply the result of a table expression as a filter to other tables and columns. Since the introduction of TREATAS, the number of use cases of INTERSECT was greatly reduced.

For example, if one needs to retrieve the customers who bought in both 2007 and 2008, a possible implementation is the following:

CustomersBuyingInTwoYears =
VAR Customers2007 =
    CALCULATETABLE (
        SUMMARIZE ( Sales, Customer[Customer Code] ),
        'Date'[Calendar Year] = "CY 2007"
    )
VAR Customers2008 =
    CALCULATETABLE (
        SUMMARIZE ( Sales, Customer[Customer Code] ),
        'Date'[Calendar Year] = "CY 2008"
    )
VAR Result =
    INTERSECT ( Customers2007, Customers2008 )
RETURN Result

From the lineage point of view, INTERSECT retains the data lineage of the first table. In the previous example, both tables have the same data lineage. If one builds a table with different data lineages, then only the lineage of the first table is kept. For example, the countries where there are both customers and stores can be expressed as follows:

INTERSECT (
    ALL ( Store[CountryRegion] ),
    ALL ( Customer[CountryRegion] )
)

In this latter example, the lineage is that of Store[CountryRegion]. Consequently, a more complex expression like the following returns the sales filtered by Store[CountryRegion], not Customer[CountryRegion]:

SalesStoresInCustomersCountries =
VAR CountriesWithStoresAndCustomers =
    INTERSECT (
        ALL ( Store[CountryRegion] ),
        ALL ( Customer[CountryRegion] )
    )
VAR Result =
    ADDCOLUMNS (
        CountriesWithStoresAndCustomers,
        "StoresSales", [Sales Amount]
    )
RETURN Result

You see the result of this expression in Figure 12-12.

The report shows stores sales per country/region. Only certain countries have data for stores sales; others are left blank.
Figure 12-12 StoresSales contains the sales in the store country, not the sales in the customer country.

In this latter example, the StoresSales column contains the sales related to the country of the store.

Using EXCEPT

EXCEPT is the last of the set functions introduced in this section. EXCEPT removes the rows present in the second table from the first table. As such, it implements set subtraction with two tables. For example, if one is interested in customers who bought a product in 2007 but not in 2008, one possible implementation is the following:

CustomersBuyingIn2007butNotIn2008 =
VAR Customers2007 =
    CALCULATETABLE (
        SUMMARIZE ( Sales, Customer[Customer Code] ),
        'Date'[Calendar Year] = "CY 2007"
    )
VAR Customers2008 =
    CALCULATETABLE (
        SUMMARIZE ( Sales, Customer[Customer Code] ),
        'Date'[Calendar Year] = "CY 2008"
    )
VAR Result =
    EXCEPT ( Customers2007, Customers2008 )
RETURN Result

The first rows of the calculated table are visible in Figure 12-13.

The report only shows one column, Customer Code.
Figure 12-13 Partial list of customers that bought a product in 2007 but not in 2008.

As usual, one could use the previous calculation as a filter argument of CALCULATE to obtain the sales amount of those customers. EXCEPT is frequently used when analyzing customer behavior. For example, a common calculation for many businesses is establishing the number of new customers, returning customers, and lost customers.

There are several possible implementations of the same calculations, each one targeted to a specific data model. The following implementation is not always the most optimal, but it is flexible and easy to understand. To compute the number of customers who did not buy anything last year but bought something this year, the following measure removes customers who bought a product in the previous year from the set of current customers:

SalesOfNewCustomers :=
VAR CurrentCustomers =
    VALUES ( Sales[CustomerKey] )
VAR CustomersLastYear =
    CALCULATETABLE (
        VALUES ( Sales[CustomerKey] ),
        DATESINPERIOD ( 'Date'[Date], MIN ( 'Date'[Date] ) - 1, -1, YEAR )
    )
VAR CustomersNotInLastYear =
    EXCEPT ( CurrentCustomers, CustomersLastYear )
VAR Result =
    CALCULATE ( [Sales Amount], CustomersNotInLastYear )
RETURN Result

The implementation of this code as a measure works with any filter and provides a flexible way to slice by any column. Please be mindful that this implementation of new customers is not the best in terms of performance. We used it here to demonstrate a possible usage of EXCEPT. Later in this chapter we show a much faster version of the same calculation, although a bit more complex to learn.

From the lineage point of view, EXCEPT retains the data lineage of the first table, as was the case with INTERSECT. For example, the following expression computes the sales made to customers living in countries where there are no stores:

SalesInCountriesWithNoStores :=
VAR CountriesWithActiveStores =
    CALCULATETABLE (
        SUMMARIZE ( Sales, Store[CountryRegion] ),
        ALL ( Sales )
    )
VAR CountriesWithSales =
    SUMMARIZE ( Sales, Customer[CountryRegion] )
VAR CountriesWithNoStores =
    EXCEPT ( CountriesWithSales, CountriesWithActiveStores )
VAR Result =
    CALCULATE (
        [Sales Amount],
        CountriesWithNoStores
    )
RETURN Result

The result of EXCEPT filters the Customer[CountryRegion] column because it is the column used by the table taken as the first argument of EXCEPT.

Using tables as filters

Functions manipulating tables are oftentimes used to build complex filters for CALCULATE parameters. In this section, we provide further examples, always leading you one step further in your understanding of DAX.

Implementing OR conditions

A first example where manipulating tables proves to be a useful skill is the following. Imagine having to implement an OR condition between the selections made in different slicers, instead of the default AND behavior provided by client tools like Excel and Power BI.

The report in Figure 12-14 contains two slicers. The default behavior of Power BI is to intersect the two conditions. As a consequence, the numbers shown represent the sales of Home Appliances to customers with a High School education.

The report shows sales for several years, per month. The results are sliced per category and education level.
Figure 12-14 By default, slicer conditions are intersected so that all the conditions are applied together.

Instead of intersecting the two conditions, one might want to merge them. In other words, the numbers shown in the report need to be the sales of products sold to customers with a High School education or the sales of Home Appliances. Because Power BI does not support “or” conditions between slicers, one can solve the problem by using DAX.

Remember that each cell of the report has a filter context containing both a filter on the category and a filter on the education. Both filters need to be replaced. There are several possible solutions to the same pattern; we demonstrate the use of three different formulas.

The first, and probably the easiest expression of that filter, is the following:

OR 1 :=
VAR CategoriesEducations =
    CROSSJOIN (
        ALL ( 'Product'[Category] ),
        ALL ( Customer[Education] )
    )
VAR CategoriesEducationsSelected =
    FILTER (
        CategoriesEducations,
        OR (
            'Product'[Category] IN VALUES ( 'Product'[Category] ),
            Customer[Education] IN VALUES ( Customer[Education] )
        )
    )
VAR Result =
    CALCULATE (
        [Sales Amount],
        CategoriesEducationsSelected
    )
RETURN Result

The measure first builds the cross-join of all the categories and education levels. Once the table is prepared, it filters out the rows that do not satisfy the condition, and finally it uses the resulting table as a filter argument for CALCULATE. CALCULATE overrides the current filter on both the category and the education, resulting in the report in Figure 12-15.

The report shows sales volumes for Home Appliances or High School, monthly for each year.
Figure 12-15 The report now shows sales volumes for the Home Appliances category OR the High School education level.

The first implementation of the measure is already simple—both to use and to understand. In case there is a large number of rows in the columns used to filter the OR condition, or when there are more than just two conditions, the resulting temporary table would quickly become huge. In such a case, one can limit its size by removing the CROSSJOIN in favor of SUMMARIZE, as in this second implementation of the same measure:

OR 2 :=
VAR CategoriesEducations =
    CALCULATETABLE (
        SUMMARIZE (
            Sales,
            'Product'[Category],
            Customer[Education]
        ),
        ALL ( 'Product'[Category] ),
        ALL ( Customer[Education] )
    )
VAR CategoriesEducationsSelected =
    FILTER (
        CategoriesEducations,
        OR (
            'Product'[Category] IN VALUES ( 'Product'[Category] ),
            Customer[Education] IN VALUES ( Customer[Education] )
        )
    )
VAR Result =
    CALCULATE (
         [Sales Amount],
        CategoriesEducationsSelected
    )
RETURN Result

The logic of this second implementation is close to the first one, the only noticeably difference being the presence of SUMMARIZE instead of CROSSJOIN. Moreover, it is worth pointing out that SUMMARIZE needs to be executed in a filter context without the filter on Category and Education. Otherwise, the slicer would affect the calculation executed by SUMMARIZE, destroying the effort of the filter.

There is at least a third solution to the same scenario, potentially faster though harder to understand at first sight. Indeed, the same table filter can be expressed thinking that if the category is in the selected values for the categories, then any value for the education level is fine. The same happens for the education level: As long as the education level is in the selected values for the education level, then any category is fine. This reasoning leads to the third formulation of the same expression:

OR 3 :=
VAR Categories =
    CROSSJOIN (
        VALUES ( 'Product'[Category] ),
        ALL ( Customer[Education] )
    )
VAR Educations =
    CROSSJOIN (
        ALL ( 'Product'[Category] ),
        VALUES ( Customer[Education] )
    )
VAR CategoriesEducationsSelected =
    UNION ( Categories, Educations )
VAR Result =
    CALCULATE (
         [Sales Amount],
        CategoriesEducationsSelected
    )
RETURN Result

As you can see, one can author the same formula in several ways. The difference is in both readability and performance. Being able to write the same formula using different methods is a skill that will prove extremely useful in the final optimization chapters, where you learn to evaluate the performance of different versions of the same code, seeking the most optimal.

Narrowing sales computation to the first year’s customers

As another example of a useful calculation involving the manipulation of tables, we demonstrate how to analyze sales over time but only considering customers who made a purchase during the first year of a selected time period. In other words, we consider the first year with sales in the visual, evaluate the customers who bought during that first year, and then we only analyze the sales of those customers over the following years, ignoring those who became customers afterwards.

The code needs to perform three steps:

  1. Checking what is the first year with sales on any product.

  2. Storing the set of customers of that first year into a variable, ignoring any other filter.

  3. Computing the sales of the customers determined in step 2 in the current period.

The following code implements this algorithm by using variables to store temporary results:

SalesOfFirstYearCustomers :=
VAR FirstYearWithSales =
    CALCULATETABLE (
        FIRSTNONBLANK (
            'Date'[Calendar Year],
            [Sales Amount]
        ),
        ALLSELECTED ()
    )
VAR CustomersFirstYear =
    CALCULATETABLE (
        VALUES ( Sales[CustomerKey] ),
        FirstYearWithSales,
        ALLSELECTED ()
    )
VAR Result =
    CALCULATE (
        [Sales Amount],
        KEEPFILTERS ( CustomersFirstYear )
    )
RETURN Result

The FirstYearWithSales variable stores the first year with sales. Please note that FIRSTNONBLANK returns a table as a result, with the data lineage of Date[Calendar Year]. The CustomersFirstYear variable retrieves the list of all customers in that first year. The last step is the easiest because it only applies the filter on the customer; in each cell of the report, the value of Sales Amount is restricted to only the customers found during the second step. The KEEPFILTERS modifier makes it possible to filter these customers by country, for example.

The result is visible in Figure 12-16, indicating that—after the first year—sales made to those customers are decreasing over time.

The figure shows sales volumes over the years per category.
Figure 12-16 The report shows sales over the years, focusing only on customers acquired in 2007.

This latter example is important to learn. Indeed, there are several scenarios where one needs to place a filter over time, compute a set, and finally analyze the behavior of this set (of customers, products, stores) over different years. With this pattern one can easily implement same-store analyses or any other calculation with similar requirements.

Computing new customers

In a previous section of this chapter about EXCEPT, we showed how to compute new customers. In this section we provide a much better implementation of the same calculation that—again—makes heavy use of table functions.

The idea in this new algorithm is the following: First we determine the earliest day when each customer made a purchase. Once this table is available, the formula checks if the first sale to the customer falls within the current time period. If that holds true, it means that the customer—in the current period—is a new customer.

Here is the code of the measure:

New Customers :=
VAR CustomersFirstSale =
    CALCULATETABLE (
        ADDCOLUMNS (
            VALUES ( Sales[CustomerKey] ),
            "FirstSale", CALCULATE (
                MIN ( Sales[Order Date] )
            )
        ),
        ALL ( 'Date' )
    )
VAR CustomersWith1stSaleInCurrentPeriod =
    FILTER (
        CustomersFirstSale,
        [FirstSale] IN VALUES ( 'Date'[Date] )
    )
VAR Result =
    COUNTROWS ( CustomersWith1stSaleInCurrentPeriod )
RETURN Result

The CustomersFirstSale variable needs to use ALL on the Date table to first compute sales that happened before the current time period. You can see the resulting report in Figure 12-17.

This report shows Num of Customers and New Customers per month in 2007.
Figure 12-17 The report shows the number of customers and new customers over the year 2007.

The way it is written now, if a user further filters other tables like the product category, a customer will be considered new when they buy the selected category for the first time. Thus, one individual customer might be considered new multiple times, depending on the filters applied. By adding further CALCULATE modifiers to the computation of the first variable, it is possible to implement several different variations of the same code. For example, by adding ALL ( Product ), then customers are only considered new when they buy any product. By adding ALL ( Store ), customers are only new the first time they buy in any store.

Reusing table expressions with DETAILROWS

The PivotTable in Excel offers a feature to retrieve the underlying data used to compute a cell. That feature is called “Show Details” in the Excel user interface, and its more technical name is “drillthrough.” This name could be confusing because in Power BI the term “drillthrough” refers to a feature that allows the user to move from one report page to another, in a manner controlled by the report author. For this reason, a feature that allows control over the “Show Details” result was called “Detail Rows Expression” in the Tabular model and was introduced in SQL Server Analysis Services 2017. As of April 2019, it is not available in Power BI, but it should be planned for a future release.

The Detail Rows Expression is a DAX table expression associated with a measure and invoked to retrieve the table for the Show Details feature. This expression is executed in the filter context of the measure. The idea is that if a measure changes the filter context to compute a variable, the Detail Rows Expression should apply a similar transformation to the filter context.

For example, consider Sales YTD that computes the year-to-date value of Sales Amount:

Sales YTD :=
CALCULATE (
    [Sales Amount],
    DATESYTD ( 'Date'[Date] )
)

The corresponding Detail Rows Expression should be a CALCULATETABLE that applies the same filter context transformation as the one made by the corresponding measure. For example, the following expression returns all the columns of the Sales table from the beginning of the year considered in the calculation:

CALCULATETABLE (
    Sales,
    DATESYTD ( 'Date'[Date] )
)

A DAX client tool executes this measure by invoking a specific DAX function called DETAILROWS, specifying the measure that the Detail Rows Expression belongs to:

DETAILROWS ( [Sales YTD] )

The DETAILROWS function invokes a table expression stored in a measure. Therefore, one can define hidden measures just to store long table expressions often used as filter arguments of many other DAX measures. For example, consider a Cumulative Total measure with a Detail Rows Expression that retrieves any dates less than or equal to the maximum date available in the filter context:

-- Detail Rows Expression for Cumulative Total measure
VAR LastDateSelected = MAX ( 'Date'[Date] )
RETURN
    FILTER (
        ALL ( 'Date'[Date] ),
        'Date'[Date] <= LastDateSelected
    )

One can reference this table expression in different measures by using the DETAILROWS function:

Cumulative Sales Amount :=
CALCULATE (
    [Sales Amount],
    DETAILROWS ( [Cumulative Total] )
)

Cumulative Total Cost :=
CALCULATE (
    [Total Cost],
    DETAILROWS ( [Cumulative Total] )
)

More detailed examples of this technique are available at https://www.sqlbi.com/articles/creating-table-functions-in-dax-using-detailrows/. However, reusing table expressions with DETAILROWS is just a workaround for the lack of custom-defined functions in DAX, and it may have performance implications. Many use cases for DETAILROWS can be solved by using calculation groups, and this technique will become obsolete once DAX introduces measures returning tables or custom-defined DAX functions.

Creating calculated tables

All the table functions shown in the previous sections can be used either as table filters in CALCULATE or to create calculated tables and queries. Earlier on, we described the ones more likely to be used as table filters, whereas in this section we describe some additional functions that are mostly used when creating calculated tables. There are other table functions whose main usage is in authoring queries; we will describe them in the next chapter. Nevertheless, be mindful that there are no limits in the use of table functions. Nothing is preventing anyone from using DATATABLE, SELECTCOLUMNS, or GENERATESERIES (some of the functions described later) in a measure or as a table filter. It is only a matter of convenience: Some functions better fit certain specific needs.

Using SELECTCOLUMNS

SELECTCOLUMNS is useful to reduce the number of columns in a table, and it also provides the capability to add new columns like ADDCOLUMNS does. In practice, SELECTCOLUMNS implements projection of columns like the SQL SELECT statement.

The most common usage of SELECTCOLUMNS is to scan a table and only return some of the columns. For example, the following expression only returns the customer education and gender, that is, two columns:

SELECTCOLUMNS (
    Customer,
    "Education", Customer[Education],
    "Gender", Customer[Gender]
)

The result contains a lot of duplicates, as you can see in Figure 12-18.

In the report, you only see one value being repeated under Education, and Gender displays both possible values multiple times.
Figure 12-18 SELECTCOLUMNS returns duplicate values.

SELECTCOLUMNS is very different from SUMMARIZE. SUMMARIZE performs a grouping of the result, whereas SELECTCOLUMNS only reduces the number of columns. Therefore, the output of SELECTCOLUMNS might contain duplicates, whereas the output of SUMMARIZE does not. One needs to provide SELECTCOLUMNS with pairs of names and expressions for each column in the resulting set. The resulting columns can also be new ones. For example, the following formula returns a new column named Customer containing the name followed by its code in parentheses:

SELECTCOLUMNS (
    Customer,
    "Education", Customer[Education],
    "Gender", Customer[Gender],
    "Customer", Customer[Name] & " (" & Customer[Customer Code] & ")"
)

You can see the result in Figure 12-19.

The report shows Education and Gender, plus a new column, Customer.
Figure 12-19 SELECTCOLUMNS can also compute new columns, like ADDCOLUMNS does.

SELECTCOLUMNS maintains the data lineage if the expression is a single column reference, whereas it generates a new data lineage whenever one uses an expression. Consequently, the following result contains two columns: The first column has the data lineage of Customer[Name], whereas the second column has a different data lineage that cannot filter the original columns, even though the content of the two columns is the same:

SELECTCOLUMNS (
    Customer,
    "Customer Name with lineage", Customer[Name],
    "Customer Name without lineage", Customer[Name] & ""
)

Creating static tables with ROW

ROW is a simple function that returns a table with only one row. ROW requires pairs of name and expression, and the result is a table with one row and a suitable number of columns. For example, the following expression is a table with one row and two columns, containing the sales amount and the quantity sold:

ROW (
    "Sales", [Sales Amount],
    "Quantity", SUM ( Sales[Quantity] )
)

The result is a table with one row and two columns, as you can see in Figure 12-20.

The figure shows a table with two columns and one row.
Figure 12-20 ROW creates a table with one single row.

ROW is no longer commonly used since the table constructor syntax was introduced. Indeed, the previous expression can be written as:

{
    ( [Sales Amount], SUM ( Sales[Quantity] ) )
}

The column names are generated automatically by the table constructor syntax, as shown in Figure 12-21.

In this report, the column names are generic: Value1 and Value2.
Figure 12-21 The table constructor generates column names automatically.

When using the table constructor syntax, commas separate rows. To include multiple columns, one needs to use parentheses to encapsulate multiple columns in a single row. The main difference between the ROW function and the curly braces syntax is that ROW specifies names for the columns, whereas the curly braces automatically generate names for the columns. The latter makes it harder to later reference column values.

Creating static tables with DATATABLE

ROW is useful when wanting to create a table with a single row. On the other hand, to create multiple rows, one would use DATATABLE. DATATABLE creates a table specifying not only the column names, but also the data type of each column and its content. For example, if one needs a table with three rows to cluster prices, an easy way to build the table is the following expression:

DATATABLE (
    "Segment", STRING,
    "Min", DOUBLE,
    "Max", DOUBLE,
    {
        { "LOW", 0, 20 },
        { "MEDIUM", 20, 50 },
        { "HIGH", 50, 99 }
    }
)

You can see the result in Figure 12-22.

The report shows a table with three segments, and the Min and Max per segment.
Figure 12-22 The figure shows the resulting table generated with DATATABLE.

The data type of columns can be any of the following values: INTEGER, DOUBLE, STRING, BOOLEAN, CURRENCY, and DATETIME. The syntax is somewhat inconsistent with the new table constructor using curly braces. Indeed, DATATABLE uses curly braces to delimit rows, whereas the anonymous table constructor uses regular parentheses leaving the curly braces only to delimit the entire table.

A strong limitation of DATATABLE is that the contents of the table need to be constant values. Using any DAX expression would result in an error. This makes DATATABLE a function that is not used much. The table constructor syntax gives developers much more flexibility in terms of expressivity.

One can use DATATABLE to define simple, constant calculated tables. In SQL Server Data Tools (SSDT) for Analysis Services Tabular, a calculated table using DATATABLE is generated when a developer pastes the content of the clipboard into the model, whereas Power BI uses Power Query to define constant tables. This is another reason why DATATABLE is not common among Power BI users.

Using GENERATESERIES

GENERATESERIES is a utility function that generates series of values once the developer provides a lower boundary, an upper boundary, and a step. For example, the following expression produces a table containing 20 values, from 1 to 20:

GENERATESERIES ( 1, 20, 1 )

The resulting data type depends on the input; that can be either a number or a DateTime. For example, if the developer needs a table containing the time of the day, this expression provides a quick way of generating an 86,400-row table (one row per second):

Time =
GENERATESERIES (
    TIME ( 0, 0, 0 ),       -- Start value
    TIME ( 23, 59, 59 ),    -- End value
    TIME ( 0, 0, 1 )        -- Step: 1 second
)

By changing the step and adding new columns, one could create a smaller table that acts as a suitable dimension—for example, to slice sales by time:

Time =
SELECTCOLUMNS (
    GENERATESERIES (
        TIME ( 0, 0, 0 ),
        TIME ( 23, 59, 59 ),
        TIME ( 0, 30, 0 )
    ),
    "Time", [Value],
    "HH:MM AMPM", FORMAT ( [Value], "HH:MM AM/PM" ),
    "HH:MM", FORMAT ( [Value], "HH:MM" ),
    "Hour", HOUR ( [Value] ),
    "Minute", MINUTE ( [Value] )
)

You can see the result in Figure 12-23.

The figure shows a time table.
Figure 12-23 Using GENERATESERIES and SELECTCOLUMNS, one can easily create a time table.

Using GENERATESERIES in a measure is uncommon, whereas that function is called upon to create simple tables that become useful as slicers, so the user can select different parameters. For example, Power BI uses GENERATESERIES to add parameters for the what-if analysis.

Conclusions

In this chapter, we introduced many new table functions. Still, many have yet to come in the next chapter. Here we focused the attention on the set of table functions that are commonly used to create calculated tables or to implement complex filter arguments for CALCULATE and CALCULATETABLE. Always remember that the code we provide is an example of what is possible with DAX; we leave it up to the reader’s imagination to find practical scenarios calling for the code in a specific model.

The main functions you learned in this chapter are:

  • ADDCOLUMNS, to add new columns to the input table.

  • SUMMARIZE, to perform grouping after the scan of a table.

  • CROSSJOIN, to perform the cartesian product of two tables.

  • UNION, INTERSECT, and EXCEPT, to compute the basic set of operations on tables.

  • SELECTCOLUMNS, to select certain columns of a table.

  • ROW, DATATABLE, and GENERATESERIES, to generate mostly constant tables as calculated tables.

In the next chapter, we will describe other table functions focusing more on complex queries or complex calculated tables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.239.208.72