Working with tables

Tables are an important part of DAX formulas. In previous chapters you learned how to iterate over tables, how to create calculated tables, and how to perform several other calculation techniques that require a table as their starting point. Moreover, *CALCULATE* filter arguments are tables: When authoring complex formulas, an ability to build the correct filter table is of paramount importance. DAX offers a rich set of functions to manage tables. In this chapter we introduce many DAX functions that are helpful for creating and managing tables.

For most of the new functions, we provide some examples that are useful for two purposes: They show how to use the function, and they act as a good DAX exercise to understand how to write complex measures.

The first function to manipulate tables is *CALCULATETABLE*. We have already used *CALCULATETABLE* multiple times in the book prior to this point. In this section we provide a more complete reference to the function, along with some considerations about when to use it.

*CALCULATETABLE* performs the same operations as *CALCULATE*, the only difference being in their result. *CALCULATETABLE* returns a table, whereas *CALCULATE* returns a single value like an integer or a string. As an example, if one needs to produce a table containing only red products, then *CALCULATETABLE* is the function to use:

CALCULATETABLE ( 'Product', 'Product'[Color] = "Red" )

A common question is what the difference is between *CALCULATETABLE* and *FILTER*. Indeed, the previous expression can be written with *FILTER* too:

FILTER ( 'Product', 'Product'[Color] = "Red" )

Even though the only difference seems to be the function name, in reality the semantics of these two functions are very different. *CALCULATETABLE* operates by changing the filter context first and later evaluating the expression. *FILTER*, on the other hand, iterates the result of its first argument, retrieving the rows that satisfy the condition. In other words, *FILTER* does not change the filter context.

You can appreciate the difference by reviewing the following example:

Red Products CALCULATETABLE = CALCULATETABLE ( ADDCOLUMNS ( VALUES ( 'Product'[Color] ), "Num of Products", COUNTROWS ( 'Product' ) ), 'Product'[Color] = "Red" )

The result is in Figure 12-1.

By using *CALCULATETABLE*, the filter context where both *ADDCOLUMNS* and COUNTROWS are evaluated is filtering red products. Therefore, the result is one row only that contains red as color and 99 as number of products. In other words, *COUNTROWS* only counted the red products, without requiring a context transition from the row generated by the *VALUES* function.

If one replaces *CALCULATETABLE* with *FILTER*, the result is different. Look at the following table:

Red Products FILTER external =FILTER (ADDCOLUMNS ( VALUES ( 'Product'[Color] ), "Num of Products", COUNTROWS ( 'Product' ) ), 'Product'[Color] = "Red" )

This time, the result is no longer 99; instead, it shows the total number of products, as shown in Figure 12-2.

This table still contains Red for the product color, but now the number of products computes 2,517, which is the total number of products. The reason is that *FILTER* does not change the filter context. Moreover, *FILTER* is evaluated after *ADDCOLUMNS*. Consequently, *ADDCOLUMNS* iterates all the products, and *COUNTROWS* computes the total number of products because there is no context transition. Only later does *FILTER* select the Red row out of all the colors.

If one uses *FILTER* instead of *CALCULATETABLE*, the expression must be written differently, relying on *CALCULATE* to force the context transition:

Red Products FILTER internal = ADDCOLUMNS ( FILTER ( VALUES ( 'Product'[Color] ), 'Product'[Color] = "Red" ), "Num of Products",CALCULATE ( COUNTROWS ( 'Product' ) ))

Now the result is back to 99. In order to obtain the same behavior as *CALCULATETABLE*, we needed to invert the execution order. This way *FILTER* runs first, and then the calculation of the number of rows relies on the context transition to force the row context of *ADDCOLUMNS* to become a filter context for *COUNTROWS*.

*CALCULATETABLE* works by modifying the filter context. It is powerful because it propagates its effect to multiple functions in a DAX expression. Its power comes with limitations in the type of filtering it can create. For example, *CALCULATETABLE* can only apply filters to columns that belong to the data model. If one only needs the customer whose sales amount is greater than one million, then *CALCULATETABLE* is not the right choice because *Sales Amount* is a measure. Therefore, *CALCULATETABLE* cannot apply a filter on a measure, whereas *FILTER* can. This is shown in the following expression; replacing *FILTER* with *CALCULATETABLE* is not an option, as it would lead to a syntax error:

Large Customers = FILTER ( Customer, [Sales Amount] > 1000000 )

*CALCULATETABLE*—like *CALCULATE*—performs a context transition and can have all the *CALCULATE* modifiers like *ALL*, *USERELATIONSHIPS*, *CROSSFILTER*, and many others. Consequently, it is much more powerful than *FILTER*. This is not to say that one should always try to use *CALCULATETABLE* and stop using *FILTER*. Each of the two functions has advantages and disadvantages, and the choice needs to be an educated one.

As a rule of thumb, one uses *CALCULATETABLE* whenever they need to apply a filter on a model column and/or there is the need for the other functionalities of *CALCULATETABLE*, like context transition and filter context modifiers.

DAX offers several functions to manipulate tables. These functions can be used to create new calculated tables, to create tables to iterate on, or to use their results as filter arguments in *CALCULATE*. In this section we provide a complete reference of those functions, along with examples. There are also other table functions that are mainly useful in queries. We show them in Chapter 13, “Authoring queries.”

*ADDCOLUMNS* is an iterator that returns all the rows and columns of its first argument, adding newly created columns to the output. For example, the following calculated table definition produces a table with all the colors and the value of sales amount for each color:

ColorsWithSales = ADDCOLUMNS ( VALUES ( 'Product'[Color] ), "Sales Amount", [Sales Amount] )

You can see the result in Figure 12-3.

Being an iterator, *ADDCOLUMNS* evaluates the column expressions in a row context. In this example, it computes the sales of the given product color because the expression of *Sales Amount* uses a measure. Thus, there is an automatic *CALCULATE* surrounding *Sales Amount* that generates the context transition. If one uses a regular expression instead of a measure, then *CALCULATE* is frequently used to force the context transition.

*ADDCOLUMNS* is oftentimes used in conjunction with *FILTER* to obtain filters on temporary calculated columns. For example, to compute the products that sold more than 150,000.00 USD, a possible implementation is the following one:

HighSalesProducts = VAR ProductsWithSales = ADDCOLUMNS ( VALUES ( 'Product'[Product Name] ), "Product Sales", [Sales Amount] ) VAR Result = FILTER ( ProductsWithSales, [Product Sales] >= 150000 ) RETURN Result

You can see the result in Figure 12-4.

The same expression can be written in several different ways, even without using *ADDCOLUMNS*. The following code, for example, ends up even simpler than the previous one, even though it does not add a *Product Sales* column to the output:

FILTER ( VALUES ( 'Product'[Product Name] ), [Sales Amount] >= 150000 )

*ADDCOLUMNS* is useful to compute multiple columns or when further calculations are needed after this first step. For example, consider computing the set of products that together represent 15% of total sales. This calculation is no longer trivial because several steps are needed:

Computing the sales amount for each product.

Computing a running total of sales amount, by aggregating each product with all the products that sold more than the product itself.

Transforming the running total into a percentage against the grand total of sales.

Only returning the products whose percentage is less than or equal to 15%.

Authoring the full query in a single step is unnecessarily complex, whereas splitting the evaluation in four steps proves much easier:

Top Products = VAR TotalSales = [Sales Amount] VAR ProdsWithSales = ADDCOLUMNS ( VALUES ( 'Product'[Product Name] ), "ProductSales", [Sales Amount] ) VAR ProdsWithRT = ADDCOLUMNS ( ProdsWithSales, "RunningTotal", VAR SalesOfCurrentProduct = [ProductSales] RETURN SUMX ( FILTER ( ProdsWithSales, [ProductSales] >= SalesOfCurrentProduct ), [ProductSales] ) ) VAR Top15Percent = FILTER ( ProdsWithRT, [RunningTotal] / TotalSales <= 0.15 ) RETURN Top15Percent

You can see the result in Figure 12-5.

In the example, we implemented the result as a calculated table, but other uses are possible. For example, one could iterate the *Top15Percent* variable using *SUMX* to create a measure computing the sales of those products.

As with most other DAX functions, one should think of *ADDCOLUMNS* as one of the many building blocks of DAX. The real power of DAX unfolds when you learn how to leverage those building blocks and have them interact in more sophisticated calculations.

*SUMMARIZE* is one of the most commonly used functions in DAX. It scans a table (its first argument), grouping columns of the same or other related tables in groups of one or more. The main use of *SUMMARIZE* is to only retrieve an existing combination of values, rather than the full list of values.

An example would be computing the number of distinct colors sold, to produce a report that shows the number of colors available and the number of colors sold at least once. The following measures would produce the desired result:

Num of colors := COUNTROWS ( VALUES ( 'Product'[Color] ) ) Num of colors sold := COUNTROWS ( SUMMARIZE ( Sales, 'Product'[Color] ) )

You can see the result of these two measures by brand in the report in Figure 12-6.

In this case we used *SUMMARIZE* to group the sales by *Product[Color]*, and then we counted the number of rows in the result. Because *SUMMARIZE* performs a group by, it only returns the colors referenced by *Sales*. On the other hand, *VALUES ( Product[Color] )* returns all the existing colors whether they have sales or not.

Using *SUMMARIZE*, one can group data by any number of columns, provided that the columns used as parameters are reachable from *Sales* only when following many-to-one or one-to-one relationships. For example, to compute the average quantity sold per product and per day, this is one possible implementation:

AvgDailyQty := VAR ProductsDatesWithSales = SUMMARIZE ( Sales, 'Product'[Product Name], 'Date'[Date] ) VAR Result = AVERAGEX ( ProductsDatesWithSales, CALCULATE ( SUM ( Sales[Quantity] ) ) ) RETURN Result

You can see the result of this measure in Figure 12-7.

In this case, we used *SUMMARIZE* to scan *Sales* and to group it by product name and by date. The resulting table contains the product name and the date, only considering days with sales for that product. *AVERAGEX* takes care of computing the average over each row of the temporary table returned by *SUMMARIZE*. If there are no sales for a certain product on any given day, then the resulting table will not contain that date.

*SUMMARIZE* can also be used like *ADDCOLUMNS* to add further columns to the result. For example, the previous measure could also be authored the following way:

AvgDailyQty := VAR ProductsDatesWithSalesAndQuantity = SUMMARIZE ( Sales, 'Product'[Product Name], 'Date'[Date],"Daily qty", SUM ( Sales[Quantity] )x) VAR Result = AVERAGEX ( ProductsDatesWithSalesAndQuantity, [Daily qty] ) RETURN Result

In this case *SUMMARIZE* returns a table that contains the product name, the date, and a newly introduced column named *Daily qty*. *Daily qty* is later averaged by *AVERAGEX*. Nevertheless, ** the use of SUMMARIZE to create temporary columns is deprecated** because

AvgDailyQty := VAR ProductsDatesWithSales = SUMMARIZE ( Sales, 'Product'[Product Name], 'Date'[Date] ) VAR ProductsDatesWithSalesAndQuantity = ADDCOLUMNS ( ProductsDatesWithSales, "Daily qty", CALCULATE ( SUM ( Sales[Quantity] ) ) ) VAR Result = AVERAGEX ( ProductsDatesWithSalesAndQuantity, [Daily qty] ) RETURN Result

Despite the code being more verbose, it is much easier to read and write because there is a single row context used in a context transition. This row context is introduced by *ADDCOLUMNS* while iterating over the result of *SUMMARIZE*. This pattern results in simpler (and most of the times faster) code.

It is possible to use more—optional—parameters with *SUMMARIZE*. They exist for the calculation of subtotals and to add columns to the result. We deliberately decided not to write about them to make the point even stronger: *SUMMARIZE works fine when grouping tables and should not be used to compute additional columns*. Although you can find code on the web that still uses *SUMMARIZE* to author new columns, consider it a very bad practice and always replace it with the pair *ADDCOLUMNS*/*SUMMARIZE*.

*CROSSJOIN* performs the cross-join of two tables, returning the cartesian product of the two input tables. In other words, it returns all possible combinations of the values in the input tables. For example, the following expression returns all the combinations of product names and years:

CROSSJOIN ( ALL ( 'Product'[Product Name] ), ALL ( 'Date'[Calendar Year] ) )

If the model contains 1,000 product names and five years, the resulting table contains 5,000 rows. *CROSSJOIN* is more often used in queries than in measures. Nevertheless, there are some scenarios where the use of *CROSSJOIN* becomes relevant—mostly because of performance.

For example, consider the need for an *OR* condition between two different columns in a *CALCULATE* filter argument. Because *CALCULATE* merges its filter arguments with an intersection, implementing the *OR* condition requires closer attention. For example, this is a possible implementation of *CALCULATE* filtering all the products that belong to the Audio category or have a Black color:

AudioOrBlackSales := VAR CategoriesColors = SUMMARIZE ( 'Product', 'Product'[Category], 'Product'[Color] ) VAR AudioOrBlack = FILTER ( CategoriesColors, OR ( 'Product'[Category] = "Audio", 'Product'[Color] = "Black" ) ) VAR Result = CALCULATE ( [Sales Amount], AudioOrBlack ) RETURN Result

The previous code works well, and it is optimal also from a performance point of view. *SUMMARIZE* scans the *Product* table, which is expected to contain a small number of rows. Thus, the evaluation of the filter is very quick.

If the requirement is to filter columns from different tables like color and year, then things are different. Indeed, one could extend the previous example; but to summarize by columns from two separate tables, *SUMMARIZE* needs to scan the *Sales* table:

AudioOr2007 Sales := VAR CategoriesYears = SUMMARIZE ( Sales, 'Product'[Category], 'Date'[Calendar Year] ) VAR Audio2007 = FILTER ( CategoriesYears, OR ( 'Product'[Category] = "Audio", 'Date'[Calendar Year] = "CY 2007" ) ) VAR Result = CALCULATE ( [Sales Amount], Audio2007 ) RETURN Result

*Sales* is not a small table; it might contain hundreds of millions of rows. Scanning it to retrieve the existing combinations of category and years could result in an expensive operation. Regardless, the resulting filter is not going to be large because there are only a few categories and years, yet the engine needs to scan a large table to retrieve the filter.

In that scenario, we recommend you build all the combinations of category and year, producing a small table; you would then filter that table, as in the following code:

AudioOr2007 Sales := VAR CategoriesYears =CROSSJOIN (VALUES ( 'Product'[Category] ),VALUES ( 'Date'[Calendar Year] ))VAR Audio2007 = FILTER ( CategoriesYears, OR ( 'Product'[Category] = "Audio", 'Date'[Calendar Year] = "CY 2007" ) ) VAR Result = CALCULATE ( [Sales Amount], Audio2007 ) RETURN Result

The full *CROSSJOIN* of categories and years contains a few hundred rows and the execution of this last version of the measure is faster.

*CROSSJOIN* is not only useful to speed up calculations. Sometimes, one is interested in retrieving rows even when no event happened. For example, by using *SUMMARIZE* to scan sales by category and country, the result only contains the categories and countries with sales for certain products. This is the intended behavior of *SUMMARIZE*, so it is not surprising. However, sometimes the absence of an event is more important than its presence. For example, one might want to investigate which brands have no sales in certain regions. In that case, the measure needs to build a more complex expression involving a *CROSSJOIN*, so to be able to also retrieve nonexistent combinations of values. We will provide more examples of *CROSSJOIN* in the next chapter.

*UNION* is a set function that performs the union of two tables. The ability to combine different tables into a single table can be important in certain circumstances. It is mainly used in calculated tables, much less frequently in measures. For example, the following table contains all the countries from both the *Customer* and the *Store* tables:

AllCountryRegions = UNION ( ALL ( Customer[CountryRegion] ), ALL ( Store[CountryRegion] ) )

You can look at the result in Figure 12-8.

*UNION* does not remove duplicates before returning a result. Thus, if Australia belongs to the countries of both customers and stores, it appears twice in the resulting table. If needed, leveraging the *DISTINCT* function will remove duplicates.

We have described and used *DISTINCT* multiple times prior to this point, to obtain the distinct values of a column as visible in the current filter context. *DISTINCT* can also be used with a table expression as a parameter, and in that special case it returns the distinct rows of the table. Thus, the following is a good implementation removing potential duplicates from the *CountryRegion* column:

DistinctCountryRegions = VAR CountryRegions = UNION ( ALL ( Customer[CountryRegion] ), ALL ( Store[CountryRegion] ) ) VAR UniqueCountryRegions = DISTINCT ( CountryRegions ) RETURN UniqueCountryRegions

You can see the resulting table in Figure 12-9.

*UNION* maintains the data lineage of the input tables if the lineage of both tables is the same. In the previous formula, the result of *DISTINCT* has no lineage because the first table contains *Customer[CountryRegion]*, and the second table contains *Store[CountryRegion]*. Because the data lineage of the input tables is different, the result has a new lineage not corresponding to any of the existing columns. Therefore, the following calculated table returns the same grand total of sales on all the rows:

DistinctCountryRegions = VAR CountryRegions = UNION ( ALL ( Customer[CountryRegion] ), ALL ( Store[CountryRegion] ) ) VAR UniqueCountryRegions = DISTINCT ( CountryRegions ) VAR Result = ADDCOLUMNS ( UniqueCountryRegions, "Sales Amount", [Sales Amount] ) RETURN Result

The result is presented in Figure 12-10.

If the calculated table needs to contain both the sales amount and the number of stores, including all country regions of both customers and stores, then the filtering must be handled manually through a more complex expression:

DistinctCountryRegions = VAR CountryRegions = UNION ( ALL ( Customer[CountryRegion] ), ALL ( Store[CountryRegion] ) ) VAR UniqueCountryRegions = DISTINCT ( CountryRegions ) VAR Result = ADDCOLUMNS ( UniqueCountryRegions, "Customer Sales Amount", VAR CurrentRegion = [CountryRegion] RETURN CALCULATE ( [Sales Amount], Customer[CountryRegion] = CurrentRegion ), "Number of stores", VAR CurrentRegion = [CountryRegion] RETURN CALCULATE ( COUNTROWS ( Store ), Store[CountryRegion] = CurrentRegion ) ) RETURN Result

You see the result in Figure 12-11.

In the previous example *CALCULATE* applies a filter on either the customer or the store country region, using the value currently iterated by *ADDCOLUMN* on the result of *UNION*. Another option to obtain the same result is to restore the lineage using *TREATAS* (see Chapter 10, “Working with the filter context,” for more information about *TREATAS*), as in the following equivalent expression:

DistinctCountryRegions = VAR CountryRegions = UNION ( ALL ( Customer[CountryRegion] ), ALL ( Store[CountryRegion] ) ) VAR UniqueCountryRegions = DISTINCT ( CountryRegions ) VAR Result = ADDCOLUMNS ( UniqueCountryRegions, "Customer Sales Amount", CALCULATE ( [Sales Amount], TREATAS ( { [CountryRegion] }, Customer[CountryRegion] ) ), "Number of stores", CALCULATE ( COUNTROWS ( Store ), TREATAS ( { [CountryRegion] }, Store[CountryRegion] ) ) ) RETURN Result

The result of our last two examples is the same; what differs is the technique used to move the filter from a new column to one that is part of the model. Moreover, in this last example you may notice the use of a table constructor: The curly braces transform *CountryRegion* into a table that can be used as a parameter of *TREATAS*.

Because *UNION* loses the data lineage if values come from different columns, *TREATAS* is a convenient function to control the data lineage of the result. It is worth noting that *TREATAS* ignores values that do not exist in the target columns.

*INTERSECT* is a set function much like *UNION*. However, instead of appending one table to another, it returns the intersection of the two tables—that is, only the rows that appear in both tables. It was popular before the *TREATAS* function was introduced because it allows one to apply the result of a table expression as a filter to other tables and columns. Since the introduction of *TREATAS*, the number of use cases of *INTERSECT* was greatly reduced.

For example, if one needs to retrieve the customers who bought in both 2007 and 2008, a possible implementation is the following:

CustomersBuyingInTwoYears = VAR Customers2007 = CALCULATETABLE ( SUMMARIZE ( Sales, Customer[Customer Code] ), 'Date'[Calendar Year] = "CY 2007" ) VAR Customers2008 = CALCULATETABLE ( SUMMARIZE ( Sales, Customer[Customer Code] ), 'Date'[Calendar Year] = "CY 2008" ) VAR Result = INTERSECT ( Customers2007, Customers2008 ) RETURN Result

From the lineage point of view, *INTERSECT* retains the data lineage of the first table. In the previous example, both tables have the same data lineage. If one builds a table with different data lineages, then only the lineage of the first table is kept. For example, the countries where there are both customers and stores can be expressed as follows:

INTERSECT ( ALL ( Store[CountryRegion] ), ALL ( Customer[CountryRegion] ) )

In this latter example, the lineage is that of *Store[CountryRegion]*. Consequently, a more complex expression like the following returns the sales filtered by *Store[CountryRegion]*, not *Customer[CountryRegion]*:

SalesStoresInCustomersCountries = VAR CountriesWithStoresAndCustomers = INTERSECT ( ALL ( Store[CountryRegion] ), ALL ( Customer[CountryRegion] ) ) VAR Result = ADDCOLUMNS ( CountriesWithStoresAndCustomers, "StoresSales", [Sales Amount] ) RETURN Result

You see the result of this expression in Figure 12-12.

In this latter example, the *StoresSales* column contains the sales related to the country of the store.

*EXCEPT* is the last of the set functions introduced in this section. *EXCEPT* removes the rows present in the second table from the first table. As such, it implements set subtraction with two tables. For example, if one is interested in customers who bought a product in 2007 but not in 2008, one possible implementation is the following:

CustomersBuyingIn2007butNotIn2008 = VAR Customers2007 = CALCULATETABLE ( SUMMARIZE ( Sales, Customer[Customer Code] ), 'Date'[Calendar Year] = "CY 2007" ) VAR Customers2008 = CALCULATETABLE ( SUMMARIZE ( Sales, Customer[Customer Code] ), 'Date'[Calendar Year] = "CY 2008" ) VAR Result = EXCEPT ( Customers2007, Customers2008 ) RETURN Result

The first rows of the calculated table are visible in Figure 12-13.

As usual, one could use the previous calculation as a filter argument of *CALCULATE* to obtain the sales amount of those customers. *EXCEPT* is frequently used when analyzing customer behavior. For example, a common calculation for many businesses is establishing the number of new customers, returning customers, and lost customers.

There are several possible implementations of the same calculations, each one targeted to a specific data model. The following implementation is not always the most optimal, but it is flexible and easy to understand. To compute the number of customers who did not buy anything last year but bought something this year, the following measure removes customers who bought a product in the previous year from the set of current customers:

SalesOfNewCustomers := VAR CurrentCustomers = VALUES ( Sales[CustomerKey] ) VAR CustomersLastYear = CALCULATETABLE ( VALUES ( Sales[CustomerKey] ), DATESINPERIOD ( 'Date'[Date], MIN ( 'Date'[Date] ) - 1, -1, YEAR ) ) VAR CustomersNotInLastYear = EXCEPT ( CurrentCustomers, CustomersLastYear ) VAR Result = CALCULATE ( [Sales Amount], CustomersNotInLastYear ) RETURN Result

The implementation of this code as a measure works with any filter and provides a flexible way to slice by any column. Please be mindful that this implementation of new customers is not the best in terms of performance. We used it here to demonstrate a possible usage of *EXCEPT*. Later in this chapter we show a much faster version of the same calculation, although a bit more complex to learn.

From the lineage point of view, *EXCEPT* retains the data lineage of the first table, as was the case with *INTERSECT*. For example, the following expression computes the sales made to customers living in countries where there are no stores:

SalesInCountriesWithNoStores := VAR CountriesWithActiveStores = CALCULATETABLE ( SUMMARIZE ( Sales, Store[CountryRegion] ), ALL ( Sales ) ) VAR CountriesWithSales = SUMMARIZE ( Sales, Customer[CountryRegion] ) VAR CountriesWithNoStores = EXCEPT ( CountriesWithSales, CountriesWithActiveStores ) VAR Result = CALCULATE ( [Sales Amount], CountriesWithNoStores ) RETURN Result

The result of *EXCEPT* filters the *Customer[CountryRegion]* column because it is the column used by the table taken as the first argument of *EXCEPT*.

Functions manipulating tables are oftentimes used to build complex filters for *CALCULATE* parameters. In this section, we provide further examples, always leading you one step further in your understanding of DAX.

A first example where manipulating tables proves to be a useful skill is the following. Imagine having to implement an *OR* condition between the selections made in different slicers, instead of the default *AND* behavior provided by client tools like Excel and Power BI.

The report in Figure 12-14 contains two slicers. The default behavior of Power BI is to intersect the two conditions. As a consequence, the numbers shown represent the sales of Home Appliances to customers with a High School education.

Instead of intersecting the two conditions, one might want to merge them. In other words, the numbers shown in the report need to be the sales of products sold to customers with a High School education or the sales of Home Appliances. Because Power BI does not support “or” conditions between slicers, one can solve the problem by using DAX.

Remember that each cell of the report has a filter context containing both a filter on the category and a filter on the education. Both filters need to be replaced. There are several possible solutions to the same pattern; we demonstrate the use of three different formulas.

The first, and probably the easiest expression of that filter, is the following:

OR 1 := VAR CategoriesEducations = CROSSJOIN ( ALL ( 'Product'[Category] ), ALL ( Customer[Education] ) ) VAR CategoriesEducationsSelected = FILTER ( CategoriesEducations, OR ( 'Product'[Category] IN VALUES ( 'Product'[Category] ), Customer[Education] IN VALUES ( Customer[Education] ) ) ) VAR Result = CALCULATE ( [Sales Amount], CategoriesEducationsSelected ) RETURN Result

The measure first builds the cross-join of all the categories and education levels. Once the table is prepared, it filters out the rows that do not satisfy the condition, and finally it uses the resulting table as a filter argument for *CALCULATE*. *CALCULATE* overrides the current filter on both the category and the education, resulting in the report in Figure 12-15.

The first implementation of the measure is already simple—both to use and to understand. In case there is a large number of rows in the columns used to filter the *OR* condition, or when there are more than just two conditions, the resulting temporary table would quickly become huge. In such a case, one can limit its size by removing the *CROSSJOIN* in favor of *SUMMARIZE*, as in this second implementation of the same measure:

OR 2 := VAR CategoriesEducations =CALCULATETABLE (SUMMARIZE (Sales,'Product'[Category],Customer[Education]),ALL ( 'Product'[Category] ),ALL ( Customer[Education] ))VAR CategoriesEducationsSelected = FILTER ( CategoriesEducations, OR ( 'Product'[Category] IN VALUES ( 'Product'[Category] ), Customer[Education] IN VALUES ( Customer[Education] ) ) ) VAR Result = CALCULATE ( [Sales Amount], CategoriesEducationsSelected ) RETURN Result

The logic of this second implementation is close to the first one, the only noticeably difference being the presence of *SUMMARIZE* instead of *CROSSJOIN*. Moreover, it is worth pointing out that *SUMMARIZE* needs to be executed in a filter context without the filter on *Category* and *Education*. Otherwise, the slicer would affect the calculation executed by *SUMMARIZE*, destroying the effort of the filter.

There is at least a third solution to the same scenario, potentially faster though harder to understand at first sight. Indeed, the same table filter can be expressed thinking that if the category is in the selected values for the categories, then any value for the education level is fine. The same happens for the education level: As long as the education level is in the selected values for the education level, then any category is fine. This reasoning leads to the third formulation of the same expression:

OR 3 :=VAR Categories =CROSSJOIN (VALUES ( 'Product'[Category] ),ALL ( Customer[Education] ))VAR Educations =CROSSJOIN (ALL ( 'Product'[Category] ),VALUES ( Customer[Education] ))VAR CategoriesEducationsSelected =UNION ( Categories, Educations )VAR Result = CALCULATE ( [Sales Amount], CategoriesEducationsSelected ) RETURN Result

As you can see, one can author the same formula in several ways. The difference is in both readability and performance. Being able to write the same formula using different methods is a skill that will prove extremely useful in the final optimization chapters, where you learn to evaluate the performance of different versions of the same code, seeking the most optimal.

As another example of a useful calculation involving the manipulation of tables, we demonstrate how to analyze sales over time but only considering customers who made a purchase during the first year of a selected time period. In other words, we consider the first year with sales in the visual, evaluate the customers who bought during that first year, and then we only analyze the sales of those customers over the following years, ignoring those who became customers afterwards.

The code needs to perform three steps:

Checking what is the first year with sales on any product.

Storing the set of customers of that first year into a variable, ignoring any other filter.

Computing the sales of the customers determined in step 2 in the current period.

The following code implements this algorithm by using variables to store temporary results:

SalesOfFirstYearCustomers := VAR FirstYearWithSales = CALCULATETABLE ( FIRSTNONBLANK ( 'Date'[Calendar Year], [Sales Amount] ), ALLSELECTED () ) VAR CustomersFirstYear = CALCULATETABLE ( VALUES ( Sales[CustomerKey] ), FirstYearWithSales, ALLSELECTED () ) VAR Result = CALCULATE ( [Sales Amount], KEEPFILTERS ( CustomersFirstYear ) ) RETURN Result

The *FirstYearWithSales* variable stores the first year with sales. Please note that *FIRSTNONBLANK* returns a table as a result, with the data lineage of *Date[Calendar Year]*. The *CustomersFirstYear* variable retrieves the list of all customers in that first year. The last step is the easiest because it only applies the filter on the customer; in each cell of the report, the value of *Sales Amount* is restricted to only the customers found during the second step. The *KEEPFILTERS* modifier makes it possible to filter these customers by country, for example.

The result is visible in Figure 12-16, indicating that—after the first year—sales made to those customers are decreasing over time.

This latter example is important to learn. Indeed, there are several scenarios where one needs to place a filter over time, compute a set, and finally analyze the behavior of this set (of customers, products, stores) over different years. With this pattern one can easily implement same-store analyses or any other calculation with similar requirements.

In a previous section of this chapter about *EXCEPT*, we showed how to compute new customers. In this section we provide a much better implementation of the same calculation that—again—makes heavy use of table functions.

The idea in this new algorithm is the following: First we determine the earliest day when each customer made a purchase. Once this table is available, the formula checks if the first sale to the customer falls within the current time period. If that holds true, it means that the customer—in the current period—is a new customer.

Here is the code of the measure:

New Customers := VAR CustomersFirstSale = CALCULATETABLE ( ADDCOLUMNS ( VALUES ( Sales[CustomerKey] ), "FirstSale", CALCULATE ( MIN ( Sales[Order Date] ) ) ), ALL ( 'Date' ) ) VAR CustomersWith1stSaleInCurrentPeriod = FILTER ( CustomersFirstSale, [FirstSale] IN VALUES ( 'Date'[Date] ) ) VAR Result = COUNTROWS ( CustomersWith1stSaleInCurrentPeriod ) RETURN Result

The *CustomersFirstSale* variable needs to use *ALL* on the *Date* table to first compute sales that happened before the current time period. You can see the resulting report in Figure 12-17.

The way it is written now, if a user further filters other tables like the product category, a customer will be considered new when they buy the selected category for the first time. Thus, one individual customer might be considered new multiple times, depending on the filters applied. By adding further *CALCULATE* modifiers to the computation of the first variable, it is possible to implement several different variations of the same code. For example, by adding *ALL ( Product )*, then customers are only considered new when they buy any product. By adding *ALL ( Store )*, customers are only new the first time they buy in any store.

The PivotTable in Excel offers a feature to retrieve the underlying data used to compute a cell. That feature is called “Show Details” in the Excel user interface, and its more technical name is “drillthrough.” This name could be confusing because in Power BI the term “drillthrough” refers to a feature that allows the user to move from one report page to another, in a manner controlled by the report author. For this reason, a feature that allows control over the “Show Details” result was called “Detail Rows Expression” in the Tabular model and was introduced in SQL Server Analysis Services 2017. As of April 2019, it is not available in Power BI, but it should be planned for a future release.

The Detail Rows Expression is a DAX table expression associated with a measure and invoked to retrieve the table for the Show Details feature. This expression is executed in the filter context of the measure. The idea is that if a measure changes the filter context to compute a variable, the Detail Rows Expression should apply a similar transformation to the filter context.

For example, consider *Sales YTD* that computes the year-to-date value of *Sales Amount*:

Sales YTD := CALCULATE ( [Sales Amount], DATESYTD ( 'Date'[Date] ) )

The corresponding Detail Rows Expression should be a *CALCULATETABLE* that applies the same filter context transformation as the one made by the corresponding measure. For example, the following expression returns all the columns of the *Sales* table from the beginning of the year considered in the calculation:

CALCULATETABLE ( Sales, DATESYTD ( 'Date'[Date] ) )

A DAX client tool executes this measure by invoking a specific DAX function called *DETAILROWS*, specifying the measure that the Detail Rows Expression belongs to:

DETAILROWS ( [Sales YTD] )

The *DETAILROWS* function invokes a table expression stored in a measure. Therefore, one can define hidden measures just to store long table expressions often used as filter arguments of many other DAX measures. For example, consider a *Cumulative Total* measure with a Detail Rows Expression that retrieves any dates less than or equal to the maximum date available in the filter context:

-- Detail Rows Expression for Cumulative Total measure VAR LastDateSelected = MAX ( 'Date'[Date] ) RETURN FILTER ( ALL ( 'Date'[Date] ), 'Date'[Date] <= LastDateSelected )

One can reference this table expression in different measures by using the *DETAILROWS* function:

Cumulative Sales Amount := CALCULATE ( [Sales Amount], DETAILROWS ( [Cumulative Total] ) ) Cumulative Total Cost := CALCULATE ( [Total Cost], DETAILROWS ( [Cumulative Total] ) )

More detailed examples of this technique are available at https://www.sqlbi.com/articles/creating-table-functions-in-dax-using-detailrows/. However, reusing table expressions with *DETAILROWS* is just a workaround for the lack of custom-defined functions in DAX, and it may have performance implications. Many use cases for *DETAILROWS* can be solved by using calculation groups, and this technique will become obsolete once DAX introduces measures returning tables or custom-defined DAX functions.

All the table functions shown in the previous sections can be used either as table filters in *CALCULATE* or to create calculated tables and queries. Earlier on, we described the ones more likely to be used as table filters, whereas in this section we describe some additional functions that are mostly used when creating calculated tables. There are other table functions whose main usage is in authoring queries; we will describe them in the next chapter. Nevertheless, be mindful that there are no limits in the use of table functions. Nothing is preventing anyone from using *DATATABLE*, *SELECTCOLUMNS*, or *GENERATESERIES* (some of the functions described later) in a measure or as a table filter. It is only a matter of convenience: Some functions better fit certain specific needs.

*SELECTCOLUMNS* is useful to reduce the number of columns in a table, and it also provides the capability to add new columns like *ADDCOLUMNS* does. In practice, *SELECTCOLUMNS* implements projection of columns like the SQL *SELECT* statement.

The most common usage of *SELECTCOLUMNS* is to scan a table and only return some of the columns. For example, the following expression only returns the customer education and gender, that is, two columns:

SELECTCOLUMNS ( Customer, "Education", Customer[Education], "Gender", Customer[Gender] )

The result contains a lot of duplicates, as you can see in Figure 12-18.

*SELECTCOLUMNS* is very different from *SUMMARIZE*. *SUMMARIZE* performs a grouping of the result, whereas *SELECTCOLUMNS* only reduces the number of columns. Therefore, the output of *SELECTCOLUMNS* might contain duplicates, whereas the output of *SUMMARIZE* does not. One needs to provide *SELECTCOLUMNS* with pairs of names and expressions for each column in the resulting set. The resulting columns can also be new ones. For example, the following formula returns a new column named *Customer* containing the name followed by its code in parentheses:

SELECTCOLUMNS ( Customer, "Education", Customer[Education], "Gender", Customer[Gender], "Customer", Customer[Name] & " (" & Customer[Customer Code] & ")" )

You can see the result in Figure 12-19.

*SELECTCOLUMNS* maintains the data lineage if the expression is a single column reference, whereas it generates a new data lineage whenever one uses an expression. Consequently, the following result contains two columns: The first column has the data lineage of *Customer[Name]*, whereas the second column has a different data lineage that cannot filter the original columns, even though the content of the two columns is the same:

SELECTCOLUMNS ( Customer, "Customer Name with lineage", Customer[Name], "Customer Name without lineage", Customer[Name] & "" )

*ROW* is a simple function that returns a table with only one row. *ROW* requires pairs of name and expression, and the result is a table with one row and a suitable number of columns. For example, the following expression is a table with one row and two columns, containing the sales amount and the quantity sold:

ROW ( "Sales", [Sales Amount], "Quantity", SUM ( Sales[Quantity] ) )

The result is a table with one row and two columns, as you can see in Figure 12-20.

*ROW* is no longer commonly used since the table constructor syntax was introduced. Indeed, the previous expression can be written as:

{ ( [Sales Amount], SUM ( Sales[Quantity] ) ) }

The column names are generated automatically by the table constructor syntax, as shown in Figure 12-21.

When using the table constructor syntax, commas separate rows. To include multiple columns, one needs to use parentheses to encapsulate multiple columns in a single row. The main difference between the *ROW* function and the curly braces syntax is that *ROW* specifies names for the columns, whereas the curly braces automatically generate names for the columns. The latter makes it harder to later reference column values.

*ROW* is useful when wanting to create a table with a single row. On the other hand, to create multiple rows, one would use *DATATABLE*. *DATATABLE* creates a table specifying not only the column names, but also the data type of each column and its content. For example, if one needs a table with three rows to cluster prices, an easy way to build the table is the following expression:

DATATABLE ( "Segment", STRING, "Min", DOUBLE, "Max", DOUBLE, { { "LOW", 0, 20 }, { "MEDIUM", 20, 50 }, { "HIGH", 50, 99 } } )

You can see the result in Figure 12-22.

The data type of columns can be any of the following values: *INTEGER*, *DOUBLE*, *STRING*, *BOOLEAN*, *CURRENCY*, and *DATETIME*. The syntax is somewhat inconsistent with the new table constructor using curly braces. Indeed, *DATATABLE* uses curly braces to delimit rows, whereas the anonymous table constructor uses regular parentheses leaving the curly braces only to delimit the entire table.

A strong limitation of *DATATABLE* is that the contents of the table need to be constant values. Using any DAX expression would result in an error. This makes *DATATABLE* a function that is not used much. The table constructor syntax gives developers much more flexibility in terms of expressivity.

One can use *DATATABLE* to define simple, constant calculated tables. In SQL Server Data Tools (SSDT) for Analysis Services Tabular, a calculated table using *DATATABLE* is generated when a developer pastes the content of the clipboard into the model, whereas Power BI uses Power Query to define constant tables. This is another reason why *DATATABLE* is not common among Power BI users.

*GENERATESERIES* is a utility function that generates series of values once the developer provides a lower boundary, an upper boundary, and a step. For example, the following expression produces a table containing 20 values, from 1 to 20:

GENERATESERIES ( 1, 20, 1 )

The resulting data type depends on the input; that can be either a number or a *DateTime*. For example, if the developer needs a table containing the time of the day, this expression provides a quick way of generating an 86,400-row table (one row per second):

Time = GENERATESERIES ( TIME ( 0, 0, 0 ), -- Start value TIME ( 23, 59, 59 ), -- End value TIME ( 0, 0, 1 ) -- Step: 1 second )

By changing the step and adding new columns, one could create a smaller table that acts as a suitable dimension—for example, to slice sales by time:

Time = SELECTCOLUMNS ( GENERATESERIES ( TIME ( 0, 0, 0 ), TIME ( 23, 59, 59 ), TIME ( 0, 30, 0 ) ), "Time", [Value], "HH:MM AMPM", FORMAT ( [Value], "HH:MM AM/PM" ), "HH:MM", FORMAT ( [Value], "HH:MM" ), "Hour", HOUR ( [Value] ), "Minute", MINUTE ( [Value] ) )

You can see the result in Figure 12-23.

Using *GENERATESERIES* in a measure is uncommon, whereas that function is called upon to create simple tables that become useful as slicers, so the user can select different parameters. For example, Power BI uses *GENERATESERIES* to add parameters for the what-if analysis.

In this chapter, we introduced many new table functions. Still, many have yet to come in the next chapter. Here we focused the attention on the set of table functions that are commonly used to create calculated tables or to implement complex filter arguments for *CALCULATE* and *CALCULATETABLE*. Always remember that the code we provide is an example of what is possible with DAX; we leave it up to the reader’s imagination to find practical scenarios calling for the code in a specific model.

The main functions you learned in this chapter are:

*ADDCOLUMNS*, to add new columns to the input table.*SUMMARIZE*, to perform grouping after the scan of a table.*CROSSJOIN*, to perform the cartesian product of two tables.*UNION*,*INTERSECT*, and*EXCEPT*, to compute the basic set of operations on tables.*SELECTCOLUMNS*, to select certain columns of a table.*ROW*,*DATATABLE*, and*GENERATESERIES*, to generate mostly constant tables as calculated tables.

In the next chapter, we will describe other table functions focusing more on complex queries or complex calculated tables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.