Besides cleaning up your data inputs, Tableau Prep can be used to increase the value of your new dataset by augmenting it with complementary data. This can be done by extending the dataset vertically, adding more rows, or horizontally, by adding new data columns. Performing such data preparation tasks within Tableau Prep allows you to create a dataset that includes key data from multiple inputs, making the end result a comprehensive dataset for analysis.
In this chapter, you will learn how to combine different datasets by using a variety of different methods. Combining data is one of the most common actions in data preparation. Most organizations source data from multiple systems and combining that data into a holistic dataset allows more insightful analysis than looking at each dataset in isolation.
In this chapter, you'll find the following recipes to help you combine your data for analytics:
To follow along with the recipes in this chapter, you will require Tableau Prep Builder. The recipes in this chapter use sample data files that you can download from the book's GitHub repository at https://github.com/PacktPublishing/Tableau-Prep-Cookbook.
Data is typically produced by multiple systems and certain systems may produce similar data that needs to be combined vertically. That is, the rows need to be stacked on top of one another. A use case we'll use in this recipe is combining sales data from two different sales systems, in order to get the total sales dataset prepared. This is a typical scenario you may encounter when your organization is operating multiple systems, is migrating from one system to another, is maintaining legacy systems, or is integrating systems from a partner or acquired company. In this recipe, we'll combine multiple datasets using Union.
To follow along with this recipe, download the Sample Files 5.1 folder from this book's GitHub repository.
Start by opening Tableau Prep and perform the following steps:
Drag the connection so that Union is highlighted, and then release. Tableau Prep will instantly add a step named Union 1:
Important note
In the bottom pane, take note of the two colors Tableau Prep has assigned to each data connection; blue for DataExport_NOV_Sales.csv and amber for DataExport_NOV_SalesData.csv. Observe the column headers and notice the multi-colored bar beneath all the field names, with the exception of the Table Names field. This multi-colored bar indicates from which data connection the field was sourced. In our case, all bars are a combination of blue and amber, meaning data was sourced from both data connections.
Hover to the left to highlight the Add action and release the mouse button. This will add the third source to your existing union. If you have selected Union instead, a second union step would have been created, combining Union 1 with the third data connection.
Using the same color bars we described in Step 4, you can quickly see the origins of each field. For example, SalesID is blue and amber, meaning it contains data from those two data connections. That is, the SalesID field does not exist in the POS_Sales_Data_November_2016.xlsx data connection. Similarly, the Date field has three colors – blue, amber, and red – indicating the field is present in all three inputs.
Rename Transaction Amount to Revenue.
Rename Cashier to Sales Person.
Rename Store ZIP to Postal Code.
Rename Store Region to Region.
The following screenshot illustrates the end result, where fields have automatically been merged as a result of renaming field names:
As you can see, Tableau Prep automatically and instantly re-aligns the fields once they have identical names. In our case, we have one field remaining that is not present in the POS_Sales_Data_November_2016.xlsx data connection – SalesID. This is not an issue for the union step. The field will be retained, and any rows originating from the POS_Sales_Data_November_2016.xlsx data connection will have a null value for that field.
Tableau Prep stacks rows from any input to a union on top of each other and aligns the data by field names at the same time. In our recipe, we've resolved two scenarios. First, we combined data from two similar, but separate, data connections. In this case, Tableau Prep combined the data without any issues. Then we combined data that was significantly different, that is, data with a different number of fields, and different field names. Tableau resolves this by setting the value to null for fields that are not present in a certain input, and we can manually rename fields in order to trigger Tableau to align them.
When you have multiple data sources that need to be vertically stacked using a union, you may opt to perform that action during ingestion. This avoids you having to create a separate Union step in Tableau Prep. Furthermore, you can use wildcards in your input, such that the input becomes dynamic, and new data files can be ingested as they are added. A typical scenario for this would be an automated process that exports data on a recurring schedule, which you then need to union with prior data exports.
In this recipe, we'll use a special type of union that is part of the input step, rather than a step by itself. Using the Union functionality during input allows you to ingest and union multiple input files simultaneously.
To follow along with this recipe, download the Sample Files 5.2 folder from this book's GitHub repository. This folder contains sales data that has been exported every month, and so we have one file for every month of the year 2016, 12 files in total, to be combined using a union.
Start by opening Tableau Prep and perform the following steps to create a dynamic input and union step:
Wildcard union allows you to perform a union within the input data connection. That is, you can union multiple files at the same time. Wildcards, represented by the asterisk, *, symbol, are utilized to refer to any number of files in a given directory that match a certain naming pattern.
In this recipe, we will perform a union to combine all 12 Excel files in the sample folder.
The data we want to capture resides in the Sales_Data sheet. As such, we need to provide a matching pattern for sheets, so that only the sheet carrying that name will be included. Since the sheet names are identical, we can achieve this without a wildcard.
Note that we can set our Matching Pattern behavior to Include or Exclude. Using the dropdown highlighted in Figure 5.10, by setting this option to Exclude, you can specify the Matching Pattern value as Info. In doing so, you're instructing Tableau Prep to exclude all sheets named Info.
Notice that there is a field named Revenue with null values. Throughout this book, we've used these sample files multiple times and the Revenue field name is unexpected, as is the null value. To investigate this further, add a clean step to your flow.
This time, you can see that the origin is the opposite. All null values originate from the November 2016 Sales.xlsx file. And so we have determined that Transaction Amount and Revenue are likely to represent the same data, but the field labels have changed in the November file.
Tableau Prep will instantly merge the fields and assign it the name of the field first selected. An icon above that field will indicate the merge action. Hovering over the icon provides details on the fields that were merged:
With these steps completed, you have successfully performed a union during the input step.
Tableau Prep stacks rows from data files in scope, as defined by matching patterns during its data ingestion. It does so by opening up multiple files in the background and stacking them without any significant controls in place. That is, completely different data sources can be combined in this union, leading to undesirable outcomes. We've seen a brief example of this in this recipe, where only a single field mismatch justified an investigation.
However, if you are confident that the data format is consistent, as is often the case with automated processes, for example, the input wildcard union option is a great way to ingest files dynamically. For example, a folder that is updated every day with a new file carrying the current date can be included by Tableau Prep thanks to the input wildcard union support.
In this recipe, we've created a union as part of the input step, leveraged a filename and sheet matching pattern, toggled the matching style from include to exclude, and investigated a discrepancy as a result of a union of files with a different structure. As a result, we've learned how to ingest multiple files and perform a Union action using a single input step.
We often store in multiple different places for a variety of reasons, including different systems, storage optimization and efficiency, security, costs, and so forth. When analyzing data, however, we often want to bring data from many different areas together to create a richer dataset for analysis. Doing so may result in a better understanding of the data and provide valuable business insights. We can use the Join functionality to combine data horizontally, that is, widen the dataset by adding fields from two or more sources together. There are a variety of different join types, as you can see by the recipes in this chapter. In this recipe, we'll look at creating an inner join.
An inner join is the end result of joining two data sources together and retaining only those rows that overlap. For example, let's assume we have order data for a B2B seller of office supplies. Their data may be segmented into a database containing order information, and another database containing customer information. The overlap between these two sources is a common identifier, such as the customer's name, or, more likely, a unique customer ID.
To follow along with this recipe, download the Sample Files 5.3 folder from this book's GitHub repository.
Start by opening Tableau Prep and perform the following steps to combine the two sample files using the Inner Join functionality:
A join will always require additional configuration before it is ready for use. Because Tableau Prep instantly evaluates the validity of your flow, creating a new join will initially result in a warning message. This is expected behavior and nothing to be concerned about:
These data sources are connected by customer ID. In the Orders data source, this field is denoted by cust_id. In the Customers data source, it is denoted by id. Select these two fields to create your join clause.
Once a join clause has been created, as we did in the previous step, Tableau Prep will immediately execute the join and provide detailed results in the bottom pane. There are two specific areas to pay careful attention to when creating a join, Join Type and Summary of Join Results.
Join Type allows us to set the type of join, that is, an inner, left, right, full outer, or other type of join. By default, the type is set to inner, which is the type we require in this recipe, and so we do not have to alter it. An inner join will return all data that overlaps between the two data sources; in our case, all orders that have a customer associated with them.
In the following screenshot, we can see that the overlapping section of the two circles is selected, resulting in Join Type : inner:
Summary of Join Results provides valuable information regarding the outcome of a join operation. Specifically, it will inform us how many records are returned from each data source, after passing through the join, and how many records did not pass through because of a mismatch. In our example, a mismatch would occur if an order would not have a customer associated with it in the Customers.csv file. Our sample data has not mismatched and we're seeing a total of 15,821 rows from the Orders file passed through the join, as well as 1,467 rows from the Customers file. You can open both files in Excel to confirm that this is indeed the correct number of rows.
The following screenshot shows the join results of the steps we have just performed:
Let's move on to the next section!
In this recipe, you've connected two data sources in a Tableau Prep flow. You then joined these two data sources using a common denominator, in this case, the customer ID. You've experienced how Tableau Prep immediately performs join operations without the need for you to run the entire flow. This is by design, and helps you validate your join prior to building out your flow further and possibly running into unexpected issues later on as a result of the output of the join step.
In the Combining datasets using an inner join recipe, we combined data that was complementary and complete, orders, and associated customer information. However you may find a use case where complementary data is present for some records, and not for others. An example of this that we'll use in this recipe involves two data sources, one with sales data from a department store, and another data source with information from customers who checked out with a loyalty card. Of course, not all customers may have a loyalty card, and so we cannot expect to match every row in the data. This is where a left join comes into play.
In a left join, we pass through all the data from the first dataset, that is, the left data source, and only those records from the second, right data source that we were able to match. This means that any sales records that did not involve a loyalty card will still pass through, but the additional fields from the second source will simply be empty, or null.
To follow along with this recipe, download the Sample Files 5.4 folder from this book's GitHub repository.
Start by opening Tableau Prep and perform the following steps to enrich the sample sales data with the customer loyalty card data, if available:
Note that Tableau Prep indicates a number of mismatched values in the Summary of Join Results section. Because the default join type is an inner join, Tableau Prep will only return rows that have a common identifier in both sources, in this case, our loyalty ID. Any other rows are excluded. The join result gives the count of rows that actually passed through the join, 216 in this case, out of a total of 826 sales transactions:
The middle section in the join step settings shows all the IDs that could not be matched in a red font. For example, null from Sales Data could not be matched to any row in the LoyaltyData file and, as a result, is excluded from the join results. Similarly, the ID 2000 from the LoyaltyData file was not present in the sales data, and therefore is also excluded from the join results:
Tableau Prep immediately updates the join results, and, at a glance, we can now see from the summary that all 826 Sales Data rows are being passed through:
You may review the row-by-row results in a more organized pane by adding a clean step to your flow and selecting null from the Loyalty_ID field. This will display all rows where a transaction was made without a loyalty card, and you can see that the values for customer information are set to null as a result:
With these steps completed, you've successfully utilized Tableau Prep to perform a left join.
In this recipe, you've created a left join, which resulted in returning all rows from one data source, and only those rows from a complementary data source when there was a common identifier. A right join would have performed the exact opposite, that is, returned all rows from the second data source, and only matching rows from the first data source.
There are many use cases for a left or right join, as in many scenarios we combine data from multiple sources to a core data source, such as transactions or customers.
In the Combining data ingest and Union actions recipe, we created an inner join to return rows from two data sources that had a commonality. In the Combining datasets using a left or right join recipe, we created a left join to return all rows from a data source and enrich that data with information from a second source, whenever there was additional information available, without dropping any rows from the original source.
In this recipe, we'll look at a variation of the join, which is named the full outer join. In this case, we'll want to retrieve all rows from both data sources involved in the join, that is, even if there's no overlap. It's essentially doing a left and right join at the same time; you won't lose any data from either data source.
In the example that follows, we'll use a use case where a company is running several projects and each project may have a number of people assigned to it. However, some projects may not have started yet, or may have already been completed. Those inactive projects won't have people assigned to them. Similarly, people may be assigned to a project, or are currently between projects and not assigned directly to any project. We want the result of our join to return all projects and all people.
To follow along with this recipe, download the Sample Files 5.5 folder from this book's GitHub repository.
Start by opening Tableau Prep and perform the following steps to create a full outer join to combine project and people data:
With these steps completed, you've successfully performed a full outer join with Tableau Prep.
In this recipe, you've leveraged a full outer join to combine two data sources, even if those sources have nothing in common. In this case, we have included projects that were not assigned any staff and included staff that were not assigned to any project.
Using a full outer join can come in handy for specific scenarios. However, there is a risk of creating a dataset that is incorrect since, with a full outer join, you may expect many mismatched values, and a mistake is easily made by ignoring the mismatches. It is prudent to double-check that your join clause is correct with any join, but perhaps even more so with a full join, for this reason.
In this chapter, we have assumed in all join-related recipes that there was an overlap between two data sources. However, for analysis purposes, you may be interested in what data is not overlapping, so that you can take action appropriately.
Using the same data as we've used in the Expanding datasets using a full outer join recipe, where we have a data source with projects, and another data source with project staff, we may change our use case to focus solely on data that does not overlap. That is, we are only interested in projects without staff assigned to them, or staff members not currently assigned to work on any project.
To follow along with this recipe, download the Sample Files 5.6 folder from this book's GitHub repository.
Start by opening Tableau Prep and perform the following steps to create a not inner join:
With this step completed, you've performed the Not Inner Join method of discarding data.
A not inner join works by discarding the data which would be put through as the result of an inner join. In doing so, Tableau Prep allows you to quickly identify rows that cannot be matched across the data sources used. This can be helpful in scenarios such as the one used in this recipe, or even for identifying data quality issues where you expect a match, but the data is not providing a match on every row.
3.129.211.87