In this chapter, you'll learn how to go beyond the built-in capabilities in Tableau Prep Builder by extending it with R and Python code. R and Python are two of the world's most popular programming languages and can perform numerous data science functions. Tableau Prep allows you to pass your data to an R or Python script at any stage during your flow, with the exception of the input data step. When you insert a script, Tableau Prep will pass the data to R or Python using an API. The script will execute in the R or Python environment and then output the results back to Tableau Prep and your flow continues. The ability to embed scripts allows you to greatly improve the functionality of Tableau Prep and perform advanced functions that are not otherwise possible.
In this chapter, we're going to cover the following main topics:
To follow along with the recipes in this chapter, you will need Tableau Prep Builder, an internet connection, and administrative rights to install software.
The recipes in this chapter use sample data files that you can download from the book's GitHub repository: https://github.com/PacktPublishing/Tableau-Prep-Cookbook.
R is a statistical programming language favored by many data scientists. It is available free of charge and benefits from a vast online community for ideation and support. To find out more about R, visit https://www.r-project.org/about.html. R by itself does not support interaction with other applications such as Tableau Prep. In order to extend R with that functionality, we need to install Rserve in addition to R itself.
Important note
The R programming language and the R and RStudio application interfaces are not part of Tableau Prep; they are separate technologies. An in-depth look at R is beyond the scope of this book. However, every system is slightly different, and you may run into unexpected challenges. For help, please refer to the R website: https://cran.r-project.org.
In this recipe, we will provide general guidance for configuring your system successfully.
To follow along with this recipe, download the Sample Files 8.1 folder from this book's GitHub repository.
Start by opening your browser, then follow these steps to configure your machine:
Important note
Depending on your system configuration, not all versions may work for you. At the time of writing, we found that version 1.8.7 works for our intended purposes on both macOS and Windows. If you are familiar with RStudio, note that we are downloading the latest snapshot manually rather than using the built-in package manager, as the package manager may install an earlier version that has compatibility challenges.
The following screenshot shows the Rserve download page on the RForge website:
Important note
If you ever need to start Rserve again, for example, after your computer reboots, simply issue these two commands again.
This completes setting up R, Rserve, and Tableau Prep to talk to each other. From now on, you can start leveraging R scripts in your flows. In the next recipe, Embedding R code in a Tableau Prep flow, we'll do exactly that.
In this recipe, we learned how to configure your Mac or Windows PC with R, RStudio, and Rserve. This will allow Tableau Prep to pass data to R for R to run a script against that data and return the results to Tableau Prep. Since R is a programming language, you can write virtually any function you desire, greatly expanding the functionality of Tableau Prep.
In the previous recipe, you learned how to configure your machine with R. You also set up a connection from Tableau Prep to R, using the Rserve package. In this recipe, we'll leverage that foundational setup to run an R script using Tableau Prep. We'll learn how to embed a script in our flow and understand how R sends the data back to our Tableau Prep data flow. The examples used are simple but form a solid foundation. Once you have mastered the basics, you'll be able to implement more advanced scripts, such as the script in the recipe titled Forecasting time series using R.
To follow along with this recipe, download the Sample Files 8.2 folder from this book's GitHub repository.
Important note
R is a programming language of its own, not covered by this book. We'll look at R code in this recipe at a very high level only, so you will gain sufficient knowledge to integrate R with Tableau Prep. However, a look at the R language itself is beyond the scope of this book.
Start by opening up Tableau Prep and connecting to the Transaction Amount by Date.xlsx Excel file included in the sample files for this recipe, then follow these steps:
This sample script contains three functions that'll help you better understand how to use R and Tableau Prep together.
The first eight lines contain a function named divide. The function uses the data from a DataFrame, df. Whenever you write an R script for Tableau Prep, calling a function with (df) will result in Tableau Prep passing data to that function.
This example function does nothing more than dividing a given number by 2.
Lines 10 to 17 demonstrate a similar function, multiply. Again, we see the use of (df) to pass data from Tableau Prep to this function. The function itself multiplies a given number by 2.
Lines 19 to 22 include a function named getOutputSchema. Whenever you want R to return data to Tableau Prep, you must use this function to do so. In this example, we're returning the TransactionAmount field to Tableau Prep. In this function, in addition to the name of the field to return, we must specify the data type for Tableau Prep. In this example, you can see that the data type for TransactionAmount has been set to prep_string(), which tells Tableau Prep that this field has a string data type.
The following table lists the various data types that are available, along with their respective functions to use in R:
Pressing the Return key will instantly run the data through the R script. If successful, you will see the TransactionAmount field returned in Tableau Prep, with its original values divided by 2.
Important note
We only got one field back from R, TransactionAmount, yet we started our flow with two fields: Date and TransactionAmount. This is expected behavior; R will only return the fields as specified in its script. In our example script, in the getOutputSchema function, Date is not listed and therefore is not returned to Tableau Prep.
It is good practice to avoid sending unnecessary data to R and vice versa, as it could cause performance drawbacks, particularly when working with large datasets. You can always include a unique identifier and join the result from R back to your original dataset.
With these steps completed, you've successfully embedded a Script step with R in your Tableau Prep flow.
In this recipe, we learned how to call an R script and a particular R function from within the Tableau Prep interface. In doing so, you've opened up the door to integrating advanced data science methods into your Tableau Prep data preparation process. We've learned that an R script file may contain multiple functions and that a script step in Tableau Prep can execute one of those functions by specifying the function name. We also learned that a function only returns data to Tableau Prep if it includes a specific function, getOutputSchema, and only the data specified in that function.
In the previous two recipes, Preparing Tableau Prep to work with R and Embedding R code in a Tableau Prep flow, you learned how to set up R, RStudio, and Rserve, as well as integrating an R script into Tableau Prep. The example script used was relatively basic and required no additional configuration. However, R is an extensible programming language, and many functions are bundled in what are called packages. In this recipe, we'll look at a more advanced script that has dependencies on such packages. You'll learn how to prepare R for these complex scripts that require additional preparation.
To follow along with this recipe, download the Sample Files 8.3 folder from this book's GitHub repository.
Important note
The R scripts in this book serve to illustrate the process of integrating R and Tableau Prep. Each dataset must be carefully analyzed prior to applying advanced functionality such as forecasting. The script referenced in this recipe is purely for illustration purposes and is not intended for real-world use.
Start by opening up Tableau Prep and connecting to the Transaction Amount by Date.xlsx Excel file included in the sample files for this recipe, then follow the steps:
The forecastTS function contains a script that is designed to take time series data and predict, or forecast, future performance. In our case, our sample data contains Date and TransactionAmount. We'll run the data through this script, which has been designed to forecast the TransactionAmount value for the next 30 days (relative to the maximum date in our data):
You'll notice Tableau Prep will show an error message when it is trying to validate the flow and leverage the specified R function. In this case, the error is caused by the fact that our script depends on extra R packages that we have not installed.
install.packages("dplyr")
install.packages("zoo")
install.packages("car")
install.packages("forecast")
install.packages("tseries")
install.packages("fUnitRoots")
install.packages("PerformanceAnalytics")
Run the preceding code to install all packages. The following screenshot shows how your RStudio console will appear during this process:
Important note
In Step 4, notice how the Script step returns more fields than were originally inputted. As highlighted in the previous recipe, titled Embedding R code in a Tableau Prep flow, which fields are returned to Tableau Prep is determined by the R script. Since an R script can also generate data and new fields, it is entirely possible for it to output more fields than were inputted. You can always refer back to the getOutputSchema function in the R script to evaluate what data is being returned to Tableau Prep.
With these steps completed, you've successfully integrated a complex R script in Tableau Prep.
In this recipe, we learned how to install additional R packages to support complex R scripts. When it comes to R, there is a near-infinite amount of code possibilities driven by a large community of authors. It is good practice to consult your data science teams prior to implementing any R script in your Tableau Prep flow so that you can validate that the script is indeed appropriate for your data.
Python is a general-purpose programming language ranking highly as one of the most popular languages right now. It is available free of charge and, similar to R, benefits from a community for ideation and support. To find out more about Python, visit https://www.python.org/. In order to allow Tableau Prep to communicate with Python, you need to prepare your machine with a Tableau Python server known as TabPy.
Important note
The Python programming language and the TabPy package are not part of Tableau Prep. Python is a separate technology and an in-depth look at Python is beyond the scope of this book. Every system is slightly different, and you may run into unexpected challenges. For help, please refer to the Python website at https://www.python.org/.
In this recipe, we will provide general guidance for configuring your system successfully.
If you are a Windows user, you must ensure you have Microsoft C++ Build Tools installed. You can download this at https://visualstudio.microsoft.com/visual-cpp-build-tools/.
Start by opening your terminal. On macOS, you can find your Terminal application in Applications, in the Utilities folder. On Windows machines, use the Start menu to search for CMD, then select Command Prompt from the search results:
The terminal will either return a version number or inform you that Python was not found. If your version is lower than 3.8, which is what we are using while writing this book, or you receive the Python was not found message, you need to install a newer version of Python. If you have Python 3.8 or higher, you can skip ahead to Step 5.
Important note
Not all versions of Python will be suitable for running TabPy. We found that the latest version (at the time of writing), 3.9.1, has some challenges with running TabPy. If you are a macOS user, it's important to note that all Macs ship with Python installed. The installation of a newer version may not cause the terminal to actually use that version. That is, you will see the same version number after the installation as you did in Step 1. This does not mean the installation has failed, but that your machine has multiple versions of Python installed. There are several ways to adjust this. However, which method you choose is up to you. It's a topic of discussion between Python programmers. You can select a simple solution, as described in this forum article: https://stackoverflow.com/questions/43354382/how-to-switch-python-versions-in-terminal/, or you can search online for a so-called Python version manager. As the scope of this book does not cover Python itself, we shall assume that, from Step 5 onward, you are able to run Python version 3.9.1 or higher from your terminal.
Important note
TabPy installation instructions are documented on TabPy's GitHub repository, which you can find at https://github.com/tableau/TabPy/blob/master/docs/server-install.md. Refer to this page if you experience any installation issues and verify the steps you've taken.
In your terminal application, run the python -m pip install --upgrade pip command to update pip, an installation manager that ships with Python.
Important note
Whenever you close all terminal windows or reboot your machine, you will likely shut down the TabPy server. To restart the server, simply open up your terminal and run the tabpy command again.
Set the server name to localhost and the port to 9004 and click Sign In to save your configuration:
With these steps completed, you've successfully installed and started the Tableau Python server.
In this recipe, we learned how to install and start the Tableau Python server, also known as TabPy, using the terminal. TabPy will allow Tableau Prep to communicate with Python. Similar to R, Tableau Prep will send data to an external Python process using TabPy, and TabPy will deliver any outputs back to Tableau Prep.
In the previous recipe, we prepared our machine to work with Python, using the Tableau Python server. In this recipe, we'll create a Tableau Prep flow and embed a Python script in it. This process is largely the same as we've practiced in the recipes for R. Using this script, we'll evaluate a dataset and flag anomalies, that is, outliers.
To follow along with this recipe, download the Sample Files 8.5 folder from this book's GitHub repository.
Important note
The Python scripts in this book serve to illustrate the process of integrating Python and Tableau Prep. Each dataset must be carefully analyzed prior to applying advanced functionality such as anomaly detection. The script referenced in this recipe is purely for illustration purposes and is not intended for real-world use.
Start by opening up Tableau Prep and connect to the sales amount by date.csv file from the sample files folder:
This script summarizes time series data by day and then performs analysis to identify outliers, that is, relatively high or low numbers, relative to the entire dataset. You can open up the script in a text editor to look under the hood.
Similar to R scripts, we must specify a function and DataFrame for Tableau Prep to pass data to, and the script must include the get_output_schema function to return data to Tableau Prep. Unlike R, we do not have to install libraries used by our script upfront.
Important note
Tableau provides useful Python resources in its documentation, available at https://help.tableau.com/current/prep/en-us/prep_scripts_TabPy.htm.
The following screenshot shows a Python script, with a function named detect_outliers for the forecast function, and a function named get_output_data to return data to Tableau Prep:
Scroll through the data and observe the results. We can see that 11/01/2020 has been flagged as an anomaly. Looking at its value of 1,351, that makes sense as its neighboring values are significantly higher:
With these steps done, you've successfully completed this recipe.
In this recipe, we learned how to integrate a Python script into a Tableau Prep workflow. Similar to R scripts, a Python script may contain multiple functions, but the Script step can only perform the single function specified. The script must include a function named get_output_data to return data to Tableau Prep, as we have done in this recipe. This analytics extension can add significant functions to your flow that are not available out of the box.
3.133.109.30