Linear regression is a common technique to find the best fit straight line in a scatter plot. The resulting line can help in predictive analysis.
In this recipe, we will add trend lines to CO2 Emission graphs using R:
This recipe has a few prerequisites before it can be followed:
install.packages("Rserve"); library(Rserve); run.Rserve();
localhost
for the Server:, and leave the Port/API Key: to 6311. Test the connection, and ensure you get a successful connection message before proceeding:Once R is ready and connected, you may proceed with the recipe.
To follow this recipe, open B05527_06 – STARTER.twbx. Use the worksheet called Integrating R, and connect to the CO2 (Worldwbank) data source:
Here are the steps to create the view with R-generated trend lines:
Linear regression, according to Wikipedia, is defined as follows:
"… an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted x. The case of one explanatory variable is called simple linear regression."
The R scripting language is widely popular with statisticians, data miners, and data scientists. There are numerous libraries that can be used within R that support various statistical analyses, such as linear regression and clustering. There are even libraries for high performance computing (Pandas) and interactive web visualizations (Shiny).
R support in Tableau was introduced in Tableau 8. Tableau enables R code to be executed from within Tableau by using a series of functions that pass R code to R using the Rserve
package. You can find a list of these functions in the help section of your calculated field window. All the functions start with SCRIPT
:
These SCRIPT
functions will accept R code in the form of a string. In our recipe, we are using the R function lm
, which is used to fit linear models:
The string expression is followed by the arguments, or the Tableau-supplied values that will be used to replace the arguments in the R code:
The SCRIPT
functions expect aggregated arguments. In the case of dimensions, you may need to resort to using aggregated functions such as MIN
, MAX
, and ATTR
to pass them as arguments. This is why we had to enclose the Year
field in ATTR
when we passed this field in our Trend Line calculated field:
These functions return scalar values back. To be more exact, they return a single column (or vector) of values back. Tableau also treats the resulting values as table calculations:
This is why, when you use calculated fields with R expressions, you may need to adjust the addressing and partitioning of the fields depending on the data and layout in your view.
If you need a refresher on table calculations, there is a section on table calculations in Appendix A, Calculated Fields Primer.
Before R code can be executed, it is also important to set up your environment so that R is installed, and so that the RServe
package is also installed and running.
It is important to filter out the null values before you add the calculated trend line field from R. Otherwise, you will get an error that says there is a mismatch in the number of values in your view and the ones returned by the SCRIPT
function:
18.224.51.145