Performing linear regression with R

Linear regression is a common technique to find the best fit straight line in a scatter plot. The resulting line can help in predictive analysis.

In this recipe, we will add trend lines to CO2 Emission graphs using R:

Performing linear regression with R

Getting ready

This recipe has a few prerequisites before it can be followed:

  1. Install R. The R Project for Statistical Computing website is https://www.r-project.org/.
  2. Optionally, install RStudio from https://www.rstudio.com/, which is an integrated environment for R.
  3. Install the Rserve package by running the following in R:
      install.packages("Rserve"); 
      library(Rserve);
    run.Rserve();
  4. You will get a message that says the connection is blocked:
    Getting ready
  5. Go to Help, and under Settings and Performance, select Manage External Service Connection...:
    Getting ready
  6. Type localhost for the Server:, and leave the Port/API Key: to 6311. Test the connection, and ensure you get a successful connection message before proceeding:
    Getting ready

Once R is ready and connected, you may proceed with the recipe.

To follow this recipe, open B05527_06 – STARTER.twbx. Use the worksheet called Integrating R, and connect to the CO2 (Worldwbank) data source:

Getting ready

How to do it...

Here are the steps to create the view with R-generated trend lines:

  1. From Dimensions, drag Country Name to the Filters shelf.
  2. In the Filters window, choose Canada, China, and United States.
  3. From Dimensions, drag Country Name to Columns.
  4. From Dimensions, drag Year to Columns.
  5. Right-click on the Year pill in Columns and choose Continuous.
  6. From Measures, drag CO2 Emission to Rows.
  7. Click on the null indicator that appears on the bottom right of the view and select Filter data:
    How to do it...
  8. Create a new calculated field called Trend Line with the following R formula:
    How to do it...
  9. Drag the new calculated field Trend Line from Measures to Rows, to the right of CO2 Emission.
  10. Right-click on the Trend Line pill and set Compute Using to Pane across.
  11. Right-click on the Trend Line pill and select Dual Axis.
  12. Right-click on the Trend Line axis and select Synchronize Axis.
  13. Right-click on the Trend Line axis and uncheck Show Header.

How it works...

Linear regression, according to Wikipedia, is defined as follows:

"… an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted x. The case of one explanatory variable is called simple linear regression."

The R scripting language is widely popular with statisticians, data miners, and data scientists. There are numerous libraries that can be used within R that support various statistical analyses, such as linear regression and clustering. There are even libraries for high performance computing (Pandas) and interactive web visualizations (Shiny).

R support in Tableau was introduced in Tableau 8. Tableau enables R code to be executed from within Tableau by using a series of functions that pass R code to R using the Rserve package. You can find a list of these functions in the help section of your calculated field window. All the functions start with SCRIPT:

How it works...

These SCRIPT functions will accept R code in the form of a string. In our recipe, we are using the R function lm, which is used to fit linear models:

How it works...

The string expression is followed by the arguments, or the Tableau-supplied values that will be used to replace the arguments in the R code:

How it works...

The SCRIPT functions expect aggregated arguments. In the case of dimensions, you may need to resort to using aggregated functions such as MIN, MAX, and ATTR to pass them as arguments. This is why we had to enclose the Year field in ATTR when we passed this field in our Trend Line calculated field:

How it works...

These functions return scalar values back. To be more exact, they return a single column (or vector) of values back. Tableau also treats the resulting values as table calculations:

How it works...

This is why, when you use calculated fields with R expressions, you may need to adjust the addressing and partitioning of the fields depending on the data and layout in your view.

Note

If you need a refresher on table calculations, there is a section on table calculations in Appendix A, Calculated Fields Primer.

Before R code can be executed, it is also important to set up your environment so that R is installed, and so that the RServe package is also installed and running.

It is important to filter out the null values before you add the calculated trend line field from R. Otherwise, you will get an error that says there is a mismatch in the number of values in your view and the ones returned by the SCRIPT function:

How it works...

See also

Please refer to the Adding a trend line recipe in this chapter

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.51.145