Chapter 1. R for Pythonistas

Welcome, brave Pythonista, to the world of the useR1! In this chapter our goal is to introduce you to R’s core features and try to address some of the confusing bits that you’ll encounter along the way. Thus, it’s useful to mention what we’re not going to do.

First, we’re not writing for the naïve data scientist. If you want to learn R from scratch, there are many wonderful resources available. We encourage you to explore them and choose those which suit your needs and learning style. Here, we’ll bring up topics and concerns that may confuse the complete novice. We’ll take some detours to explain topics that we hope will specifically help the friendly Pythonista to adapt to R more easily.

Second, this is not quick guide, or a bilingual dictionary. Here, we want to take you through a journy of exploRation and undeRstanding. we want you to get a feel for R so that you begin to think R, becoming bilingual. Thus, for the sake of narrative, we may introduce some items much later than when writing for a complete novice. We’ll include some convenient Python/R 1:1 translations in the appendix for your convenience, but without context, they are less useful.

Third, This is not a comprehensive guide. Once you crack the R coconut, you’ll get plenty of enjoyment exploring the language deeper to address your specific needs as they arise. As we mentioned in the first part of the book, the R community is diverse, friendly, welcoming — and helpful! We’re convinced it’s one of the less tech-bro cultures out there. To get an idea of the community, you can follow #rstats on twitter.

Up and running with R

To follow the exercises in this chapter, you can either access our RStudio Cloud project or install R locally. Follow the instructions in one of the next two subsections.

RStudio Cloud

To use RStudio Cloud, make an account at http://rstudio.cloud/ and then navigate to our publically-available project. Make sure to save a copy of the project in your workspace (you’ll see the link in the header) so that you have your own copy.

You’re RStudio session should look like figure Figure 1-1. Open file ch03.R and that’s it! You’re ready to follow along with all the examples. To execute commands press ctrl + enter (or cmd + enter).

Figure 1-1. Our project in RStudio Clour.

RStudio Desktop

If you want run R locally, first download and install R for your operating system from https://www.r-project.org/. R 4.0 was released on in June 2020. The difference between R 4.x and R 3.x is nowhere near as drastic as that between Python 3.x and Python 2.x. With a few notable exceptions, R 4.x is backwards compatible. Nonetheless, we’ll assume you are running at least R 4.0.0: “Taking Off Again”. Each release gets a name inspired by Peanuts (the classic comic strip and film franchise featuring Charlie Brown, Snoopy and co.). Which is a nice personal touch, we think.

Next, install the RStudio IDE (discussed in [Link to Come]) from https://rstudio.com/.

Finally, set up a project to work on. This is a bit different from a virtual environnement, which we’ll discuss later on. There are two typical ways to make a project with pre-existing files.

If you’re using git, you’ll be happy to know that RStudio is also a basic git client. In RStudio, select File > New project > Version Control > Git and enter the repository URL https://github.com/Scavetta/PyR4MDS. The project directory name will use the repo name automatically. Choose where you want to store the repo and click “Create Project”.

If you’re not using git, you can just download and unzip the repo from https://github.com/Scavetta/PyR4MDS. In RStudio, select File > Existing Directory and navigate to the downloaded directory. A new R project file, *.Rproj will be created in that directory.

Your RStudio session should look like figure Figure 1-2. Open file ch03.R and that’s it! You’re ready to follow along with all the examples. To execute commands press ctrl + enter (or cmd + enter).

Figure 1-2. Our project in RStudio.

The perks & perils of projects & packages

We could begin exploring R by using a built in data set, and diving right into the tidyverse (introduced in [Link to Come]) but I want to step back for a second, take a deep breath, and begin our story at the beginning. We’ll start by reading in a simple csv file. For this, we’re going to use a data set that is actually already available in R in the {ggplot2} package. For our purposes, we’re less bothered with the actual analysis than how it’s being done in R. We’ve provided the dataset as a file in the project subdirectory R4Py.

If you set up your project correctly (see above) all you’ll need to execute is:

diamonds <- read.csv("R4Py/diamonds.csv")

Just like in Python, single ('') and double ("") quotation marks are interchangable.

You should now have the file imported and available as an object in your global environment. This is where your user-defined objects are found and is the last environment in a chain of environments. The first thing you’ll notice is that the environment pane of RStudio will display the object and already give some summary information. This lovely, simple touch is similar to the Jupyter notebook extension for VScode (see [Link to Come]), which also let’s you view your environment. Although this is a standard feature in RStudio, viewing a list of objects when scripting in Python, or many languages for that matter, is not typical. Clicking the little blue arrow beside the object name will reveal a text description (see fig. Figure 1-3).

Figure 1-3. A pulldown of a data frame.

Clicking on the name will open it up in an Excel-like viewer (see fig. Figure 1-4).

Figure 1-4. A data frame in table view.
Note

The RStudio viewer is much nicer than Excel, since it only loads into memory what you’re seing on the screen. You can search for specific text and filter your data here, so it’s a handy tool for getting a peak at your data.

Although these are nice features, some useRs consider them to be a bit too much GUI2 and a bit too little IDE3. Pythonistas would mostly agree and some criticize the user experience of RStudio. We partly, agree. For example, to import your data set, you could have also clicked on the “import dataset…” button. This can be convenient if you’re having a really hard time parsing through the file’s structure, but it leads to undocumented, non-reproducible actions which are extremely frustrating since scripts/projects will not be self-contained. The command to import the file will be executed in the console, and visible in the history panel, but it will not appear in the script unless you explicitly copy it. The result is objects in your environment which are not defined in your script. However, remember that RStudio is not R. You can use R with other text editors.

If you couldn’t import your data with the above command, either (i) the file doesn’t exist in that directory, or (ii) you’re working in the wrong working directory, which is more likely. You may be tempted to write something terrible, like this:

diamonds <- read.csv("~/Documents/R projects/PyR4MDS/R4Py/diamonds.csv")

Please never hard-code file paths! You’ll be familiar with avoiding this when using virtual environments with Python. Neither the working directory nor the project are virtual environments, but they are nonetheless very handy, so let’s check them out!

The working directory is the first place R looks for a file. When you use R projects, the working directory is wherever you have the *.Rproj file. Thus, “R4Py/diamonds.csv” says to take diamonds.csv, located in the R4Py sub-directory, located in whatever the working directory is. It doesn’t matter what the working directory is called or where it is. You can move the entire project anywhere on your computer and it will still just work once you open the project (the *.Rproj file) in RStudio.

Warning

If you’re not using R projects, then your working directory will likely be your home directory. This is terrible because you’ll have to specify the entire path to your file, e.g. “~/Documents/R projects/PyR4MDS/R4Py/” instead of just the sub-directories within your project. You’ll find the command getwd() to get, and setwd() to set the working directory in many outdated tutorials. Please don’t use these commands!

In this first command to import data, you’ll already notice some things that will confuse and/or aggravate the seasoned Pythonista. Three things in particular stand out.

First, notice that it’s common place, and even preferred, to use <- as the assign operator in R. You can use =, as in Python, and indeed you’ll see prominent and experienced useRs do this, but <- is more explicit as assign to object since = is also used to assign values to arguments in function calls, and we all know how much Pythonistas love being explicit!

Note

The <- assign operator is actually a legacy operator stemming from the pre-standardized QWERTY keyboard where the <- didn’t mean move the cursor one space to the left but literally, make <- appear.

Second, notice that the function name is read.csv(), nope, that’s not a typo. csv() is not a method of object read, nor is it a function of module read. Both are completely acceptable interpretations if this was a Python command. In R, with a few, but notable, exceptions, . doesn’t mean anything special. It’s a bit annoying if you’re used to more OOP-oriented languages where . is a special character.

Installing and loading packages

Finally, you’ll notice that we didn’t initialize any packages to accomplish this task. The read.*() function variants are a part of base R. Interestingly, there are newer and more convenient ways of reading in files if these functions don’t satisfy your needs. e.g. the read_csv() function is in the readr package. We know you’re excited to see that _!

In general, when you see simple functions with . these are old base R functions created when nobody worried that it would be confusing to have . in the names. The _ counterparts are generally from packages in the tidyverse (see [Link to Come]), of which {readr} is one. They basically do the same thing, but with some slight tweaks to make them more user-friendly.

Let’s see this in action. Just like in Python you’ll need to install the package. This is typically done directly in the R console, there is no pip equivalent in R. In RStudio, you can do this by going to the “Packages” panel in the lower-right pane and clicking on the “Install” button. Just type in the name and it will install from CRAN (The Comprehensive R Archive Network). This is the repository of official R packages. They have undergone quality control and are hosted on mirrored servers around the world. The first time you do this, you’ll be asked to choose a mirror site to install from. For the most part it doesn’t matter which one you choose. Type in tidyverse and make sure that the “install all dependencies” box is checked and click OK. You’ll see a lot of red text as the core tidyverse packages and all their dependencies are installed. This is mostly just a convenient way to get lots of useful packages installed all at once.

The actual command that is executed is shown in the console:

install.packages("tidyverse")

Of course, you can just type and execute this yourself instead of using the IDE. The most common problem in installing packages is to not have write permission in the packages directory. This will prompt you to create a personal library. You can always check where your packages are installed by using

.libPaths()
[1] "/Library/Frameworks/R.framework/Versions/4.0/Resources/library"

If you have a personal library, it will be shown here in the second position.

Note

In contrast to Pythonistas, who tend to use virtual environments, useRs typically install a package once, making it available system-wide. The current implementation of a project-specific library in R is implemented with the renv package.

After installing a package, it needs to be initialized in each new R session. When we say initialize, or load, a package, what we’re really saying is “use the library() function to load an installed package and then attach it to the namespace, i.e. the global environment”. All your packages comprise your library, hence library(). The core suite of packages in the tidyverse can be loaded using library(tidyverse). That is commonplace, and for the most part not a problem, but you may want to get into the habit of loading only those packages that you actually require instead of filling up your environment needlessly. Let’s start with {readr}, which contains the read_csv() function.

library(readr)

This is the equivalent of:

import readr

Although R uses OOP, it’s mostly operating in the background, hence you’ll never see strange aliases for packages like:

import readr as rr

That’s just a foreign concept in R. After you have attached the package all functions and datasets in that package are available in your global environment.

Warning

This calls to mind another legacy function that you may see floating around. You must absolutely avoid attach() (and for the most part it’s counterpart detach()). This function allows you to attach an object to your global environment, much like how we attached a package. Thus, you can call elements within the object directly, without first specifying the object name explicitly, like how we call functions within a package without having to explicitly call the package name every time. The reason this has fallen out of favor is that you are likely to have many data objects that you want to access, so conflicting names are likely to be an issue (i.e. leading to masking of objects). Plus, it’s just not explicit.

I want to address one other issue with loading packages before we continue. You’ll often see:

require(readr)

This is used to load a package. require() will load an installed package and also return a TRUE/FALSE if it was successful > (or not). This is useful for testing if a package exists, and so should be reserved for those instances where that is necessary. For the most part you want to use library().

Alright, let’s read in our data set again, this time using read_csv() to make some simple comparisons between the two methods.

> diamonds_2 <- read_csv("R4Py/diamonds.csv")
Parsed with column specification:
cols(
  carat = col_double(),
  cut = col_character(),
  color = col_character(),
  clarity = col_character(),
  depth = col_double(),
  table = col_double(),
  price = col_double(),
  x = col_double(),
  y = col_double(),
  z = col_double()
)

You’ll notice that we’re afforded a more detailed account of what’s happened:

As we mentioned in the first part of the book, tidyverse design choices tend to be more user-friendly than older processes they update. This output tells us the column names of our tabular data and their types (see Table 1-2).

Also note that the current trend in R is to use snake case, underscores (“_”) between words and only lower case letters. Although there has classically been poor adherence to a style guide in R, the Advanced R book offers good suggestions. Google also attempted to promote an R style guide, but it doesn’t seem that the community is very strict on this issue.

The Triumph of Tibbles

So far, we’ve imported our data twice, using two different commands. This was done so that you can see some of how R works under-the-hood and some typical behavior of the tidyverse versus base package. We already mentioned that you can click on the object in the Environment Viewer to look at it, but it’s also typical to just print it to the console. You may be tempted to execute:

> print(diamonds)

But the print() function is not necessary except in specific cases, like within a for loop. As with a Jupyter notebook, just executing the object name, e.g.:

> diamonds

This will print the object to the console. We won’t reproduce it here, but if you do execute the above command, you’ll notice that this is not a nice output! Indeed, one wonders why the default output allows so much to be printed to the console in interactive mode. Now try with the data frame we read in using read_csv():

> diamonds_2
# A tibble: 53,940 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
# … with 53,930 more rows

Wow! That’s a much nicer output than the default base R version. We have a neat little table with the names of the columns on one row, and 3-letter codes for the data types below that in <>. We only see the first 10 rows and then a note telling us how much we’re not seeing. If there were too many columns for our screen, we’d see them listed at the bottom. Give that a try, set your console output to be very narrow and execute the command again:

# A tibble: 53,940 x 10
   carat cut     color clarity
   <dbl> <chr>   <chr> <chr>
 1 0.23  Ideal   E     SI2
 2 0.21  Premium E     SI1
 3 0.23  Good    E     VS1
 4 0.290 Premium I     VS2
 5 0.31  Good    J     SI2
 6 0.24  Very GJ     VVS2
 7 0.24  Very GI     VVS1
 8 0.26  Very GH     SI1
 9 0.22  Fair    E     VS2
10 0.23  Very GH     VS1
# … with 53,930 more rows,
#   and 6 more variables:
#   depth <dbl>, table <dbl>,
#   price <dbl>, x <dbl>,
#   y <dbl>, z <dbl>

Base R was already pretty good for EDA4, but this is next level convenience. So what happened? Actually understanding this is pretty important, but first we want to highlight two other interesting points.

First, notice that we didn’t need to load all of readr to gain access to the read_csv() function. We could have left out library(readr) and just used:

> diamonds_2 <- readr::read_csv("R4Py/diamonds.csv")

The double-colon operator :: is used to access functions within a package. It’s akin to:

from pandas import read_csv

You’ll see :: used when useRs know that they’ll only need one very specific function from a package, or that functions in two packages may conflict with each other, so they want to avoid attaching an entire package to their namespace.

Second, this is the first time we see actual data in R and we can tell right away that numbering begins with 1! (and why wouldn’t it?).

Note

Just as an aside for printing objects to the screen. You’ll often see round brackets around an entire expression. This just means to execute the expression and print the object to the screen.

(aa <- 8)

It mostly just clutters up commands. Unless it’s necessary, just explicitly call the object.

aa <- 8
aa

Plus, it’s easier to just comment out (use ctrl+shift+c in RStudio) the print line instead of having to go back and remove all those extra brackets.

Ok, so let’s get to the heart of what’s happening here. Why do diamonds and diamonds_2 look so different when printed to the console. Answering this question will help us to understand a bit about how R handles objects. To answer this question, let’s take a look at the class of these objects:

class(diamonds)

[1] “data.frame”

class(diamonds_2)

[1] “spec_tbl_df” “tbl_df” “tbl” “data.frame”

You’ll be familiar with a data.frame from pandas DataFrame (ok, can we just admit that a pandas DataFrame is just a Python implementation of an R data.frame?). But using the tidyverse read_csv() function produced an object with three additional classes. The two to mention here are the sub-class tbl_df and the class tbl, the two go hand-in-hand for defining a tibble (hence tbl) which has a data frame structure tbl_df.

Tibbles are a core feature of the tidyverse and have many perks over base R objects. For example, printing to the console. Recall that calling an object name is just a shortcut for calling print(). print() in turn has a method to handle data frames and now that we’ve attached the readr package, it now has a method to handle objects of class tbl_df.

So here we see OOP principles operating in the background implicitly handling object classes and calling the methods appropriate to a given class. Convenient! Confusing? Implicit! I can see why Pythonistas get annoyed, but once you get over it, you see that you can just get on with your work without too much hassle.

A word about types and exploring

Let’s take a deeper look at our data and see how R stores and handles data.

We already saw two very common classes, data.frame, which is basically what pandas.DataFrame is modelled on, and a special class therein called tbl (say tibble). We also saw some sub-classes, namely tbl_df. So what is a data frame? Have a look at Table 1-1. A data frame is a 2-dimensional heterogenous data structure. It sounds simple, but let’s break it down a bit further.

Table 1-1. Example Dataframe
Name Number of dimensions Type of data

Vector

1

Homogeneous

List

1

Heterogeneous

Data Frame

2

Heterogeneous

Matrix

2

Homogeneous

Array

n

Homogeneous

Vectors

Vectors are the most basic form of data storage. They are 1-dimensional and homogeneous. That is, one element after another, where every element is of the same type. It’s like a 1-dimensional numpy array composed solely of scalars. We don’t refer to scalars in R, that’s just a 1-element long vector. There are many types in R, and 4 commonly-used “user-defined atomic vector types”. The term “atomic” already tells us that it doesn’t get any more basic than what we find in Table 1-2.

Table 1-2. Data Types
Type Data frame shorthand Tibble shorthand Description

Logical

logi

<lgl>

Binary TRUE/FALSE, T/F, 1/0

Integer

int

<int>

Whole numbers from [-Inf,Inf]

Double

num

<dbl>

Real numbers from [-Inf,Inf]

Character

chr

<chr>

All alpha-numeric characters, including white spaces.

The two other user-defined atomic vector types are raw and complex but you’re unlikely to encounter them at this stage in your work, so we won’t discuss them further.

Vectors are fundamental building blocks. There are a few things to note about vectors, so let’s get that out of the way before we return to the workhorse of data science, the beloved data frame.

Vector types have an inherent Hierarchy

The four user-defined atomic vector types listed in Table 1-2 are ordered according to increasing levels of information content. When you create a vector, R will try to find the lowest information-content type that can encompass all the information in that vector. e.g. logical:

> a <- c(TRUE, FALSE)
> typeof(a)
[1] "logical"

logical, is R’s equivalent of bool, but is very rarely referred to as boolean or binary. Also, note that T and F are not in themselves reserved terms in R, and so they are not recommended for logical vectors, altough they are valid. Use TRUE and FALSE instead. Let’s take a look at numbers:

> b <- c(1, 2)
> typeof(b)
[1] "double"
> c <- c(3.14, 6.8)
> typeof(c)
[1] "double"

R will automatically convert between double and integer as needed. Math is performed primarily using double-precision, which is reflected in the data frame shorthand for double being displayed as numeric. Unless you explicitly need to restrict a number to be a true integer, then numeric/double will be fine. If you do want to restrict values to be integer, you can coerce them to a specific type using one of the as.*() function, or use the L suffix to specify that a number must be an integer.

> b <- as.integer(c(1, 2))
> typeof(b)
[1] "integer"
> b <- c(1L, 2L)
> typeof(b)
[1] "integer"

Sometimes you may explicitly want to store the elements of a vector as integers if you know that they will never be converted to doubles (e.g. used as ID values or indexing) since integers require less storage space. But if they are going to be used in any math that will convert them to double, then it will probably be easiest to just store them as doubles to begin with.

Characters are R’s version of strings. You’ll know this as str in Python, which is, confusingly, a common R function, str(), which gives the structure of an object. However, characters are frequently referred to as strings in R, including in arguments and package names, which is an unfortunate inconsistency.

> d <- c("a", "b")
> typeof(d)
[1] "character"

Putting these together in a vanilla data frame using data.frame() or using the more recently developed tibble using tibble(), gives us:

myTb <- tibble(a = c(T, F),
               b = c(1L, 2L),
               c = c(3.14, 6.8),
               d = c("a", "b"))
myTb
# A tibble: 2 x 4
  a         b     c d
  <lgl> <int> <dbl> <chr>
1 TRUE      1  3.14 a
2 FALSE     2  6.8  b

Notice we get the nice output from print() since it’s a tibble. When we look at the structure, we’ll see some confusing features:

> str(myTb)
tibble [2 × 4] (S3: tbl_df/tbl/data.frame)
 $ a: logi [1:2] TRUE FALSE
 $ b: int [1:2] 1 2
 $ c: num [1:2] 3.14 6.8
 $ d: chr [1:2] "a" "b"

str() a classic base package function and gives some bare bones output, it’s similar to what you’ll see when you click on the reveal arrow beside the objects name in the environment panel. The first row gives the object’s class (which we already saw above). S3 refers to the specific OOP system that this object uses, which in this case is the most basic and un-strict OOP system.

Alternatively, we can also use the tidyerse glimpse() function, from the {dplyr} package.

> library(dplyr)
> glimpse(myTb)
Rows: 2
Columns: 4
$ a <lgl> TRUE, FALSE
$ b <int> 1, 2
$ c <dbl> 3.14, 6.80
$ d <chr> "a", "b"

Notice that Table 1-2 also states the short hand num which does not appear in the output of glimpse(). This refers to the the “numeric” class, which refers to either double (for double-precision floating-point numbers) or integer type.

The above examples showed us that a data.frame is a heterogenous, 2-dimensional collection of homogeneous 1-dimensional vectors, each having the same length. We’ll get to why R prints all those dollar signs below (and no, it has nothing to do with your salary!)

Naming (internal) Things

We already mentioned that snake case is the current trend in naming objects in R. However, naming columns in a data frame is a different beast altogether since we just inherit names from the first line of the source file. Data frames in base R, obtained e.g. using the read.*() family of functions or manually created using the data.frame() function doesn’t allow for any “illegal” characters. Illegal characters include all white spaces and all reserved characters in R, e.g.:

  • Aritmetic operators (+, -, /, *, etc.),

  • Logical operators (&, |, etc.),

  • Relational operators (==, !=, >, <, etc.)

  • Brackets, ([, (, {, < and their closers)

In addition, although they can contain numbers, they can’t begin with numbers. Let’s see what happens:

# Base package version
data.frame("Weight (g)" = 15,
           "Group" = "trt1",
           "5-day check" = TRUE)
  Weight..g. Group X5.day.check
1         15  trt1         TRUE

All the illegal characters have been replaced with .! I know, right? R is really having a good time mocking you OOP obsessives! On top of that, any variable that began with a number is now prefaced with an X.

So what about importing a file with no header?

> myDiamonds_base_nohead <- read.csv("R4Py/diamonds_noheader.csv", header = F)
> names(myDiamonds_base_nohead)
 [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10"

In base R, if we don’t have any header, the given names are V for “variable” followed by the number of that column.

The same file read in with one of the readr::read_*() family of functions or created with tibble() will maintain illegal characters! This seems trivial, but it’s actually a serious critique of the tidyverse and it’s something you really need to be aware of it you start meddling in other people’s scripts. Let’s look:

> tibble("Weight (g)" = 15,
+            "Group" = "trt1",
+            "5-day check" = TRUE)
# A tibble: 1 x 3
  `Weight (g)` Group `5-day check`
         <dbl> <chr> <lgl>
1           15 trt1  TRUE

Notice the paired backticks for the column Weight (g) and 5-day check? You now need to use this to escape the illegal characters. Perhaps this makes for more informative commands, since you have the full name, but you’ll likely want to maintain short and informative column names anyways. Information about the unit (e.g. g for weight) is extraneous information that belongs in a data set legend.

Not only that, but the names given to header-less datasets are also different:

> myDiamonds_tidy_nohead <- read_csv("R4Py/diamonds_noheader.csv", col_names = F)
> names(myDiamonds_tidy_nohead)
 [1] "X1"  "X2"  "X3"  "X4"  "X5"  "X6"  "X7"  "X8"  "X9"  "X10"

Instead of V we get X! This takes us back to the tidyverse as a distinct dialect in R. If you inherit a script entirely in base R, you’ll have a tricky time if you just start throwing in tidyverse functions with wild abandon. It’s like asking for a Berliner5 in a Berlin bakery!

List the ways

Lists are another type of data structure you’ll use a lot in R. Actually, we’ve already encountered them in our very short R journey. That’s because data.frame`s are a specific class of type `list. Yup, you heard that right.

> typeof(myTb)
[1] "list"

Table 1-1 tells us that a list is a 1-dimensional, heterogenous object. What that means is that every element in this 1-dimensional object can be a different type, indeed lists can contain not only vectors, but other list, data frames, matrices, and on and on. In the case that each element is a vector of the same length, we end up with tabular data that is then class data.frame. Pretty convenient right? Typically, you’ll encounter pure lists as the output from statistical tests, let’s take a look.

The PlantGrowth data frame is a built in object in R. It contains two variables (i.e. elements in the list, aka columns in the tabular data): weight and group.

> glimpse(PlantGrowth)
Rows: 30
Columns: 2
$ weight <dbl> 4.17, 5.58, 5.18, 6.11, 4.50, 4.6$ group  <fct> ctrl, ctrl, ctrl, ctrl, ctrl, ctr

The data set describes the dry plant weight (in grams, thank you) of 30 observations (i.e. individual plants, aka rows in the tabular data) grown under one of three conditions described in groups: ctrl, trt1, and trt2. the convenient glimpse() function doesn’t show us these three gorups, but the classic str() does:

> str(PlantGrowth)
'data.frame':	30 obs. of  2 variables:
 $ weight: num  4.17 5.58 5.18 6.11 4.5 4.61 5.17 4.53 5.33 5.14 ...
 $ group : Factor w/ 3 levels "ctrl","trt1",..: 1 1 1 1 1 1 1 1 1 1 ...

If you’re getting nervous about <fct>` and Factor w/ 3 levels, just hang tight — we’ll talk about that after we’re done with lists.

alright, let’s get to some tests. We may want to define a linear model for weight described by group:

PGlm <- lm(weight ~ group, data = PlantGrowth)

lm() is a foundational and flexible function for defining linear models in R. Our model is written in formula notation, where weight ~ group is y ~ x. You’ll recognize the ~ as the standard symbol for “described by” in statistics. The output is a type list of class lm:

> typeof(PGlm)
[1] "list"
> class(PGlm)
[1] "lm"

There are two things that we want to remind you of and build on here.

First, remember that we mentioned that a data frame is a collection of vectors of the same length? Now we see that that just means that it’s a special class of a list, where each element is a vector of the same length. We can access a named element within a list using the $ notation:

> names(PlantGrowth)
[1] "weight" "group"
> PlantGrowth$weight
 [1] 4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14 4.81 4.17 4.41 3.59
[15] 5.87 3.83 6.03 4.89 4.32 4.69 6.31 5.12 5.54 5.50 5.37 5.29 4.92 6.15
[29] 5.80 5.26

Notice the way it’s printed, along a row, and the beginning of each row begins with a [] with an index position in there. (we already mentioned that R begins indexing at 1, right?) In RStudio, you’ll get an autocomplete list of column names after typing $.

We can also access a named element within a list using the same notation:

> names(PGlm)
 [1] "coefficients"  "residuals"     "effects"       "rank"
 [5] "fitted.values" "assign"        "qr"            "df.residual"
 [9] "contrasts"     "xlevels"       "call"          "terms"
[13] "model"

You can see how a list is such a nice way to store the results of a statistical test since we have lots of different kinds of output. e.g. coefficients:

> PGlm$coefficients
(Intercept)   grouptrt1   grouptrt2
      5.032      -0.371       0.494

is a named 3-element long numeric vector. (although its elements are named, the $ operator is invalid for atomic vectors, but we have some other tricks up our sleeve, of course — see indexing with [] below). We didn’t get into the details, but you may be aware that given our data we expect to have three coefficients (estimates) in our model.

Consider residuals:

> PGlm$residuals
     1      2      3      4      5      6      7      8      9     10
-0.862  0.548  0.148  1.078 -0.532 -0.422  0.138 -0.502  0.298  0.108
    11     12     13     14     15     16     17     18     19     20
 0.149 -0.491 -0.251 -1.071  1.209 -0.831  1.369  0.229 -0.341  0.029
    21     22     23     24     25     26     27     28     29     30
 0.784 -0.406  0.014 -0.026 -0.156 -0.236 -0.606  0.624  0.274 -0.266

They are stored in a named 30-element long numerical vector (remember we had 30 observations). So lists are pretty convenient for storing heterogenous data and you’ll see them quite often in R, although there is a concerted effort in the tidyverse towards data frames and their variants.

Second, remember that we mentioned that the . mostly doesn’t have any special meaning. Well here’s one of the exceptions where the . does actually have a meaning. Probably the most common use is when it specifies all when defining a model. Here, since other than the weight column, PlantGrowth only had one other column, we could have written:

lm(weight ~ ., data = PlantGrowth)

It’s not really necessary, since we only have one independent variable here, but in some cases it’s convenient. The ToothGrowth dataset has a similar experimental set up, but we’re measuring the length of tooth growth under two conditions, a specific supplement (supp) and its dosage (dose).

lm(len ~ ., data = ToothGrowth)
# is the same as
lm(len ~ supp + dose, data = ToothGrowth)

But like always, being explicit has it’s advantages, such as defining more precise models:

lm(len ~ supp * dose, data = ToothGrowth)

Can you spot the difference between the two outputs?

The Facts about Factors

Alright, the last thing we need to clear up before we continue is the phenomena of the factor. Factors are akin to the pandas category type in Python. They are a wonderful and useful class in R. For the most part they exist and you won’t have cause to worry about them, but do be aware, since their uses and misuses will make your life a dream or a misery, respectively. Let’s take a look.

The name “factor” is very much a statistics term, we may refer to them as categorical variables, as Python does, but you’ll also see them referred to as qualitative and discrete variables, in text books and also in specific R packages, like RColorBrewer and ggplot2, respectively. Although these terms all refer to the same kind of variable, when we say factor in R, we’re referring to a class of type integer. It’s like how data.frame is a class of type list. Observe:

> typeof(PlantGrowth$group)
[1] "integer"
> class(PlantGrowth$group)
[1] "factor"

You can easily identify a factor because in both the output from str() (see above) and in plain vector formatting, the levels will be stated:

> PlantGrowth$group
 [1] ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl ctrl
 [11] trt1 trt1 trt1 trt1 trt1 trt1 trt1 trt1 trt1 trt1
 [21] trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2 trt2
Levels: ctrl trt1 trt2

The levels are statistician’s names for what we refer to as “groups”. Another give-away is that, although we have characters, they are not enclosed in quotation marks! This is very curious since we can actually treat them as characters, even though the are type integer (see Table 1-2). You may be interested to look at the internal structure of an object using dput(). Here we can see that we have an integer vector c(1L, ...) and two attributes, the label and the class.

> dput(PlantGrowth$group)
structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
            2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
            3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L),
          .Label = c("ctrl", "trt1", "trt2"),
          class = "factor")

The labels define the names of each level in the factor and are mapped to the integers, 1 being ctrl, and so on. So when we print to the screen we only see the names, not the integers. It seems to be a legacy element from the days when memory was expensive and it made sense to save an integer many times over instead of a potentially long character vector.

So far, the only kind of factor we saw was really describing a nominal variable (a categorical variable with no order), but we have a nice solution for ordinal variables also. Check out this variable from the diamonds data set:

> diamonds$color
[1] E E E I J J I H E H ..
Levels: D < E < F < G < H < I < J

The levels have an order, in the sence that D comes before E, and so on.

How to find… stuff

Alright, by now we saw how R stores data and various subtleties that you’ll need to keep in mind, in particular things that may trip up a Pythonista. Let’s move onto logical expressions and indexing, which is to say: how to find… stuff.

Logical expressions are combinations of relational operators, which ask yes/no questions of comparison, and logical operators, which combine those yes/no questions.

Let’s begin with a vector:

> diamonds$price > 18000
   [1] FALSE FALSE FALSE FALSE FALSE FALSE
   ...

This simply asks which of our diamonds are more expensive than $18,000. There are three key things to always keep in mind here.

First, the length of the shorter object, here the unassigned numeric vector 18000 (1-element long) will be “recycled” over the entire length of the longer vector, here the price column from the diamonds data frame accessed with the $ notation, (53,940 elements). In Python you may refer to this as broadcasting, when using numpy arrays, and vectorization as a distinct function. In R, we simply refer to both as vectorization, or vector recycling.

Second, this means that the output vector is the same length as the length of the longest vector, here 53,940 elements.

Third, anytime you see a relational or logical operator, you know that the output vector will always be a logical vector. Remember logical as in TRUE/FALSE in R, not logical as in Mr. Spock.

If you want to combine questions you’ll have to combine two complete quesitons, as such really expensive and small diamonds (classy!):

> diamonds$price > 18000 & diamonds$carat < 1.5
   [1] FALSE FALSE FALSE FALSE FALSE FALSE
   ...

Notice that all three key points above hold true. When I introduced the atomic vector types that I failed to mention that logical is also defined by 1 & 0. This means we can do math on logical vectors, which is very convenient. How many expensive little diamonds do we have?

> sum(diamonds$price > 18000 & diamonds$carat < 1.5)
[1] 9

(Not enough if I’m being honest). What proportion of my data set do they represent? Just divide by the total number of observations.

> sum(diamonds$price > 18000 & diamonds$carat < 1.5)/nrow(diamonds)
[1] 0.0001668521

So that’s asking and combining questions. Let’s take a look at indexing using []. You’re already familiar with [], but we feel that they are more straight-forward in R right out of the box. Here’s a summary:

Table 1-3. Indexing
Use Data object Result

xx[i]

Vector

Vector of only i elements

xx

List, Data frame, tibble

The i element extracted from a list

xx[i]

List, Data frame, tibble

The i element maintaining the original structure

xx[i,j]

Data frame, tibble or matrix

The i rows and j columns of a data frame, tibble or matrix

xx[i,j,k]

Array

The i rows, j columns and k dimension of an array

i, j, and k are 1 of 3 different types of vector which can be used inside []:

  1. An integer vector

  2. A logical vector, or

  3. A character vector containing names, if the elements are named.

This should be familiar to you already from Python. For integer and logical vectors, these can be unassigned vectors, or object or functions that resolve to integer or logical vectors. Numbers don’t need to be type integer, although whole numbers are clearer. Using numeric/double rounds down to the nearest whole number, but try to avoid using real numbers when indexing, unless it serves a purpose.

Let’s begin with integers. We’ll take another little detour here to discuss the omnipresent : operator, which won’t do what you Pythonista brain tells you it should should do. We’ll begin with a built-in character vector, letters, which is the same as having a column in a data frame, like PlantGrowth$weight.

> letters[1] # The 1st element (indexing begins at 1)
[1] "a"

So that’s pretty straight-forward. How about counting backwards?

> letters[-4] # Everything except the 4th element,
> # (*not* the fourth element, counting backwards!)
 [1] "a" "b" "c" "e" "f" "g" "h" ...

Nope, that’s not happening, the - means to exclude an element, not to count backwards, but it was a nice try. We can also exclude a range of values:

> letters[-(1:20)] # Exclude elements 1 through 20
[1] "u" "v" "w" "x" "y" "z"

and of course index a fange of values:

> letters[23:26] # The 23rd to the 26th element
[1] "w" "x" "y" "z"

And remember, we can combine this with anything that will give us an integer vector. length() will tell us how many elements we have in our vector, and lhs:rhs is short hand for the function seq(from = lhs, to = rhs, by = 1), which creates a sequence of values in incremental steps of by, in this case defaulting to 1.

> letters[23:length(letters)] # The 23rd to the last element
[1] "w" "x" "y" "z"

So that means you always need an lhs and an rhs when using :. It’s a pity, but this isn’t going to work:

> letters[23:] # error

Using the [] inappropriatly gives rise to a legendary and mysterious error message in R:

> df[1]
Error in df[1] : object of type 'closure' is not subsettable
> t[6]
Error in t[6] : object of type 'closure' is not subsettable

Can you tell where we went wrong? df and t are not data storage objects that we can index! They are functions and thus they must be followed by () where we provide the arguments. [] are always used to subset and these functions, df() and t() are functions of type closure, which are not subsettable. So it’s a pretty clear error message actually, and a good reminder to not call objects using ambiguous short names, or indeed to get confused between functions and data storage objects.

That’s all fine a good, but you’re probably aware that the true power in indexing comes from using logical vectors to index specific TRUE elements, just like using type bool in Python. The most common way of obtaining a logical vector for indexing is to use a logical expression (see above). This is exactly what happens with masking in numpy.

So what are the colors of those fancy diamonds?

> diamonds$color[diamonds$price > 18000 & diamonds$carat < 1.5]
[1] D D D D F D F F E
Levels: D < E < F < G < H < I < J

Here, we’re using price and carat to find the colors of the diamonds that we’re interested in. Not surprisingly, they are the best color classifications. You may find it annoying that you have to write diamonds$ repeatedly, but we would argue that it just makes it more explicit, and it’s what happens when we reference Series in Python. Since we’re indexing a vector we get a vector as output. Let’s turn to data frames. We could have written the above indexing command as:

> diamonds[diamonds$price > 18000 & diamonds$carat < 1.5, "color"]
# A tibble: 9 x 1
  color
  <ord>
1 D
2 D
3 D
4 D
5 F
6 D
7 F
8 F
9 E

As you would expect, in [i,j]`, i always refers to the rows (observations), and j always refers to columns (variables). Notice that we also mixed two different types of input, but it works because they were in different parts of the expression. We use a logical vector that is as long as the data frame’s number of observations (thank you vector recycling) to obtain all the TRUE rows, and then we used a character vector to extract a named element, recall that each column in a data frame is a named element. This is a really typical formulation in R. The output is a data frame, specifically a tibble, since we used indexing on the diamonds data frame, and not on a specific 1-dimensional vector therein. Not to get bogged down with the topic, but it is worth noting that if we didn’t have a tibble, indexing for a single column (in j) would return a vector:

> diamonds <- data.frame(diamonds)
> class(diamonds)
[1] "data.frame"
> diamonds[diamonds$price > 18000 & diamonds$carat < 1.5, "color"]
[1] D D D D F D F F E
Levels: D < E < F < G < H < I < J

This is indeed confusing and highlights that necessity to always be aware of the class of our data object. The tidyverse tries to address some of this by maintaining data frames even in those instances where base R prefers to revert to a vector. The tidyverse method, shown below, makes things easier (the base package shorthand, subset(), works much in the same way, but filter() works better when used in a tidyverse context.)

> diamonds %>%
+   filter(price > 18000, carat < 1.5) %>%
+   select(color)
# A tibble: 9 x 1
  color
  <ord>
1 D
2 D
3 D
4 D
5 F
6 D
7 F
8 F
9 E

We introduced the principles behind the tidyverse in the first part of the book, and now we’re seeing it in action. The %>% above allows us to unnest objects and functions. For example, we could have written:

> select(filter(diamonds, price > 18000, carat < 1.5), color)

That has the format of a long, nested function that is quite difficult to follow. We can pronounce %>% as “and then” and thus read the entire command above as “Take the diamonds data set and then filter using these criteria and then select only these columns”. This goes a long way in helping us to literally read and understand code and is why dplyr is described as the Grammar of Data Analysis. Objects, like tibbles, are the nouns, %>% is our punctuation, and functions are the verbs.

Table 1-4. Function description
Function Works on Description

filter()

rows

Use a logical vector to retain only TRUE rows

arrange()

rows

Reorder rows according to values in a specific column

select()

columns

Use a name or a helper function to extract only those columns

summarise()

columns

Apply aggregation functions to a column

mutate()

columns

Apply transformation functions to a column

The five most important verbs in dplyr are listed in Table 1-4. We already saw filter() and select() in action, so let’s take a look at applying functions with summarise() and mutate(). summarise() is used to apply an aggregation function, which returns a single value, like the mean, mean(), or standard deviation, sd(). It’s common to see summarise() used in combination with the group_by() function. In our analogy of grammatical elements, group_by() is an adverb, it modifies how a verb operates. In the example below, we use group_by() to add a Group attribute to our data frame and the functions applied in summarise are thus group-specific. It’s just like the ``.groupby()` method for pandas DataFrames!

> PlantGrowth %>%
+   group_by(group) %>%
+   summarise(avg = mean(weight),
+             stdev = sd(weight))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
  group   avg stdev
  <fct> <dbl> <dbl>
1 ctrl   5.03 0.583
2 trt1   4.66 0.794
3 trt2   5.53 0.443

mutate() is used to apply a transformation function, which returns as many outputs as inputs. In these cases it’s not unusual to use both tidyverse syntax and native [] in combination to index specific values. For example, this data set contains the area under irrigation (thousands of hectares) for different regions of the world at four different time points.

> irrigation <- read_csv("R4Py/irrigation.csv")
Parsed with column specification:
cols(
  region = col_character(),
  year = col_double(),
  area = col_double()
)
> irrigation
# A tibble: 16 x 3
   region      year  area
   <chr>      <dbl> <dbl>
 1 Africa      1980   9.3
 2 Africa      1990  11
 3 Africa      2000  13.2
 4 Africa      2007  13.6
 5 Europe      1980  18.8
 6 Europe      1990  25.3
 7 Europe      2000  26.7
 8 Europe      2007  26.3
...

We may want to measure the area fold-change relative to 1980 for each region.

irrigation %>%
  group_by(region) %>%
  mutate(area_std_1980 = area/area[year == 1980])
# A tibble: 16 x 4
# Groups:   region [4]
   region      year  area area_std_1980
   <chr>      <dbl> <dbl>         <dbl>
 1 Africa      1980   9.3          1
 2 Africa      1990  11            1.18
 3 Africa      2000  13.2          1.42
 4 Africa      2007  13.6          1.46
 5 Europe      1980  18.8          1
 6 Europe      1990  25.3          1.35
 7 Europe      2000  26.7          1.42
 8 Europe      2007  26.3          1.40
 ...

Just like with mutate() we can add more transformations, like the percentage change over each time point:

> irrigation <- irrigation %>%
+   group_by(region) %>%
+   mutate(area_std_1980 = area/area[year == 1980],
+          area_per_change = c(0, diff(area)/area[-length(area)] * 100))
> irrigation
# A tibble: 16 x 5
# Groups:   region [4]
   region      year  area area_std_1980 area_per_change
   <chr>      <dbl> <dbl>         <dbl>           <dbl>
 1 Africa      1980   9.3          1               0
 2 Africa      1990  11            1.18           18.3
 3 Africa      2000  13.2          1.42           20.0
 4 Africa      2007  13.6          1.46            3.03
 5 Europe      1980  18.8          1               0
 6 Europe      1990  25.3          1.35           34.6
 7 Europe      2000  26.7          1.42            5.53
 8 Europe      2007  26.3          1.40           -1.50
 ...

Reiterations Redo

Notice that we didn’t need any looping in the above examples. You may have intuitively wanted to apply a for loop to calculate aggregation or transformation functions for each region, but it’s not necessary. Avoiding for loops is somewhat of a past time in R, and is found in the base package with the apply family of functions.

Because vectorization is so fundamental to R, there’s a bit of an unofficial contest to see how few for loops you can write. We imagine some useRs have a wall sign: “Days since last for loop:” like factories have for accidents.

This means there are some very old methods for reiterating tasks, along with some newer methods which make the process more convenient.

Table 1-5. Base package apply family
Function Use

apply()

Apply a function to each row or column of a matrix or data frame

lapply()

Apply a function to each element in a list

sapply()

Simplify the output of lapply()

mapply()

The multivariate version of sapply()

tapply()

Apply a function to values defined by an index

emapply()

Apply a function to values in an environment

The old school method relies on the apply family of functions, listed in Table 1-5. Except for apply() pronounce them all as the first letter and then apply, hence “t apply” not “tapply”. There’s a bit of a trend to disavow these workhorses of reiteration but you’ll still see them alot, so they’re worth getting familiar with. Doing so will also help you to appreciate why the tidyverse arose. As an example, let’s return to the aggregration functions we applied to the PlantGrowth data frame above. In the apply family of functions, we could have used:

> tapply(PlantGrowth$weight, PlantGrowth$group, mean)
 ctrl  trt1  trt2
5.032 4.661 5.526
> tapply(PlantGrowth$weight, PlantGrowth$group, sd)
     ctrl      trt1      trt2
0.5830914 0.7936757 0.4425733

You can imagine reading this as “take the weight column from the PlantGrowth data set, split the values according to the label in the group column in the PlantGrowth data set and then apply the mean function to each group of values and then return a named vector”.

Can you see how tedious this is if you want to add more functions on there? Named vectors can be convenient, but also, they are not really a typical way that you want to store important data.

One attempt to simplify this process was implemented in plyr, the precursor to dplyr. plyr is pronounced plyer like the small multifunctional hand-held tool. We use it as such:

library(plyr)

ddply(PlantGrowth, "group", summarize,
      avg = mean(weight))

This is still sometimes used today, but has mostly been supersceeded by a dataframe-centric version of the package, hence the d in dplyr (say d-plyer):

library(dplyr)
PlantGrowth %>%
  group_by(group) %>%
  summarize(avg = mean(weight))

But to be clear, we could have returned a data frame with other very old functions:

> aggregate(weight ~ group, PlantGrowth, mean)
  group weight
1  ctrl  5.032
2  trt1  4.661
3  trt2  5.526

Wow, what a great function, right? This thing is super old! You’ll still see it around, and why not, once you wrap your head around i’s elegant and get’s the job done, even though it still only applies one function. However, the ongoing push to use a unified tidyverse framework, which is easier to read and arguably easier to learn, means the ancient arts are fading into the background.

These functions existed since the early days of R and reflect, intuitively, what statisticians do all the time. The split data into chunks, defined by some property (rows, columns, categorical variables, objects), then they apply some kind of action (plotting, hypothesis testing, modelling, etc.) and then they combine the output together in some way (data frame, list, etc.). The process is sometimes called Split-Apply-Combine. Realizing that this process kept repeating itself, started to make things clearer to the community of how to start thinking about data and indeed, how to actually organize data. From this the idea of “tidy” data was born.

Final Thoughts

In Python, you often hear about the python way. This means the proper Python syntax and the prefered method to perform a specific action. This doesn’t really exist in R; there are many ways to go about the same thing and people will use all variety! Plus, they’ll often mix dialects. Although some dialects are easier to read than others, this hybridization can make it harder to get into the language.

Added to this is the constant tweaking of an expanding tidyverse. Functions are taged as experimental, dormant, maturing, stable, questioning, supersceded, and archived. Couple that with relative lax standards for project-specific package managment or for the use of virtual environments, and you can imagine a certain amount of growing frustration.

R officially celebrated its 20^{th} birthday in 2020, and it’s roots are much older than that. Yet, it sometimes feels like R is currently experiencing a teenage growth spurt. It’s trying to figure out how it suddenly got a lot bigger and can sometimes be both awkward and cool. Blending the differnt R dialects will take you a long way in discovering its full potential.

1 useR! is the annual R conference and also a series of books by published Springer.

2 “Graphical User Interface”

3 “Integrated Development Environment”

4 Exploratory Data Analysis, an informal method for summarizing data

5 Berliner (noun): In Berlin, a resident of the city. Everywhere else: a tasty jelly-filled, sugar-powered donut.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.131.168