© Thomas Mailund 2019
Thomas MailundR Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-4894-2_1

1. Introduction

Thomas Mailund1 
(1)
Aarhus, Denmark
 

R is a functional programming language with a focus on statistical analysis. It has built-in support for model specifications that can be manipulated as first-class objects, and an extensive collection of functions for dealing with probability distributions and model fitting, both built-in and through extension packages.

The language has a long history. It was created in 1992 and is based on an even older language, S, from 1976. Many quirks and inconsistencies have entered the language over the years. There are, for example, at least three partly incompatible ways of implementing object orientation, and one of these is based on a naming convention that clashes with some built-in functions. It can be challenging to navigate through the many quirks of R, but this is alleviated by a suite of extensions, collectively known as the “Tidyverse.”

While there are many data science applications that involve more complex data structures, such as graphs and trees, most bread-and-butter analyses involve rectangular data. That is, the analysis is of data that can be structured as a table of rows and columns where, usually, the rows correspond to observations and the columns correspond to explanatory variables and observations. The usual data sets are also of a size that can be loaded into memory and analyzed on a single desktop or laptop. I will assume that both are the case here. If this is not the case, then you need different, big data techniques that go beyond the scope of this book.

The Tidyverse is a collection of extensions to R; packages that are primarily aimed at analyzing tabular data that fits into your computer’s memory. Some of the packages go beyond this, but since data science is predominately manipulation of tabular data, this is the focus of this book.

The Tidyverse packages provide consistent naming conventions, consistent programming interfaces, and more importantly a consistent notation that captures how data analysis consists of different steps of data manipulation and model fitting.

The packages do not merely provide collections of functions for manipulating data but rather small domain-specific languages for different aspects of your analysis. Almost all of these small languages are based on the same overall “pipeline” syntax, so once you know one of them, you can easily use the others. A noticeable exception is the plotting library ggplot2 . It is slightly older than the other extensions and because of this has a different syntax. The main difference is the operator used for combining different operations. The data pipeline notation uses the %>% operator, while ggplot2 combines plotting instructions using +. If you are like me, then you will often try to combine ggplot2 instructions using %>%—only out of habit—but once you get an error from R, you will recognize your mistake and can quickly fix it.

This book is a syntax reference for modern data science in R, which means that it is a guide for using Tidyverse packages and it is a guide for programmers who want to use R’s Tidyverse packages instead of basic R programming.

This guide does not explain each Tidyverse package exhaustively. The development of Tidyverse packages progresses rapidly, and the book would not contain a complete guide shortly after it is printed anyway. The structure of the extensions and the domain-specific languages they provide are stable, however, and from examples with a subset of the functionality in them, you should not have any difficulties with reading the package documentation for each of them and find the features you need that are not covered in the book.

To get started with the Tidyverse, install and load it:
install.packages("tidyverse")
library(tidyverse)

The Tidyverse consists of many packages that you can install and load independently, but loading all through the tidyverse package is the easiest, so unless you have good reasons to, just load tidyverse when you start an analysis. In this book I describe three packages that are not loaded from tidyverse but are generally considered part of the Tidyverse.1

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.52.8