Transforming data into actionable information requires the ability to clearly and reproducibly wrangle, analyze, and visualize that data. These skills are the foundations of data science, a field that has amplified our collective understanding of issues ranging from disease transmission to racial inequities. Moreover, the ability to programmatically interact with data enables researchers and professionals to quickly discover and communicate patterns in data that are often difficult to detect. Understanding how to write code to work with data allows people to engage with information in new ways and on larger scales.
The existence of free and open source software has made these tools accessible to anyone with access to a computer. The purpose of this book is to teach people how to leverage programming to ask questions of their data sets.
This book revolves around the practical steps needed to program for data science using the
R programming language. It takes a holistic approach to teaching the topic, recognizing that an entire ecosystem of tools and technologies is needed to do this. While writing code is a core part of being a data scientist (and this book), many more foundational skills must be acquired as part of this journey. Data science requires installing and configuring software to write, execute, and manage code; tracking the version of (and changes to) your projects; leveraging core concepts from computer science to understand how to accomplish a given task; accessing and processing data from a variety of sources; leveraging visual communication to expose patterns in your data; and building applications to share insights with others. The purpose of this text is to help people develop a strong foundation across these areas so that they can enter the data science field (or bring data science to their field).
This book is written for people with no programming or data science experience, though it would still be helpful for people active in the field. This book was originally developed to support a course in the Informatics undergraduate degree program at the University of Washington, so it is (not surprisingly) well suited for college students interested in entering the data science field. We also believe that anyone whose job involves working with data can benefit from learning how to reproducibly create analyses, visualizations, and reports.
If you are interested in pursuing a career in data science, or if you use data on a regular basis and want to use programming techniques to gain information from that data, then this text is for you.
The book is divided into six sections, each of which is summarized here.
This section walks through the steps of downloading and installing necessary software for the rest of the book. More specifically, Chapter 1 details how to install a text editor, Bash terminal, the
R interpreter, and the RStudio program. Then, Chapter 2 describes how to use the command line for basic file system navigation.
This section walks through the technical basis of project management, including keeping track of the version of your code and producing documentation. Chapter 3 introduces the
git software to track line-by-line code changes, as well as the corresponding popular code hosting and collaboration service GitHub. Chapter 4 then describes how to use Markdown to produce the well-structured and -styled documentation needed for sharing and presenting data.
Part III: Foundational
This section introduces the
R programming language, the primary language used throughout the book. In doing so, it introduces the basic syntax of the language (Chapter 5), describes fundamental programming concepts such as functions (Chapter 6), and introduces the basic data structures of the language: vectors (Chapter 7), and lists (Chapter 8).
Because the most time-consuming part of data science is often loading, formatting, exploring, and reshaping data, this section of the book provides a deep dive into the best ways to wrangle data in
R. After introducing techniques and concepts for understanding the structure of real-world data (Chapter 9), the book presents the data structure most commonly used for managing data in
R: the data frame (Chapter 10). To better support working with this data, the book then describes two packages for programmatically interacting with the data:
dplyr (Chapter 11), and
tidyr (Chapter 12). The last two chapters of the section describe how to load data from databases (Chapter 13) and web-based data services with application programming interfaces (APIs) (Chapter 14).
This section of the book focuses on the conceptual and technical skills necessary to design and build visualizations as part of the data science process. It begins with an overview of data visualization principles (Chapter 15) to guide your choices in designing visualizations. Chapter 16 then describes in granular detail how to use the
ggplot2 visualization package in
R. Finally, Chapter 17 explores the use of three additional
R packages for producing engaging interactive visualizations.
Part VI: Building and Sharing Applications
As in any domain, data science insights are valuable only if they can be shared with and understood by others. The final section of the book focuses on using two different approaches to creating interactive platforms to share your insights (directly from your
R program!). Chapter 18 uses the R Markdown framework to transform analyses into sharable documents and websites. Chapter 19 takes this a step further with the Shiny framework, which allows you to create interactive web applications using
R. Chapter 20 then describes approaches for working on collaborative teams of data scientists, and Chapter 21 details how you can further your education beyond this book.
Throughout the book, you will see computer code appear inline with the text, as well as in distinct blocks. When code appears inline, it will appear in
monospace font. A distinct code block looks like this:
# This is a comment - it describes the code that follows # The next line of code prints the text "Hello world!" print("Hello world!")
The text in the code blocks is colored to reflect the syntax of the programming language used (typically the
R language). Example code blocks often include values that you need to replace. These replacement values appear in
UPPER_CASE_FONT, with words separated by underscores. For example, if you need to work with a folder of your choosing, you would put the name of your folder where it says
FOLDER_NAME in the code. Code sections will all include comments: in programming, comments are bits of text that are not interpreted as computer instructions—they aren’t code, they’re just notes about the code! While a computer is able to understand the code, comments are there to help people understand it. Tips for writing your own descriptive comments are discussed in Chapter 5.
To guide your reading, we also include five types of special callout notes:
These boxes provide best practices and shortcuts that can make your life easier.
These boxes provide interesting background information on a topic.
These boxes reinforce key points that are important to keep in mind.
These boxes describe common mistakes and explain how to avoid them.
Throughout the text there are instructions for using specific keyboard keys. These are included in the text in
lowercase monospace font. When multiple keys need to be pressed at the same time, they are separated by a plus sign (
+). For example, if you needed to press the Command and “c” keys at the same time, it would appear as
cmd key is used, Windows users should instead use the Control (
The individual chapters in this book will walk you through the process of programming for data science. Chapters often build upon earlier examples and concepts (particularly through Part III and Part IV).
This book includes a large number of code examples and demonstrations, with reported output and results. That said, the best way to learn to program is to do it, so we highly recommend that as you read, you type out the code examples and try them yourself! Experiment with different options and variations—if you’re wondering how something works or if an option is supported, the best thing to do is try it yourself. This will help you not only practice the actual writing of code, but also better develop your own mental model of how data science programs work.
Many chapters conclude by applying the described techniques to a real data set in an In Action section. These sections take a data-driven approach to understanding issues such as gentrification, investment in education, and variation in life expectancy around the world. These sections use a hands-on approach to using new skills, and all code is available online.1
1In-Action Code: https://github.com/programming-for-data-science/in-action
As you move through each chapter, you may want to complete the accompanying set of online exercises.2 This will help you practice new techniques and ensure your understanding of the material. Solutions to the exercises are also available online.
2Book Exercises: https://github.com/programming-for-data-science
Finally, you should know that this text does not aim to be comprehensive. It is both impractical and detrimental to learning to attempt to explain every nuance and option in the
R language and ecosystem (particularly to people who are just starting out). While we discuss a large number of popular tools and packages, the book cannot explain all possible options that exist now or will be created in the future. Instead, this text aims to provide a primer on each topic—giving you enough details to understand the basics and to get up and running with a particular data science programming task. Beyond those basics, we provide copious links and references to further resources where you can explore more and dive deeper into topics that are relevant or of interest to you. This book will provide the foundations of using
R for data science—it is up to each reader to apply and build upon those skills.
To guide your learning, a set of online exercises (and their solutions) is available for each chapter. The complete analysis code for all seven In Action sections is also provided. See the book website3 for details.
Register your copy of Programming Skills for Data Science on the InformIT site for convenient access to updates and/or corrections as they become available. To start the registration process, go to informit.com/register and log in or create an account. Enter the product ISBN (9780135133101) and click Submit. Look on the Registered Products tab for an Access Bonus Content link next to this product, and follow that link to access any available bonus materials. If you would like to be notified of exclusive offers on new editions and updates, please check the box to receive email from us.