13 Accessing Databases

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

13
Accessing Databases

This chapter introduces relational databases as a way to structure and organize complex data sets. After introducing the purpose and format of relational databases, it describes the syntax for interacting with them using R. By the end of the chapter you will be able to wrangle data from a database.

13.1 An Overview of Relational Databases

Simple data sets can be stored and loaded from .csv files, and are readily represented in the computer’s memory as a data frame. This structure works great for data that is structured just as a set of observations made up of features. However, as data sets become more complex, you run against some limitations.

In particular, your data may not be structured in a way that it can easily and efficiently be represented as a single data frame. For example, imagine you were trying to organize information about music playlists (e.g., on a service such as Spotify). If your playlist is the unit of analysis you are interested in, each playlist would be an observation (row) and would have different features (columns) included. One such feature you could be interested in is the songs that appear on the playlist (implying that one of your columns should be songs). However, playlists may have lots of different songs, and you may also be tracking further information about each song (e.g., the artist, the genre, the length of the song). Thus you could not easily represent each song as a simple data type such as a number or string. Moreover, because the same song may appear in multiple playlists, such a data set would include a lot of duplicate information (e.g., the title and artist of the song).

To solve this problem, you could use multiple data frames (perhaps loaded from multiple .csv files), joining those data frames together as described in Chapter 11 to ask questions of the data. However, that solution would require you to manage multiple different .csv files, as well as to determine an effective and consistent way of joining them together. Since organizing, tracking, and updating multiple .csv files can be difficult, many large data sets are instead stored in databases. Metonymically, a database is a specialized application (called a database management system) used to save, organize, and access information—similar to what git does for versions of code, but in this case for the kind of data that might be found in multiple .csv files. Because many organizations store their data in a database of some kind, you will need to be able to access that data to analyze it. Moreover, accessing data directly from a database makes it possible to process data sets that are too large to fit into your computer’s memory (RAM) at once. The computer will not be required to hold a reference to all the data at once, but instead will be able to apply your data manipulation (e.g., selecting and filtering the data) to the data stored on a computer’s hard drive.

13.1.1 What Is a Relational Database?

The most commonly used type of database is a relational database. A relational database organizes data into tables similar in concept and structure to a data frame. In a table, each row (also called a record) represents a single “item” or observation, while each column (also called a field) represents an individual data property of that item. In this way, a database table mirrors an R data frame; you can think of them as somewhat equivalent. However, a relational database may be made up of dozens, if not hundreds or even thousands, of different tables—each one representing a different facet of the data. For example, one table may store information about which music playlists are in the database, another may store information about the individual songs, another may store information about the artists, and so on.

What makes relational databases special is how they specify the relationships between these tables. In particular, each record (row) in a table is given a field (column) called the primary key. The primary key is a unique value for each row in the table, so it lets you refer to a particular record. Thus even if there were two songs with the same name and artist, you could still distinguish between them by referencing them through their primary key. Primary keys can be any unique identifier, but they are almost always numbers and are frequently automatically generated and assigned by the database. Note that databases can’t just use the “row number” as the primary key, because records may be added, removed, or reordered—which means a record won’t always be at the same index!

Moreover, each record in one table may be associated with a record in another—for example, each record in a songs table might have an associated record in the artists table indicating which artist performed the song. Because each record in the artists table has a unique key, the songs table is able to establish this association by including a field (column) for each record that contains the corresponding key from artists (see Figure 13.1). This is known as a foreign key (it is the key from a “foreign” or other table). Foreign keys allow you to join tables together, similar to how you would with dplyr. You can think of foreign keys as a formalized way of defining a consistent column for the join() function’s by argument.

A figure shows the example of a pair of database tables and how the foreign key is used when joining the two tables. — Figure 13.1 An example pair of database tables (top). Each table has a *primary key* column `id`. The `songs` table (top right) also has an `artist_id` *foreign key* used to associate it with the `artists` table (top left). The bottom table illustrates how the foreign key can be used when joining the tables.

The figure shows two tables namely artists and songs at the top and one resultant table at the bottom. The artists table includes three rows and two columns. The column headers read, id and name. The id is pointed out as the primary key. Row 1 reads 10 and david bowie. Row 2 reads 11 and queen. Row 3 reads 12 and prince. The songs table includes four rows and three columns. The column headers read, id, title, and artist_id. The artist_id is the foreign key. Row 1 reads 80, Bohemian Rhapsody, and 11. Row 2 reads 81, Don't Stop Me Now, and 11. Row 3 reads 82, Purple Rain, and 12. Row 4 reads 83, Starman, and 10. The resultant table at the bottom shows the output for the query "artists JOIN songs ON artists.id = songs.artists.id". The resultant table includes four rows and five columns. The column headers read, artists.id, artists.name, songs.id, songs.title, and songs.artists_id. Row 1 reads 10, David Bowie, 83, Starman, and 10. Row 2 reads 11, Queen, 80, Bohemian Rhapsody, and 11. Row 3 reads 11, Queen, 81, Don't Stop Me Now, and 11. Row 4 reads 12, Prince, 82, Purple Rain, and 12.

Databases can use tables with foreign keys to organize data into complex structures; indeed, a database may have a table that just contains foreign keys to link together other tables! For example, if a database needs to represent data such that each playlist can have multiple songs, and songs can be on many playlists (a “many-to-many” relationship), you could introduce a new “bridge table” (e.g., playlists_songs) whose records represent the associations between the two other tables (see Figure 13.2). You can think of this as a “table of lines to draw between the other tables.” The database could then join all three of the tables to access the information about all of the songs for a particular playlist.

A figure shows the example of a bridge table, used to associate playlists and songs and how the three tables joined. — Figure 13.2 An example “bridge table” (top right) used to associate many playlists with many songs. The bottom table illustrates how these three tables might be joined.

The figure shows three tables namely playlists, songs, and plylists_songs at the top and one resultant table at the bottom. The playlists table includes two rows and two columns. The column headers read, id and name. Row 1 reads 100 and awesome mix. Row 2 reads 101 and sweet tunes. The songs table includes four rows and two columns. The column headers read, id and title. Row 1 reads 80 and Bohemian Rhapsody. Row 2 reads 81 and Don't Stop Me Now. Row 3 reads 82 and Purple Rain. Row 4 reads 83 and Starman. The playlists_songs table includes five rows and two columns. Row 1 reads 100 and 81. Row 2 reads 100 and 82. Row 3 reads 100 and 83. Row 4 reads 101 and 80. Row 5 reads 101 and 82. The resultant table at the bottom shows the output for the query "playlist JOIN playlists_songs ON playlists.id = playlist_id JOIN songs ON songs.id = songs_id". The resultant table includes five rows and four columns. The column headers read, playlist_id, playlists.name, songs.id, and songs.title. Row 1 reads 100, Awesome Mix, 81, and Don't Stop Me. Row 2 reads Now 100, Awesome Mix, 82, and Purple Rain. Row 3 reads 100, Awesome Mix, 83, and Starman. Row 4 reads 101, Sweet Tunes, 80, and Bohemian Rhapsody. Row 5 reads 101, Sweet Tunes, 82, and Purple Rain.

Going Further

Database design, development, and use is actually its own (very rich) problem domain. The broader question of making databases reliable and efficient is beyond the scope of this book.

13.1.2 Setting Up a Relational Database

To use a relational database on your own computer (e.g., for experimenting or testing your analysis), you will need to install a separate software program to manage that database. This program is called a relational database management system (RDMS). There are a couple of different popular RDMS systems; each of them provides roughly the same syntax (called SQL) for manipulating the tables in the database, though each may support additional specialized features. The most popular RDMSs are described here. You are not required to install any of these RDMSs to work with a database through R; see Section 13.3, below. However, brief installation notes are provided for your reference.

SQLite¹ is the simplest SQL database system, and so is most commonly used for testing and development (though rarely in real-world “production” systems). SQLite databases have the advantage of being highly self-contained: each SQLite database is a single file (with the .sqlite extension) that is formatted to enable the SQLite RDMS to access and manipulate its data. You can almost think of these files as advanced, efficient versions of .csv files that can hold multiple tables! Because the database is stored in a single file, this makes it easy to share databases with others or even place one under version control.

¹SQLite: https://www.sqlite.org/index.html

To work with an SQLite database you can download and install a command line application² for manipulating the data. Alternatively, you can use an application such as DB Browser for SQLite,³ which provides a graphical interface for interacting with the data. This is particularly useful for testing and verifying your SQL and R code.

²SQLite download page: https://www.sqlite.org/download.html; look for “Precompiled Binaries” for your system.

³DB Browser for SQLite: http://sqlitebrowser.org
PostgreSQL⁴ (often shortened to “Postgres”) is a free open source RDMS, providing a more robust system and set of features (e.g., for speeding up data access and ensuring data integrity) and functions than SQLite. It is often used in real-world production systems, and is the recommended system to use if you need a “full database.” Unlike with SQLite, a Postgres database is not isolated to a single file that can easily be shared, though there are ways to export a database.

⁴PostgreSQL: https://www.postgresql.org

You can download and install the Postgres RDMS from its website;⁵ follow the instructions in the installation wizard to set up the database system. This application will install the manager on your machine, as well as provide you with a graphical application (pgAdmin) to administer your databases. You can also use the provided psql command line application if you add it to your PATH; alternatively, the SQL Shell application will open the command line interface directly.

⁵PostgreSQL download page: https://www.postgresql.org/download
MySQL⁶ is a free (but closed source) RDMS, providing a similar level of features and structure as Postgres. MySQL is a more popular system than Postgres, so its use is more common, but can be somewhat more difficult to install and set up.

⁶MySQL: https://www.mysql.com

If you wish to set up and use a MySQL database, we recommend that you install the Community Server Edition from the MySQL website.⁷ Note that you do not need to sign up for an account (click the smaller “No thanks, just start my download” link instead).

⁷MySQL download Page: https://dev.mysql.com/downloads/mysql

We suggest you use SQLite when you’re just experimenting with a database (as it requires the least amount of setup), and recommend Postgres if you need something more full-featured.

13.2 A Taste of SQL

The reason all of the RDMSs described in Section 13.1.2 have “SQL” in their names is because they all use the same syntax—SQL—for manipulating the data stored in the database. SQL (Structured Query Language) is a programming language used specifically for managing data in a relational database—a language that is structured for querying (accessing) that information. SQL provides a relatively small set of commands (referred to as statements), each of which is used to interact with a database (similar to the operations described in the Grammar of Data Manipulation used by dplyr).

This section introduces the most basic of SQL statements: the SELECT statement used to access data. Note that it is absolutely possible to access and manipulate a database through R without using SQL; see Section 13.3. However, it is often useful to understand the underlying commands that R is issuing. Moreover, if you eventually need to discuss database manipulations with someone else, this language will provide some common ground.

Caution

Most RDMSs support SQL, though systems often use slightly different “flavors” of SQL. For example, data types may be named differently, or different RDMSs may support additional functions or features.

Tip

For a more thorough introduction to SQL, w3schools^a offers a very newbie-friendly tutorial on SQL syntax and usage. You can also find more information in Forta, Sams Teach Yourself SQL in 10 Minutes, Fourth Edition (Sams, 2013), and van der Lans, Introduction to SQL, Fourth Edition (Addison-Wesley, 2007).

^a https://www.w3schools.com/sql/default.asp

The most commonly used SQL statement is the SELECT statement. The SELECT statement is used to access and extract data from a database (without modifying that data)—this makes it a query statement. It performs the same work as the select() function in dplyr. In its simplest form, the SELECT statement has the following format:

Table of Contents for 13 Accessing Databases

Create new playlist

Sign In

Sign Up

13Accessing Databases

13.1 An Overview of Relational Databases

13.1.1 What Is a Relational Database?

13.1.2 Setting Up a Relational Database

13.2 A Taste of SQL

13.3 Accessing a Database from R

Table of Contents for
13 Accessing Databases

13
Accessing Databases

13.3 Accessing a Database from `R`