The previous two units (Confirmatory Data Analysis and Inferential Statistics and Predictive Analytics), have focused on teaching both theory and practice in ideal data scenarios, so that our more academic quests can be divorced from outside concerns about the veracity or format of the data. To this end, we deliberately stayed away from data sets not already built-into R or available from add-on packages. But very few people I know get by in their careers using R by not importing any data from sources outside R packages. Well, we very briefly touched upon how to load data into R (the read.*
commands) in the very first chapter of this book, did we not? So we should be all set, right?
Here's the rub: I know almost as few people that can get by using simple CSVs and tab-delimited text locally with the primary read.*
commands as can get by not using outside sources of data at all! The unfortunate fact is that many introductory analytics texts largely disregard this reality. This produces many well-informed new analysts who are nevertheless stymied on their first attempt to apply their fresh knowledge to "real-world data". In my opinion, any text that purports to be a practical resource for data analysts cannot afford to ignore this.
Luckily, due to largely undirected and unplanned personal research I do for blog posts and my own edification using motley collections of publicly available data sources in various formats, I—perhaps delusionally—consider myself fairly adept in navigating this criminally overlooked portion of practical analytics. It is the body of lessons I've learned during these wild data adventures that I'd like to impart to you in this and the subsequent chapter, dear reader.
It's common for data sources to be not only difficult to load, for various reasons, but to be difficult to work with because of errors, junk, or just general idiosyncrasies. Because of this, this chapter and the next chapter, Dealing with messy data will have a lot in common. This chapter will concentrate more on getting data from outside sources and getting it into a somewhat usable form in R. The next chapter will discuss particularly common gotchas while working with data in an imperfect world.
I appreciate that not everyone has the interest or the time-availability to go on wild goose hunts for publicly available data to answer questions formed on a whim. Nevertheless, the techniques that we'll be discussing in this chapter should be very helpful in handling the variety of data formats that you'll have to contend with in the course of your work or research. Additionally, having the wherewithal to employ freely available data on the web can be indispensable for learning new analytics methods and technologies.
The first source of data we'll be looking at is that of the venerable relational database.
Perhaps the most common external source of data is from relational databases. Since this section is probably of interest to only those who work with databases—or at least plan to—some knowledge of the basics of relational databases is assumed.
One way to connect to databases from R is to use the RODBC
package. This allows one to access any database that implements the ODBC common interface (for example, PostgreSQL, Access, Oracle, SQLite, DB2, and so on). A more common method—for whatever reason—is to use the DBI
package and DBI-compliant drivers.
DBI
is an R package that defines a generalized interface for communication between different databases and R. Like with ODBC, it allows the same compliant SQL to run on multiple databases. The DBI
package alone is not sufficient for communicating with any particular database from R; in order to use DBI
, you must also install and load a DBI-compliant driver for your particular database. Packages exist providing drivers for many RDBMSs. Among them are RPostgreSQL
, RSQLite
, RMySQL
, and ROracle
.
In order to most easily demonstrate R/DB communication, we will be using a SQLite database. This will also most easily allow the prudent reader to create the example database and follow along. The SQL we'll be using is standard, so you can really use any DB you want, anyhow.
Our example database has two columns: artists
and paintings
. The artists
table contains a unique integer ID, an artist's name, and the year they were born. The paintings
table contains a unique integer ID, an artist ID, the name of the painting, and its completion date. The artist ID in the paintings
table is a foreign key that references the artist ID in the artist
table; this is how this database links paintings to their respective painters.
If you want to follow along, use the following SQL statements to create and populate the database. If you're using SQLite, name the database art.db
.
CREATE TABLE artists( artist_id INTEGER PRIMARY KEY, name TEXT, born_on INTEGER ); CREATE TABLE paintings( painting_id INTEGER PRIMARY KEY, painting_artist INTEGER, painting_name TEXT, year_completed INTEGER, FOREIGN KEY(painting_artist) REFERENCES artists(artist_id) ); INSERT INTO artists(name, born_on) VALUES ("Kay Sage", 1898), ("Piet Mondrian", 1872), ("Rene Magritte", 1898), ("Man Ray", 1890), ("Jean-Michel Basquiat", 1960); INSERT INTO paintings(painting_artist, painting_name, year_completed) VALUES (4, "Orquesta Sinfonica", 1916), (4, "La Fortune", 1938), (1, "Tommorow is Never", 1955), (1, "The Answer is No", 1958), (1, "No Passing", 1954), (5, "Bird on Money", 1981), (2, "Place de la Concorde", 1943), (2, "Composition No. 10", 1942), (3, "The Human Condition", 1935), (3, "The Treachery of Images", 1948), (3, "The Son of Man", 1964);
Confirm for yourself that the following SQL commands yield the appropriate results by typing them into the sqlite3 command line interface.
SELECT * FROM artists; -------------------------------- 1 | Kay Sage | 1898 2 | Piet Mondrian | 1872 3 | Rene Magritte | 1898 4 | Man Ray | 1890 5 | Jean-Michel Basquiat | 1960 SELECT * FROM paintings; -------------------------------------- 1 | 4 | Orquesta Sinfonica | 1916 2 | 4 | La Fortune | 1938 3 | 1 | Tommorow is Never | 1955 4 | 1 | The Answer is No | 1958 5 | 1 | No Passing | 1954 6 | 5 | Bird on Money | 1981 7 | 2 | Place de la Concorde | 1943 8 | 2 | Composition No. 10 | 1942 9 | 3 | The Human Condition | 1935 10 | 3 | The Treachery of Images | 1948 11 | 3 | The Son of Man | 1964
For our first act, we load the necessary packages, choose our database driver, and connect to the database:
library(DBI) library(RSQLite) sqlite <- dbDriver("SQLite") # we read the art sqlite db from the current # working directory which can be get and set # with getwd() and setwd(), respectively art_db <- dbConnect(sqlite, "./art.db")
Again, we are using sqlite
for this example, but this procedure is applicable to all DBI-compliant database drivers.
Let's now run a query against this database. Let's get a list of all the painting names and their respective artist's name. This will require a join operation between the two tables:
result <- dbSendQuery(art_db, "SELECT paintings.painting_name, artists.name FROM paintings INNER JOIN artists ON paintings.painting_artist=artists.artist_id;") response <- fetch(result) head(response) dbClearResult(result) ---------------------------------------------- painting_name name 1 Orquesta Sinfonica Man Ray 2 La Fortune Man Ray 3 Tommorow is Never Kay Sage 4 The Answer is No Kay Sage 5 No Passing Kay Sage
Here we used the dbSendQuery
function to send a query to the database. Its first and second arguments were the database handle variable (from the dbConnect
function) and the SQL statement, respectively. We store a handle to the result in a variable. Next, the fetch
function retrieves the response from the handle. By default, it will retrieve all matches from the query, though this can be limited by specifying the n
argument (see help("fetch")
). The result of the fetch is then stored in the variable response
. response
is an R data frame like any other; we can do any of the operations we've already learned with it. Finally, we clear the result, which is good practice, because it frees resources.
For a slightly more involved query, let's try to find the average (mean) age of the artists at the age they were when each of the paintings were completed. This still requires a join, but this time we are selecting paintings.year_completed
and artists.born_on
.
result <- dbSendQuery(art_db, "SELECT paintings.year_completed, artists.born_on FROM paintings INNER JOIN artists ON paintings.painting_artist=artists.artist_id;") response <- fetch(result) head(response) dbClearResult(result) ---------------------------- year_completed born_on 1 1916 1890 2 1938 1890 3 1955 1898 4 1958 1898 5 1954 1898 6 1981 1960
At this time, row-wise subtraction and averaging can be performed simply:
mean(response$year_completed - response$born_on) ----------- [1] 51.091
Finally, we close our connection to the database:
dbDisconnect(art_db)
Why, indeed. Although this very simple example could have easily just been written into the logic of the SQL query, for more complicated data analysis this simply won't cut it. Unless you are using a really specialized database, many databases aren't prepared for certain mathematical functions with regard to numerical accuracy. More importantly, most databases don't implement advanced math functions at all. Even if they did, they almost certainly wouldn't be portable between different RDBMSs. There is great merit in having analytics logic reside in R so that if—for whatever reason—you have to switch databases, your analysis code will remain unchanged.
If SQL is your cup of tea, did you know you can use the sqldf
package to perform arbitrary SQL queries on data.frame
s?
There is a rising interest and (to a lesser extent) need in databases that don't adhere to the relational paradigm. These so-called NoSQL databases include the immensely popular Hadoop/HDFS, MongoDB, CouchDB, Neo4j, and Redis, among many others. There are R packages for communicating to most of these, too, including one for every one of the databases mentioned here by name. Since the operation of all of these packages is idiosyncratic and heavily dependent on which species of NoSQL the database in question belongs to, your best bet for learning how to use this is to read the help pages and/or vignettes for each package.
3.15.168.199