Chapter 10. Sources of Data

The previous two units (Confirmatory Data Analysis and Inferential Statistics and Predictive Analytics), have focused on teaching both theory and practice in ideal data scenarios, so that our more academic quests can be divorced from outside concerns about the veracity or format of the data. To this end, we deliberately stayed away from data sets not already built-into R or available from add-on packages. But very few people I know get by in their careers using R by not importing any data from sources outside R packages. Well, we very briefly touched upon how to load data into R (the read.* commands) in the very first chapter of this book, did we not? So we should be all set, right?

Here's the rub: I know almost as few people that can get by using simple CSVs and tab-delimited text locally with the primary read.* commands as can get by not using outside sources of data at all! The unfortunate fact is that many introductory analytics texts largely disregard this reality. This produces many well-informed new analysts who are nevertheless stymied on their first attempt to apply their fresh knowledge to "real-world data". In my opinion, any text that purports to be a practical resource for data analysts cannot afford to ignore this.

Luckily, due to largely undirected and unplanned personal research I do for blog posts and my own edification using motley collections of publicly available data sources in various formats, I—perhaps delusionally—consider myself fairly adept in navigating this criminally overlooked portion of practical analytics. It is the body of lessons I've learned during these wild data adventures that I'd like to impart to you in this and the subsequent chapter, dear reader.

It's common for data sources to be not only difficult to load, for various reasons, but to be difficult to work with because of errors, junk, or just general idiosyncrasies. Because of this, this chapter and the next chapter, Dealing with messy data will have a lot in common. This chapter will concentrate more on getting data from outside sources and getting it into a somewhat usable form in R. The next chapter will discuss particularly common gotchas while working with data in an imperfect world.

I appreciate that not everyone has the interest or the time-availability to go on wild goose hunts for publicly available data to answer questions formed on a whim. Nevertheless, the techniques that we'll be discussing in this chapter should be very helpful in handling the variety of data formats that you'll have to contend with in the course of your work or research. Additionally, having the wherewithal to employ freely available data on the web can be indispensable for learning new analytics methods and technologies.

The first source of data we'll be looking at is that of the venerable relational database.

Relational Databases

Perhaps the most common external source of data is from relational databases. Since this section is probably of interest to only those who work with databases—or at least plan to—some knowledge of the basics of relational databases is assumed.

One way to connect to databases from R is to use the RODBC package. This allows one to access any database that implements the ODBC common interface (for example, PostgreSQL, Access, Oracle, SQLite, DB2, and so on). A more common method—for whatever reason—is to use the DBI package and DBI-compliant drivers.

DBI is an R package that defines a generalized interface for communication between different databases and R. Like with ODBC, it allows the same compliant SQL to run on multiple databases. The DBI package alone is not sufficient for communicating with any particular database from R; in order to use DBI, you must also install and load a DBI-compliant driver for your particular database. Packages exist providing drivers for many RDBMSs. Among them are RPostgreSQL, RSQLite, RMySQL, and ROracle.

In order to most easily demonstrate R/DB communication, we will be using a SQLite database. This will also most easily allow the prudent reader to create the example database and follow along. The SQL we'll be using is standard, so you can really use any DB you want, anyhow.

Our example database has two columns: artists and paintings. The artists table contains a unique integer ID, an artist's name, and the year they were born. The paintings table contains a unique integer ID, an artist ID, the name of the painting, and its completion date. The artist ID in the paintings table is a foreign key that references the artist ID in the artist table; this is how this database links paintings to their respective painters.

If you want to follow along, use the following SQL statements to create and populate the database. If you're using SQLite, name the database art.db.

CREATE TABLE artists(
  artist_id INTEGER PRIMARY KEY,
  name TEXT,
  born_on INTEGER
);

CREATE TABLE paintings(
  painting_id INTEGER PRIMARY KEY,
  painting_artist INTEGER,
  painting_name TEXT,
  year_completed INTEGER,
  FOREIGN KEY(painting_artist) REFERENCES artists(artist_id)
);

INSERT INTO artists(name, born_on)
VALUES ("Kay Sage", 1898),
("Piet Mondrian", 1872),
("Rene Magritte", 1898),
("Man Ray", 1890),
("Jean-Michel Basquiat", 1960);

INSERT INTO paintings(painting_artist, painting_name, year_completed)
VALUES (4, "Orquesta Sinfonica",  1916),
(4, "La Fortune", 1938),
(1, "Tommorow is Never", 1955),
(1, "The Answer is No", 1958),
(1, "No Passing", 1954),
(5, "Bird on Money", 1981),
(2, "Place de la Concorde", 1943),
(2, "Composition No. 10",  1942),
(3, "The Human Condition", 1935),
(3, "The Treachery of Images", 1948),
(3, "The Son of Man", 1964);

Confirm for yourself that the following SQL commands yield the appropriate results by typing them into the sqlite3 command line interface.

SELECT * FROM artists;
--------------------------------
1 | Kay Sage             | 1898
2 | Piet Mondrian        | 1872
3 | Rene Magritte        | 1898
4 | Man Ray              | 1890
5 | Jean-Michel Basquiat | 1960


SELECT * FROM paintings;
--------------------------------------
1  | 4 | Orquesta Sinfonica      | 1916
2  | 4 | La Fortune              | 1938
3  | 1 | Tommorow is Never       | 1955
4  | 1 | The Answer is No        | 1958
5  | 1 | No Passing              | 1954
6  | 5 | Bird on Money           | 1981
7  | 2 | Place de la Concorde    | 1943
8  | 2 | Composition No. 10      | 1942
9  | 3 | The Human Condition     | 1935
10 | 3 | The Treachery of Images | 1948
11 | 3 | The Son of Man          | 1964

For our first act, we load the necessary packages, choose our database driver, and connect to the database:

library(DBI)
library(RSQLite)
sqlite <- dbDriver("SQLite")
# we read the art sqlite db from the current
# working directory which can be get and set
# with getwd() and setwd(), respectively
art_db <- dbConnect(sqlite, "./art.db")

Again, we are using sqlite for this example, but this procedure is applicable to all DBI-compliant database drivers.

Let's now run a query against this database. Let's get a list of all the painting names and their respective artist's name. This will require a join operation between the two tables:

result <- dbSendQuery(art_db, 
  "SELECT paintings.painting_name, artists.name
  FROM paintings INNER JOIN artists
  ON paintings.painting_artist=artists.artist_id;")
response <- fetch(result)
head(response)
dbClearResult(result)

----------------------------------------------

       painting_name                 name
1 Orquesta Sinfonica              Man Ray
2         La Fortune              Man Ray
3  Tommorow is Never             Kay Sage
4   The Answer is No             Kay Sage
5         No Passing             Kay Sage

Here we used the dbSendQuery function to send a query to the database. Its first and second arguments were the database handle variable (from the dbConnect function) and the SQL statement, respectively. We store a handle to the result in a variable. Next, the fetch function retrieves the response from the handle. By default, it will retrieve all matches from the query, though this can be limited by specifying the n argument (see help("fetch")). The result of the fetch is then stored in the variable response. response is an R data frame like any other; we can do any of the operations we've already learned with it. Finally, we clear the result, which is good practice, because it frees resources.

For a slightly more involved query, let's try to find the average (mean) age of the artists at the age they were when each of the paintings were completed. This still requires a join, but this time we are selecting paintings.year_completed and artists.born_on.

result <- dbSendQuery(art_db,
  "SELECT paintings.year_completed, artists.born_on
  FROM paintings INNER JOIN artists
  ON paintings.painting_artist=artists.artist_id;")
response <- fetch(result)
head(response)
dbClearResult(result)

----------------------------

  year_completed born_on
1           1916    1890
2           1938    1890
3           1955    1898
4           1958    1898
5           1954    1898
6           1981    1960

At this time, row-wise subtraction and averaging can be performed simply:

mean(response$year_completed - response$born_on)

-----------

[1] 51.091

Finally, we close our connection to the database:

dbDisconnect(art_db)

Why didn't we just do that in SQL?

Why, indeed. Although this very simple example could have easily just been written into the logic of the SQL query, for more complicated data analysis this simply won't cut it. Unless you are using a really specialized database, many databases aren't prepared for certain mathematical functions with regard to numerical accuracy. More importantly, most databases don't implement advanced math functions at all. Even if they did, they almost certainly wouldn't be portable between different RDBMSs. There is great merit in having analytics logic reside in R so that if—for whatever reason—you have to switch databases, your analysis code will remain unchanged.

Note

If SQL is your cup of tea, did you know you can use the sqldf package to perform arbitrary SQL queries on data.frames?

There is a rising interest and (to a lesser extent) need in databases that don't adhere to the relational paradigm. These so-called NoSQL databases include the immensely popular Hadoop/HDFS, MongoDB, CouchDB, Neo4j, and Redis, among many others. There are R packages for communicating to most of these, too, including one for every one of the databases mentioned here by name. Since the operation of all of these packages is idiosyncratic and heavily dependent on which species of NoSQL the database in question belongs to, your best bet for learning how to use this is to read the help pages and/or vignettes for each package.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.168.199