© Matt Wiley and Joshua F. Wiley 2016

Matt Wiley and Joshua F. Wiley, Advanced R, 10.1007/978-1-4842-2077-1_10

10. Reading Big Data(bases)

Matt Wiley and Joshua F. Wiley1

(1)Elkhart Group Ltd. & Victoria College, Columbia City, Indiana, USA

Now that you understand how to manage data inside R, let’s consider where data is found. While smaller data is found in comma-separated values (CSV) files or files easily converted to such, larger data tends to live in other places. This chapter deals with big data, or at least data that may be big. What is the challenge with big data? R works in memory, random access memory, not hard drives. A quick check of your system settings should reveal the amount of memory you have. We, the authors, use between 4 and 32 gigabytes in our real-world systems, with the larger number being a somewhat expensive habit. R cannot analyze data larger than available RAM.

This leads to two possibilities. While it is possible to save R data objects and load them one at a time, R is not designed to be a full-time database. Furthermore, large data is often managed by other groups besides analysts. Big data is a reality, and most of it lives inside a database of some sort. Now, while most databases have the capability to export files to CSV, this is not always convenient. Our goal is to provide you with just enough information to allow you access to some popular and common databases. If you are a data user, you’ll be able to access data from several common databases by the end of this chapter. If you are a big data owner, you’ll gain the ability to write newly collected data into a database. Along the way, we’ll show the commands allowing you to load into R a subset of your data. We also discuss read versus write privileges. Finally, (actually, first), we install these databases with you, so you are ready to practice using large data. Fair warning: Our goal is not to provide expertise in database management. We do have some book recommendations we make throughout the chapter for more details.

This chapter has three main sections that each follow the same pattern: installation of database software followed by interacting with that database via R. The first two database systems are SQLite and PostgreSQL, which are both relational databases. As such, they fit nicely with the previous chapters discussing data, because so far we’ve used tabular data. Tabular data is fixed columns and variable rows. New data is slotted neatly into a new row, and thus most data has a rigid format . Additionally, while there may be more than one table, there tend to be common keys that allow us to match tables to each other (and rows within those tables to each other). The last database we demonstrate is MongoDB, which is an example of a document store. In a document store, each piece, or blob, of data may be unique. There is no reason to believe that any keys or relations exist; of course, there is no reason to believe that there aren’t any similarities either.

In this chapter, we use the following packages : devEMF (Johnson, 2015), DBI (R-SIG-DB, Wickham, and Kirill, 2016), RSQLite (Wickham, James, and Falcon 2014), jsonlite (Ooms, 2014), tm (Feinerer and Hornik, 2015), wordcloud (Fellows, 2014), RMongo (Cheng, 2013), RPostgreSQL (Conway, Eddelbuettel, Nishiyama, Prayaga, and Tiffin, 2016), and data.table (Dowle, Srinivasan, Short, and Lianoglou, 2015). The following code loads the checkpoint (Microsoft Corporation, 2016) package to control the exact version of R packages used and then loads the packages:

## load checkpoint and required packages
library(checkpoint)
checkpoint("2016-09-04", R.version = "3.3.1")


library(devEMF)
library(RPostgreSQL)
library(DBI)
library(RSQLite)
library(jsonlite)
library(tm)
library(wordcloud)
library(RMongo)
library(data.table)
options(width = 70) # only 70 characters per line

This chapter is not about curating big data and certainly is not about curating big data in R. Instead, this chapter gets you learning the ins and outs of using R with some popular open source databases. Before moving into specific databases, a word of advice: should you wish to learn all three of these databases, go to each section for installation instructions. Follow those, and, after you are finished, just to be safe, restart your computer.

SQLite

SQLiteis both extraordinarily small and exceptionally efficient. It is also freely available (as of this writing and for the last 16 years). While we recommend reading The Definitive Guideto SQLite by Grant Allen and Mike Owens (Apress, 2010), there are five main data types recognized (compare this with R): null, integer, real, text, and blob.

Each SQLite database exists as just one file and is thus quite portable. These databases have no users, no need for passwords, and there is a simple connection structure. Disadvantages are a consequence of some of those same advantages. Without user privileges, there is no distinction between just reading information and being able to modify data.

This lack of user privileges deserves an important warning. We build the databases we use in this chapter with mock data. In real-life work, databases are not for practice. There is a healthy amount of fear when directly connecting to a live database for the first time, especially if you are one wrong drop call away from deleting your team’s or client’s data. Because SQLite is just one file, it is wise to try on a copy of the actual data first if at all possible. Alternatively, as in our case, practice first on something completely safe.

Installing SQLite on Windows

Installing does not happen with SQLite. Rather, the file(s) are simply downloaded. A visit to SQLite at www.sqlite.org/download.html is enough; we downloaded the sqlite-tools-win32-VERSION.zip file, where VERSION maps to the current product and version. Next, extract sqliteV.exe, where V is the version to a folder on the disk where you want your data to live. We extracted to C:/sqlite as our folder. Make sure you have free space on the disk; this database eventually takes up some space.

That is all it takes to get SQLite up and running on your system. Now, in real life, perhaps someone else created the SQLite database, and you have a database copy or access via a network share drive. Our goals are to use data from the database and to store data to the database from R. This is not the same, nor as efficient, as managing an SQLite database by using the “proper” techniques. All the same, it gets us using the database quickly!

SQLite and R

Our first task is connecting to our database. Connections use the DBI package library and build a link to our database. In the case of SQLite, there is no need for a username or password. We do need to tell the link the location of our database as well as the type of database. Recall that we made some choices about where to locate SQLite, and that location must match up with our current location. We call this first connection con1 and then use the dbIsValid()function from the RSQLite library to test whether it started correctly. Ours returns as valid, so we are ready to go:

con1 = dbConnect(SQLite(), dbname = "C:/sqlite/LiteDB.db")                
dbIsValid(con1)
[1] TRUE

Our goal is to get you connecting to this database quickly. We push our familiar iris data to SQLite, which allows us to see that the system is working correctly. Data in SQLite lives in tables, and inside a table are fields. This corresponds to a data table in R, which has a name and columns. We assume that you are already overly familiar with iris and show the result of writing the data table version of iris to SQLite. The dbWriteTable()function takes three required arguments: the connection, the name of the new table, and the name of the R object to write into that table. The R object should be a data frame. Additional arguments may be passed, including settings determining whether we overwrite or append, should a table already be in the database with the same name. For this first attempt, we set both to their defaults of FALSE:

dbWriteTable(con1, name = "Iris", diris, overwrite = FALSE, append = FALSE)                
[1] TRUE
dbListTables(con1)
[1] "Iris"
dbListFields(con1, "Iris")
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" "Species"     

Now this data is in our SQLite Iris table; we can read it back into R. Our goal is for you to learn how to read an entire table into R and to confirm that our storage was lossless or accurate. Using the function dbReadTable() on our con1 connection and our Iris table inside SQLite gives us access to our data. This function returns a data frame:

diris_lite1 <- dbReadTable(con1, "Iris")                
diris_lite1 <- as.data.table(diris_lite1)
all.equal(diris, diris_lite1)
[1] TRUE

The all.equal()function confirms that there was no loss. This opens up an easy way to store large data. Data tables from R easily store, and through proper use of overwrite, append, or name, we have a good-enough way to curate that data with minimal direct knowledge of SQLite. Should you find yourself in possession of large tabular data, this is admittedly an easy way to keep track of it. One of the authors has a series of tabular data that changes over time on occasion but is highly similar. An extra column of the current date allows large chunks to be readily written to SQLite into a sort of permanent archive. Now, the main reason this might be useful is to read into R not an entire table, but merely part of one.

Reading part of a table is our goal, and it is not quite so easy as what we did previously. Depending on your familiarity with Structured Query Language (SQL ), this step may be more or less easy. If you are less familiar with SQL, that is perfectly fine. Our advice, in that case, is first to cast a broad net into your SQLite table, and then use R to remove more rows or columns you neither need nor want. On the other hand, if you are more comfortable with SQL, there is no reason not to select precisely just the parts you want. Here, we give a few examples that ought to be enough for beginning levels of SQL comfort. Again, we recommend other books for moving deeper into SQL.

A basic SQL query SELECTs specific columns, FROM a specific table, WHERE specific rows of specific columns have a certain feature. We use dbSendQuery() to send and execute our query to and in SQLite. Then, we use dbFetch() to get our results into R. In the end, we use dbClearResult()to clear the buffer in SQLite in preparation for any future queries.

query <- dbSendQuery(con1, "SELECT [Sepal.Length] FROM Iris WHERE Species = 'setosa'")                
diris_lite2 <- dbFetch(query, n=-1)
head(diris_lite2)
  Sepal.Length
1          5.1
2          4.9
3          4.7
4          4.6
5          5.0
6          5.4
dbClearResult(query)
[1] TRUE

The preceding code returns 50 observations of sepal length of the setosaspecies. In dbFetch(), we reference the already executed query, and the n argument tells us the number of rows to select. In this case, we ask for all rows. Contrastingly, we might have asked for all 100 rows that do not have setosa. The command would be similar, as shown in the following code:

query <- dbSendQuery(con1, "SELECT [Sepal.Length], Species FROM Iris WHERE Species <> 'setosa'")                
diris_lite3 <- dbFetch(query, n=-1)
head(diris_lite3)
Sepal.Length    Species
1          7.0 versicolor
2          6.4 versicolor
3          6.9 versicolor
4          5.5 versicolor
5          6.5 versicolor
6          5.7 versicolor
dbClearResult(query)
[1] TRUE

Notice that because of the full stop between Sepal and Length, we have to encase that variable in a bracket, unlike species. It is worth mentioning that SQL has some linguistic quirks and reserved characters (much as R does, to some degree). In fact, this could well be the biggest risk for those of us less experienced in SQL. Remember, as soon as you use dbSendQuery, SQLite cheerfully receives and executes that query. Since SQLite has no user authentication, there is no way to use minimal, read-only privileges. In the best case, the wrong commands simply have no effect. In the worst case,, you could delete, overwrite, or otherwise damage data. Our hard-earned advice is always to work on a local copy of your data, especially if you are reading this to start your big data project. Otherwise, talk to the database owner or administrator if you are new to SQL.

Our last iris example returns all columns by using the * wildcard rather than listing out all columns. Again, we ask for all rows that are not setosa:

query <- dbSendQuery(con1, "SELECT * FROM Iris WHERE Species <> 'setosa'")                
diris_lite4 <- dbFetch(query, n=-1)
diris_lite4 <- as.data.table(diris_lite4)
diris_lite4
     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
  1:          7.0         3.2          4.7         1.4 versicolor
  2:          6.4         3.2          4.5         1.5 versicolor
  3:          6.9         3.1          4.9         1.5 versicolor
 ---                                                             
 98:          6.5         3.0          5.2         2.0  virginica
 99:          6.2         3.4          5.4         2.3  virginica
100:          5.9         3.0          5.1         1.8  virginica
dbClearResult(query)

The purpose perhaps of SQLite is to allow for data larger than RAM might physically allow. It can be read in chunks as we’ve demonstrated, or we can push large amounts of data from R to free up memory. Our next example comes with a choice. If you have enough available free space in your system RAM, go ahead and do the example of 200 million rows. Otherwise, either change the number of variables in rnorm()to something lower, or use our code online to run the smaller example of 70 million rows.

This code creates 200 million entirely made-up data observations with a mean of 1 and a default standard deviation of 1. As you see, it is about 1.6 gigabytes in size. Certainly, data can be and is larger, but this is good enough as a test case for you to see the effects on your system. It likely takes about 5 minutes to run on your system, so be prepared for a bit of a wait.

BiggerData <- rnorm(200000000, 1)                
BiggerData <-as.data.table(BiggerData)
object.size(BiggerData)
1600001088 bytes
dbWriteTable(con1, "BiggerData", BiggerData)
[1] TRUE
dbListTables(con1)
[1] "BiggerData" "Iris"      

Dropping a table from SQLite is possible from R. Notice that we have two tables listed in our database. BiggerData is a large table, so it is not instant. Still, the operation is fairly quick. SQLite does not reduce the size of the database file on your hard disk as a result of this operation. Now, it does internally free up that space, so it needs not grow again. However, it is not free to use outside SQLite. In SQL parlance, what is needed is a VACUUMof the database. This procedure could require up to twice the size of the file, and again, the space has been released for new use—just not outside the database. Either way, this seems, to us, better done outside R.

dbRemoveTable(con1, "BiggerData")                
[1] TRUE
dbListTables(con1)
[1] "Iris"

Our final bit of code shuts down our connection to our database. The function call is dbDisconnect(), and we call that on our connection, con1. A successful call returns TRUE, and we also see with dbIsValid()that it is indeed shut down. We also remove some unneeded data before we move on to the next database.

dbDisconnect(con1)                
[1] TRUE
dbIsValid(con1)
[1] FALSE
rm(diris_lite1, diris_lite2, diris_lite3, diris_lite4)

We said that our space was not freed up. Since this is invented data, to free up your system, simply use Windows Explorer after we are done to delete the LiteDB.db file in your C:/sqlite folder. It bears repeating that database management should not be done through R. Rather, R can natively store some data if necessary or access subsets of data for analysis.

PostgreSQL

PostgreSQL is both open source and popular (ranking fifth of all databases according to some sources). Highly advanced, this database has over 30 types of data it understands internally. We recommend Beginning Databases with PostgreSQL by Neil Matthew and Richard Stones (Apress, 2005). As with SQLite, we walk you through installing a practice version on your local machine. Our goal is to get you using R on data inside this database as soon as possible. More-advanced database operations most likely ought not to be done in R.

This format is recommended when data integrity, reliability , and a need for various levels of user access may be useful. On the other hand, if you just need fast read speed for more static yet large data, this might not be the best choice. Of course, we admit that often the choice of a database has much less to do with what we want and more to do with where data is found. A warning about PostgreSQL is that it does have users, and these users have certain privileges. For our example, we have one superuser. From a security standpoint, this is unwise, and from a real-life perspective, a database owner is unlikely to grant analysts such privileges. If you intend to be working extensively in this environment and want more practice, an easy way to get it is to create more levels of users than we do in our brief installation. Then, practice accessing your database by using those usernames and passwords in R.

Installing PostgreSQL on Windows

Visit the PostgreSQL website ( www.postgresql.org/download/windows/ ) and choose one of the installer options. We chose the latest stable release (not a beta) for 64-bit architecture. After downloading, you should have a file named something like postgresql-9.5.4-windows-x64.exe, which is your installation program. Of course, your version number likely is higher than 9.5.4, as that is our edition as of the writing of this chapter.

Go ahead and install the program, clicking the Next button as needed, and accepting the default settings for the installation directory. The installer asks for a password for the database superuser named postgres, which we set to advancedr. If you select a different password, please be sure to record it securely. Incidentally, this also creates a database with the same name as the superuser. PostgreSQL defaults to port 5432, and if that is not showing as the default option, you likely have another version of PostgreSQL living on your computer. While we configure the port in R, we highly recommend at first practice to go with as many defaults as possible. Clicking Next a few more times, accepting the defaults, should lead you to the installation event. You may be asked on the final screen about Stack Builder; it is not required, and you may uncheck that option before clicking Finish.

At this point, PostgreSQL’s installation is over, and it is running on your computer. If you are familiar with this database, you could add data to it directly (and likely would not have needed this primer). We add data via R and quickly get to accessing subsets of that data .

PostgreSQL and R

Our first task is connecting to our database. This uses the DBI package library and builds a link to our database. In the case of PostgreSQL, we need a username and password. We also need to tell the link the location of our database as well as the type of database. This database is a server, and we do not need to tell it the file location so much as the host location and the port. We call this second connection con2 and then use the dbGetInfo() function from the RPostgreSQL library to test whether it started correctly. Ours returns with quite a bit of information, so we are ready to go:

drv <- dbDriver("PostgreSQL")                  
con2 <- dbConnect(drv, dbname = "postgres", host = "localhost",
+                  port = 5432, user = "postgres", password = "advancedr")
dbGetInfo(con2)
$host
[1] "localhost"


$port
[1] "5432"


$user
[1] "postgres"


$dbname
[1] "postgres"


$serverVersion
[1] "9.5.4"


$protocolVersion
[1] 3


$backendPId
[1] 14220


$rsId
list()

It may be worth mentioning that we included the password directly in the code in plain text for dbConnect() this time. It might not be the most secure methodology, and care should be taken (particularly in this case, since this is our superuser). All the same, our connection is working, it is properly connected to port 5432 (while this is the default port for this database, it is not always this port), and our host is localhost. Should you connect to a database online, you would change the address to that database. Now that our connection is working, we see which tables are in the database and add the familiar iris data set :

dbListTables(con2)                
character(0)
diris <- as.data.table(iris)
dbWriteTable(con2, "iris", diris, overwrite = TRUE)
[1] TRUE
dbListTables(con2)
[1] "iris"
dbListFields(con2, "iris")
[1] "row.names"    "Sepal.Length" "Sepal.Width"  "Petal.Length"  "Petal.Width"  "Species"     

A nice feature of these functions is they do what they say. All the same, we draw your attention to the dbWriteTable() function. Although writing is not our main focus, this function takes both an overwrite and an append option set to FALSE or TRUE. Both are convenient in the right situation. We turn our attention to reading from this database, and confirm that we successfully pull into R data from PostgreSQL with a visual inspection of our data:

diris_gre1 <- dbReadTable(con2, "iris")                
diris_gre1 <- as.data.table(diris_gre1)
diris_gre1
     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
  1:          5.1         3.5          1.4         0.2    setosa
  2:          4.9         3.0          1.4         0.2    setosa
  3:          4.7         3.2          1.3         0.2    setosa
 ---                                                            
148:          6.5         3.0          5.2         2.0 virginica
149:          6.2         3.4          5.4         2.3 virginica
150:          5.9         3.0          5.1         1.8 virginica

A basic SQL query SELECTs specific columns, FROM a specific table, WHERE specific rows of specific columns have a certain feature. We use dbSendQuery() to send our query to and execute it in PostgreSQL. Then, we use dbFetch() to get our results into R. Notice that when we fetch, we select only the first three rows that have this feature. At the end, we use dbClearResult() to clear the buffer in PostgreSQL in preparation for future queries.

query <- dbSendQuery(con2, statement = 'SELECT * FROM iris WHERE "Petal.Length" > 2')                
diris_gre2 <- dbFetch(query, n=3)
head(diris_gre2)
   row.names Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1        51          7.0         3.2          4.7         1.4 versicolor
2        52          6.4         3.2          4.5         1.5 versicolor
3        53          6.9         3.1          4.9         1.5 versicolor
dbClearResult(query)
[1] TRUE

This is highly similar to what we did earlier with SQLite, because both databases are SQL databases that store tabular data. Alternately, if we needed only one column of data, we would SELECT that specific column rather than using a wildcard value:

query <- dbSendQuery(con2, 'SELECT "Petal.Length" FROM iris')                
diris_gre3 <- dbFetch(query, n=6)
head(diris_gre3)
  Petal.Length
1          1.4
2          1.4
3          1.3
4          1.5
5          1.4
6          1.7
dbClearResult(query)
[1] TRUE

The next query returns both Sepal.Length and Species columns from our iris data set, where the sepal width does not equal 2.2 ( an arbitrary number). Of note is that this returns only 147 rows instead of the full 150, and of course only two columns instead of five. Scaling this idea up, you readily see how big data might be broken into relevant chunks for analysis. We changed from query to qu simply to save a few characters, to improve the readability of the line in the query text.

qu <- dbSendQuery(con2, 'SELECT "Sepal.Length", "Species" FROM iris WHERE "Sepal.Width" <> 2.2')                
diris_gre5 <- dbFetch(qu, n=-1)
head(diris_gre5)
  Sepal.Length Species
1          5.1  setosa
2          4.9  setosa
3          4.7  setosa
4          4.6  setosa
5          5.0  setosa
6          5.4  setosa
dbClearResult(qu)
[1] TRUE

The purpose perhaps of PostgreSQL is to allow for data larger than RAM might physically allow. It can be read in chunks as we’ve demonstrated, or we can push large amounts of data from R to free up memory. Our next example comes with a choice. If you have enough available free space in your system RAM, go ahead and do the example of 200 million rows as written. Otherwise, either change the number of variables in rnorm() to something lower, or use our code online to run the smaller example of 70 million rows.

This code creates 200 million entirely made-up data observations with a mean of 1 and a default standard deviation of 1. As you see, it is about 1.6 gigabytes in size. Certainly, data can be larger, but this might be good enough as a test case to see how it works on your system. It likely takes around 5 minutes to run, so be prepared for a bit of a wait. If you have already run this in an earlier step, there is no need to repeat it.

BiggerData <- rnorm(200000000, 1)                  
BiggerData <-as.data.table(BiggerData)


dbWriteTable(con2, "BiggerData", BiggerData)
[1] TRUE
dbListTables(con2)
[1] "iris"       "BiggerData"

Once again, while we are (at least with superuser privileges) able to DROP a table, it does not release space to us in PostgreSQL (and hence on our hard drive). Also once again, we recommend saving VACUUM commands for actual database management (and thus recommend using SQL on the database directly rather than in R).

dbListTables(con2)                
[1] "iris"       "BiggerData"
dbRemoveTable(con2, "BiggerData")
[1] TRUE
dbListTables(con2)
[1] "iris"

To close this connection, there are two steps rather than one, because we must disconnect from PostgreSQL and unload our driver that allows DBI to make that connection. If successful, they will both return TRUE.

dbDisconnect(con2)                
[1] TRUE
dbUnloadDriver(drv)
[1] TRUE

This concludes our demonstration of both PostgreSQL and tabular databases. The real way to make the knowledge you now have more powerful is to delve further into SQL syntax to use dbSendQuery() on ever more exotic quests. For now, though, a small-enough quantity of data ought to be readable into R that some analytics can happen. We turn our attention to a different sort of database.

MongoDB

As mentioned earlier, MongoDBis different. This is a document store database , which is a type of nonrelational database. This avoids tables for documents called objects. These documents are binary versions of JavaScript Object Notation (JSON); since they are binary versions, MongoDB calls them BSON objects. This type of database is used for data that may be semistructured or unstructured. This tends to be closer to the concept of a data lake (an increasingly popular buzzword in data science), which suggests that diverse types of data may need to be stored and accessed.

One powerful feature of JSON and semistructured data is the ability to collect all the information in one place for one particular instance. For example, suppose we had all patient records in one place. Rather than dozens of tables in which each patient occupies a row, and each table is something along the lines of Address or Diagnosis, we instead store all records for one patient in one blob. Of course, most patients have an address, but perhaps not all need three types of blood panels in their diagnosis section. This is semistructured data. On the other hand, if you are simply collecting data from the Web about a favorite company—including news articles, social media posts or likes, and stock market performance figures—there may be almost no similarity between elements of that data. Regardless, MongoDB can store it, and we can access it.

Installing MongoDB on Windows

A visit to the MongoDB website ( www.mongodb.com/download-center ) shows an option to download the 64-bit version of MongoDB with SSL support (version 3.2.8 as of this writing). See Figure 10-1. Once that is downloaded, go ahead and start the installation process . Select Next, read and agree to the license documentation, and proceed. We used the complete installation and then clicked the Install button. A short time later, please click Finish.

A393929_1_En_10_Fig1_HTML.jpg
Figure 10-1. Downloading MongoDB

We also need to download the sample database for MongoDB. We will use the restaurants data set available at MongoDB’s Import Example Dataset page at https://docs.mongodb.com/getting-started/shell/import-data/ under step 1 (primer-dataset.json). We will also use Ch10_MOCK_DATA.json as another data set. Both are also available as part of the code packet for this chapter from this book’s site. Go ahead and download both files, and store them in C:Program FilesMongoDBServer3.2in. Keep in mind that you may be using version 3.3 or later, so be sure to change the version number in this file path as needed. With that, MongoDB is installed, and we have our data sets. Figure 10-2 shows the folder .

A393929_1_En_10_Fig2_HTML.jpg
Figure 10-2. The MongoDBServer3.2in folder

Now press the Windows key, and while holding it down, then press the R key. This opens the Run dialog box. Type cmd and then press Enter/Return. There needs to be a data directory for your database. By default, this is in the main file path (for example, usually the C drive for Windows). In the command prompt, type md datadb to make the default directory. Note the space between md and datadb. Next, still in the command prompt, type cd C:Program FilesMongoDBServer3.2in and hit Enter/Return. Should version 3.3 release before the printing of this book, change the 3.2 in the preceding file path to 3.3. From here, type mongod.exe and hit Enter/Return. If all goes well, the last line should be waiting for connections on port 27017. Keep this window open (otherwise, our server is not serving files).

Next, again keeping the first console open, press and hold the Windows key and press the letter key to open a second Run dialog box. This time, most likely cmd is already filled in, so press Enter/Return to open a second command prompt. In the command prompt, again type cd C:Program FilesMongoDBServer3.2in and hit Enter/Return. We now use MongoDB itself to import these two data sets. Run the code shown in bold, and the following output shows the results:

mongoimport --db test --collection restaurants --drop --file primer-dataset.json                  
2016-09-25T11:11:24    connected to: localhost
2016-09-25T11:11:24    dropping: test.restaurants
2016-09-25T11:11:26    imported 25359 documents


mongoimport --db test --collection mock --drop --file Ch10_MOCK_DATA.json
2016-09-25T11:19:45    connected to: localhost
2016-09-25T11:19:45    dropping: test.mock
2016-09-25T11:19:45    imported 1000 documents

Note that this database has a database named test, and inside that database there are now two collections. These collections, namely restaurants and mock, were created by importing JSON files . Had those two collections existed already in our test database, then the --drop command would have purged them. It is somewhat standard to use a database named test or temp as a way of signaling that experimentation is expected in such a place.

Go ahead and close this second command prompt window (be careful—you must keep the first window, which should be giving you server COMMAND and NETWORK feedback, open). After closing just this second window, we turn our attention to R.

MongoDB and R

MongoDB does have the option to have usernames, passwords, and multiple databases. As we have not set up any of these, we go ahead and simply connect to our local server on our third connection, con3. Unlike our first two connections, this uses a new command, mongoDbConnect()from the RMongo package . Also, notice that this calls for a host to be the localhost, and the port is specified as well. If you are connecting to a MongoDB instance over the Web, be sure to identify the IP address as well as its port. Of course, also find out from the administrator the name of the database you wish to access. Finally, find out whether you would have only read permissions or more. It does not do to experiment on other people’s databases!

con3<-mongoDbConnect("test", host = "127.0.0.1", port=27017)                

Should authentication be required, you need to run a second command. Type and run dbAuthenticate(con3, username, password)into your R session. A word of caution is needed here: depending on the version of MongoDB being accessed, the version of Java installed on your local machine, and the authentication protocol set up by the database administrator, this functionality may or may not work. We realize that this is a vague statement. As of the writing of this book, authentication seems possible, but there seem to be signs that this is not always true for all users.

To add to the bad news, at this time, the function call dbShowCollections()does not work on the most recent versions of MongoDB. If you are connecting to a mature database, chances are it may be using an older version, and you may be able to see the collections. The RMongo package is approximately three years old, and we have hopes for an update. For now, we simply provide the heartless wisdom that connecting to databases and data munging, in general, is rarely easy. Fortunately, we know that our collections are stored in restaurants and mock.

Thus, we may proceed to access some data from our collections to confirm that we can, in fact, read data from MongoDB into R. Notice that the function call dbGetQuery()takes at least three arguments. First, it requires the connection, and second, it requires a text string that has collection information. The third argument is the search terms used to access our data. In this first case, we do not define any search terms and thus merely have the wrapper of {} for that formal. The last two arguments are optional. The 0 tells us we are starting at the beginning of the list, and the 8 says we want the next eight entries only. In other words, we are skipping no entries, and we are limiting the ones returned to eight. We have edited the final output for readability and show just a portion of the View()in Figure 10-3.

A393929_1_En_10_Fig3_HTML.jpg
Figure 10-3. Part of the results of the View(output1) command
output1 <- dbGetQuery(con3, 'restaurants' , "{}", 0,8 )                  
is.data.frame(output1)
[1] TRUE
names(output1)
[1] "address"  "restaurant_id"  "name"  "cuisine"      "X_id"  "borough"      "grades"       
output1[1,]
                                                                                                                address
{ "building" : "2206" , "coord" : [ -74.1377286 , 40.6119572] , "street" : "Victory Boulevard" , "zipcode" : "10314"}


restaurant_id   name           cuisine        X_id                            borough
40356442        Kosher Island  Jewish/Kosher  57e7f9a275bec64c417b9ab2        Staten Island
                                                                                                                                                                                                                                                                                                                                              grades
[ { "date" : { "$date" : "2014-10-06T00:00:00.000Z"} , "grade" : "A" , "score" : 9} ,


 { "date" : { "$date" : "2014-05-20T00:00:00.000Z"} , "grade" : "A" , "score" : 12} ,

 { "date" : { "$date" : "2013-04-04T00:00:00.000Z"} , "grade" : "A" , "score" : 12} ,

 { "date" : { "$date" : "2012-01-24T00:00:00.000Z"} , "grade" : "A" , "score" : 9}]
View(output1)

This data seems to contain information about locations, types of food, and names of restaurants. Using that knowledge, we refine our search criteria to types of American cuisine. This is interesting because what we do is remove the start and limit arguments, and adjust the query criteria. Here is where data hygiene becomes so crucial. Naively, we attempted dbGetQuery(con3, 'restaurants' , '{"cuisine": "American"}') at first. This did not work. Because we already had to rewrite this section because our former favorite package for this database was removed from CRAN, this seemed somber. Many things were tried; many things failed. In the end, a careful inspection of the data set in the raw downloaded JSON file revealed that for whatever reason, there is a space at the end of American. Not the other food types, mind you, just that one. If we had selected any other option, we would not have had this story to tell. The lesson learned that we want to share is that this process takes time. Expect things to go not so well. Keep persevering, and the data will come eventually.

After correcting our query, we can get the results. Rather than simply view those results , we decided to put the restaurant names into a word cloud, to see whether we could detect a pattern. Be sure to note the space after our cuisine type! Also note that if you search for a Bakery, this is not necessary. We show the results of the word cloud in Figure 10-4.

A393929_1_En_10_Fig4_HTML.jpg
Figure 10-4. A word cloud comprising the names of 1,000 American cuisine restaurants
output2 <- dbGetQuery(con3, 'restaurants' , '{"cuisine": "American "}')                
output2 <- as.data.table(output2)
rNames <- Corpus(VectorSource(output2$name))
wordcloud(rNames, max.words = 20)

Our other, mock data came with an ID value already, and MongoDB gives each document or blob an id. Thus, to avoid name conflicts, X_id was made by MongoDB. This is a rather important point; most SQL or NoSQL databases tend to have some reserved names. While we did our best to ignorey that in this chapter, if you intend to store data yourself on such a database rather than simply read it into R, we recommend reading more on the topic and having healthy caution. We also formatted this output a bit to clean it up.

dbGetQuery(con3, 'mock', '{}', 0,1)                  

gender
   Male
                                                                                                 financial
{ "account1" : "DO56 T7LH 5884 0956 8784 3936 1833" ,


 "account2" : "197VPF5XPA13ngPwAmSiWsSMoUUPoWFQpa"}

last_name       X_id                          id       ip_address     first_name
Lawson          57e7f92175bec64c417b96ca      1        255.61.75.176  Jesse


diagnoses
Manic affective disorder, recurrent episode, moderate


email
[email protected]

As you can see from our first bit of mock data , we seem to have patient information. We hasten to assure you that this is all imaginary data. Of note in blobs is that while this first one did, in fact, have all known bits of data, there is no reason for each item to be present. Thus, if only one financial account were available, we would simply not include the second account. Unlike tabular data, which requires NA entries of some sort, in semistructured data, we can simply not include elements that are unknown. These data are thus highly flexible (and more efficient in storage, because unknowns are not included at all).

The downside is that searching through such unstructured data can be fairly difficult without a strong knowledge of the data set and naming conventions. All the same, we can find all documents in our collection that have a specific feature in common. The function call dbGetQuery()can take many things in the third formal. Here we have just passed it a list that finds only documents in which the gender is male. Again, additional formals include skip, which skips the first x rows, and limit, which limits the results to an upper cap.

mfound<-mongo.find.all(con3, "test.ch10data", list(gender="Male"))                

For readers more familiar with MongoDB queries, the third formal can, in fact, contain BSON query language directly. To learn more about queries in MongoDB, we recommend The Definitive Guide to MongoDB, Third Edition by David Hows et al. (Apress, 2015), particularly Chapter 8. For now, we take our records indicating male gender and pull out the first names to do a bit of analysis on semistructured data. The types of data stored in such blobs are often less suited to traditional statistical tests, and perhaps more related to modern analytics. For simplicity’s sake, after extracting the male gender blobs, we access the first-name component and store that in a corpus called fNames. From there, we build a word cloud on the first 20, as shown in Figure 10-5.

A393929_1_En_10_Fig5_HTML.jpg
Figure 10-5. Word cloud of first names, where gender = male
output3 <- dbGetQuery(con3, 'mock', '{"gender": "Male"}')                
fNames <- Corpus(VectorSource(output3$first_name))
wordcloud(fNames, max.words = 20)

Our final bit of code shuts down our connection to our database. The function call is dbDisconnect(), and we call that on our third connection, con3. This call both disconnects and frees up client and server resources:

dbDisconnect(con3)                

Quite a bit more could be said about unstructured documents and the exciting possibilities of this type of data. The ability to store and then access chaotic information becomes ever more routine. This is an area where we anticipate much room for growth, as fuzzy matching allows machines to sort through quickly and return near matches—perhaps a topic for other authors, and certainly a topic for another book. The last step to take in this section is to close the command-prompt window that is still open.

Summary

This chapter gave you direct access to three databases with R. While many of the features are not designed for curating data, R nevertheless provides a quick way to get a handle on big data. Such data sets seem to be becoming ever more common, and while it may not be feasible for researchers to be familiar with every database, that is not required. Using R, it is entirely possible to pull out a subset of data that contains all items of interest and is small enough to fit into working memory. Skilled users of R, such as this book’s readers, can prune the data in R itself further, and from there perform desired analytics. Table 10-1 provides a summary of the key functions in this chapter.

Table 10-1. Key Functions Described in This Chapter

Function

What It Does

dbConnect()

Uses the DBI package plus a driver to connect to a SQLite or PostgreSQL database.

dbDriver()

Driver needed for PostgreSQL (and others).

dbIsValid()

Tests a SQLite connection for validity and returns a Boolean.

dbWriteTable()

Writes data from R to SQLite or PostgreSQL (you must be a user with write privileges).

dbListTables()

Lists all tables in a SQLite database.

dbListFields()

Lists all fields (column names) in a SQLite database.

dbSendQuery()

Sends an SQL query to a SQLite or PostgreSQL database and executes the query there.

dbFetch()

Fetches the results from the last sent query in SQLite or PostgreSQL.

dbClearResult()

Sets SQLite or PostgreSQL back to neutral for the next query.

dbRemoveTable()

Drops a table from SQLite or PostgreSQL (user must have privileges); does not VACUUM.

dbDisconnect()

Closes the DBI package connection to SQLite or PostgreSQL.

dbUnloadDriver()

Unloads the PostgreSQL (and other) drivers.

mongoDbConnect()

Creates a connection to a MongoDB server and a specific database on that server.

dbInsertDocument()

Inserts a record into a MongoDB collection; takes second formal collection.

dbGetQuery()

Finds all blobs that match a query; this has several possibilities we only briefly explored.

dbDisconnect()

Destroys a MongoDB connection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.29.119