Installing and using MongoDB

MongoDB is supported on all major platforms such as Windows, Linux, and OS X platforms.

The details for installing MongoDB can be found on their official website at https://docs.mongodb.com/manual/installation/. Note that we will be using the MongoDB Community Edition.

For our exercise, we will re-use the Linux CentOS environment from our Cloudera Hadoop Distribution VM.

The exercise is however not dependent on the platform on which you install MongoDB. Once the installation has been completed, you can execute the commands indicated in this chapter on any other supported platform. If you have access to a separate Linux machine, you can use that as well.

We will visit some of the common semantics of MongoDB and also download two datasets to compute the highest number of Nobel Prizes grouped by continent. The complete dump of the Nobel Prize data on Nobel Laureates is available from nobelprize.org. The data contains all the primary attributes of Laureates. We wish to integrate this data with demographic information on the respective countries to extract more interesting analytical information:

  1. Download MongoDB: MongoDB can be downloaded from https://www.mongodb.com/download-center#community.

To determine which version is applicable for us, we checked the version of Linux installed on the CDH VM:

(cloudera@quickstart ~)$ lsb_release -a 
LSB Version:     :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch 
Distributor ID:  CentOS 
Description:     CentOS release 6.7 (Final) 
Release:  6.7 
Codename: Final 
  1. Based on the information, we have to use the CentOS version of MongoDB, and accordingly, following the instructions at https://docs.mongodb.com/manual/tutorial/install-mongodb-on-red-hat/, we installed the software, shown as follows:
The first step involved adding the repo as follows. Type in sudo nano /etc/yum.repos.d/mongodb-org-3.4.repo on the command line and enter the text as shown. 
 
 
(root@quickstart cloudera)# sudo nano /etc/yum.repos.d/mongodb-org-3.4.repo 
 
### Type in the information shown below and press CTRL-X 
### When prompted to save buffer, type in yes

(mongodb-org-3.4)
name=MongoDB Repository
baseurl=https://repo.mongodb.org/yum/redhat/$releasever/mongodb-org/3.4/x86_64/
gpgcheck=1
enabled=1
gpgkey=https://www.mongodb.org/static/pgp/server-3.4.asc

The following screenshot shows the contents of the file:

Setting up MongoDB repository

As seen in the following screenshot, type Y for Yes:

Saving the .repo file

Save the file as shown in the image as follows. This will now allow us to install mongo-db:

Writing and Saving the .repo file
# Back in terminal, type in the following

(cloudera@quickstart ~)$ sudo yum install -y mongodb-org (...) Installing: mongodb-org x86_64 3.4.6-1.el6 mongodb-org-3.4 5.8 k Installing for dependencies: mongodb-org-mongos x86_64 3.4.6-1.el6 mongodb-org-3.4 12 M mongodb-org-server x86_64 3.4.6-1.el6 mongodb-org-3.4 20 M mongodb-org-shell x86_64 3.4.6-1.el6 mongodb-org-3.4 11 M mongodb-org-tools x86_64 3.4.6-1.el6 mongodb-org-3.4 49 M Transaction Summary ===================================================================== Install 5 Package(s) Total download size: 91 M Installed size: 258 M Downloading Packages: (1/5): mongodb-org-3.4.6-1.el6.x86_64.rpm | 5.8 kB 00:00 (...) Installed: mongodb-org.x86_64 0:3.4.6-1.el6 Dependency Installed: mongodb-org-mongos.x86_64 0:3.4.6-1.el6 mongodb-org-server.x86_64 0:3.4.6-1.el6 mongodb-org-shell.x86_64 0:3.4.6-1.el6 mongodb-org-tools.x86_64 0:3.4.6-1.el6 Complete! ### Attempting to start mongo without first starting the daemon will produce an error message ### ### You need to start the mongo daemon before you can use it ### (cloudera@quickstart ~)$ mongo MongoDB shell version v3.4.6 connecting to: mongodb://127.0.0.1:27017 2017-07-30T10:50:58.708-0700 W NETWORK (thread1) Failed to connect to 127.0.0.1:27017, in(checking socket for error after poll), reason: Connection refused 2017-07-30T10:50:58.708-0700 E QUERY (thread1) Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed : connect@src/mongo/shell/mongo.js:237:13 @(connect):1:6 exception: connect failed
### The first step is to create the MongoDB dbpath - this is where MongoDB will store all data 
 
### Create a folder called, mongodata, this will be the mongo dbpath ### 
 
(cloudera@quickstart ~)$ mkdir mongodata
### Start mongod ### 
 
(cloudera@quickstart ~)$ mongod --dbpath mongodata 
2017-07-30T10:52:17.200-0700 I CONTROL  (initandlisten) MongoDB starting : pid=16093 port=27017 dbpath=mongodata 64-bit host=quickstart.cloudera 
(...) 
2017-07-30T10:52:17.321-0700 I INDEX    (initandlisten) build index done.  scanned 0 total records. 0 secs 
2017-07-30T10:52:17.321-0700 I COMMAND  (initandlisten) setting featureCompatibilityVersion to 3.4 
2017-07-30T10:52:17.321-0700 I NETWORK  (thread1) waiting for connections on port 27017 

Open a new terminal and download the JSON data files as shown in the following screenshot:

Selecting Open Terminal from Terminal App on Mac OS X
# Download Files
# laureates.json and country.json ###
# Change directory to go to the mongodata folder that you created earlier (cloudera@quickstart ~)$ cd mongodata (cloudera@quickstart mongodata)$ curl -o laureates.json "http://api.nobelprize.org/v1/laureate.json" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 428k 0 428k 0 0 292k 0 --:--:-- 0:00:01 --:--:-- 354k ### Clean the file laureates.json ### Delete content upto the first ( on the first line of the file ### Delete the last } character from the file ### Store the cleansed dataset in a file called laureates.json

Note that the file needs to be slightly modified. The code is shown in the following image:

Modifying the .json file for our application
(cloudera@quickstart mongodata)$ cat laureates.json | sed 's/^{"laureates"://g' | sed 's/}$//g' > mongofile.json 
 
 
### Import the file laureates.json into MongoDB 
### mongoimport is a utility that is used to import data into MongoDB 
### The command below will import data from the file, mongofile.json 
### Into a db named nobel into a collection (i.e., a table) called laureates 
 
(cloudera@quickstart mongodata)$ mongoimport --jsonArray --db nobel --collection laureates --file mongofile.json 
2017-07-30T11:06:35.228-0700   connected to: localhost 
2017-07-30T11:06:35.295-0700   imported 910 documents 

In order to combine the data in laureate.json with country-specific information, we need to download the countryInfo.txt from geonames.org  We will now download the second file that we need for the exercise, country.json. We will use both laureates.json and country.json for the exercise.

### country.json: Download it from http://www.geonames.org (license: https://creativecommons.org/licenses/by/3.0/). Modify the start and end of the JSON string to import into MongoDB as shown as follows:

# The file country.json contains descriptive information about all countries
# We will use this file for our tutorial

### Download country.json (cloudera@quickstart mongodata)$ curl -o country.json "https://raw.githubusercontent.com/xbsd/packtbigdata/master/country.json" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 113k 100 113k 0 0 360k 0 --:--:-- --:--:-- --:--:-- 885k ### The file, country.json has already been cleaned and can be imported directly into MongoDB (cloudera@quickstart mongodata)$ mongoimport --jsonArray --db nobel --collection country --file country.json 2017-07-30T12:10:35.554-0700 connected to: localhost 2017-07-30T12:10:35.580-0700 imported 250 documents ### MONGO SHELL ### (cloudera@quickstart mongodata)$ mongo MongoDB shell version v3.4.6 connecting to: mongodb://127.0.0.1:27017 MongoDB server version: 3.4.6 Server has startup warnings: (...) 2017-07-30T10:52:17.298-0700 I CONTROL (initandlisten) ### Switch to the database nobel using the 'use <databasename>' command > use nobel switched to db nobel ### Show all collections (i.e., tables) ### This will show the tables that we imported into MongoDB - country and laureates > show collections country laureates > ### Collections in MongoDB are the equivalent to tables in SQL ### 1. Common Operations ### View collection statistics using db.<dbname>.stats() > db.laureates.stats() "ns" : "nobel.laureates", # Name Space "size" : 484053, # Size in Bytes "count" : 910, # Number of Records "avgObjSize" : 531, # Average Object Size "storageSize" : 225280, # Data size # Check space used (in bytes) > db.laureates.storageSize() 225280 # Check number of records > db.laureates.count() 910
### 2. View data from collection ### ### There is an extensive list of commands that can be used in MongoDB. As such discussing them all is outside the scope of the text. However, a few of the familiar commands have been given below as a marker to help the reader get started with the platform. ### See first record for laureates using findOne() ### findOne() will show the first record in the collection > db.laureates.findOne() { "_id" : ObjectId("597e202bcd8724f48de485d4"), "id" : "1", "firstname" : "Wilhelm Conrad", "surname" : "Röntgen", "born" : "1845-03-27", "died" : "1923-02-10", "bornCountry" : "Prussia (now Germany)", "bornCountryCode" : "DE", "bornCity" : "Lennep (now Remscheid)", "diedCountry" : "Germany", "diedCountryCode" : "DE", "diedCity" : "Munich", "gender" : "male", "prizes" : ( { "year" : "1901", "category" : "physics", "share" : "1", "motivation" : ""in recognition of the extraordinary services he has rendered by the discovery of the remarkable rays subsequently named after him"", "affiliations" : ( { "name" : "Munich University", "city" : "Munich", "country" : "Germany" } ) } ) } ### See all records for laureates > db.laureates.find() { "_id" : ObjectId("597e202bcd8724f48de485d4"), "id" : "1", "firstname" : "Wilhelm Conrad", "surname" : "Röntgen", "born" : "1845-03-27", "died" : "1923-02-10", "bornCountry" : "Prussia (now Germany)", "bornCountryCode" : "DE", "bornCity" : "Lennep (now Remscheid)" (...) ... ### MongoDB functions accept JSON formatted strings as parameters to options ### Some examples are shown below for reference ### Query a field - Find all Nobel Laureates who were male > db.laureates.find({"gender":"male"}) (...) { "_id" : ObjectId("597e202bcd8724f48de485d5"), "id" : "2", "firstname" : "Hendrik Antoon", "surname" : "Lorentz", "born" : "1853-07-18", "died" : "1928-02-04", "bornCountry" : "the Netherlands", "bornCountryCode" : "NL", "bornCity" : "Arnhem", "diedCountry" : "the Netherlands", "diedCountryCode" : "NL", "gender" : "male", "prizes" : ( { "year" : "1902", "category" : "physics", "share" : "2", "motivation" : ""in recognition of the extraordinary service they rendered by their researches into the influence of magnetism upon radiation phenomena"", "affiliations" : ( { "name" : "Leiden University", "city" : "Leiden", "country" : "the Netherlands" } ) } ) } (...)

Query a field - find all Nobel Laureates who were born in the US and received a Nobel Prize in Physics. Note that here we have a nested field (category is under prizes as shown). Hence, we will use the dot notation as shown in the coming image.

Image illustrating category, one of the nested fields:

Nested JSON Fields
> db.laureates.find({"bornCountryCode":"US", "prizes.category":"physics", "bornCity": /Chicago/}) 
 
{ "_id" : ObjectId("597e202bcd8724f48de48638"), "id" : "103", "firstname" : "Ben Roy", "surname" : "Mottelson", "born" : "1926-07-09", "died" : "0000-00-00", "bornCountry" : "USA", "bornCountryCode" : "US", "bornCity" : "Chicago, IL", 
... 
 
 
### Check number of distinct prize categories using distinct 
> db.laureates.distinct("prizes.category") 
( 
   "physics", 
   "chemistry", 
   "peace", 
   "medicine", 
   "literature", 
   "economics" 
) 
 
### Using Comparison Operators 
### MongoDB allows users to chain multiple comparison operators
### Details on MongoDB operators can be found at: https://docs.mongodb.com/manual/reference/operator/ 
 
# Find Nobel Laureates born in either India or Egypt using the $in operator
> db.laureates.find ( { bornCountryCode: { $in: ("IN","EG") } } ) 
 
{ "_id" : ObjectId("597e202bcd8724f48de485f7"), "id" : "37", "firstname" : "Sir Chandrasekhara Venkata", "surname" : "Raman", "born" : "1888-11-07", "died" : "1970-11-21", "bornCountry" : "India", "bornCountryCode" : "IN", "bornCity" : "Tiruchirappalli", "diedCountry" : "India", "diedCountryCode" : "IN", "diedCity" : "Bangalore", "gender" : "male", "prizes" : ( { "year" : "1930", "category" : "physics", "share" : "1", "motivation" : ""for his work on the scattering of light and for the discovery of the effect named after him"", "affiliations" : ( { "name" : "Calcutta University", "city" : "Calcutta", "country" : "India" } ) } ) } 
... 
 
### Using Multiple Comparison Operators 
 
### Find Nobel laureates who were born in either US or China and won prize in either Physics or Chemistry using the $and and $or operator 
> db.laureates.find( { 
$and : ({ $or : ( { bornCountryCode : "US" }, { bornCountryCode : "CN" } ) },
{ $or : ( { "prizes.category" : "physics" }, { "prizes.category" : "chemistry" } ) } ) } )
{ "_id" : ObjectId("597e202bcd8724f48de485ee"), "id" : "28", "firstname" : "Robert Andrews", "surname" : "Millikan", "born" : "1868-03-22", "died" : "1953-12-19", "bornCountry" : "USA", "bornCountryCode" : "US", "bornCity" : "Morrison, IL", "diedCountry" : "USA", "diedCountryCode" : "US", "diedCity" : "San Marino, CA", "gender" : "male", "prizes" : ( { "year" : "1923", "category" : "physics", "share" : "1", "motivation" : ""for his work on the elementary charge of electricity and on the photoelectric effect"", "affiliations" : ( { "name" : "California Institute of Technology (Caltech)", "city" : "Pasadena, CA", "country" : "USA" } ) } ) } ... ### Performing Aggregations is one of the common operations in MongoDB queries ### MongoDB allows users to perform pipeline aggregations, map-reduce aggregations and single purpose aggregations ### Details on MongoDB aggregations can be found at the URL ### https://docs.mongodb.com/manual/aggregation/ ### Aggregation Examples ### Count and aggregate total Nobel laureates by year and sort in descending order ### Step 1: Use the $group operator to indicate that prize.year will be the grouping variable ### Step 2: Use the $sum operator (accumulator) to sum each entry under a variable called totalPrizes ### Step 3: Use the $sort operator to rank totalPrizes > db.laureates.aggregate( {$group: {_id: '$prizes.year', totalPrizes: {$sum: 1}}}, {$sort: {totalPrizes: -1}} ); { "_id" : ( "2001" ), "totalPrizes" : 15 } { "_id" : ( "2014" ), "totalPrizes" : 13 } { "_id" : ( "2002" ), "totalPrizes" : 13 } { "_id" : ( "2000" ), "totalPrizes" : 13 } (...) ### To count and aggregate total prizes by country of birth > db.laureates.aggregate( {$group: {_id: '$bornCountry', totalPrizes: {$sum: 1}}}, {$sort: {totalPrizes: -1}} ); { "_id" : "USA", "totalPrizes" : 257 } { "_id" : "United Kingdom", "totalPrizes" : 84 } { "_id" : "Germany", "totalPrizes" : 61 } { "_id" : "France", "totalPrizes" : 51 } ... ### MongoDB also supports PCRE (Perl-Compatible) Regular Expressions ### For more information, see https://docs.mongodb.com/manual/reference/operator/query/regex ### Using Regular Expressions: Find count of nobel laureates by country of birth whose prize was related to 'radiation' (as indicated in the field motivation under prizes) > db.laureates.aggregate( {$match : { "prizes.motivation" : /radiation/ }}, {$group: {_id: '$bornCountry', totalPrizes: {$sum: 1}}}, {$sort: {totalPrizes: -1}} ); { "_id" : "USA", "totalPrizes" : 4 } { "_id" : "Germany", "totalPrizes" : 2 } { "_id" : "the Netherlands", "totalPrizes" : 2 } { "_id" : "United Kingdom", "totalPrizes" : 2 } { "_id" : "France", "totalPrizes" : 1 } { "_id" : "Prussia (now Russia)", "totalPrizes" : 1 } #### Result: We see that the highest number of prizes (in which radiation was mentioned as a key-word) was the US ### Interestingly, we can also do joins and other similar operations that allow us to combine the data with other data sources ### In this case, we'd like to join the data in laureates with the data from country information obtained earlier ### The collection country contains many interesting fields, but for this exercise, we will show how to find the total number of nobel laureates by continent ### The Left Join ### Step 1: Use the $lookup operator to define the from/to fields, collection names and assign the data to a field named countryInfo
### We can join the field bornCountryCode from laureates with the field countryCode from the collection country > db.laureates.aggregate( {$lookup: { from: "country", localField: "bornCountryCode", foreignField: "countryCode", as: "countryInfo" }}) { "_id" : ObjectId("597e202bcd8724f48de485d4"), "id" : "1", "firstname" : "Wilhelm Conrad", "surname" : "Röntgen", "born" : "1845-03-27", "died" : "1923-02-10", "bornCountry" : "Prussia (now Germany)", "bornCountryCode" : "DE", "bornCity" : "Lennep (now (..) "country" : "Germany" } ) } ), "countryInfo" : ( { "_id" : ObjectId("597e2f2bcd8724f48de489aa"), "continent" : "EU", "capital" : "Berlin", "languages" : "de", "geonameId" : 2921044, "south" : 47.2701236047002, ... ### With the data joined, we can now perform combined aggregations ### Find the number of Nobel laureates by continent > db.laureates.aggregate( {$lookup: { from: "country", localField: "bornCountryCode", foreignField: "countryCode", as: "countryInfo" }}, {$group: {_id: '$countryInfo.continent', totalPrizes: {$sum: 1}}}, {$sort: {totalPrizes: -1}} ); ... ); { "_id" : ( "EU" ), "totalPrizes" : 478 } { "_id" : ( "NA" ), "totalPrizes" : 285 } { "_id" : ( "AS" ), "totalPrizes" : 67 } ... This indicates that Europe has by far the highest number of Nobel Laureates.

There are many other operations that can be performed, but the intention of the prior section was to introduce MongoDB at a high level with a simple use case. The URLs given in this chapter contain more in-depth information on using MongoDB.

There are also several visualization tools in the industry that are used to interact with and visualize data stored in MongoDB collections using a point-and-click interface. A simple yet powerful tool called MongoDB Compass is available at https://www.mongodb.com/download-center?filter=enterprise?jmp=nav#compass.

Navigate to the previously mentioned URL and download the version of Compass that is appropriate for your environment:

Downloading MongoDB Compass

After installation, you'll see a welcome screen. Click on Next until you see the main dashboard:

MongoDB Compass Screenshot

Click on Performance to view the current status of MongoDB:

MongoDB Performance Screen

Expand the nobel database by clicking on the arrow next to the word on the left sidebar. You can click and drag on different parts of the bar charts and run ad hoc queries. This is very useful if you want to get an overall understanding of the dataset without necessarily having to run all queries by hand, as shown in the following screenshot:

Viewing our file in MongoDB Compass
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.37.196