Steps

The steps to set up Hadoop are:

Download JAR from the Maven repository at: http://repo1.maven.org/maven2/org/mongodb/mongo-hadoop/mongo-hadoop-core/2.0.2/.

Download mongo-java-driver: https://oss.sonatype.org/content/repositories/releases/org/mongodb/mongodb-driver/3.5.0/.

Create a directory (in our case named mongo_lib) and copy these two JARs in there with the following command:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:<path_to_directory>/mongo_lib/

Alternatively, we can copy these JARs under the share/hadoop/common/ directory. As these JARs will need to be available in every node, for clustered deployment it's easier to use Hadoop's DistributedCache to distribute the JARs to all nodes.

The next step is to install Hive from: https://hive.apache.org/downloads.html.

For this example, we used a MySQL server for Hive's metastore data. This can be a local MySQL server for development and is recommended to be a remote server for production environments.

Once we have Hive set up we just run the following:

> hive

Then we add the three JARs (mongo-hadoop-core, mongo-hadoop-driver, and mongo-hadoop-hive) that we downloaded earlier:

hive> add jar /Users/dituser/code/hadoop-2.8.1/mongo-hadoop-core-2.0.2.jar;
Added [/Users/dituser/code/hadoop-2.8.1/mongo-hadoop-core-2.0.2.jar] to class path
Added resources: [/Users/dituser/code/hadoop-2.8.1/mongo-hadoop-core-2.0.2.jar]
hive> add jar /Users/dituser/code/hadoop-2.8.1/mongodb-driver-3.5.0.jar;
Added [/Users/dituser/code/hadoop-2.8.1/mongodb-driver-3.5.0.jar] to class path
Added resources: [/Users/dituser/code/hadoop-2.8.1/mongodb-driver-3.5.0.jar]
hive> add jar /Users/dituser/code/hadoop-2.8.1/mongo-hadoop-hive-2.0.2.jar;
Added [/Users/dituser/code/hadoop-2.8.1/mongo-hadoop-hive-2.0.2.jar] to class path
Added resources: [/Users/dituser/code/hadoop-2.8.1/mongo-hadoop-hive-2.0.2.jar]
hive>

And then assuming our data is in the table exchanges:

customerid	int
pair	`string`
time	`timestamp`
recommendation	`int`

We can also use Gradle or Maven to download the JARs in our local project. If we only need MapReduce, then we just download the mongo-hadoop-core JAR. For Pig/Hive/Streaming and so on, we must download the appropriate JARs from:
http://repo1.maven.org/maven2/org/mongodb/mongo-hadoop/.
Useful Hive commands:
show databases;
create table exchanges(customerId int, pair String, time TIMESTAMP, recommendation int);

Now that we are all set, we can create a MongoDB collection backed by our local Hive data:

hive> create external table exchanges_mongo (objectid STRING, customerid INT,pair STRING,time STRING, recommendation INT) STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' WITH SERDEPROPERTIES('mongo.columns.mapping'='{"objectid":"_id", "customerid":"customerid","pair":"pair","time":"Timestamp", "recommendation":"recommendation"}') tblproperties('mongo.uri'='mongodb://localhost:27017/exchange_data.xmr_btc');

Finally, we can copy all data from the exchanges' Hive table into MongoDB as follows:

hive> Insert into table exchanges_mongo select * from exchanges;

This way, we have established a pipeline between Hadoop and MongoDB using Hive, without any external server.

Table of Contents for Steps

Create new playlist

Sign In

Sign Up

Table of Contents for
Steps