Exporting data from HDFS into MongoDB using Pig

MongoDB is a NoSQL database that was designed for storing and retrieving large amounts of data. MongoDB is often used for user-facing data. This data must be cleaned and formatted before it can be made available. Apache Pig was designed, in part, with this kind of work in mind. The MongoStorage class makes it extremely convenient to bulk process the data in HDFS using Pig and then load this data directly into MongoDB. This recipe will use the MongoStorage class to store data from HDFS into a MongoDB collection.

Getting ready

The easiest way to get started with the Mongo Hadoop Adaptor is to clone the mongo-hadoop project from GitHub and build the project configured for a specific version of Hadoop. A Git client must be installed to clone this project.

This recipe assumes that you are using the CDH3 distribution of Hadoop.

The official Git Client can be found at http://git-scm.com/downloads.

GitHub for Windows can be found at http://windows.github.com/.

GitHub for Mac can be found at http://mac.github.com/.

The Mongo Hadoop Adaptor can be found on GitHub at https://github.com/mongodb/mongo-hadoop. This project needs to be built for a specific version of Hadoop. The resulting JAR file must be installed on each node in the $HADOOP_HOME/lib folder.

The Mongo Java Driver is required to be installed on each node in the $HADOOP_HOME/lib folder. It can be found at https://github.com/mongodb/mongo-java-driver/downloads.

How to do it...

Complete the following steps to copy data from HDFS to MongoDB:

  1. Clone the mongo-hadoop repository:
    git clone https://github.com/mongodb/mongo-hadoop.git
  2. Switch to the stable release 1.0 branch:
    git checkout release-1.0
  3. Set the Hadoop version which mongo-hadoop should target. In the folder that mongo-hadoop was cloned to, open the build.sbt file with a text editor. Change the following line:
    hadoopRelease in ThisBuild := "default"

    to

    hadoopRelease in ThisBuild := "cdh3"
  4. Build mongo-hadoop:
    ./sbt package

    This will create a file named mongo-hadoop-core_cdh3u3-1.0.0.jar in the core/target folder. It will also create a file named mongo-hadoop-pig_cdh3u3-1.0.0.jar in the pig/target folder.

  5. Download the Mongo Java Driver Version 2.8.0 from: https://github.com/mongodb/mongo-java-driver/downloads.
  6. Copy mongo-hadoop-core, mongo-hadoop-pig, and the MongoDB Java Driver to $HADOOP_HOME/lib on each node:
    cp mongo-hadoop-core_cdh3u3-1.0.0.jar mongo-2.8.0.jar $HADOOP_HOME/lib
  7. Create a Pig script that will read the weblogs from HDFS and store them into a MongoDB Collection:
    register /path/to/mongo-hadoop/mongo-2.8.0.jar
    register /path/to/mongo-hadoop/core/target/mongo-hadoop-core-1.0.0.jar
    register /path/to/mongo-hadoop/pig/target/mongo-hadoop-pig-1.0.0.jar
    
    define MongoStorage com.mongodb.hadoop.pig.MongoStorage();
    
    weblogs = load '/data/weblogs/weblog_entries.txt' as 
                    (md5:chararray, url:chararry, date:chararray, time:chararray, ip:chararray);
    
    store weblogs into 'mongodb://<HOST>:<PORT>/test.weblogs_from_pig' using MongoStorage();

How it works...

The Mongo Hadoop Adaptor provides a new Hadoop compatible filesystem implementation, MongoInputFormat and MongoOutputFormat. These abstractions make working with MongoDB similar to working with any Hadoop compatible filesystem. MongoStorage converts Pig types to the BasicDBObjectBuilder object type, which is used by MongoDB.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.22.216.59