Time for action – getting web server data into Hadoop

Let's take a look at how we can simplistically copy data from a web server onto HDFS.

  1. Retrieve the text of the NameNode web interface to a local file:
    $ curl localhost:50070 > web.txt
    
  2. Check the file size:
    $ ls -ldh web.txt 
    

    You will receive the following response:

    -rw-r--r-- 1 hadoop hadoop 246 Aug 19 08:53 web.txt
    
  3. Copy the file to HDFS:
    $ hadoop fs -put web.txt web.txt
    
  4. Check the file on HDFS:
    $ hadoop fs -ls 
    

    You will receive the following response:

    Found 1 items
    -rw-r--r--   1 hadoop supergroup        246 2012-08-19 08:53 /user/hadoop/web.txt
    

What just happened?

There shouldn't be anything that is surprising here. We use the curl utility to retrieve a web page from the embedded web server hosting the NameNode web interface and save it to a local file. We check the file size, copy it to HDFS, and verify the file has been transferred successfully.

The point of note here is not the series of actions—it is after all just another use of the hadoop fs command we have used since Chapter 2, Getting Up and Running—rather the pattern used is what we should discuss.

Though the data we wanted was in a web server and accessible via the HTTP protocol, the out of the box Hadoop tools are very file-based and do not have any intrinsic support for such remote information sources. This is why we need to copy our network data into a file before transferring it to HDFS.

We can, of course, write data directly to HDFS through the programmatic interface mentioned back in Chapter 3, Writing MapReduce Jobs, and this would work well. This would, however, require us to start writing custom clients for every different network source from which we need to retrieve data.

Have a go hero

Programmatically retrieving data and writing it to HDFS is a very powerful capability and worth some exploration. A very popular Java library for HTTP is the Apache HTTPClient, within the HTTP Components project found at http://hc.apache.org/httpcomponents-client-ga/index.html.

Use the HTTPClient and the Java HDFS interface to retrieve a web page as before and write it to HDFS.

Getting files into Hadoop

Our previous example showed the simplest method for getting file-based data into Hadoop and the use of the standard command-line tools or programmatic APIs. There is little else to discuss here, as it is a topic we have dealt with throughout the book.

Hidden issues

Though the preceding approaches are good as far as they go, there are several reasons why they may be unsuitable for production use.

Keeping network data on the network

Our model of copying network-accessed data to a file before placing it on HDFS will have an impact on performance. There is added latency due to the round-trip to disk, the slowest part of a system. This may not be an issue for large amounts of data retrieved in one call—though disk space potentially becomes a concern—but for small amounts of data retrieved at high speed, it may become a real problem.

Hadoop dependencies

For the file-based approach, it is implicit in the model mentioned before that the point at which we can access the file must have access to the Hadoop installation and be configured to know the location of the cluster. This potentially adds additional dependencies in the system; this could force us to add Hadoop to hosts that really need to know nothing about it. We can mitigate this by using tools like SFTP to retrieve the files to a Hadoop-aware machine and from there, copy onto HDFS.

Reliability

Notice the complete lack of error handling in the previous approaches. The tools we are using do not have built-in retry mechanisms which means we would need to wrap a degree of error detection and retry logic around each data retrieval.

Re-creating the wheel

This last point touches on perhaps the biggest issue with these ad hoc approaches; it is very easy to end up with a dozen different strings of command-line tools and scripts, each of which is doing very similar tasks. The potential costs in terms of duplicate effort and more difficult error tracking can be significant over time.

A common framework approach

Anyone with experience in enterprise computing will, at this point, be thinking that this sounds like a problem best solved with some type of common integration framework. This is exactly correct and is indeed a general type of product well known in fields such as Enterprise Application Integration (EAI).

What we need though is a framework that is Hadoop-aware and can easily integrate with Hadoop (and related projects) without requiring massive effort in writing custom adaptors. We could create our own, but instead let's look at Apache Flume which provides much of what we need.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.103.96