Let's take a look at how we can simplistically copy data from a web server onto HDFS.
$ curl localhost:50070 > web.txt
$ ls -ldh web.txt
You will receive the following response:
-rw-r--r-- 1 hadoop hadoop 246 Aug 19 08:53 web.txt
$ hadoop fs -put web.txt web.txt
$ hadoop fs -ls
You will receive the following response:
Found 1 items -rw-r--r-- 1 hadoop supergroup 246 2012-08-19 08:53 /user/hadoop/web.txt
There
shouldn't be anything that is surprising here. We use the curl
utility to retrieve a web page from the embedded web server hosting the NameNode web interface and save it to a local file. We check the file size, copy it to HDFS, and verify the file has been transferred successfully.
The point of note here is not the series of actions—it is after all just another use of the hadoop fs
command we have used since Chapter 2, Getting Up and Running—rather the pattern used is what we should discuss.
Though the data we wanted was in a web server and accessible via the HTTP protocol, the out of the box Hadoop tools are very file-based and do not have any intrinsic support for such remote information sources. This is why we need to copy our network data into a file before transferring it to HDFS.
We can, of course, write data directly to HDFS through the programmatic interface mentioned back in Chapter 3, Writing MapReduce Jobs, and this would work well. This would, however, require us to start writing custom clients for every different network source from which we need to retrieve data.
Programmatically retrieving data and writing it to HDFS is a very powerful capability and worth some exploration. A very popular Java library for HTTP is the Apache HTTPClient, within the HTTP Components project found at http://hc.apache.org/httpcomponents-client-ga/index.html.
Use the HTTPClient and the Java HDFS interface to retrieve a web page as before and write it to HDFS.
Our previous example showed the simplest method for getting file-based data into Hadoop and the use of the standard command-line tools or programmatic APIs. There is little else to discuss here, as it is a topic we have dealt with throughout the book.
Though the preceding approaches are good as far as they go, there are several reasons why they may be unsuitable for production use.
Our model of copying network-accessed data to a file before placing it on HDFS will have an impact on performance. There is added latency due to the round-trip to disk, the slowest part of a system. This may not be an issue for large amounts of data retrieved in one call—though disk space potentially becomes a concern—but for small amounts of data retrieved at high speed, it may become a real problem.
For the file-based approach, it is implicit in the model mentioned before that the point at which we can access the file must have access to the Hadoop installation and be configured to know the location of the cluster. This potentially adds additional dependencies in the system; this could force us to add Hadoop to hosts that really need to know nothing about it. We can mitigate this by using tools like SFTP to retrieve the files to a Hadoop-aware machine and from there, copy onto HDFS.
Notice the complete lack of error handling in the previous approaches. The tools we are using do not have built-in retry mechanisms which means we would need to wrap a degree of error detection and retry logic around each data retrieval.
This last point touches on perhaps the biggest issue with these ad hoc approaches; it is very easy to end up with a dozen different strings of command-line tools and scripts, each of which is doing very similar tasks. The potential costs in terms of duplicate effort and more difficult error tracking can be significant over time.
Anyone with experience in enterprise computing will, at this point, be thinking that this sounds like a problem best solved with some type of common integration framework. This is exactly correct and is indeed a general type of product well known in fields such as Enterprise Application Integration (EAI).
What we need though is a framework that is Hadoop-aware and can easily integrate with Hadoop (and related projects) without requiring massive effort in writing custom adaptors. We could create our own, but instead let's look at Apache Flume which provides much of what we need.
18.119.103.96