Time for action – checking the prerequisites

Hadoop is written in Java, so you will need a recent Java Development Kit (JDK) installed on the Ubuntu host. Perform the following steps to check the prerequisites:

  1. First, check what's already available by opening up a terminal and typing the following:
    $ javac
    $ java -version
    
  2. If either of these commands gives a no such file or directory or similar error, or if the latter mentions "Open JDK", it's likely you need to download the full JDK. Grab this from the Oracle download page at http://www.oracle.com/technetwork/java/javase/downloads/index.html; you should get the latest release.
  3. Once Java is installed, add the JDK/bin directory to your path and set the JAVA_HOME environment variable with commands such as the following, modified for your specific Java version:
    $ export JAVA_HOME=/opt/jdk1.6.0_24
    $ export PATH=$JAVA_HOME/bin:${PATH}
    

What just happened?

These steps ensure the right version of Java is installed and available from the command line without having to use lengthy pathnames to refer to the install location.

Remember that the preceding commands only affect the currently running shell and the settings will be lost after you log out, close the shell, or reboot. To ensure the same setup is always available, you can add these to the startup files for your shell of choice, within the .bash_profile file for the BASH shell or the .cshrc file for TCSH, for example.

An alternative favored by me is to put all required configuration settings into a standalone file and then explicitly call this from the command line; for example:

$ source Hadoop_config.sh

This technique allows you to keep multiple setup files in the same account without making the shell startup overly complex; not to mention, the required configurations for several applications may actually be incompatible. Just remember to begin by loading the file at the start of each session!

Setting up Hadoop

One of the most confusing aspects of Hadoop to a newcomer is its various components, projects, sub-projects, and their interrelationships. The fact that these have evolved over time hasn't made the task of understanding it all any easier. For now, though, go to http://hadoop.apache.org and you'll see that there are three prominent projects mentioned:

  • Common
  • HDFS
  • MapReduce

The last two of these should be familiar from the explanation in Chapter 1, What It's All About, and common projects comprise a set of libraries and tools that help the Hadoop product work in the real world. For now, the important thing is that the standard Hadoop distribution bundles the latest versions all of three of these projects and the combination is what you need to get going.

A note on versions

Hadoop underwent a major change in the transition from the 0.19 to the 0.20 versions, most notably with a migration to a set of new APIs used to develop MapReduce applications. We will be primarily using the new APIs in this book, though we do include a few examples of the older API in later chapters as not of all the existing features have been ported to the new API.

Hadoop versioning also became complicated when the 0.20 branch was renamed to 1.0. The 0.22 and 0.23 branches remained, and in fact included features not included in the 1.0 branch. At the time of this writing, things were becoming clearer with 1.1 and 2.0 branches being used for future development releases. As most existing systems and third-party tools are built against the 0.20 branch, we will use Hadoop 1.0 for the examples in this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.159.223