Time for action – getting and installing Avro

Let's download Avro and get it installed on our system.

  1. Download the latest stable version of Avro from http://avro.apache.org/releases.html.
  2. Download the latest version of the ParaNamer library from http://paranamer.codehaus.org.
  3. Add the classes to the build classpath used by the Java compiler.
    $ export CLASSPATH=avro-1.7.2.jar:${CLASSPATH}
    $ export CLASSPATH=avro-mapred-1.7.2.jar:${CLASSPATH}
    $ export CLASSPATH=paranamer-2.5.jar:${CLASSPATH
    
  4. Add existing JAR files from the Hadoop distribution to the build classpath.
    Export CLASSPATH=${HADOOP_HOME}/lib/Jackson-core-asl-1.8.jar:${CLASSPATH}
    Export CLASSPATH=${HADOOP_HOME}/lib/Jackson-mapred-asl-1.8.jar:${CLASSPATH}
    Export CLASSPATH=${HADOOP_HOME}/lib/commons-cli-1.2.jar:${CLASSPATH}
    
  5. Add the new JAR files to the Hadoop lib directory.
    $cpavro-1.7.2.jar ${HADOOP_HOME}/lib
    $cpavro-1.7.2.jar ${HADOOP_HOME}/lib
    $cpavro-mapred-1.7.2.jar ${HADOOP_HOME}/lib
    

What just happened?

Setting up Avro is a little involved; it is a much newer project than the other Apache tools we'll be using, so it requires more than a single download of a tarball.

We download the Avro and Avro-mapred JAR files from the Apache website. There is also a dependency on ParaNamer that we download from its home page at http://codehaus.org.

Note

The ParaNamer home page has a broken download link at the time of writing; as an alternative, try the following link:

http://search.maven.org/remotecontent?filepath=com/thoughtworks/paranamer/paranamer/2.5/paranamer-2.5.jar

After downloading these JAR files, we need to add them to the classpath used by our environment; primarily for the Java compiler. We add these files, but we also need to add to the build classpath several packages that ship with Hadoop because they are required to compile and run Avro code.

Finally, we copy the three new JAR files into the Hadoop lib directory on each host in the cluster to enable the classes to be available for the map and reduce tasks at runtime. We could distribute these JAR files through other mechanisms, but this is the most straightforward means.

Avro and schemas

One advantage Avro has over tools such as Thrift and Protocol Buffers, is the way it approaches the schema describing an Avro datafile. While the other tools always require the schema to be available as a distinct resource, Avro datafiles encode the schema in their header, which allows for the code to parse the files without ever seeing a separate schema file.

Avro supports but does not require code generation that produces code tailored to a specific data schema. This is an optimization that is valuable when possible but not a necessity.

We can therefore write a series of Avro examples that never actually use the datafile schema, but we'll only do that for parts of the process. In the following examples, we will define a schema that represents a cut-down version of the UFO sighting records we used previously.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.151.164