Let's download Avro and get it installed on our system.
$ export CLASSPATH=avro-1.7.2.jar:${CLASSPATH} $ export CLASSPATH=avro-mapred-1.7.2.jar:${CLASSPATH} $ export CLASSPATH=paranamer-2.5.jar:${CLASSPATH
build
classpath.Export CLASSPATH=${HADOOP_HOME}/lib/Jackson-core-asl-1.8.jar:${CLASSPATH} Export CLASSPATH=${HADOOP_HOME}/lib/Jackson-mapred-asl-1.8.jar:${CLASSPATH} Export CLASSPATH=${HADOOP_HOME}/lib/commons-cli-1.2.jar:${CLASSPATH}
lib
directory.$cpavro-1.7.2.jar ${HADOOP_HOME}/lib $cpavro-1.7.2.jar ${HADOOP_HOME}/lib $cpavro-mapred-1.7.2.jar ${HADOOP_HOME}/lib
Setting up Avro is a little involved; it is a much newer project than the other Apache tools we'll be using, so it requires more than a single download of a tarball.
We download the Avro and Avro-mapred JAR files from the Apache website. There is also a dependency on ParaNamer that we download from its home page at http://codehaus.org.
After downloading these JAR files, we need to add them to the classpath used by our environment; primarily for the Java compiler. We add these files, but we also need to add to the build
classpath several packages that ship with Hadoop because they are required to compile and run Avro code.
Finally, we copy the three new JAR files into the Hadoop lib
directory on each host in the cluster to enable the classes to be available for the map and reduce tasks at runtime. We could distribute these JAR files through other mechanisms, but this is the most straightforward means.
One advantage Avro has over tools such as Thrift and Protocol Buffers, is the way it approaches the schema describing an Avro datafile. While the other tools always require the schema to be available as a distinct resource, Avro datafiles encode the schema in their header, which allows for the code to parse the files without ever seeing a separate schema file.
Avro supports but does not require code generation that produces code tailored to a specific data schema. This is an optimization that is valuable when possible but not a necessity.
We can therefore write a series of Avro examples that never actually use the datafile schema, but we'll only do that for parts of the process. In the following examples, we will define a schema that represents a cut-down version of the UFO sighting records we used previously.
3.16.81.33