To show that the data is accessible from multiple languages, let's also display the job output using Java.
OutputRead.java
:import java.io.File; import java.io.IOException; import org.apache.avro.file.DataFileReader; import org.apache.avro.generic.GenericData; import org.apache.avro. generic.GenericDatumReader; import org.apache.avro.generic.GenericRecord; import org.apache.avro.io.DatumReader; public class OutputRead { public static void main(String[] args) throws IOException { String filename = args[0] ; File file=new File(filename) ; DatumReader<GenericRecord> reader= new GenericDatumReader<GenericRecord>(); DataFileReader<GenericRecord>dataFileReader=new DataFileReader<GenericRecord>(file,reader); while (dataFileReader.hasNext()) { GenericRecord result=dataFileReader.next(); String output = String.format("%s %d", result.get("shape"), result.get("count")) ; System.out.println(output) ; } } }
$ javacOutputResult.java $ java OutputResultresult.avro blur 1 cylinder 1 diamond 2 formation 1 light 3 saucer 1
We added this example to show the Avro data being read by more than one language. The code is very similar to the earlier InputRead
class; the only difference is that the named fields are used to display each datum as it is read from the datafile.
As previously mentioned, we worked hard to reduce representation-related complexity in our GraphPath
class. But with mappings to and from flat lines of text and objects, there was an overhead in managing these transformations.
With its support for nested complex types, Avro can natively support a representation of a node that is much closer to the runtime object. Modify the GraphPath
class job to read and write the graph representation to an Avro datafile comprising of datums for each node. The following example schema may be a good starting point, but feel free to enhance it:
{ "type": "record", "name": "Graph_representation", "fields" : [ {"name": "node_id", "type": "int"}, {"name": "neighbors", "type": "array", "items:"int" }, {"name": "distance", "type": "int"}, {"name": "status", "type": "enum", "symbols": ["PENDING", "CURRENT", "DONE" },] ] }
There are many features of Avro we did not cover in this case study. We focused only on its value as an at-rest data representation. It can also be used within a remote procedure call (RPC) framework and can optionally be used as the default RPC format in Hadoop 2.0. We didn't use Avro's code generation facilities that produce a much more domain-focused API. Nor did we cover issues such as Avro's ability to support schema evolution that, for example, allows new fields to be added to recent records without invalidating old datums or breaking existing clients. It's a technology you are very likely to see more of in the future.
3.133.134.17