Avro Data as HIVE Table

In order to create Avro data from the contacts text file, we will make use of INTERNAL Hive table. They come with in-built mechanisms to convert text data to Avro data. However, having an EXTERNAL table for Avro data would be more practical as discussed before from integration perspective with Sqoop, Flume and Flink.

In order to see this in action we will need to execute the following steps:

  1. Avro schema for contacts namely contact.ascv (schema file) could be represented as shown the following command:
{
"namespace": "example.avro",
"type": "record",
"name": "Contact",
"fields": [
{"name": "id", "type": "string"},
{"name": "cell", "type": "string"},
{"name": "phone", "type": "string"},
{"name": "email", "type": "string"}
]
}

All Avro objects are dependent on schema definition and at the storage layer these Avro objects are serialized into Avro data files. The Avro serializers need to have reference to this schema to perform serialization. Avro serializations should be incremental in nature so that external tables can be created over Avro data files. When a Sqoop job is run to load the data as Avro data files, a default schema is generated by Ssqoop to serialize data into Avro data files.

As seen previously, Avro schema definition is similar to JSON schema draft specification. But, there are differences in data type support and the structure of the Avro schema declaration.

  1. In the Hive editor run the following command to create another Hive table, but this time with Avro data format with inline schema definition
CREATE TABLE contactsAvro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.literal'='{"namespace": "example.avro",
"type": "record",
"name": "Contact",
"fields": [
{"name": "id", "type": "string"},
{"name": "cell", "type": "string"},
{"name": "phone", "type": "string"},
{"name": "email", "type": "string"}
]
}
');
  1. Now let us load the data into this Hive table using INSERT OVERWRITE query in the Hive Query Builder. This may take some time as internally it triggers MapReduce jobs for this data load:
INSERT OVERWRITE TABLE contactsAvro SELECT id, cell, phone, email FROM contactsText;
  1. Querying from the newly created table (contactsAvro) gives us the same output as we saw before, but if we look into the Hive warehouse we see Avro data files created from the data load operation as shown the following command:
Figure 21: Avro Data Backed HIVE Table - Data Loaded with INSERT OVERWRITE

The following figure shows Avro data files shown in the Hive warehouse folder:

Figure 22: Generated Avro Data Files

The following screenshot shows the content of one of the Avro data files :

Figure 23: View of Avro Data file

Similarly, even for Parquet storage Hive tables can be defined and the data would be stored in Parquet format. The only difference would be in the way table is created, with a different SERDE.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.217.95