SerDe

SerDe stands for Serialization and Deserialization. It is the technology used to process records and map them to column data types in Hive tables. To explain the scenario of using SerDe, we need to understand how Hive reads and writes data first.

The process to read data is as follows.

  1. Data is read from HDFS.
  2. Data is processed by the INPUTFORMAT implementation, which defines the input data split and key/value records. In Hive, we can use CREATE TABLE ... STORED AS <FILE_FORMAT> (see Chapter 9Performance Considerations) to specify which INPUTFORMAT it reads from.
  3. The Java Deserializer class defined in SerDe is called to format the data into a record that maps to column and data types in a table.

For an example of reading data, we can use JSON SerDe to read the TEXTFILE format data from HDFS and translate each row of the JSON attribute and value to rows in Hive tables with the correct schema.

The process to write data is as follows:

  1. Data (such as using an INSERT statement) to be written is translated by the Serializer class defined in SerDe to the format that the OUTPUTFORMAT class can read.
  1. Data is processed by the OUTPUTFORMAT implementation, which creates the RecordWriter object. Similar to the INPUTFORMAT implementation, the OUTPUTFORMAT implementation is specified in the same way as a table where it writes the data.
  2. The data is written to the table (data saved in the HDFS).

For an example of writing data, we can write a row-column of data to Hive tables using JSON SerDe, which translates data to a JSON text string saved to the HDFS.

A list of commonly used SerDe (org.apache.hadoop.hive.serde2) supported is as follows:

  • LazySimpleSerDe: The default built-in SerDe (org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe) that's used with the TEXTFILE format. It can be implemented as follows:
      > CREATE TABLE test_serde_lz 
> STORED as TEXTFILE as
> SELECT name from employee;
No rows affected (32.665 seconds)
  • ColumnarSerDe: This is the built-in SerDe used with the RCFILE and ORC format. It can be used as follows:
      > CREATE TABLE test_serde_rc
> STORED as RCFILE as
> SELECT name from employee;
No rows affected (27.187 seconds)

> CREATE TABLE test_serde_orc
> STORED as ORC as
> SELECT name from employee;
No rows affected (24.087 seconds)
  • RegexSerDe: This is the built-in Java regular expression used in SerDe to parse text files. It can be used as follows:
      > CREATE TABLE test_serde_rex(
> name string,
> gender string,
> age string.
> )
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES(
> 'input.regex' = '([^,]*),([^,]*),([^,]*)',
> 'output.format.string' = '%1$s %2$s %3$s'
> )
> STORED AS TEXTFILE;
No rows affected (0.266 seconds)
  • HBaseSerDe: This is the built-in SerDe to enable Hive to integrate with HBase. We can map a Hive table to an existing HBase table by leveraging this SerDe for querying as well as inserting data. Make sure the HBase daemons are running before running the following query. More details are introduced in Chapter 1oWorking with Other Tools:
      > CREATE TABLE test_serde_hb(
> id string,
> name string,
> gender string,
> age string
> )
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.hbase.HBaseSerDe'
> STORED BY
> 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES (
> "hbase.columns.mapping"=
> ":key,info:name,info:gender,info:age"
> )
> TBLPROPERTIES("hbase.table.name" = "test_serde");
No rows affected (0.387 seconds)
  • AvroSerDe: This is the built-in SerDe that enables reading and writing Avro (see http://avro.apache.org/) data in Hive tables. Avro is a remote-procedure-call and data-serialization framework. As of Hive v0.14.0, Avro-backed tables can simply be created by specifying the file format as AVRO, in three ways:
      > CREATE TABLE test_serde_avro( -- Specify schema directly 
> name string,
> gender string,
> age string
> )
> STORED as AVRO;
No rows affected (0.31 seconds)


> CREATE TABLE test_serde_avro2 -- Specify schema from properties
> STORED as AVRO
> TBLPROPERTIES (
> 'avro.schema.literal'='{
> "type":"record",
> "name":"user",
> "fields":[
> {"name":"name", "type":"string"},
> {"name":"gender", "type":"string", "aliases":["gender"]},
> {"name":"age", "type":"string", "default":"null"}
> ]
> }'
> );
No rows affected (0.41 seconds)


-- Using schema file directly as follows is a more flexiable way
> CREATE TABLE test_serde_avro3 -- Specify schema from schema
file
> STORED as AVRO
> TBLPROPERTIES (
> 'avro.schema.url'='/tmp/schema/test_avro_schema.avsc'
> );
No rows affected (0.21 seconds)


-- Check the schema file
$ cat /tmp/schema/test_avro_schema.avsc
{
"type" : "record",
"name" : "test",
"fields" : [
{"name":"name", "type":"string"},
{"name":"gender", "type":"string", "aliases":["gender"]},
{"name":"age", "type":"string", "default":"null"}
]
}
  • ParquetHiveSerDe: This is the built-in SerDe (parquet.hive.serde.ParquetHiveSerDe) that enables reading and writing the Parquet data format as of Hive v0.13.0. It can be used as follows:
      CREATE TABLE test_serde_parquet
> STORED as PARQUET as
> SELECT name from employee;
No rows affected (34.079 seconds)
  • OpenCSVSerDe: This is the SerDe to read and write CSV data. It comes as a built-in SerDe as of Hive v0.14.0. OpenCSVSerDe is more powerful than the built-in row delimiter supported by supporting escape and quote specifications and so on. It can be used as follows:
      > CREATE TABLE test_serde_csv(
> name string,
> gender string,
> age string
>)
> ROW FORMAT SERDE
> 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> WITH SERDEPROPERTIES (
> "separatorChar" = " ",
> "quoteChar" = "'",
> "escapeChar" = "\"
> )
> STORED AS TEXTFILE;
  • JSONSerDe: JSON SerDe is available as of Hive v0.12.0 to read and write JSON data records with Hive: 
      > CREATE TABLE test_serde_js(
> name string,
> gender string,
> age string
> )
> ROW FORMAT SERDE
> 'org.apache.hive.hcatalog.data.JsonSerDe'
> STORED AS TEXTFILE;
No rows affected (0.245 seconds)

Hive also allows users to define a customized SerDe if none of these work for their data format. For more information about custom SerDe, please refer to the Hive Wiki at https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HowtoWriteYourOwnSerDe.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.79.33