Apache Tez

Apache Tez is part of the Stinger initiative led by Hortonworks to make the Hive enterprise ready and suitable for interactive SQL queries. The Tez design is based on research done by Microsoft on parallel and distributed computing.

Tez entered the Apache Incubator in February 2013 and graduated to a top-level project in July 2014.

Tez is basically an embeddable and extensible framework to build high-performance batch and interactive data-processing applications that need to integrate easily with YARN.

Confusion often arises when Tez is thought of as an engine. Tez is not a general-purpose engine, but more of a framework for tools to express their purpose-built needs. Tez, for example, enables Hive, Pig, and others to build their own purpose-built engines and embed them in those technologies to express their purpose-built needs. Projects such as Hive, Pig, and Cascading now have significant improvements in response times when they use Tez instead of MapReduce.

Tez generalizes the MapReduce paradigm to a more powerful framework based on expressing computations as a dataflow graph. Tez exists to address some of the limitations of MapReduce. For example, in a typical MapReduce, a lot of temporary data is stored (such as each mapper's output, which is a disk I/O), which is an overhead. In the case of Tez, this disk I/O of temporary data is saved, thereby resulting in higher performance compared to the MapReduce model.

Also, Tez can adjust the parallelism of reduce tasks at runtime, depending on the actual data size coming out of the previous task. On the other hand, in MapReduce the number of reducers is static and has to be decided by the user before the job is submitted to the cluster.

The processing done by multiple MapReduce jobs can now be done by a single Tez job, as follows:

Apache Tez

Referring to the preceding diagram, earlier (with PIG/HIVE), we used to need multiple M/R jobs to do some processing. However, now, in Tez, a single M/R job does the same, that is, the reducers (the green boxes) of the previous step feed the mappers (the blue boxes) of the next step.

The preceding image is taken from http://www.infoq.com/articles/apache-tez-saha-murthy.

Tez is not meant directly for end users; in fact, it enables developers to build end-user applications with much better performance and flexibility. Traditionally, Hadoop has been a batch-processing platform to process large amounts of data. However, there are a lot of use cases for near-real-time performance of query processing. There are also several workloads, such as machine learning, that do not fit into the MapReduce paradigm. Tez helps Hadoop address these use cases.

Tez provides an expressive dataflow-definition API that lets developers create their own unique data-processing graphs (DAGs) to represent their applications' data-processing flows. Once the developer defines a flow, Tez then provides additional APIs to inject custom business logic that will run in that flow. These APIs then combine inputs (that read data), outputs (that write data), and processors (that process data) to process the flow.

Tez can also run any existing MR job without any modification. For more information on Tez, refer to http://tez.apache.org/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.193.232