Other BigData Tools andTechnologies | 171
7.3.2 Features of Apache Pig
Apache Pig comes with the following features.
• Rich Set of Operators: It provides many operators to perform operations, like join, sort,
filer, etc.
• Ease of Programming: Pig Latin is similar to SQL and it is easy to write a Pig script if you are
good at SQL.
• Extensibility: Using the existing operators, users can develop their own functions to read,
process and write data.
• UDF’s: Pig provides the facility to create User-defined Function (UDF) using other program-
ming languages, such as Java and invoke them in Pig Scripts very easily.
• Handles All Kinds of Data: Apache Pig analyses all kinds of data, both structured as well as
unstructured. Pig stores the results in HDFS.
7.3.3 Apache Pig vs. MapReduce
Apache Pig MapReduce
Apache Pig is a data ow language. MapReduce is a data processing framework.
It is a high-level language. MapReduce is low level and rigid.
Performing a Join operation in Apache Pig is
pretty simple with SQL like syntax.
It is quite difcult in MapReduce to perform a
Join operation between datasets.
Apache Pig uses multi-query approach, thereby
reducing the length of the codes to a great extent.
MapReduce will require almost 20 times more
the number of lines to perform the same task.
There is no need for compilation. On execution,
every Apache Pig operator is converted
internally into a MapReduce job. Pig supports
lazy execution.
MapReduce jobs have a long compilation
process because each child task’s intermediate
state is executed in every data node’s local le
system.
7.3.4 Pig Architecture
As shown in Figure 7.3, there are various components in the Apache Pig framework.
• Script Parser: Initially, the Pig Scripts are handled by the Parser. It checks the syntax of the script
and it does type checking and other miscellaneous checks. The output of the parser will be a
DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data
flows are represented as edges.
• Pig Optimizer: The logical plan (DAG) is passed to the logical optimizer, which carries out
the logical optimizations such as projection and pushdown.
• Pig Compiler: The compiler compiles the optimized logical plan into a series of MapReduce jobs.
• Execution Engine: The MapReduce jobs are submitted through YARN container to Hadoop
engine in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing
the results.
M07 Big Data Simplified XXXX 01.indd 171 5/17/2019 2:50:07 PM