Other BigData Tools andTechnologies | 171
7.3.2 Features of Apache Pig
Apache Pig comes with the following features.
Rich Set of Operators: It provides many operators to perform operations, like join, sort,
filer, etc.
Ease of Programming: Pig Latin is similar to SQL and it is easy to write a Pig script if you are
good at SQL.
Extensibility: Using the existing operators, users can develop their own functions to read,
process and write data.
UDF’s: Pig provides the facility to create User-defined Function (UDF) using other program-
ming languages, such as Java and invoke them in Pig Scripts very easily.
Handles All Kinds of Data: Apache Pig analyses all kinds of data, both structured as well as
unstructured. Pig stores the results in HDFS.
7.3.3 Apache Pig vs. MapReduce
Apache Pig MapReduce
Apache Pig is a data ow language. MapReduce is a data processing framework.
It is a high-level language. MapReduce is low level and rigid.
Performing a Join operation in Apache Pig is
pretty simple with SQL like syntax.
It is quite difcult in MapReduce to perform a
Join operation between datasets.
Apache Pig uses multi-query approach, thereby
reducing the length of the codes to a great extent.
MapReduce will require almost 20 times more
the number of lines to perform the same task.
There is no need for compilation. On execution,
every Apache Pig operator is converted
internally into a MapReduce job. Pig supports
lazy execution.
MapReduce jobs have a long compilation
process because each child task’s intermediate
state is executed in every data node’s local le
system.
7.3.4 Pig Architecture
As shown in Figure 7.3, there are various components in the Apache Pig framework.
Script Parser: Initially, the Pig Scripts are handled by the Parser. It checks the syntax of the script
and it does type checking and other miscellaneous checks. The output of the parser will be a
DAG (directed acyclic graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data
flows are represented as edges.
Pig Optimizer: The logical plan (DAG) is passed to the logical optimizer, which carries out
the logical optimizations such as projection and pushdown.
Pig Compiler: The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution Engine: The MapReduce jobs are submitted through YARN container to Hadoop
engine in a sorted order. Finally, these MapReduce jobs are executed on Hadoop producing
the results.
M07 Big Data Simplified XXXX 01.indd 171 5/17/2019 2:50:07 PM
172 | Big Data Simplied
Data le for this below example:
emp.txt
4564546454,280813,640,890,Anup,USA
4564546678,280813,740,875,Sam,UK
4564546489,280813,640,840,Ramos,USA
4564546498,280813,840,820,Mohor,India
4564546498,280813,840,820,Jack,Aus
transactions.txt
4564546454,800,08-20-2013
4564546454,900,08-20-2013
4564546454,700,08-20-2013
4564546454,600,08-21-2013
4564546678,900,08-21-2013
4564546489,400,08-21-2013
4564546498,500,08-21-2013
4564546678,1600,08-20-2013
4564546678,1800,08-20-2013
4564546678,1900,08-20-2013
4564546678,900,08-21-2013
Figure 7.3 Apache Pig architecture
Pig latin
scripts
MapReduce
YARN
HDFS
Pig serverGrunt shell
Script parser
Pig optimizer
Pig compiler
Pig execution engine
M07 Big Data Simplified XXXX 01.indd 172 5/17/2019 2:50:07 PM
Other BigData Tools andTechnologies | 173
Example: Load the data from emp.txt as (accno-long, date-long, ctrycd-int,groupid-int,
name-chararray, ctry-chararray)
a = load ‘/ana/data/employee/emp.txt’ using PigStorage(‘,’) as
(accno:long, date:long, ctrycd:int,groupid:int, name:chararray,
ctry:chararray);
DUMP a;
Load the data from transactions.txt as (accno-long,amount-oat,date-chararray)
b = load ‘/user/sayan/transactions.txt’ using PigStorage(‘,’) as
(accno:long, amnt:float,date:chararray);
DUMP b;
M07 Big Data Simplified XXXX 01.indd 173 5/17/2019 2:50:08 PM
174 | Big Data Simplied
Find the accounts where transaction amount is greater than $2400
c = group b by accno;
d = foreach c generate group as accno,SUM(b.amnt) as sumamnt;
e = filter d by sumamnt> 2400
DUMP e;
M07 Big Data Simplified XXXX 01.indd 174 5/17/2019 2:50:10 PM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.81.33