256 | Big Data Simplied
7. We can run the mapper and reducer on local files (for example, python-input.txt). In order
to run the Map and Reduce on the Hadoop Distributed File System (HDFS), we need the
Hadoop Streaming jar library. So, before we run the scripts on Hadoop engine, test locally to
ensure that they are working fine.
Run the mapper.
cat python-input.txt | python mapper.py
Run reducer.py
cat python-input.txt | python mapper.py | sort -k1,1 | python
reducer.py
Our testing in local has completed as the mapper and reducer are working as expected so we
won’t face any further issues.
9.4.4 Running the MapReduce Python Code on Hadoop
1. Before we run the MapReduce task on Hadoop, copy local data (python-input.txt) to HDFS
inside the ‘/data’ directory.
hadoop fs -put /<your-local-path>/python-input.txt /data
hadoop fs -cat /data/python-input.txt
M09 Big Data Simplified XXXX 01.indd 256 5/10/2019 10:23:01 AM
Working with Big Data inPython | 257
2. Now locate the path of Hadoop Streaming jar inside Hadoop home path.
3. Now run the ‘hadoop jar…’ command as shown below.
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-
2.7.3.jar -file mapper.py -mapper mapper.py -file reducer.py -reducer
reducer.py -input /data/python-input.txt -output /data/pythonoutput
Python MapReduce job is completed successfully.
M09 Big Data Simplified XXXX 01.indd 257 5/10/2019 10:23:02 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.188.138