Configuring Hadoop runtime on Windows

We have already seen how to test your Spark applications written in Scala on Eclipse or IntelliJ, but there is another potential issue that should not be overlooked. Although Spark works on Windows, Spark is designed to be run on the UNIX-like operating system. Therefore, if you are working on Windows environment, then extra care needs to be taken.

While using Eclipse or IntelliJ to develop your Spark applications for solving data analytics, machine learning, data science, or deep learning applications on Windows, you might face an I/O exception error and your application might not compile successfully or may be interrupted. Actually, the thing is that Spark expects that there is a runtime environment for Hadoop on Windows too. For example, if you run a Spark application, say KMeansDemo.scala, on Eclipse for the first time, you will experience an I/O exception saying the following:

17/02/26 13:22:00 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable nullinwinutils.exe in the Hadoop binaries.

The reason is that by default, Hadoop is developed for the Linux environment, and if you are developing your Spark applications on Windows platform, a bridge is required that will provide an environment for the Hadoop runtime for Spark to be properly executed. The details of the I/O exception can be seen in the following figure:

Figure 14: I/O exception occurred due to the failure of not to locate the winutils binary in the Hadoop binary path

Now, how to get rid of this problem then? The solution is straightforward. As the error message says, we need to have an executable, namely winutils.exe. Now download the winutils.exe file from https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin, paste it in the Spark distribution directory, and configure Eclipse. More specifically, suppose your Spark distribution containing Hadoop is located at C:/Users/spark-2.1.0-bin-hadoop2.7. Inside the Spark distribution, there is a directory named bin. Now, paste the executable there (that is, path = C:/Users/spark-2.1.0-binhadoop2.7/bin/).

The second phase of the solution is going to Eclipse and then selecting the main class (that is, KMeansDemo.scala in this case), and then going to the Run menu. From the Run menu, go to the Run Configurations option and from there select the Environment tab, as shown in the following figure:

Figure 15: Solving the I/O exception occurred due to the absence of winutils binary in the Hadoop binary path

If you select the tab, you a will have the option to create a new environmental variable for Eclipse suing the JVM. Now create a new environmental variable named HADOOP_HOME and put the value as C:/Users/spark-2.1.0-bin-hadoop2.7/. Now press on Apply button and rerun your application, and your problem should be resolved.

It is to be noted that while working with Spark on Windows in a PySpark, the winutils.exe file is required too. For PySpark reference, refer to the Chapter 19, PySpark and SparkR.

Please make a note that the preceding solution is also applicable in debugging your applications. Sometimes, even if the preceding error occurs, your Spark application will run properly. However, if the size of the dataset is large, it is most likely that the preceding error will occur.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.116.102