Installing Spark on Windows

Getting Spark installed on Windows involves several steps that we'll walk you through here. I'm just going to assume that you're on Windows because most people use this book at home. We'll talk a little bit about dealing with other operating systems in a moment. If you're already familiar with installing stuff and dealing with environment variables on your computer, then you can just take the following little cheat sheet and go off and do it. If you're not so familiar with Windows internals, I will walk you through it one step at a time in the upcoming sections. Here are the quick steps for those Windows pros:

Install a JDK: You need to first install a JDK, that's a Java Development Kit. You can just go to Sun's website and download that and install it if you need to. We need the JDK because, even though we're going to be developing in Python during this course, that gets translated under the hood to Scala code, which is what Spark is developed in natively. And, Scala, in turn, runs on top of the Java interpreter. So, in order to run Python code, you need a Scala system, which will be installed by default as part of Spark. Also, we need Java, or more specifically Java's interpreter, to actually run that Scala code. It's like a technology layer cake.
Install Python: Obviously you're going to need Python, but if you've gotten to this point in the book, you should already have a Python environment set up, hopefully with Enthought Canopy. So, we can skip this step.
Install a prebuilt version of Spark for Hadoop: Fortunately, the Apache website makes available prebuilt versions of Spark that will just run out of the box that are precompiled for the latest Hadoop version. You don't have to build anything, you can just download that to your computer and stick it in the right place and be good to go for the most part.
Create a conf/log4j.properties file: We have a few configuration things to take care of. One thing we want to do is adjust our warning level so we don't get a bunch of warning spam when we run our jobs. We'll walk through how to do that. Basically, you need to rename one of the properties files, and then adjust the error setting within it.
Add a SPARK_HOME environment variable: Next, we need to set up some environment variables to make sure that you can actually run Spark from any path that you might have. We're going to add a SPARK_HOME environment variable pointing to where you installed Spark, and then we will add %SPARK_HOME%in to your system path, so that when you run Spark Submit, or PySpark or whatever Spark command you need, Windows will know where to find it.
Set a HADOOP_HOME variable: On Windows there's one more thing we need to do, we need to set a HADOOP_HOME variable as well because it's going to expect to find one little bit of Hadoop, even if you're not using Hadoop on your standalone system.
Install winutils.exe: Finally, we need to install a file called winutils.exe. There's a link to winutils.exe within the resources for this book, so you can get that there.

If you want to walk through the steps in more detail, you can refer to the upcoming sections.

Table of Contents for Installing Spark on Windows

Create new playlist

Sign In

Sign Up

Table of Contents for
Installing Spark on Windows