Installing Spark

Let's get back to a new browser tab here, head to spark.apache.org, and click on the Download Spark button:

Now, we have used Spark 2.1.1 in this book, but anything beyond 2.0 should work just fine.

Make sure you get a prebuilt version, and select the Direct Download option so all these defaults are perfectly fine. Go ahead and click on the link next to instruction number 4 to download that package.

Now, it downloads a TGZ (Tar in GZip) file, which you might not be familiar with. Windows is kind of an afterthought with Spark quite honestly because on Windows, you're not going to have a built-in utility for actually decompressing TGZ files. This means that you might need to install one, if you don't have one already. The one I use is called WinRAR, and you can pick that up from www.rarlab.com. Go to the Downloads page if you need it, and download the installer for WinRAR 32-bit or 64-bit, depending on your operating system. Install WinRAR as normal, and that will allow you to actually decompress TGZ files on Windows:

So, let's go ahead and decompress the TGZ files. I'm going to open up my Downloads folder to find the Spark archive that we downloaded, and let's go ahead and right-click on that archive and extract it to a folder of my choosing - I'm just going to put it in my Downloads folder for now. Again, WinRAR is doing this for me at this point:

So, I should now have a folder in my Downloads folder associated with that package. Let's open that up and there is Spark itself. You should see something like the folder content shown below. So, you need to install that in some place that you can remember:

You don't want to leave it in your Downloads folder obviously, so let's go ahead and open up a new file explorer window here. I go to my C drive and create a new folder, and let's just call it spark. So, my Spark installation is going to live in C:spark. Again, nice and easy to remember. Open that folder. Now, I go back to my downloaded spark folder and use Ctrl + A to select everything in the Spark distribution, Ctrl + C to copy it, and then go back to C:spark, where I want to put it, and Ctrl + V to paste it in:

Remembering to paste the contents of the spark folder, not the spark folder itself is very important. So, what I should have now is my C drive with a spark folder that contains all of the files and folders from the Spark distribution.

Well, there are still a few things we need to configure. So, while we're in C:spark let's open up the conf folder, and in order to make sure that we don't get spammed to death by log messages, we're going to change the logging level setting here. So to do that, right-click on the log4j.properties.template file and select Rename:

Delete the .template part of the filename to make it an actual log4j.properties file. Spark will use this to configure its logging:

Now, open this file in a text editor of some sort. On Windows, you might need to right-click there and select Open with and then WordPad. In the file, locate log4j.rootCategory=INFO:

Let's change this to log4j.rootCategory=ERROR and this will just remove the clutter of all the log spam that gets printed out when we run stuff. Save the file, and exit your editor.

So far, we installed Python, Java, and Spark. Now the next thing we need to do is to install something that will trick your PC into thinking that Hadoop exists, and again this step is only necessary on Windows. So, you can skip this step if you're on Mac or Linux.

I have a little file available that will do the trick. Let's go to http://media.sundog-soft.com/winutils.exe. Downloading winutils.exe will give you a copy of a little snippet of an executable, which can be used to trick Spark into thinking that you actually have Hadoop:

Now, since we're going to be running our scripts locally on our desktop, it's not a big deal, we don't need to have Hadoop installed for real. This just gets around another quirk of running Spark on Windows. So, now that we have that, let's find it in the Downloads folder, Ctrl + C to copy it, and let's go to our C drive and create a place for it to live:

So, create a new folder again in the root C drive, and we will call it winutils:

Now let's open this winutils folder and create a bin folder inside it:

Now in this bin folder, I want you to paste the winutils.exe file we downloaded. So you should have C:winutilsin and then winutils.exe:

This next step is only required on some systems, but just to be safe, open Command Prompt on Windows. You can do that by going to your Start menu and going down to Windows System, and then clicking on Command Prompt. Here, I want you to type cd c:winutilsin, which is where we stuck our winutils.exe file. Now if you type dir, you should see that file there. Now type winutils.exe chmod 777 mphive. This just makes sure that all the file permissions you need to actually run Spark successfully are in place without any errors. You can close Command Prompt now that you're done with that step. Wow, we're almost done, believe it or not.

Now we need to set some environment variables for things to work. I'll show you how to do that on Windows. On Windows 10, you'll need to open up the Start menu and go to Windows System | Control Panel to open up Control Panel:

In Control Panel, click on System and Security:

Then, click on System:

Then click on Advanced system settings from the list on the left-hand side:

From here, click on Environment Variables...:

We will get these options:

Now, this is a very Windows-specific way of setting environment variables. On other operating systems, you'll use different processes, so you'll have to look at how to install Spark on them. Here, we're going to set up some new user variables. Click on the first New... button for a new user variable and call it SPARK_HOME, as shown below, all uppercase. This is going to point to where we installed Spark, which for us is c:spark, so type that in as the Variable value and click on OK:

We also need to set up JAVA_HOME, so click on New... again and type in JAVA_HOME as Variable name. We need to point that to where we installed Java, which for us is c:jdk:

We also need to set up HADOOP_HOME, and that's where we installed the winutils package, so we'll point that to c:winutils:

So far, so good. The last thing we need to do is to modify our path. You should have a PATH environment variable here:

Click on the PATH environment variable, then on Edit..., and add a new path. This is going to be %SPARK_HOME%in, and I'm going to add another one, %JAVA_HOME%in:

Basically, this makes all the binary executables of Spark available to Windows, wherever you're running it from. Click on OK on this menu and on the previous two menus. We have finally everything set up.

Table of Contents for Installing Spark

Create new playlist

Sign In

Sign Up

Table of Contents for
Installing Spark