Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Getting data into Spark

Next, load the KDD cup data into PySpark using sc, as shown in the following command:

raw_data = sc.textFile("./kddcup.data.gz")

In the following command, we can see that the raw data is now in the raw_data variable:

raw_data

This output is as demonstrated in the following code snippet:

./kddcup.data,gz MapPartitionsRDD[3] at textFile at NativeMethodAccessorImpl.java:0

If we enter the raw_data variable, it gives us details regarding kddcup.data.gz, where raw data underlying the data file is located, and tells us about MapPartitionsRDD.

Now that we know how to load the data into Spark, let's learn about parallelization with Spark RDDs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

3.145.96.86

Table of Contents for Getting data into Spark

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting data into Spark