Creating an RDD using a Python list

The following is a very simple example:

nums = parallelize([1, 2, 3, 4]) 

If I just want to make an RDD out of a plain old Python list, I can call the parallelize() function in Spark. That will convert a list of stuff, in this case, just the numbers, 1, 2, 3, 4, into an RDD object called nums.

That is the simplest case of creating an RDD, just from a hard-coded list of stuff. That list could come from anywhere; it doesn't have to be hard-coded either, but that kind of defeats the purpose of big data. I mean, if I have to load the entire Dataset into memory before I can create an RDD from it, what's the point?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.74.41