Actions

You can also perform actions on an RDD, when you want to actually get a result. Here are some examples of what you can do:

  • collect(): You can call collect() on an RDD, which will give you back a plain old Python object that you can then iterate through and print out the results, or save them to a file, or whatever you want to do.
  • count(): You can also call count(), which will force it to actually go count how many entries are in the RDD at this point.
  • countByValue(): This function will give you a breakdown of how many times each unique value within that RDD occurs.
  • take(): You can also sample from the RDD using take(), which will take a random number of entries from the RDD.
  • top(): top() will give you the first few entries in that RDD if you just want to get a little peek into what's in there for debugging purposes.
  • reduce(): The more powerful action is reduce() which will actually let you combine values together for the same common key value. You can also use RDDs in the context of key-value data. The reduce() function lets you define a way of combining together all the values for a given key. It is very much similar in spirit to MapReduce. reduce() is basically the analogous operation to a reducer() in MapReduce, and map() is analogous to a mapper(). So, it's often very straightforward to actually take a MapReduce job and convert it to Spark by using these functions.

Remember, too, that nothing actually happens in Spark until you call an action. Once you call one of those action methods, that's when Spark goes out and does its magic with directed acyclic graphs, and actually computes the optimal way to get the answer you want. But remember, nothing really occurs until that action happens. So, that can sometimes trip you up when you're writing Spark scripts, because you might have a little print statement in there, and you might expect to get an answer, but it doesn't actually appear until the action is actually performed.

That is Spark 101 in a nutshell. Those are the basics you need for Spark programming. Basically, what is an RDD and what are the things you can do to an RDD. Once you get those concepts, then you can write some Spark code. Let's change tack now and talk about MLlib, and some specific features in Spark that let you do machine learning algorithms using Spark.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.118.210