Using the Spark shell to run logistic regression

When you run a command and have not specified a left-hand side (that is, leaving out the val x of val x = y), the Spark shell will print the value along with res[number]. The res[number] function can be used as if we had written val res[number] = y. Now that you have the data in a more usable format, try to do something cool with it! Use Spark to run logistic regression over the dataset as follows:

scala> import spark.util.Vector
import spark.util.Vector

scala> case class DataPoint(x: Vector, y: Double)
defined class DataPoint

scala> def parsePoint(x: Array[Double]): DataPoint = {
      DataPoint(new Vector(x.slice(0,x.size-2)) , x(x.size-1))
      }
parsePoint: (x: Array[Double])this.DataPoint

scala> val points = nums.map(parsePoint(_))
points: spark.RDD[this.DataPoint] = MappedRDD[3] at map at <console>:24

scala> import java.util.Random
import java.util.Random

scala> val rand = new Random(53)
rand: java.util.Random = java.util.Random@3f4c24
scala> var w = Vector(nums.first.size-2, _ => rand.nextDouble)
13/03/31 00:57:30 INFO spark.SparkContext: Starting job: first at <console>:20
...
13/03/31 00:57:30 INFO spark.SparkContext: Job finished: first at <console>:20, took 0.01272858 s
w: spark.util.Vector = (0.7290865701603526, 0.8009687428076777, 0.6136632797111822, 0.9783178194773176, 0.3719683631485643, 0.46409291255379836, 0.5340172959927323, 0.04034252433669905, 0.3074428389716637, 0.8537414030626244, 0.8415816118493813, 0.719935849109521, 0.2431646830671812, 0.17139348575456848, 0.5005137792223062, 0.8915164469396641, 0.7679331873447098, 0.7887571495335223, 0.7263187438977023, 0.40877063468941244, 0.7794519914671199, 0.1651264689613885, 0.1807006937030201, 
0.3227972103818231, 0.2777324549716147, 0.20466985600105037, 0.5823059390134582, 0.4489508737465665, 0.44030858771499415, 0.6419366305419459, 0.5191533842209496, 0.43170678028084863, 0.9237523536173182, 0.5175019655845213, 0.47999523211827544, 0.25862648071479444, 0.020548000101787922, 0.18555332739714137, 0....

scala> val iterations = 100
iterations: Int = 100

scala> import scala.math._

scala> for (i <- 1 to iterations) {
        val gradient = points.map(p =>
          (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
        ).reduce(_ + _)
        w -= gradient
      }
[....]

scala> w
res27: spark.util.Vector = (0.2912515190246098, 1.05257972144256, 1.1620192443948825, 0.764385365541841, 1.3340446477767611, 0.6142105091995632, 0.8561985593740342, 0.7221556020229336, 0.40692442223198366, 0.8025693176035453, 0.7013618380649754, 0.943828424041885, 0.4009868306348856, 0.6287356973527756, 0.3675755379524898, 1.2488466496117185, 0.8557220216380228, 0.7633511642942988, 6.389181646047163, 1.43344096405385, 1.729216408954399, 0.4079709812689015, 0.3706358251228279, 0.8683036382227542, 0.36992902312625897, 0.3918455398419239, 0.2840295056632881, 0.7757126171768894, 0.4564171647415838, 0.6960856181900357, 0.6556402580635656, 0.060307680034745986, 0.31278587054264356, 0.9273189009376189, 0.0538302050535121, 0.545536066902774, 0.9298009485403773, 0.922750704590723, 0.072339496591

If things went well, you just used Spark to run logistic regression. Awsome! We have just done a number of things: we have defined a class, we have created an RDD, and we have also created a function. As you can see the Spark shell is quite powerful. Much of the power comes from it being based on the Scala REPL (the Scala interactive shell), so it inherits all the power of the Scala REPL (Read-Evaluate-Print Loop). That being said, most of the time you will probably want to work with a more traditionally compiled code rather than working in the REPL environment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.254.192