Reading a dataset in Libsvm format

Let's see how to read data in LIBSVM format using the read API and the load() method by specifying the format of the data (that is, libsvm) as follows:

# Creating DataFrame from libsvm dataset
myDF = spark.read.format("libsvm").load("C:/Exp//mnist.bz2")

The preceding MNIST dataset can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist.bz2. This will essentially return a DataFrame and the content can be seen by calling the show() method as follows:

myDF.show() 

The output is as follows:

Figure 7: A snap of the handwritten dataset in LIBSVM format

You can also specify other options such as how many features of the raw dataset you want to give to your DataFrame as follows:

myDF= spark.read.format("libsvm")
.option("numFeatures", "780")
.load("data/Letterdata_libsvm.data")

Now if you want to create an RDD from the same dataset, you can use the MLUtils API from pyspark.mllib.util as follows:

Creating RDD from the libsvm data file
myRDD = MLUtils.loadLibSVMFile(spark.sparkContext, "data/Letterdata_libsvm.data")

Now you can save the RDD in your preferred location as follows:

myRDD.saveAsTextFile("data/myRDD")
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.151.21