Reading and manipulating raw text files

You can read a raw text data file using the textFile() method. Suppose you have the logs of some purchase:

number	product_name	transaction_id	website	price	date0	jeans	30160906182001	ebay.com	100	12-02-20161	camera	70151231120504	amazon.com	450	09-08-20172	laptop	90151231120504	ebay.ie	1500	07--5-20163	book	80151231120506	packt.com	45	03-12-20164	drone	8876531120508	alibaba.com	120	01-05-2017

Now reading and creating RDD is pretty straightforward using the textFile() method as follows:

myRDD = spark.sparkContext.textFile("sample_raw_file.txt")
$cd myRDD
$ cat part-00000
number product_name transaction_id website price date 0 jeans 30160906182001 ebay.com 100 12-02-20161 camera 70151231120504 amazon.com 450 09-08-2017

As you can see, the structure is not that readable. So we can think of giving a better structure by converting the texts as DataFrame. At first, we need to collect the header information as follows:

header = myRDD.first() 

Now filter out the header and make sure the rest looks correct as follows:

textRDD = myRDD.filter(lambda line: line != header)
newRDD = textRDD.map(lambda k: k.split("\t"))

We still have the RDD but with a bit better structure of the data. However, converting it into DataFrame will provide a better view of the transactional data.

The following code creates a DataFrame by specifying the header.split is providing the names of the columns:

 textDF = newRDD.toDF(header.split("\t"))
textDF.show()

The output is as follows:

Figure 10: Sample of the transactional data

Now you could save this DataFrame as a view and make a SQL query. Let's do a query with this DataFrame now:

textDF.createOrReplaceTempView("transactions")
spark.sql("SELECT * FROM transactions").show()
spark.sql("SELECT product_name, price FROM transactions WHERE price >=500 ").show()
spark.sql("SELECT product_name, price FROM transactions ORDER BY price DESC").show()

The output is as follows:

Figure 11: Query result on the transactional data using Spark SQL
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.95.216