Loading an RDD from a text file

I can also load an RDD from a text file, and that could be anywhere.

sc.textFile("file:///c:/users/frank/gobs-o-text.txt")  

In this example, I have a giant text file that's the entire encyclopedia or something. I'm reading that from my local disk here, but I could also use s3n if I want to host this file on a distributed AmazonS3 bucket, or hdfs if I want to refer to data that's stored on a distributed HDFS cluster (that stands for Hadoop Distributed File System if you're not familiar with HDFS). When you're dealing with big data and working with a Hadoop cluster, usually that's where your data will live.

That line of code will actually convert every line of that text file into its own row in an RDD. So, you can think of the RDD as a database of rows, and, in this example, it will load up my text file into an RDD where every line, every row, contains one line of text. I can then do further processing in that RDD to parse or break out the delimiters in that data. But that's where I start from.

Remember when we talked about ETL and ELT earlier in the book? This is a good example of where you might actually be loading raw data into a system and doing the transform on the system itself that you used to query your data. You can take raw text files that haven't been processed at all and use the power of Spark to actually transform those into more structured data.

It can also talk to things like Hive, so if you have an existing Hive database set up at your company, you can create a Hive context that's based on your Spark context. How cool is that? Take a look at this example code:

hiveCtx = HiveContext(sc)  rows = hiveCtx.sql("SELECT name, age FROM users")  

You can actually create an RDD, in this case called rows, that's generated by actually executing a SQL query on your Hive database.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.193.85