Getting started

Let's start with loading, parsing, and viewing simple flight data. At first, download the NY flights dataset as a CSV from https://s3-us-west-2.amazonaws.com/sparkr-data/nycflights13.csv. Now let's load and parse the dataset using read.csv() API of R:

#Creating R data frame
dataPath<- "C:/Exp/nycflights13.csv"
df<- read.csv(file = dataPath, header = T, sep =",")

Now let's view the structure of the dataset using View() method of R as follows:

View(df)
Figure 20: A snap of the NYC flight dataset

Now let's create the Spark DataFrame from the R DataFrame as follows:

##Converting Spark DataFrame 
flightDF<- as.DataFrame(df)

Let's see the structure by exploring the schema of the DataFrame:

printSchema(flightDF)

The output is as follows:

Figure 21: The schema of the NYC flight dataset

Now let's see the first 10 rows of the DataFrame:

showDF(flightDF, numRows = 10)

The output is as follows:

Figure 22: The first 10 rows of the NYC flight dataset

So, you can see the same structure. However, this is not scalable since we loaded the CSV file using standard R API. To make it faster and scalable, like in Scala, we can use external data source APIs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.2.223