Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Creating a DataFrame from CSV

In this recipe, we'll look at how to create a new DataFrame from a delimiter-separated values file.

Note

The code for this recipe can be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter1-spark-csv/src/main/scala/com/packt/scaladata/spark/csv/DataFrameCSV.scala.

How to do it...

This recipe involves four steps:

Add the spark-csv support to our project.
Create a Spark Config object that gives information on the environment that we are running Spark in.
Create a Spark context that serves as an entry point into Spark. Then, we proceed to create an SQLContext from the Spark context.
Load the CSV using the SQLContext.

CSV support isn't first-class in Spark, but it is available through an external library from Databricks. So, let's go ahead and add that to our build.sbt.

After adding the spark-csv dependency, our complete build.sbt looks like this:

organization := "com.packt"

name := "chapter1-spark-csv"

scalaVersion := "2.10.4"

val sparkVersion="1.4.1"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
"com.databricks" %% "spark-csv" % "1.0.3"
)

SparkConf holds all of the information required to run this Spark "cluster." For this recipe, we are running locally, and we intend to use only two cores in the machine—local[2]. More details about this can be found in the There's more… section of this recipe:
```
import org.apache.spark.SparkConf

 val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]")
```
Note
When we say that the master of this run is "local," we mean that we are running Spark on standalone mode. We'll see what "standalone" mode means in the There's more… section.
Initialize the Spark context with the Spark configuration. This is the core entry point for doing anything with Spark:
```
import org.apache.spark.SparkContext
val sc = new SparkContext(conf)
```
The easiest way to query data in Spark is by using SQL queries:
```
import org.apache.spark.sql.SQLContext
val sqlContext=new SQLContext(sc)
```

Now, let's load our pipe-separated file. The students is of type org.apache.spark.sql.DataFrame:

import com.databricks.spark.csv._
val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|')

How it works...

The csvFile function of sqlContext accepts the full filePath of the file to be loaded. If the CSV has a header, then the useHeader flag will read the first row as column names. The delimiter flag defaults to a comma, but you can override the character as needed.

Instead of using the csvFile function, we could also use the load function available in SQLContext. The load function accepts the format of the file (in our case, it is CSV) and options as Map. We can specify the same parameters that we specified earlier using Map, like this:

val options=Map("header"->"true", "path"->"ModifiedStudent.csv")

val newStudents=sqlContext.load("com.databricks.spark.csv",options)

There's more…

As we saw earlier, we now ran the Spark program in standalone mode. In standalone mode, the Driver program (the brain) and the Worker nodes all get crammed into a single JVM. In our example, we set master to local[2], which means that we intend to run Spark in standalone mode and request it to use only two cores in the machine.

Spark can be run on three different modes:

Standalone
Standalone cluster, using its in-built cluster manager
Using external cluster managers, such as Apache Mesos and YARN

In Chapter 6, Scaling Up, we have dedicated explanations and recipes for how to run Spark on inbuilt cluster modes on Mesos and YARN. In a clustered environment, Spark runs a Driver program along with a number of Worker nodes. As the name indicates, the Driver program houses the brain of the program, which is our main program. The Worker nodes have the data and perform various transformations on it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Creating a DataFrame from CSV

Create new playlist

Sign In

Sign Up

Creating a DataFrame from CSV

Note

How to do it...

Note

How it works...

There's more…

Table of Contents for
Creating a DataFrame from CSV