Creating a DataFrame from CSV

In this recipe, we'll look at how to create a new DataFrame from a delimiter-separated values file.

How to do it...

This recipe involves four steps:

  1. Add the spark-csv support to our project.
  2. Create a Spark Config object that gives information on the environment that we are running Spark in.
  3. Create a Spark context that serves as an entry point into Spark. Then, we proceed to create an SQLContext from the Spark context.
  4. Load the CSV using the SQLContext.
  5. CSV support isn't first-class in Spark, but it is available through an external library from Databricks. So, let's go ahead and add that to our build.sbt.

    After adding the spark-csv dependency, our complete build.sbt looks like this:

    organization := "com.packt"
    
    name := "chapter1-spark-csv"
    
    scalaVersion := "2.10.4"
    
    val sparkVersion="1.4.1"
    
    libraryDependencies ++= Seq(
      "org.apache.spark" %% "spark-core" % sparkVersion,
      "org.apache.spark" %% "spark-sql" % sparkVersion,
    "com.databricks" %% "spark-csv" % "1.0.3"
    )
  6. SparkConf holds all of the information required to run this Spark "cluster." For this recipe, we are running locally, and we intend to use only two cores in the machine—local[2]. More details about this can be found in the There's more… section of this recipe:
    import org.apache.spark.SparkConf
    
     val conf = new SparkConf().setAppName("csvDataFrame").setMaster("local[2]")

    Note

    When we say that the master of this run is "local," we mean that we are running Spark on standalone mode. We'll see what "standalone" mode means in the There's more… section.

  7. Initialize the Spark context with the Spark configuration. This is the core entry point for doing anything with Spark:
    import org.apache.spark.SparkContext
    val sc = new SparkContext(conf)

    The easiest way to query data in Spark is by using SQL queries:

    import org.apache.spark.sql.SQLContext
    val sqlContext=new SQLContext(sc)
  8. Now, let's load our pipe-separated file. The students is of type org.apache.spark.sql.DataFrame:
    import com.databricks.spark.csv._
    val students=sqlContext.csvFile(filePath="StudentData.csv", useHeader=true, delimiter='|')

How it works...

The csvFile function of sqlContext accepts the full filePath of the file to be loaded. If the CSV has a header, then the useHeader flag will read the first row as column names. The delimiter flag defaults to a comma, but you can override the character as needed.

Instead of using the csvFile function, we could also use the load function available in SQLContext. The load function accepts the format of the file (in our case, it is CSV) and options as Map. We can specify the same parameters that we specified earlier using Map, like this:

val options=Map("header"->"true", "path"->"ModifiedStudent.csv")

val newStudents=sqlContext.load("com.databricks.spark.csv",options)

There's more…

As we saw earlier, we now ran the Spark program in standalone mode. In standalone mode, the Driver program (the brain) and the Worker nodes all get crammed into a single JVM. In our example, we set master to local[2], which means that we intend to run Spark in standalone mode and request it to use only two cores in the machine.

Spark can be run on three different modes:

  • Standalone
  • Standalone cluster, using its in-built cluster manager
  • Using external cluster managers, such as Apache Mesos and YARN

In Chapter 6, Scaling Up, we have dedicated explanations and recipes for how to run Spark on inbuilt cluster modes on Mesos and YARN. In a clustered environment, Spark runs a Driver program along with a number of Worker nodes. As the name indicates, the Driver program houses the brain of the program, which is our main program. The Worker nodes have the data and perform various transformations on it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.133.54