Dataframe

In spite of being an evolved version of SchemaRDD, dataframe comes with big differences to RDDs. It was introduced as the part of Project Tungsten that allows data to be stored off-heap in binary format. This helps in unloading the garbage collection overhead.

A dataframe comes with the concept of always having schema with data (the Spark framework manages the schema and only passes data between the nodes). As Spark understands the schema and data is stored off-heap in binary format, there is no need to encode data using Java (or Kryo) serialization, which reduces the serialization effort as well.

Also, the dataframe API introduced the concept of building optimized query plans that can be executed with the Spark Catalyst optimizer. As dataframe, like RDD, also follows lazy evaluation, it allows the creation of an optimized query plan before execution. Finally, the operation happens on RDDs only, but it is transparent to the user. Consider, the following query:

select a.name, b.title from a,b where a.id=b.id where b.title ='Architect'

Here, instead of joining tables and then filtering the data based on title, the data will be first filtered and then joined as per the optimized query plan. This helps to reduce network shuffle.

With all this advancement, dataframe has some shortcomings. Dataframe does not provide compile-time safety. Consider the following example of dataframe from Spark 1.6:

JavaRDD<Employee> empRDD = jsc.parallelize(Arrays.asList(new Employee("Foo", 1),new Employee("Bar", 2))); 
SQLContext sqlContext = new SQLContext(jsc); 
 
DataFrame df = sqlContext.createDataFrame(empRDD, Employee.class);

Here we created a dataframe of Employee objects using sqlContext. The constructor of Employee objects accepts two parameters: name, ID.

JavaBean should contain a default constructor and should be serializable to be used as elements for RDD, dataframe, or dataset.

Notice that DataFrame does not contain any type. To filter elements, a filter operation can be executed on DataFrame as follows:

DataFrame filter = df.filter("id >1");

Here, the complier has no option to identify whether the column name specified in the filter condition exists or not. If it does not, it will throw exceptions at runtime.

Another disadvantage of dataframe is that it was Scala centric. All APIs are not available in Java or Python.

Table of Contents for Dataframe

Create new playlist

Sign In

Sign Up

Table of Contents for
Dataframe