Unified dataframe and dataset API

With the evolution of Spark 2.x, the untyped characteristics of dataframe and the typed characteristics of dataset have been merged into dataset. So, fundamentally a dataframe is an alias of a collection of generic objects of a dataset (row), where row is a generic untyped JVM object. In Java, a dataframe is represented as a dataset (row). The unification of dataframe with dataset allows us to operate upon the data in multiple ways such as:

Using strongly typed APIs: Java functions provide type safety while working on dataset, also referred as a typed transformation:

dsEmp.filter(newFilterFunction<Employee>() {
  @Override
  publicboolean call(Employee emp) throws Exception {
    return emp.getEmpId() > 1;
  }
}).show();

//Using Java 8
dsEmp.filter(emp -> emp.getEmpId()>1).show();

Using untyped APIs: An untyped transformation provides SQL-like column specific names, and the operation can be used to operate upon data:

dsEmp.filter("empID > 1").show();

Using DSL: Dataframes/dataset provide a domain-specific language for structured data manipulation. In Java, one can use the function col() by importing the Spark SQL package org.apache.spark.sql.functions.col which provides a statically typed function on untyped column names of a dataframe.

dsEmp.filter(col("empId").gt(1)).show();

Figure 8.1: Unified dataset API Spark 2.x

Table of Contents for Unified dataframe and dataset API

Create new playlist

Sign In

Sign Up

Table of Contents for
Unified dataframe and dataset API