Unified dataframe and dataset API

With the evolution of Spark 2.x, the untyped characteristics of dataframe and the typed characteristics of dataset have been merged into dataset. So, fundamentally a dataframe is an alias of a collection of generic objects of a dataset (row), where row is a generic untyped JVM object. In Java, a dataframe is represented as a dataset (row). The unification of dataframe with dataset allows us to operate upon the data in multiple ways such as:

  • Using strongly typed APIs: Java functions provide type safety while working on dataset, also referred as a typed transformation:
dsEmp.filter(newFilterFunction<Employee>() {
@Override
publicboolean call(Employee emp) throws Exception {
return emp.getEmpId() > 1;
}
}).show();

//Using Java 8
dsEmp.filter(emp -> emp.getEmpId()>1).show();
  • Using untyped APIs: An untyped transformation provides SQL-like column specific names, and the operation can be used to operate upon data:
dsEmp.filter("empID > 1").show();
  • Using DSL: Dataframes/dataset provide a domain-specific language for structured data manipulation. In Java, one can use the function col() by importing the Spark SQL package org.apache.spark.sql.functions.col which provides a statically typed function on untyped column names of a dataframe.
dsEmp.filter(col("empId").gt(1)).show();
Figure 8.1: Unified dataset API Spark 2.x
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.217.198