Untyped dataset operation

Once we have created the dataset, then Spark provides a couple of handy functions which perform basic SQL operation and analysis, such as the following:

  • show(): This displays the top 20 rows of the dataset in a tabular form. Strings of more than 20 characters will be truncated, and all cells will be aligned right:
emp_ds.show();

Another variant of the show() function allows the user to enable or disable the 20 characters limit in the show() function by passing a Boolean as false to disable truncation of the string:

emp_ds.show(false);
  • printSchema(): This function prints the schema to the console in a tree format:
emp_ds.printSchema();
  • select(): This selects a set of columns as passed as an argument from the dataset:
emp_ds.select("empName" ,"empId").show();

By statically importing the function org.apache.spark.sql.functions.col, one can add extra features to columns by using helper functions associated with col(), such as creating an alias, incrementing the value of the column, casting to a specific datatype, and so on:

emp_ds.select(col("empName").name("Employee Name") ,col("empId").cast(DataTypes.IntegerType).name("Employee Id")).show();
  • filter(): This function is used to select rows meeting the criteria passed as an argument to the filter() function:
emp_ds.sort(col("empId").asc()).filter(col("salary").gt("2500"));
  • groupBy(): This groups the dataset using the specified columns, so that one can run aggregation on them:
emp_ds.select("job").groupBy(col("job")).count().show();
  • join(): This joins with another dataframe using the given join expression. The function accepts three parameters: the first being another dataset; the second being the column on which the data will be joined; and the third being joinType--inner, outer, left_outer, right_outer, and leftsemi. The following performs a left-outer join between emp_ds and deptDf:
emp_ds.as("A").join(deptDf.as("B"),emp_ds.col("deptno").equalTo(deptDf.col("deptno")),"left").select("A.empId","A.empName","A.job","A.manager","A.hiredate","A.salary","A.comm","A.deptno","B.dname","B.loc").show();
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.136.186