In Chapter 5, Working with Data and Storage, we read CSV using SparkSession in the form of a Java RDD. However, this time we will read the CSV in the form of a dataset. Consider, you have a CSV with the following content:
emp_id,emp_name,emp_dept
1,Foo,Engineering
2,Bar,Admin
The SparkSession can be used to read this CSV file as follows:
Dataset<Row> csv = sparkSession.read().format("csv").option("header","true").load("C:\Users\sgulati\Documents\my_docs\book\testdata\emp.csv");
Similarly to the collect() function on RDD, a dataset provides the show() function, which can be used to read the content of the dataset:
csv.show();
Executing this function will show the content of the CSV files along with the headers which seems similar to a relational table. The content of a dataset can be transformed/filtered using Spark SQL which will be discussed in the next sections.