Processing the Parquet files

Apache Parquet is another columnar-based data format used by many tools in the Hadoop ecosystem, such as Hive, Pig, and Impala. It increases performance using efficient compression, columnar layout, and encoding routines. The Parquet processing example is very similar to the JSON Scala code. The DataFrame is created and then saved in Parquet format using the write method with a parquet type:

df.write.parquet("hdfs://localhost:9000/tmp/test.parquet")

This results in an HDFS directory, which contains eight parquet files:

For more information about possible SparkContext and SparkSession methods, check the API documentation of the classes called org.apache.spark.SparkContext and org.apache.spark.sql.SparkSession, using the Apache Spark API reference at http://spark.apache.org/docs/latest/api/scala/index.html.

In the next section, we will examine Apache Spark DataFrames. They were introduced in Spark 1.3 and have become one of the first-class citizens in Apache Spark 1.5 and 1.6.

Table of Contents for Processing the Parquet files

Create new playlist

Sign In

Sign Up

Table of Contents for
Processing the Parquet files