In order to get an idea of what is happening behind the scenes, we can call the explain method on the Dataset. This returns (as text) an output what the Catalyst Optimizer did behind the scenes by printing a text representation of the PEP.
So let's have a look at what happens if we join two Datasets backed by the parquet files:
spark.sql("select c.familyName from clientbigparquet c inner join accountbigparquet a on c.id=a.clientId").explain
The resulting execution plan (PEP) looks like this:
Since this might be a bit hard to read for beginners, find the same plan in the following figure:
So what is done here is basically--when read from bottom to top--two read operations to parquet files followed by a filter operations to ensure that the columns of the join predicate are not null. This is important since otherwise the join operation fails.
Then, a projection operation is performed, which basically means that unnecessary columns are removed. We are only selecting familyName, but for the join operation, clientId and id are also necessary. Therefore, these fields are included in the set of column names to be obtained as well.
Finally, all data is available to execute the join.
As previously stated, the Parquet files are smart data sources, since projection and filtering can be performed at the storage level, eliminating disk reads of unnecessary data. This can be seen in the PushedFilters and ReadSchema sections in the explained plan, where the IsNotNull operation and the ReadSchema projection on id, clientId, and familyName is directly performed at the read level of the Parquet files.