Using the explain method to obtain the PEP

In order to get an idea of what is happening behind the scenes, we can call the explain method on the Dataset. This returns (as text) an output what the Catalyst Optimizer did behind the scenes by printing a text representation of the PEP.

So let's have a look at what happens if we join two Datasets backed by the parquet files:

spark.sql("select c.familyName from clientbigparquet c inner join accountbigparquet a on c.id=a.clientId").explain

The resulting execution plan (PEP) looks like this:

Since this might be a bit hard to read for beginners, find the same plan in the following figure:

So what is done here is basically--when read from bottom to top--two read operations to parquet files followed by a filter operations to ensure that the columns of the join predicate are not null. This is important since otherwise the join operation fails.

In relational databases, the isNotNull filter would be a redundant step, since RDBMS recognizes the notion of not nullable columns; therefore, the null check is done during insert and not during read.

Then, a projection operation is performed, which basically means that unnecessary columns are removed. We are only selecting familyName, but for the join operation, clientId and id are also necessary. Therefore, these fields are included in the set of column names to be obtained as well.

Finally, all data is available to execute the join.

Note that in the preceding diagram, BroadCastHashJoin is illustrated as single stage, but in reality, a join spans partitions over - potentially - multiple physical nodes. Therefore, the two tree branches are executed in parallel and the results are shuffled over the cluster using hash bucketing based on the join predicate.

As previously stated, the Parquet files are smart data sources, since projection and filtering can be performed at the storage level, eliminating disk reads of unnecessary data. This can be seen in the PushedFilters and ReadSchema sections in the explained plan, where the IsNotNull operation and the ReadSchema projection on id, clientId, and familyName is directly performed at the read level of the Parquet files.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.255.168