Join types

The following is a table of the different types of joins. This is important, as the choice made when joining two datasets makes all the difference in the output, and also the performance.

Join type Description
inner The inner join compares each row from left to rows from right and combines matched pair of rows from left and right datasets only when both have non-NULL values.

cross

The cross join matches every row from left with every row from right generating a Cartesian cross product.
outer, full, fullouter The full outer Join gives all rows in left and right filling in NULL if only in right or left.
leftanti The leftanti Join gives only rows in left based on non-existence on right side.
left, leftouter The leftouter Join gives all rows in left plus common rows of left and right (inner join). Fills in NULL if not in right.
leftsemi The leftsemi Join gives only rows in left based on existence on right side. The does not include right-side values.
right, rightouter The rightouter Join gives all rows in right plus common rows of left and right (inner join). Fills in NULL if not in left.

 

We will examine how the different join types work by using the sample datasets.

scala> val statesPopulationDF = spark.read.option("header", "true").option("inferschema", "true").option("sep", ",").csv("statesPopulation.csv")
statesPopulationDF: org.apache.spark.sql.DataFrame = [State: string, Year: int ... 1 more field]

scala> val statesTaxRatesDF = spark.read.option("header", "true").option("inferschema", "true").option("sep", ",").csv("statesTaxRates.csv")
statesTaxRatesDF: org.apache.spark.sql.DataFrame = [State: string, TaxRate: double]

scala> statesPopulationDF.count
res21: Long = 357

scala> statesTaxRatesDF.count
res32: Long = 47

%sql
statesPopulationDF.createOrReplaceTempView("statesPopulationDF")
statesTaxRatesDF.createOrReplaceTempView("statesTaxRatesDF")

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.38.142