Joins

In traditional databases, joins are used to join one transaction table with another lookup table to generate a more complete view. For example, if you have a table of online transactions by customer ID and another table containing the customer city and customer ID, you can use join to generate reports on the transactions by city.

Transactions table: The following table has three columns, the CustomerID, the Purchased item, and how much the customer paid for the item:

CustomerID Purchased item Price paid
1 Headphone 25.00
2 Watch 100.00
3 Keyboard 20.00
1 Mouse 10.00
4 Cable 10.00
3 Headphone 30.00

 

Customer Info table: The following table has two columns, the CustomerID and the City the customer lives in:

CustomerID City
1 Boston
2 New York
3 Philadelphia
4 Boston

 

Joining the transaction table with the customer info table will generate a view as follows:

CustomerID Purchased item Price paid City
1 Headphone 25.00 Boston
2 Watch 100.00 New York
3 Keyboard 20.00 Philadelphia
1 Mouse 10.00 Boston
4 Cable 10.00 Boston
3 Headphone 30.00 Philadelphia

 

Now, we can use this joined view to generate a report of Total sale price by City:

City #Items Total sale price
Boston 3 45.00
Philadelphia 2 50.00
New York 1 100.00

 

Joins are an important function of Spark SQL, as they enable you to bring two datasets together, as seen previously. Spark, of course, is not only meant to generate reports, but is used to process data on a petabyte scale to handle real-time streaming use cases, machine learning algorithms, or plain analytics. In order to accomplish these goals, Spark provides the API functions needed.

A typical join between two datasets takes place using one or more keys of the left and right datasets and then evaluates a conditional expression on the sets of keys as a Boolean expression. If the result of the Boolean expression returns true, then the join is successful, else the joined DataFrame will not contain the corresponding join.

The join API has 6 different implementations:

join(right: dataset[_]): DataFrame
Condition-less inner join

join(right: dataset[_], usingColumn: String): DataFrame
Inner join with a single column

join(right: dataset[_], usingColumns: Seq[String]): DataFrame
Inner join with multiple columns

join(right: dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
Join with multiple columns and a join type (inner, outer,....)

join(right: dataset[_], joinExprs: Column): DataFrame
Inner Join using a join expression

join(right: dataset[_], joinExprs: Column, joinType: String): DataFrame
Join using a Join expression and a join type (inner, outer, ...)

We will use one of the APIs to understand how to use join APIs ; however, you can choose to use other APIs depending on the use case:

def   join(right: dataset[_], joinExprs: Column, joinType: String): DataFrame 
Join with another DataFrame using the given join expression

right: Right side of the join.
joinExprs: Join expression.
joinType : Type of join to perform. Default is inner join

// Scala:
import org.apache.spark.sql.functions._
import spark.implicits._

df1.join(df2, $"df1Key" === $"df2Key", "outer")

Note that joins will be covered in detail in the next few sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.238.161