Dataset joining

In this section, we will cover dataset joining techniques. We will also discuss some of Spark's special features for data joining plus some data joining solutions made easy with Spark.

After this section, we will be able to join data for various machine learning needs.

Dataset joining and its tool – the Spark SQL

In preparing datasets for a machine learning project, we often need to combine data from multiple datasets. For relational tables, the task is to join tables through a primary and foreign key relationship.

Joining two or more datasets together sounds easy, but can be very challenging and time consuming. In SQL, SELECT is the most frequently used command. As an example, the following is a typical SQL code to perform a join:

SELECT column1, column2, …
FROM table1, table2
WHERE table1.joincolumn = table2.joincolumn
AND search_condition(s);

To work with the table joining tasks mentioned before, data scientists and machine learning professionals often utilize their familiar SQL tools. Within the Spark environment, the Spark SQL was created for this task.

The Spark SQL lets users query structured data inside Spark programs using either SQL or a familiar DataFrame API, which is usable in the R notebook as well. The Spark SQL reuses the Hive frontend and metastore, giving full compatibility with existing Hive data and queries.

The Spark SQL includes a cost-based optimizer, columnar storage, and code generation to make queries fast. At the same time, it scales to thousands of nodes and multihour queries using the Spark engine, which provides full mid-query fault tolerance.

The two main components when using the Spark SQL are DataFrame and SQLContext.

As discussed before, DataFrame is a distributed collection of data organized into named columns. It is based on the data frame concept in R language and is similar to a database table in a relational database. The Spark SQL provides SQLContext to encapsulate all relational functionality in Spark.

Dataset joining in Spark

Here, with some concrete examples, we will demonstrate some methods and related processes of using Spark SQL.

For the purposes of illustration, imagine an application with the following four tables:

  • Users(userId INT, name String, email STRING,age INT, latitude: DOUBLE, longitude: DOUBLE,subscribed: BOOLEAN)
  • Events(userId INT, action INT, Default)
  • WebLog(userId, webAction)
  • Demographic(memberId, age, edu, income)

At least, we need to join the User and Events tables together; we can do this with the following code:

val trainingDataTable = sql("""
  SELECT e.action
         u.age,
         u.latitude,
         u.logitude
  FROM Users u
  JOIN Events e
  ON u.userId = e.userId""")

As the results of the Spark SQL are also stored in RDDs, interfacing with other Spark libraries is trivial. The returned results from the preceding can be directly used for machine learning.

In the preceding example, the Spark SQL made it easy to join various datasets, preparing them for the machine learning algorithm. Furthermore, the Spark SQL allows developers to close the loop by making it easy to manipulate and join the output of these algorithms, producing the desired final result.

For more information about using the Spark SQL, please go to:

http://spark.apache.org/docs/1.0.0/sql-programming-guide.html

Dataset joining with the R data table package

The Spark technology has made data work faster and data analytics easier than before.

According to the Spark development team, the Spark SQL was created for:

  • Writing less code
  • Reading less data
  • Passing the hard work to optimizer

This was achieved by utilizing DataFrames and the Spark SQL commands, sqlContext.read and df.write.

Besides Spark SQL, users may also use R to join tables, for which the data.table R package is very powerful and should be used. The data.table package is created for:

  • Fast aggregation of large data (for example, 100 GB in RAM) and fast ordered joins
  • Fast adding/modifying/deleting of columns by group, using no copies at all
  • Listing columns and a fast file reader (fread)

This package offers a natural and flexible syntax for faster development as well.

To use data.table for joining, you need to create a data.frame first, which is easy.

Then, just use X[Y] to join two tables.

This is also known as Last Observation Carried Forward (LOCF) or a rolling join.

X[Y] is a join between data.table X and data.table Y. If Y has two columns, the first column is matched to the first column of the key of X, and the second column is matched to the second. An equi-join is performed by default, meaning that the values must be equal.

Instead of an equi-join, a rolling join is as follows:

X[Y,roll=TRUE]

As before, the first column of Y is matched to X, where the values are equal. However, the last join column in Y, the second one in this example, is treated specially. If no match is found, then the row before is returned, provided the first column still matches.

Further controls are rolling forwards, rolling backwards, rolling to the nearest, and limited staleness.

For example, type the following and follow the output at the prompt:

example(data.table)

The R data.table package provides an enhanced version of data.frame, including:

  • Fast aggregation of large data—for example, 100 GB in RAM (take a look at the benchmarks on up to two billion rows)
  • Fast ordered joins—for example, rolling forwards, rolling backwards, rolling to the nearest, and limited staleness
  • Fast overlapping range joins—for example, GenomicRanges

As we may recall, in section Data cleaning made easy, we had four tables for the purposes of illustration, as follows:

  • Users(userId INT, name String, email STRING,age INT, latitude: DOUBLE, longitude: DOUBLE,subscribed: BOOLEAN)
  • Events(userId INT, action INT, Default)
  • WebLog(userId, webAction)
  • Demographic(memberId, age, edu, income)

For this example, we obtained a subset from the first data and aggregated the fourth data in last section. Now, we need to join them together. As per the preceding section, mixing the Spark SQL with R on the R notebook could make data joining very easy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.37.126