Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6. Understanding Data Flows

In this chapter, we will cover:

Splitting a stream into two or more streams based on a condition
Merging rows from two streams with the same or different structure
Comparing two streams and generating differences
Generating all possible pairs formed from two datasets
Joining two streams based on conditions
Interspersing new rows in between existent rows
Executing steps even when your stream is empty
Processing rows differently based on the row number

Introduction

The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation.

When a transformation is launched, all its steps are started. During the execution, the steps work simultaneously reading rows from the incoming hops, processing them, and delivering them to the outgoing hops. When there are no more rows left, the execution of the transformation ends.

The dataset that flows from step to step is not more than a set of rows all having the same structure or metadata. This means that all rows have the same number of columns, and the columns in all rows have the same type and name.

Suppose that you have a single stream of data and that you apply the same transformations to all rows, that is, you have all steps connected in a row one after the other. In other words, you have the simplest of the transformations from the point of view of its structure. In this case, you don't have to worry much about the structure of your data stream, nor the origin or destination of the rows. The interesting part comes when you face other situations, for example:

You want a step to start processing rows only after another given step has processed all rows
You have more than one stream and you have to combine them into a single stream
You have to inject rows in the middle of your stream and those rows don't have the same structure as the rows in your dataset

With Kettle, you can actually do this, but you have to be careful because it's easy to end up doing wrong things and getting unexpected results or even worse: undesirable errors.

With regard to the first example, it doesn't represent a default behavior due to the parallel nature of the transformations as explained earlier. There are two steps however, that might help, which are as follows:

Blocking Step: This step blocks processing until all incoming rows have been processed.
Block this step until steps finish: This step blocks processing until the selected steps finish.

Both these steps are in the Flow category.

You will find examples of the use of the last of these steps in the following recipes:

Writing an Excel file with several sheets (Chapter 2, Reading and Writing Files)
Generating a custom log file (Chapter 9, Getting the Most Out of Kettle)

This chapter focuses on the other two examples and some similar use cases, by explaining the different ways for combining, splitting, or manipulating streams of data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 6. Understanding Data Flows

Create new playlist

Sign In

Sign Up

Chapter 6. Understanding Data Flows

Introduction

Table of Contents for
6. Understanding Data Flows