Chapter 6. Understanding Data Flows

In this chapter, we will cover:

  • Splitting a stream into two or more streams based on a condition
  • Merging rows from two streams with the same or different structure
  • Comparing two streams and generating differences
  • Generating all possible pairs formed from two datasets
  • Joining two streams based on conditions
  • Interspersing new rows in between existent rows
  • Executing steps even when your stream is empty
  • Processing rows differently based on the row number

Introduction

The main purpose of Kettle transformations is to manipulate data in the form of a dataset; this task is done by the steps of the transformation.

When a transformation is launched, all its steps are started. During the execution, the steps work simultaneously reading rows from the incoming hops, processing them, and delivering them to the outgoing hops. When there are no more rows left, the execution of the transformation ends.

The dataset that flows from step to step is not more than a set of rows all having the same structure or metadata. This means that all rows have the same number of columns, and the columns in all rows have the same type and name.

Suppose that you have a single stream of data and that you apply the same transformations to all rows, that is, you have all steps connected in a row one after the other. In other words, you have the simplest of the transformations from the point of view of its structure. In this case, you don't have to worry much about the structure of your data stream, nor the origin or destination of the rows. The interesting part comes when you face other situations, for example:

  • You want a step to start processing rows only after another given step has processed all rows
  • You have more than one stream and you have to combine them into a single stream
  • You have to inject rows in the middle of your stream and those rows don't have the same structure as the rows in your dataset

With Kettle, you can actually do this, but you have to be careful because it's easy to end up doing wrong things and getting unexpected results or even worse: undesirable errors.

With regard to the first example, it doesn't represent a default behavior due to the parallel nature of the transformations as explained earlier. There are two steps however, that might help, which are as follows:

  • Blocking Step: This step blocks processing until all incoming rows have been processed.
  • Block this step until steps finish: This step blocks processing until the selected steps finish.

Both these steps are in the Flow category.

You will find examples of the use of the last of these steps in the following recipes:

  • Writing an Excel file with several sheets (Chapter 2, Reading and Writing Files)
  • Generating a custom log file (Chapter 9, Getting the Most Out of Kettle)

This chapter focuses on the other two examples and some similar use cases, by explaining the different ways for combining, splitting, or manipulating streams of data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.12.50