Time for action—updating a file with news about examinations

Imagine you are responsible to collect the results of an annual examination that is being taken in a language school. The examination evaluates writing, reading, speaking, and listening skills. Every professor gives the exam to the students, the students take the examination, the professors grade the examinations in the scale 0-100 for each skill, and write the results in a text file, like the following:

student_code;name;writing;reading;speaking;listening
80711-85;William Miller;81;83;80;90
20362-34;Jennifer Martin;87;76;70;80
75283-17;Margaret Wilson;99;94;90;80
83714-28;Helen Thomas;89;97;80;80
61666-55;Maria Thomas;88;77;70;80

All the files follow that pattern.

When a professor has the file ready, he/she sends it to you, and you have to integrate the results in a global list. Let's do it with Kettle.

  1. Before starting, be sure to have a file ready to read. Type it or download the sample files from the Packt's official website.
  2. Create the file where the news will be appended. Type this:
    ---------------------------------------------------------
    Annual Language Examinations
    Testing writing, reading, speaking and listening skills
    ---------------------------------------------------------
    student_code;name;writing;reading;speaking;listening;file_processed;process_date
    

    Save the file as C:/pdi_files/output/examination.txt.

  3. Create a new transformation.
  4. Expand the Input branch of the steps tree.
  5. Drag the Get System Info and Text file input icons to the canvas.
  6. Expand the Output branch of the steps tree, and drag a Text file output step to the canvas.
  7. Link the steps as follows:
    Time for action—updating a file with news about examinations
  8. Double-click the first Get System Info step icon and give it a name.
  9. Fill the grid as follows:
    Time for action—updating a file with news about examinations
  10. Click OK.
  11. Double-click the Text file Input step icon and configure it like here:
    Time for action—updating a file with news about examinations
  12. Select the Content tab.
  13. Check the Include filename in output? checkbox and type file_processed in the Filename fieldname textbox.
  14. Check the Add filenames to result checkbox.
  15. Select the Fields tab and Click the Get Fields button to fill the grid.
  16. Click OK.
  17. Double-click the second Get System Info step icon and give it a name.
  18. Add a field named process_date, and from the list of choices select system date (fixed).
  19. Double-click the Text file output step icon and give it a name.
  20. Type C:/pdi_files/output/examination as the filename.
  21. In the Fields tab, press the Get Fields button to fill the grid.
  22. Change the format of the Date row to yy/MM/dd.
  23. Give a name and description to the transformation and save it.
  24. Press F9 to run the transformation.
  25. Fill in the argument grid, writing the full path of the file created.
    Time for action—updating a file with news about examinations
  26. Click Launch.
  27. The output file should look like this:
    ---------------------------------------------------------
    Annual Language Examinations
    Testing writing, reading, speaking and listening skills
    ---------------------------------------------------------
    student_code;name;writing;reading;speaking;listening;file_processed;process_date
    80711-85;William Miller;81;83;80;90;C:examsexam1.txt;28-05-2009
    20362-34;Jennifer Martin;87;76;70;80;C:examsexam1.txt;28-05-2009
    75283-17;Margaret Wilson;99;94;90;80;C:examsexam1.txt;28-05-2009
    83714-28;Helen Thomas;89;97;80;80;C:examsexam1.txt;28-05-2009
    61666-55;Maria Thomas;88;77;70;80;C:examsexam1.txt;28-05-2009
    
  28. Run the transformation again.
  29. This time fill the argument grid with the name of a second file.
  30. Click Launch.
  31. Verify that the data from this second file was appended to the previous data in the output file.

What just happened?

You read a file whose name is known at runtime, and fed a destination file by appending the contents of the input file.

The first Get System Info step tells Kettle to take the first command line argument, and assume that it is the name of the file to read.

In the Text File Input step, you didn't specify the name of the file, but told Kettle to take as the name of the file, the field coming from the previous step, which is the read argument.

With the second Get System Info step you just took from the system, the date, which you used later to enrich the data sent to the destination file.

The destination file is appended with new data every time you run the transformation. Beyond the basic required data (student code and grades), the name of the processed file and the date on which the data is being appended are added as part of the data.

Tip

When you don't specify the name and location of a file (like in this example), or when the real file is not available at design time, you won't be able to use the Get Fields button, nor preview to see if the step is well configured. The trick is to configure the step by using a real file identical to the expected one. After the step is configured, change the name and location of the file as needed.

Getting information by using Get System Info step

The Get System Info step allows you to get different information from the system. In this exercise, you took the system date and an argument. If you look to the available list, you will see more than just these two options.

Here we used the step in two different ways:

  • As a resource to take the name of the file from the command line
  • To add a field to the dataset

The use of this step will be clearer with a picture.

Getting information by using Get System Info step

In this example, the Text File Input doesn't know the name or the location of the file. It takes it from the previous step, which is a Get System Info Step. As the Get System Info serves as a supplier of information, the hop that leaves the step changes its look and feel to show the situation.

The second time the Get System Info is used, its function is simply to add a field to the incoming dataset.

Data types

Every field must have a data type. The data type can be any of the common data types—number (float), string, date, Boolean, integer, or big number. Strings are simple, just text for which you may specify a length. Date and numeric fields have more variants, and are worthy of while a separate explanation.

Date fields

Date is one the main data types available in Kettle. In the matches tutorial, you have an example of date field—the match date field. Its values were 2/Jun, 6/Jun, 10/Jun. Take a look at how you defined that field in the Text file input step. You defined the field as a date field with format dd/MMM. What does it mean? To Kettle it means that it has to interpret the field as a date, where the first two positions represent the day, then there is a slash, and finally there is the month in letters (that's the meaning of the three last positions).

Generally speaking, when a date field is created, like the text input field of the example, you have to define the format of the data so that Kettle can recognize in the field the different components of the date. There are several formats that may be defined for a date, all of them combinations of letters that represents date or time components. Here are the most basic ones:

Letters

Meaning

y

Year

M

Month

d

Day

H

Hour (0-23)

m

Minutes

s

Seconds

Now let's see the other end of the same transformation—the output step. Here you set another format for the same field: dd/MM. According the table, this means the date has to have two positions for the day, then a slash, and then two positions for the month. Here, the format specification represents the mask you want to apply when the date is shown. Instead of 2/Jun, 6/Jun, 10/Jun, in the output file, you expect to see 02/06, 06/06, 10/06.

In the examination tutorial, you also have a Date field—the process date. When you created it, you didn't specify a format because you took the system date which, by definition, is a date and Kettle knows it. But when writing this date to the output file, again you defined a format, in this case it was yyyy/MM/dd.

In general, when you are writing a date, the format attribute is used of format the data before sending it to the destination. In case you don't specify a format, Kettle sets a default format.

As said earlier, there are more combinations to define the format to a date field. For a complete reference, check the Sun Java API documentation located at http://java.sun.com/javase/6/docs/api/java/text/SimpleDateFormat.html.

Numeric fields

Numeric fields are present in almost all Kettle transformations. In the Examination example, you encountered numeric fields for the first time. The input file had four numeric fields. As the numbers were all integer, you didn't set a specific format. When you have more elaborate fields such as numbers with separators, dollar signs, and so on, you should specify a format to tell Kettle how to interpret the number. If you don't, Kettle will do its best to interpret the number, but this could lead to unexpected results.

At the other extreme of the flow, when writing to the output file text, you may specify the format in which you want the number to be shown.

There are several formats you may apply to a numeric field. The format is basically a combination of predefined symbols, each with a special meaning. The following are the most used symbols:

Symbol

Meaning

#

Digit Leading zeros are not shown

0

Digit If the digit is not present, zero is displayed in its place

.

Decimal separator

-

Minus sign

%

Field has to be multiplied by 100 and shown as a percentage

These symbols are not used alone. In order to specify the format of your numbers, you have to combine them. Suppose that you have a numeric field whose value is 99.55; the following table shows you the same value after applying different formats to it:

Format

Result

#

100

0

100

#.#

99.6

#.##

99.55

#.000

99.550

000.000

099.550

If you don't specify a format for your numbers, you may still provide a Length and Precision. Length is the total number of significant figures, while precision is the number of floating-point digits.

If you neither specify format nor length or precision, Kettle behaves as follow. While reading, it does its best to interpret the incoming number, and when writing, it sends the data as it comes without applying any format.

For a complete reference on number formats, you can check the Sun Java API documentation available at http://java.sun.com/javase/6/docs/api/java/text/DecimalFormat.html.

Running transformations from a terminal window

In the examination exercise, you specified that the name of the input file will be taken from the first command-line argument. That means when executing the transformation, the filename has to be supplied as an argument. Until now, you only ran transformations from inside Spoon. In the last exercise, you provided the argument by typing it in a dialog window. Now it is time to learn how to run transformations with or without arguments from a terminal window.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.40.129