Let's forget the examinations for a while, and retake the World Cup tutorial from the previous chapter. The file you obtained from that tutorial was a list of results of football matches. These are sample rows of the final file:
Match Date;Home Team;Away Team;Result 02/06;Italy;France;2-1 02/06;Argentina;Hungary;2-1 06/06;Italy;Hungary;3-1 06/06;Argentina;France;2-1 10/06;France;Hungary;3-1 10/06;Italy;Argentina;1-0 ...
Now you want to take that information to obtain some statistics such as the maximum number of goals per match in a given day. To do it, follow these instructions:
wcup_first_round.txt
file you generated in Chapter 2. Give the name and location of the file, check the Content tab to see that everything matches your file, and fill the Fields tab. Match_Date
. Range
field in this case. Click I understand, and don't worry because you already sorted the data in the previous step.You opened a file with results from several matches and got some statistics from it.
In the file, there was a column with the match result in the format n-m
, with n
being the goals of the home team and m
being the goals of the away team. With the Split Fields step, you split this field in two—one with each of these two numbers.
With the Calculator you did two things:
Note that in order to create a description, you used the +
operator to concatenate string rather than add numbers.
After that, you ordered the data by match date with a Sort rows step.
In the preview window of the Sort rows step, you could see all the calculated fields: home team goals, away team goals, match goals, and description.
Finally, you did some statistical calculations:
Match_Date
in the upper grid of the Group by step.Let's see how it works. Because the Group by step was preceded by a Sort rows step, the rows came to the step already ordered. When the rows arrive to the Group by step, Kettle creates groups based on the field(s) indicated in the upper grid—the Match_Date field in this case. The following drawing shows this idea:
Then, for every group, the fields that you put in the lower grid are calculated. Let's see, for example, the group for the match date 03/06
. For the rows in this group, Kettle calculated the following:
03/06
. There were 4
. 03/06
. There were 3+2+3+4=12.
; : Austria-Spain; Sweden-Brazil; Netherlands-Iran; Peru-Scotland.The same calculations were made for every group. You can verify the details by looking in the preview window.
Look at the Step Metrics tab in the Execution Results area of the screen:
Note that 24 rows entered the Group by step and only 7 came out of that step towards the Dummy step. That is because after the grouping, you no longer have the detail of matches. The output of the Group by step is your new data now—one row for every group created.
The Group by step allows you to create groups of rows and calculate new fields over those groups.
In order to define the groups, you have to specify which field(s) are the keys. For every combination of values for those fields, Kettle builds a new group.
In the tutorial you grouped by a single field Match_date
. Then for every value of Match_date
, Kettle created a different group.
The Group by step operates on consecutive rows. Suppose that the rows are already sorted by date, but those with date 10/06
are above the rest. The step traverses the dataset and each time the value for any of the grouping field changes, it creates a new group. If you see it this way, you will notice that the step will work even if the data is not sorted by the grouping field.
As you probably don't know how the data is ordered, it is safer and recommended that you sort the data by using a Sort rows step just before using a Group by step.
Once you have defined the groups, you are free to specify new fields to be calculated for every group. Every new field is defined as an aggregate function over some of the existent fields.
Let's review some of the fields you created in the tutorial:
Matches
field is the result of applying the Number of values
function over the field Match_date
. Sum of goals
field is the result of applying the Sum
function over the field goals
. Maximum
field is the result of applying the Maximum
function over the field goals
.Finally, you have the option to calculate aggregate functions over the whole dataset. You do this by leaving the upper grid blank. Following the same example, you could calculate the total number of matches and the average number of goals for all those matches. This is how you do it:
The following is what you get:
In any case, as a result of the Group by step, you will no longer have the detailed rows, unless you check the Include all rows? checkbox.
Here you have one more task related with the examinations file. Create a new transformation, read the file, and calculate:
Read the file with countries' information you used in Chapter 2. Build a file where each row has two columns—the name of a country and the list of spoken languages in that country.
3.141.2.157