An International Musical Contest will take place and 24 countries will participate, each presenting a duet. Your task is to hire interpreters so the contestants can communicate in their native language. In order to do that, you need to find out the language they speak:
countries.xml
file that contains information about countries that you used in Chapter 2. isofficial= T
.Now let's create the main flow of data:
ID;Country;Duet 1;Russia;Mikhail Davydova ;;Anastasia Davydova 2;Spain;Carmen Rodriguez ;;Francisco Delgado 3;Japan;Natsuki Harada ;;Emiko Suzuki 4;China;Lin Jiang ;;Wei Chiu 5;United States;Chelsea Thompson ;;Cassandra Sullivan 6;Canada;Mackenzie Martin ;;Nathan Gauthier 7;Italy;Giovanni Lombardi ;;Federica Lombardi
Note that Country Name
is a field coming from the text file stream, while the country field comes from the countries stream.
language-IS NOT NULL
. Duet, Country Name
and language
to Name, Country
, and Language
. people_and_languages.txt
with the selected fields.Name|Country|Language Mikhail Davydova|Russia| Anastasia Davydova|Russia| Carmen Rodriguez|Spain|Spanish Francisco Delgado|Spain|Spanish Natsuki Harada|Japan|Japanese Emiko Suzuki|Japan|Japanese Lin Jiang|China|Chinese Wei Chiu|China|Chinese Chelsea Thompson|United States|English Cassandra Sullivan|United States|English Mackenzie Martin|Canada|French Nathan Gauthier|Canada|French Giovanni Lombardi|Italy|Italian Federica Lombardi|Italy|Italian
First of all, you read a file with information about countries and the languages spoken in those countries.
Then you read a list of people along with the country they come from. For every row in this list, you told Kettle to look for the country (Country Name
field) in the countries stream (country
field), and to give you back a language and the percentage of people that speaks that language (language
and percentage
fields). Let's explain it with a sample row: The row for Francisco Delgado
from Spain
. When this row gets to the Stream lookup step, Kettle looks in the list of countries for a row with the country Spain
. It finds it. Then, it returns the value of the columns language
and percentage: Spanish
and 74.4
.
Now take another sample row—the row with the country Russia
. When the row gets to the Stream lookup step, Kettle looks for it in the list of countries, but it doesn't find it. So what you get as language is a null string.
Whether the country is found or not, two new fields are added to your stream—language
and percentage
.
After the Stream lookup step, you discarded the rows where language
is null, that is, those whose country wasn't found in the list of countries.
With the successful rows you generated an output file.
The Stream lookup step allows you to look up data in a secondary stream.
You tell Kettle which of the incoming streams is the stream used to look up, by selecting the right choice in the Lookup step list.
The upper grid in the configuration window allows you to specify the names of the fields that are used to look up.
In the left column, Field, you indicate the field of your main stream. You can fill this column by using the Get Fields button, and deleting all the fields you don't want to use for the search.
In the right column, Lookup Field, you indicate the field of the secondary stream.
When a row of data comes to the step, a lookup is made to see if there is a row in the secondary stream for which, every pair (Field
, LookupField
) in the grid has the value of Field
equal to the value of LookupField
. If there is one, the look up will be successful.
In the lower grid, you specify the names of the secondary stream fields that you want back as a result of the look up. You can fill this column by using the Get lookup fields button, and deleting all the fields you don't want to retrieve.
After the lookup, new fields are added to your dataset—one for every row of this grid.
For the rows for which the look up is successful, the values for the new fields will be taken from the lookup stream.
For the others, the fields will remain null, unless you set a default value.
When you use a Stream lookup, all lookup data is loaded into memory. Then the stream lookup is made using a hash table algorithm. Even if you don't know how this algorithm works, it is important that you know the implications of using this step:
The tutorial where you counted the words in a file worked pretty well, but you may have noticed that it has some details you can fix or enhance.
You discarded a very small list of words, but there are much more that are quite usual in English—prepositions, pronouns, auxiliary verbs, and many more. So here is the challenge: Get a list of commonly used words and save it in a file. Instead of excluding words from a small list as you did with a Filter rows step, exclude the words that are in your common words file.
3.145.47.252