Time for action validating genres with a Regex Evaluation step

In this tutorial you will read the modified films file and validate the genres field.

  1. Create a new transformation.
  2. Read the modified films file just as you did in the previous tutorial.
  3. In the Content tab, check the Rownum in output? option and fill the Rownum fieldname with the text rownum.
  4. Do a preview. You should see this:
    Time for action validating genres with a Regex Evaluation stepunexpected errors, avoidingdata, validating
  5. After the Text file input step, add a Regex Evaluation step. You will find it under the Scripting category of steps.
  6. Under the Step settings box, select Genres as the Field to evaluate, and type genres_ok as the Result Fieldname.
  7. In the Regular expression textbox type [A-Za-zs-]*(|[A-Za-zs-]*)* .
  8. Add the Filter rows step, an Add constants step, and two Text file output steps and link them as shown next:
    Time for action validating genres with a Regex Evaluation stepunexpected errors, avoidingdata, validating
  9. Edit the Add constants step.
  10. Add a String constant named err_code with value GEN_INV and a String constant named err_desc with value Invalid list of genres.
  11. Configure the Text file output step after the Add constant step to create the ${LABSOUTPUT}/films_err.txt file, with the fields rownum, err_code, and err_desc.
  12. Configure the other Text file output step to create the ${LABSOUTPUT}/films_ok.txt file, with the fields film, Year, Genres, Director, and Actors.
  13. Double-click the Filter rows step and add the condition genres_ok = Y, Y being a Boolean value. Send true data to the stream that generates the films_ok.txt file. Send false data to the other stream.
  14. Run the transformation.
  15. Check the generated files. The films_err.txt file looks like the following:
    rownum;err_code;err_desc
    12;GEN_INV;Invalid list of genres
    18;GEN_INV;Invalid list of genres
    20;GEN_INV;Invalid list of genres
    21;GEN_INV;Invalid list of genres
    22;GEN_INV;Invalid list of genres
    33;GEN_INV;Invalid list of genres
    34;GEN_INV;Invalid list of genres
    ...
    

The films_ok.txt file looks like this:

film;Year;Genres;Director;Actors
Persepolis;2007;Animation | Comedy | Drama | History;Vincent Paronnaud, Marjane Satrapi;Chiara Mastroianni, Catherine Deneuve, Danielle Darrieux
Trois couleurs - Rouge;1994;Drama;Krzysztof Kieslowski;Irène Jacob, Jean-Louis Trintignant, Frédérique Feder, Jean-Pierre Lorit, Samuel Le Bihan
Les Misérables;1933;Drama | History;Raymond Bernard;
...

What just happened?

You read the films file and checked that the Genres field was a list of strings separated by |. You created two files:

  • One file with the valid rows.
  • Another file with the rows with an invalid Genres field. Note that the rownum field you added when you read the file is used here for identifying the wrong lines.

In order to check the validity of the Genres field, you used a regular expression. The expression you typed accepts any combination of characters, spaces, or hyphens separated by a pipe. The * symbol allows empty genres as well. For a detailed explanation of regular expressions, please refer to Chapter 2.

Validating data

As said, you would validate data mainly for two reasons:

  • To prevent the transformation from aborting because of unexpected errors
  • To check that your data meets some pre-existing requirements

For example, consider some of the sample data from previous chapters:

  • In the match file, the results field had to be a string formed by two numbers separated by a -
  • In the real estate file, the flag for Fireplace had to be Yes or No
  • In the contest file the name of the country had to be a valid country, not a random string

If your data doesn't meet these requirements, it is possible that you don't have errors but you will still be working with invalid data.

In the last tutorial you just validated one of the fields. If you want to validate more than one field, you have a specific step that simplifies that work: The Data Validator.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.163.180