Reading unstructured files

The simplest files for reading are those where all rows follow the same pattern: Each row has a fixed number of columns, and all columns have the same kind of data in every row. However, it is common to have files where the information does not have that format. In many occasions, the files have little or no structure. Suppose you have a file with roller coaster descriptions, and the file looks like the following:

JOURNEY TO ATLANTIS
SeaWorld Orlando
Journey to Atlantis is a unique thrill ride since it is ...
Roller Coaster Stats
Drop: 60 feet
Trains: 8 passenger boats
Train Mfg: Mack
KRAKEN
SeaWorld Orlando
Named after a legendary sea monster, Kraken is a ...
Kraken begins with a plunge from a height of 15-stories ...
Roller Coaster Stats
Height: 151 feet
Drop: 144 feet
Top Speed: 65 mph
Length: 4,177 feet
Inversions: 7
Trains: 3 - 32 passenger
Ride Time: 2 minutes, 2 seconds
KUMBA
Busch Gardens Tampa
...

As you can see, the preceding file is far from being a structured file that you can read simply by configuring a Text file input step. Following this recipe, you will learn how to deal with this kind of file.

Getting ready

When you have to read an unstructured file, such as the preceding sample file, the first thing to do is to take a detailed look at it. Try to understand how the data is organized; despite being unstructured, it has a hidden format that you have to discover in order to be able to read it.

So, let's analyze the sample file, which is available for download from the book's site.

The file has data about several roller coasters. Let's take note of the characteristics of the file:

Tip

As a useful exercise, you could do this yourself before reading the following list.

  • Each roller coaster spans several lines.
  • There are blank lines, which should be eliminated.
  • What allows us to distinguish the first line for each roller coaster from the rest is that it is written in uppercase letters.
  • The first line below the name of the roller coaster is the name of the amusement park where it is located.
  • Most of the lines have a property of the roller coaster in the format of code:description, as for example Drop: 60 feet.
  • Above the properties, there is a line with the text Roller Coaster Stats, which doesn't add any information. It should be discarded.
  • There are lines with additional information about the roller coaster. There is nothing that distinguishes these lines. They simply do not fall into any of the other kinds of lines (lines with the name of the park, lines with properties of the roller coaster, and so on).

Once you understand the content of your file, you are ready to read it, and parse it.

How to do it...

Carry out the following steps:

  1. Create a transformation and drag a Text file input step.
  2. Use that step to read the file named rollercoasters_II.txt. Under the Content tab, uncheck the Header option and under the Separator tab, type |. Under the Fields tab, enter a single field named text of type String. As the character | is not present in any part of the file, you are sure that the whole line will be read as a single field.
  3. From the Scripting category of steps, add a Modified Java Script Value step, double-click it, and under the main tab window type the following snippet of code:
    var attraction;
    trans_Status=CONTINUE_TRANSFORMATION;
    if (getProcessCount('r') == 1) attraction = '';
    if (text == upper(removeDigits(text))) {
    attraction = text;
    trans_Status=SKIP_TRANSFORMATION;
    }
    else if (text == 'Roller Coaster Stats')
    trans_Status=SKIP_TRANSFORMATION;
    
  4. Click on the Get variables button to populate the grid with the variable attraction.
  5. From the Transform category, add an Add value fields changing sequence step.
  6. Double-click the step. As Result field type line_nr. In the first row of the grid type attraction.
  7. Do a preview on this last step. You will see the following:
How to do it...

So far, you've read the file, and identified all the rows belonging to each roller coaster. It's time to parse the different lines. In the first place, let's parse the lines that contain properties:

  1. Add a Filter rows step and enter the condition text REGEXP (.+):(.+).
  2. From the Scripting category, add a Regex Evaluation step and send the true rows toward this step.
  3. Configure the step as follows: As Field to evaluate select text. Check the Create fields for capture groups option. As Regular expression: type (.+):(.+).
  4. Fill the lower grid with two rows: as New field type code in the first row and desc in the second. In both rows, under Type, select String, and under Trim select both.
  5. Finally add a Select values step to select the fields attraction, code, desc, and line_nr.

Tip

In order to do a preview to see how the steps are transforming your data, you can add a Dummy step and send the false rows of the Filter rows step towards it. The only purpose of this is avoiding the transformation crash.

Now you will parse the other lines: the lines that contain the park name, and the additional comments:

  1. Add another Filter rows step, and send the false rows of the other Filter rows step toward this one.
  2. Add two Add constants steps, and a Select values step, and link all the steps as shown in the following diagram:
    How to do it...
  3. In the Filter rows enter the condition line_nr=1.
  4. In the first Add constants step, add a String field named code with value park. Make sure the true rows of the Filter rows step go toward this step.
  5. In the other Add constants step, add a String field named code with value additional_information. Make sure the false rows of the Filter rows step go toward this step.
  6. Use the Select values step that joins the two Add constants steps to select the fields attraction, code, text, and line_nr. In the same step, rename text as desc.

Note

Make sure that the fields are in this exact order. The metadata of both Select values steps must coincide.

Now that you have parsed all the types of rows it's time to join the rows together.

  1. Join both Select values with a Sort rows step. Sort the rows by attraction and line_nr.
  2. Select the Sort rows step and do a preview. You should see the following:
How to do it...

How it works...

When you have an unstructured file, the first thing to do is understand its content, in order to be able to parse the file properly.

If the entities described in the file (roller coasters in this example) are spanned over several lines, the very first task is to identify the rows that make up a single entity. The usual method is to do it with a JavaScript step. In this example, with the JavaScript code, you used the fact that the first line of each roller coaster was written with uppercase letter, to create and add a field named attraction. In the same code, you removed unwanted lines.

In this example, as you needed to know which row was the first in each group, you added an Add value fields changing sequence step.

After doing this, which as noted is only necessary for a particular kind of file, you have to parse the lines. If the lines do not follow the same pattern, you have to split your stream in as many streams as kind of rows you have. In this example, you split the main stream into three, as follows:

  1. One for parsing the lines with properties, for example Drop: 60 feet.
  2. One for setting the name of the amusement park where the roller coaster was.
  3. One for keeping the additional information.

In each stream, you proceeded differently according to the format of the line.

The most useful step for parsing individual unstructured fields is the Regexp Evaluation step. It both validates if a field follows a given pattern (provided as a regular expression) and optionally, it captures groups. In this case, you used that step to capture a code and a description. In the preceding example (Drop: 60 feet), the Regexp Evaluation step allowed you to build two fields: code with value Drop, and desc with value 60 feet.

Once you parsed the line with the Regexp Evaluation or the step of your choice, you can continue transforming or modifying the fields according to your needs and the characteristics of your particular file.

In the same way, depending on the purpose of your transformation, you can leave the streams separated or join them back together as you did in the recipe.

There's more...

There are some common kinds of files that can be parsed in the way you parsed the roller coasters file:

Master/detail files

Suppose that you have a file of invoices such as the following:

INV.0001-0045;02/28/2010;$323.99
CSD-031;2;$34.00
CSA-110;1;$100.99
LSK-092;1;$189.00
INV.0001-0046;02/28/2010;$53.99
DSD-031;2;$13.00
CXA-110;1;$40.99
INV.0001-0047;02/28/2010;$1149.33
...

The lines beginning with INV. are the invoice headers; the lines following the headers are the details of those invoices.

Files like these are not uncommon. If you have a file like this with records that represent headers followed by records that represent details about those headers, and the header and detail records have different structures, you could parse it as explained in the recipe.

Read the file, do whatever is necessary to find out if a row is a header or a detail, and split the stream in two. After that, parse header rows and detail rows accordingly.

Log files

Log files are among the most common kinds of unstructured files. Look at the following sample lines belonging to a Pentaho Server log:

...
2010-09-30 13:01:30,437 DEBUG [org.pentaho.platform.engine.
services.solution.SolutionEngine] fd386728-ccab-11df-9...
2010-09-30 13:01:30,484 INFO [org.pentaho.platform.reposit
ory.solution.SolutionRepositoryBase] Solution Reposito...
2010-09-30 13:01:30,484 INFO [org.pentaho.platform.reposit
ory.solution.SolutionRepositoryBase] Solution Reposit...
2010-09-30 13:01:30,515 INFO [org.pentaho.platform.reposit
ory.solution.SolutionRepositoryBase] Could not find d...
2010-09-30 13:01:30,531 ERROR [org.pentaho.platform.engine.
services.solution.SolutionEngine] fd386728-ccab-11df-...
2010-09-30 13:01:42,515 WARN [org.hibernate.cache.EhCacheP
rovider] Could not find configuration [file]; using d...
...

In this case, all lines begin with a timestamp, followed by the level of log (DEBUG, INFO, and so on), and then the details of the log.

Despite being unstructured, the lines in a log file—the one shown above—have some text that let you know what kind of data is in those lines. Using that knowledge, you can parse different lines as explained in the recipe.

In this particular example, you could read the file as containing two fields: one for the timestamp, the other with the rest of the line. Then you can parse the second field splitting it in two: the kind of log (DEBUG, INFO, and so on) and the detail. Optionally, if you wanted to treat each level of log differently, you could split the stream with a Switch case step or nested Filter rows steps and proceed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.171.212