The simplest files for reading are those where all rows follow the same pattern: Each row has a fixed number of columns, and all columns have the same kind of data in every row. However, it is common to have files where the information does not have that format. In many occasions, the files have little or no structure. Suppose you have a file with roller coaster descriptions, and the file looks like the following:
JOURNEY TO ATLANTIS SeaWorld Orlando Journey to Atlantis is a unique thrill ride since it is ... Roller Coaster Stats Drop: 60 feet Trains: 8 passenger boats Train Mfg: Mack KRAKEN SeaWorld Orlando Named after a legendary sea monster, Kraken is a ... Kraken begins with a plunge from a height of 15-stories ... Roller Coaster Stats Height: 151 feet Drop: 144 feet Top Speed: 65 mph Length: 4,177 feet Inversions: 7 Trains: 3 - 32 passenger Ride Time: 2 minutes, 2 seconds KUMBA Busch Gardens Tampa ...
As you can see, the preceding file is far from being a structured file that you can read simply by configuring a Text file input step. Following this recipe, you will learn how to deal with this kind of file.
When you have to read an unstructured file, such as the preceding sample file, the first thing to do is to take a detailed look at it. Try to understand how the data is organized; despite being unstructured, it has a hidden format that you have to discover in order to be able to read it.
So, let's analyze the sample file, which is available for download from the book's site.
The file has data about several roller coasters. Let's take note of the characteristics of the file:
code:description
, as for example Drop: 60 feet
. Roller Coaster Stats
, which doesn't add any information. It should be discarded.Once you understand the content of your file, you are ready to read it, and parse it.
Carry out the following steps:
rollercoasters_II.txt
. Under the Content tab, uncheck the Header option and under the Separator tab, type |
. Under the Fields tab, enter a single field named text
of type String. As the character |
is not present in any part of the file, you are sure that the whole line will be read as a single field.var attraction; trans_Status=CONTINUE_TRANSFORMATION; if (getProcessCount('r') == 1) attraction = ''; if (text == upper(removeDigits(text))) { attraction = text; trans_Status=SKIP_TRANSFORMATION; } else if (text == 'Roller Coaster Stats') trans_Status=SKIP_TRANSFORMATION;
attraction
. line_nr
. In the first row of the grid type attraction
.So far, you've read the file, and identified all the rows belonging to each roller coaster. It's time to parse the different lines. In the first place, let's parse the lines that contain properties:
text REGEXP (.+):(.+)
. (.+):(.+)
. code
in the first row and desc
in the second. In both rows, under Type, select String, and under Trim select both. attraction, code, desc
, and line_nr
.In order to do a preview to see how the steps are transforming your data, you can add a Dummy step and send the false rows of the Filter rows step towards it. The only purpose of this is avoiding the transformation crash.
Now you will parse the other lines: the lines that contain the park name, and the additional comments:
line_nr=1
. code
with value park
. Make sure the true rows of the Filter rows step go toward this step. code
with value additional_information
. Make sure the false rows of the Filter rows step go toward this step. attraction, code, text
, and line_nr
. In the same step, rename text
as desc
.Make sure that the fields are in this exact order. The metadata of both Select values steps must coincide.
Now that you have parsed all the types of rows it's time to join the rows together.
attraction
and line_nr
.When you have an unstructured file, the first thing to do is understand its content, in order to be able to parse the file properly.
If the entities described in the file (roller coasters in this example) are spanned over several lines, the very first task is to identify the rows that make up a single entity. The usual method is to do it with a JavaScript step. In this example, with the JavaScript code, you used the fact that the first line of each roller coaster was written with uppercase letter, to create and add a field named attraction
. In the same code, you removed unwanted lines.
In this example, as you needed to know which row was the first in each group, you added an Add value fields changing sequence step.
After doing this, which as noted is only necessary for a particular kind of file, you have to parse the lines. If the lines do not follow the same pattern, you have to split your stream in as many streams as kind of rows you have. In this example, you split the main stream into three, as follows:
Drop: 60 feet
.In each stream, you proceeded differently according to the format of the line.
The most useful step for parsing individual unstructured fields is the Regexp Evaluation step. It both validates if a field follows a given pattern (provided as a regular expression) and optionally, it captures groups. In this case, you used that step to capture a code and a description. In the preceding example (Drop: 60 feet)
, the Regexp Evaluation step allowed you to build two fields: code
with value Drop
, and desc
with value 60 feet
.
Once you parsed the line with the Regexp Evaluation or the step of your choice, you can continue transforming or modifying the fields according to your needs and the characteristics of your particular file.
In the same way, depending on the purpose of your transformation, you can leave the streams separated or join them back together as you did in the recipe.
There are some common kinds of files that can be parsed in the way you parsed the roller coasters file:
Suppose that you have a file of invoices such as the following:
INV.0001-0045;02/28/2010;$323.99 CSD-031;2;$34.00 CSA-110;1;$100.99 LSK-092;1;$189.00 INV.0001-0046;02/28/2010;$53.99 DSD-031;2;$13.00 CXA-110;1;$40.99 INV.0001-0047;02/28/2010;$1149.33 ...
The lines beginning with INV
. are the invoice headers; the lines following the headers are the details of those invoices.
Files like these are not uncommon. If you have a file like this with records that represent headers followed by records that represent details about those headers, and the header and detail records have different structures, you could parse it as explained in the recipe.
Read the file, do whatever is necessary to find out if a row is a header or a detail, and split the stream in two. After that, parse header rows and detail rows accordingly.
Log files are among the most common kinds of unstructured files. Look at the following sample lines belonging to a Pentaho Server log:
... 2010-09-30 13:01:30,437 DEBUG [org.pentaho.platform.engine. services.solution.SolutionEngine] fd386728-ccab-11df-9... 2010-09-30 13:01:30,484 INFO [org.pentaho.platform.reposit ory.solution.SolutionRepositoryBase] Solution Reposito... 2010-09-30 13:01:30,484 INFO [org.pentaho.platform.reposit ory.solution.SolutionRepositoryBase] Solution Reposit... 2010-09-30 13:01:30,515 INFO [org.pentaho.platform.reposit ory.solution.SolutionRepositoryBase] Could not find d... 2010-09-30 13:01:30,531 ERROR [org.pentaho.platform.engine. services.solution.SolutionEngine] fd386728-ccab-11df-... 2010-09-30 13:01:42,515 WARN [org.hibernate.cache.EhCacheP rovider] Could not find configuration [file]; using d... ...
In this case, all lines begin with a timestamp, followed by the level of log (DEBUG, INFO, and so on), and then the details of the log.
Despite being unstructured, the lines in a log file—the one shown above—have some text that let you know what kind of data is in those lines. Using that knowledge, you can parse different lines as explained in the recipe.
In this particular example, you could read the file as containing two fields: one for the timestamp, the other with the rest of the line. Then you can parse the second field splitting it in two: the kind of log (DEBUG, INFO, and so on) and the detail. Optionally, if you wanted to treat each level of log differently, you could split the stream with a Switch case step or nested Filter rows steps and proceed.
18.218.171.212