Importing data from fixed-width data files

Log files from events and time series data files are common sources for data visualizations. Sometimes, we can read them using CSV dialect for tab-separated data, but sometimes they are not separated by any specific character. Instead, fields are of fixed widths and we can infer the format to match and extract data.

One way to approach this is to read a file line by line and then use string manipulation functions to split a string into separate parts. This approach seems straightforward, and if performance is not an issue, it should be tried first.

If performance is more important or the file to parse is large (hundreds of megabytes), using the Python module struct (http://docs.python.org/library/struct.html) can speed us up as the module is implemented in C rather than in Python.

Getting ready

As the module struct is part of the Python Standard Library, we don't need to install any additional software to implement this recipe.

How to do it...

We will use a pregenerated dataset with a million rows of fixed-width records. Here's what sample data looks like:

…
207152670 3984356804116 9532
427053180 1466959270421 5338
316700885 9726131532544 4920
138359697 3286515244210 7400
476953136 0921567802830 4214
213420370 6459362591178 0546
…

This dataset is generated using code that can be found in the repository for this chapter— ch02-generate_f_data.py.

Now we can read the data. We can use the following code sample. We will carry out the following steps for this:

  1. Define the data file to read.
  2. Define the mask for how to read the data.
  3. Read line by line using the mask to unpack each line into separate data fields.
  4. Print each line as separate fields.

This is shown in the following code snippet:

import struct
import string

datafile = 'ch02-fixed-width-1M.data'

# this is where we define how to
# understand line of data from the file
mask='9s14s5s'

with open(datafile, 'r') as f:
    for line in f:
        fields = struct.Struct(mask).unpack_from(line)
        print 'fields: ', [field.strip() for field in fields]

How it works...

We define our format mask according to what we have previously seen in the datafile. To see the file, we could have used Linux shell commands such as head or more or something similar.

String formats are used to define the expected layout of the data to extract. We use format characters to define what type of data we expect. So if the mask is defined as 9s15s5s, we can read that as "a string of nine character width, followed by a string width of 15 characters and then again followed by a string of five characters."

In general, c defines the character (the char type in C) or a string of length 1, s defines a string (the char[] type in C), d defines a float (the double type in C), and so on. The complete table is available on the official Python website at http://docs.python.org/library/struct.html#format-characters.

We then read the file line by line and extract (the unpack_from method) the line according to the specified format. Because we might have extraneous spaces before (or after) our fields, we use strip() to strip every extracted field.

For unpacking, we used the object-oriented (OO) approach using the struct.Struct class, but we could have as well used the non-object approach where the line would be as shown here:

fields = struct.unpack_from(mask, line)

The only difference is the usage of pattern. If we are to perform more processing using the same formatting mask, the OO approach saves us from stating that format in every call. Moreover, it gives us the ability to inherit the struct.Struct class in future, thus extending or providing additional functionality for specific needs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.239.118