Importing data from tab-delimited files

Another very common format of flat datafile is the tab-delimited file. This can also come from an Excel export but can be the output of some custom software we must get our input from.

The good thing is that usually this format can be read in almost the same way as CSV files as the Python module csv supports the so-called dialects that enable us to use the same principles to read variations of similar file formats, one of them being the tab- delimited format.

Getting ready

Now you're already able to read CSV files. If not, please refer to the Importing data from CSV recipe first.

How to do it...

We will reuse the code from the Importing data from CSV recipe, where all we need to change is the dialect we are using as shown in the following code:

import csv

filename = 'ch02-data.tab'

data = []
try:
    with open(filename) as f:
        reader = csv.reader(f, dialect=csv.excel_tab)
       header = reader.next()
       data = [row for row in reader]
except csv.Error as e:
    print "Error reading CSV file at line %s: %s" % (reader.line_num, e)
    sys.exit(-1)
if header:
    print header
    print '==================='
 
for datarow in data:
    print datarow

How it works...

The dialect-based approach is very similar to what we already did in the Importing data from CSV recipe, except for the line where we instantiate the csv reader object, giving it the parameter dialect and specifying the excel_tab dialect that we want.

There's more...

A CSV-based approach will not work if the data is "dirty", that is, if there are certain lines not ending with just a new line character but have additional (Tab) markers. So we need to clean special lines separately before splitting them. The sample "dirty" tab-delimited file can be found in ch02-data-dirty.tab. The following code sample cleans data as it reads it:

datafile = 'ch02-data-dirty.tab'

with open(datafile, 'r') as f:
    for line in f:
        # remove next comment to see line before cleanup
        # print 'DIRTY: ', line.split('	')

        # we remove any space in line start or end
        line = line.strip()

        # now we split the line by tab delimiter
        print line.split('	')

We also see that there is another approach to do this—using the split(' ') function.

The advantage of using the csv module approach over split() is that we can reuse the same code for reading by just changing the dialect and detecting it with the file extension (.csv and .tab) or some other method (for example, using the csv.Sniffer class).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.162.214