Importing data from CSV

In this recipe, we'll work with the most common file format that you will encounter in the wild world of data—CSV. It stands for Comma Separated Values, which almost explains all the formatting there is. (There is also a header part of the file, but those values are also comma separated.)

Python has a module called csv that supports reading and writing CSV files in various dialects. Dialects are important because there is no standard CSV, and different applications implement CSV in slightly different ways. A file's dialect is almost always recognizable by the first look into the file.

Getting ready

What we need for this recipe is the CSV file itself. We'll use sample CSV data that you can download from ch02-data.csv.

We assume that sample data files are in the same folder as the code reading them.

How to do it...

The following code example demonstrates how to import data from a CSV file. We will perform the following steps for this:

  1. Open the ch02-data.csv file for reading.
  2. Read the header first.
  3. Read the rest of the rows.
  4. In case there is an error, raise an exception.
  5. After reading everything, print the header and the rest of the rows.

This is shown in the following code:

import csv

filename = 'ch02-data.csv'

data = []
try:
    with open(filename) as f:
        reader = csv.reader(f)
        header = reader.next()
        data = [row for row in reader]
except csv.Error as e:
    print "Error reading CSV file at line %s: %s" % (reader.line_num, e)
    sys.exit(-1)
if header:
    print header
    print '=================='

for datarow in data:
    print datarow

How it works...

First, we import the csv module in order to enable access to the required methods. Then, we open the file with data using the with compound statement and bind it to the object f. The context manager with statement releases us of care about the closing resource after we are finished manipulating those resources. It is a very handy way of working with resource-like files because it makes sure that the resource is freed (for example, that the file is closed) after the block of code is executed over it.

Then, we use the csv.reader() method that returns the reader object, which allows us to iterate over all rows of the read file. Every row is just a list of values and is printed inside the loop.

Reading the first row is somewhat different as it is the header of the file and describes the data in each column. This is not mandatory for CSV files and some files don't have headers, but they are a really nice way of providing minimal metadata about datasets. Sometimes though, you will find separate text or even CSV files that are just used as metadata describing the format and additional data about the data.

The only way to check what the first line looks like is to open the file and visually inspect it (for example, see the first few lines of the file)... This can be done efficiently on Linux using bash commands like head as shown here:

$ head some_file.csv

During iteration of data, we save the first row in header while we add every other row to the data list.

We can also check if the .csv file has a header or not using the method csv.has_header.

Should any errors occur during reading, csv.reader() will generate an error that we can catch and then print the helpful message to the user in order to help detection of errors.

There's more...

If you want to read about the background and reasoning for the csv module, the PEP-defined document CSV File API is available at http://www.python.org/dev/peps/pep-0305/.

If you have larger files that you want to load, it's often better to use well-known libraries like NumPy's loadtxt() that cope better with large CSV files.

The basic usage is simple as shown in the following code snippet:

import numpy
data = numpy.loadtxt('ch02-data.csv', dtype='string', delimiter=',')

Note that we need to define a delimiter to instruct NumPy to separate our data as appropriate. The function numpy.loadtxt() is somewhat faster than the similar function numpy.genfromtxt(), but the latter can cope better with missing data, and you are able to provide functions to express what is to be done during the processing of certain columns of loaded data files.

Note

Currently, the csv module doesn't support Unicode, and so you must explicitly convert the read data into UTF-8 or ASCII printable. The official Python CSV documentation offers good examples on how to resolve data encoding issues.

In Python 3.3 and later versions, Unicode support is default and there are no such issues.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.173.53