Importing data from Microsoft Excel files

Although Microsoft Excel supports some charting, sometimes you need more flexible and powerful visualization and need to export data from existing spreadsheets into Python for further use.

A common approach to importing data from Excel files is to export data from Excel into CSV-formatted files and use the tools described in the previous recipe to import data using Python from the CSV file. This is a fairly easy process if we have one or two files (and have Microsoft Excel or OpenOffice.org installed), but if we are automating a data pipe for many files (as part of an ongoing data processing effort), we are not in a position to manually convert every Excel file into CSV. So, we need a way to read any Excel file.

Python has decent support for reading and writing Excel files through the project www.python-excel.org. This support is available in the form of different modules for reading and writing and is platform-independent; in other words, we don't have to run it on Windows in order to read Excel files.

The Microsoft Excel file format changed over time, and support for different versions is available in different Python libraries. The latest stable version of XLRD is 0.90 at the time of this writing and it has support for reading .xlsx files.

Getting ready

First, we need to install the required module. For this example, we will use the module xlrd. We will use pip in our virtual environment, as shown in the following code:

$ mkvirtualenv xlrdexample
(xlrdexample)$ pip install xlrd

After successful installation, use the sample file ch02-xlsxdata.xlsx.

How to do it...

The following code example demonstrates how to read a sample dataset from a known Excel file. We will do this as shown in the following steps:

  1. Open the file workbook.
  2. Find the sheet by name.
  3. Read the cells using the number of rows (nrows) and columns (ncols).
  4. For demonstration purposes, we only print the read dataset.

This is shown in the following code:

import xlrd

file = 'ch02-xlsxdata.xlsx'

wb = xlrd.open_workbook(filename=file)

ws = wb.sheet_by_name('Sheet1')

dataset = []

for r in xrange(ws.nrows):
    col = []
    for c in range(ws.ncols):
        col.append(ws.cell(r, c).value)
    dataset.append(col)

from pprint import pprint
pprint(dataset)

How it works...

Let's try to explain the simple object model that xlrd uses. At the top level, we have a workbook (the Python class xlrd.book.Book) that consists of one or more worksheets (xlrd.sheet.Sheet), and every sheet has a cell (xlrd.sheet.Cell) from which we can then read the value.

We load a workbook from a file using open_workbook(), which returns the xlrd.book.Book instance that contains all the information about a workbook like sheets. We access sheets using sheet_by_name(); if we need all sheets, we could use sheets(), which returns a list of the xlrd.sheet.Sheet instances. The xlrd.sheet.Sheet class has a number of columns and rows as attributes that we can use to infer ranges for our loop to access every particular cell inside a worksheet using the method cell(). There is an xrld.sheet.Cell class, though it is not something we want to use directly.

Note that the date is stored as a floating point number and not as a separate data type, but the xlrd module is able to inspect the value and try to infer if the data is in fact a date. So, we can inspect the cell type for the cell to get the Python date object. The module xlrd will return xlrd.XL_CELL_DATE as the cell type if the number format string looks like a date. Here is a snippet of code that demonstrates this:

from datetime import datetime
from xlrd import open_workbook, xldate_as_tuple
…
cell = sheet.cell(1, 0)
print cell
print cell.value
print cell.ctype
if cell.ctype == xlrd.XL_CELL_DATE:
    date_value = xldate_as_tuple(cell.value, book.datemode)
    print datetime(*date_value)

This field still has issues, so please refer to the official documentation and mailing list in case you require extensive work with dates.

There's more...

A neat feature of xlrd is its ability to load only parts of the file that are required in the memory. There is an on_demand parameter that can be passed as True value while calling open_workbook so that the worksheet will only be loaded when requested. See the following example of code snippet for this:

book = open_workbook('large.xls', on_demand=True)

We didn't mention writing Excel files in this section partly because there will be a separate recipe for that and partly because there is a different module for that—xlwt. You will read more about it in the Exporting data to JSON, CSV, and Excel recipe in this chapter.

If you need specific usage that was not covered with the module and examples explained earlier, here is a list of other Python modules on PyPi that might help you out with spreadsheets http://pypi.python.org/pypi?:action=browse&c=377.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.99.152