Reading and writing to Excel with pandas

Excel files contain a lot of important data. Of course, we can export that data in other more portable formats such as CSV. However, it is more convenient to read and write Excel files with Python. As is common in the Python world, there is more than one project with the goal of providing Excel I/O capabilities. The modules that we will need to install to get Excel I/O to work with pandas are somewhat obscurely documented. The reason is that the projects that pandas depends on are independent and rapidly developing. The pandas package is picky about the files it accepts as Excel files. These files must have the .xls or .xlsx suffix; otherwise, we get the following error:

ValueError: No engine for filetype: ''

This is easy to fix. For instance, if we create a temporary file, we just give it the proper suffix. If you don't install anything, you will get the following error message:

ImportError: No module named openpyxl.workbook

The following command gets rid of the error by installing openpyxl:

$ pip install openpyxl

Check the version with the following command:

$ pip freeze|grep openpyxl
openpyxl==2.0.3

The openpyxl module is a port of PHPExcel and supports the reading and writing of .xlsx files.

Tip

If for some reason the pip install method didn't work for you, you can find alternative installation instructions at http://pythonhosted.org/openpyxl/.

Even after installing openpyxl, you might get the following error:

ImportError: No module named style

Fix this by installing xlsxwriter:

$ pip install xlsxwriter

Also, we can check the xlsxwriter version again. I have installed version 0.5.5. The xlsxwriter module is also needed to read the .xlsx files. At this point, you will most likely get the following error:

ImportError: No module named xlrd

This module can be installed with pip as well:

$ pip install xlrd
$ pip freeze|grep xlrd
xlrd==0.9.3

The xlrd module is able to extract data from the .xls and .xlsx files. Let's generate random values to populate a pandas DataFrame, create an Excel file from the DataFrame, recreate the DataFrame from the Excel file, and apply the mean() method to it. For the sheet of the Excel file, we can either specify a zero-based index or name.

Refer to the pd_xls.py file in the book's code bundle, which will contain the following code:

import numpy as np
import pandas as pd
from tempfile import NamedTemporaryFile

np.random.seed(42)
a = np.random.randn(365, 4)

tmpf = NamedTemporaryFile(suffix='.xlsx')
df = pd.DataFrame(a)
print tmpf.name
df.to_excel(tmpf.name, sheet_name='Random Data')
print "Means
", pd.read_excel(tmpf.name, 'Random Data').mean()

Create an Excel file with the to_excel() method:

df.to_excel(tmpf.name, sheet_name='Random Data')

Recreate the DataFrame with the top-level read_excel() function:

print "Means
", pd.read_excel(tmpf.name, 'Random Data').mean()

The means are printed as follows:

/var/folders/k_/xx_xz6xj0hx627654s3vld440000gn/T/tmpeBEfnO.xlsx
Means
0    0.037860
1    0.024483
2    0.059836
3    0.058417
dtype: float64
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.147.193