Excel files contain a lot of important data. Of course, we can export that data in other more portable formats such as CSV. However, it is more convenient to read and write Excel files with Python. As is common in the Python world, there is more than one project with the goal of providing Excel I/O capabilities. The modules that we will need to install to get Excel I/O to work with pandas are somewhat obscurely documented. The reason is that the projects that pandas depends on are independent and rapidly developing. The pandas package is picky about the files it accepts as Excel files. These files must have the .xls
or .xlsx
suffix; otherwise, we get the following error:
ValueError: No engine for filetype: ''
This is easy to fix. For instance, if we create a temporary file, we just give it the proper suffix. If you don't install anything, you will get the following error message:
ImportError: No module named openpyxl.workbook
The following command gets rid of the error by installing openpyxl:
$ pip install openpyxl
Check the version with the following command:
$ pip freeze|grep openpyxl openpyxl==2.0.3
The openpyxl module is a port of PHPExcel and supports the reading and writing of .xlsx
files.
If for some reason the pip install
method didn't work for you, you can find alternative installation instructions at http://pythonhosted.org/openpyxl/.
Even after installing openpyxl, you might get the following error:
ImportError: No module named style
Fix this by installing xlsxwriter:
$ pip install xlsxwriter
Also, we can check the xlsxwriter version again. I have installed version 0.5.5. The xlsxwriter module is also needed to read the .xlsx
files. At this point, you will most likely get the following error:
ImportError: No module named xlrd
This module can be installed with pip
as well:
$ pip install xlrd $ pip freeze|grep xlrd xlrd==0.9.3
The xlrd module is able to extract data from the .xls
and .xlsx
files. Let's generate random values to populate a pandas DataFrame
, create an Excel file from the DataFrame
, recreate the DataFrame
from the Excel file, and apply the mean()
method to it. For the sheet of the Excel file, we can either specify a zero-based index or name.
Refer to the pd_xls.py
file in the book's code bundle, which will contain the following code:
import numpy as np import pandas as pd from tempfile import NamedTemporaryFile np.random.seed(42) a = np.random.randn(365, 4) tmpf = NamedTemporaryFile(suffix='.xlsx') df = pd.DataFrame(a) print tmpf.name df.to_excel(tmpf.name, sheet_name='Random Data') print "Means ", pd.read_excel(tmpf.name, 'Random Data').mean()
Create an Excel file with the to_excel()
method:
df.to_excel(tmpf.name, sheet_name='Random Data')
Recreate the DataFrame
with the top-level read_excel()
function:
print "Means ", pd.read_excel(tmpf.name, 'Random Data').mean()
The means are printed as follows:
/var/folders/k_/xx_xz6xj0hx627654s3vld440000gn/T/tmpeBEfnO.xlsx Means 0 0.037860 1 0.024483 2 0.059836 3 0.058417 dtype: float64
18.217.147.193