Hierarchical Data Format (HDF) is a specification and technology for the storage of big numerical data. HDF was created in the supercomputing community and is now an open standard. The latest version of HDF is HDF5 and is the one we will be using. HDF5 structures data in groups and datasets. Datasets are multidimensional homogeneous arrays. Groups can contain other groups or datasets. Groups are like directories in a hierarchical filesystem.
The two main HDF5 Python libraries are:
In this example, we will be using PyTables. PyTables has a number of dependencies:
The parallel version of HDF5 also requires MPI. HDF5 can be installed by obtaining a distribution from http://www.hdfgroup.org/HDF5/release/obtain5.html and running the following commands (which could take a few minutes):
$ gunzip < hdf5-X.Y.Z.tar.gz | tar xf - $ cd hdf5-X.Y.Z $ make $ make install
In all likelihood, your favorite package manager has a distribution for HDF5. Please choose the latest stable version. At the time of writing this book, the most recent version was 1.8.12.
The second dependency, numexpr, claims to be able to perform certain operations faster than NumPy. It supports multithreading and has its own virtual machine implemented in C. Numexpr and PyTables are available on PyPi, so we can install these with pip
as follows:
$ pip install numexpr $ pip install tables
Check the installed versions with the following command:
$ pip freeze|grep tables tables==3.1.1 $ pip freeze|grep numexpr numexpr==2.4
Again, we will generate random values and fill a NumPy array with those random values. Create an HDF5 file and attach the NumPy array to the root node with the following code:
tmpf = NamedTemporaryFile() h5file = tables.openFile(tmpf.name, mode='w', title="NumPy Array") root = h5file.root h5file.createArray(root, "array", a) h5file.close()
Read the HDF5 file and print its file size:
h5file = tables.openFile(tmpf.name, "r") print getsize(tmpf.name)
The value that we get for the file size is 13824
. Once we read an HDF5 file and obtain a handle for it, we would normally traverse it to find the data we need. Since we only have one dataset, traversing is pretty simple. Call the iterNodes()
and read()
methods to get the NumPy array back:
for node in h5file.iterNodes(h5file.root): b = node.read() print type(b), b.shape
The type and shape of the dataset corresponds to our expectations:
<type 'numpy.ndarray'> (365, 4)
The following code can be found in the hf5storage.py
file in this book's code bundle:
import numpy as np import tables from tempfile import NamedTemporaryFile from os.path import getsize np.random.seed(42) a = np.random.randn(365, 4) tmpf = NamedTemporaryFile() h5file = tables.openFile(tmpf.name, mode='w', title="NumPy Array") root = h5file.root h5file.createArray(root, "array", a) h5file.close() h5file = tables.openFile(tmpf.name, "r") print getsize(tmpf.name) for node in h5file.iterNodes(h5file.root): b = node.read() print type(b), b.shape h5file.close()
18.218.78.102