Storing data with PyTables

Hierarchical Data Format (HDF) is a specification and technology for the storage of big numerical data. HDF was created in the supercomputing community and is now an open standard. The latest version of HDF is HDF5 and is the one we will be using. HDF5 structures data in groups and datasets. Datasets are multidimensional homogeneous arrays. Groups can contain other groups or datasets. Groups are like directories in a hierarchical filesystem.

The two main HDF5 Python libraries are:

  • h5y
  • PyTables

In this example, we will be using PyTables. PyTables has a number of dependencies:

  • NumPy: We installed NumPy in Chapter 1, Getting Started with Python Libraries
  • numexpr: This package claims that it evaluates multiple-operator array expressions many times faster than NumPy can
  • HDF5

    Note

    The parallel version of HDF5 also requires MPI. HDF5 can be installed by obtaining a distribution from http://www.hdfgroup.org/HDF5/release/obtain5.html and running the following commands (which could take a few minutes):

    $ gunzip < hdf5-X.Y.Z.tar.gz | tar xf -
    $ cd hdf5-X.Y.Z
    $ make
    $ make install
    

In all likelihood, your favorite package manager has a distribution for HDF5. Please choose the latest stable version. At the time of writing this book, the most recent version was 1.8.12.

The second dependency, numexpr, claims to be able to perform certain operations faster than NumPy. It supports multithreading and has its own virtual machine implemented in C. Numexpr and PyTables are available on PyPi, so we can install these with pip as follows:

$ pip install numexpr
$ pip install tables

Check the installed versions with the following command:

$ pip freeze|grep tables
tables==3.1.1
$ pip freeze|grep numexpr
numexpr==2.4

Again, we will generate random values and fill a NumPy array with those random values. Create an HDF5 file and attach the NumPy array to the root node with the following code:

tmpf = NamedTemporaryFile()
h5file = tables.openFile(tmpf.name, mode='w', title="NumPy Array")
root = h5file.root
h5file.createArray(root, "array", a)
h5file.close()

Read the HDF5 file and print its file size:

h5file = tables.openFile(tmpf.name, "r")
print getsize(tmpf.name)

The value that we get for the file size is 13824. Once we read an HDF5 file and obtain a handle for it, we would normally traverse it to find the data we need. Since we only have one dataset, traversing is pretty simple. Call the iterNodes() and read() methods to get the NumPy array back:

for node in h5file.iterNodes(h5file.root):
   b = node.read()
   print type(b), b.shape

The type and shape of the dataset corresponds to our expectations:

<type 'numpy.ndarray'> (365, 4)

The following code can be found in the hf5storage.py file in this book's code bundle:

import numpy as np
import tables
from tempfile import NamedTemporaryFile
from os.path import getsize

np.random.seed(42)
a = np.random.randn(365, 4)

tmpf = NamedTemporaryFile()
h5file = tables.openFile(tmpf.name, mode='w', title="NumPy Array")
root = h5file.root
h5file.createArray(root, "array", a)
h5file.close()

h5file = tables.openFile(tmpf.name, "r")
print getsize(tmpf.name)

for node in h5file.iterNodes(h5file.root):
   b = node.read()
   print type(b), b.shape

h5file.close()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.78.102