Comparing the NumPy .npy binary format and pickling pandas DataFrames

Saving data in the CSV format is fine most of the time. It is easy to exchange CSV files, since most programming languages and applications can handle this format. However, it is not very efficient; CSV and other plaintext formats take up a lot of space. Numerous file formats have been invented, which offer a high level of compression such as zip, bzip, and gzip.

The following is the complete code for this storage comparison exercise, which can also be found in the binary_formats.py file of this book's code bundle:

import numpy as np
import pandas as pd
from tempfile import NamedTemporaryFile
from os.path import getsize

np.random.seed(42)
a = np.random.randn(365, 4)

tmpf = NamedTemporaryFile()
np.savetxt(tmpf, a, delimiter=',')
print "Size CSV file", getsize(tmpf.name)

tmpf = NamedTemporaryFile()
np.save(tmpf, a)
tmpf.seek(0)
loaded = np.load(tmpf)
print "Shape", loaded.shape
print "Size .npy file", getsize(tmpf.name)

df = pd.DataFrame(a)
df.to_pickle(tmpf.name)
print "Size pickled dataframe", getsize(tmpf.name)
print "DF from pickle
", pd.read_pickle(tmpf.name)

NumPy offers a NumPy-specific format called .npy, which can be used to store NumPy arrays. Before demonstrating this format, we will generate a 365 x 4 NumPy array filled with random values. This array simulates daily measurements for four variables for a year (for instance, a weather data station with sensors measuring temperature, humidity, precipitation, and atmospheric pressure). We will use a standard Python NamedTemporaryFile to store the data. The temporary file should be automatically deleted.

Store the array in a CSV file and check its size as follows:

tmpf = NamedTemporaryFile()
np.savetxt(tmpf, a, delimiter=',')
print "Size CSV file", getsize(tmpf.name)

The CSV file size is printed as follows:

Size CSV file 36864

Save the array in the NumPy.npy format, load the array, check its shape, and the size of the .npy file:

tmpf = NamedTemporaryFile()
np.save(tmpf, a)
tmpf.seek(0)
loaded = np.load(tmpf)
print "Shape", loaded.shape
print "Size .npy file", getsize(tmpf.name)

The call to the seek() method was needed to simulate closing and reopening the temporary file. The shape should be printed with the file size:

Shape (365, 4)
Size .npy file 11760

The .npy file is roughly three times smaller than the CSV file, as expected. Python lets us store data structures of practically arbitrary complexity. We can store a pandas DataFrame or Series as a pickle as well.

Note

The Python pickle is a format to store Python objects to disk or other medium. This is called pickling. We can recreate the Python objects from storage. This reverse process is called unpickling (refer to http://docs.python.org/2/library/pickle.html). Pickling has evolved over the years, so as a result, various pickle protocols exist. Not all Python objects can be pickled; however, alternative implementations such as dill exist, which allow more types of Python objects to be pickled. If possible, use cPickle (included in the standard Python distribution) because it is implemented in C and is, therefore, faster.

Create a DataFrame from the generated NumPy array, write it to a pickle with the to_pickle() method, and retrieve it from the pickle with the read_pickle() function:

df = pd.DataFrame(a)
df.to_pickle(tmpf.name)
print "Size pickled dataframe", getsize(tmpf.name)
print "DF from pickle
", pd.read_pickle(tmpf.name)

The pickle of the DataFrame is slightly larger than the .npy file, as you can confirm in the following printout:

Size pickled dataframe 14991
DF from pickle
           0         1         2         3
0   0.496714 -0.138264  0.647689  1.523030
[TRUNCATED]
59 -2.025143  0.186454 -0.661786  0.852433
         ...       ...       ...       ...

[365 rows x 4 columns]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.170.134