Saving data in the CSV format is fine most of the time. It is easy to exchange CSV files, since most programming languages and applications can handle this format. However, it is not very efficient; CSV and other plaintext formats take up a lot of space. Numerous file formats have been invented, which offer a high level of compression such as zip, bzip, and gzip.
The following is the complete code for this storage comparison exercise, which can also be found in the binary_formats.py
file of this book's code bundle:
import numpy as np import pandas as pd from tempfile import NamedTemporaryFile from os.path import getsize np.random.seed(42) a = np.random.randn(365, 4) tmpf = NamedTemporaryFile() np.savetxt(tmpf, a, delimiter=',') print "Size CSV file", getsize(tmpf.name) tmpf = NamedTemporaryFile() np.save(tmpf, a) tmpf.seek(0) loaded = np.load(tmpf) print "Shape", loaded.shape print "Size .npy file", getsize(tmpf.name) df = pd.DataFrame(a) df.to_pickle(tmpf.name) print "Size pickled dataframe", getsize(tmpf.name) print "DF from pickle ", pd.read_pickle(tmpf.name)
NumPy offers a NumPy-specific format called .npy
, which can be used to store NumPy arrays. Before demonstrating this format, we will generate a 365 x 4 NumPy array filled with random values. This array simulates daily measurements for four variables for a year (for instance, a weather data station with sensors measuring temperature, humidity, precipitation, and atmospheric pressure). We will use a standard Python NamedTemporaryFile
to store the data. The temporary file should be automatically deleted.
Store the array in a CSV file and check its size as follows:
tmpf = NamedTemporaryFile() np.savetxt(tmpf, a, delimiter=',') print "Size CSV file", getsize(tmpf.name)
The CSV file size is printed as follows:
Size CSV file 36864
Save the array in the NumPy.npy
format, load the array, check its shape, and the size of the .npy
file:
tmpf = NamedTemporaryFile() np.save(tmpf, a) tmpf.seek(0) loaded = np.load(tmpf) print "Shape", loaded.shape print "Size .npy file", getsize(tmpf.name)
The call to the seek()
method was needed to simulate closing and reopening the temporary file. The shape should be printed with the file size:
Shape (365, 4) Size .npy file 11760
The .npy
file is roughly three times smaller than the CSV file, as expected. Python lets us store data structures of practically arbitrary complexity. We can store a pandas DataFrame or Series as a pickle as well.
The Python pickle is a format to store Python objects to disk or other medium. This is called pickling. We can recreate the Python objects from storage. This reverse process is called unpickling (refer to http://docs.python.org/2/library/pickle.html). Pickling has evolved over the years, so as a result, various pickle protocols exist. Not all Python objects can be pickled; however, alternative implementations such as dill exist, which allow more types of Python objects to be pickled. If possible, use cPickle (included in the standard Python distribution) because it is implemented in C and is, therefore, faster.
Create a DataFrame
from the generated NumPy array, write it to a pickle with the to_pickle()
method, and retrieve it from the pickle with the read_pickle()
function:
df = pd.DataFrame(a) df.to_pickle(tmpf.name) print "Size pickled dataframe", getsize(tmpf.name) print "DF from pickle ", pd.read_pickle(tmpf.name)
The pickle of the DataFrame
is slightly larger than the .npy
file, as you can confirm in the following printout:
Size pickled dataframe 14991 DF from pickle 0 1 2 3 0 0.496714 -0.138264 0.647689 1.523030 [TRUNCATED] 59 -2.025143 0.186454 -0.661786 0.852433 ... ... ... ... [365 rows x 4 columns]
18.189.170.134