HDF5 file

The Hierarchical Data Format (HDF) is a file format designed to store and manage large amounts of data. It was designed in the 90s at the National Center for Supercomputing Applications (NCSA), and then NASA decided to use this format. Portability and efficiency for time series storage was key in the design of this language. The trading world rapidly adopted this format, in particular, High-Frequency Trading (HFT) firms, hedge funds, and investment banks. These financial firms rely on gigantic amounts of data for backtesting, trading, and any other kinds of analysis.

This format allows HDF users in finance to handle very large datasets, to obtain access to a whole section or a subsection of the tick data. Additionally, since it is a free format, the number of open source tools is significant.

The hierarchical structure of the HDF5 shown uses two major types:

  • Datasets: Multidimensional arrays of a given type
  • Groups: Container of other groups and/or datasets

The following diagram shows the hierarchical structure of the HDF5:

To get the dataset's content, we can access it like a regular file using the POSIX syntax /path/file. The metadata is also stored in groups and datasets. The HDF5 format uses B-trees to index datasets, which makes it a good storage format for time series, especially financial asset price series.

In the code, we will describe an example of how to use an HDF5 file in Python. We will use the load_financial_data function we used in this book to get the GOOG prices. We store the data frame in an HDF5 file called goog_data. Then, we use the h5py library to read this file and read the attributes of these files. We will print the data content of this files.

In this code will get the GOOG financial data. We store this data into the data frame  goog_data:

!/bin/python3
import pandas as pd
import numpy as np
from pandas_datareader import data
import matplotlib.pyplot as plt
import h5py

def load_financial_data(start_date, end_date,output_file):
try:
df = pd.read_pickle(output_file)
print('File data found...reading GOOG data')
except FileNotFoundError:
print('File not found...downloading the GOOG data')
df = data.DataReader('GOOG', 'yahoo', start_date, end_date)
df.to_pickle(output_file)
return df

goog_data=load_financial_data(start_date='2001-01-01',
end_date = '2018-01-01',
output_file='goog_data.pkl')

In this part of the code we will store the data frame goog_data into the file goog_data.h5

 goog_data.to_hdf('goog_data.h5','goog_data',mode='w',format='table',data_columns=True)

We will then load this file from the file goog_data.h5 and create a data frame  goog_data_from_h5_file:

goog_data_from_h5_file = h5py.File('goog_data.h5')

print(goog_data_from_h5_file['goog_data']['table'])
print(goog_data_from_h5_file['goog_data']['table'][:])
for attributes in goog_data_from_h5_file['goog_data']['table'].attrs.items():
print(attributes)

Despite being portable and open source, the HDF5 file format has some important caveats:

  • The likelihood of getting corrupted data is high. When the software handing the HDF5 file crashes, it is possible to lose all the data located in the same file.
  • It has limited features. It is not possible to remove arrays.
  • It offers low performance. There is no use of operating system caching.

Many financial companies still use this standardized file. It will remain on the market for a few years. Next, we will talk about the file storage alternative: databases.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
34.201.122.150