Writing and reading HDF5 files with the HighFive library

HDF5 is a highly efficient file format for storing datasets and scientific values. The HighFive library provides a higher-level C++ interface for the C library provided by the HDF Group. In this example, we propose to look at its interface by transforming the dataset used in the previous section to HDF5 format.

The main concepts of HDF5 format are groups and datasets. Each group can contain other groups and have attributes of different types. Also, each group can contain a set of dataset entries. Each dataset is a multidimensional array of values of the same type, which also can have attributes of different types.

Let's start with including required headers, as follows:

#include <highfive/H5DataSet.hpp>
#include <highfive/H5DataSpace.hpp>
#include <highfive/H5File.hpp>

Then, we have to create a file object where we will write our dataset, as follows:

HighFive::File file(file_name, HighFive::File::ReadWrite |
HighFive::File::Create |
HighFive::File::Truncate);

After we have a file object, we can start creating groups. We define a group of papers that should hold all paper objects, as follows:

auto papers_group = file.createGroup("papers");

Then, we iterate through an array of papers (as shown in the previous section) and create a group for each paper object with two attributes: the numerical id attribute and the preliminary_decision attribute of the string type, as illustrated in the following code block:

for (const auto& paper : papers) {
auto paper_group =
papers_group.createGroup("paper_" + std::to_string(paper.id));
std::vector<uint32_t> id = {paper.id};
auto id_attr = paper_group.createAttribute<uint32_t>(
"id", HighFive::DataSpace::From(id));

id_attr.write(id);
auto dec_attr = paper_group.createAttribute<std::string>(
"preliminary_decision",
HighFive::DataSpace::From(paper.preliminary_decision));
dec_attr.write(paper.preliminary_decision);

After we have created an attribute, we have to put in its value with the write() method. Notice that the HighFive::DataSpace::From function automatically detects the size of the attribute value. The size is the amount of memory required to hold the attribute's value. Then, for each paper_group, we create a corresponding group of reviews, as follows:

 auto reviews_group = paper_group.createGroup("reviews");

We insert into each reviews_group a dataset of numerical values of confidence, evaluation, and orientation fields. For the dataset, we define the DataSpace (the number of elements in the dataset) of size 3 and define a storage type as a 32-bit integer, as follows:

std::vector<size_t> dims = {3};
std::vector<int32_t> values(3);
for (const auto& r : paper.reviews) {
auto dataset = reviews_group.createDataSet<int32_t>(
std::to_string(r.id), HighFive::DataSpace(dims));
values[0] = std::stoi(r.confidence);
values[1] = std::stoi(r.evaluation);
values[2] = std::stoi(r.orientation);
dataset.write(values);
}
}

After we have created and initialized all objects, the Papers/Reviews dataset in HDF5 format is ready. When the file object leaves the scope, its destructor saves everything to the secondary storage.

Having the file in the HDF5 format, we can consider the HighFive library interface for file reading. As the first step, we again create the  HighFive::File object, but with attributes for reading, as follows:

HighFive::File file(file_name, HighFive::File::ReadOnly);

Then, we use the getGroup() method to get the top-level papers_group object, as follows:

auto papers_group = file.getGroup("papers");

We need to get a list of all nested objects in this group because we can access objects only by their names. We can do this by running the following code:

auto papers_names = papers_group.listObjectNames();

Using a loop, we iterate over all papers_group objects in the papers_group container, like this:

for (const auto& pname : papers_names) {
auto paper_group = papers_group.getGroup(pname);
...
}

For each paper object, we read its attributes, and the memory space required for the attribute value. Also, because each attribute can be multidimensional, we should take care of it and allocate an appropriate container, as follows:

  std::vector<uint32_t> id;
paper_group.getAttribute("id").read(id);
std::cout << id[0];

std::string decision;
paper_group.getAttribute("preliminary_decision").read(decision);
std::cout << " " << decision << std::endl;

For reading datasets, we can use the same approach: get the reviews group, then get a list of dataset names, and, finally, read each dataset in a loop, as follows:

  auto reviews_group = paper_group.getGroup("reviews");
auto reviews_names = reviews_group.listObjectNames();
std::vector<int32_t> values(2);
for (const auto& rname : reviews_names) {
std::cout << " review: " << rname << std::endl;
auto dataset = reviews_group.getDataSet(rname);
auto selection = dataset.select(
{1}, {2}); // or use just dataset.read method to get whole data
selection.read(values);
std::cout << " evaluation: " << values[0] << std::endl;
std::cout << " orientation: " << values[1] << std::endl;
}

Notice that we use the select() method for the dataset, which allows us to read only a part of the dataset. We define this part with ranges given as arguments. There is the read() method in the dataset type to read a whole dataset at once.

Using these techniques, we can read and transform any HDF5 datasets. This file format allows us to work only with part of the required data and not to load the whole file to the memory. Also, because this is a binary format, its reading is more efficient than reading large text files.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.69.163