Reading CSV files with the Fast-CPP-CSV-Parser library

Consider how to deal with CSV format in C++. There are many different libraries for parsing CSV format with C++. They have different sets of functions and different ways to integrate into applications. The easiest way to use C++ libraries is to use headers-only libraries because this eliminates the need to build and link them. We propose to use the Fast-CPP-CSV-Parser library because it is a small single-file header-only library with the minimal required functionality, which can be easily integrated into a development code base.

As an example of a CSV file format, we use the Iris dataset, which describes three different types of iris plants and was conceived by R.A. Fisher. Each row in the file contains the following fields: sepal length, sepal width, petal length, petal width, and a string with a class name.

The reference to the Iris dataset is the following: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

To read this dataset with the Fast-CPP-CSV-Parser library, we need to include a single header file, as follows:

#include <csv.h>

Then, we define an object of the type io::CSVReader. We must define the number of columns as a template parameter. This parameter is one of the library limitations because we need to be aware of the CSV file structure. The code for this is illustrated in the following snippet:

const uint32_t columns_num = 5;
io::CSVReader<columns_num> csv_reader(file_path);

Next, we define containers for storing the values we read, as follows:

std::vector<std::string> categorical_column;
std::vector<double> values;

Then, to make our code more generic and gather all information about column types in one place, we introduce the following helper types and functions. We define a tuple object that describes values for a row, like this:

using RowType = std::tuple<double, double, double, double, std::string>;
RowType row;

The reason for using a tuple is that we can easily iterate it with metaprogramming techniques. Then, we define two helper functions. One is for reading a row from a file, and it uses the read_row() method of the io::CSVReader class. The read_row() method takes a variable number of parameters of different types. Our RowType type describes these values. We do automatic parameter filling by using the std::index_sequence type with the std::get function, as illustrated in the following code snippet:

template <std::size_t... Idx, typename T, typename R>
bool read_row_help(std::index_sequence<Idx...>, T& row, R& r) {
return r.read_row(std::get<Idx>(row)...);
}

The second helper function uses a similar technique for transforming a row tuple object to our value vectors, as follows:

template <std::size_t... Idx, typename T>
void fill_values(std::index_sequence<Idx...>,
T& row,
std::vector<double>& data) {
data.insert(data.end(), {std::get<Idx>(row)...});
}

Now, we can put all the parts together. We define a loop where we continuously read row values and move them to our containers. After we read a row, we check the return value of the read_row() method, which tells us if the read was successful or not. A false return value means that we have reached the end of the file. In the case of a parsing error, we catch an exception from the io::error namespace. There are exception types for different parsing fails. In the following example, we handle number parsing errors:

 try {
bool done = false;
while (!done) {
done = !read_row_help(
std::make_index_sequence<std::tuple_size<RowType>::value>{}, row,
csv_reader);
if (!done) {
categorical_column.push_back(std::get<4>(row));
fill_values(std::make_index_sequence<columns_num - 1>{}, row,
values);
}
}
} catch (const io::error::no_digit& err) {
// ignore badly formatted samples
std::cerr << err.what() << std::endl;
}

Also, notice that we moved only four values to our vector of doubles because the last column contains string objects that we put to another vector of categorical values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.97.40