Preprocessing CSV files

Sometimes, the data we have comes in a format that's incompatible with libraries we want to use. For example, the Iris dataset file contains a column that contains strings. Many machine learning libraries cannot read such values, because they assume that CSV files contain only numerical values that can be directly loaded to internal matrix representation.

So, before using such datasets, we need to preprocess them. In the case of the Iris dataset, we need to replace the categorical column containing string labels with numeric encoding. In the following code sample, we replace strings with distinct numbers, but in general, such an approach is a bad idea, especially for classification tasks. Machine learning algorithms usually learn only numerical relations, so a more suitable approach would be to use specialized encoding—for example, one-hot encoding. The code can be seen in the following block:

#include <fstream>
#include <regex>
...
std::ifstream data_stream("iris.data");
std::string data_string((std::istreambuf_iterator<char>(data_stream)),
std::istreambuf_iterator<char>());
data_string =
std::regex_replace(data_string, std::regex("Iris-setosa"), "1");
data_string =
std::regex_replace(data_string, std::regex("Iris-versicolor"), "2");
data_string =
std::regex_replace(data_string, std::regex("Iris-virginica"), "3");
std::ofstream out_stream("iris_fix.csv");
out_stream << data_string;

We read the CSV file content to the std::string object with the std::ifstream object. Also, we use std::regex routines to replace string class names with numbers. Using the regex functions allows us to reduce code size and make it more expressive in comparison with the loop approach, which typically uses the std::string::find() and std::string::replace() methods. After replacing all categorical class names in the file, we create a new file with the std::ofstream object.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.134.130