Data Processing

One of the essential things in machine learning is the data that we use for training. We can gather training data from the processes we work with, or we can take already prepared training data from third-party sources. In any case, we have to store training data in a file format that should satisfy our development requirements. These requirements depend on the task we solve, as well as the data-gathering process. Sometimes, we need to transform data stored in one format to another to satisfy our needs. Examples of such needs are as follows:

  • Increasing human readability to ease communication with engineers
  • The existence of compression possibility to allow data to occupy less space on secondary storage
  • The use of data in the binary form to speed up the parsing process
  • Supporting the complex relations between different parts of data to make precise mirroring of a specific domain
  • Platform independence to be able to use the dataset in different development and production environments

Today, there exists a variety of file formats that is used for storing different kinds of information. Some of these are very specific, and some of them are general-purpose. There are software libraries that allow us to manipulate these file formats. There is rarely a need to develop a new format and parser from scratch. Using existing software for reading a format can significantly reduce development and testing time, which allows us to focus on particular tasks.

This chapter discusses how to process popular file formats that we use for storing data. It shows what libraries exist for working with JavaScript Object Notation (JSON), Comma-Separated Values (CSV), and Hierarchical Data Format v5 (HDF5) formats. This chapter also introduces the basic operations required to load and process image data with the OpenCV and Dlib libraries, and how to convert the data format used in these libraries to data types used in linear algebra libraries. It also describes data normalization techniques such as feature scaling and standardization procedures to deal with heterogeneous data.

This chapter will cover the following topics:

  • Parsing data formats to C++data structures
  • Initializing matrix and tensor objects from C++ data structures
  • Manipulating images with the OpenCV and Dlib libraries
  • Transforming images into matrix and tensor objects of various libraries
  • Normalizing data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.255.187