Reading JSON files with the RapidJSON library

Some datasets come with structured annotations and can contain multiple files and folders. An example of such a complex dataset is the Common Objects in Context (COCO) dataset. This dataset contains a text file with annotations for describing relations between objects and their structural parts. This widely-known dataset is used to train models for segmentation, object detection, and classification tasks. Annotations in this dataset are defined in the JSON file format. JSON is a widely used file format for objects' (entities') representations. It is just a text file with special notations for describing relations between objects and their parts. In the following code samples, we show how to work with this file format using the RapidJSON C++ library. However, we are going to use a more straightforward dataset that defines paper reviews. The authors of this dataset are Keith, B., Fuentes, E., & Meneses, C. (2017), and they made this dataset for their work titled A Hybrid Approach for Sentiment Analysis Applied to Paper Reviews. The following code sample shows a reduced part of this dataset:

{
"paper": [
{
"id": 1,
"preliminary_decision": "accept",
"review": [
{
"confidence": "4",
"evaluation": "1",
"id": 1,
"lan": "es",
"orientation": "0",
"remarks": "",
"text": "- El artículo aborda un problema contingente
y muy relevante, e incluye tanto un diagnóstico
nacional de uso de buenas prácticas como una solución
(buenas prácticas concretas)...",
"timespan": "2010-07-05"
},
{
"confidence": "4",
"evaluation": "1",
"id": 2,
"lan": "es",
"orientation": "1",
"remarks": "",
"text": "El artículo presenta recomendaciones
prácticas para el desarrollo de software seguro...",
"timespan": "2010-07-05"
},
{
"confidence": "5",
"evaluation": "1",
"id": 3,
"lan": "es",
"orientation": "1",
"remarks": "",
"text": "- El tema es muy interesante y puede ser de
mucha ayuda una guía para incorporar prácticas de
seguridad...",
"timespan": "2010-07-05"
}
]
},
...
]
}

There are two main approaches to parse and process JSON files, which is listed as follows:

  • The first approach assumes the parsing of whole files at once and creating a Document Object Model (DOM). The DOM is a hierarchical structure of objects that represents entities stored in files. It is usually stored in computer memory, and, in the case of large files, it can occupy a significant amount of memory.
  • Another approach is to parse the file continuously and provide an application program interface (API) for a user to handle and process each event related to the file-parsing process. This second approach is usually called Simple API for XML (SAX). Despite its name, it's a general approach that is used with non-XML data too.

Using a DOM for working with training datasets usually requires a lot of memory for structures that are useless for machine learning algorithms. So, in many cases, it is preferable to use the SAX interface. It allows us to filter irrelevant data and initialize structures that we can use directly in our algorithms. In the following code sample, we use this approach.

As a preliminary step, we define types for paper/review entities, as follows:

...
struct Paper {
uint32_t id{0};
std::string preliminary_decision;
std::vector<Review> reviews;
};

using Papers = std::vector<Paper>;
...
struct Review {
std::string confidence;
std::string evaluation;
uint32_t id{0};
std::string language;
std::string orientation;
std::string remarks;
std::string text;
std::string timespan;
};
...

Then, we declare a type for the object, which will be used by the parser to handle parsing events. This type should be inherited from the rapidjson::BaseReaderHandler base class, and we need to override virtual handler functions that the parser will call when a particular parsing event occurs, as illustrated in the following code block:

#include <rapidjson/error/en.h>
#include <rapidjson/filereadstream.h>
#include <rapidjson/reader.h>
...
struct ReviewsHandler
: public rapidjson::BaseReaderHandler<rapidjson::UTF8<>, ReviewsHandler> {
ReviewsHandler(Papers* papers) : papers_(papers) {}
bool Uint(unsigned u) ;
bool String(const char* str, rapidjson::SizeType length, bool /*copy*/);
bool Key(const char* str, rapidjson::SizeType length, bool /*copy*/);
bool StartObject();
bool EndObject(rapidjson::SizeType /*memberCount*/);
bool StartArray();
bool EndArray(rapidjson::SizeType /*elementCount*/);

Paper paper_;
Review review_;
std::string key_;
Papers* papers_{nullptr};
HandlerState state_{HandlerState::None};
};

Notice that we made handlers only for objects and arrays parsing events, and events for parsing unsigned int/string values. Now, we can create the rapidjson::FileReadStream object and initialize it with a handler to the opened file and with a buffer object that the parser will use for intermediate storage. We use the rapidjson::FileReadStream object as the argument to the Parse() method of the rapidjson::Reader type object. The second argument is the object of the type we derived from rapidjson::BaseReaderHandler, as illustrated in the following code block:

auto file = std::unique_ptr<FILE, void (*)(FILE*)>(
fopen(filename.c_str(), "r"), [](FILE* f) {
if (f)
::fclose(f);
});
if (file) {
char readBuffer[65536];
rapidjson::FileReadStream is(file.get(), readBuffer,
sizeof(readBuffer));
rapidjson::Reader reader;
Papers papers;
ReviewsHandler handler(&papers);
auto res = reader.Parse(is, handler);
if (!res) {
throw std::runtime_error(rapidjson::GetParseError_En(res.Code()));
}
return papers;
} else {
throw std::invalid_argument("File can't be opened " + filename);
}

When there are no parsing errors, we will have an initialized array of Paper type objects. Consider, more precisely, the event handler's implementation details. Our event handler works as a state machine. In one state, we populate it with the Review objects, and in another one, with the Papers objects, and there are states for other events, as shown in the following code snippet:

enum class HandlerState {
None,
Global,
PapersArray,
Paper,
ReviewArray,
Review
};

We parse the unsigned unit values only for the Id attributes of the Paper and the Review objects, and we update these values according to the current state and the previously parsed key, as follows:

bool Uint(unsigned u) {
bool res{true};
try {
if (state_ == HandlerState::Paper && key_ == "id") {
paper_.id = u;
} else if (state_ == HandlerState::Review && key_ == "id") {
review_.id = u;
} else {
res = false;
}
} catch (...) {
res = false;
}
key_.clear();
return res;
}

String values also exist in both types of objects, so we do the same checks to update corresponding values, as follows:

bool String(const char* str, rapidjson::SizeType length, bool /*copy*/) {
bool res{true};
try {
if (state_ == HandlerState::Paper && key_ == "preliminary_decision") {
paper_.preliminary_decision = std::string(str, length);
} else if (state_ == HandlerState::Review && key_ == "confidence") {
review_.confidence = std::string(str, length);
} else if (state_ == HandlerState::Review && key_ == "evaluation") {
review_.evaluation = std::string(str, length);
} else if (state_ == HandlerState::Review && key_ == "lan") {
review_.language = std::string(str, length);
} else if (state_ == HandlerState::Review && key_ == "orientation") {
review_.orientation = std::string(str, length);
} else if (state_ == HandlerState::Review && key_ == "remarks") {
review_.remarks = std::string(str, length);
} else if (state_ == HandlerState::Review && key_ == "text") {
review_.text = std::string(str, length);
} else if (state_ == HandlerState::Review && key_ == "timespan") {
review_.timespan = std::string(str, length);
} else {
res = false;
}
} catch (...) {
res = false;
}
key_.clear();
return res;
}

The event handler for the JSON key attribute stores the key value to the appropriate variable, which we use to identify a current object in the parsing process, as follows:

bool Key(const char* str, rapidjson::SizeType length, bool /*copy*/) {
key_ = std::string(str, length);
return true;
}

The StartObject event handler switches states according to the current key values and the previous state value. We base the current implementation on the knowledge of the structure of the current JSON file: there is no array of Paper objects, and each Paper object includes an array of reviews. It is one of the limitations of the SAX interface—we need to know the structure of the document to implement all event handlers correctly. The code can be seen in the following block:

bool StartObject() {
if (state_ == HandlerState::None && key_.empty()) {
state_ = HandlerState::Global;
} else if (state_ == HandlerState::PapersArray && key_.empty()) {
state_ = HandlerState::Paper;
} else if (state_ == HandlerState::ReviewArray && key_.empty()) {
state_ = HandlerState::Review;
} else {
return false;
}
return true;
}

In the EndObject event handler, we populate arrays of Paper and Review objects according to the current state. Also, we switch the current state back to the previous one by running the following code:

bool EndObject(rapidjson::SizeType /*memberCount*/) {
if (state_ == HandlerState::Global) {
state_ = HandlerState::None;
} else if (state_ == HandlerState::Paper) {
state_ = HandlerState::PapersArray;
papers_->push_back(paper_);
paper_ = Paper();
} else if (state_ == HandlerState::Review) {
state_ = HandlerState::ReviewArray;
paper_.reviews.push_back(review_);
} else {
return false;
}
return true;
}

In the StartArray event handler, we switch the current state to a new one according to the current state value by running the following code:


bool StartArray() {
if (state_ == HandlerState::Global && key_ == "paper") {
state_ = HandlerState::PapersArray;
key_.clear();
} else if (state_ == HandlerState::Paper && key_ == "review") {
state_ = HandlerState::ReviewArray;
key_.clear();
} else {
return false;
}
return true;
}

In the EndArray event handler, we switch the current state to the previous one based on our knowledge of the document structure by running the following code:

bool EndArray(rapidjson::SizeType /*elementCount*/) {
if (state_ == HandlerState::ReviewArray) {
state_ = HandlerState::Paper;
} else if (state_ == HandlerState::PapersArray) {
state_ = HandlerState::Global;
} else {
return false;
}
return true;
}

The vital thing in this approach is to clear the current key value after object processing. This helps us to debug parsing errors, and we always have actual information about the currently processed entity.

For small files, using the DOM approach can be preferable because it leads to less code and cleaner algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.37.126