A look at data formats

Now that we have looked at our own data format, let's go ahead and take a look at some fairly popular data formats that are currently out there. This is not an exhaustive look at these, but more an introduction to data formats and what we may find out in the wild.

The first data format that we will look at is a schema-less format. As stated previously, schema-based formats either send ahead of time the schema for data, or they will send the schema with the data itself. This allows, usually, a more compact form of the data to come in, while also making sure both endpoints agree on the way the data will be received. The other form is schema-less, where we send the data in a new form, but all of the information to decode it is done through the specification.

JSON is one of these formats. When we send JSON, we have to encode it and then decode it once we are on the other side. Another schema-less data format is XML. Both of these should be quite familiar to web developers as we utilize JSON extensively and we use a form of XML when putting together our frontends (HTML).

Another popular format is MessagePack (https://msgpack.org/index.html). MessagePack is a format that is known for producing smaller payloads than JSON. What is also nice about MessagePack is the number of languages that have the library written natively for them. We will take a look at the Node.js version, but just note that this could be used on both the frontend (in the browser) and on the server. So let's begin:

We will npm install the what-the-pack extension by utilizing the following command:

> npm install what-the-pack

Once we have done this, we can start to utilize this library. With the following code, we can see how easy it is to utilize this data format over the wire:

import MessagePack from 'what-the-pack';
import json from '../schema/test.json';

const { encode, decode } = MessagePack.initialize(2**22);
const encoded = encode(json);
const decoded = decode(encoded);
console.log(encoded.byteLength, Buffer.from(JSON.stringify(decoded)).byteLength);
console.log(encoded, decoded);

What we see here is a slightly modified version of the example that is on the page for what-the-pack (https://www.npmjs.com/package/what-the-pack). We import the package and then we initialize the library. One way this library is different is that we need to initialize a buffer for the encoding and decoding process. This is what the 2**22 is doing in the initialize method. We are initializing a buffer that is 2 to the power of 22 bytes large. This way, it can easily slice the buffer and copy it without having expensive array-based operations. Another thing keen observers will note is that the library is not based on streaming. They have most likely done this to be compatible between the browser and Node.js. Other than these small issues, the overall library works just like we think it would.

The first console log shows us that the encoded buffer is 5 bytes less than the JSON version. While this does showcase that the library gives us a more compact form, it should be noted that there are cases where MessagePack may not be smaller than the corresponding JSON. It also may run slower than the built-in JSON.stringify and JSON.parse methods. Remember, everything is a trade-off.

There are plenty of schema-less data formats out there and each of them has their own tricks to try to make the encoding/decoding time faster and to make the over-the-wire data smaller. However, when we are dealing with enterprise systems, we will most likely see a schema-based data format being used.

There are a couple of ways to define a schema but, in our case, we will use the proto file format:

Let's go ahead and create a proto file to simulate the test.json file that we had. The schema could look something like the following:

package exampleProtobuf;
syntax = "proto3";

message TestData {
    string item1 = 1;
    int32  item2 = 2;
    float  item3 = 3;
}

What we are declaring here is that this message called TestData is going to live in a package called exampleProtobuf. The package is mainly there to group like items (this is heavily utilized in languages such as Java and C#). The syntax tells our encoder and decoder that the protocol we are going to use is proto3. There were other versions of the protocol and this one is the most up-to-date stable version.

We then declare a new message called TestData that has three entries. One will be called item1 and will be of type string, one will be a whole number called item2, and the final one will be a floating-point number called item3. We are also giving them IDs as this makes it easier for things such as indexing and for self-reference types (also because it is mandatory for protobuf to work). We will not go into exactly what this does, but note that it can help with the encoding and decoding process.

Next, we can write some code that can use this to create a TestData object in our code that can specifically handle these messages. This would look like the following:

protobuf.load('test.proto', function(err, root) {
    if( err ) throw err;
    const TestTypeProto = 
     root.lookupType("exampleProtobuf.TestData");
    if( TestTypeProto.verify(json) ) {
        throw Error("Invalid type!");
    }
    const message2 = TestTypeProto.create(json);
    const buf2 = TestTypeProto.encode(message2).finish();
    const final2 = TestTypeProto.decode(buf2);
    console.log(buf2.byteLength, 
     Buffer.from(JSON.stringify(final2)).byteLength);
    console.log(buf2, final2);
});

Notice that this is similar to most of the code we have seen except for some verification and creation processes. First, the library needs to read in the proto file that we have and make sure it is actually correct. Next, we create the object based on the namespace and name we gave it. Now, we verify our payload and create a message from this. We then run it through the encoder specific to this data type. Finally, we decode the message and test to make sure that we got the same data that we put in.

Two things should be noticeable from this example. First, the data size is quite small! This is one advantage that schema-based/protobuf has over schema-less data formats. Since we know ahead of time what the types should be, we do not need to encode that information into the message itself. Second, we will see that the floating-point number did not come back out as 3.3. This is due to precision errors and it is something that we should be on the lookout for.

Now, if we do not want to read in proto files like this, we could build the message in the code like the following:

const TestType = new protobuf.Type("TestType");
TestType.add(new protobuf.Field("item1", 1, "string"));
TestType.add(new protobuf.Field("item2", 2, "int32"));
TestType.add(new protobuf.Field("item3", 3, "float"));

This should resemble the message that we created in the proto file, but we will go over each line to show that it is the same protobuf object. We are first creating a new type called TestType in this case (instead of TestData). Next, we add three fields, each with their own label, an index number, and the type of data that is stored in it. If we run this through the same type of verification, create, encode, decode process, we will get the same results as before.

While this has not been a comprehensive overview of different data formats, it should help to recognize when we might use schema-less (when we don't know what the data may look like) and when to use schemas (when communicating between unknown systems or we need a decrease in payload size).

Table of Contents for A look at data formats

Create new playlist

Sign In

Sign Up

Table of Contents for
A look at data formats