Implementing the decoder

The decoder has quite a bit more state to it than the encoder and this is usually true of data formats. When dealing with raw bytes, trying to parse the information out of it is usually more difficult than writing the data out as that raw format.

Let's take a look at the helper methods that we will use to decode the data types we support:

import { CONSTANTS } from './helper.js';

export const decodeString = function(buf) {
    if(buf[0] !== CONSTANTS.string) {
        return false;
    }
    const len = buf.readUInt32BE(1);
    return buf.slice(5, 5 + len).toString('utf8');
}
export const decodeNumber = function(buf) {
    return buf.readInt32BE(1);
}

The decodeString method showcases how we could handle errors in the case of incorrectly formatted data, the decodeNumber method does not showcase this. For the decodeString method, we need to grab the length of the string from the buffer and we know this is the second byte of the buffer that would be passed in. Based on this, we know we can grab the string by starting at the fifth byte in the buffer (the first byte is the one that tells us that this is a string; the next four are the length of the string), and we grab everything until we get to the length of the string. We then run this buffer through the toString method.

decodeNumber is quite simple since we only have to read the 4 bytes after the first byte telling us it is a number (again, we should do a check here, but we are keeping it simple). This showcases the two main helper methods that we need to decode the data types that we support. Next, we will take a look at the actual decoder. It will look something like the following.

As stated previously, the decoding process is a bit more involved. This is for a number of reasons, as follows:

We are working directly on the bytes so we have to do quite a bit of processing.
We are dealing with a header and body section. If we created a non-schema-based system, we may be able to write a decoder without as much state as we have in this one.
Again, since we are dealing with the buffers directly, all of the data may not come in at once, so we need to handle this case. The encoder does not have to worry about this since we are operating the Writable stream in object mode.

With this in mind, let's run through the decoding stream:

We will set up our decode stream with all of the same types of setup that we have done with Transform streams in the past. We will set up a few private variables to track the state as we move through the decoder:

import { Transform } from 'stream'
import { decodeString, decodeNumber, CONSTANTS } from './helper.js'

export default class SimpleSchemaReader extends Transform {
    #obj = {}
    #inHeaders = false
    #inBody = false
    #keys = []
    #currKey = 0
}

Next, we are going to utilize an index throughout the decoding process. We are not able to simply read a byte at a time since the decoding process runs through the buffer at different speeds (when we read a number, we are reading 5 bytes; when we read a string, we read at least 6 bytes). Because of this, a while loop will be better:

#decode = function(chunk, index, type='headers') { 
        const item = chunk[index] === CONSTANTS.string ?
            decodeString(chunk.slice(index)) :
            decodeNumber(chunk.slice(index, index + 5));
          
        if( type === 'headers' ) {
            this.#obj[item] = null;
        } else {
            this.#obj[this.#keys[this.#currKey]] = item;
        }
        return chunk[index] === CONSTANTS.string ?
            index + item.length + 5 :
            index + 5;
    }
    constructor(opts={}) {
        opts.readableObjectMode = true;
        super(opts);
    }
    _transform(chunk, encoding, callback) {
        let index = 0; //1
        while(index <= chunk.byteLength ) {
        }
    }

Now, we do a check on the current byte to see whether it is a header or body delineation mark. This will let us know whether we are working on the object keys or on the object values. If we detect the headers flag, we will set the #inHeaders Boolean stating that we are in the headers. If we are in the body, we have more work to do:

// in the while loop
const byte = chunk[index];
if( byte === CONSTANTS.header ) { 
    this.#inHeaders = !this.#inHeaders
    index += 1;
    continue;
} else if( byte === CONSTANTS.body ) { 
    this.#inBody = !this.#inBody
    if(!this.#inBody ) { 
        this.push(this.#obj);
        this.#obj = {};
        this.#keys = [];
        this.#currKey = 0;
        return callback();
    } else {
        this.#keys = Object.keys(this.#obj); 
    }
    index += 1;
    continue;
}
if( this.#inHeaders ) { 
    index = this.#decode(chunk, index);
} else if( this.#inBody ) {
    index = this.#decode(chunk, index, 'body');
    this.#currKey += 1;
} else {
    callback(new Error("Unknown state!"));
}

Next, the paragraphs that follow will explain the process of getting the headers and the values of each JSON object.

First, we will change our body Boolean to the opposite of what it is currently at. Next, if we are going from inside the body to outside of the body, this means that we are done with this object. Because of this, we can push out the object that we are currently working on and reset all of our internal state variables (the temporary object, #obj; the temporary set of #keys that we get from the header; and the #currKey to know which key we are working on when we are in the body). Once we have this, we can run the callback (we are returning here so we don't run through more of our main body). If we do not do this, we will keep going through the loop and we will be in a bad state.

Otherwise, we have gone through the headers of our payload and have reached the values for each object. We will set our private #keys variable to the keys of the object (since, at this point, the headers should have grabbed all of the keys from the headers). We can now start to see the decoding process.

If we are in the headers, we will run our private #decode method and not utilize the third argument since the default is to run the method as if we are in headers. Otherwise, we will run it like we are in the body and pass a third argument to state that we are in the body. Also, if we are in the body, we will increment our #currKey variable.

Finally, we can take a look at the heart of the decoding process, the #decode method. We grab the item based on the first byte in the buffer, which will tell us which decoding helper method we should run. Then, if we are running this method in header mode, we will set a new key for our temporary object, and we will set its value to null since that will be filled in once we get to the body. If we are in body mode, we will set the value of the key corresponding to the #currKey index in our #keys array that we are looping through once we are in the body.

With that code explanation, the basic process that is happening can be summed up in a few basic steps:

We need to go through the headers and set the object's keys to these values. We are temporarily setting the values for each of these keys to null since they will be filled in later.
Once we move out of the header section and we go to the body section, we can grab all of the keys from the temporary object, and the decode run we do at that time should correspond to the key at the current key's index in the array.
Once we are out of the body, we reset all of the temporary variables for the state and send out the corresponding object since we are finished with the decoding process.

This may seem confusing, but all we are doing is lining up the header at some index with the body element at that same index. It would be similar to the following code if we wanted to put an array of keys and values together:

const keys = ['item1', 'item2', 'item3'];
const values = [1, 'what', 2.2];
const tempObj = {};
for(let i = 0; i < keys.length; i++) {
    tempObj[keys[i]] = null;
}
for(let i = 0; i < values.length; i++) {
    tempObj[keys[i]] = values[i];
}

This code is almost exactly the same as what we were doing with the preceding buffer, except we have to work with the raw bytes instead of higher-level items such as strings, arrays, and objects.

With both the decoder and the encoder finished, we can now run an object through our encoder and decoder to see whether we get the same value out. Let's run the following test harness code:

import encoder from './encoder.js'
import decoder from './decoder.js'
import json from './test.json'

const enc = new encoder();
const dec = new decoder();
enc.pipe(dec);
dec.on('data', (obj) => {
    console.log(obj);
});
enc.write(json);

We'll use the following test object:

{
    "item1" : "item",
    "item2" : 12,
    "item3" : 3.3
}

We will see that we will spit the same object out as we pipe the data through the encoder into the decoder. Now, it's great that we created our own encoding and decoding scheme, but how does it hold up to the transfer size compared to JSON (since we are trying to do better than just stringifying and parsing)? With this payload, we are actually increasing the size! If we think about it, this makes sense. We have to add in all of our special encoding items (all of the information other than the data such as the 0x10 and 0x11 bytes), but we now start to add more numerical items to our list that are quite large. We will see that we start to beat the basic JSON.stringify and JSON.parse:

{
    "item1" : "item",
    "item2" : 120000000,
    "item3" : 3.3,
    "item4" : 120000000,
    "item5" : 120000000,
    "item6" : 120000000
}

This is happening because stringified numbers are turned into just that, string versions of the numbers, so when we get numbers that are larger than 5 bytes, we are starting to save on bytes (1 byte for the data type and 4 bytes for the 32-bit number encoding). With strings, we will never see savings since we are always adding an extra 5 bytes of information (1 byte for the data type and 4 bytes for the length of the string).

In most encoding and decoding schemes, this is the case. The way they handle data has trade-offs depending on the type of data that is being passed. In our case, if we are sending large, highly numerical data over the wire, our scheme will probably work better, but if we are sending strings across, we are not going to benefit at all from this encoding and decoding scheme. Keep this thought in mind as we take a look at some data formats that are used quite heavily out in the wild.

Remember, this encoding and decoding scheme is not meant to be used in actual environments as it is riddled with issues. However, it does showcase the underlying theme of building out data formats. While most of us will never have to build data formats, it is nice to understand what goes on when building them out, and where data formats may have to specialize their encoding and decoding schemes based on the type of data that they are primarily working with.

Table of Contents for Implementing the decoder

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing the decoder