Serialization

The next layer (from the bottom to the top on Figure 12.12) is serialization. Once a datagram is available, it has to be converted to a Python object. Like framing, many different serialization protocols exist. Some are text-based, and some are binary-based.

Text-based serialization is widely used because it is usually easy to implement. In some cases, a hand-made serialization algorithm based on key/values separated by a specific character can be enough. Typically, the Comma-Separated Values (CSV) format, widely used to store datasets, is the simplest form of text-based serialization. A consequence of this simplicity is that such simple algorithms have limitations. The main one is that they cannot handle complex types, that is, the values cannot be structured values, but only base types, such as integers and strings.

Some more evolved text-based algorithms exist to cope with these limitations. The most widely used one is probably JavaScript Object Notation (JSON). JSON is widely used to encode information on the web, its syntax is simple, and it allows us to serialize complex types composed of lists or objects. Another advantage of JSON is that one object can be serialized on a string composed of only one line, which makes it a perfect combination with line-based framing. The main drawback of all text-based serialization algorithms is performance: parsing text is complex, so these algorithms can have a significant performance impact on high workloads.

Binary-based serialization is more complex to implement than text-based serialization. There are several reasons for this. The first one is that, since they encode objects in binary blobs, ensuring compatibility between multiple systems and programming languages is more complex. The second one is also why binary-based serialization is worth the effort: encoding data in a binary format allows for many optimizations, both on the size of encoding, and speed of encoding. So binary-based serialization is much more efficient than text-based encoding. Depending on the context, this can be an important criterion for the choice of the serialization algorithm in an application.

Among all the available algorithms, Google developed and open-sourced one of the most versatile ones: Protobuf. Protobuf allows us to serialize simple objects, up to very complex ones, that are large. Moreover, it supports a feature rarely supported by its competitors: management of evolution in the structures. This allows us to add or remove fields in objects to serialize, while ensuring that existing software will work with these changes (not in all cases, but in a lot). Google provides implementations of Protobuf for many languages, with code generated from a Domain Specific Language (DSL) which makes it very easy to use. But Protobuf is only one of the many available solutions. MessagePack is an alternative, also supporting many programming languages. MessagePack aims to be similar to JSON in its ease of use but with binary encoding, which makes it much more efficient than JSON. Other popular solutions are AVRO and Thrift, both developed by the Apache foundation.

Choosing between text-based serialization or binary-based serialization should be based on two main criteria: performance and ease of use. If serialization may become the performance bottleneck of the application (which is rarely the case), then binary-based serialization should be used. Otherwise the choice should be based on ease of use in the ecosystem. Special care is needed on that point for binary-based serialization: many algorithms are not supported in JavaScript running in a browser. So, such algorithms cannot be used to talk between a web frontend and a backend.

Table of Contents for Serialization

Create new playlist

Sign In

Sign Up

Table of Contents for
Serialization