Scaling databases with Microservices

One of, Microservices rules is that every service is supposed to manage its own data, which makes it easy to scale data. The idea is that managing and handling a smaller set of data is easier than managing a monolithic database. For example, in our example of an e-commerce application, say we have one product service that manages product-related information, another for users, and so on. The first thing to note is that we are not trying to keep all the data in one database machine. Each service can scale independently with its own data.

How you manage your data will depend on the kind of application you are trying to build. But it is worth spending some time understanding the scalability aspects of a database. Traditionally, when designing for scalability, databases were the most difficult part of an application to scale and used to usually be the bottleneck. I remember in the early days of my career – when NoSQL (we will come to this soon) databases were not so popular, and we used to rely more on Relational or SQL databases – scaling was a nightmare. It wasn't impossible, but it was definitely tricky. Let me explain what I'm talking about.

Imagine we have this e-commerce application that has different important aspects, such as products (details of products being sold), inventory, users, shopping cart, orders, and transactions. Everything works fine until we have a few thousand products with a few thousand users. Imagine the site starts getting more popular, which of course we want, and the number of products and users is getting into the millions. We had our relations database on 100 GB of storage, which is no longer sufficient. Ages ago, the obvious solution was to move our database to a bigger machine, with more disk, memory, and computational power. But, as already discussed, this approach has its limitations. A single high-power machine is usually costlier than multiple low-power machines. And, more importantly, there is a limit to increasing the power of a single machine; that is, you can add only a limited amount of CPU or RAM to a system.

There are solutions to this problem; two of the most common are read replicas and sharding.

Read replicas are a solution where we create additional database imagread-only only copies. The replicas would only serve read requests. This solution usually helps in cases where we expect more read operations than data manipulation operations. For example, when people are browsing products on a website, they are actually not modifying any data. The idea is to distribute these search requests across read-only database images.

The following diagram shows the use of read replicas:

Though read replicas help to speed up fetch queries, the design is not without its drawbacks. For example, if we added a new product or an existing product goes out of stock, the information first gets updated to the main database and then copied over to read replicas. There is some lag in this operation; copying from the main database to replicas will take some time. The delay can last anywhere from a few seconds to a few minutes, based on the business logic applied. So while this data is getting updated, users are seeing stale data. Imagine the frustration of users who saw a product, added it to their cart, but when they were checking out they got a message that the product was out of stock.

Another important and useful aspect that helps to scale a database is sharding. The idea behind the concept of sharding is to logically divide data into multiple databases, where it helps to keep the data in different machines. A simple example would be in that the case of an e-commerce site, we would like to keep data from different countries in different database instances. Another common way to divide data is through a primary key or some other logical divider. For example, in a database that manages employee data, we might want to keep data for employees whose surnames start from A-H in one database, I-P in another, and Q-Z in another.

The following diagram shows how a sharded database appears:

Though sharding is another important and widely-used approach for scaling databases, it comes with its own problems. The most common one is joining the tables. In a single database instance, it would be easy to join multiple tables and generate reports, whereas if data for one table is in one machine and another table is in the second machine, it is difficult to join the two tables and fetch any meaningful data.

NoSQL databases have given us many options in managing database scalability. Before going into detail, let me take a moment here to explain what NoSQL databases are, and how they help us manage scalability. I believe NoSQL itself is a somewhat misunderstood the concept. A lot of the time, I have heard of Not-SQL being referred to as something that would simply solve all your scalability problems automatically.

So, what is NoSQL? Literally, it would mean No-SQL. But I actually prefer the term non-relational database. Let's look into this concept. NoSQL actually is an umbrella term, which groups all the databases that do not keep data in traditional relational or table-based formats.

There are four major types into which we can classify NoSQL databases, based on the approach they use to store the data:

Key-value based database: This is perhaps the simplest kind of database. We are storing data in the form of key-value combinations. Think of a hashmap kind of structure, where we are adding a unique identifier as a key and data objects as values. It is very easy to scale, as long as we have unique keys. Examples of this kind of database include Redis and Riak.
Column-based databases: We have been using row-based relational databases for a long time. In row-based databases, we think of a record as a single object that can be stored in a database's table row. For example, an employee record that has the ID, name, department, salary, and joining date can be thought of as a single record of a relational database:

ID	Name	Department	Salary	Joining date
1211	Joe	Finance	70000	29-07-2014
1212	David	IT	77000	17-09-2016

Let's say most of our requirements are around grouping data based on different aspects, such as, fetch how many employees are in IT, how many employees earn more than $75,000?, or, find the average salary. Row-based storage is not very effective in such cases.

In contrast, a database with column-based storage stores data based on the columns. For example, think of all the salary data is present independently and can be accessed for independent calculation without worrying about what extra data like department, joining date etc is available. This makes the calculations and access much easier based on a particular column.

Examples of databases using a column-based storage mechanism are Cassandra and Vertica.

Document-based databases: This can be thought of as an extension to key-value-based storage. In a key-value-based system, a value can be anything, but a document-based database adds a restriction for proper formatting to be followed for data being stored. The data that is stored as documents. Metadata is provided for each document being stored, for better indexing and searching. Examples of document-based storage databases are MongoDB and CouchDB.
Graph-based databases: This kind of database is useful when our data records are connected to other records in some way and we need a method to parse this connectivity. A simple example is when we are storing information for people, and we need to capture friendship information, such as P1 is a friend of P2, who in turn is a friend of P4 and P6, so we can capture some relationship between P1, P2, P4, P6, and so on. Examples of databases that use graph-based storage are Neo4J and OrientDB.

A complete discussion on databases is outside the scope of this book, so we will quickly jump back to our core topic, the scalability of Microservices and databases.

As we have already mentioned, with a Microservices-based architecture, it is recommended that each Microservice be responsible for its own data. The first advantage we get with this approach is that we are not dealing with a huge set of data in one go.

The second important advantage we get is that because each Microservice is independent, it is easy to manage different types of data stores. For example, you might want to keep product data in a document-based database, such as MongoDB, whereas you keep user information in traditional relational databases, such as Oracle RDBMS or MySQL.

Command Query Responsibility Segregation (CQRS): We are used to looking at any entity as having four core operations, known as CRUD operations. These are Create, Read, Update, and Delete. Normally, we would have a single service, model, and database managing all these operations, but sometimes it makes sense to segregate Command (Create, Update, and Delete) and Query (Read) operations. This approach can have many advantages, such as we can manage the Read or Query operation in multiple ways; for example, at some places we need only a subset of data in list form, and at other places we might need a more detailed version.

The approach also has additional advantages as we are segregating the read operations from the update operations. We can use Read replicas of the database to achieve better performance. CQRS also helps in segregating UI views. An update to the data can trigger an event to update the view.

Though CQRS is helpful in separating Read and Update concerns, and hence supports the Single Responsibility principle, we need to be careful in implementing this as we are adding complexity to the system.

If the entity is simple and does not need too many customized operations, it might be a good idea to stick with a simple CRUD operation implementation.

Table of Contents for Scaling databases with Microservices

Create new playlist

Sign In

Sign Up

Table of Contents for
Scaling databases with Microservices