Chapter 1. Understanding Replication Concepts

In this chapter, you will be introduced to various replication concepts and you will learn which kind of replication is most suitable for which kind of practical scenario. At the end of the chapter, you will be able to judge whether a certain concept is feasible under various circumstances or not.

We will cover the following topics in this chapter:

  • CAP theory
  • Physical limitations of replication
  • Why latency matters
  • Synchronous and asynchronous replication
  • Sharding and replication

Before we jump into practical work using PostgreSQL, we will guide you through some very fundamental ideas and facts related to replication.

The CAP theory and physical limitations

You might wonder why a theory can be found at such a prominent place in a book that is supposed to be highly practical. Well, there is a very simple reason for that: Some nice-looking marketing papers of some commercial database vendors might leave you with the impression that everything is possible and easy to do without any serious limitation. This is not the case; there are physical limitations every vendor of software has to cope with. There is simply no way around the laws of nature, and shiny marketing cannot help to overcome nature.

In this chapter, you will be introduced to the so called CAP theory. Understanding the basic ideas of this theory is essential to fight off some requirements that cannot be turned into reality.

Understanding the CAP theory

Before we dig into the details we have to discuss what CAP actually means. CAP is an abbreviation for the following three features:

  • Consistency: This feature indicates whether all the nodes in a cluster see the same data at the same time or not.
  • Availability: This feature indicates if it is certain that you will receive an answer to every request. Can a user consider all the nodes in a cluster to be available? Think of data or state information split between two machines. A request is made, and machine 1 has some of the data and machine 2 has the rest of the data. If either machine goes down, not all the requests can be fulfilled, because not all of the data or state information is available entirely on either machine.
  • Partition tolerance: This feature indicates if the system will continue to work if arbitrary messages are lost on the way. A Network Partition event occurs when a system is no longer accessible (think of a network connection failing). A different way of considering partition tolerance is to think of it as message passing. If an individual system can no longer send/receive messages to/from other systems, it has been effectively partitioned out of the network.

Why are those previous three bullet points relevant to normal users? Well, the bad news is that a replicated (or distributed) system can only provide two out of those three features at the same time.

It is theoretically impossible to offer consistency, availability, and partition tolerance at the very same time. As you will see later in this book, this can have a significant impact on the system layouts that are safe and feasible to use. There is simply no such thing as the solution to all replication-related problems. When you are planning a large scale system, you might have to come up with different concepts depending on the needs that are specific to your requirements.

Tip

PostgreSQL, Oracle, DB2, and so on, will provide you with CAp while NoSQL systems such as MongoDB or Cassandra will provide you with cAP. This is why NoSQL is often referred to as eventually consistent.

Why the speed of light matters

The speed of light is not just a theoretical issue, it really does have an impact on your daily life. And more importantly, it has a serious implication when it comes to finding the right solution for your cluster.

We all know that there is some sort of cosmic speed limit called the speed of light. So why care? Well, let us do a simple mental experiment. Let us assume for a second that our database server is running at 3 GHz clock speed.

How far can light travel within one clock cycle of your CPU? If you do the math, you will figure out that light will travel around 10 cm per clock cycle (in pure vacuum). We can safely assume that an electric signal inside a CPU will be magnitudes slower than pure light in vacuum. The core idea is: 10 cm in one clock cycle? Well, this is not much at all.

For the sake of our mental experiment, let us now consider various distances:

  • Distance from one end of the CPU to the other
  • Distance from your server to some other server next door
  • Distance from your server in central Europe to a server somewhere in China

Considering the size of a CPU core on a die, you can assume that you can send a signal (even it if is not traveling at the speed of light by far) from one part of the CPU to some other part quite fast. It simply won't take 1 million clock cycles to add up two numbers that are already in your first level cache on your CPU.

But, what happens if you have to send a signal from one server to some other server and back? You can safely assume that sending a signal from server A to server B next door takes a lot longer because the cable is simply a lot longer. Often, it's more than 10 cm. In addition to that, network switches and other network components will add some latency as well.

Note

I am talking about the length of the cable here and not about it's bandwidth.

Sending a message (or a transaction) from Europe to China is of course many times more time consuming than sending some data to a server next door. Again, the important thing here is that the amount of data is not as relevant as the so called latency.

Long distance transmission

Let me try to explain the concept of latency by giving a very simple example. Let us assume you are European and you are sending a letter to China. You will easily accept the fact that the size of your letter is not the limiting factor here. It makes absolutely no difference if your letter is two or twenty pages long; the time it takes to reach the destination is basically the same. Also, it makes no difference if you send one, two or ten letters at the same time. Given reasonable numbers of letters, the size of the aircraft (that is bandwidth) to ship the stuff to China is usually not the problem. But, the so called roundtrip might very well be an issue. If you rely on the response to your letter from China to continue your work, you will soon find yourself waiting for a long time.

Why latency matters

The same concept applies to replication: If you send a chunk of data from Europe to China, you should avoid waiting on the response. If you send a chunk of data from your server to a server in the same rack, you might be able to wait on the response because your electronic signal will simply be fast enough to make it back in time.

Note

The basic problems of latency described in this section are not PostgreSQL-specific. The very same concepts and physical limitations apply to all types of databases and systems. As mentioned before, this fact is sometimes silently hidden and neglected in shiny commercial marketing papers. Nevertheless, the laws of physics will stand firm. This applies to commercial and open source software.

The most important point you have to keep in mind here is that bandwidth is not always the magical fix to a performance problem in a replicated environment. In many setups, latency is at least as important as bandwidth.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.182.62