Chapter 9. Distributed Databases and Storage: The Central Bank

One of the largest hurdles to overcome when building a cloud native infrastructure is operating reliable and scalable storage for both online and offline use. While earlier sentiment around storage services being run in the cloud was frowned upon, a number of cloud native storage projects have succeeded, the most prominent being Vitess. Cloud services, including Azure, have also significantly invested in managed disks for virtual machines and virtual machine scale sets, as well as advanced storage accounts (such as Gen2) and managed database services (such as MySQL, SQL Server, and Redis). All of this has made storage a first-class concept in Azure’s cloud. In this chapter, we will look at why you should run storage solutions at scale in Azure and how you can orchestrate them in a way that is easy to manage.

The Need for Distributed Databases in Cloud Native Architecture

Although Azure provides a number of managed data (both SQL and NoSQL) solutions, sometimes you may need to run larger and more specialized setups while operating in the cloud. The ecosystem of self-operated Cloud Native Computing Foundation (CNCF) database projects will allow you to scale to a larger size than Azure platform as a service (PaaS) services provide, and will give you more features and performance options.

As this chapter will further discuss, the CNCF landscape lays out a set of storage solutions that are primarily designed for cloud operations. The benefit of these systems is that server management overhead is minimized, while you still have complete control over the system.

In this chapter, we will examine some of the mature data stores that are part of the CNCF landscape. Vitess, Rook, TiKV, and etcd are all mature cloud data stores that are able to scale to a tremendous size and are being run as first-class data systems across Azure. These systems represent the foundational building blocks of a data ecosystem that’s operated in a cloud infrastructure.

Running Stateful Workloads on AKS

At the time of this writing, Azure Kubernetes Service (AKS) supports multiple availability zones (i.e., multiple data centers within a region); however, it does not support multiple fault domains. This means that if your AKS cluster is in only one availability zone, it is theoretically possible to have downtime because multiple nodes in your cluster are unavailable. We recommend that you split your AKS cluster over multiple availability zones until AKS supports fault domains. 

Azure Storage and Database Options

Before you start building your cloud data stores, you should thoroughly evaluate your use case, particularly the storage, performance, and cost of what you are trying to build. You may find that Azure’s PaaS services might be a better fit for a particular use case, rather than using a CNCF product. For example, Azure’s Hadoop Distributed File System (HDFS) storage offering, Gen2 storage accounts, is an extremely economical solution for storing large volumes of data.

At the time of this writing, Azure provides the following storage and database PaaS services:

  • Azure Cosmos DB

  • Azure Cache for Redis

  • Azure Database for:

    • MySQL

    • PostgreSQL

    • MariaDB

  • Storage accounts (providing blob, Network File System [NFS], and hierarchical storage)

  • Azure Storage Explorer

We recommend that you evaluate the performance of these Azure offerings, specifically considering the performance, availability, and cost of a PaaS service, to ensure that you’re using the correct system for your use case. Azure provides a pricing calculator tool that is extremely useful for modeling costs.

If you believe that an Azure PaaS service does not fit your requirements, continue reading this chapter to learn about other potential data storage solutions.

Introduction to Vitess: Distributed and Sharded MySQL

Vitess is a database cluster system for horizontal scaling of MySQL and MariaDB. It was created by YouTube engineers in 2010 as a means to protect and scale their databases. It provides a number of features that aid in performance, security, and monitoring, but most importantly, it offers powerful tools to manage database topologies and provide first-class support for vertical and horizontal sharding and resharding. In many aspects, it allows you to operate SQL databases in a NoSQL manner, while still getting all the benefits of running a relational database.

Why Run Vitess?

You may be asking yourself: “Why should I run Vitess, instead of using Azure Cosmos DB?” Although Cosmos DB is an attractive option for globally distributed writes and replicas, the current cost model makes it prohibitively expensive to run at scale (thousands of queries per second), making a system like Vitess an attractive offering for those who need a high-throughput relational database with a more customizable setup.

Vitess provides the following benefits:

Scalability
You can scale individual schemas up and down.
Manageability
A centralized control plane manages all database-related operations.
Protection
You can rewrite queries, build query deny lists, and add table-level access control lists (ACLs).
Shard management
You can easily create and manage database shards.
Performance
A lot of the internals are built for performance, including client connection pooling and query deduping.

Vitess helps you manage the operational and scaling overhead of relational databases and the health of your cluster. Furthermore, Vitess offers a wider range of configurables than traditional cloud database offerings (including Cosmos DB), which will help you meet your scaling and replication requirements.

The Vitess Architecture

Vitess provides sharding as a service by creating a middleware-style system while still utilizing MySQL to store data. The client application connects to a daemon known as VTGate (see in Figure 9-1). VTGate is a cluster-aware router that routes queries to the appropriate VTTablet instances, which manage each instance of MySQL. The Topology service stores the topology of the infrastructure, including where the data is stored and the ability of each VTTablet instance to serve the data.

Figure 9-1. The Vitess architecture

Vitess also provides two administrative interfaces: vtctl, a CLI; and vtctld, a web interface.

VTGate is a query router (or proxy) for routing queries from the client to the databases. The client doesn’t need to know about anything except where the VTGate instance is located, which means the client-to-server architecture is straightforward.

Deploying Vitess on Kubernetes

Vitess has first-class support for Kubernetes, and it is very straightforward to get started with it. In this walkthrough, we’ll create a cluster and a schema and then expand the cluster:

  1. First, clone the vitess repository and move to the vitess/examples/operator folder:

    $ git clone [email protected]:vitessio/vitess.git
    $ cd vitess/examples/operator
  2. Now, install the Kubernetes operator by running:

    $ kubectl apply -f operator.yaml
  3. Next, create an initial Vitess cluster by running:

    $ kubectl apply -f 101_initial_cluster.yaml
  4. You’ll be able to verify that the initial cluster install was successful as follows:

    $ kubectl get pods
    NAME                                             READY   STATUS    RESTARTS   AGE
    example-etcd-faf13de3-1                          1/1     Running   0          78s
    example-etcd-faf13de3-2                          1/1     Running   0          78s
    example-etcd-faf13de3-3                          1/1     Running   0          78s
    example-vttablet-zone1-2469782763-bfadd780       3/3     Running   1          78s
    example-vttablet-zone1-2548885007-46a852d0       3/3     Running   1          78s
    example-zone1-vtctld-1d4dcad0-59d8498459-kwz6b   1/1     Running   2          78s
    example-zone1-vtgate-bc6cde92-6bd99c6888-vwcj5   1/1     Running   2          78s
    vitess-operator-8454d86687-4wfnc                 1/1     Running   0          2m29s

You now have a small, single-zone Vitess cluster running with two replicas.

You can find more database operation examples in the Vitess Kubernetes Operations example page.

One of the more common use cases you will encounter is the need to add or remove replicas of the database. You can easily do this by running the following:

$ kubectl edit planetscale.com example

# To add or remove replicas, change the replicas field of the appropriate resource 
# to the desired value. E.g.

  keyspaces:
  - name: commerce
    turndownPolicy: Immediate
    partitionings:
    - equal:
            parts: 1
            shardTemplate:
              databaseInitScriptSecret:
                name: example-cluster-config
                key: init_db.sql
              replication:
                enforceSemiSync: false
              tabletPools:
              - cell: zone1
                    type: replica
                    replicas: 2 # Change this value
Note

One consideration of building distributed data storage is determining the size of your failure domain. Vitess recommends that you store a maximum of 250 GB of data per server. While there are some MySQL performance reasons for this, other regular operational tasks become more difficult with a larger shard size/failure domain. You can read more about why Vitess recommends 250 GB per server on the Vitess blog.

Introduction to Rook: Storage Orchestrator for Kubernetes

So far, we’ve covered relational SQL database storage and high-performance serving, but there are other types of storage that need to be served and operated in the cloud. These are known as blob filesystems. Blob stores are great for storing images, videos, and other files that aren’t suitable for a relational database. Blob storage comes with extra challenges around storing and replicating large amounts of data efficiently.

Rook is a cloud native file, block, and object open source storage orchestrator that provides software-defined blob storage. Rook graduated as a CNCF project in 2018 and is specifically designed around running storage on Kubernetes. Similar to Vitess, Rook manages a lot of the daily operations of a cluster, including self-scaling and self-healing, and it automates other tasks such as disaster recovery, monitoring, and upgrades.

The Rook Architecture

Rook is a Kubernetes orchestrator, not an actual storage solution. Rook supports the following storage systems:

  • Ceph, a highly scalable distributed storage solution for block storage, object storage, and shared filesystems with years of production deployments

  • Cassandra, a highly available NoSQL database featuring lightning-fast performance, tunable consistency, and massive scalability

  • NFS, which allows remote hosts to mount filesystems over a network and interact with those filesystems as though they are mounted locally

Rook can orchestrate a number of storage providers on a Kubernetes cluster. Each has its own operator for deploying and managing the resources.

Deploying Rook on Kubernetes

In this example, we will deploy a Cassandra cluster using the Rook (Cassandra) Kubernetes operator. If you’re deploying a Ceph cluster via Rook, you can also use a Helm chart to deploy:

  1. Clone the Rook operator:

    $ git clone --single-branch --branch master https://github.com/rook/rook.git
  2. Install the operator:

    $ cd rook/cluster/examples/kubernetes/cassandra
    $ kubectl apply -f operator.yaml
  3. You can verify that the operator is installed by running:

    $ kubectl -n rook-cassandra-system get pod
  4. Now, go to the cassandra folder and create the cluster:

    $ cd rook/cluster/examples/kubernetes/cassandra
    $ kubectl create -f cluster.yaml
  5. To verify that all the desired nodes are running, issue the following command:

    $ kubectl -n rook-cassandra get pod -l app=rook-cassandra

It is exceptionally easy to scale your Cassandra cluster up and down using the Kubernetes kubectl edit command. You can simply change the value of Spec.Members up or down to scale the cluster accordingly:

$ kubectl edit clusters.cassandra.rook.io rook-cassandra
# To scale up a rack, change the Spec.Members field of the rack to the desired value.
# To scale down a rack, change the Spec.Members field of the rack to the desired value
# After editing and saving the yaml, check your cluster’s Status and Events for information
# on what’s happening:

apiVersion: cassandra.rook.io/v1alpha1
kind: Cluster
metadata:
  name: rook-cassandra
  namespace: rook-cassandra
spec:
  version: 3.11.6
  repository: my-private-repo.io/cassandra
  mode: cassandra
  annotations:
  datacenter:
    name: us-east2
    racks:
      - name: us-east2
        members: 3 # Change this number up or down


$ kubectl -n rook-cassandra describe clusters.cassandra.rook.io rook-cassandra
Note

You can find more details on Cassandra configurables in the Cassandra CRD documentation.

Similarly, Ceph and NFS clusters can be easily deployed using the Rook operator.

Introduction to TiKV

TiKV (Titanium Key-Value), created by PingCAP, Inc., is an open source, distributed, and transactional key-value store. Contrary to many key-value and NoSQL systems, TiKV provides simple (raw) APIs as well as transactional APIs that provide atomicity, consistency, isolation, and durability (ACID) compliance.

Why Use TiKV?

TiKV provides the following features:

  • Geo-replication

  • Horizontal scalability

  • Consistent distributed transactions

  • Coprocessor support

  • Automatic sharding

TiKV is an attractive option due its auto-sharding and geo-replication features, as well as its ability to scale to store more than 100 TB of data. It also provides strong consensus guarantees with its use of the Raft protocol for replication.

The TiKV Architecture

As previously mentioned, TiKV has two APIs that can be used for different use cases: raw and transactional. Table 9-1 outlines the differences between these two APIs.

Table 9-1. TiKV APIs: Raw versus transactional
  Raw Transactional
Description A lower-level key-value API for interacting directly with individual key-value pairs A higher-level key-value API that provides ACID semantics
Atomicity Single key Multiple keys
Use when... Your application doesn’t require distributed transactions or multiversion concurrency control (MVCC) Your application requires distributed transactions and/or MVCC

TiKV utilizes RocksDB as a storage container in each node, and utilizes Raft groups to provide distributed transactions, as shown in Figure 9-2. The placement driver acts as a cluster manager and ensures that all shards have had their replication constraints met and that the data has been load-balanced over the pool. Finally, clients connect to TiKV nodes using the Google Remote Procedure Call (gRPC) protocol, which is optimized for performance.

Figure 9-2. The TiKV architecture

Within a TiKV node (pictured in Figure 9-3), RocksDB provides the underlying storage mechanisms and Raft provides consensus for transactions. The TiKV API provides an interface for clients to interact with, and the coprocessor handles SQL-like queries and assembles the result from the underlying storage.

Figure 9-3. TiKV instance architecture

Deploying TiKV on Kubernetes

TiKV can be deployed via multiple automation methods, including Ansible, Docker, and Kubernetes. In the following example, we will deploy a basic cluster with Kubernetes and Helm:

  1. Install the TiKV custom resource definition:

    $ kubectl apply -f https://raw.githubusercontent.com/tikv/tikv-operator/master/ 
      manifests/crd.v1beta1.yaml
  2. We will need to explore the Helm operator. First, add the PingCap repository:

    $ helm repo add pingcap https://charts.pingcap.org/
  3. Now create a namespace for the tikv-operator:

    $ kubectl create ns tikv-operator
  4. Install the operator:

    $ helm install --namespace tikv-operator tikv-operator pingcap/tikv-operator 
      --version v0.1.0
  5. Deploy the cluster:

    $ kubectl apply -f https://raw.githubusercontent.com/tikv/tikv-operator/master/ 
      examples/basic/tikv-cluster.yaml
  6. You can check the status of the deployment by running:

    $ kubectl wait --for=condition=Ready --timeout 10m tikvcluster/basic

This will create a single-host storage cluster with 1 GB of storage and a placement driver (PD) instance.

You can modify the replicas and storage parameters in the cluster definition to match your requirements. We recommend that you run your storage with at least three replicas for redundancy purposes. For example, if you run kubectl edit ikv.org TikvCluster and modify replicas to 4 and storage to 500Gi:

apiVersion: tikv.org/v1alpha1
kind: TikvCluster
metadata:
  name: basic
spec:
  version: v4.0.0
  pd:
    baseImage: pingcap/pd
    replicas: 4
    # if storageClassName is not set, the default Storage Class of the Kubernetes cluster 
    # will be used
    # storageClassName: local-storage
    requests:
      storage: "1Gi"
    config: {}
  tikv:
    baseImage: pingcap/tikv
    replicas: 4
    # if storageClassName is not set, the default Storage Class of the Kubernetes cluster 
    # will be used
    # storageClassName: local-storage
    requests:
      storage: "500Gi"
    config: {}

you will get four replicas of a 500 GB storage partition.

More on etcd

We’ve utilized etcd (stealthily) throughout this book. Here we will spend some time talking about it in detail.

etcd is a strongly consistent (Raft-consensus), distributed key-value store that is intended for use in distributed systems. It is designed to gracefully handle leader elections and faults such as network interruptions and unexpected node downtime. etcd has a simple API (Figure 9-4), so you will find a number of the systems we’ve discussed in this book (e.g., Kubernetes, TiKV, Calico, and Istio) leveraging it behind the scenes.

Often, etcd is used to store configuration values, feature flags, and service discovery information in a key-value format. etcd allows clients to watch these items for changes and reconfigures itself when they change. etcd also is suitable for leadership elections and locking uses. etcd is used as the service discovery backend in Kubernetes as well as for orchestration in Rook.

Figure 9-4. etcd operational architecture

If you are utilizing etcd for your core infrastructure, you will quickly understand how important it is for this infrastructure to run reliably. As an etcd cluster is a critical piece of your infrastructure, we recommend the following to ensure a stable etcd cluster in Azure.

Hardware Platform

Consistent key-value store performance relies heavily on the performance of the underlying hardware. A poorly performing consistent key-value cluster can severely impact the performance of your distributed system. The same applies for etcd. If you are doing hundreds of transactions per second, you should utilize a managed solid state drive (SSD) with your virtual machine (we recommend a premium or ultra SSD) to prevent performance degradation. You will need at least four cores for etcd to run, and the throughput will scale with an increase in CPU cores. You can find more information on capacity planning and hardware configurations for the cloud on the general hardware page and the AWS guide (etcd doesn’t currently publish an Azure-specific guide).

We recommend running etcd on the following Azure SKUs:

  • Standard_D8ds_v4 (eight-core CPU)

  • Standard_D16ds_v4 (16-core CPU)

These SKUs explicitly allow you to attach premium or ultra SSDs, which will increase the performance of the cluster.

Note

As with any production use case that we’ve discussed in this book, we strongly recommend that you enable accelerated networking on all Azure compute instances.

Autoscaling and Auto-remediation

We would be remiss if we did not discuss how autoscaling and auto-remediation work with etcd. Autoscaling is not recommended for etcd as it could have unintended consequences if capacity additions or subtractions adversely affect cluster performance. That said, enabling auto-remediation (auto-healing) of bad etcd instances is considered to be OK. You can disable autoscaling for etcd by running the following command on your Kubernetes cluster:

$ kubectl delete hpa etcd

This will disable the Horizontal Pod Autoscaler (HPA), which is used in Kubernetes to automatically scale clusters up and down. You can find more information about the HPA in the Kubernetes documentation.

Azure Autoscaling

Azure provides autoscaling as a feature in a number of its products, including virtual machine scale sets and Azure functions. The autoscaling features allow you to efficiently scale your resources up and down when appropriate. With virtual machine scale sets, there can be up to a 10-minute delay for the new virtual machines to be available when scaling out.

Availability and Security

etcd is a control plane service, so it is important that it is highly available and that the communications to it are secure. We recommend that you run at least three etcd nodes in a cluster to ensure availability of the cluster. Cluster throughput will be dependent on storage performance. We strongly recommend against using static discovery mechanisms for your cluster nodes, and instead to use DNS SRV records per the etcd clustering guide.

etcd TLS

etcd supports client-to-server and peer (server-to-server) Transport Layer Security (TLS) encryption. We strongly recommend that you enable this as etcd is a control plane system. We won’t go into implementation details here (as they are highly dependent on your infrastructure). You can find configuration options in the etcd security documentation.

Role-based access control

Since etcd v2.1, role-based access control (RBAC) has been a feature of the etcd API. Using RBACs, you can create access patterns that only allow certain users (or applications) access to a subset of the etcd data. Traditionally, this is done by providing HTTP basic authentication when making HTTP requests to etcd. As of etcd v3.2, if you’re using the --client-cert-auth=true configuration option, the Common Name (CN) of the client TLS certificate will be used as the user (instead of a username/password combination).

You can find examples of how to apply RBACs to your etcd data space in the documentation.

Summary

Despite common misconceptions, it is possible to run large-scale data systems natively in Azure. In this chapter, we reviewed why it is advantageous to use cloud native data systems in Azure’s cloud. We covered four systems: Vitess (relational), Rook (blob), TiKV (key-value), and etcd (key-value configuration/service discovery). These systems are the bedrock of a cloud native architecture that stores and serves online data and offers a large upside over utilizing PaaS components. By now, you should understand what software makes sense for you to manage yourself as well as which PaaS services to use and how to deploy them in your infrastructure.

The cloud, and in particular, Kubernetes, has become a much friendlier place to run your stateful infrastructure. While Azure Gen2 storage accounts are a great resource for blob storage, cloud native software can really help you build long-lasting, large-scale, stateful infrastructure.

Now that you understand how you can store and serve data in a cloud environment, let’s look at how to move data between systems using real-time messaging.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.190.159.10