Key-value databases

Key-value databases operate on the principle of structuring data as pairs of values corresponding to keys. To highlight the benefits of key-value databases, it would help to revisit the significance of hash maps, a common term prevalent in computer science to specify a unique data-structure that provides a constant-time lookup for key pairs.

An intuitive example for a hash table is as follows:

Consider a collection of 500 books and five bookcases. Each bookcase has five shelves. The books can be placed in an arbitrary order, but that would make it incredibly difficult to find a specific book and you may need to go through hundreds of books before locating the one you need. One method of categorizing the books would be to assign ranges of letters to each of the bookshelves, for example, A-E, F-J, K-O, P-T, U-Z, and use the first letter of the name of the book to assign it to a specific shelf. However, suppose you have a disproportionate number of books that start with the letters A-E. This means that the case assigned for A-E would have a much higher number of books relative to the other ones.

A more elegant alternative could be to assign a value to each of the books and use the respective value to determine which bookcase or bookshelf the book belongs to. To assign a number, a specific value to each book, we could sum up the numbers corresponding to each letter of the title of the book using a range of 1-26 for the letters A-Z respectively:

Our Simple Hash Map

Since we have five bookcases, each with five shelves, we have a total of 25 shelves. One method of allocating a book to a specific shelf would be to take the numeric value of the book obtained by summing the letters in the title and dividing the value by 26. Any number, when divided by 25, will yield a remainder between 0-25; that is, 26 unique values. We can use this value then to assign the book to a particular shelf. This then becomes our self-created hash function.

Of the 25 shelves, each of them is now assigned a numeric value corresponding to the values 0-25 respectively, with the last shelf being assigned the values 24 and 25. For example, shelf zero is assigned to store books whose numeric value divided by 26 yields zero, shelf one is assigned to store books whose numeric value divided by 26 yields one, and shelf 25 is assigned to store books whose numeric value divided by 26 yields 24 or 25.

An example will help to illustrate this concept more concretely.

Book name: HAMLET

Numeric value of title:

Hash values

Sum total of the numeric value = 8 + 1 + 13 + 12 + 5 + 20 = 59

Divide number by 26 = 2, remainder seven

Hence, the book is assigned to shelf number seven.

We have essentially found a way to methodically assign a shelf to each individual book, and because we have a fixed rule, when a new request for a book arrives, we can find it almost instantaneously since we will know the shelf corresponding to the book.

The preceding method illustrates the concept of hashing, and in practice, we would use a hash function that would find a unique value for each book, and assuming we could get an arbitrary number of bookshelves and slots in which we can place the books, we could simply use the plain numeric value of the book to identify which shelf it would belong to.

There would be cases where two books would have the same numeric value, and in those cases we could stack the books in the slot corresponding to the number. In computer science, this effect of multiple values corresponding to a key is known as a collision, and in those cases we would assign multiple items by means of a list or similar datatype.

In real-life use cases, we have much more complex items to work with than the simple example of books. Generally, we'd use more complex hash functions that lower the chance of collision and accordingly assign the key-value pair. The data would be stored in a contiguous array in memory and hence, when a request for a certain key arrived, we could instantaneously find the value by using the hash function to identify the location in memory where the data resides.

Hence, using key-value pairs to store data can be immensely powerful because the time to retrieve information corresponding to a key can be very fast as there is no need to search through a long list to identify a matching key.

Key-value databases employ the same principle of assigning unique keys to each record, and the data corresponding to each key is stored in the corresponding location. In our discussion of MongoDB, we saw that records were assigned a certain key identified by the _id value in each record. In practice, we could use this value to retrieve the corresponding data in constant time.

As mentioned before, memcached used to be the preferred method to store data in key-value pairs for web services that required very fast access to frequently used data. In essence, it served as a memory cache to store temporary information. With the advent of NoSQL databases, new platforms that extended the limited use case of memcached became prominent. Solutions such as Redis offered not only the ability to store data in key-value pairs in memory, but also the ability to persist the data on disk. In addition, these key-value stores supported horizontal scaling, which permitted the distribution of key-value pairs across hundreds of nodes.

The disadvantage of key-value storage was that the data could not be queried with the same flexibility as standard databases, which supported multiple levels of indexing and a more richer set of SQL commands. Nevertheless, the benefits of constant time lookup implied that for use cases that required a key-value structure, there were few other solutions that were comparable in both performance and efficiency. For instance, a shopping website with thousands of users could store user profile information in a key-value database and be able to look up individual information by simply applying a hash function corresponding to, for example, the user ID.

Today, key-value databases use a variety of methods to store data:

  • SSTables: A file of sorted key-value pairs represented as strings (and directly mapped to the Google File System (GFS)).
  • B-trees: Balanced trees where values are identified by traversing along leaves/nodes.
  • Bloom filters: A more optimal key-value method used when the number of keys is high. It uses multiple hash functions to set the bit-value to one in an array corresponding to keys.
  • Shards: A process involving partitioning data across multiple nodes.

Well known key-value databases include:

Open source

Commercial

Redis

Amazon DynamoDB

Cassandra

Riak

Aerospike

Oracle NoSQL

Apache Ignite

Azure Cosmos DB

Apache Accumulo

Oracle Berkeley DB

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.255.187