Columnar databases

Columnar databases have existed since the 90s, but came to prominence after the release of Google Bigtable as mentioned earlier. They are, in essence, a method of storing data that is optimized for querying very large volumes of data in a fast and efficient manner relative to row-based/tuple-based storage.

The benefits of columnar databases, or more concretely storing each column of data independently, can be illustrated with a simple example.

Consider a table consisting of 100 million household addresses and phone numbers. Consider also a simple query that requires the user to find the number of households in the state of New York, in the city of Albany, built after 1990. We'll create a hypothetical table to illustrate the difference in querying the data row by row versus column by column.

Hardware characteristics:

Average disk read speed: 200 MB per second

Database characteristics:

Table name: housedb

Total rows = 100 million
Total rows with State NY = Two million
Total rows with State NY and City Albany = 10,000
Total rows with State NY and City Albany and YearBuilt > 1990 = 500

Data size:

Let us assume that the size of each of the data of each row is as follows:

PlotNumber, YearBuilt each = 8 bytes = total 16 bytes
Owner, Address, State and City each = 12 bytes = Total 48 bytes
Net size in bytes of each row = 16 + 48 = 64 bytes

Note that the actual size will be higher, as there are several other considerations such as indexing and other table optimizations and related overheads that we won't consider here for the sake of simplicity.

We will also assume that the columnar database maintains an implicit row index that permits querying the data at certain indices in each column vector.

The following table shows the first 4 records:

PlotNumber	Owner	Address	State	City	YearBuilt
1	John	1 Main St.	WA	Seattle	1995
2	Mary	20 J. Ave.	NY	Albany	1980
3	Jane	5 45^th St.	NY	Rye Brook	2001
4	John	10 A. Blvd.	CT	Stamford	2010

In total, the table has 100 million records. The last few are shown as follows:

PlotNumber	Owner	Address	State	City	YearBuilt
99999997	Jim	23 B. Lane	NC	Cary	1995
99999998	Mike	5 L. Street	NY	Syracuse	1993
99999999	Tim	10 A. Blvd.	NY	Albany	2001
100000000	Jack	10 A. Blvd.	CT	Stamford	2010

The query we will run against this dataset is as follows:

select * from housedb where State like 'NY' and City like 'Albany' and YearBuilt > 1990

Scenario A: Searching row by row

In the first scenario, if we did a naïve row-by-row search, since the data for each column is not stored separately, but the data for each row is scanned, we would have to query across:

100 million * 64 bytes (size of each row in bytes) = 6,400 million bytes = approximately 6000 MB of data

At a disk read speed of say, 200 MBps, this means it would take approximately 6000 / 200 = 30 seconds to read all the records to find the matching entries.

Scenario B: Searching column by column

Assuming each column of data resides in individual files representing the respective columns, we will look each where clause individually:

select * from housedb where State like 'NY' and City like 'Albany' and YearBuilt > 1990

Where clause part 1: where State like 'NY'

The State column, as described earlier, has 100 million entries each of size 12 bytes.

In this case, we only need to search across:

100 million * 12 bytes = 1,200 million bytes = 1,000 MB of data.

At a data read rate of 200 MBps, this would take 200 MB, and it would take 1000 / 200 = 5 seconds to read the column of data.

This returns two million records (as noted earlier database characteristics)

Where clause part 2: City like 'Albany'

In the preceding step, we had narrowed our window of search to two million records that satisfied the criteria of State NY. In the second where clause step, now, we need not query across all 100 million records. Instead, we can simply look at the two million records that satisfied the criteria to determine which ones belong to City Albany.

In this case, we only need to search across:

2 million * 12 bytes = 24 million bytes = approximately 20 MB of data.

At a data read rate of 200 MBps, this would take 0.1 seconds.

This returns 10,000 records (as noted earlier in Database Characteristics).

Where clause part 3: YearBuilt > 1990

In the preceding step, we further narrowed our window of search to 10,000 records fulfilling both the criteria of State NY and City Albany. In this step, we will query 10,000 records in the YearBuilt column to find which ones fulfil the criteria of YearBuilt > 1990.

In this case, we only need to search across:

10,000 * 16 bytes = 160,000 bytes = approximately 150 KB of data.

At a data read rate of 200 MBps, this would take 0.00075 seconds, which we can round to zero seconds.

Hence, the net time spent in querying across the data was:

Where clause part 1: where State like 'NY' - five seconds
Where clause part 2: City like 'Albany' - 0.1 seconds
Where clause part 3: YearBuilt > 1990 - zero seconds

Net time taken to read the data = 5.1 seconds.

Important: Note that the actual read or more specifically, scan performance, depends on various other factors. The size of the tuple (row), the time to reconstruct the tuple (tuple reconstruction), bandwidth of memory (how fast data can be read into the CPU from Main Memory, and so on), cache line size and other factors. In practice, there would be various levels of abstractions due to which the actual performance may be slower. Further there are other considerations such as hardware architecture and parallel operations that can affect positively or otherwise the overall performance. These topics are more advanced and require dedicated reading. The analysis here focuses exclusively on the disk I/O, which is one of the critical aspects of overall performance at a high-level.

The preceding example demonstrates the benefits of querying data that has been stored in columns from a query performance or efficiency perspective based on the size of the data. There is also another benefit offered by columnar data, which is that it allows storage of tables that may have arbitrary schema in columns.

Consider the first four rows of the prior table. If, for example, we had missing information in some of the rows, that would lead to sparse columns:

PlotNumber	Owner	Address	State	City	YearBuilt
1	John	1 Main St.	NULL	Seattle	1995
2	Mary	20 J. Ave.	NY	NULL	NULL
3	Jane	NULL	NY	Rye Brook	NULL
4	John	10 A. Blvd.	CT	NULL	NULL

Instead of populating NULL values, we can instead create a Column Family called Complete_Address that can contain an arbitrary number of key-value pairs corresponding to only those fields that have corresponding data:

PlotNumber	Owner	Complete_Address		YearBuilt
1	John	Address: 1 Main St.	City: Seattle	1995
2	Mary	Address: 20 J. Ave.	State: NY	NULL
3	Jane	State: NY	City: Rye Brook	NULL
4	John	Address: 10 A. Blvd.	State: CT	NULL

A third and very important benefit offered by columnar databases is the ability to retrieve data based on three keys: a row key, a column key, and a timestamp that uniquely identifies each record, permitting very fast access to the data in question.

For example, since the Owner field can change when the property (PlotNumber) is sold, we can add another field that denotes the date of the record; that is, the date that the record corresponds to. This would allow us to distinguish among properties that had a change of ownership whilst all the other data remained the same:

PlotNumber	Owner	Address	State	City	YearBuilt	RecordDate
1	John	1 Main St.	WA	Seattle	1995	2001.04.02
2	Mary	20 J. Ave.	NY	Albany	1980	2007.05.30
3	Jane	5 45^th St.	NY	Rye Brook	2001	2001.10.24
4	John	10 A. Blvd.	CT	Stamford	2010	2003.07.20

Since there can be multiple records for each PlotNumber to accommodate change of ownership, we can now define three keys that could uniquely identify each cell of data in each record, as follows:

Row key: PlotNumber
Column key: The column name
Timestamp key: RecordDate

Each cell in each record in the table will thus have a unique three-value pair that distinguishes it from the other cells.

Databases such as Bigtable, Cassandra, and others employ this method to perform data analysis at scale both expeditiously and efficiently.

Some of the popular columnar databases are listed as follows. Note that there may be repetitions as databases can have multiple NoSQL properties (such as both in-memory and columnar):

Open source	Commercial
Apache Parquet	Kdb+
MonetDB	Teradata
MariaDB	SAP HANA
Druid	HP Vertica
HBase	Oracle Exadata
Apache Kudu	ParAccel
Apache Arrow	Actian Vector

Table of Contents for Columnar databases

Create new playlist

Sign In

Sign Up

Table of Contents for
Columnar databases