Columnar databases have existed since the 90s, but came to prominence after the release of Google Bigtable as mentioned earlier. They are, in essence, a method of storing data that is optimized for querying very large volumes of data in a fast and efficient manner relative to row-based/tuple-based storage.
The benefits of columnar databases, or more concretely storing each column of data independently, can be illustrated with a simple example.
Consider a table consisting of 100 million household addresses and phone numbers. Consider also a simple query that requires the user to find the number of households in the state of New York, in the city of Albany, built after 1990. We'll create a hypothetical table to illustrate the difference in querying the data row by row versus column by column.
Hardware characteristics:
Average disk read speed: 200 MB per second
Database characteristics:
Table name: housedb
- Total rows = 100 million
- Total rows with State NY = Two million
- Total rows with State NY and City Albany = 10,000
- Total rows with State NY and City Albany and YearBuilt > 1990 = 500
Data size:
Let us assume that the size of each of the data of each row is as follows:
- PlotNumber, YearBuilt each = 8 bytes = total 16 bytes
- Owner, Address, State and City each = 12 bytes = Total 48 bytes
- Net size in bytes of each row = 16 + 48 = 64 bytes
Note that the actual size will be higher, as there are several other considerations such as indexing and other table optimizations and related overheads that we won't consider here for the sake of simplicity.
We will also assume that the columnar database maintains an implicit row index that permits querying the data at certain indices in each column vector.
The following table shows the first 4 records:
PlotNumber |
Owner |
Address |
State |
City |
YearBuilt |
1 |
John |
1 Main St. |
WA |
Seattle |
1995 |
2 |
Mary |
20 J. Ave. |
NY |
Albany |
1980 |
3 |
Jane |
5 45th St. |
NY |
Rye Brook |
2001 |
4 |
John |
10 A. Blvd. |
CT |
Stamford |
2010 |
In total, the table has 100 million records. The last few are shown as follows:
PlotNumber |
Owner |
Address |
State |
City |
YearBuilt |
99999997 |
Jim |
23 B. Lane |
NC |
Cary |
1995 |
99999998 |
Mike |
5 L. Street |
NY |
Syracuse |
1993 |
99999999 |
Tim |
10 A. Blvd. |
NY |
Albany |
2001 |
100000000 |
Jack |
10 A. Blvd. |
CT |
Stamford |
2010 |
The query we will run against this dataset is as follows:
select * from housedb where State like 'NY' and City like 'Albany' and YearBuilt > 1990
Scenario A: Searching row by row
In the first scenario, if we did a naïve row-by-row search, since the data for each column is not stored separately, but the data for each row is scanned, we would have to query across:
100 million * 64 bytes (size of each row in bytes) = 6,400 million bytes = approximately 6000 MB of data
At a disk read speed of say, 200 MBps, this means it would take approximately 6000 / 200 = 30 seconds to read all the records to find the matching entries.
Scenario B: Searching column by column
Assuming each column of data resides in individual files representing the respective columns, we will look each where clause individually:
select * from housedb where State like 'NY' and City like 'Albany' and YearBuilt > 1990
-
Where clause part 1: where State like 'NY'
The State column, as described earlier, has 100 million entries each of size 12 bytes.
In this case, we only need to search across:
100 million * 12 bytes = 1,200 million bytes = 1,000 MB of data.
At a data read rate of 200 MBps, this would take 200 MB, and it would take 1000 / 200 = 5 seconds to read the column of data.
This returns two million records (as noted earlier database characteristics)
- Where clause part 2: City like 'Albany'
In the preceding step, we had narrowed our window of search to two million records that satisfied the criteria of State NY. In the second where clause step, now, we need not query across all 100 million records. Instead, we can simply look at the two million records that satisfied the criteria to determine which ones belong to City Albany.
In this case, we only need to search across:
2 million * 12 bytes = 24 million bytes = approximately 20 MB of data.
At a data read rate of 200 MBps, this would take 0.1 seconds.
This returns 10,000 records (as noted earlier in Database Characteristics).
- Where clause part 3: YearBuilt > 1990
In the preceding step, we further narrowed our window of search to 10,000 records fulfilling both the criteria of State NY and City Albany. In this step, we will query 10,000 records in the YearBuilt column to find which ones fulfil the criteria of YearBuilt > 1990.
In this case, we only need to search across:
10,000 * 16 bytes = 160,000 bytes = approximately 150 KB of data.
At a data read rate of 200 MBps, this would take 0.00075 seconds, which we can round to zero seconds.
Hence, the net time spent in querying across the data was:
- Where clause part 1: where State like 'NY' - five seconds
- Where clause part 2: City like 'Albany' - 0.1 seconds
- Where clause part 3: YearBuilt > 1990 - zero seconds
Net time taken to read the data = 5.1 seconds.
The preceding example demonstrates the benefits of querying data that has been stored in columns from a query performance or efficiency perspective based on the size of the data. There is also another benefit offered by columnar data, which is that it allows storage of tables that may have arbitrary schema in columns.
Consider the first four rows of the prior table. If, for example, we had missing information in some of the rows, that would lead to sparse columns:
PlotNumber |
Owner |
Address |
State |
City |
YearBuilt |
1 |
John |
1 Main St. |
NULL |
Seattle |
1995 |
2 |
Mary |
20 J. Ave. |
NY |
NULL |
NULL |
3 |
Jane |
NULL |
NY |
Rye Brook |
NULL |
4 |
John |
10 A. Blvd. |
CT |
NULL |
NULL |
Instead of populating NULL values, we can instead create a Column Family called Complete_Address that can contain an arbitrary number of key-value pairs corresponding to only those fields that have corresponding data:
PlotNumber |
Owner |
Complete_Address |
|
YearBuilt |
1 |
John |
Address: 1 Main St. |
City: Seattle |
1995 |
2 |
Mary |
Address: 20 J. Ave. |
State: NY |
NULL |
3 |
Jane |
State: NY |
City: Rye Brook |
NULL |
4 |
John |
Address: 10 A. Blvd. |
State: CT |
NULL |
A third and very important benefit offered by columnar databases is the ability to retrieve data based on three keys: a row key, a column key, and a timestamp that uniquely identifies each record, permitting very fast access to the data in question.
For example, since the Owner field can change when the property (PlotNumber) is sold, we can add another field that denotes the date of the record; that is, the date that the record corresponds to. This would allow us to distinguish among properties that had a change of ownership whilst all the other data remained the same:
PlotNumber |
Owner |
Address |
State |
City |
YearBuilt |
RecordDate |
1 |
John |
1 Main St. |
WA |
Seattle |
1995 |
2001.04.02 |
2 |
Mary |
20 J. Ave. |
NY |
Albany |
1980 |
2007.05.30 |
3 |
Jane |
5 45th St. |
NY |
Rye Brook |
2001 |
2001.10.24 |
4 |
John |
10 A. Blvd. |
CT |
Stamford |
2010 |
2003.07.20 |
Since there can be multiple records for each PlotNumber to accommodate change of ownership, we can now define three keys that could uniquely identify each cell of data in each record, as follows:
- Row key: PlotNumber
- Column key: The column name
- Timestamp key: RecordDate
Each cell in each record in the table will thus have a unique three-value pair that distinguishes it from the other cells.
Databases such as Bigtable, Cassandra, and others employ this method to perform data analysis at scale both expeditiously and efficiently.
Some of the popular columnar databases are listed as follows. Note that there may be repetitions as databases can have multiple NoSQL properties (such as both in-memory and columnar):
Open source |
Commercial |
Apache Parquet |
Kdb+ |
MonetDB |
Teradata |
MariaDB |
SAP HANA |
Druid |
HP Vertica |
HBase |
Oracle Exadata |
Apache Kudu |
ParAccel |
Apache Arrow |
Actian Vector |