Columnar databases

Columnar databases have existed since the 90s, but came to prominence after the release of Google Bigtable as mentioned earlier. They are, in essence, a method of storing data that is optimized for querying very large volumes of data in a fast and efficient manner relative to row-based/tuple-based storage.

The benefits of columnar databases, or more concretely storing each column of data independently, can be illustrated with a simple example.

Consider a table consisting of 100 million household addresses and phone numbers. Consider also a simple query that requires the user to find the number of households in the state of New York, in the city of Albany, built after 1990. We'll create a hypothetical table to illustrate the difference in querying the data row by row versus column by column.

Hardware characteristics:

Average disk read speed: 200 MB per second

Database characteristics:

Table name: housedb

  • Total rows = 100 million
  • Total rows with State NY = Two million
  • Total rows with State NY and City Albany = 10,000
  • Total rows with State NY and City Albany and YearBuilt > 1990 = 500

Data size:

Let us assume that the size of each of the data of each row is as follows:

  • PlotNumber, YearBuilt each = 8 bytes = total 16 bytes
  • Owner, Address, State and City each = 12 bytes = Total 48 bytes
  • Net size in bytes of each row = 16 + 48 = 64 bytes

Note that the actual size will be higher, as there are several other considerations such as indexing and other table optimizations and related overheads that we won't consider here for the sake of simplicity.

We will also assume that the columnar database maintains an implicit row index that permits querying the data at certain indices in each column vector.

The following table shows the first 4 records:

PlotNumber

Owner

Address

State

City

YearBuilt

1

John

1 Main St.

WA

Seattle

1995

2

Mary

20 J. Ave.

NY

Albany

1980

3

Jane

5 45th St.

NY

Rye Brook

2001

4

John

10 A. Blvd.

CT

Stamford

2010

 

In total, the table has 100 million records. The last few are shown as follows:

PlotNumber

Owner

Address

State

City

YearBuilt

99999997

Jim

23 B. Lane

NC

Cary

1995

99999998

Mike

5 L. Street

NY

Syracuse

1993

99999999

Tim

10 A. Blvd.

NY

Albany

2001

100000000

Jack

10 A. Blvd.

CT

Stamford

2010

 

The query we will run against this dataset is as follows:

select * from housedb where State like 'NY' and City like 'Albany' and YearBuilt > 1990 

Scenario A: Searching row by row

In the first scenario, if we did a naïve row-by-row search, since the data for each column is not stored separately, but the data for each row is scanned, we would have to query across:

100 million * 64 bytes (size of each row in bytes) = 6,400 million bytes = approximately 6000 MB of data

At a disk read speed of say, 200 MBps, this means it would take approximately 6000 / 200 = 30 seconds to read all the records to find the matching entries.

Scenario B: Searching column by column

Assuming each column of data resides in individual files representing the respective columns, we will look each where clause individually:

select * from housedb where State like 'NY' and City like 'Albany' and YearBuilt > 1990 
  1. Where clause part 1: where State like 'NY'

The State column, as described earlier, has 100 million entries each of size 12 bytes.

In this case, we only need to search across:

100 million * 12 bytes = 1,200 million bytes = 1,000 MB of data.

At a data read rate of 200 MBps, this would take 200 MB, and it would take 1000 / 200 = 5 seconds to read the column of data.

This returns two million records (as noted earlier database characteristics)

  1. Where clause part 2: City like 'Albany'

In the preceding step, we had narrowed our window of search to two million records that satisfied the criteria of State NY. In the second where clause step, now, we need not query across all 100 million records. Instead, we can simply look at the two million records that satisfied the criteria to determine which ones belong to City Albany.

In this case, we only need to search across:

2 million * 12 bytes = 24 million bytes = approximately 20 MB of data.

At a data read rate of 200 MBps, this would take 0.1 seconds.

This returns 10,000 records (as noted earlier in Database Characteristics).

  1. Where clause part 3: YearBuilt > 1990

In the preceding step, we further narrowed our window of search to 10,000 records fulfilling both the criteria of State NY and City Albany. In this step, we will query 10,000 records in the YearBuilt column to find which ones fulfil the criteria of YearBuilt > 1990.

In this case, we only need to search across:

10,000 * 16 bytes = 160,000 bytes = approximately 150 KB of data.

At a data read rate of 200 MBps, this would take 0.00075 seconds, which we can round to zero seconds.

Hence, the net time spent in querying across the data was:

  • Where clause part 1: where State like 'NY' - five seconds
  • Where clause part 2: City like 'Albany' - 0.1 seconds
  • Where clause part 3: YearBuilt > 1990 - zero seconds

Net time taken to read the data = 5.1 seconds.

Important: Note that the actual read or more specifically, scan performance, depends on various other factors. The size of the tuple (row), the time to reconstruct the tuple (tuple reconstruction), bandwidth of memory (how fast data can be read into the CPU from Main Memory, and so on), cache line size and other factors. In practice, there would be various levels of abstractions due to which the actual performance may be slower. Further there are other considerations such as hardware architecture and parallel operations that can affect positively or otherwise the overall performance. These topics are more advanced and require dedicated reading. The analysis here focuses exclusively on the disk I/O, which is one of the critical aspects of overall performance at a high-level.

The preceding example demonstrates the benefits of querying data that has been stored in columns from a query performance or efficiency perspective based on the size of the data. There is also another benefit offered by columnar data, which is that it allows storage of tables that may have arbitrary schema in columns.

Consider the first four rows of the prior table. If, for example, we had missing information in some of the rows, that would lead to sparse columns:

PlotNumber

Owner

Address

State

City

YearBuilt

1

John

1 Main St.

NULL

Seattle

1995

2

Mary

20 J. Ave.

NY

NULL

NULL

3

Jane

NULL

NY

Rye Brook

NULL

4

John

10 A. Blvd.

CT

NULL

NULL

 

Instead of populating NULL values, we can instead create a Column Family called Complete_Address that can contain an arbitrary number of key-value pairs corresponding to only those fields that have corresponding data:

PlotNumber

Owner

Complete_Address

 

YearBuilt

1

John

Address: 1 Main St.

City: Seattle

1995

2

Mary

Address: 20 J. Ave.

State: NY

NULL

3

Jane

State: NY

City: Rye Brook

NULL

4

John

Address: 10 A. Blvd.

State: CT

NULL

 

A third and very important benefit offered by columnar databases is the ability to retrieve data based on three keys: a row key, a column key, and a timestamp that uniquely identifies each record, permitting very fast access to the data in question.

For example, since the Owner field can change when the property (PlotNumber) is sold, we can add another field that denotes the date of the record; that is, the date that the record corresponds to. This would allow us to distinguish among properties that had a change of ownership whilst all the other data remained the same:

PlotNumber

Owner

Address

State

City

YearBuilt

RecordDate

1

John

1 Main St.

WA

Seattle

1995

2001.04.02

2

Mary

20 J. Ave.

NY

Albany

1980

2007.05.30

3

Jane

5 45th St.

NY

Rye Brook

2001

2001.10.24

4

John

10 A. Blvd.

CT

Stamford

2010

2003.07.20

Since there can be multiple records for each PlotNumber to accommodate change of ownership, we can now define three keys that could uniquely identify each cell of data in each record, as follows:

  • Row key: PlotNumber
  • Column key: The column name
  • Timestamp key: RecordDate

Each cell in each record in the table will thus have a unique three-value pair that distinguishes it from the other cells.

Databases such as Bigtable, Cassandra, and others employ this method to perform data analysis at scale both expeditiously and efficiently.

Some of the popular columnar databases are listed as follows. Note that there may be repetitions as databases can have multiple NoSQL properties (such as both in-memory and columnar):

Open source

Commercial

Apache Parquet

Kdb+

MonetDB

Teradata

MariaDB

SAP HANA

Druid

HP Vertica

HBase

Oracle Exadata

Apache Kudu

ParAccel

Apache Arrow

Actian Vector

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.179.85