106 | Big Data Simplied
are automatically partitioned and replicated throughout the cluster. Using a process called com-
paction, Cassandra periodically consolidates SSTables, discarding obsolete data and tombstone
(an indicator that data was deleted).
5.4.3 Components of Cassandra
The key components of Cassandra are listed as follows.
Node: It is the place where data is stored.
Data centre: It is a collection of related nodes.
Cluster: A cluster is a component that contains one or more data centres.
Commit log: The commit log is a crash-recovery mechanism in Cassandra. Every write oper-
ation is written to the commit log.
Memtable: A memtable is a memory-resident data structure. After commit log, the data will
be written to the memtable. Sometimes, for a single-column family, there will be multiple
memtables. Data is written in a memtable temporarily.
SSTable: It is a disk file to which the data is flushed from the memtable when its contents
reach a threshold value.
Bloom filter: These are nothing but quick, non-deterministic algorithms for testing whether
an element is a member of a set. It is a special kind of cache. The bloom filters are accessed
after every query.
5.4.4 Cassandra Write Operations at a Node Level
The write process (as depicted in Figure 5.7) in Cassandra is explained below.
1. When write request comes to the node, first of all, it logs in the commit log.
2. Cassandra writes the data in the memtable. Data written in the memtable on each write
request also writes in the commit log separately. Memtable is a temporarily stored data in the
memory while Commit log logs the transaction records for backup purposes.
3. When memtable is full, the data is flushed to the SSTable data file.
A memtable is ushed to an immutable structure called SSTable (Sorted String Table). The
commit log is used for playback purposes in case when the data from memtable is lost due
to node failure. For example, the machine has a power outage before the memtable could get
ushed. Every SSTable creates three les on disk which include a bloom lter, a key index and
a data le. Over a period of time, a number of SSTables are created. This results in the need
to read multiple SSTables to satisfy a read request. Compaction is the process of combining
SSTables so that related data can be found in a single SSTable. This helps with making reads
much faster.
Each node processes the request individually. Every node first writes the mutation to the
commit log and then writes the mutation to the memtable. Writing to the commit log ensures
durability of the write as the memtable is an in-memory structure and is only written to disk
when the memtable is flushed to disk. A memtable is flushed to the disk due to the following
reasons.
M05 Big Data Simplified XXXX 01.indd 106 5/20/2019 7:42:46 PM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.213.238