So far, we have learned about building custom Java applications (for example, TripPlanner
).Cassandra offers a rich set of configurations and we have already discussed some of them in the previous chapters. In this chapter, we will discuss some important features of Cassandra, such as secondary indexes and composite columns.
Indexes over column values are known as Secondary indexes. Secondary indexes are distributed, and nodes that hold the data will store them down in a hidden table locally. These are pretty useful and needed whenever we need to query a column value other than the row key. It is advisable not to use secondary indexes for unique column values or when the dataset that belongs to the indexed column values is relatively small. Cassandra builds secondary indexes automatically in the background without blocking any subsequent read/write requests; however, atomicity is guaranteed on row persistence and index updates.
author
with secondary indexes enabled for author_name
, as follows:create column family author with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and column_metadata=[{column_name:author_name, validation_class: UTF8Type, index_type:KEYS},{column_name:age, validation_class: IntegerType},{column_name:sex, validation_class: UTF8Type}];
author
column family as follows:set author[author1][author_name]='cassandra'; set author[author1][age]= 32; set author[author2][author_name]='packtpub'; set author[author2][age]= 22;
author_name
while creating the column family. So now you can fetch rows using it as follows:get author where author_name = 'packtpub'; get author where author_name = 'packtpub' and age <= 22;
age
column, if you try to fetch the rows for given values for age as follows:get author where age = 22;
The output would be an error:
No indexed columns present in index clause with operator EQ
The reason for such an error is that we have not added a secondary index over the age
column values. We can update the author
column family to add the index over age
, as follows:
update column family author with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and column_metadata=[{column_name:author_name, validation_class: UTF8Type, index_type:KEYS},{column_name:age, validation_class: IntegerType, index_type:KEYS},{column_name:sex, validation_class: UTF8Type}];
Add an index over age
column values:
get author where age = 22;
This will return the rows as follows:
RowKey: author2 => (column=age, value=22, timestamp=1359544767274000) => (column=author_name, value=packtpub, timestamp=1359544767264000)
Rows in Cassandra are sorted by column names. Cassandra relies on the comparator datatype column name to ensure sorting of columns. Sorted wide rows means for a row key multiple columns (for example, Super column in super column family) are sorted (unlike static column family).
CompositeType
is essentially a comparator with a combination of more than one type of comparator. You may use CompositeType
to build feature sets, such as "inverted indexes" (of course, secondary indexes will not simply work over wide rows and also will not provide more control).
Let's try a quick exercise on the usage of composite columns. In this exercise, we will build a sample application (webTrafficMonitor
) to monitor the traffic for a specific website and perform various searches, such as searching for IP addresses logged for a specific website in the given time range.
create keyspace compositesample WITH strategy_class = SimpleStrategy AND strategy_options:replication_factor = 1;
use compositesample;
create columnfamily web_traffic (url varchar, logtime timestamp, ip_address varchar, country PRIMARY KEY(url,logtime,ip_address));
insert into web_traffic(url,logtime,ip_address,country) values(' www.packtpublishing.com ','2013-01-23', 'India'), insert into web_traffic(url,logtime,ip_address,country) values('www.packtpublishing.com','2013-01-23','192.168.43.12','India'), insert into web_traffic(url,logtime,ip_address,country) values(' www.packtpublishing.com ','2013-01-23','10.16.143.121','USA'), insert into web_traffic(url,logtime,ip_address,country) values('www.packtpublishing.com','2013-01-24','192.168.43.23','India'), insert into web_traffic(url,logtime,ip_address,country) values('www.packtpublishing.com','2013-01-24','192.168.43.29','India'), insert into web_traffic(url,logtime,ip_address,country) values('www.packtpublishing.com','2013-01-25','192.168.43.129','India'), insert into web_traffic(url,logtime,ip_address,country) values('www.packtpublishing.com','2013-01-25','192.168.143.157','India'), insert into web_traffic(url,logtime,ip_address,country) values('www.packtpublishing.com','2013-01-25','192.168.143.104','India'),
You can always perform a search using secondary indexes or a wild search as follows:
select * from web_traffic where url='www.packtpublishing.com' order by ip_address desc;
url
) to fetch all the rows for www.packtpublishing.com for log times between 23-01-2013 and 24-01-2013, ordered by the IP address: select * from web_traffic where url='www.packtpublishing.com' and logtime >='2013-01-23' and logtime <='2013-01-24' order by logtime desc;
Cassandra is meant to deal with a large volume of data, so it is highly possible to encounter problems, such as high usage of resources and performance bottlenecks. Let's discuss some important performance specific configurations.
In the previous sections (Cassandra storage architecture), we discussed about Cassandra's write log path. Data written on memtables is flushed periodically in the form of small SStables (Sorted String Tables) onto the disk. The process of combining row segments, rebuilding indexes for merged SStables, and removing tombstones is called compaction, and it always runs in the background. Cassandra offers two types of compaction strategies.
Tiered compaction (default strategy) for a column family is preferred for applications that need heavy write loads and not so frequent read/updates. Upon reaching the configured value of min_compaction_threshold
(default is 4
), Cassandra will perform compaction and merge SStables of a similar size together into a single SStable
.
An SStable is a data structure written on a disk containing data (key-value pair), index, and bloom filter. Generally, columns for a row key persist across multiple SStables (as SStables are immutable). In the case of SizeTieredCompactionStrategy
, based on min_compaction_threshold
, Cassandra will merge SStables of similar size (but may be for different row keys) into one SStable. Hence, over a period of time, if there are frequent write requests for a particular row key, it is highly possible that multiple segments of a row may belong to a dynamic number of SStables. This may result in inconsistent read/write performance. This could also be problematic because of repetitive compactions same SStables with obsolete data may also be present as a part of the compaction; the process, unless merging is completely done.
Leveled compaction in Cassandra is inspired by Google's leveldb (http://code.google.com/p/leveldb). It is considered to be the best fit for applications having frequent read/updates. It creates a fixed small size in-memory (5 MB by default) SStable, and this is arranged in different levels with no overlapping. Compaction happens as soon as new SStable add request arrives, and it will be upgraded to the next level (say L1
from L0
) after immediate compaction.
For example, upon receiving an event to flush the data of a particular row from a memtable in the form of multiple SStables with LCS, compaction will happen as soon as add requests for these SStables arrive, and it will be upgraded to the next level (say L1
from L0
) after immediate compaction. This definitely guarantees no data overlap, as most of the data segments will be available from a single SStable on each level. It also saves space by removing obsolete rows (not entirely) on each level. Because of leveled compaction, this strategy would perform more I/O requests in comparison with SizeTieredStrategy
. At each level size, the SStable is 10 times bigger than the previous one. As mentioned, it makes a lot of sense to prefer LCS in cases of frequent read/write, to avoid high read latency.
compaction_strategy_class
, as follows:create column family users with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and compaction_strategy='SizeTieredCompactionStrategy' and default_validation_class = 'UTF8Type'
update column family users with compaction_strategy='LeveledCompactionStrategy';
By default, Cassandra has bloom filters enabled. A bloom filter index is used to locate SStables that may hold data for a particular row. Upon receiving a read request, Cassandra bloom filter validates if required data is present in an SStable. The bloom filter false positive ratio ranges from 0.000744 (default) to 1.0 (disabled). A higher ratio means low memory consumption and is advisable for applications, which ensures that occurrences of such read request for rows would be almost NIL.
You can always update a column family for bloom filter false ratio, as follows:
update column family users WITH bloom_filter_fp_chance = 0.004;
Cassandra offers data caching of row key (key cache) and entire row (row cache). These cache tunings are pretty much useful in case of better read performance. Based on an application's read behavior, you can always configure these for hot rows (frequently fetched rows) to avoid disk I/O. A higher value of key caching and row level caching is recommended in cases where an almost complete data set needs to be frequently accessed. Please refer to the Configuring Cassandra section for more details on configuration.
18.116.118.229