Top features you'll want to know about

So far, we have learned about building custom Java applications (for example, TripPlanner).Cassandra offers a rich set of configurations and we have already discussed some of them in the previous chapters. In this chapter, we will discuss some important features of Cassandra, such as secondary indexes and composite columns.

Cassandra secondary indexes

Indexes over column values are known as Secondary indexes. Secondary indexes are distributed, and nodes that hold the data will store them down in a hidden table locally. These are pretty useful and needed whenever we need to query a column value other than the row key. It is advisable not to use secondary indexes for unique column values or when the dataset that belongs to the indexed column values is relatively small. Cassandra builds secondary indexes automatically in the background without blocking any subsequent read/write requests; however, atomicity is guaranteed on row persistence and index updates.

  1. Create a column family with a secondary index: You can enable secondary indexes while creating a column family. For example, create column family author with secondary indexes enabled for author_name, as follows:
    create column family author with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and column_metadata=[{column_name:author_name, validation_class: UTF8Type, index_type:KEYS},{column_name:age, validation_class: IntegerType},{column_name:sex, validation_class: UTF8Type}];
    
  2. Insert: First, store some columns in the author column family as follows:
    set author[author1][author_name]='cassandra';
    set author[author1][age]= 32;
    set author[author2][author_name]='packtpub';
    set author[author2][age]= 22;
    
  3. Using the secondary index: We have already enabled the secondary index on author_name while creating the column family. So now you can fetch rows using it as follows:
    get author where author_name = 'packtpub';
    get author where author_name = 'packtpub' and age <= 22;
    
  4. Update the column family: As the secondary index is not enabled over the age column, if you try to fetch the rows for given values for age as follows:
    get author where age = 22;
    

    The output would be an error:

    No indexed columns present in index clause with operator EQ
    

    The reason for such an error is that we have not added a secondary index over the age column values. We can update the author column family to add the index over age, as follows:

    update column family author with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and column_metadata=[{column_name:author_name, validation_class: UTF8Type, index_type:KEYS},{column_name:age, validation_class: IntegerType, index_type:KEYS},{column_name:sex, validation_class: UTF8Type}];
    

    Add an index over age column values:

    get author where age = 22;
    

    This will return the rows as follows:

    RowKey: author2
    => (column=age, value=22, timestamp=1359544767274000)
    => (column=author_name, value=packtpub, timestamp=1359544767264000)
    

Note

For more details about using cassandra-cli, please refer to the quick exercises in the Hands on with the Cassandra command-line interface section.

Cassandra composite columns

Rows in Cassandra are sorted by column names. Cassandra relies on the comparator datatype column name to ensure sorting of columns. Sorted wide rows means for a row key multiple columns (for example, Super column in super column family) are sorted (unlike static column family).

CompositeType is essentially a comparator with a combination of more than one type of comparator. You may use CompositeType to build feature sets, such as "inverted indexes" (of course, secondary indexes will not simply work over wide rows and also will not provide more control).

Let's try a quick exercise on the usage of composite columns. In this exercise, we will build a sample application (webTrafficMonitor) to monitor the traffic for a specific website and perform various searches, such as searching for IP addresses logged for a specific website in the given time range.

  1. Create the keyspace:
    create keyspace compositesample WITH strategy_class = SimpleStrategy AND strategy_options:replication_factor = 1;
    
  2. Use the keyspace:
    use compositesample;
    
  3. Create the column family:
    create columnfamily web_traffic (url varchar, logtime timestamp, ip_address varchar, country PRIMARY KEY(url,logtime,ip_address));
    
  4. Insert rows:
    insert into web_traffic(url,logtime,ip_address,country) values('
    www.packtpublishing.com
    ','2013-01-23', 'India'),
    
    insert into web_traffic(url,logtime,ip_address,country) values('www.packtpublishing.com','2013-01-23','192.168.43.12','India'),
    
    insert into web_traffic(url,logtime,ip_address,country) values('
    www.packtpublishing.com
    ','2013-01-23','10.16.143.121','USA'),
    
    insert into web_traffic(url,logtime,ip_address,country) values('www.packtpublishing.com','2013-01-24','192.168.43.23','India'),
    
    insert into web_traffic(url,logtime,ip_address,country) values('www.packtpublishing.com','2013-01-24','192.168.43.29','India'),
    
    insert into web_traffic(url,logtime,ip_address,country) values('www.packtpublishing.com','2013-01-25','192.168.43.129','India'),
    
    insert into web_traffic(url,logtime,ip_address,country) values('www.packtpublishing.com','2013-01-25','192.168.143.157','India'),
    
    insert into web_traffic(url,logtime,ip_address,country) values('www.packtpublishing.com','2013-01-25','192.168.143.104','India'),
    
  5. Perform the search:

    You can always perform a search using secondary indexes or a wild search as follows:

    1. Fetch all the rows for www.packtpublishing.com ordered by the IP address:
      select * from web_traffic where url='www.packtpublishing.com' order by ip_address desc;
      
    2. Now, suppose we need to monitor traffic and retrieve all the data for www.packtpublishing.com over a given period of time; we can use a clustered key in combination with the partitioning key (for example, url) to fetch all the rows for www.packtpublishing.com for log times between 23-01-2013 and 24-01-2013, ordered by the IP address:
      select * from web_traffic where url='www.packtpublishing.com' and logtime >='2013-01-23' and logtime <='2013-01-24' order by logtime desc;
      

Cassandra performance tuning

Cassandra is meant to deal with a large volume of data, so it is highly possible to encounter problems, such as high usage of resources and performance bottlenecks. Let's discuss some important performance specific configurations.

Compaction tuning

In the previous sections (Cassandra storage architecture), we discussed about Cassandra's write log path. Data written on memtables is flushed periodically in the form of small SStables (Sorted String Tables) onto the disk. The process of combining row segments, rebuilding indexes for merged SStables, and removing tombstones is called compaction, and it always runs in the background. Cassandra offers two types of compaction strategies.

SizeTieredCompactionStrategy

Tiered compaction (default strategy) for a column family is preferred for applications that need heavy write loads and not so frequent read/updates. Upon reaching the configured value of min_compaction_threshold (default is 4), Cassandra will perform compaction and merge SStables of a similar size together into a single SStable.

An SStable is a data structure written on a disk containing data (key-value pair), index, and bloom filter. Generally, columns for a row key persist across multiple SStables (as SStables are immutable). In the case of SizeTieredCompactionStrategy, based on min_compaction_threshold, Cassandra will merge SStables of similar size (but may be for different row keys) into one SStable. Hence, over a period of time, if there are frequent write requests for a particular row key, it is highly possible that multiple segments of a row may belong to a dynamic number of SStables. This may result in inconsistent read/write performance. This could also be problematic because of repetitive compactions same SStables with obsolete data may also be present as a part of the compaction; the process, unless merging is completely done.

Leveled Compaction Strategy (LCS)

Leveled compaction in Cassandra is inspired by Google's leveldb (http://code.google.com/p/leveldb). It is considered to be the best fit for applications having frequent read/updates. It creates a fixed small size in-memory (5 MB by default) SStable, and this is arranged in different levels with no overlapping. Compaction happens as soon as new SStable add request arrives, and it will be upgraded to the next level (say L1 from L0) after immediate compaction.

For example, upon receiving an event to flush the data of a particular row from a memtable in the form of multiple SStables with LCS, compaction will happen as soon as add requests for these SStables arrive, and it will be upgraded to the next level (say L1 from L0) after immediate compaction. This definitely guarantees no data overlap, as most of the data segments will be available from a single SStable on each level. It also saves space by removing obsolete rows (not entirely) on each level. Because of leveled compaction, this strategy would perform more I/O requests in comparison with SizeTieredStrategy. At each level size, the SStable is 10 times bigger than the previous one. As mentioned, it makes a lot of sense to prefer LCS in cases of frequent read/write, to avoid high read latency.

  • You can create a column family by using compaction_strategy_class, as follows:
    create column family users with key_validation_class = 'UTF8Type' and comparator = 'UTF8Type' and compaction_strategy='SizeTieredCompactionStrategy' and default_validation_class = 'UTF8Type'
    
  • You can also update a column family for compaction_strategy as follows:
    update column family users with compaction_strategy='LeveledCompactionStrategy';
    

Bloom filter false positive chance

By default, Cassandra has bloom filters enabled. A bloom filter index is used to locate SStables that may hold data for a particular row. Upon receiving a read request, Cassandra bloom filter validates if required data is present in an SStable. The bloom filter false positive ratio ranges from 0.000744 (default) to 1.0 (disabled). A higher ratio means low memory consumption and is advisable for applications, which ensures that occurrences of such read request for rows would be almost NIL.

You can always update a column family for bloom filter false ratio, as follows:

update column family users WITH bloom_filter_fp_chance = 0.004;

Note

Updating bloom_filter_fp_chance would require a rebuild of SStables to regenerate bloom filters.

Cache

Cassandra offers data caching of row key (key cache) and entire row (row cache). These cache tunings are pretty much useful in case of better read performance. Based on an application's read behavior, you can always configure these for hot rows (frequently fetched rows) to avoid disk I/O. A higher value of key caching and row level caching is recommended in cases where an almost complete data set needs to be frequently accessed. Please refer to the Configuring Cassandra section for more details on configuration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.79.176