Apache Cassandra

Apache Cassandra mixes features of key-value and traditional relational databases. In a conventional relational database, the columns of a table are fixed. In Cassandra, however, rows within the same table can have different columns. Cassandra is therefore column oriented, since it allows a flexible schema for each row. Columns are organized in so-called column families, which are equivalent to tables in relational databases. Joins and subqueries are not possible with Cassandra. Cassandra can be downloaded from http://cassandra.apache.org/download/. The latest version at the time of writing was 2.0.9. Please refer to http://wiki.apache.org/cassandra/GettingStarted to get started.

Run the server from the command line as follows:

$ bin/cassandra –f

If you run the previous command, you may get the following error message:

Cassandra 2.0 and later require Java 7 or later.

Java in this context is a high-level programming language such as Python. Java 7 refers to version 1.7 (it's a marketing ploy). If you have Java installed, you can check its version as follows:

$ java –version
java version "1.7.0_60"

Note

For most operating systems, except Mac OS X, you can download Java from http://www.oracle.com/technetwork/java/javase/downloads/index.html.

Instructions for installing Java on Mac are given at http://docs.oracle.com/javase/7/docs/webnotes/install/mac/mac-jdk.html. Since this is a Python book, we will not dwell too long on the details of installing Java. A quick web search should give you more than enough information.

Create the directories listed in conf/cassandra.yaml or tweak them as follows:

data_file_directories:
/tmp/lib/cassandra/data
commitlog_directory: /tmp/lib/cassandra/commitlog
saved_caches_directory: /tmp/lib/cassandra/saved_caches

The following commands make sense if you don't want to keep the data:

$ mkdir -p /tmp/lib/cassandra/data
$ mkdir –p /tmp/lib/cassandra/commitlog
$ mkdir –p /tmp/lib/cassandra/saved_caches

Install a Python driver with the following command:

$ sudo pip install cassandra-driver
$ pip freeze|grep cassandra-driver
cassandra-driver==2.0.2

You might get the following error message:

The required version of setuptools (>=0.9.6) is not available,
    and can't be installed while this script is running. Please
    install a more recent version first, using
    'easy_install -U setuptools'.

This seems pretty self-explanatory.

Now it's time for the code. Connect to a cluster and create a session as follows:

cluster = Cluster()
session = cluster.connect()

Cassandra has the concept of keyspace. A keyspace holds tables. Cassandra has its own query language called Cassandra Query Language (CQL). CQL is very similar to SQL. Create the keyspace and set the session to use it:

session.execute("CREATE KEYSPACE IF NOT EXISTS mykeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };")
session.set_keyspace('mykeyspace')

Now, create a table for the sunspots data:

session.execute("CREATE TABLE IF NOT EXISTS sunspots (year decimal PRIMARY KEY, sunactivity decimal);")
  1. Create a statement that we will use in a loop to insert rows of the data as tuples:
    query = SimpleStatement(
        "INSERT INTO sunspots (year, sunactivity) VALUES (%s, %s)",
        consistency_level=ConsistencyLevel.QUORUM)
  2. The following line inserts the data:
    for row in rows:
        session.execute(query, row)
  3. Get the count of the rows in the table:
    print session.execute("SELECT COUNT(*) FROM sunspots")

    This prints the row count as follows:

    [Row(count=309)]
    
  4. Drop the keyspace and shut down the cluster:
    session.execute('DROP KEYSPACE mykeyspace')
    cluster.shutdown()

Refer to the cassandra_demo.py file in this book's code bundle:

from cassandra import ConsistencyLevel
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
import statsmodels.api as sm

cluster = Cluster()
session = cluster.connect()
session.execute("CREATE KEYSPACE IF NOT EXISTS mykeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };")
session.set_keyspace('mykeyspace')
session.execute("CREATE TABLE IF NOT EXISTS sunspots (year decimal PRIMARY KEY, sunactivity decimal);")

query = SimpleStatement(
    "INSERT INTO sunspots (year, sunactivity) VALUES (%s, %s)",
    consistency_level=ConsistencyLevel.QUORUM)

data_loader = sm.datasets.sunspots.load_pandas()
df = data_loader.data
rows = [tuple(x) for x in df.values]
for row in rows:
    session.execute(query, row)

print session.execute("SELECT COUNT(*) FROM sunspots")

session.execute('DROP KEYSPACE mykeyspace') 
cluster.shutdown()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.133.160