Apache Cassandra mixes features of key-value and traditional relational databases. In a conventional relational database, the columns of a table are fixed. In Cassandra, however, rows within the same table can have different columns. Cassandra is therefore column oriented, since it allows a flexible schema for each row. Columns are organized in so-called column families, which are equivalent to tables in relational databases. Joins and subqueries are not possible with Cassandra. Cassandra can be downloaded from http://cassandra.apache.org/download/. The latest version at the time of writing was 2.0.9. Please refer to http://wiki.apache.org/cassandra/GettingStarted to get started.
Run the server from the command line as follows:
$ bin/cassandra –f
If you run the previous command, you may get the following error message:
Cassandra 2.0 and later require Java 7 or later.
Java in this context is a high-level programming language such as Python. Java 7 refers to version 1.7 (it's a marketing ploy). If you have Java installed, you can check its version as follows:
$ java –version java version "1.7.0_60"
For most operating systems, except Mac OS X, you can download Java from http://www.oracle.com/technetwork/java/javase/downloads/index.html.
Instructions for installing Java on Mac are given at http://docs.oracle.com/javase/7/docs/webnotes/install/mac/mac-jdk.html. Since this is a Python book, we will not dwell too long on the details of installing Java. A quick web search should give you more than enough information.
Create the directories listed in conf/cassandra.yaml
or tweak them as follows:
data_file_directories: /tmp/lib/cassandra/data commitlog_directory: /tmp/lib/cassandra/commitlog saved_caches_directory: /tmp/lib/cassandra/saved_caches
The following commands make sense if you don't want to keep the data:
$ mkdir -p /tmp/lib/cassandra/data $ mkdir –p /tmp/lib/cassandra/commitlog $ mkdir –p /tmp/lib/cassandra/saved_caches
Install a Python driver with the following command:
$ sudo pip install cassandra-driver $ pip freeze|grep cassandra-driver cassandra-driver==2.0.2
You might get the following error message:
The required version of setuptools (>=0.9.6) is not available, and can't be installed while this script is running. Please install a more recent version first, using 'easy_install -U setuptools'.
This seems pretty self-explanatory.
Now it's time for the code. Connect to a cluster and create a session as follows:
cluster = Cluster() session = cluster.connect()
Cassandra has the concept of keyspace. A keyspace holds tables. Cassandra has its own query language called Cassandra Query Language (CQL). CQL is very similar to SQL. Create the keyspace and set the session to use it:
session.execute("CREATE KEYSPACE IF NOT EXISTS mykeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };") session.set_keyspace('mykeyspace')
Now, create a table for the sunspots data:
session.execute("CREATE TABLE IF NOT EXISTS sunspots (year decimal PRIMARY KEY, sunactivity decimal);")
query = SimpleStatement( "INSERT INTO sunspots (year, sunactivity) VALUES (%s, %s)", consistency_level=ConsistencyLevel.QUORUM)
for row in rows: session.execute(query, row)
print session.execute("SELECT COUNT(*) FROM sunspots")
This prints the row count as follows:
[Row(count=309)]
session.execute('DROP KEYSPACE mykeyspace') cluster.shutdown()
Refer to the cassandra_demo.py
file in this book's code bundle:
from cassandra import ConsistencyLevel from cassandra.cluster import Cluster from cassandra.query import SimpleStatement import statsmodels.api as sm cluster = Cluster() session = cluster.connect() session.execute("CREATE KEYSPACE IF NOT EXISTS mykeyspace WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };") session.set_keyspace('mykeyspace') session.execute("CREATE TABLE IF NOT EXISTS sunspots (year decimal PRIMARY KEY, sunactivity decimal);") query = SimpleStatement( "INSERT INTO sunspots (year, sunactivity) VALUES (%s, %s)", consistency_level=ConsistencyLevel.QUORUM) data_loader = sm.datasets.sunspots.load_pandas() df = data_loader.data rows = [tuple(x) for x in df.values] for row in rows: session.execute(query, row) print session.execute("SELECT COUNT(*) FROM sunspots") session.execute('DROP KEYSPACE mykeyspace') cluster.shutdown()
3.138.119.106