Installing Shark

As of the writing of this chapter, the latest version of Shark is v0.7.0 and it requires Spark 0.7.2 as well as a very recent JVM (Open JK7/Oracle HotSpot JDK7). Shark is available pre-built for both Hadoop 1 and Hadoop 2. As of the writing, the respective files are and Once you have downloaded and extracted Shark, it's time to configure it. In this example, we will assume that you extracted in /home/spark/. Shark has a separate configuration from Spark, which lives at shark-0.7.0/conf/ For local mode, you need to set up at least HIVE_HOME and SPARK_HOME like so:

export HIVE_HOME=/home/spark/hive-0.9.0-bin
export SPARK_HOME=/home/park/spark-0.7.2
source $SPARK_HOME/conf/

In local mode, you also need to create a place for Hive to store its files, which by default is /user/hive/warehouse. Make sure to use the chown command in order to make the files accessible to your user like so:

mkdir -p /user/hive/warehouse && chown [your-spark-user] /user/hive/warehouse

If you are using Shark with a Spark cluster, you also need to set the MASTER and HADOOP_HOME variables. If you are using Shark with an existing Hive installation, you must set HIVE_CONF_DIR to the directory containing the Hive XML configuration files. If you add these after the source... line, you can reference the variables in the Spark configuration with:

export HADOOP_HOME=/path/to/hadoop
export MASTER=spark://$SPARK_MASTER_IP:7077

Once you have Shark installed and set up, you also need to copy Shark and its custom hive to all the workers nodes; do this with:

pscp -v -r -h ./spark-0.7.2/conf/slaves -l sparkuser ./shark-0.7.0 ~/
pscp -v -r -h ./spark-0.7.2/conf/slaves -l sparkuser ./hive-0.9.0-bin ~/

If you are doing an EC2-based setup, just use the latest AMI; it should already be set up for Shark.

