As of the writing of this chapter, the latest version of Shark is v0.7.0 and it requires Spark 0.7.2 as well as a very recent JVM (Open JK7/Oracle HotSpot JDK7). Shark is available pre-built for both Hadoop 1 and Hadoop 2. As of the writing, the respective files are http://spark-project.org/download/shark-0.7.0-hadoop1-bin.tgz and http://spark-project.org/download/shark-0.7.0-hadoop2-bin.tgz. Once you have downloaded and extracted Shark, it's time to configure it. In this example, we will assume that you extracted in /home/spark/
. Shark has a separate configuration from Spark, which lives at shark-0.7.0/conf/shark-env.sh
. For local mode, you need to set up at least HIVE_HOME
and SPARK_HOME
like so:
export HIVE_HOME=/home/spark/hive-0.9.0-bin export SPARK_HOME=/home/park/spark-0.7.2 source $SPARK_HOME/conf/spark-env.sh
In local mode, you also need to create a place for Hive to store its files, which by default is /user/hive/warehouse
. Make sure to use the chown
command in order to make the files accessible to your user like so:
mkdir -p /user/hive/warehouse && chown [your-spark-user] /user/hive/warehouse
If you are using Shark with a Spark cluster, you also need to set the MASTER
and HADOOP_HOME
variables. If you are using Shark with an existing Hive installation, you must set HIVE_CONF_DIR
to the directory containing the Hive XML configuration files. If you add these after the source
... line, you can reference the variables in the Spark configuration with:
export HADOOP_HOME=/path/to/hadoop export MASTER=spark://$SPARK_MASTER_IP:7077
Once you have Shark installed and set up, you also need to copy Shark and its custom hive to all the workers nodes; do this with:
pscp -v -r -h ./spark-0.7.2/conf/slaves -l sparkuser ./shark-0.7.0 ~/ pscp -v -r -h ./spark-0.7.2/conf/slaves -l sparkuser ./hive-0.9.0-bin ~/
If you are doing an EC2-based setup, just use the latest AMI; it should already be set up for Shark.
18.224.68.28