©  Raju Kumar Mishra 2018
Raju Kumar MishraPySpark Recipeshttps://doi.org/10.1007/978-1-4842-3141-8_2

2. Installation

Raju Kumar Mishra
(1)
Bangalore, Karnataka, India
 
In the upcoming chapters, we are going to solve many problems by using PySpark . PySpark also interacts with many other big data frameworks to provide end-to-end solutions. PySpark might read data from HDFS , NoSQL databases , or a relational database management system (RDBMS) . After data analysis, we can also save the results into HDFS or databases.
This chapter covers all the software installations that are required to go through this book. We are going to install all the required big data frameworks on the CentOS operating system . CentOS is an enterprise-class operating system. It is free to use and easily available. You can download CentOS from www.centos.org/download/ and then install it on a virtual machine.
This chapter covers the following recipes:
  • Recipe 2-1. Install Hadoop on a single machine
  • Recipe 2-2. Install Spark on a single machine
  • Recipe 2-3. Use the PySpark shell
  • Recipe 2-4. Install Hive on a single machine
  • Recipe 2-5. Install PostgreSQL
  • Recipe 2-6. Configure the Hive metastore on PostgreSQL
  • Recipe 2-7. Connect PySpark to Hive
  • Recipe 2-8. Install Apache Mesos
  • Recipe 2-9. Install HBase
I suggest that you install every piece of software on your own. It is a good exercise and will give you a deeper understanding of the components of each software package.

Recipe 2-1. Install Hadoop on a Single Machine

Problem

You want to install Hadoop on a single machine.

Solution

You might be thinking, Why are we installing Hadoop while we are learning PySpark? Are we going to use Hadoop MapReduce as a distributed framework for our problem solving? The answer is, Not at all. We are going to use two components of Hadoop: HDFS and YARN—HDFS for data storage and YARN as cluster manager. Installation of Hadoop requires you to download it and configure it.

How It Works

Follow these steps to complete the installation.

Step 2-1-1. Creating a New CentOS User

In this step, we’ll create a new user. You might be thinking, Why a new user? Why can’t we install Hadoop on an existing user? The reason is that we want to provide a dedicated user for all the big data frameworks. With the following lines of code, we create the user pysparkbook:
[root@localhost pyspark]# adduser pysparkbook
[root@localhost pyspark]# passwd pysparkbook
The output is as follows:
Changing password for user pysparkbook.
New password:
passwd: all authentication tokens updated successfully.
In the preceding code, you can see that the command adduser has been used to create or add a user. The Linux command passwd has been used to provide a password to our new user pysparkbook.
After creating the user, we have to add it to sudo. Sudo stands for superuser do. Using sudo, we can run any code as a super user. Sudo is used to install software.

Step 2-1-2. Creating a new CentOS user

A new user is created. You might be thinking why new user. Why cant we install Hadoop in existing user. The reason behind that is, we want to provide a dedicated user for all the big data frameworks. In following lines of code we are going to create a user “pysparkbook”.
[pyspark@localhost ∼]$ su root
[root@localhost pyspark]# adduser pysparkbook
[root@localhost pyspark]# passwd pysparkbook
Output:
Changing password for user pysparkbook.
New password:
passwd: all authentication tokens updated successfully.
In the preceding code, you can see that the command adduser has been used to create or add a user. The command passwd has been used to provide a password for our new user pysparkbook to the sudo.
[root@localhost pyspark]# usermod -aG wheel pyspark
[root@localhost pyspark]#exit
Then we will enter to our user pysparkbook.
[pyspark@localhost ∼]$ su pysparkbook
We will create two directories. The binaries directory under the home directory will be used to download software, and the allPySpark directory under the root (/) directory will be used to install big data frameworks:
[pysparkbook@localhost ∼]$ mkdir binaries
[pysparkbook@localhost ∼]$ sudo  mkdir /allPySpark

Step 2-1-3. Installing Java

Hadoop, Hive, Spark and many big data frameworks use Java to run on. That’s why we are first going to install Java. We are going to use OpenJDK for this purpose; we’ll install the eighth version of OpenJDK. We can install Java on CentOS by using the yum installer, as follows:
[pysparkbook@localhost binaries]$ sudo yum install java-1.8.0-openjdk.x86_64
After installation of any software, it is a good idea to check the installation to ensure that everything is fine.
To check the Java installation, I prefer the command java -version:
[pysparkbook@localhost binaries]$ java -version
The output is as follows:
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-b15)
OpenJDK 64-Bit Server VM (build 25.111-b15, mixed mode)
Java has been installed. Now we have to look for the environment variable JAVA_HOME, which will be used by all the distributed frameworks. After installation, JAVA_HOME can be found by using jrunscript as follows:
[pysparkbook@localhostbinaries]$jrunscript -e 'java.lang.System.out.println(java.lang.System.getProperty("java.home"));'
Here is the output:
/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-2.b15.el7_3.x86_64/jre

Step 2-1-4. Creating Passwordless Logging from pysparkbook

Use this command to create a passwordless login :
[pysparkbook@localhost binaries]$ ssh-keygen -t rsa
Here is the output:
Generating public/private rsa key pair.
Enter file in which to save the key (/home/pysparkbook/.ssh/id_rsa):
/home/pysparkbook/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/pysparkbook/.ssh/id_rsa.
Your public key has been saved in /home/pysparkbook/.ssh/id_rsa.pub.
The key fingerprint is:
fd:9a:f3:9d:b6:66:f5:29:9f:b5:a5:bb:34:df:cd:6c [email protected]
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|                 |
|         .       |
|        S .      |
|           .    .|
|            . o.=|
|          .o ++OE|
|          oo.+XO*|
+-----------------+
[pysparkbook@local host binaries]$ cat ∼/.ssh/id_rsa.pub >> ∼/.ssh/authorized_keys
[pysparkbook@localhost binaries]$ chmod 755 ∼/.ssh/authorized_keys
[pysparkbook@localhost binaries]$ ssh localhost
Here is the output:
Last login: Wed Dec 21 16:17:45 2016 from localhost
[pysparkbook@localhost ∼]$ exit
Here is the output:
logout
Connection to localhost closed .

Step 2-1-5. Downloading Hadoop

We are going to download Hadoop from the Apache website. As noted previously, we will download all the software into the binaries directory. We’ll use the wget command to download Hadoop:
[pysparkbook@localhost ∼]$ cd  binaries
[pysparkbook@localhost binaries]$ wget http://redrockdigimark.com/apachemirror/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
Here is the output:
--2016-12-21 12:50:55--  http://redrockdigimark.com/apachemirror/hadoop/common/hadoop-2.6.5/hadoop-2.6.5.tar.gz
Resolving redrockdigimark.com (redrockdigimark.com)... 119.18.61.94
Connecting to redrockdigimark.com (redrockdigimark.com)|119.18.61.94|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 199635269 (190M) [application/x-gzip]
Saving to: 'hadoop-2.6.5.tar.gz'

Step 2-1-6. Moving Hadoop Binaries to the Installation Directory

Our installation directory is allPySpark. The downloaded software is hadoop-2.6.5.tar.gz, which is a compressed directory. So at first we have to decompress it by using the tar command as follows:
[pysparkbook@localhost binaries]$ tar xvzf hadoop-2.6.5.tar.gz
Now we’ll move Hadoop under the allPySpark directory:
pysparkbook@localhost binaries]$ sudo mv hadoop-2.6.5   /allPySpark/ hadoop

Step 2-1-7. Modifying the Hadoop Environment File

We have to make some changes in the Hadoop environment file . This file is found in the Hadoop configuration directory. In our case, the Hadoop configuration directory is /allPySpark/hadoop/etc/hadoop/. Use the following line of code to add JAVA_HOME to the hadoop-env.sh file:
[pysparkbook@localhost binaries]$ vim /allPySpark/hadoop/etc/hadoop/hadoop-env.sh
After opening the Hadoop environment file, add the following line:
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-2.b15.el7_3.x86_64/jre

Step 2-1-8. Modifying the Hadoop Properties Files

In this step, we are concerned with three properties files :
  • hdfs-site.xml: HDFS properties
  • core-site.xml: Core properties related to the cluster
  • mapred-site.xml: Properties for the MapReduce framework
These properties files are found in the Hadoop configuration directory. In the preceding chapter, we discussed HDFS. You learned that HDFS has two components: NameNode and DataNode. You also learned that HDFS uses data replication for fault-tolerance. In our hdfs-site.xml file, we are going to set the NameNode directory by using the dfs.name.dir parameter, the DataNode directory by using the dfs.data.dir parameter, and the replication factor by using the dfs.replication parameter.
Let’s modify hdfs-site.xml:
[pysparkbook@localhost binaries]$ vim /allPySpark/hadoop/etc/hadoop/hdfs-site.xml
After opening hdfs-site.xml, we have to put the following lines in that file:
<property>
  <name>dfs.name.dir</name>
    <value>file:/allPySpark/hdfs/namenode</value>
    <description>NameNode location</description>
</property>
<property>
  <name>dfs.data.dir</name>
    <value>file:/allPySpark/hdfs/datanode</value>
     <description>DataNode location</description>
</property>
<property>
 <name>dfs.replication</name>
 <value>1</value>
 <description> Number of block replication </description>
</property>
After updating hdfs-site.xml, we are going to update core-site.xml. In core-site.xml, we are going to update only one property, fs.default.name. This property is used to determine the host, port, and other details of the file system. Add the following lines to core-site.xml:
<property>
  <name>fs.default.name</name>
    <value>hdfs://localhost:9746</value>
</property>
Finally, we are going to modify mapred-site.xml. We also are going to modify mapreduce.framework.name, which will decide which runtime framework has to be used. The possible values are local, classic, or yarn. Add the following to mapred-site.xml:
[pysparkbook@localhost binaries]$ cp /allPySpark/hadoop/etc/hadoop/mapred-site.xml.template /allPySpark/hadoop/etc/hadoop/mapred-site.xml
[pysparkbook@localhost binaries]$vim /allPySpark/hadoop/etc/hadoop/mapred-site.xml
 <property>
  <name>mapreduce.framework.name</name>
   <value>yarn</value>
 </property>

Step 2-1-9. Updating the .bashrc File

Next, we’ll add the following lines to the .bashrc file . Open the .bashrc file:
[pysparkbook@localhost binaries]$ vim  ∼/.bashrc  
Then add the following lines:
export HADOOP_HOME=/allPySpark/hadoop
export PATH=$PATH:$HADOOP_HOME/sbin
export PATH=$PATH:$HADOOP_HOME/bin
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-2.b15.el7_3.x86_64/jre
export PATH=$PATH:$JAVA_HOME/bin
Then we have to source the .bashrc file. After sourcing the file, the new updated values will be reflected in the console.
[pysparkbook@localhost binaries]$ source ∼/.bashrc

Step 2-1-10. Running the NameNode Format

We have updated some property files. We are supposed to run in NameNode format so that all the changes are reflected in our framework. Use the following command to run NameNode format:
[pysparkbook@localhost binaries]$ hdfs namenode -format
Here is the output:
16/12/22 02:38:15 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 2.6.5
.
16/12/22 02:40:59 INFO util.ExitUtil: Exiting with status 0
16/12/22 02:40:59 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/127.0.0.1
************************************************************/

Step 2-1-11. Starting Hadoop

Hadoop has been installed, and now we can start it. We can find the Hadoop starting script in /allPySpark/hadoop/sbin/. Although this script has been deprecated, we’ll use it for this example:
[pysparkbook@localhost binaries]$ /allPySpark/hadoop/sbin/start-all.sh

Step 2-1-12. Checking the Hadoop Installation

We know that the jps command will show all the Java processes running on the machine. Here is the command:
[pysparkbook@localhost binaries]$ jps
If everything is fine, we will see the process running as shown here:
25720 NodeManager
25896 Jps
25625 ResourceManager
25195 NameNode
25292 DataNode
25454 SecondaryNameNode
Congratulations! We have finally installed Hadoop on our systems.

Recipe 2-2. Install Spark on a Single Machine

Problem

You want to install Spark on a single machine.

Solution

We are going to install the prebuilt spark-2.0.0 for Hadoop version 2.6. We can build Spark from source code. But in this example, we are going to use the prebuilt Apache Spark.

How It Works

Follow the steps in this section to complete the installation.

Step 2-2-1. Downloading Apache Spark

We are going to download Spark from its mirror. We’ll use the wget command as follows:
[pysparkbook@localhost binaries]$ wget  https://d3kbcqa49mib13.cloudfront.net/spark-2.0.0-bin-hadoop2.6.tgz

Step 2-2-2. Extracting a .tgz file of Spark

Use this command to extract the .tgz file :
[pysparkbook@localhost binaries]$ tar xvzf spark-2.0.0-bin-hadoop2.6.tgz

Step 2-2-3. Moving the Extracted Spark Directory to /allPySpark

Now we have to move the extracted Spark directory under the /allPySpark location . Use this command:
[pysparkbook@localhost binaries]$ sudo mv spark-2.0.0-bin-hadoop2.6   /allPySpark/ spark

Step 2-2-4. Changing the Spark Environment File

The Spark environment file possesses all the environment variables required to run Spark. We are going to set the following environment variables in an environment file:
  • HADOOP_CONF_DIR: Configuration directory of Hadoop
  • SPARK_CONF_DIR: Alternate configuration directory (default: ${SPARK_HOME}/conf)
  • SPARK_LOG_DIR: Stores log files (default: ${SPARK_HOME}/log)
  • SPARK_WORKER_DIR: Sets the working directory of worker processes
  • HIVE_CONF_DIR: Used to read data from Hive
First we have to copy the spark-env.sh.template file to spark-env.sh. The Spark environment file, spark-env.sh, is found inside spark/conf (the configuration directory location). Here is the command:
[pysparkbook@localhost binaries]$ cp /allPySpark/spark/conf/spark-env.sh.template /allPySpark/spark/conf/spark-env.sh
Now let’s open the spark-env.sh file:
[pysparkbook@localhost binaries]$ vim /allPySpark/spark/conf/spark-env.sh
Now append the following lines to the end of spark-env.sh:
export HADOOP_CONF_DIR=/allPySpark/hadoop/etc/hadoop/
export SPARK_LOG_DIR=/allPySpark/logSpark/
export SPARK_WORKER_DIR=/tmp/spark
export HIVE_CONF_DIR=/allPySpark/hive/conf

Step 2-2-5. Amending the .bashrc File

In the .bashrc file, we have to add the Spark bin directory. Use the following command:
[pysparkbook@localhost binaries]$ vim  ∼/.bashrc
Then add the following lines in the .bashrc file:
export SPARK_HOME=/allPySpark/spark
export PATH=$PATH:$SPARK_HOME/bin
After this, source the .bashrc file:
[pysparkbook@localhost binaries]$ source  ∼/.bashrc

Step 2-2-6. Starting PySpark

We can start the PySpark shell by using the pyspark script. Discussion of the pyspark script will continue in the next recipe.
[pysparkbook@localhost binaries]$ pyspark
We have completed one more successful installation. Other installations are required to move through this book, but first let’s focus on the PySpark shell.

Recipe 2-3. Use the PySpark Shell

Problem

You want to use the PySpark shell.

Solution

The PySpark shell is an interactive shell for interacting with PySpark by using Python. The PySpark shell can be started by using a PySpark script. The PySpark script can be found at the spark/bin location.

How It Works

The PySpark shell can be started as follows:
[pysparkbook@localhost binaries]$ pyspark
After starting, PySpark will show the screen in Figure 2-1.
A430628_1_En_2_Fig1_HTML.jpg
Figure 2-1.
Starting up the console screen in PySpark
You can see that, after starting, PySpark displays a lot of information. It displays information about the Python version it is using as well as the PySpark version.
The >>> symbol is known to Python programmers . Whenever we start the Python shell, we get this symbol. It tells us that we can now write our Python commands. Similarly, in PySpark, this symbol tells us that now we can write our Python or PySpark commands and see the results.
The PySpark shell works similarly on both a single-machine installation and a cluster installation of PySpark.

Recipe 2-4. Install Hive on a Single Machine

Problem

You want to install Hive on a single machine.

Solution

We discussed Hive in the first chapter. Now it is time to install Hive on our machines. We are going to read data from Hive in PySpark in upcoming chapters.

How It Works

Follow the steps in this section to complete the installation.

Step 2-4-1. Downloading Hive

We can download Hive from the Apache Hive website. We download the Hive tar.gz file by using the wget command as follows:
[pysparkbook@localhost binaries]$ wget http://www-eu.apache.org/dist/hive/hive-2.0.1/apache-hive-2.0.1-bin.tar.gz
Here is the output:
100%[=============================================>] 139,856,338  709KB/s   in 3m 5s  
2016-12-26 09:34:21 (737 KB/s) - 'apache-hive-2.0.1-bin.tar.gz' saved [139856338/139856338]

Step 2-4-2. Extracting Hive

We have downloaded apache-hive-2.0.1-bin.tar.gz, a .tar.gz. So now we have to extract it. We can extract it by using the tar command as follows:
[pysparkbook@localhost binaries]$ tar xvzf   apache-hive-2.0.1-bin.tar.gz

Step 2-4-3. Moving the Extracted Hive Directory

[pysparkbook@localhost binaries]$ sudo mv apache-hive-2.0.1-bin /allPySpark/hive

Step 2-4-4. Updating hive-site.xml

Hive is dispatched with the embedded Derby database for metastores. The Derby database is memory-less. Hence it is better to provide a definite location for it. We can provide that location in hive-site.xml. For that, we have to move hive-default.xml.template to hive-site.xml.
[pysparkbook@localhost binaries]$ mv /allPySpark/hive/conf/hive-default.xml.template /allPySpark/hive/conf/hive-default.xml.templatehive-site.xml
Then open hive-site.xml and update the following:
[pysparkbook@localhost binaries]$ vim /allPySpark/hive/conf/hive-site.xml
You can add the following lines to the end of hive-site.xml or you can change javax.jdo.option.ConnectionURL in the hive-site.xml file:
<name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=/allPySpark/hive/metastore/metastore_db;create=true</value>
After that, we have to add HADOOP_HOME to the hive_env.sh file, as follows:
[pysparkbook@localhost binaries]$ mv /allPySpark/hive/conf/hive-env.sh.template /allPySpark/hive/conf/hive-env.sh
[pysparkbook@localhost binaries]$ vim  /allPySpark/hive/conf/hive-env.sh
And in hive-env.sh, add the following line:
# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=/allPySpark/hadoop

Step 2-4-5. Updating the .bashrc File

Open the .bashrc file . This file will stay in the home directory:
[pysparkbook@localhost binaries]$ vim  ∼/.bashrc
Add the following lines into the .bashrc file:
####################Hive Parameters ######################
export HIVE_HOME=/allPySpark/hive
export PATH=$PATH:$HIVE_HOME/bin
Now source the .bashrc file by using following command:
[pysparkbook@localhost binaries]$ source ∼/.bashrc

Step 2-4-6. Creating Data Warehouse Directories of Hive

Now we have to create data warehouse directories. This data warehouse directory is used by Hive to place the data files.
[pysparkbook@localhost binaries]$hadoop fs -mkdir -p /user/hive/warehouse
[pysparkbook@localhost binaries]$hadoop fs -mkdir -p /tmp
[pysparkbook@localhost binaries]$hadoop fs -chmod g+w /user/hive/warehouse
[pysparkbook@localhost binaries]$hadoop fs -chmod g+w /tmp
The /user/hive/warehouse directory is the Hive warehouse directory.

Step 2-4-7. Initiating the Metastore Database

Sometimes it is necessary to initiate a schema. You might be thinking, a schema of what? We know that Hive stores metadata of tables in a relational database. For the time being, we are going to use a Derby database as the metastore database for Hive. And then, in upcoming recipes, we are going to connect our Hive to an external PostgreSQL. In Ubuntu, Hive installation works without this command. But in CentOS I found it indispensable to run. Without the following command, Hive throws errors:
[pysparkbook@localhost  binaries]$ schematool -initSchema -dbType derby

Step 2-4-8. Checking the Hive Installation

Now that Hive has been installed, we should check our work. Start the Hive shell by using the following command:
[pysparkbook@localhost binaries]$ hive
Then we will find that the Hive shell has been opened as follows:
hive>

Recipe 2-5. Install PostgreSQL

Problem

You want to install PostgreSQL.

Solution

PostgreSQL is a relational database management system developed at the University of California. The PostgreSQL license provides permission to use, modify, and distribute PostgreSQL. PostgreSQL can run on macOS and on Unix-like systems such as Red Hat and Ubuntu. We are going to install it on CentOS.
We are going to use our PostgreSQL in two ways. First, we’ll use PostgreSQL as a metastore database for Hive. After having an external database as a metastore, we will be able to easily read data from the existing Hive. Second, we are going to read data from PostgreSQL, and after analysis, we will save our result to PostgreSQL.
Installing PostgreSQL can be done with source code, but we are going to install it via the command-line yum installer.

How It Works

Follow the steps in this section to complete the installation.

Step 2-5-1. Installing PostgreSQL

PostgreSQL can be installed using the yum installer. Here is the command:
[pysparkbook@localhost binaries]$ sudo yum install postgresql-server
[sudo] password for pysparkbook:

Step 2-5-2. Initializing the Database

To use PostgreSQL, we first need to use the initdb utility to initialize the database. If we don’t initialize the database, we cannot use it. At the time of database initialization, we can also specify the data file of the database. After installing PostgreSQL, we have to initialize it. The database can be initialized using following command:
[pysparkbook@localhost binaries]$ sudo postgresql-setup initdb
Here is the output:
[sudo] password for pysparkbook:
Initializing database ... OK

Step 2-5-3. Enabling and Starting the Database

[pysparkbook@localhost binaries]$ sudo systemctl enable postgresql
[pysparkbook@localhost binaries]$ sudo systemctl start postgresql
[pysparkbook@localhost binaries]$ sudo -i -u postgres
Here is the output:
[sudo] password for pysparkbook:
-bash-4.2$ psql
psql (9.2.18)
Type "help" for help.
postgres=#
Note
The installation procedure is located at the following web page:

Recipe 2-6. Configure the Hive Metastore on PostgreSQL

Problem

You want to configure the Hive metastore on PostgreSQL.

Solution

As we know, Hive puts metadata of tables in a relational database. We have already installed Hive. Our Hive installation has an embedded metastore. Hive uses the Derby relational database system for its metastore. In upcoming chapters, we will read existing Hive tables from PySpark.
Configuring the Hive metastore on PostgreSQL requires us to populate tables in the PostgreSQL database. These tables will hold the metadata of the Hive tables. After this, we have to configure the Hive property file.

How It Works

In this section, we are going to configure the Hive metastore on the PostgreSQL database. Then our Hive will have metadata in PostgreSQL.

Step 2-6-1. Downloading the PostgreSQL JDBC Connector

We need a JDBC connector so that the Hive process can connect to the external PostgreSQL. We can get a JDBC connector by using the following command:
[pysparkbook@localhost binaries]$ wget https://jdbc.postgresql.org/download/postgresql-9.4.1212.jre6.jar

Step 2-6-2. Copying the JDBC Connector to the Hive lib Directory

After getting the JDBC connector, we have to put it in the Hive lib directory:
[pysparkbook@localhost binaries]$ cp postgresql-9.4.1212.jre6.jar  /allPySpark/hive/lib/

Step 2-6-3. Connecting to PostgreSQL

Use this command to connect to PostgreSQL:
[pysparkbook@localhost binaries]$ sudo -u postgres psql

Step 2-6-4. Creating the Required User and Database

In this step, we are going to create a PostgreSQL user, pysparkBookUser. Then we are going to create a database named pymetastore. Our database is going to hold all the tables related to the Hive metastore.
First, create the user:
postgres=# CREATE USER pysparkBookUser WITH PASSWORD 'pbook';
Here is the output:
CREATE ROLE
Next, create the database:
postgres=# CREATE DATABASE pymetastore;
Here is the output:
CREATE DATABASE
The c PostgreSQL command stands for connect. We have created our database pymetastore. Now we are going to connect to this database by using our c command:
postgres=# c pymetastore;
You are now connected to the pymetastore database. You can see more PostgreSQL commands at www.postgresql.org/docs/9.0/static/app-psql.html .

Step 2-6-5. Populating Data in the pymetastore Database

Hive possess its own PostgreSQL scripts to populate tables for the metastore. The i command reads commands from the PostgreSQL script and executes those commands. The following command runs the hive-txn-schema-2.0.0.postgres.sql script, which will create all the tables required for the Hive metastore:
pymetastore=#  i /allPySpark/hive/scripts/metastore/upgrade/postgres/hive-txn-schema-2.0.0.postgres.sql
Here is the output:
psql:/allPySpark/hive/scripts/metastore/upgrade/postgres/hive-txn-schema-2.0.0.postgres.sql:30: NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "txns_pkey" for table "txns"
CREATE TABLE
CREATE TABLE
INSERT 0 1
psql:/allPySpark/hive/scripts/metastore/upgrade/postgres/hive-txn-schema-2.0.0.postgres.sql:69: NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "hive_locks_pkey" for table "hive_locks"
CREATE TABLE

Step 2-6-6. Granting Permissions

The following commands will grant some permissions:
pymetastore=# grant select, insert,update,delete on public.txns to pysparkBookUser;
GRANT
pymetastore=# grant select, insert,update,delete on public.txn_components to pysparkBookUser;
GRANT
pymetastore=# grant select, insert,update,delete on public.completed_txn_components   to pysparkBookUser;
GRANT
pymetastore=# grant select, insert,update,delete on public.next_txn_id to pysparkBookUser;
GRANT
pymetastore=# grant select, insert,update,delete on public.hive_locks to pysparkBookUser;
GRANT
pymetastore=# grant select, insert,update,delete on public.next_lock_id to pysparkBookUser;
GRANT
pymetastore=# grant select, insert,update,delete on public.compaction_queue to pysparkBookUser;
GRANT
pymetastore=# grant select, insert,update,delete on public.next_compaction_queue_id to pysparkBookUser;
GRANT
pymetastore=# grant select, insert,update,delete on public.completed_compactions to pysparkBookUser;
GRANT
pymetastore=# grant select, insert,update,delete on public.aux_table to pysparkBookUser;
GRANT

Step 2-6-7. Changing the pg_hba.conf File

Remember that in order to update pg_hba.conf, you are supposed to be the root user. So first become the root user. Then open the pg_hba.conf file:
[root@localhost binaries]# vim /var/lib/pgsql/data/pg_hba.conf
Then change all the peers and indent them to trust:
#local   all             all                                     peer
local    all             all                                     trust
# IPv4 local connections:
#host    all             all             127.0.0.1/32            ident
host     all             all             127.0.0.1/32            trust
# IPv6 local connections:
#host    all             all             ::1/128                 ident
host     all             all             ::1/128                 trust
Come out of root user.

Step 2-6-8. Testing Our User

Next, we’ll test that we are easily able to enter our database as our created user:
[pysparkbook@localhost binaries]$ psql -h localhost -U pysparkbookuser -d pymetastore
Here is the output:
psql (9.2.18)
Type "help" for help.
pymetastore=>

Step 2-6-9. Modifying Our hive-site.xml

We can modify the Hive-related configuration in the configuration file hive-site.xml. We have to modify the following properties:
  • javax.jdo.option.ConnectionURL: Connection URL to the database
  • javax.jdo.option.ConnectionDriverName: Connection JDBC driver name
  • javax.jdo.option.ConnectionUserName: Database connection user
  • javax.jdo.option.ConnectionPassword: Connection password
Either modify these properties or add the following lines at the end of the Hive property file to get the required result:
<property>
      <name>javax.jdo.option.ConnectionURL</name>
      <value>jdbc:postgresql://localhost/pymetastore</value>
      <description>postgreSQL server metadata store</description>
 </property>
 <property>
      <name>javax.jdo.option.ConnectionDriverName</name>
      <value>org.postgresql.Driver</value>
      <description>Driver class of postgreSQL</description>
 </property>
  <property>
      <name>javax.jdo.option.ConnectionUserName</name>
      <value>pysparkbookuser</value>
      <description>User name to connect to postgreSQL</description>
 </property>
 <property>
      <name>javax.jdo.option.ConnectionPassword</name>
      <value>pbook</value>
      <description>password for connecting to PostgreSQL server</description>
 </property>

Step 2-6-10. Starting Hive

We have connected Hive to an external relational database management system. So now it is time to start Hive and check that everything is fine. First, start Hive:
[pysparkbook@localhost binaries]$ hive
Our activities will be reflected in PostgreSQL. Let’s create a database and a table inside the database. We’ll create the database apress and the table apressBooks via the following commands:
hive> create database apress;
Here is the output:
OK
Time taken: 1.397 seconds
hive> use apress;
Here is the output:
OK
Time taken: 0.07 seconds
hive> create table apressBooks (
    >      bookName String,
    >      bookWriter String
    >      )
    >      row format delimited
    >      fields terminated by ',';
Here is the output:
OK
Time taken: 0.581 seconds

Step 2-6-11. Testing Creation of Metadata in PostgreSQL

The created database and table will be reflected in PostgreSQL. We can see the updated data in the TBLS table as follows:
pymetastore=> SELECT * from "TBLS";
TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME |    OWNER    | RETENTION | SD_ID |  TBL_NAME   |
   TBL_TYPE    | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT
--------+-------------+-------+------------------+-------------+-----------+-------+-------------+
---------------+--------------------+--------------------
     1 |  1482892229 |     6 |                0 | pysparkbook |         0 |    1 | apressbooks |
 MANAGED_TABLE |                    |
(1 row)
The significant work needed to connect Hive to an external database is done. In the following recipe, we are going to install Apache Mesos.

Recipe 2-7. Connect PySpark to Hive

Problem

You want to connect PySpark to Hive.

Solution

PySpark needs the Hive property file to know the configuration parameters of Hive. The Hive property file, hive-site.xml, stays in the Hive configuration directory. Copy the Hive property file to the Spark configuration directory. Then we will be finished and we can start PySpark.

How It Works

Two steps have been identified to connect PySpark to Hive.

Step 2-7-1. Copying the Hive Property File to the Spark Conf Directory

Use this command to copy the Hive property file:
[pysparkbook@localhost binaries]$cp /allPySpark/hive/conf/hive-site.xml /allPySpark/spark/

Step 2-7-2. Starting PySpark

Use this command to start PySpark:
[pysparkbook@localhost binaries]$pyspark

Recipe 2-8. Install Apache Mesos

Problem

You want to install Apache Mesos.

Solution

Installing Apache Mesos requires downloading the code and then configuring it.

How It Works

Follow the steps in this section to complete the installation.

Step 2-8-1. Downloading Apache Mesos

Use this command to obtain Apache Mesos:
[pysparkbook@localhost binaries]$ wget http://www.apache.org/dist/mesos/1.1.0/mesos-1.1.0.tar.gz
The output is as follows:
--2016-12-28 08:15:14--  http://www.apache.org/dist/mesos/1.1.0/mesos-1.1.0.tar.gz
Resolving www.apache.org (www.apache.org)... 88.198.26.2, 140.211.11.105, 2a01:4f8:130:2192::2
Connecting to www.apache.org (www.apache.org)|88.198.26.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 41929556 (40M) [application/x-gzip]
Saving to: 'mesos-1.1.0.tar.gz.1'
100%[=============================================>] 41,929,556   226KB/s   in 58s    
2016-12-28 08:16:23 (703 KB/s) - 'mesos-1.1.0.tar.gz.1' saved [41929556/41929556]

Step 2-8-2. Extracting Mesos from .tar.gz

To extract Mesos, use this command:
[pysparkbook@localhost binaries]$ tar xvzf mesos-1.1.0.tar.gz

Step 2-8-3. Installing Repo to Install Maven

To install Maven, we first need to install the repo:
[pysparkbook@localhost binaries]$ sudo bash -c 'cat > /etc/yum.repos.d/wandisco-svn.repo <<EOF
> [WANdiscoSVN]
> name=WANdisco SVN Repo 1.9
> enabled=1
> baseurl=http://opensource.wandisco.com/centos/7/svn-1.9/RPMS/$basearch/
> gpgcheck=1
> gpgkey=http://opensource.wandisco.com/RPM-GPG-KEY-WANdisco
> EOF'

Step 2-8-4. Installing Dependencies of Maven

It is time to install the dependencies required to install Maven:
[pysparkbook@localhost binaries]$ sudo yum install -y apache-maven python-devel java-1.8.0-openjdk-devel zlib-devel libcurl-devel openssl-devel cyrus-sasl-devel cyrus-sasl-md5 apr-devel subversion-devel apr-util-devel

Step 2-8-5. Downloading Apache Maven

Now we’re ready to download Maven:
[pysparkbook@localhost binaries]$ wget http://www-us.apache.org/dist/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz
Here is the output:
100%[===================================================>] 8,491,533    274KB/s   in 87s    
2016-12-28 23:47:40 (95.5 KB/s) - 'apache-maven-3.3.9-bin.tar.gz' saved [8491533/8491533]

Step 2-8-6. Extracting the Maven Directory

As with other software, we have to extract and move the Maven directory:
[pysparkbook@localhost binaries]$ tar -xvzf apache-maven-3.3.9-bin.tar.gz
[pysparkbook@localhost binaries]$ mv apache-maven-3.3.9 /allPySpark/maven
We then have to link the mvn file:
[pysparkbook@localhost binaries]$ sudo ln -s /allPySpark/maven/bin/mvn /usr/bin/mvn
[sudo] password for pysparkbook:

Step 2-8-7. Checking the Maven Installation

The best way to check our installation is to run the version command:
[pysparkbook@localhost binaries]$ mvn -version
Here is the output:
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T22:11:47+05:30)
Maven home: /allPySpark/maven
Java version: 1.8.0_111, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-2.b15.el7_3.x86_64/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.10.0-514.2.2.el7.x86_64", arch: "amd64", family: "unix"

Step 2-8-9. Configuring Mesos

We have to move to the Mesos build directory. Then we have to run configure for the script.
[pysparkbook@localhost build]$ ../configure
----------------------------------------------------------------------------
Following lines will come as notes
----------------------------------------------------------------------------
You may get errors as follows:
make[2]: *** [../3rdparty/protobuf-2.6.1/python/dist/protobuf-2.6.1-py2.7.egg] Error 1
make[2]: Leaving directory '/allPySpark/mesos/build/src'
make[1]: *** [all] Error 2
make[1]: Leaving directory '/allPySpark/mesos/build/src'
make: *** [all-recursive] Error 1
If you do get these errors, you need to perform the following installations and upgrade pytz:
[pysparkbook@localhost build]$sudo yum install python-setuptools python-setuptools-devel
[pysparkbook@localhost build]$sudo easy_install pip
[pysparkbook@localhost build]$sudo pip install --upgrade pytz
----------------------------------------------------------------------------

Step 2-8-10. Running Make

[pysparkbook@localhost build]$make
Installing build/bdist.linux-x86_64/wheel/mesos.scheduler-1.1.0-py2.7-nspkg.pth
running install_scripts
creating build/bdist.linux-x86_64/wheel/mesos.scheduler-1.1.0.dist-info/WHEEL
make[2]: Leaving directory '/allPySpark/mesos/build/src'
make[1]: Leaving directory '/allPySpark/mesos/build/src'

Step 2-8-11. Running make install

Use this command to install Mesos:
[pysparkbook@localhost build]$make install

Step 2-8-12. Starting Mesos Master

After successful installation of Mesos, we can start the master by using the following command:
[pysparkbook@localhost build]$ mesos-master --work_dir=/allPySpark/mesos/workdir

Step 2-8-13. Starting Mesos Slaves

Use this command to start the slave on the same machine:
[root@localhost binaries]#mesos-slave --master=127.0.0.1:5050 --work_dir=/allPySpark/mesos/workdir --systemd_runtime_directory=/allPySpark/mesos/systemd
In an upcoming chapter, you will see how to start the PySpark shell on Mesos.

Recipe 2-9. Install HBase

Problem

You want to install Apache HBase .

Solution

HBase is a NoSQL database , as we discussed in Chapter 1. We are going to install HBase. Then we will read data from this HBase installation by using Spark. We will go for the simplest installation of HBase. We can download HBase from the HBase website ( https://hbase.apache.org/ ) and then configure it.

How It Works

Follow the steps in this section to complete the installation.

Step 2-9-1. Obtaining HBase

Use the following command to download HBase in our binaries directory:
[pysparkbook@localhost binaries]$ wget http://www-eu.apache.org/dist/hbase/stable/hbase-1.2.4-bin.tar.gz

Step 2-9-2. Extracting HBase

[pysparkbook@localhost binaries]$ tar xzf  hbase-1.2.4-bin.tar.gz
[pysparkbook@localhost binaries]$ sudo mv hbase-1.2.4 /usr/local/hbase

Step 2-9-3. Updating the HBase Environment File

HBase also looks for JAVA_HOME. So we’ll update the HBase environment file with JAVA_HOME:
[pysparkbook@localhost binaries]$ vim /allPySpark/hbase/conf/hbase-env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.111-2.b15.el7_3.x86_64/

Step 2-9-4. Creating the HBase Directory in HDFS

Even HBase can put its data on a local machine, in the case of a single-machine installation. But to make things clearer about the workings of HBase, we are going to create a directory on HDFS for HBase. First, we have to start HDFS. Use the following commands to create a directory in HDFS:
[pysparkbook@localhost binaries]$/allPySpark/hadoop/sbin/start-dfs.sh
[pysparkbook@localhost binaries]$/allPySpark/hadoop/sbin/start-yarn.sh
[pysparkbook@localhost binaries]$hadoop fs -mkdir /hbase

Step 2-9-5. Updating the HBase Property File and .bashrc

Let’s start with updating the property file, and then we will update the .bashrc file. In the HBase property file, we are going to update hbase:rootdir. The HBase property file stays in the HBase configuration directory. For us, the HBase configuration directory is /allPySpark/hbase/conf.
[pysparkbook@localhost binaries]$ vim /allPySpark/hbase/conf/hbase-site.xml
Now add the following lines in the hbase-site.xml file:
<property>
<name>hbase:rootdir</name>
<value>hdfs://localhost:9746/hbase</value>
</property>
It is time to update .bashrac, as shown here:
[pysparkbook@localhost binaries]$vim ∼/.bashrc
Add the following lines in .bashrc:
export HBASE_HOME=/allPySpark/hbase
export PATH=$PATH:$HBASE_HOME/bin
[pysparkbook@localhost binaries]$ source  ∼/.bashrc

Step 2-9-6. Starting HBase and the HBase Shell

[pysparkbook@localhost binaries]$ /allPySpark/hbase/bin/start-hbase.sh
[pysparkbook@localhost binaries]$/allPySpark/hbase/bin/hbase shell
Here is the output:
hbase(main):001:0>
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.57.126