Setting up Hadoop

Before starting to play with Hadoop and Hunk, we are going to download and run a VM. You'll get a short description on how to get everything up and running and put in some data for processing later.

Starting and using a virtual machine with CDH5

We have decided to take the default Cloudera CDH 5.3.1 VM from the Cloudera site and fine-tune it for our needs. Please open this link to prepare a VM: http://www.bigdatapath.com/2015/08/learning-hunk-links-to-vm-with-all-stuff-you-need/.

This post may have been be updated by the time you're reading this book.

SSH user

You can run the terminal application by clicking the special icon on the top bar:

SSH user

Your user is cloudera. sudo is passwordless:

[cloudera@quickstart ~]$ whoami
cloudera
[cloudera@quickstart ~]$ sudo su
[root@quickstart cloudera]# whoami
root
[root@quickstart cloudera]#

MySQL

MySQL is used as an example of the data ingestion process. The user name is dwhuser, the password is dwhuser. You can get root access by using the root username and the cloudera password:

[cloudera@quickstart ~]$ mysql -u root -p

mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| cdrdb              |
| cm                 |
| firehose           |
| hue                |
| metastore          |
| mysql              |
| oozie              |
| retail_db          |
| sentry             |
+--------------------+
10 rows in set (0.00 sec)

We import data from MySQL to Hadoop from the database named cdrdb. There are some other databases. They are used by Cloudera Manager services and Hadoop features such as Hive Metastore, Oozie, and so on.

Tip

Hive Metastore is a service designed to centralize metadata management. It's a kind of Teradata DBC.Table, DBC.Columns, or IBM DB2 syscat.Columns, syscat.Tables. The idea is to create a strict schema description over the bytes stored in Hadoop and then get access to this data using SQL.

Oozie is a kind of Hadoop CRON without a Single Point of Failure (SPOF). Think it through; is it easy to create a distributed reliable CRON with failover functionality? Oozie uses RDBMS to persist metadata about planned, running, and finished tasks. This VM doesn't provide an Oozie HA configuration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.75.217