We have already learned about setting up the Hadoop and HBase clusters. Now, we will learn the aspects we need to consider to maintain the cluster and keep it up and running. This chapter will help readers make their HBase cluster more reliable by making it high available.
In this chapter, we will concentrate on the operational part of HBase. We will discuss the following topics:
As HBase runs on top of Hadoop, before starting with the HBase administration, let's look at Hadoop administration tasks and aspects in brief.
Here is the list of available Hadoop shell commands and steps on how to use them.
A binary is present inside the bin
directory. We can call the following Hadoop command if we need to know all the commands available:
<Hadoop directory path>bin/hadoop
In the version previous to Hadoop v1, we can use the preceding command. However, in the later versions, we have to use the following command:
<Hadoop directory path>bin/hdfs
A binary without any parameter will display the list of available commands. We can check the actual implementation of Hadoop shell and its Java source at https://github.com/shot/hadoop-source-reading/blob/master/src/core/org/apache/hadoop/fs/FsShell.java.
Let's take a look at the Hadoop shell commands. However, first we will look at the generic options available with the aforementioned bin/hadoop
and bin/hdfs
. The following is the syntax:
hdfs [--config <configuration dir>] [command] [generic_options] [command_options]
The following table will explain to you the parameters of the preceding command:
We categorized Hadoop shell commands into the following three types:
Let's explore the commands under the above mentioned types.
The following is the list of administration commands:
balancer
: Using this command, we can balance data distribution throughout the cluster. Sometimes, it so happens that a few of the DataNodes become overloaded when write operations happen at pace. This might also happen when a new DataNode is added but underutilized. We can stop this command anytime using Ctrl + C.The syntax for this command is as follows:
hdfs balancer [-threshold <threshold value>]
The following is the example:
hdfs balancer - threshold 20
The balancer process is iterative. The threshold value gives us a long value in the range of 1 to 100. The balancer generally tries to equalize the data uses throughout all the DataNodes, and tries to keep it within the range [average - threshold, average +threshold]
.
The smaller the value of the given threshold, the more balanced the cluster is.
While balancing the cluster, it uses a lot of network bandwidth. We can control it using another administration command, dfsadmin –setBalanacerBandwidth <bandwidth>
, so the balancer will use a specified percentage of the available bandwidth. This should be set to prevent read/write exceptions during the cluster operation. This setting can also be changed using the dfs.balance.bandwidthPerSec
(value in bytes per second) parameter found in the default file in Hadoop, wherein we can specify it, or we can set it at runtime using the dfsadmin
command.
The balancer will pick DataNodes with disk usage above the higher threshold (seen as over utilized DataNode) and try to find blocks from these DataNodes to be copied into a DataNode that's underutilized. In the second round, a balancer selects DataNodes that are overutilized, and moves the blocks to nodes where utilization is below average. The third round will choose nodes with utilization above average to move data to underutilized nodes.
For more details on balancer (flow, architecture, and administration), visit https://issues.apache.org/jira/browse/HADOOP-1652. Here, PDF files are available on the balancer architecture; you can also visit http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html#balancer.
daemonlog
: This command is used to set the logging level for each Hadoop daemon process. This comes handy when we debug a problem with Hadoop, and therefore this command can be used to increase or decrease the log level for debugging purposes. This log-level modification can be done through configuration or Hadoop daemon web pages. However, it is better that an administrator does it through a command line.This command accepts two parameters, namely get and set. Get is used to get the information about the log level, and set is used to set the log level.
The following is the syntax to get the log level information:
-getlevel <host:port> <name>
The preceding command gets the log level information of the daemon processes running at the specified host and port by internally connecting to http://<host>:<port>/logLevel?log=<name>
.
The host, <host>
, gets the log level information from the port, <port>
, on which the service is running.
The <name>
parameter specifies the hostname from which to get the log level. This is a fully qualified classname of the daemon performing the logging process.
Example of it is org.apache.hadoop.mapred.JobTracker
for the JobTracker daemon.
The following is the syntax to set the log level:
-setlevel <host:port> <name> <level>
The preceding command sets the log level of the daemon running at the specified host by internally connecting to http://<host>:<port>/logLevel?log=<name>
.
The host, <host>
, sets the log level on the port, <port>
, on which the service is running. The <name>
parameter specifies the daemon on which to set the log level. The <level>
parameter specifies the log level to set the daemon.
The following command is an example of how to get the log level:
hdfs daemonlog -getlevel host:<port> org.apache.hadoop.mapred.JobTracker
The following command is an example of how to set the log level:
hdfs daemonlog -setlevel host: <port>org.apache.hadoop.mapred.JobTracker <ERROR or DEBUG>
You can also find the description of the daemonlog
command at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html#daemonlog.
datanode
: This command is used to start or stop the DataNode daemon processes. The following is the syntax:hdfs datanode [-rollback]
The rollback option helps to roll back DataNode to the previous version. If the upgrade process is in progress and something goes wrong, we need to restore the DataNode metadata to the previous existing version. If the command is specified without any parameter, it will start the DataNode daemon, if it's not already running.
You can also find the description of the datanode
command at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html#datanode.
dfsadmin
: This command runs the dfsadmin
client for the Hadoop cluster to perform administration commands. We can check the actual implementation in Java at https://github.com/facebook/hadoop-20/blob/master/src/hdfs/org/apache/hadoop/hdfs/tools/DFSAdmin.java. The following is the syntax:hdfs dfsadmin [GENERIC_OPTIONS] [-report] [-safemode enter | leave | get | wait] [-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status | details | force] [-metasave filename] [-setQuota <quota><dirname>] [-clrQuota <dirname>......<dirname>] [-help [cmd]] [-restoreFailedStorage true|false|check]
The following list will give us an explanation of different parameters in this command. We already discussed the generic options earlier, and it remains the same here.
report
: This parameter with the hdfs
command displays the basic status of the cluster and HDFS file system.For example, have a look at the following command:
hdfs dfsadmin –status
The following is what you will get as output:
safemode
: Safe mode is the condition when Hadoop prevents reading of data from the cluster, and meanwhile, loads and updates the metadata during a start up process. We have a command to get and set information. Safe mode is the state of NameNode in which it is in a read-only mode, where NameNode does not accept changes to the namespace and deleted blocks are not replicated.This command has parameters such as -safemode <enter /leave / get | wait>
, where leave
forces Hadoop to come out of the safe mode explicitly, get
gets the status of NameNode, whether it is in the safe mode, and wait
makes NameNode wait till it comes out of the safe mode.
If you force Hadoop to come out of the safe mode, it means you are asking Hadoop to come out without updating the metadata, and this will lead to corruption of data most of the time. However, if at all it's necessary to force Hadoop to leave the safe mode, first verify, check, and try to see what is there in the logs of NameNode.
Hadoop enters the safe mode automatically at startup, and it leaves the safe mode by itself once it has reached the minimum percentage of blocks needed for a replication condition to fulfill (based on the replication factor).
NameNode can also enter the safe mode manually, but then, it can also only be taken out of this safe mode manually.
This parameter of dfsadmin
can be used as follows:
hdfs dfsadmin –safemode [<get/enter/leave/wait>]
Let's see one example:
hdfs dfsadmin –safemode get
The following screenshot shows the output of the preceding command:
refreshNodes
: This parameter of the command makes Hadoop read configurations such as hosts again, and excludes files to update the set of DataNodes that are allowed to connect to NameNodes that should be or are already decommissioned. For example, have a look at the following command:hdfs dfsadmin -refreshNodes
finalizeUpgrade
: When we issue the dfsadmin
command with this parameter, it will make an upgrade permanent. It does so by deleting the previous version of directories on DataNodes and NameNode. This completes the upgrade process and is not downgradable.upgradeProgress
: This parameter of the command has three options: status
, detail
, and force
. It also fetches the information on the Hadoop upgrade process.metasave
: This parameter saves the NameNode primary data structures to a file. The file contains one line for each of the following:setQuota
: This parameter is used to set the quota for each directory as a long integer value that puts a hard limit on the number of names in the directory tree. It reports errors if one of the following is true:clrQuota
: This parameter clears the quota for each directory. An error is reported if one of the following is true:clearQuota
does not fault if the directory has no quotahelp
: This displays the help for all the commands.restoreFailedstorage
: This parameter turns automatic attempts on or off to restore failed storage data. If a failed storage comes online again, the system will attempt to restore edits and/or fsimage during checkpoint. The check
option will return the current setting. This parameter has options such as true
, false
, and chec
.You can also find the description of the dfsadmin
command at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html#dfsadmin.
mradmin
: This command runs a MapReduce client. The following is the syntax:hadoop mradmin [ generic_options ] [-refreshqueueacls]
The -refreshqueueacls
parameter refreshes the queue ACLs used by Hadoop to check access during submissions of the job by the user. The properties present in mapred-queue-acls.xml
are reloaded by the queue manager.
Some other options of this command are as follows:
You can also find the description of the mradmin
command at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html#mradmin.
jobtracker
: This command runs an instance of a MapReduce node if the daemon is not already started. The following is the syntax:Hadoop jobtracker [-dumpConfiguration]
The -dumpConfiguration
option dumps the configuration used by JobTracker, along with the queue configuration in JSON format, into a standard output used by JobTracker, and then exits.
You can also find the description of the jobtracker
command at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html#jobtracker.
namenode
: This command runs a NameNode instance. The following is the syntax:hadoop namenode [-format] / [-upgrade] / [-rollback] / [-finalize] / [-importCheckpoint]
The following table describes the parameters of this command:
secondarynamenode
: This command starts the secondary NameNode instance. The following is the syntax:hadoop secondarynamenode [-checkpoint [force]] / [-geteditsize]
The following list explains the parameters of this command:
You can also find the description of this command at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html#secondarynamenode.
tasktracker
: This starts the TaskTracker node; the syntax for this is as follows:hadoop tasktracker
You can also find the description of this command at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html#tasktracker.
The following is the list of user commands:
archive
: This command is used to create Hadoop achieve files. Its syntax is as follows:hadoop archive -archiveName NAME <src> <dest>
distcp
: This command is used to copy a file from one cluster to another or to the same cluster at a different location. This uses the MapReduce task to copy files parallel. Its syntax is:hadoop distcp <source url> <destination url>
Have a look at the following example:
hadoop distcp hdfs://hadoop1:9000/files hdfs://hadoop2:9000/filesdir
fs
: Instead of this command, we use hdfs dfs
, which we will discuss in the next section with all of its various options.fsck
: This command is used to find the inconsistencies in HDFS. It reports problems with various files, for example, missing blocks for a file or under-replicated blocks. This is not a Hadoop shell command. It can be run as:hdfs fsck [GENERIC_OPTIONS] <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]
The following is the description of the parameters of this command:
Command options |
Description |
---|---|
| |
| |
| |
| |
| |
|
This prints the list of missing blocks and files they belong to |
| |
| |
|
By default, the fsck
command ignores files opened for write; we can use -openforwrite
to report such files. They are generally tagged CORRUPT
or HEALTHY
depending on their block allocation status.
The following screenshot shows the output of fsck
:
fetchdt
: This retrieves delegation tokens from NameNode. Authentication is a two-party authentication protocol based on Java SASL Digest-MD5. The token is obtained during job submissions and submitted to JobTracker as part of the job submission. Find more details at http://hortonworks.com/wp-content/uploads/2011/10/security-design_withCover-1.pdf.The following is the syntax:
fetchdt <opts> <token file>
The following table will describe to you the different command options:
jar
: This runs a JAR file. Users can bundle their MapReduce code in a JAR file and execute it using the command, the syntax of which is as follows:hadoop jar <jar> [mainClass] arguments
The following is an example of this command:
hadoop jar hadoop-mapreduce-examples-*.jar pi 20 20
Job
: This is used to submit a job. It can be known as a Hadoop or mapred job. The following is the syntax :JobClient <command> <args> [-submit <job-file>] [-status <job-id>] [-counter <job-id> <group-name> <counter-name>] [-kill <job-id>] [-set-priority <job-id> <priority>] [-events <job-id> <from-event-#> <#-of-events>] [-history <jobOutputDir>] [-list [all]] [-list-active-trackers] [-list-blacklisted-trackers] [-list-attempt-ids <job-id> <task-type> <task-state>] [-kill-task <task-id>] [-fail-task <task-id>]
The following table will guide you through the command options:
pipes
: This command enables Hadoop to MapReduce code written in C++. This library is supported on 32-bit Linux installations. The following is the syntax:hadoop pipes [-conf <path>] [-jobconf <key=value>, <key=value>, ...] [-input <path>] [-output <path>] [-jar <jar file>] [-inputformat <class>] [-map <class>] [-partitioner <class>] [-reduce <class>] [-writer <class>] [-program <executable>] [-reduces <num>]
The following are the descriptions of the command options:
Command options |
Description |
---|---|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
|
version
: This displays the Hadoop version. Its syntax is as follows:hadoop version
The hdfs dfs
command provides shell-based Hadoop commands that directly interact with Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS, or others.
This command can be executed using the following syntax:
[hadoop fs <args>]
Alternatively, we can also use:
[hdfs dfs args]
Let's discuss these dfs
commands briefly:
Options |
Description |
---|---|
|
This appends files to HDFS. These files can be local or inputs to be written on HDFS. Have a look at the following example:
This parameter reads the input from |
|
This displays the content of a file on
hdfs dfs –cat <file URI>
Here is an example: hdfs dfs -cat hdfs://<namenode>/file
|
|
This changes the group of a file or directory.
hdfs dfs –chgrp [-R] [owner][:[group]]<URI>
Here are some examples:
hdfs dfs –chgrp –R hadoop:hadoop hdfs://namenode/dir hdfs dfs –chgrp hadoop:hadoop hdfs://namenode/file
For permission-related information, visit http://hadoop.apache.org/docs/r1.2.1/hdfs_permissions_guide.html. |
|
This changes the access mode of a file or directory. The syntax is as follows:
hdfs dfs –chmod <mode> <URI>
Here are some examples:
hdfs dfs –chmod 777 hdfs://namenode/filename hdfs dfs –chmod –R 777 hdfs://namenode/directory
For more on permission and mode, visit http://hadoop.apache.org/docs/r1.2.1/hdfs_permissions_guide.html. |
|
This changes the owner of a directory or file, we use The syntax is as follows:
hdfs dfs –chown [–R] <user or owner:group><URI>
Here is an example: hdfs dfs –chown –R shashwat:hadoop hdfs://namenode/directory
|
|
This copies files from a local drive to an HDFS file system. The syntax is as follows:
hdfs dfs –copyFromLocal <Local file/ Local Directory><URI>
Here are some examples:
hdfs dfs –copyFromLocal /user/home/Shashwat/file1 hdfs://namenode/newdir hdfs dfs –copyFromLocal /user/home/Shashwat/dir hdfs://namenode/newdir
|
|
This copies files from HDFS to local drives. Adding This is the syntax:
hdfs dfs –copyToLocal [-ignorecrc] [-crc] <URI><Local file/ Local Directory>
Here are some examples: hdfs dfs –copyToLocal hdfs://namenode/newdir /user/home/Shashwat hdfs dfs –copyToLocal hdfs://namenode/newdir /user/home/Shashwat |
|
This counts the number of directories, files, and bytes under the given path. The following is the syntax: hdfs dfs –count [–q] <path>
Here's an example: hdfs dfs –count /dir
|
|
This is used to copy files from one HDFS location to another on the same Hadoop cluster or other Hadoop clusters. This is the syntax:
hdfs dfs –cp <Source URI> <Destination URI>
Here are some examples: hdfs dfs –cp /user/file1 /user/dir1/
|
|
This displays the size of the directories and files under the given path. This is the syntax: hdfs dfs –du [-S] [-h] <URI>
Adding options Here are some examples:
hdfs dfs –du -s /user/dire hdfs dfs –du /user/dir hdfs dfs –du –s –h /user/dir
|
|
This is equivalent to |
|
When we delete files, and trash is enabled on HDFS, deleted files go to trash, but not directly deleted from HDFS. This command enables us to empty the trash. This is the syntax: hdfs dfs –expunge
|
|
This is equivalent to
hdfs dfs –get <HDFS location> <local destination>
Here is an example: hdfs dfs –get hdfs://namenode/dir /tmp
|
|
This concatenates the source and destination, and copies them to a local directory. This is the syntax: hdfs dfs -getmerge <src> <localdst> [addnl]
|
|
This lists out files and folders in a given path. Here is the syntax:
hdfs dfs –ls <directory path>
This is the example:
hdfs dfs –ls /
|
|
This lists out files and folders recursively; the syntax and uses are same as |
|
This creates a directory on HDFS. hdfs dfs –mkdir <directory path to be created>
Here are some examples: hdfs dfs -mkdir /user/hadoop/dirtocreate hdfs dfs -mkdir /user/hadoop/dirtocreate /user/hadoop/dirtocreate1 |
|
This copies a file from the local directory and deletes the source file on the source path. This is the syntax: dfs -moveFromLocal <localsrc> <dst>
|
|
This moves the file from HDFS to a local destination. This is the syntax: hdfs dfs -moveToLocal [-crc] <src> <dst>
|
|
This moves a file or directory from one HDFS location to another, either on the same cluster or to different clusters. This is the syntax:
hdfs dfs -mv <source> <dest>
Here are some examples:
hdfs dfs -mv /user/shashwat/file /user/shashwat/file1 hdfs dfs -mv hdfs://namenode/file hdfs://namenode/file
|
|
This copies a single source or multiple sources from the local file system to the destination file system. It also reads the input from This is the syntax: hdfs dfs –put <source local> <destination HDFS>
Here are some examples:
hdfs dfs –put /tmp/userfilelocal hdfs://namenode/dirtarget hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile
Giving |
|
This deletes a specified file. If the This is the syntax: hdfs dfs -rm [-skipTrash] URI [URI …]
Here are the examples: hdfs dfs -rm hdfs://namenode/file hdfs dfs -rm hdfs://namenode/file hdfs://namenode/file1 hdfs dfs –rm /user/files/file |
|
This deletes files/directories recursively. This is the syntax:
hdfs dfs -rmr [-skipTrash] URI [URI …]
Here is an example: hdfs dfs -rmr /user/shashwat/dirTodelete
|
|
This is a helpful command for specifying the replication of existing files on HDFS explicitly. This is the syntax:
hdfs dfs -setrep [-R] <path>
Here is an example:
hdfs dfs -setrep -w 5 -R /user/shashwat/dir
Here, |
|
This displays the statistics of the given argument file/directory. This is the syntax:
hdfs dfs -stat URI [URI …]
Here is an example: hdfs dfs -stat /user/Shashwat/file
|
|
This displays the trailing content of a file on HDFS, which is the same as the This is the syntax: hdfs dfs -tail [-f] URI
Here are some examples:
hdfs dfs -tail /user/Shashwat/file.log hdfs dfs –tail -f /user/Shashwat/file.log
|
|
This tests the condition with the following options: This is the syntax: hdfs dfs -test -[ezd] URI
Here are some examples: hdfs dfs -test -e /user/Shashwat/file hdfs dfs -test -z /user/Shashwat/file hdfs dfs -test -d /user/Shashwat/file |
|
This is the syntax:
hdfs dfs -text <src>
Here is an example: hdfs dfs -text /user/Shashwat/mr/_part01
|
|
This command creates a zero-length file on HDFS. This is the syntax:
hdfs dfs -touchz URI [URI …]
Here are some examples: hdfs dfs -touchz hdfs://namenode/user/Shashwat/file hdfs dfs -touchz /user/Shashwat/file |
Commands preceding Commands preceding |
If HDFS contains the path /user/files/file
, and if the local disk also contains the same path, the HDFS API won't know which one we mean, unless we specify a scheme such as file://
or hdfs://
. It might pick the path we did not want to copy.
Therefore, we have -copyFromLocal
, which prevents us from mistakenly copying the wrong file by limiting the parameter we give to the local file system.
The put
command is for users who know which scheme to put in front. It is confusing, sometimes, for new Hadoop users to decide or specify which file system they are currently in and where their files actually are.
copyFromLocal
is similar to the put
command, except that the source is restricted to a local file reference.
copyToLocal
is similar to the get
command, except that the destination is restricted to a local file reference.
For the latest Hadoop documentation, visit http://hadoop.apache.org/docs/ and select the Hadoop version.
Now, let's start with HBase administration and operation tasks.
52.15.129.90