Chapter 6. HBase Cluster Maintenance and Troubleshooting

We have already learned about setting up the Hadoop and HBase clusters. Now, we will learn the aspects we need to consider to maintain the cluster and keep it up and running. This chapter will help readers make their HBase cluster more reliable by making it high available.

In this chapter, we will concentrate on the operational part of HBase. We will discuss the following topics:

  • Introduction to the HBase administration
  • HBase shell
  • Different administration tools for HBase
  • Using Java in HBase shell for various tweaks
  • HBase and shell scripting for HBase
  • Connecting hive with HBase to run Hive Query Language (HQL) queries from hive
  • Implementing securities in HBase
  • Frequently occurring errors and their solutions
  • Other miscellaneous topics

As HBase runs on top of Hadoop, before starting with the HBase administration, let's look at Hadoop administration tasks and aspects in brief.

Here is the list of available Hadoop shell commands and steps on how to use them.

Hadoop shell commands

A binary is present inside the bin directory. We can call the following Hadoop command if we need to know all the commands available:

<Hadoop directory path>bin/hadoop

In the version previous to Hadoop v1, we can use the preceding command. However, in the later versions, we have to use the following command:

<Hadoop directory path>bin/hdfs

A binary without any parameter will display the list of available commands. We can check the actual implementation of Hadoop shell and its Java source at https://github.com/shot/hadoop-source-reading/blob/master/src/core/org/apache/hadoop/fs/FsShell.java.

Tip

We can use bin/hadoop or bin/hdfs based on the version of Hadoop we have. In the newer versions of Hadoop, it is advisable to use HDFS instead of Hadoop. Here, we will use bin/hadoop, but you can use any one of the commands, depending on the version you are using.

Types of Hadoop shell commands

Let's take a look at the Hadoop shell commands. However, first we will look at the generic options available with the aforementioned bin/hadoop and bin/hdfs. The following is the syntax:

hdfs [--config <configuration dir>] [command] [generic_options] [command_options]

The following table will explain to you the parameters of the preceding command:

Parameters

Explanation

--config

Using this parameter, we can define the current and active configuration directory as we might have more than one configuration parameter or directory for the cluster. We can define it as follows:

hadoop --config /home/shashwat/hadoop2/config1

-D parameter-name=parameter value

Here, we can give runtime parameters that are found in the configuration files. We can pass command-line runtime parameters.

-jt <local> or <jobtrackerHostname:port>

Using this, we can pass the JobTracker address host address while dealing with MapReduce.

-files <files list separated by comma>

In this parameter, we provide a list of files to be copied and required for a job running on the Hadoop cluster while submitting. This copies the required resource files for the job.

-libjars <list of jar files required for job to run command separated file list>

Here, we can list out the library JAR files that are needed for the job to run, which will be included in the Java classpath.

-archives <list of archieve files comma separated>

Here, we can list the files that are to be extracted for the job resource.

Tip

All the earlier options are valid in the cases of the fs, dfs, dfsadmin, fsck, job, and fetchdt commands.

We categorized Hadoop shell commands into the following three types:

  • Administration commands
  • User commands
  • File-system-related commands

Let's explore the commands under the above mentioned types.

Administration commands

The following is the list of administration commands:

  • balancer: Using this command, we can balance data distribution throughout the cluster. Sometimes, it so happens that a few of the DataNodes become overloaded when write operations happen at pace. This might also happen when a new DataNode is added but underutilized. We can stop this command anytime using Ctrl + C.

    The syntax for this command is as follows:

    hdfs balancer [-threshold <threshold value>]
    

    The following is the example:

    hdfs balancer - threshold 20
    

    The balancer process is iterative. The threshold value gives us a long value in the range of 1 to 100. The balancer generally tries to equalize the data uses throughout all the DataNodes, and tries to keep it within the range [average - threshold, average +threshold].

    The smaller the value of the given threshold, the more balanced the cluster is.

    While balancing the cluster, it uses a lot of network bandwidth. We can control it using another administration command, dfsadmin –setBalanacerBandwidth <bandwidth>, so the balancer will use a specified percentage of the available bandwidth. This should be set to prevent read/write exceptions during the cluster operation. This setting can also be changed using the dfs.balance.bandwidthPerSec (value in bytes per second) parameter found in the default file in Hadoop, wherein we can specify it, or we can set it at runtime using the dfsadmin command.

    The balancer will pick DataNodes with disk usage above the higher threshold (seen as over utilized DataNode) and try to find blocks from these DataNodes to be copied into a DataNode that's underutilized. In the second round, a balancer selects DataNodes that are overutilized, and moves the blocks to nodes where utilization is below average. The third round will choose nodes with utilization above average to move data to underutilized nodes.

    Note

    For more details on balancer (flow, architecture, and administration), visit https://issues.apache.org/jira/browse/HADOOP-1652. Here, PDF files are available on the balancer architecture; you can also visit http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html#balancer.

  • daemonlog: This command is used to set the logging level for each Hadoop daemon process. This comes handy when we debug a problem with Hadoop, and therefore this command can be used to increase or decrease the log level for debugging purposes. This log-level modification can be done through configuration or Hadoop daemon web pages. However, it is better that an administrator does it through a command line.

    This command accepts two parameters, namely get and set. Get is used to get the information about the log level, and set is used to set the log level.

    The following is the syntax to get the log level information:

    -getlevel <host:port> <name>
    

    The preceding command gets the log level information of the daemon processes running at the specified host and port by internally connecting to http://<host>:<port>/logLevel?log=<name>.

    The host, <host>, gets the log level information from the port, <port>, on which the service is running.

    The <name> parameter specifies the hostname from which to get the log level. This is a fully qualified classname of the daemon performing the logging process.

    Example of it is org.apache.hadoop.mapred.JobTracker for the JobTracker daemon.

    The following is the syntax to set the log level:

    -setlevel <host:port> <name> <level>
    

    The preceding command sets the log level of the daemon running at the specified host by internally connecting to http://<host>:<port>/logLevel?log=<name>.

    The host, <host>, sets the log level on the port, <port>, on which the service is running. The <name> parameter specifies the daemon on which to set the log level. The <level> parameter specifies the log level to set the daemon.

    The following command is an example of how to get the log level:

    hdfs daemonlog -getlevel host:<port> org.apache.hadoop.mapred.JobTracker
    

    The following command is an example of how to set the log level:

    hdfs daemonlog -setlevel host: <port>org.apache.hadoop.mapred.JobTracker <ERROR or DEBUG>
    
  • datanode: This command is used to start or stop the DataNode daemon processes. The following is the syntax:
    hdfs datanode [-rollback]
    

    The rollback option helps to roll back DataNode to the previous version. If the upgrade process is in progress and something goes wrong, we need to restore the DataNode metadata to the previous existing version. If the command is specified without any parameter, it will start the DataNode daemon, if it's not already running.

  • dfsadmin: This command runs the dfsadmin client for the Hadoop cluster to perform administration commands. We can check the actual implementation in Java at https://github.com/facebook/hadoop-20/blob/master/src/hdfs/org/apache/hadoop/hdfs/tools/DFSAdmin.java. The following is the syntax:
    hdfs dfsadmin [GENERIC_OPTIONS] [-report] [-safemode enter | leave | get | wait] [-refreshNodes] [-finalizeUpgrade] [-upgradeProgress status | details | force] [-metasave filename] [-setQuota <quota><dirname>] [-clrQuota <dirname>......<dirname>] [-help [cmd]] [-restoreFailedStorage true|false|check]
    

    The following list will give us an explanation of different parameters in this command. We already discussed the generic options earlier, and it remains the same here.

    • report: This parameter with the hdfs command displays the basic status of the cluster and HDFS file system.

      For example, have a look at the following command:

      hdfs dfsadmin –status
      

      The following is what you will get as output:

      Administration commands
    • safemode: Safe mode is the condition when Hadoop prevents reading of data from the cluster, and meanwhile, loads and updates the metadata during a start up process. We have a command to get and set information. Safe mode is the state of NameNode in which it is in a read-only mode, where NameNode does not accept changes to the namespace and deleted blocks are not replicated.

      This command has parameters such as -safemode <enter /leave / get | wait>, where leave forces Hadoop to come out of the safe mode explicitly, get gets the status of NameNode, whether it is in the safe mode, and wait makes NameNode wait till it comes out of the safe mode.

      If you force Hadoop to come out of the safe mode, it means you are asking Hadoop to come out without updating the metadata, and this will lead to corruption of data most of the time. However, if at all it's necessary to force Hadoop to leave the safe mode, first verify, check, and try to see what is there in the logs of NameNode.

      Hadoop enters the safe mode automatically at startup, and it leaves the safe mode by itself once it has reached the minimum percentage of blocks needed for a replication condition to fulfill (based on the replication factor).

      NameNode can also enter the safe mode manually, but then, it can also only be taken out of this safe mode manually.

      This parameter of dfsadmin can be used as follows:

      hdfs dfsadmin –safemode [<get/enter/leave/wait>]
      

      Let's see one example:

      hdfs dfsadmin –safemode get
      

      The following screenshot shows the output of the preceding command:

      Administration commands
    • refreshNodes: This parameter of the command makes Hadoop read configurations such as hosts again, and excludes files to update the set of DataNodes that are allowed to connect to NameNodes that should be or are already decommissioned. For example, have a look at the following command:
      hdfs dfsadmin -refreshNodes
      
    • finalizeUpgrade: When we issue the dfsadmin command with this parameter, it will make an upgrade permanent. It does so by deleting the previous version of directories on DataNodes and NameNode. This completes the upgrade process and is not downgradable.
    • upgradeProgress: This parameter of the command has three options: status, detail, and force. It also fetches the information on the Hadoop upgrade process.
    • metasave: This parameter saves the NameNode primary data structures to a file. The file contains one line for each of the following:
      • DataNodes' hearts beating with NameNode
      • Blocks waiting to be replicated
      • Blocks currently being replicated
      • Blocks waiting to be deleted
    • setQuota: This parameter is used to set the quota for each directory as a long integer value that puts a hard limit on the number of names in the directory tree. It reports errors if one of the following is true:
      • N is not a positive integer
      • The user is not an administrator
      • The directory does not exist or is a file
      • The directory exceeds the new quota
    • clrQuota: This parameter clears the quota for each directory. An error is reported if one of the following is true:
      • The directory does not exist or is a file
      • The user is not an administrator; clearQuota does not fault if the directory has no quota
    • help: This displays the help for all the commands.
    • restoreFailedstorage: This parameter turns automatic attempts on or off to restore failed storage data. If a failed storage comes online again, the system will attempt to restore edits and/or fsimage during checkpoint. The check option will return the current setting. This parameter has options such as true, false, and chec.
  • mradmin: This command runs a MapReduce client. The following is the syntax:
    hadoop mradmin [ generic_options ] [-refreshqueueacls]
    

    The -refreshqueueacls parameter refreshes the queue ACLs used by Hadoop to check access during submissions of the job by the user. The properties present in mapred-queue-acls.xml are reloaded by the queue manager.

    Some other options of this command are as follows:

    • The -refreshQueues option to refresh a job queue
    • The -refreshUserToGroupsMappings option to refresh user groups
    • -refreshSuperUserGroupsConfiguration
    • -refreshNodes
    • -help [cmd]
  • jobtracker: This command runs an instance of a MapReduce node if the daemon is not already started. The following is the syntax:
    Hadoop jobtracker [-dumpConfiguration]
    

    The -dumpConfiguration option dumps the configuration used by JobTracker, along with the queue configuration in JSON format, into a standard output used by JobTracker, and then exits.

    Note

    You can also find the description of the jobtracker command at http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html#jobtracker.

  • namenode: This command runs a NameNode instance. The following is the syntax:
    hadoop namenode [-format] / [-upgrade] / [-rollback] / [-finalize] / [-importCheckpoint]
    

    The following table describes the parameters of this command:

    Command options

    Description

    -format

    This parameter of the command should be used only once, at the first time, when a new cluster is configured. The command with this parameter formats the file system to HDFS and prepares the file system.

    This parameter must not be used for a working and in-production cluster as whole data will be destroyed.

    -upgrade

    This initiates the upgrade process to a newer version.

    -rollback

    This roll backs the upgrade process if something goes wrong. This must be used after stopping the cluster and distributing the old Hadoop version files on it.

    -finalize

    Once all NameNodes and DataNodes are upgraded successfully, this command commits the changes; this removes the previous state of the HDFS file system.

    After using this command, rollback will not work.

    -importCheckpoint

    This loads the image file data from a checkpoint directory and saves it into the current directory. The checkpoint directory is read from the fs.checkpoint.dir property.

  • secondarynamenode: This command starts the secondary NameNode instance. The following is the syntax:
    hadoop secondarynamenode [-checkpoint [force]] / [-geteditsize]
    

    The following list explains the parameters of this command:

    • checkpoint [-force]: This performs checkpointing at the secondary NameNode if the EditLog size is greater than or equal to the fs.checkpoint.size.

      If -force is used, perform checkpoint irrespective of the EditLog size.

    • geteditsize: This prints out the EditLog size.
  • tasktracker: This starts the TaskTracker node; the syntax for this is as follows:
    hadoop tasktracker
    

User commands

The following is the list of user commands:

  • archive: This command is used to create Hadoop achieve files. Its syntax is as follows:
    hadoop archive -archiveName NAME <src> <dest>
    
  • distcp: This command is used to copy a file from one cluster to another or to the same cluster at a different location. This uses the MapReduce task to copy files parallel. Its syntax is:
    hadoop distcp <source url> <destination url>
    

    Have a look at the following example:

    hadoop distcp hdfs://hadoop1:9000/files hdfs://hadoop2:9000/filesdir
    
  • fs: Instead of this command, we use hdfs dfs, which we will discuss in the next section with all of its various options.
  • fsck: This command is used to find the inconsistencies in HDFS. It reports problems with various files, for example, missing blocks for a file or under-replicated blocks. This is not a Hadoop shell command. It can be run as:
    hdfs fsck [GENERIC_OPTIONS] <path> [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]
    

    The following is the description of the parameters of this command:

    Command options

    Description

    path

    This defines the path to be checked

    -move

    This moves corrupted files to /lost+found

    -delete

    This deletes corrupted files

    -files

    This prints out files being checked

    -openforwrite

    This prints out files opened for write

    -list-corruptfileblocks

    This prints the list of missing blocks and files they belong to

    -blocks

    This prints out the block report

    -locations

    This prints out locations for every block

    -racks

    This prints the network topology for DataNode locations

    By default, the fsck command ignores files opened for write; we can use -openforwrite to report such files. They are generally tagged CORRUPT or HEALTHY depending on their block allocation status.

    The following screenshot shows the output of fsck:

    User commands
  • fetchdt: This retrieves delegation tokens from NameNode. Authentication is a two-party authentication protocol based on Java SASL Digest-MD5. The token is obtained during job submissions and submitted to JobTracker as part of the job submission. Find more details at http://hortonworks.com/wp-content/uploads/2011/10/security-design_withCover-1.pdf.

    The following is the syntax:

    fetchdt <opts> <token file>
    

    The following table will describe to you the different command options:

    Command options

    Description

    --webservice <url>

    This is the URL to contact a NameNode on

    --renewer <name>

    This is a name of the delegation token renewer

    --cancel

    This cancels the delegation token

    --renew

    This renews the delegation token, which must be fetched using the --renewer <name> option

    --print

    This prints the delegation token

  • jar: This runs a JAR file. Users can bundle their MapReduce code in a JAR file and execute it using the command, the syntax of which is as follows:
    hadoop jar <jar> [mainClass] arguments
    

    The following is an example of this command:

    hadoop jar hadoop-mapreduce-examples-*.jar pi 20 20
    
  • Job: This is used to submit a job. It can be known as a Hadoop or mapred job. The following is the syntax :
    JobClient <command> <args>
      [-submit <job-file>]
      [-status <job-id>]
      [-counter <job-id> <group-name> <counter-name>]
      [-kill <job-id>]
      [-set-priority <job-id> <priority>]
      [-events <job-id> <from-event-#> <#-of-events>]
      [-history <jobOutputDir>]
      [-list [all]]
      [-list-active-trackers]
      [-list-blacklisted-trackers]
      [-list-attempt-ids <job-id> <task-type> <task-state>]
      [-kill-task <task-id>]
      [-fail-task <task-id>]
    

    The following table will guide you through the command options:

    Command options

    Description

    -submit job-file

    This is used to submit the job.

    -status job-id

    This prints the map and reduces the completion percentage and all job counters.

    -counter job-id group-name counter-name

    This prints the counter value of a job.

    -kill job-id

    This is used to kill the job.

    -events job-id from-event-# #-of-events

    This prints the events' details received by JobTracker for the given range of values.

    -history [all]jobOutputDir

    This prints job details of failed and killed jobs. More details about the job, such as successful tasks and task attempts, made for each task can be viewed by specifying the [all] option.

    -list [all]

    This is used to display jobs that are yet to be completed. The -list all option displays all jobs.

    -kill-task task-id

    This is used to kill the task using a task ID.

    -fail-task task-id

    This lists out the failed tasks of a failed job and the number of attempts.

    -set-priority job-id priority

    Using this switch, we can change the priority of a job to any one of these priority values: VERY_HIGH, HIGH, NORMAL, LOW, and VERY_LOW.

  • pipes: This command enables Hadoop to MapReduce code written in C++. This library is supported on 32-bit Linux installations. The following is the syntax:
    hadoop pipes [-conf <path>] [-jobconf <key=value>, <key=value>, ...] [-input <path>] [-output <path>] [-jar <jar file>] [-inputformat <class>] [-map <class>] [-partitioner <class>] [-reduce <class>] [-writer <class>] [-program <executable>] [-reduces <num>]
    

    The following are the descriptions of the command options:

    Command options

    Description

    -conf path

    This is the path to where configuration for a job exists

    -jobconf key=value, key=value, ...

    This adds/overrides configurations for jobs

    -input path

    This is the path to the input directory

    -output path

    This is the path to the output directory

    -jar jar file

    This is the JAR filename

    -inputformat class

    This is a InputFormat class

    -map class

    This is Java Map class

    -partitioner class

    This is a Java partitioner

    -reduce class

    This is a Java Reduce class

    -writer class

    This is a Java RecordWriter class

    -program executable

    This is an executable URI

    -reduces num

    This is the number of Reduces

  • version: This displays the Hadoop version. Its syntax is as follows:
    hadoop version
    

File system-related commands

The hdfs dfs command provides shell-based Hadoop commands that directly interact with Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS, or others.

This command can be executed using the following syntax:

[hadoop fs <args>]

Alternatively, we can also use:

[hdfs dfs args]

Let's discuss these dfs commands briefly:

Options

Description

-appendToFile

This appends files to HDFS. These files can be local or inputs to be written on HDFS.

Have a look at the following example:

  • hdfs dfs -appendToFile file /user/data/appendedfile
  • hdfs dfs -appendToFile file file0 /user/data/appendedfile
  • hdfs dfs -appendToFile localfile hdfs://<namenode>/user/data/appendedfile
  • hdfs dfs -appendToFile - hdfs://<namenode>/user/data/appendedfile

This parameter reads the input from stdin.

-cat

This displays the content of a file on stdout. The following is the syntax:

hdfs dfs –cat <file URI>

Here is an example:

hdfs dfs -cat hdfs://<namenode>/file

** -chgrp

This changes the group of a file or directory. –R is used for recursive. The syntax is:

hdfs dfs –chgrp [-R] [owner][:[group]]<URI>

Here are some examples:

hdfs dfs –chgrp –R hadoop:hadoop hdfs://namenode/dir
hdfs dfs –chgrp hadoop:hadoop hdfs://namenode/file

For permission-related information, visit http://hadoop.apache.org/docs/r1.2.1/hdfs_permissions_guide.html.

** -chmod

This changes the access mode of a file or directory. –R is used for recursive.

The syntax is as follows:

hdfs dfs –chmod <mode> <URI>

Here are some examples:

hdfs dfs –chmod 777 hdfs://namenode/filename
hdfs dfs –chmod –R 777 hdfs://namenode/directory

For more on permission and mode, visit http://hadoop.apache.org/docs/r1.2.1/hdfs_permissions_guide.html.

** -chown

This changes the owner of a directory or file, we use –R for recursive.

The syntax is as follows:

hdfs dfs –chown [–R] <user or owner:group><URI>

Here is an example:

hdfs dfs –chown –R shashwat:hadoop hdfs://namenode/directory

-copyFromLocal

This copies files from a local drive to an HDFS file system.

The syntax is as follows:

hdfs dfs –copyFromLocal <Local file/ Local Directory><URI>

Here are some examples:

hdfs dfs –copyFromLocal /user/home/Shashwat/file1 hdfs://namenode/newdir
hdfs dfs –copyFromLocal /user/home/Shashwat/dir hdfs://namenode/newdir

-copyToLocal

This copies files from HDFS to local drives. Adding ignorecrc will force it not to check crc after copying, and the crc option will print the crc details.

This is the syntax:

hdfs dfs –copyToLocal [-ignorecrc] [-crc] <URI><Local file/ Local Directory>

Here are some examples:

hdfs dfs –copyToLocal hdfs://namenode/newdir /user/home/Shashwat
hdfs dfs –copyToLocal hdfs://namenode/newdir /user/home/Shashwat

## -count

This counts the number of directories, files, and bytes under the given path. –q is also added to get quota.

The following is the syntax:

hdfs dfs –count [–q] <path>

Here's an example:

hdfs dfs –count /dir

## -cp

This is used to copy files from one HDFS location to another on the same Hadoop cluster or other Hadoop clusters.

This is the syntax:

hdfs dfs –cp <Source URI> <Destination URI>

Here are some examples:

hdfs dfs –cp /user/file1 /user/dir1/

## -du

This displays the size of the directories and files under the given path.

This is the syntax:

hdfs dfs –du [-S] [-h] <URI>

Adding options –S will give the summarized (aggregated) size, and –h will give it in a human-readable format (in MB, GB, and so on).

Here are some examples:

hdfs dfs –du -s /user/dire
hdfs dfs –du /user/dir
hdfs dfs –du –s –h /user/dir

## -dus

This is equivalent to –du –s and displays the size of directories or files as an aggregated summary.

## -expunge

When we delete files, and trash is enabled on HDFS, deleted files go to trash, but not directly deleted from HDFS. This command enables us to empty the trash.

This is the syntax:

hdfs dfs –expunge

-get

This is equivalent to –copyToLocal.

This is the syntax:

hdfs dfs –get <HDFS location> <local destination>

Here is an example:

hdfs dfs –get hdfs://namenode/dir /tmp

-getmerge

This concatenates the source and destination, and copies them to a local directory.

This is the syntax:

hdfs dfs -getmerge <src> <localdst> [addnl]

-ls

This lists out files and folders in a given path.

Here is the syntax:

hdfs dfs –ls <directory path>

This is the example:

hdfs dfs –ls /

-lsr

This lists out files and folders recursively; the syntax and uses are same as –ls.

-mkdir

This creates a directory on HDFS.

This is the syntax:

hdfs dfs –mkdir <directory path to be created>

Here are some examples:

hdfs dfs -mkdir /user/hadoop/dirtocreate
hdfs dfs -mkdir /user/hadoop/dirtocreate /user/hadoop/dirtocreate1

-moveFromLocal

This copies a file from the local directory and deletes the source file on the source path.

This is the syntax:

dfs -moveFromLocal <localsrc> <dst>

-moveToLocal

This moves the file from HDFS to a local destination.

This is the syntax:

hdfs dfs -moveToLocal [-crc] <src> <dst>

** -mv

This moves a file or directory from one HDFS location to another, either on the same cluster or to different clusters.

This is the syntax:

hdfs dfs -mv <source> <dest>

Here are some examples:

hdfs dfs -mv /user/shashwat/file /user/shashwat/file1
hdfs dfs -mv hdfs://namenode/file hdfs://namenode/file

-put

This copies a single source or multiple sources from the local file system to the destination file system. It also reads the input from stdin and writes to the destination file system.

This is the syntax:

hdfs dfs –put <source local> <destination HDFS>

Here are some examples:

hdfs dfs –put /tmp/userfilelocal hdfs://namenode/dirtarget
hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile

Giving (hyphen) instead of source path will take input from stdin.

** -rm

This deletes a specified file. If the –skipTrash option is specified, the trash, if enabled, will be bypassed, and the specified file or files will be deleted immediately. This can be useful when it is necessary to delete files from an over-quota directory.

This is the syntax:

hdfs dfs -rm [-skipTrash] URI [URI …]

Here are the examples:

hdfs dfs -rm hdfs://namenode/file
hdfs dfs -rm hdfs://namenode/file hdfs://namenode/file1
hdfs dfs –rm /user/files/file

** -rmr

This deletes files/directories recursively.

This is the syntax:

hdfs dfs -rmr [-skipTrash] URI [URI …]

Here is an example:

hdfs dfs -rmr /user/shashwat/dirTodelete

## -setrep

This is a helpful command for specifying the replication of existing files on HDFS explicitly. –R is added to set the replication factor recursively.

This is the syntax:

hdfs dfs -setrep [-R] <path>

Here is an example:

hdfs dfs -setrep -w 5 -R /user/shashwat/dir

Here, –w will wait until it is replicated and -R will perform the operation recursively.

## -stat

This displays the statistics of the given argument file/directory.

This is the syntax:

hdfs dfs -stat URI [URI …]

Here is an example:

hdfs dfs -stat /user/Shashwat/file

-tail

This displays the trailing content of a file on HDFS, which is the same as the tail command in Linux. –f is added to continuously tail the file content.

This is the syntax:

hdfs dfs -tail [-f] URI

Here are some examples:

hdfs dfs -tail /user/Shashwat/file.log
hdfs dfs –tail -f /user/Shashwat/file.log

-test

This tests the condition with the following options:

  • -e checks to see whether the file exists; it returns 0 if true
  • -z checks to see whether the file is of zero length; it returns 0 if true
  • -d checks to see whether the path is a directory; it returns 0 if true

This is the syntax:

hdfs dfs -test -[ezd] URI

Here are some examples:

hdfs dfs -test -e /user/Shashwat/file
hdfs dfs -test -z /user/Shashwat/file
hdfs dfs -test -d /user/Shashwat/file

## -text

hdfs dfs –cat will display the file content correctly when the file is text based. If we need to read a sequential binary file or compressed file, cat will not do, so we have to use this command. The allowed formats are ZIP and TextRecordInputStream.

This is the syntax:

hdfs dfs -text <src>

Here is an example:

hdfs dfs -text /user/Shashwat/mr/_part01

-touchz

This command creates a zero-length file on HDFS.

This is the syntax:

hdfs dfs -touchz URI [URI …]

Here are some examples:

hdfs dfs -touchz hdfs://namenode/user/Shashwat/file
hdfs dfs -touchz /user/Shashwat/file

Commands preceding ## are important admin commands.

Commands preceding ** are the commands to be used with caution as they might result in data loss.

Difference between copyToLocal/copyFromLocal and get/put

If HDFS contains the path /user/files/file, and if the local disk also contains the same path, the HDFS API won't know which one we mean, unless we specify a scheme such as file:// or hdfs://. It might pick the path we did not want to copy.

Therefore, we have -copyFromLocal, which prevents us from mistakenly copying the wrong file by limiting the parameter we give to the local file system.

The put command is for users who know which scheme to put in front. It is confusing, sometimes, for new Hadoop users to decide or specify which file system they are currently in and where their files actually are.

copyFromLocal is similar to the put command, except that the source is restricted to a local file reference.

copyToLocal is similar to the get command, except that the destination is restricted to a local file reference.

Tip

For the latest Hadoop documentation, visit http://hadoop.apache.org/docs/ and select the Hadoop version.

Now, let's start with HBase administration and operation tasks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.129.90