Exploring HDFS commands

To perform filesystem related tasks, the commands begin with hdfs dfs. The filesystem commands have been designed to behave similarly to the corresponding Unix/Linux filesystem commands.

What is a URI? URI stands for Uniform Resource Identifier. In the commands that are listed as follows, you will observe the use of URI for file locations. The URI syntax to access a file in HDFS is hdfs://namenodehost/parent/child/<file>.

Commonly used HDFS commands

The following are some of the most commonly used HDFS commands:

  • ls: This command lists files in HDFS.

    The syntax of the ls command is hdfs dfs -ls <args>. The following is the screenshot showing an example of the ls command:

    Commonly used HDFS commands
  • cat: This command displays the contents of file/files in the terminal.

    The syntax of the cat command is hdfs dfs -cat URI [URI …]. The following is a sample output of the cat command:

    Commonly used HDFS commands
  • copyFromLocal: This command copies a file/files from the local filesystem to HDFS.

    The syntax of the copyFromLocal command is hdfs dfs -copyFromLocal <localsrc> URI. The following is the screenshot showing an example of the copyFromLocal command:

    Commonly used HDFS commands
  • copyToLocal: This command copies a file/files from HDFS to thelocal filesystem.

    The syntax of the copyToLocal command is hdfs dfs -copyToLocal URI <localdst>. The following is the screenshot showing an example of the copyToLocal command:

    Commonly used HDFS commands
  • cp: This command copies files within HDFS.

    The syntax of the cp command is hdfs dfs -cp URI [URI …] <dest>. The following is the screenshot showing an example of the cp command:

    Commonly used HDFS commands
  • mkdir: This command creates a directory in HDFS.

    The syntax of the mkdir command is hdfs dfs -mkdir <paths>. The following is the screenshot showing an example of the mkdir command:

    Commonly used HDFS commands
  • mv: This command moves files within HDFS.

    The syntax of the mv command is hdfs dfs -mv URI [URI …] <dest>. The following is the screenshot showing an example of the mv command:

    Commonly used HDFS commands
  • rm: This command deletes files from HDFS.

    The syntax of the rm command is hdfs dfs -rm URI [URI …]. The following is the screenshot showing an example of the rm command:

    Commonly used HDFS commands
  • rm -r: This command deletes a directory from the HDFS.

    The syntax of the rm –r command is hdfs dfs –rm -r URI [URI …]. The following is the screenshot showing an example of the rm -r command:

    Commonly used HDFS commands
  • setrep: This command sets the replication factor for a file in HDFS.

    The syntax of the setrep command is hdfs dfs -setrep [-R] <path>. The following is the screenshot showing an example of the setrep command:

    Commonly used HDFS commands
  • tail: This command displays the trailing kilobyte of the contents of a file in HDFS.

    The syntax of the tail command is hdfs dfs -tail [-f] URI. The following is the screenshot showing an example of the tail command:

    Commonly used HDFS commands

Commands to administer HDFS

Hadoop provides several commands to administer HDFS. The following are two of the commonly used administration commands in HDFS:

  • balancer: In a cluster, new datanodes can be added. The addition of new datanodes provides more storage space for the cluster. However, when a new datanode is added, the datanode does not have any files. Due to the addition of the new datanode, data blocks across all the datanodes are in a state of imbalance, that is, they are not evenly spread across the datanodes. The administrator can use the balancer command to balance the cluster. The balancer can be invoked using this command.

    The syntax of the balancer command is hdfs balancer –threshold <threshold>. Here, threshold is the balancing threshold expressed in percentage. The threshold is specified as a float value that ranges from 0 to 100. The default threshold values is 10. The balancer tries to distribute blocks to the underutilized datanodes. For example, if the average utilization of all the datanodes in the cluster is 50 percent, the balancer, by default, will try to pick up blocks from nodes that have a utilization of above 60 percent (50 percent + 10 percent) and move them to nodes that have a utilization of below 40 percent (50 percent - 10 percent).

  • dfsadmin: The dfsadmin command is used to run administrative commands on HDFS.

    The syntax of the dfsadmin command is hadoop dfsadmin <options>. Let's understand a few of the important command options and the actions they perform:

    • [-report]: This generates a report of the basic filesystem information and statistics.
    • [-safemode <enter | leave | get | wait>]: This safe mode is a namenode state in which it does not accept changes to the namespace (read-only) and does not replicate or delete blocks.
    • [-saveNamespace]: This saves the current state of the namespace to a storage directory and resets the edits log.
    • [-rollEdits]: This forces a rollover of the edits log, that is, it saves the state of the current edits log and creates a fresh edits log for new transactions.
    • [-restoreFailedStorage true|false|check]: This enables to set/unset or check to attempt to restore failed storage replicas.
    • [-refreshNodes]: This updates the namenode daemon with the set of datanodes allowed to connect to the namenode daemon.
    • [-setQuota <quota> <dirname>...<dirname>]: This sets the quota (the number of items) for the directory/directories.
    • [-clrQuota <dirname>...<dirname>]: This clears the set quota for the directory/directories.
    • [-setSpaceQuota <quota> <dirname>...<dirname>]: This sets the disk space quota for the directory/directories.
    • [-clrSpaceQuota <dirname>...<dirname>]: This clears the disk space quota for the directory/directories.
    • [-refreshserviceacl]: This refreshes the service-level authorization policy file. We will be learning more about authorization later.
    • [-printTopology]: This prints the tree of the racks and their nodes as reported by the namenode daemon.
    • [-refreshNamenodes datanodehost:port]: This reloads the configuration files for a datanode daemon, stops serving the removed block pools, and starts serving new block pools. A block pool is a set of blocks that belong to a single namespace. We will be looking into this concept a bit later.
    • [-deleteBlockPool datanodehost:port blockpoolId [force]]: This deletes a block pool of a datanode daemon.
    • [-setBalancerBandwidth <bandwidth>]: This sets the bandwidth limit to be used by the balancer. The bandwidth is the value in bytes per second that the balancer should use for data blocks movement.
    • [-fetchImage <local directory>]: This gets the latest fsimage file from namenode and saves it to the specified local directory.
    • [-help [cmd]]: This displays help for the given command or all commands if a command is not specified.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.127.37