Sorting and finding unique text

Shell provides different ways to sort the input text using the sort command. It's also possible to remove repeated lines from sorted/unsorted input text using the uniq command. The input text to sort and uniq commands can be given from a file, or redirected from another command.

Sorting an input text

The lines in the input text are sorted in the following order:

  • Numbers from 0 to 9
  • Uppercase letters from A to Z
  • Lowercase letters from a to z

The syntax will be as follows:

sort [OPTION] [FILE …]

Single or multiple input files can be provided to sort for sorting.

The sort command takes multiple options to provide flexibility in sorting. The popular and important OPTION to sort have been discussed in the following table:

Option

Description

-b

This ignores leading blanks

-d

This considers only blanks and alphanumeric characters

-f

This ignores a case

-i

This ignores a non-printable character

-M

This compares months that are unknown (for example, < JAN < FEB… < DEC)

-n

This sorts on the basis of numerical values

-r

This sorts in reverse order

-h

This sorts on human-readable numbers; for example, 9K, 5M, 1G, and so on.

-u

This gets unique lines

-o file

This writes an output to a file instead of stdout

-m

This merges the already sorted file without resorting it

-k n

This sorts data according to the given column n

Now, we will see with the help of examples, how different sorting can be done on the input text data.

Sorting a single file

In our example, we will consider the sort1.txt file for sorting. The content of this file is as follows:

$ cat sort1.txt
Japan
Singapore
Germany
Italy
France
Sri Lanka

To sort the content alphabetically, we can use the sort command without any option:

$ sort sort1.txt
France
Germany
Italy
Japan
Singapore
Sri Lanka

To sort the content in reverse order, we can use the –r option:

$ sort -r sort1.txt
Sri Lanka
Singapore
Japan
Italy
Germany
France

Sorting multiple files: We can also sort multiple files collectively, and the sorted output can be used for further queries.

For example, consider sort1.txt and sort2.txt files. We will reuse the content of the sort1.txt file from the previous example. The content of sort2.txt is as follows:

$ cat sort2.txt
India
USA
Canada
China
Australia

We can sort both the files together alphabetically as follows:

$ sort sort1.txt sort2.txt
Australia
Canada
China
France
Germany
India
Italy
Japan
Singapore
Sri Lanka
USA

We can also use the -o option to save the sorted output of files in a file instead of displaying it on stdout:

$ sort sort1.txt sort2.txt -o sorted.txt
$ cat sorted.txt
Australia
Canada
China
France
Germany
India
Italy
Japan
Singapore
Sri Lanka
USA

Redirecting output to sort

We can sort an output redirected from another command. The following example shows the sorting of the df -h command output:

$ df -h    #  Disk space usage in human readable format
Redirecting output to sort

The following command sorts output of df by its 2nd column content:

$ df -h | sort -hb -k2  #. Sorts by 2nd column according to size available:
Redirecting output to sort

We can sort the ls -l output according to the last modification day and month:

$ ls -l /var/cache/    # Long listing content of /var/cache
Redirecting output to sort

To sort the ls -l output, first sort according to the month that is the 6th field using the -M option, and if the month for two or more row is the same, then sort according to the day that is the 7th field using -n for numerical sort:

$ ls -l /var/cache/ | sort -bk 6M -nk7
Redirecting output to sort

Filtering unique elements

In many use-case, we need to remove duplicate items and keep only one occurrence of items. It is very useful when the output of a command or input file is too big, and it contains lot of duplicate lines. To get unique lines from a file or redirected output, the shell command uniq is used. One important point to note is that, in order to get the uniq output, input should be sorted, or first run the sort command to make it sorted. The syntax will be as follows:

sort [OPTION] [INPUT [OUTPUT]]

An input to uniq can be given from a file or another command's output.

If an input file is provided, then an optional output file can also be specified on a command line. If no output file is specified, the output will be printed on stdout.

The options that uniq supports are discussed in the following table:

Option

Description

-c

This prefixes lines with the number of occurrences

-d

This prints duplicate lines only once

-f N

This skips the comparison of the first N fields

-i

This is case-insensitive comparison of items

-u

This prints only unique lines

-s N

This avoids comparing the first N characters in line

-w N

This compares only N characters in line

Unique elements in a file

Consider the unique.txt file as an example on which we will run the uniq command with its options. The content of unique.txt is as follows:

$ cat unique.txt
Welcome to Linux shell scripting
1
Welcome to LINUX shell sCripting
2
Welcome To Linux Shell Scripting
4
2
4
Welcome to Linux shell scripting
2
3
Welcome to Linux shell scripting
2
Welcome to Linux shell scripting
Welcome to LINUX shell sCripting

To remove duplicate lines from the unique.txt file, we can do the following:

  • Firstly, sort the file and then redirect the sorted text to the uniq command:
    $ sort unique.txt | uniq
    
  • Use the -u option with the sort command:
    $ sort  -u unique.txt
    

The output from running either of the commands will be the same, as follows:

Unique elements in a file

We can use the -c option to print the number of occurrences of each line in the input file:

$ sort unique.txt | uniq -c
Unique elements in a file

Using the options -c and -i will print the uniq lines along with the occurrence count. A comparison for unique line will be done case-insensitive:

$ sort unique.txt | uniq -ci
Unique elements in a file

To get only those lines in file that have appeared only once, the -u option is used:

$ sort unique.txt | uniq -u
1
3
Welcome To Linux Shell Scripting

Similarly, to get the lines that have been appeared more than once in a file, -d is used:

$  sort unique.txt | uniq -d
2
4
Welcome to Linux shell scripting
Welcome to LINUX shell sCripting

We can also tell the uniq command to find unique lines based on comparing only the first N character of the line:

$ sort unique.txt | uniq -w 10
1
2
3
4
Welcome to Linux shell scripting
Welcome To Linux Shell Scripting

Note

  • The uniq command does not detect the repeated lines unless they are adjacent.
  • To find unique lines, first sort the input using the sort command and then apply the uniq command
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.1.22