CHAPTER 24
Text-Processing One-Liners

Even though this book is about using the shell's command language, I use a fair number of calls to other utilities for text processing. Sed, awk, and grep are the primary UNIX text-processing utilities, although I have used others. This chapter gives you a collection of short and useful one-liners that illustrate quite a few methods for gathering specific information from various textual sources.

Very often when writing a script, you need to know source data locations before you start pruning the data for further processing. For instance, you can find the load average of a running Linux system from the first line of the output of the top utility, the output of the uptime command, the output of the w command, and in the /proc/loadavg file. There are almost always multiple ways to gather and process information, and the tools introduced in this chapter should give you an excellent start on knowing what you will need to do in many situations.

For more information about any of these utilities, consult Appendix C of this book or the man pages of individual utilities. This chapter is not intended to cover these utilities exhaustively; several of these utilities have had complete books written about them.

An extremely common use of the utilities discussed in this chapter is to modify or filter a string that is obtained from any one of a number of sources, such as from an environment variable or from output of a system command. For consistency in these examples, the following common variable is echoed and piped to the utility to illustrate the mode of use:

VAR="The quick brown fox jumped over the lazy dog."

Displaying Specific Fields

The following example is a simple awk statement to extract data fields from a string containing a record with multiple fields, assuming that whitespace characters separate the fields. The awk field variables start at $1 and increment up through the end of the string. In our example string, there are nine fields separated by whitespace. The awk positional variable $0 is special in that it holds the value of the whole string. Quite often, the print statement will target only a single field, but this example shows how to extract and reorder several of the input fields:

echo $VAR | awk '{print $1, $8, $4, $5, $6, $7, $3, $9}'

This produces the following output:


The lazy fox jumped over the brown dog.

Specifying the Field Separator

Here is another simple use of awk, where the field separator is specified using the -F command-line switch. Using this option causes the source string to be split up based on something other than whitespace. In this case it is the letter o.

echo $VAR | awk -Fo '{print $4}'

This produces the following output:


ver the lazy d

Simple Pattern-Matching

Matching specific fields of the input is very useful in finding data quickly. A grep command can easily return lines that match a given string, but awk can return lines that match a specific value in a specific field. The following example finds and displays all lines whose second field is equal to the string casper in /etc/hosts. The test used for the second field could be changed from equal (==) to not equal (!=) to find the lines in the file that do not contain the string casper in the second field, and more complicated conditions can be constructed in the usual way.

awk '$2 == "casper" {print $0}' /etc/hosts

This produces the following output:


172.16.5.4 casper casper.mydomain.com

Matching Fields Against Several Values

Another pattern-matching technique, which is similar to the previous one, is to look for one of several alternatives in a specific field. The example here extends the previous one a bit by looking for lines in my /etc/hosts file whose IP addresses (in field 1) start with either 127 or 172. Note that each alternative between the slashes (/) is separated by the pipe (|) character; this is awk notation for the regular expression specifying the pattern "starting with 127 or starting with 172." The pattern-matching operator ~ could also be replaced with the negated operator !~ to return the lines in the file that don't match the expression.

awk '$1 ~ /^127|^172/ {print $0}' /etc/hosts

This produces the following output:


127.0.0.1 localhost
172.16.5.2 phred phred.mydomain.com
172.16.5.4 casper casper.mydomain.com

Determining the Number of Fields

This one-liner illustrates the use of a special awk internal variable NF whose value is the number of fields in the current line of input. You may want to try changing the field separator as shown in the earlier example and note the difference in the result.

echo $VAR | awk '{print NF}'

This produces the following output:


9

Determining the Last Field

This is a slightly modified version of the previous example; it adds a dollar sign ($) in front of the NF variable. This will print out the value of the last field instead of the number of fields.

echo $VAR | awk '{print $NF}'

The following output results:


dog.

Determining the Second-to-Last Field

We can use NF to get the second-to-last field of the string, as in the next example. This could be easily modified to reference other positions in the input relative to the last field. The previous three examples all relate directly to the standard numeric awk field variables. From our example string, $NF would be equal to $9. This variable is one layer more abstract than directly referencing a positional variable. It allows you to reference any particular field of an arbitrary string length through logic.

echo $VAR | awk '{print $(NF-1)}'

You get the following output:


lazy

Passing Variables to awk

In some cases you may not know until the command is run which field you want. You can deal with this by passing a value to awk when it is invoked. The following example shows how you can pass the value of the shell variable TheCount to an awk command. The -v switch to awk specifies that you are going to set a variable. Following the -v switch is the variable being assigned within awk.

TheCount=3
echo $VAR | awk -v counter=$TheCount '{print $counter}'

This produces the following output:


brown

The -v switch is a relatively new option for assigning a variable, and it may not be ideal when you're shooting for portability. In that case, this usage should do the trick:

TheCount=3
echo $VAR | awk '{print $counter}' counter=$TheCount

It produces the following output:


brown

Using a Variable Passed to awk in a Condition

Here is another use of shell variables with the awk command. The NODE=$node assignment sets the internal awk variable NODE to the value of the shell variable $node. The awk command then checks whether each line of the input file for $2 is equal to the value of NODE. If a line is equal, then $3 is output. In this example, the /etc/hosts file was used. The code works like that in the "Simple Pattern-Matching" example shown earlier, except that the value to compare against can be specified independently of the field that is output.

awk -v NODE=$node '$2 == NODE {print $3}' /etc/hosts

The output depends on the contents of your /etc/hosts file, but the intended effect is to display the domain name corresponding to the specified node name. Try setting the node variable to the name of your system before running this command. My system is named casper and this is its hosts file entry:

172.16.5.4 casper casper.mydomain.com

Thus, if on some line in the /etc/hosts file, the system name stored in the node variable is in field 2, then the third field of that line will be displayed. When I run this command after setting the shell variable $node to casper, the output is the third field of the /etc/hosts entry for casper: casper.mydomain.com.

Displaying a Range of Fields (Main Method)

Usually, printing a range of fields from an input line cannot be expressed using simple syntax. Unless the range is fixed, you generally need to have awk loop through a previously specified list of fields, printing each one in turn. In this example, the for loop starts with a fixed field number (here, 3) and ends with the value of the NF variable. You can modify this easily to permit any range. The printf (formatted print) command in the body of the loop prints the current field, followed by a space. The last print statement outside the loop adds a final carriage return at the end of the output.

echo $VAR | awk '{for(i=3; i<=NF; i++) {printf "%s ",$i}; print ""}'

Here is the output:


brown fox jumped over the lazy dog.

Displaying a Range of Fields (Alternate Method)

One last use of external variables being passed to awk is related to potential problems with awk versions. In some cases, the versions of awk, nawk, or gawk handle the -v switch differently. There are also issues when passing variables that have spaces included in literal strings. Most awk commands from the command line are contained within single quotes: '. When passing external shell variables to awk, in the space within the awk command where the variable containing spaces would normally be applied you should embed the shell variable directly into the command by surrounding it with more single quotes. In the following example, the awk command starts with a single quote and then begins a for loop. The counter variable i is set to the initial value of 3 and will continue to loop while i is less than or equal to $end. $end is a shell variable that is embedded between two single quotes. The first of these quotes ends the initial awk statement and the shell is then used to expand the value of the $end variable. The second single quote that follows the $end variable reopens the awk command, which includes the loop increment value as well as the print statements. The final single quote ends the whole awk statement.

This example is very simple and nearly the same as the range-printing solution. It illustrates the use of a shell variable within an awk command. The differences are that the ending variable ($end) is passed from the shell environment and it is not contained within the single quotes of the awk command. The shell variable $end is set to the value 6.

echo $VAR | awk '{for(i=3; i<='$end'; i++) {printf "%s ",$i}; print ""}'

Here is the output:


brown fox jumped over

Determining the Length of a String Using awk

The length value in awk is another internal variable that contains the number of characters in the current line.

echo $VAR | awk '{print length}'

Here's the output:


45

Determining the Length of a String Using expr

Another solution for this task uses the internal length function of expr.

(expr length "$VAR")

The following output results:


45

Displaying a Substring with awk

Substring extraction can be performed using a built-in function of awk. The function has the following form:

substr(string,position of first character of substring,substring character count)

The following example extracts a substring of three characters from the third field of the VAR variable, starting from the second character in the field.

echo $VAR | awk '{print substr($3,2,3)}'

You get the following output:


row

Displaying a Substring with expr

Here is a method of extracting a substring using expr. It uses the substr() function of expr. As before, the first argument is the string, the second is the position of the desired substring's starting character, and the last is the number of characters in the substring. The example gets 4 characters from the string stored in VAR, starting at character number 12.

(expr substr "$VAR" 12 4)

The following output results:


rown

Conducting Simple Search and Replace with sed

The following example searches for space characters within each line of input and replaces them with the string %20. The search-and-replace syntax follows the pattern s/search string/replacement string/. The g at the end of the expression is optional; it stands for global and indicates that you want to replace all instances of the search term found in the line. Without the g, the command replaces only the first instance of the search term.

echo $VAR | sed -e "s/ /%20/g"

The following output results:


The%20quick%20brown%20fox%20jumped%20over%20the%20lazy%20dog.

Disregarding Blank and Commented Lines from a File

This example is a little more involved. First it uses a sed command to filter all lines that have been commented out in a specified file (here, /etc/ntp.conf). The output is then piped to awk, which is used to print only non-null lines (i.e., lines whose length is not 0). The sed command checks whether each line starts with a pound sign (#) and is followed by a string that matches the pattern .*, which denotes "any number of any characters." If a line matches this overall pattern, sed produces no output; otherwise it echoes the line. The effect of this is to echo the original contents of the file, minus any commented lines (those beginning with #). The sed output is piped into an awk one-liner that filters out lines of length 0. The resulting sequence is a quick way to remove all blank and commented entries of a file.

sed -e "s/#.*//g" /etc/ntp.conf | awk '{if(length !=0) print $0}'

The output will, of course, be specific to the file used as input.

Conducting Dual Search and Replace with sed

A more advanced search and replace first checks the input for a string other than the one that is going to be replaced, and performs the search-and-replace operation only if this string is found. For instance, you might have a file in which each line contains a name and address, and you want to change "Portland" to "Gresham" on the lines containing the name Ron Peters.

This can be accomplished using sed by including a pattern before the search expression. Continuing with our "quick brown fox" example, the following code first searches for the word "quick" in the input and then replaces all instances (g) of the string he with the replacement string she on the line if the word was found.

echo $VAR | sed -e "/quick/s/he/she/g"

Here's the output:


Tshe quick brown fox jumped over tshe lazy dog.

Filtering Lines with sed

Sometimes filtering out certain lines is desirable. For instance, when parsing ps output, you might not want the header line displayed. The following sed example removes the first line from the stdout of a call to ps. This is similar to the head command, but it has the opposite effect: while a head command grabs the specified number of leading lines and drops the rest, our example removes the specified number of initial lines from the output of ps (here, 1) and displays the rest. (You could use the tail command, but you would need to know the total number of lines.) Removing more than the first line is as simple as changing the specified line to a range of lines; to remove the first three lines, you would change 1d to 1,3d.

ps -ef | sed -e '1d'

This produces the following output (the italicized line is the header that was removed):


UID    PID  PPID  C   STIME  TTY    TIME CMD
root    1    0    0   22:32   ?  00:00:05 init [5]
root    2    1    0   22:32   ?  00:00:01 [keventd]
root    3    1    0   22:32   ?  00:00:00 [kapmd]
...

Searching for Multiple Strings with egrep

egrep is a utility that works in much the same way as the traditional grep command. Handily, it will search for more than one string at a time. In this example, I search for any one of three alternative search strings within the /etc/passwd file.

egrep "desktop|mysql|ntp" /etc/passwd

It produces the following output:


ntp:x:38:38::/etc/ntp:/sbin/nologin
desktop:x:80:80:desktop:/var/lib/menu/kde:/sbin/nologin
mysql:x:27:27:MySQL Server:/var/lib/mysql:/bin/bash

A Clean Method of Searching the Process Table

Traditionally a command to find a specific process in the process table would look something like this:

ps -ef | grep some_string

When this command is run, the output includes not only the process data you were looking for, but also the data for the grep process itself since the search string is also contained in the invocation of grep. To clean up the output, you can add an additional pipe to remove the additional grep process entry with the -v switch to grep, like this:

ps -ef | grep some_string | grep -v grep

There is a little trick for performing this task without the additional pipe:

ps -ef | grep "[s]ome_string"

This turns the original search string into a regular expression. The new grep command has the same effect as the previous one because the regular expression evaluates to the same string as in the original grep command (some_string). The entry for the grep process, however, shows the command as it was issued, prior to the evaluation of the regular expression ([s]ome_string). The entry for the grep process itself thus fails to match when the grep command is run and is not included in the output.

Summing Columns Using awk

On occasion I've run across the need to add up the values in a column of output. Most of the time this is from a simple directory listing, but it could be any custom data. The following file listing is the input for the awk command. This input specifies only the *gz files (although this is arbitrary and could have been any directory listing) and is then piped to awk, where the fifth field of each line is summed together to output a total. The following is the directory listing that will be used as input to the awk command:

$ ls -l *gz
-rwxr--r-- 1 rpeters rpeters     3337 Jul  9 2007   mysqlstatus.tar.gz
-rw-r--r-- 1 rpeters rpeters     1367 Sep 21 2007   countdown.tgz
-rwxr--r-- 1 rpeters rpeters  1214743 Mar 12 12:35  dokuwiki.tgz
-rw-r--r-- 1 root    root        6724 Sep 21 2007   lvm.tar.gz
-rwxr--r-- 1 rpeters rpeters  1043064 Jul 25 2007   rrdtool.tar.gz
-rwxr--r-- 1 rpeters rpeters  5271568 Aug 17 2007   PerlAPI.tar.gz

The awk portion of the one-liner is itself a simple two-part script. The awk utility takes input one line at a time and performs any necessary processing over and over as long as there are lines of input. In this case, the processing takes field number 5 ($5 is the file size) and adds the value of that variable to the total variable.

$ ls -l *gz | awk '{total+=$5} END {print total/1024/1024}'
7.19147

There are two special rules in awk: BEGIN and END. These allow customized processing to happen outside the main input loop either before or after its processing. In this case we're using the END rule to signify that the main processing of the sum total is complete and awk should move on to postprocessing. Following END, the final element converts the total variable to megabytes and displays the result.

Generating Random Numbers Using awk

I don't use random-number generators very often. However, I have sometimes needed one when writing simple games and when starting multiple tasks at random intervals so that they wouldn't conflict with each other.

The following command generates a random number between 0 and 100. The rand() function of awk generates a number between 0 and 1. The srand() function initializes the random-number generator using the seed value passed to it as an argument. If the seed expression is left out (as in the example here), the time of day is the default value used for the seed. For testing purposes, you may want to remove the srand() function from the code so the "random" number returned won't be random, but rather predictable.

echo | awk '{srand(); print int(100 * rand())}'

Generating Random Numbers from the Shell

Both bash and ksh have the ability to generate random numbers. There is a built-in shell variable called RANDOM that you can use for this purpose. This variable will generate a random integer between 0 and 32767 every time it is accessed.

echo $RANDOM

Displaying Character-Based Fields with sed

awk is very good at displaying fields separated by whitespace or by specific delimiters. It is more challenging to extract a specific character or range of characters from a string whose length you don't know. You could find the length of the string with awk and then use the cut command to grab specific characters, but that requires more than one command. The same result can be achieved more simply by using sed.

You can use sed to split strings based on character patterns rather than fields. A pattern describes the elements into which the string will be split. These elements are represented by parentheses containing one or more dots (.), which stand for single characters. Each element in the pattern corresponds to a field in the input string when it is split.

The possible elements are shown here:

(.): One character

(.*): An arbitrary number of characters

(...): Here, three consecutive characters; in general, as many consecutive characters as there are dots

The split instruction consists of two parts separated by forward slashes (/) before and after. The first part is the pattern and the second specifies the field or fields from the string that should be displayed. When sed is invoked, the entire split instruction, including the pattern, is quoted and the parentheses in the pattern are escaped with backslashes ().

The following examples clarify this technique. In the first example, the first element in the pattern specifies an arbitrary number of characters leading up to the second, final element. The second element consists of a single character. The dollar sign ($) used here signifies the end of line or, in this case, the end of the input string. The output is the second field of the input string. Thus this command prints the last two characters in the input string. In our case, this is the last character of the phrase and the period at the end of the sentence.

echo $VAR | sed 's/(.*)(..)$/2/'

Here's the output:


g.

The second example has three elements in the pattern. The first consists of the first four characters in the string. The second consists of all characters apart from the first four, leading up to the final element. The third element consists of the last three characters in the string. The first and third elements are then printed. Note that the fourth character in the output is a space.

echo $VAR | sed 's/(....)(.*)(...)$/13/'`

Here's the output:


The og.

Escaping Special Characters

You have seen several occasions in which special characters had to be escaped because they were not to be evaluated using their normal meanings. This occurs frequently in sed operations, particularly replacements. These replacements can be somewhat tricky because of all the backslashes and forward slashes.

The next few examples show the code for several replacement operations. The code works within a script, but because of the way the shell evaluates escape characters, the code will not work from the command line in case you want to test the code manually. There are two possibilities for most of these examples. The first uses escapes to search for and replace the special characters. The second uses square brackets ([ and ]) to specify the character in the search.


Note This option doesn't always work, such as when searching for a square bracket or an escape character itself. See Chapter 25 for another method of escaping special characters.


You have to escape all characters that have a special meaning, such as !, @, #, %, ^, ., *, and so on. This example shows how to escape a period:

some_var=`echo $some_var | sed -e s/\./\\\./g`
some_var=`echo $some_var | sed -e s/[.]/\\\./g`

To escape the dollar sign, use the following code:

some_var=`echo $some_var | sed -e s/\\$/\\\\$/g`
some_var=`echo $some_var | sed -e s/[$]/\\\\$/g`

The following lets you escape the ampersand or parenthesis:

some_var=`echo $some_var | sed -e s/&/\\\\&/g`

To escape forward slashes you use the following code:

some_var=`echo $some_var | sed -e s/\\//\\\\\//g`
some_var=`echo $some_var | sed -e s/[/]/\\\\\//g`

The longest and ugliest of all is escaping backslashes, because you're trying to escape the escape character. The syntax with the square brackets doesn't work in this case and you're left with seven consecutive backslashes as the search string.

some_var=`echo $some_var | sed -e s/\\\/\\\\\\\/g`

Returning Trailing Lines from a Pattern Match Using grep

It's easy to grab certain lines from a file or output identified by running grep with a regular expression. What isn't quite so simple is getting lines that follow the lines matching the grep search expression. Consider a log file in which a particular entry precedes the record of a certain type of sequence of events, and you want to grab both the initial entry and the sequence that follows from the log.

The following awk command performs the task by searching for a line containing the STRING that identifies the initial entry. The getline command can then be used to get the next line of the input. Several getline statements with a print statement between each could be used to retrieve more than one line. You can print each retrieved line, or any field or range of fields within it. This example shows a print statement applying to only the first field of the two lines following the initial line that contains the STRING:

some_command_output | awk ' $1 ~ /^STRING/
{ getline;print $1;getline;print $1}'

If you want to omit printing selected lines, you would perform several getline commands in a row without a print in between.

Note that STRING is preceded by a caret, ^. This forms a regular expression specifying that STRING occurs at the beginning of the string being matched, $1 in this case. This match works if you want to find text that starts with STRING, possibly followed by additional text. If you want to match STRING exactly, you can add $ to the end of the regular expression to specify the end-of-line character, like so: /^STRING$/. To match STRING anywhere in the line, remove both the ^ and the $.

The current GNU grep utility has the ability to return an arbitrary number of lines following a matching search already built in. This feature is accessed using the -A switch.

Returning Preceding Lines to a Pattern Match Using grep

This technique is a bit more involved, as you need to cache previous lines and print them out when necessary. I outline three examples here, each more complex than the previous one.

The first command searches through the file for the pattern string. As it processes each line of the file, it saves the line prior to the match, and its record number, in the variables p0 and pNR, respectively. (The awk internal variable NR represents the number of records in the current line of input.) When a line containing the pattern string is found, the record number of the previous line (pNR) and the previous line itself (p0) are displayed. Here the record number is simply used to display the number of the line that is being output.

awk '/pattern/{print pNR, p0}{pNR=NR;p0=$0}' /some/file

The next example works in almost the same way except that it saves and prints the two lines preceding each line that matches the pattern string.

awk '/pattern/{print ppNR, pp0," ", pNR, p0}
{ppNR=pNR;pNR=NR;pp0=p0;p0=$0}' /some/file

The last example pushes the limits of what I would consider a reasonable one-liner. It grabs and retains previous lines as in the first two examples, but in a more general fashion. Instead of using distinct variables as in the second example, it saves the data in an array that it populates with a loop.

The first part of the command is a for loop that iterates through an array containing the previous lines of input. The loop moves the second through the last elements to the bottom of the array and then saves the current line ($0) in the highest element of the array. This loop is executed once for each line in the file.

Once it finds the pattern string, it prints an entry delimiter Entry. (This isn't required, but the final output can easily get confusing without some type of demarcation between groupings of lines.) Then another loop iterates through and prints out the lines stored in the array.

To configure this command for your own purposes, change the upper value of the j loop to the number of previous lines you want to return, plus 1. If you want the 3 previous lines for each of the pattern entries, set this value to 4. Also change the array assignment to set the highest element $0 to 4.

Next modify the upper value in the k loop to the number of previous lines you want printed. If you want 3 as before, use 3. If you would like to include the pattern line as well as the previous lines, make the upper limit of the k loop the same as that of the j loop. In this case it would be 4.

Since this is a somewhat complex command sequence, a generic example and a more specific one are presented here. It is worth some experimentation time to get a feel for how this works.

Here's the generic example:

awk '{for(j=0;j<='prevlines+1';j++){a[j]=a[j+1];a['prevlines+1']=$0}}
  /some pattern/{{print "Entry"}
  {for(k=0;k<'prevlines or prevlines+1';k++)
  {print a[k]}}}' /some/file

And here's the more specific example:

awk '{for(j=0;j<=4;j++){a[j]=a[j+1];a[4]=$0}}
  /some pattern/{{print "Entry"}{for(k=0;k<3;k++)
  {print a[k]}}}' /some/file



Note The current GNU version of the grep utility has the ability to return an arbitrary number of lines found immediately prior to the actual matched lines found. You can access this feature via the -B switch.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.188.120