© David Both 2020
David BothUsing and Administering Linux: Volume 2https://doi.org/10.1007/978-1-4842-5455-4_6

6. Regular Expressions

David Both
(1)
Raleigh, NC, USA
 

Objectives

In this chapter you will learn
  • To define the term “regular expression”
  • To describe the purpose of regular expressions and extended regular expressions
  • To differentiate between different styles of regular expressions as used by different tools
  • To state the difference between basic regular expressions and extended regular expressions
  • To identify and use many of the metacharacters and expressions used to build regular expressions for typical administrative tasks
  • To use regular expressions and extended regular expressions with tools like grep and sed

Introducing regular expressions

In Volume 1 of this course, we explored the use of file name globbing using wildcard characters like ∗ and ? as a means to select specific files or lines of data from a data stream. We have also seen how to use brace expansion and sets to provide more flexibility in matching more complex patterns. These tools are powerful and I use them many times a day. Yet there are things that cannot be done with wildcards.
Regular expressions (REGEXes or REs) provide us with more complex and flexible pattern matching capabilities. Just as certain characters take on special meaning when using file globbing, REs also have special characters. There are two main types of regular expressions (REs), basic regular expressions (BREs) and extended regular expressions (EREs).
The first thing we need are some definitions. There are many definitions for the term “regular expressions,” but many are dry and uninformative. Here are mine:
  • Regular expressions are strings of literal and metacharacters that can be used as patterns by various Linux utilities to match strings of ASCII plain text data in a data stream. When a match occurs, it can be used to extract or eliminate a line of data from the stream or to modify the matched string in some way.
  • Basic regular expressions (BREs) and extended regular expressions (EREs) are not significantly different in terms of functionality.1 The primary difference is in the syntax used and how metacharacters are specified. In basic regular expressions, the metacharacters “?”, “+”, “{”, “|”, “(”, and “)” lose their special meaning; instead, it is necessary to use the backslash versions “?”, “+”, “{”, “|”, “(”, and “)”. The ERE syntax is believed by many users to be easier to use.
Regular expressions (REs)2 take the concept of using metacharacters to match patterns in data streams much further than file globbing and give us even more control over the items we select from a data stream. REs are used by various tools to parse3 a data stream to match patterns of characters in order to perform some transformation on the data.
Regular expressions have a reputation for being obscure and arcane incantations that only those with special wizardly SysAdmin powers use. Figure 6-1 would seem to confirm this. The command pipeline appears to be an intractable sequence of meaningless gibberish to anyone without the knowledge of regex. It certainly seemed that way to me the first time I encountered something similar early in my career. As you will see, it is actually relatively simple once it is all explained.
A489914_1_En_6_Fig1_HTML.png
Figure 6-1
A real-world sample of the use of regular expressions. It is actually a single line that I used to transform a file that was sent to me into a usable form
We can only begin to touch upon all of the possibilities opened to us by regular expressions in a single chapter. There are entire books devoted exclusively to regular expressions so we will explore the basics in this chapter – just enough to get started with tasks common to SysAdmins.

Getting started

Now we need a real-world example to use as a learning tool. Here is one I encountered several years ago.

The mailing list

This example highlights the power and flexibility of the Linux command line, especially regular expressions, for their ability to automate common tasks. I have administered several listservs during my career and still do. People send me lists of email addresses to add to those lists. In more than one case, I have received a list of names and email addresses in a Word format that were to be added to one of the lists.
The list itself was not really very long, but it was very inconsistent in its formatting. An abbreviated version of that list, with name and domain changes, is shown in Figure 6-2. The original list has extra lines, characters like brackets and parentheses that need to be deleted whitespace such as spaces and tabs, and some empty lines. The format required to add these emails to the list is first last <[email protected]>. Our task is to transform this list into a format usable by the mailing list software.
A489914_1_En_6_Fig2_HTML.png
Figure 6-2
A partial, modified listing of the document of email addresses to add to a listserv
It was obvious that I needed to manipulate the data in order to mangle it into an acceptable format for inputting to the list. It is possible to use a text editor or a word processor such as LibreOffice Writer to make the necessary changes to this small file. However, people send me files like this quite often so it becomes a chore to use a word processor to make these changes. Despite the fact that Writer has a good search and replace function, each character or string must be replaced singly and there is no way to save previous searches. Writer does have a very powerful macro feature, but I am not familiar with either of its two languages, LibreOffice Basic or Python. I do know Bash shell programming.

The first solution

I did what comes naturally to a SysAdmin – I automated the task. The first thing I did was to copy the address data to a text file so I could work on it using command-line tools. After a few minutes of work, I developed the Bash command-line program in Figure 6-1 that produced the desired output as the file, addresses.txt. I used my normal approach to writing command-line programs like this by building up the pipeline one command at a time.
Let’s break this pipeline down into its component parts to see how it works and fits together. All of the experiments in this chapter are to be performed as the student user.
Experiment 6-1
First we download the sample file Experiment_6-1.txt from the Apress GitHub web site. Let’s do all of this work in a new directory so we will create that too.
[student@studentvm1 ~]$ mkdir chapter6 ; cd chapter6
[student@studentvm1 chapter6]$ wget https://raw.githubusercontent.com/Apress/using-and-administering-linux-volume-2/master/Experiment_6-1.txt
Now we just take a look at the file and see what we need to do.
[student@studentvm1 chapter6]$ cat Experiment_6-1.txt
Team 1  Apr 3
Leader  Virginia Jones  [email protected]
Frank Brown  [email protected]
Cindy Williams  [email protected]
Marge smith   [email protected]
 [Fred Mack]   [email protected]
Team 2  March 14
leader  Alice Wonder  [email protected]
John broth  [email protected]
Ray Clarkson  [email protected]
Kim West    [email protected]
[JoAnne Blank]  [email protected]
Team 3  Apr 1
Leader  Steve Jones  [email protected]
Bullwinkle Moose [email protected]
Rocket Squirrel [email protected]
Julie Lisbon  [email protected]
[Mary Lastware) [email protected]
[student@studentvm1 chapter6]$
The first things I see that can be done are a couple easy ones. Since the Team names and dates are on lines by themselves, we can use the following to remove those lines that have the word “Team.” I place the end of sentence period outside the quotes for clarity to ensure that only the intended string is inside the quotes.
[student@studentvm1 chapter6]$ cat Experiment_6-1.txt | grep -v Team
I won’t reproduce the results of each stage of building this Bash program, but you should be able to see the changes in the data stream as it shows up on STDOUT, the terminal session. We won’t save it in a file until the end.
In this first step in transforming the data stream into one that is usable, we use the grep command with a simple literal pattern, “Team.” Literals are the most basic type of pattern we can use as a regular expression because there is only a single possible match in the data stream being searched, and that is the string “Team”.
We need to discard empty lines so we can use another grep statement to eliminate them. I find that enclosing the regular expression for the second grep command ensures that it gets interpreted properly.
[student@studentvm1 chapter6]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^s*$"
Leader  Virginia Jones  [email protected]
Frank Brown  [email protected]
Cindy Williams  [email protected]
Marge smith   [email protected]
 [Fred Mack]   [email protected]
leader  Alice Wonder  [email protected]
John broth  [email protected]
Ray Clarkson  [email protected]
Kim West    [email protected]
[JoAnne Blank]  [email protected]
Leader  Steve Jones  [email protected]
Bullwinkle Moose [email protected]
Rocket Squirrel [email protected]
Julie Lisbon  [email protected]
[Mary Lastware) [email protected]
[student@studentvm1 chapter6]$
The expression "^s*$" illustrates anchors and using the backslash () as an escape character to change the meaning of a literal, “s” in this case, to a metacharacter that means any whitespace such as spaces, tabs, or other characters that are unprintable. We cannot see these characters in the file, but it does contain some of them. The asterisk, a.k.a. splat (∗), specifies that we are to match zero or more of the whitespace characters. This would match multiple tabs or multiple spaces or any combination of those in an otherwise empty line.
I configured my Vim editor to display whitespace using visible characters. Do this by adding the following line to your own ~.vimrc or the global /etc/vimrc files. Then start – or restart – Vim.
set listchars=eol:$,nbsp:_,tab:<->,trail:~,extends:>,space:+
I found a lot of bad, very incomplete, and contradictory information on the Internet in my searches on how to do this. The built-in Vim help has the best information, and the data line I have created from that here is one that works for me.
The result, before any operation on the file, is shown in Figure 6-3. Regular spaces are shown as +; tabs are shown as <, <>, or <-->, and fill the length of the space that the tab covers. The end of line (EOL) character is shown as $.
A489914_1_En_6_Fig3_HTML.png
Figure 6-3
The Experiment_6-1.txt file showing all of the embedded whitespace
You can see that there are a lot of whitespace characters that need to be removed from our file. We also need to get rid of the work “leader” which appears twice and is capitalized once. Let’s get rid of “leader” first. This time we will use sed (stream editor) to perform this task by substituting a new string – or a null string in our case – for the pattern it matches. Adding sed -e "s/[Ll]eader//" to the pipeline does this.
[student@studentvm1 chapter6]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^s*$" | sed -e "s/[Ll]eader//"
In this sed command, -e means that the quote enclosed expression is a script that produces a desired result. In the expression the s means that this is a substitution. The basic form of a substitution is s/regex/replacement string/. So /[Ll]eader/ is our search string. The set [Ll] matches L or l so [Ll]eader matches leader or Leader. In this case the replacement string is null because it looks like this - // - a double forward slash with no characters or whitespace between the two slashes.
Now let’s get rid of some of the extraneous characters like []() that will not be needed.
[student@studentvm1 chapter6]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^s*$" | sed -e "s/[Ll]eader//" -e "s/[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g"
We have added four new expressions to the sed statement. Each one removes a single character. The first of these additional expressions is a bit different. Because the left square brace [ character can mark the beginning of a set, we need to escape it to ensure that sed interprets it correctly as a regular character and not a special one.
We could use sed to remove the leading spaces from some of the lines, but the awk command can do that as well as reorder the fields if necessary, and add the <> characters around the email address.
[student@studentvm1 chapter6]$ cat Experiment_6-1.txt | grep -v Team | grep -v "^s*$" | sed -e "s/[Ll]eader//" -e "s/[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'
The awk utility is actually a very powerful programming language that can accept data streams on its STDIN. This makes it extremely useful in command-line programs and scripts.
The awk utility works on data fields and the default field separator is spaces – any amount of whitespace. The data stream we have created so far has three fields separated by whitespace, first, last, and email. This little program awk '{print $1" "$2" <"$3">"}' takes each of the three fields, $1, $2, and $3, and extracts them without leading or trailing whitespace. It then prints them in sequence adding a single space between each as well as the <> characters needed to enclose the email address.
The last step here would be to redirect the output data stream to a file, but that is trivial so I leave it with you to perform that step. It is not really necessary that you do so.
I saved the Bash program in an executable file and now I can run this program any time I receive a new list. Some of those lists are fairly short, as is the one in Figure 6-3, but others have been quite long, sometimes containing up to several hundred addresses and many lines of “stuff” that do not contain addresses to be added to the list.

The second solution

But now that we have a working solution, one that is a step-by-step exploration of the tools we are using, we can do quite a bit more to perform the same task in a more compact and optimized command-line program.
Experiment 6-2
In this experiment we explore ways in which we can shorten and simplify the command-line program from Experiment 6-1. The final result of that experiment was the following CLI program.
cat Experiment_6-1.txt | grep -v Team | grep -v "^s*$" | sed -e "s/[Ll]eader//" -e "s/[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'
Let’s start near the beginning and combine the two grep statements. The result is shorter and more succinct. It also means faster execution because grep only needs to parse the data stream once.
Tip When the STDOUT from grep is not piped through another utility and when using a terminal emulator that supports color, the regex matches are highlighted in the output data stream. The default for the Xfce4-terminal is a black background, white text, and highlighted text in red.
In the revised grep command, grep -vE "Team|^s*$", we add the E option which specifies extended regex. According to the grep man page, “In GNU grep there is no difference in available functionality between basic and extended syntaxes.” This statement is not strictly true because our new combined expression fails without the E option. Run the following to see the results.
[student@studentvm1 chapter6]$ cat Experiment_6-1.txt | grep -vE "Team|^s*$"
Try it without the E option.
The grep tool can also read data from a file so we eliminate the cat command.
[student@studentvm1 chapter6]$ grep -vE "Team|^s*$" Experiment_6-1.txt
This leaves us with the following, somewhat simplified CLI program.
grep -vE "Team|^s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'
We can also simplify the sed command, and we will do so in Experiment 6-6 after we learn more about regular expressions.
It is important to realize that my solution is not the only one. There are different methods in Bash for producing the same output; there are other languages like Python and Perl that can also be used. And, of course, there are always LibreOffice Writer macros. But I can always count on Bash as part of any Linux distribution. I can perform these tasks using Bash programs on any Linux computer, even one without a GUI desktop or that does not have LibreOffice installed.

grep

Because GNU grep is one of the tools I use the most that provides a more or less standardized implementation of regular expressions, I will use that set of expressions as the basis for the next part of this chapter. We will then look at sed, another tool that uses regular expressions.
Throughout this self-study course, you will have already encountered globs and regexes. Along with the previous experiments in this chapter, you should have at least a basic understanding of regexes and how they work. However, there are many details that are important to understanding some of the complexity and of regex implementations and how they work.

Data flow

All implementations of regular expressions are line based. A pattern created by a combination of one or more expressions is compared against each line of a data stream. When a match is made, an action is taken on that line as prescribed by the tool being used. For example, when a pattern match occurs with grep, the usual action is to pass that line on to STDOUT and lines that do not match the pattern are discarded. As we have seen, the -v option reverses those actions so that the lines with matches are discarded.
Each line of the data stream is evaluated on its own, and the results of matching the expressions in the pattern with the data from previous lines are not carried over. It might be helpful to think of each line of a data stream as a record and that the tools that use regexes process one record at a time. When a match is made, an action defined by the tool in use is take on the line that contains the matching string.

regex building blocks

Figure 6-4 contains a list of the basic building block expressions and metacharacters implemented by the GNU grep command and their descriptions. When used in a pattern, each of these expressions or metacharacters matches a single character in the data stream being parsed.
A489914_1_En_6_Fig4_HTML.png
Figure 6-4
These expressions and metacharacters are implemented by grep and most other regex implementations
Let’s explore these building blocks before continuing on with some of the modifiers. The text file we will use for Experiment 6-3 is from a lab project I created for an old Linux class I used to teach. It was originally in a LibreOffice Writer ODT file, but I saved it to an ASCII text file. Most of the formatting of things like tables was removed, but the result is a long ASCII text file that we can use for this series of experiments.
Experiment 6-3
We must download the sample file from the Apress GitHub web site. If the directory ~/chapter6 is not the PWD, make it so. This is a document containing lab projects from which I used to teach.
[student@studentvm1 chapter6]$ wget https://raw.githubusercontent.com/Apress/using-and-administering-linux-volume-2/master/Experiment_6-3.txt
To begin, just use the less command to look at and explore the Experiment_6-3.txt file for a few minutes so you have an idea of its content.
Now we will use some simple expressions in grep to extract lines from the input data stream. The Table of Contents (TOC) contains a list of projects and their respective page numbers in the PDF document. Let’s extract the TOC starting with lines ending in two digits.
[student@studentvm1 chapter6]$ grep [0-9][0-9]$ Experiment_6-3.txt
That is not really what we want. It displays all lines that end in two digits and misses TOC entries with only one digit. We will look at how to deal with an expression for one or more digits in a later experiment. Looking at the whole file in less, we could do something like this.
[student@studentvm1 chapter6]$ grep "^Lab Project" Experiment_6-3.txt | grep "[0-9]$"
This is much closer to what we want but it is not quite there. We get some lines from later in the document that also match these expressions. If you study the extra lines and look at those in the complete document, you can see why they match while not being part of the TOC. This also misses TOC entries that do not start with “Lab Project.” Sometimes this is the best you can do, but it does give a better look at the TOC than we had before. We will look at how to combine these two grep instances into a single one in a later experiment in this chapter.
Now let’s modify this a bit and use the POSIX expression. Notice the double square braces around the POSIX expression. Single braces generate an error message.
[student@studentvm1 chapter6]$ grep "^Lab Project" Experiment_6-3.txt | grep "[[:digit:]]$"
This gives the same results as the previous attempt. Let’s look for something different.
[student@studentvm1 chapter6]$ grep systemd Experiment_6-3.txt
This lists all occurrences of “systemd” in the file. Try using the -i option to ensure that you get all instances including those that start with uppercase.4 Or you could just change the literal expression to Systemd. Count the number of lines with the string systemd contained in them. I always use -i to ensure that all instances of the search expression are found regardless of case.
[student@studentvm1 chapter6]$ grep -i systemd Experiment_6-3.txt | wc
     20     478    3098
As you can see I have 20 lines and you should have the same number.
Here is an example of matching a metacharacter. the left bracket ([). First let’s try it without doing anything special.
[student@studentvm1 chapter6]$ grep -i "[" Experiment_6-3.txt
grep: Invalid regular expression
This occurs because [ is interpreted as a metacharacter. We need to “escape” this character with a backslash so that it is interpreted as literal character and not as a metacharacter.
[student@studentvm1 chapter6]$ grep -i "[" Experiment_6-3.txt
Most metacharacters lose their special meaning when used inside bracket expressions. To include a literal ], place it first in the list. To include a literal ^, place it anywhere but first. To include a literal [, place it last.

Repetition

Regular expressions may be modified using some operators that allow specification of zero, one, or more repetitions of a character or expression. These repetition operators, shown in Figure 6-5, are placed immediately following the literal character or metacharacter used in the pattern.
A489914_1_En_6_Fig5_HTML.png
Figure 6-5
Metacharacter modifiers that specify repetition
Experiment 6-4
Run each of the following commands and examine the results carefully so that you understand what is happening.
[student@studentvm1 chapter6]$ grep -E files? Experiment_6-3.txt
[student@studentvm1 chapter6]$ grep -Ei "drives*" Experiment_6-3.txt
[student@studentvm1 chapter6]$ grep -Ei "drives+" Experiment_6-3.txt
[student@studentvm1 chapter6]$ grep -Ei "drives{2}" Experiment_6-3.txt
[student@studentvm1 chapter6]$ grep -Ei "drives{2,}" Experiment_6-3.txt
[student@studentvm1 chapter6]$ grep -Ei "drives{,2}" Experiment_6-3.txt
[student@studentvm1 chapter6]$ grep -Ei "drives{2,3}" Experiment_6-3.txt
Be sure to experiment with these modifiers on other text in the sample file.

Other metacharacters

There are still some interesting and important modifiers that we need to explore. These metacharacters are listed and described in Figure 6-6.
A489914_1_En_6_Fig6_HTML.png
Figure 6-6
Metacharacter modifiers
We now have a way to specify word boundaries with the < and > metacharacters. This means we can now be even more explicit with our patterns. We can also use some logic in more complex patterns.
Experiment 6-5
Start with a couple simple patterns. This first one selects all instances of drives but not drive, drivess, or additional trailing “s” characters.
[student@studentvm1 chapter6]$ grep -Ei "<drives>" Experiment_6-3.txt
Now let’s build up a search pattern to locate references to tar, the tape archive command, and related references. The first two iterations display more than just tar-related lines.
[student@studentvm1 chapter6]$ grep -Ei "tar" Experiment_6-3.txt
[student@studentvm1 chapter6]$ grep -Ei "<tar" Experiment_6-3.txt
[student@studentvm1 chapter6]$ grep -Ein "<tar>" Experiment_6-3.txt
The -n option in the last command displays the line numbers of each line in which a match occurred. This can assist in locating specific instances of the search pattern.
Tip Matching lines of data can extend beyond a single screen, especially when searching a large file. You can pipe the resulting data stream through the less utility and then use the less search facility which implements regexes too to highlight the occurrences of matches to the search pattern. The search argument in less is <tar>.
This next pattern searches for “shell script” or “shell program” or “shell variable” or “shell environment” or “shell prompt” in our test document. The parentheses alter the logical order in which the pattern comparisons are resolved.
[student@studentvm1 chapter6]$ grep -Eni "<shell (script|program|variable|environment|prompt)" Experiment_6-3.txt
Remove the parentheses from the preceding command and run it again to see the difference.
Although we have now explored the basic building blocks of regular expressions in grep, there are an infinite variety of ways in which they can be combined to create complex yet elegant search patterns. However, grep is a search tool and does not provide any direct capability to edit or modify the contents of a line of text in the data stream when a match is made.

sed

The sed utility not only allows searching for text that matches a regex pattern, it can also modify, delete, or replace the matched text. I use sed at the command line and in Bash shell scripts as a fast and easy way to locate and text and alter it in some way. The name sed stands for stream editor because it operates on data streams in the same manner as other tools that can transform a data stream. Most of those changes simply involve selecting specific lines from the data stream and passing them on to another transformer5 program.
We have already seen sed in action, but now, with an understanding of regular expressions, we can better analyze and understand our earlier usage.
Experiment 6-6
In Experiment 6-2 we simplified the CLI program we used to transform a list of names and email addresses into a form that can be used as input to a listserv. That CLI program looks like this after some simplification.
grep -vE "Team|^s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[//g" -e "s/]//g" -e "s/)//g" -e "s/(//g" | awk '{print $1" "$2" <"$3">"}'
It is possible to combine four of the five expressions used in the sed command into a single expression. The sed command now has two expressions instead of five.
sed -e "s/[Ll]eader//" -e "s/[]()[]//g"
This makes it a bit difficult to understand the more complex expression. Note that no matter how many expressions a single sed command contains, the data stream is only parsed once to match all of the expressions.
Let’s examine the revised expression, -e "s/[]()[]//g", more closely. By default, sed interprets all [ characters as the beginning of a set and the last ] character as the end of that set, -e "s/[]()[]//g". The intervening ] characters are not interpreted as metacharacters. Since we need to match [ as a literal character in order to remove it from the data stream and sed normally interprets that as a metacharacter, we need to escape it so that it is interpreted as a literal ], -e "s/[]()[]//g". So now all of the metacharacters in this expression are highlighted. Let’s plug this into the CLI script and test it.
[student@studentvm1 chapter6]$ grep -vE "Team|^s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[]()[]//g"
I know that you are asking “Why not place the [ after the [ that opens the set and before the ] character.” Try it as I did.
[student@studentvm1 chapter6]$ grep -vE "Team|^s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[[]()]//g"
I think that should work but it does not. Little unexpected results like this make it clear that we must be careful and test each regex carefully to ensure that it actually does what we intend. After some experimentation of my own, I discovered that the escaped left square brace [ works fine in all positions of the expression except for the first one. This behavior is noted in the grep man page which I probably should have read first. However, I find that experimentation reinforces the things I read and I usually discover more interesting things than that for which I was looking.
Adding the last component, the awk statement, our optimized program looks like this and the results are exactly what we want.
[student@studentvm1 chapter6]$ grep -vE "Team|^s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[]()[]//g" | awk '{print $1" "$2" <"$3">"}'

Other tools that implement regular expressions

Many Linux tools implement regular expressions. Most of those implementations are very similar to that of awk, grep, and sed so that it should be easy to learn the differences. Although we have not looked in detail at awk, it is a powerful text processing language that also implements regexes.
Most of the more advanced text editors use regexes. Vim, gVim, Kate, and GNU Emacs are no exceptions. The less utility implements regexes as does the search and replace facility of LibreOffice Writer.
Programming languages like Perl, awk, and Python also contain implementations of regexes which makes them well suited to writing tools for text manipulation.

Resources

I have found some excellent resources for learning about regular expressions. There are more than I have listed here, but these are the ones I have found to be particularly useful.
The grep man page has a good reference but is not appropriate for learning about regular expressions. The O’Reilly book, Mastering Regular Expressions,6 is a very good tutorial and reference for regular expressions. I recommend it for anyone who is or wants to be a Linux SysAdmin because you will use regular expressions. Another good O’Reilly book is sed & awk7 which covers both of these powerful tools, and it also has an excellent discussion of regular expressions.
There are also some good web sites that can help you learn about regular expressions and which provide interesting and useful cookbook-style regex examples. There are some that ask for money in return for using them. Jason Baker, my technical reviewer for Volumes 1 and 2 of this course, suggests https://regexcrossword.com/ as a good learning tool.

Chapter summary

This chapter has provided a very brief introduction to the complex world of regular expressions. We have explored the regex implementation in the grep utility in just enough depth to give you an idea of some of the amazing things that can be accomplished with regexes. We have also looked at several Linux tools and programming languages that also implement regexes.
But make no mistake! We have only scratched the surface of these tools and regular expressions. There is much more to learn and there are some excellent resources for doing so.

Exercises

Perform these exercises to complete this chapter:
  1. 1.
    In Experiment 6-1 we included a sed search for the ( character even though there was not one in the Experiment_6-1.txt data file. Why do you think that might be a good idea?
     
  2. 2.
    Consider the following problem regarding Experiments 6-1 and 6-2. What would happen to the resulting data stream if one or more lines had a different data format such as first, middle, last, or if it were last, first?
     
  3. 3.
    The following regex is used in Experiment 6-5: grep -Eni "<shell (script|program|variable|environment|prompt)" Experiment_6-3.txt. Create a statement of the logic defined by this regex. Then create a statement of the logic of this regex with the parentheses removed.
     
  4. 4.
    The grep utility has an option that can be used to specify that only words are to be matched so that the < and > metacharacters are not required. In Experiment 6-6, eliminate the word metacharacters using that option and test the result.
     
  5. 5.
    Use the sed command to replace the grep command in the last iteration of the CLI program in Experiment 6-6: grep -vE "Team|^s*$" Experiment_6-1.txt | sed -e "s/[Ll]eader//" -e "s/[]()[]//g" | awk '{print $1" "$2" <"$3">"}'.
     
Footnotes
1
See the grep info page in Section 3.6 Basic vs. Extended Regular Expressions.
 
2
When I talk about regular expressions, in a general sense I usually mean to include both basic and extended regular expressions. If there is a differentiation to be made, I will use the acronyms BRE for basic regular expression and ERE for extended regular expression.
 
3
One general meaning of parse is to examine something by studying its component parts. For our purposes, we parse a data stream to locate sequences of characters that match a specified pattern.
 
4
The official form of systemd is all lowercase.
 
5
Many people call tools like grep “filter” programs because they filter unwanted lines out of the data stream. I prefer the term “transformers” because ones such as sed and awk do more than just filter. They can test the content for various string combinations and alter the matching content in many different ways. Tools like sort, head, tail, uniq, fmt, and more all transform the data stream in some way.
 
6
Friedl, Jeffrey E. F., Mastering Regular Expressions, O’Reilly, 2012, Paperback ISBN-13: 978-0596528126
 
7
Robbins, Arnold, and Dougherty, Dale, sed & awk: UNIX Power Tools (Nutshell Handbooks), O’Reilly, 2012, ISBN-13: 978-1565922259
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.17.20