Character matching

We now know how we can search for whole words, even if we're not entirely sure about uppercase and lowercase yet.

We've also seen that regular expressions under (most) Linux applications are greedy, so we need to be sure that we're dealing with this properly by specifying whitespace and character anchors, which we will explain shortly.

In both these cases, we knew what we were looking for. But what if we do not really know what we are looking for, or perhaps only part of it? The answer to this dilemma is character matching.

In regular expressions, there are two characters we can use as substitutes for other characters:

  • . (dot) matches any one character (except a newline)
  • * (asterisk) matches any number of repeats of the character before (even zero instances)

An example will help in understanding this:

reader@ubuntu:~/scripts/chapter_10$ vim character-class.txt 
reader@ubuntu:~/scripts/chapter_10$ cat character-class.txt
eee
e2e
e e
aaa
a2a
a a
aabb
reader@ubuntu:~/scripts/chapter_10$ grep 'e.e' character-class.txt
eee
e2e
e e
reader@ubuntu:~/scripts/chapter_10$ grep 'aaa*' character-class.txt
aaa
aabb
reader@ubuntu:~/scripts/chapter_10$ grep 'aab*' character-class.txt
aaa
aabb

A lot of things happened there, some of which may feel very counter-intuitive. We'll walk through them one by one and go into detail on what is happening:

reader@ubuntu:~/scripts/chapter_10$ grep 'e.e' character-class.txt 
eee
e2e
e e

In this example, we use the dot to substitute for any character. As we can see, this includes both letters (eee) and numbers (e2e). However, it also matches the space character between the two es on the last line.

Here's another example:

reader@ubuntu:~/scripts/chapter_10$ grep 'aaa*' character-class.txt 
aaa
aabb

When we use the * substitution, we're looking for zero or more instances of the preceding character. In the search pattern aaa*, this means the following strings are valid:

  • aa
  • aaa
  • aaaa
  • aaaaa

... and so on. While everything after the first result should be clear, why does aa also match aaa*? Because of the zero in zero or more! In that case, if the last a is zero, we're left with only aa.

The same thing happens in the last example:

reader@ubuntu:~/scripts/chapter_10$ grep 'aab*' character-class.txt 
aaa
aabb

The pattern aab* matches the aa within aaa, since the b* can be zero, which makes the pattern end up as aa. Of course, it also matches one or more bs (aabb is fully matched).

These wildcards are great when you have only a general idea about what you're looking for. Sometimes, however, you will have a more specific idea of what you need.

In this case, we can use brackets, [...], to narrow our substitution to a certain character set. The following example should give you a good idea of how to use this:

reader@ubuntu:~/scripts/chapter_10$ grep 'f.r' grep-file.txt 
We can use this regular file for testing grep.
Also, New Zealand is pretty far away.
reader@ubuntu:~/scripts/chapter_10$ grep 'f[ao]r' grep-file.txt
We can use this regular file for testing grep.
Also, New Zealand is pretty far away.
reader@ubuntu:~/scripts/chapter_10$ grep 'f[abcdefghijklmnopqrstuvwxyz]r' grep-file.txt
We can use this regular file for testing grep.
Also, New Zealand is pretty far away.
reader@ubuntu:~/scripts/chapter_10$ grep 'f[az]r' grep-file.txt
Also, New Zealand is pretty far away.
reader@ubuntu:~/scripts/chapter_10$ grep 'f[a-z]r' grep-file.txt
We can use this regular file for testing grep.
Also, New Zealand is pretty far away.
reader@ubuntu:~/scripts/chapter_10$ grep 'f[a-k]r' grep-file.txt
Also, New Zealand is pretty far away.
reader@ubuntu:~/scripts/chapter_10$ grep 'f[k-q]r' grep-file.txt
We can use this regular file for testing grep

First, we demonstrate using . (dot) to replace any character. In this scenario, the pattern f.r matches both for and far.

Next, we use the bracket notation in f[ao]r to convey that we'll accept a single character between f and r, which is in the character set of ao. As expected, this again returns both far and for.

If we do this with the f[az]r pattern, we can only match with far and fzr. Since the string fzr isn't in our text file (and not a word, obviously), we only see the line with far printed.

Next, let's say you wanted to match with a letter, but not a number. If you used . (dot) to search, as in the first example, this would return both letters and numbers. So, you would also get, for example, f2r as a match (should that be in the file, which it is not).

If you used the bracket notation, you could use the following notation: f[abcdefghijklmnopqrstuvwxyz]r. That matches on any letter, a-z, between f and r. However, it's not great to type that out on a keyboard (trust me on this).

Luckily, the creators of POSIX regular expressions introduced a shorthand for this: [a-z], as shown in the previous example. We can also use a subset of the alphabet, as shown: f[a-k]r. Since the letter o is not between a and k, it does not match on for.

A last example demonstrates that this is a powerful, and also practical, pattern:

reader@ubuntu:~/scripts/chapter_10$ grep reali[sz]e grep-file.txt 
Did you ever realise that in the UK they say colour,
but in the USA they use color (and realize)!

Hopefully, this still all makes sense. Before moving on to line anchors, we're going to go one step further by combining notations.

In the preceding example, you see that we can use bracket notation to handle some of the differences between American and British English. However, this only works when the difference in spelling is a single letter, as with realise/realize.

In the case of color/colour, there is an extra letter we need to deal with. This sounds like a case for zero or more, does it not?

reader@ubuntu:~/scripts/chapter_10$ grep 'colo[u]*r' grep-file.txt 
Did you ever realise that in the UK they say colour,
but in the USA they use color (and realize)!

By using the pattern colo[u]*r, we're searching for a line containing a word that starts with colo, may or may not contain any number of us, and ends with an r. Since both color and colour are acceptable for this pattern, both lines are printed.

You might be tempted to use the dot character with the zero-or-more * notation. However, look closely at what happens in that case:

reader@ubuntu:~/scripts/chapter_10$ grep 'colo.*r' grep-file.txt 
Did you ever realise that in the UK they say colour,
but in the USA they use color (and realize)!

Again, both lines are matched. But, since the second line contains another r further on, the string color (and r is matched, as well as colour and color.

This is a typical instance where the regular expression pattern is too greedy for our purposes. While we cannot tell it to be less greedy, there is an option in grep that lets us only look for single words that match.

The notation -w evaluates whitespaces and line endings/beginnings to find only whole words. This is how it is used:

reader@ubuntu:~/scripts/chapter_10$ grep -w 'colo.*r' grep-file.txt 
Did you ever realise that in the UK they say colour,
but in the USA they use color (and realize)!

Now, only the words colour and color are matched. Earlier, we put whitespace around our word to facilitate this behavior, but as the word colour is at the end of the line, it is not followed by a whitespace.

Try for yourself and see why enclosing the colo.*r search pattern does not work with whitespace, but does work with the -w option.

Some implementations of regular expressions have the {3} notation, to supplement the * notation. In this notation, you can specify exactly how often a pattern should be present. The search pattern [a-z]{3} would match all lowercase strings of exactly three characters. In Linux, this can only be done with extended regular expressions, which we will see later in this chapter.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.5.201