Regular expressions

Regular expression (also known as regex or regexp) provides a way of specifying a pattern to be matched in a given big chunk of text data. It supports a set of characters to specify the pattern. It is widely used for a text search and string manipulation. A lot of shell commands provide an option to specify regex such as grep, sed, find, and so on.

The regular expression concept is also used in other programming languages such as C++, Python, Java, Perl, and so on. Libraries are available in different languages to support regular expression's features.

Regular expression metacharacters

The metacharacters used in regular expressions are explained in the following table:

Metacharacters

Description

* (Asterisk)

This matches zero or more occurrences of the previous character

+ (Plus)

This matches one or more occurrences of the previous character

?

This matches zero or one occurrence of the previous element

. (Dot)

This matches any one character

^

This matches the start of the line

$

This matches the end of line

[... ]

This matches any one character within a square bracket

[^... ]

This matches any one character that is not within a square bracket

| (Bar)

This matches either the left side or the right side element of |

{X}

This matches exactly X occurrences of the previous element

{X,}

This matches X or more occurrences of the previous element

{X,Y}

This matches X to Y occurrences of the previous element

(...)

This groups all the elements

<

This matches the empty string at the beginning of a word

>

This matches the empty string at the end of a word

This disables the special meaning of the next character

Character ranges and classes

When we look into a human readable file or data, its major content contains alphabets (a to z) and numbers (0 to 9). While writing regex for matching a pattern consisting of alphabets or numbers, we can make use character ranges or classes.

Character ranges

We can use character ranges in a regular expression as well. We can specify a range by a pair of characters separated by a hyphen. Any characters that fall in between that range, inclusive, are matched. Character ranges are enclosed inside square brackets.

The following table shows some of character ranges:

Character range

Description

[a-z]

This matches any single lowercase letter from a to z

[A-Z]

This matches any single uppercase letter from A to Z

[0-9]

This matches any single digit from 0 to 9

[a-zA-Z0-9]

This matches any single alphabetic or numeric characters

[h-k]

This matches any single letter from h to k

[2-46-8j-lB-M]

This matches any single digit from 2 to 4 or 6 to 8 or any letter from j to l or B to M

Character classes: Another way of specifying a range of character matches is by using Character classes. It is specified within the square brackets [:class:]. The possible class value is mentioned in the following table:

Character Class

Description

[:alnum:]

This matches any single alphabetic or numeric character; for example, [a-zA-Z0-9]

[:alpha:]

This matches any single alphabetic character; for example, [a-zA-Z]

[:digit:]

This matches any single digit; for example, [0-9]

[:lower:]

This matches any single lowercase alphabet; for example, [a-z]

[:upper:]

This matches any single uppercase alphabet; for example, [A-Z]

[:blank:]

This matches a space or tab

[:graph:]

This matches a character in the range of ASCII—for example 33-126—excluding a space character

[:print:]

This matches a character in the range of ASCII—for example. 32-126—including a space character

[:punct:]

This matches any punctuation marks such as '?', '!', '.', ',', and so on

[:xdigit:]

This matches any hexadecimal characters; for example, [a-fA-F0-9]

[:cntrl:]

This matches any control characters

Creating your own regex: In the previous sections of regular expression, we discussed about metacharacters, character ranges, character class, and their usage. Using these concepts, we can create powerful regex that can be used to filter out text data as per our need. Now, we will create a few regex using the concepts we have learned.

Matching dates in mm-dd-yyyy format

We will consider our valid date starting from UNIX Epoch—that is, 1st January 1970. In this example, we will consider all the dates between UNIX Epoch and 30th December 2099 as valid dates. An explanation of forming its regex is given in the following subsections:

Matching a valid month

  • 0[1-9] matches 01st to 09th month
  • 1[0-2] matches 10th, 11th, and 12th month
  • '|' matches either left or right expression

Putting it all together, the regex for matching a valid month of date will be 0[1-9]|1[0-2].

Matching a valid day

  • 0[1-9] matches 01st to 09th day
  • [12][0-9] matches 10th to 29th day
  • 3[0-1] matches 30th to 31st day
  • '|' matches either left or right expression
  • 0[1-9]|[12][0-9]|3[0-1] matches all the valid days in a date

Matching the valid year in a date

  • 19[7-9][[0-9] matches years from 1970 to 1999
  • 20[0-9]{2} matches years from 2000 to 2099
  • '|' matches either left or right expression
  • 19[7-9][0-9]|20[0-9]{2} matches all the valid years between 1970 to 2099

Combining valid months, days, and years regex to form valid dates

Our date will be in mm-dd-yyyy format. By putting together regex formed in the preceding sections for months, days, and years, we will get regex for the valid date:

(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[0-1])-(19[7-9][0-9]|20[0-9]{2})

There is a nice website, http://regexr.com/, where you can also validate regular expression. The following screenshot shows the matching of the valid date among the given input:

Combining valid months, days, and years regex to form valid dates

Regex for a valid shell variable

In Chapter 1, Beginning of Scripting Journey, we learned nomenclature of variables in shell. A valid variable name can contain a character from alphanumeric and underscore, and the first letter of the variable can't be a digit.

Keeping these rules in mind, a valid shell variable regex can be written as follows:

^[_a-zA-Z][_a-zA-Z0-9]*$

Here, ^ (caret) matches the start of a line.

The regex [_a-zA-Z] matches _ or any upper or lower case alphabet [_a-zA-Z0-9]* matches zero or multiple occurrences of _,any digit or upper and lower case alphabet $ (Dollar) matches the end of the line.

In character class format, we can write regex as ^[_[:alpha:]][_[:alnum:]]*$.

The following screenshot shows valid shell variables using regex formed:

Regex for a valid shell variable

Note

  • Enclose regular expression in single quotes (') to avoid pre-shell expansion.
  • Use back slash () before a character to escape the special meaning of metacharacters.
  • Metacharacters such as ?, +, {, |, (, and ) are known to be extended regex. They lose their special meaning when used in basic regex. To avoid this, use them with backslash '?', '+', '{', '|', '(', and ')'.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.164.53