Regular expression (also known as regex or regexp) provides a way of specifying a pattern to be matched in a given big chunk of text data. It supports a set of characters to specify the pattern. It is widely used for a text search and string manipulation. A lot of shell commands provide an option to specify regex such as grep
, sed
, find
, and so on.
The regular expression concept is also used in other programming languages such as C++, Python, Java, Perl, and so on. Libraries are available in different languages to support regular expression's features.
The metacharacters used in regular expressions are explained in the following table:
When we look into a human readable file or data, its major content contains alphabets (a to z) and numbers (0 to 9). While writing regex for matching a pattern consisting of alphabets or numbers, we can make use character ranges or classes.
We can use character ranges in a regular expression as well. We can specify a range by a pair of characters separated by a hyphen. Any characters that fall in between that range, inclusive, are matched. Character ranges are enclosed inside square brackets.
The following table shows some of character ranges:
Character range |
Description |
---|---|
| |
| |
| |
| |
| |
|
This matches any single digit from 2 to 4 or 6 to 8 or any letter from j to l or B to M |
Character classes: Another way of specifying a range of character matches is by using Character classes. It is specified within the square brackets [:class:]. The possible class value is mentioned in the following table:
Creating your own regex: In the previous sections of regular expression, we discussed about metacharacters, character ranges, character class, and their usage. Using these concepts, we can create powerful regex that can be used to filter out text data as per our need. Now, we will create a few regex using the concepts we have learned.
We will consider our valid date starting from UNIX Epoch—that is, 1st January 1970. In this example, we will consider all the dates between UNIX Epoch and 30th December 2099 as valid dates. An explanation of forming its regex is given in the following subsections:
Putting it all together, the regex for matching a valid month of date will be 0[1-9]|1[0-2].
Our date will be in mm-dd-yyyy format. By putting together regex formed in the preceding sections for months, days, and years, we will get regex for the valid date:
(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[0-1])-(19[7-9][0-9]|20[0-9]{2})
There is a nice website, http://regexr.com/, where you can also validate regular expression. The following screenshot shows the matching of the valid date among the given input:
In Chapter 1, Beginning of Scripting Journey, we learned nomenclature of variables in shell. A valid variable name can contain a character from alphanumeric and underscore, and the first letter of the variable can't be a digit.
Keeping these rules in mind, a valid shell variable regex can be written as follows:
^[_a-zA-Z][_a-zA-Z0-9]*$
Here, ^ (caret) matches the start of a line.
The regex [_a-zA-Z] matches _ or any upper or lower case alphabet [_a-zA-Z0-9]* matches zero or multiple occurrences of _,any digit or upper and lower case alphabet $ (Dollar) matches the end of the line.
In character class format, we can write regex as ^[_[:alpha:]][_[:alnum:]]*$.
The following screenshot shows valid shell variables using regex formed:
18.226.164.53