Regular expressions in Python

Regular expressions are very powerful and are widely used for pattern matching in the cyber security domain, be it dealing with parsing log files, Qualys or Nessus reports, or outputs produced by Metasploit, NSE or any other service scanning or exploit script. The module that provides support for regular expressions in Python is re. There are a few important methods that we will be using with Python regular expressions (the re module), which are explained as follows:

match()

This determines if the regular expression finds a match at the beginning of the string re.match(pattern,string,Flag=0). The flags can be specified with the | or operator. The most commonly used flags are re.Ignore-Case, re.Multiline, and re.DOTALL. These flags can be specified with the or operator as (re.M| re.I).

search()

Unlike match, search doesn't look for a match just at the beginning of the string, but instead searches or traverses throughout the string to look for the given search string/regex that can be specified as re.search(pattern,string,Flag=0).

findall()

This searches the string for the regex matches and returns all the substrings as a list wherever it finds a match.

group()

If a match is found, then group() returns the string matched by the RE.

start()

If a match is found, then start() returns the starting position of the match.

end()

If a match is found, then end() returns the end position of the match.

span()

If a match is found, then span() returns a tuple containing the start and end positions of the match.

split()

This splits a string on the basis of a regex match and returns us a list.

sub()

This is used for string replacement. It replaces all the substrings wherever it finds a match. It returns a new string if the match is not found.

subn()

This is used for string replacement. It replaces all the substrings wherever it finds a match. The return type is a tuple with the new string at index 0 and the number of replacements at index 1.

We will now try to understand regular expressions with the help of the following snippet from the regular_expressions.py script:

The difference between match and search is that match only searches for the pattern at the beginning of the string, whereas search looks throughout the entire input string. The output produced with code lines 42 and 50 will illustrate this:

In the preceding screen, it can be seen that when the Hello input is passed, both match and search were able to locate the string. However, when the input passed was d, which means any decimal, match was not able to locate it but search was. This is because the search method searches throughout the string and not just the beginning.

Again, it can be seen from the following screenshot that match did not return the grouping of digits and non-digits, but search did:

In the following output, the Reg keyword is searched, so both match and search return results:

 

Notice how findall(), in the following screenshot, is different from match and search:

These examples have shown how match() and search() operate differently and how search() is more powerful for carrying out search operations:

  

Let's take a look at a few important regular expressions in Python:

Regex expression

Description

d

This matches digits from zero to nine to a string.

(Dd)

This matches the D non-digits and the d  digits that are grouped together. Parentheses (()) are used for grouping.

.*string.*

This returns a match if a word is found in the string, irrespective of what is before and after it. The .* notation means anything and everything.

^

The cap symbol means it matches a pattern at the start of the string.

[a-zA-Z0-9]

[...] is used to match anything that is placed inside the braces. [12345], for example, means that a match should be found for any number between one and five. [a-zA-Z0-9] means that all alphanumeric characters should be considered matches.

w

w is identical to [a-zA-Z0-9_] and matches all the alphanumeric characters.

W

W is the negation of w and matches all non-alphanumeric characters.

D

D is the negation of d and matches all characters that aren't digits.

[^a-z]

^, when placed inside [], acts as a negation. In this case, it means match anything besides letters from a to z

re{n}

This means match exactly n occurrences of the preceding expression.

re{n ,}

This means match n or more  occurrences of the preceding expression.

re {n,m}

This means match a minimum of n and a maximum of m occurrences of the preceding expression.

s

This means match the space characters.

[T|t]est

This means match both Test and test.

re*

This means match any occurrence of the expression following *.

re?

This means match any occurrence of the expression following ?.

re+

This means match any occurrence of the expression following +.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.64.128