Catastrophic or exponential backtracking

Regular expression engines can be broadly categorized into two types:

  1. The Non-deterministic Finite Automation (NFA) engine
  2. The Deterministic Finite Automation (DFA) engine

The DFA engines do not evaluate each character more than once to find a match. On the other hand, the NFA engines support backtracking, which means that each character in the input can be evaluated multiple times by the regex engine. The Java regular expression engine is an NFA engine.

Regex engines backtrack or give back one position at a time to make various attempts to match a given pattern when using the greedy quantifier or trying alternations. Similarly, when using lazy quantifiers, the regex engine moves forward one position at a time to attempt matching.

Regex engines usually take less time to find a positive match in the given input as compared to returning a failure for a non-match. The NFA regex engines need to evaluate all the possible permutations before returning a failure.

For example, in a regular expression that uses nested repetition quantifiers, the regex engine backtracks excessively while matching a long input text. A catastrophic backtracking problem usually occurs when the regex engine fails to make a negative match towards the end of the string and after attempting far too many permutations.

As an example, check the following regex with nested quantifiers:

^(w+)*$ 

Suppose we test it against an input text that does not have a word character in the end, such as this input string:

abcdefghijklmno: 

We know that due to the presence of a non-word character (colon) at the end of the input, the match will fail. However, due to the presence of nested compound quantifiers, (w+)*, the regex engine backtracks excessively and makes a lot of attempts to match the input before giving up.

Excessive backtracking may also be caused by two or more alternatives that are mutually exclusive and can match the same string in the input. For example, having a regex pattern like this one to match the text between the %% tags:

%%(.|s)+%% 

This regex may also cause catastrophic backtracking for failed cases, such as the following input string with a missing closing tag:

%%        something    here    abcd     123 

The problem here is that the alternations in (.|s) are not mutually exclusive, as dot can also match the same whitespace that is matched by s, except the newline character.

Here is a complete program listing that demonstrates a dynamically building regex getting slower with every iteration of the loop and eventually causing catastrophic backtracking:

package example.regex; 
  
import java.util.regex.Matcher; 
import java.util.regex.Pattern; 
  
public class CatastropicBacktracking 
{ 
  public static void main(String[] args) 
  { 
    final int MAX = 30; 
     
    for (int i = 1; i < MAX; i++) 
    { 
      StringBuilder sb1 = new StringBuilder(i); 
      StringBuilder sb2 = new StringBuilder(i); 
       
      for (int j = i; j > 0; j--) 
      { 
        sb1.append('a'); 
        sb2.append("a?"); 
      } 
       
      sb2.append(sb1); 
  
      final Pattern p = Pattern.compile("^" + sb2.toString() + "$"); 
      Matcher m = p.matcher(sb1.toString()); 
  
      long start = System.nanoTime(); 
      m.matches(); 
      long end = System.nanoTime(); 
       
      System.out.printf("%s:: ( %sms ) :: Pattern <%s>, 
Input <%s>%n", i, (end - start)/1_000_000, sb2, sb1); } } }

When you compile and run the preceding program and look at the generated output, you will note an output as follows:

 

1:: ( 0ms ) :: Pattern <a?a>, Input <a> 
2:: ( 0ms ) :: Pattern <a?a?aa>, Input <aa> 
3:: ( 0ms ) :: Pattern <a?a?a?aaa>, Input <aaa> 
4:: ( 0ms ) :: Pattern <a?a?a?a?aaaa>, Input <aaaa> 
5:: ( 0ms ) :: Pattern <a?a?a?a?a?aaaaa>, Input <aaaaa> 
6:: ( 0ms ) :: Pattern <a?a?a?a?a?a?aaaaaa>, Input <aaaaaa> 
7:: ( 0ms ) :: Pattern <a?a?a?a?a?a?a?aaaaaaa>, Input <aaaaaaa> 
8:: ( 0ms ) :: Pattern <a?a?a?a?a?a?a?a?aaaaaaaa>, Input <aaaaaaaa> 
9:: ( 0ms ) :: Pattern <a?a?a?a?a?a?a?a?a?aaaaaaaaa>, Input <aaaaaaaaa> 
10:: ( 0ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?aaaaaaaaaa>, Input <aaaaaaaaaa> 
11:: ( 0ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaa>, Input <aaaaaaaaaaa> 
12:: ( 0ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaa>, Input <aaaaaaaaaaaa> 
13:: ( 10ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaa>, Input <aaaaaaaaaaaaa> 
14:: ( 1ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaa> 
15:: ( 15ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaa> 
16:: ( 18ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaa> 
17:: ( 29ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaaa> 
18:: ( 22ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaaaa> 
19:: ( 51ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaaaaa> 
20:: ( 97ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaaaaaa> 
21:: ( 188ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaaaaaaa> 
22:: ( 441ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaaaaaaaa> 
23:: ( 1003ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaaaaaaaaa> 
24:: ( 1549ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaaaaaaaaaa> 
25:: ( 3010ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaaaaaaaaaaa> 
26:: ( 5884ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaaaaaaaaaaaa> 
27:: ( 12588ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaaaaaaaaaaaaa> 
28:: ( 24765ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaaaaaaaaaaaaaa> 
29:: ( 51679ms ) :: Pattern <a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaa>, Input <aaaaaaaaaaaaaaaaaaaaaaaaaaaaa> 

Note how the execution time grows rapidly on higher values of the counter i, especially after 25.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.93.222