How to avoid catastrophic backtracking

Here are some tips to keep in mind while handling situations with catastrophic or excessive backtracking in your regex:

  • When you write regular expressions, make sure they fail fast without spending a lot of unnecessary steps in backtracking.
  • When using nested repetition operators or quantifiers, make sure that there is only one unique way to match the a string.
  • Make good judicious use of atomic groups and possessive quantifiers to avoid excessive backtracking.
  • You should avoid having too many optional matches that are not mutually exclusive in an alternation pattern.
  • Be very careful when using a free-flowing pattern such as .* or .+ in your regex. Wherever possible, use negated character classes for cutting down the backtracking steps and for better performance.
  • Avoid matching hugely sized text using a single regex. It is better to match smaller strings using your regex and call matcher.find() in a loop to get multiple matches. If needed, use another inner pattern to match and examine the matches found by the main pattern.

The regex with the nested quantifier that caused catastrophic backtracking is as follows:

^(w+)*$ 

We can make use of possessive quantifiers to disallow any backtracking, as follows:

^(w+)*+$ 

You will note a massive jump in this improved regex in any of the benchmarks or regex testing tools, as suggested earlier.

Also, in the alternation regex example, we found that this regex causes excessive backtracking for failed cases:

%%(.|s)%% 

It can be converted to the following regex to avoid excessive backtracking:

%%(S|s)+%% 

It is even better to avoid the group and use a character class, as follows:

%%[Ss]+%%  

Note the use of S instead of dot to make alternatives mutually exclusive.

A regex that can cause excessive backtracking is as follows:

^(?:.*:)+#$ 

In the preceding regex example, if we use a negated character class instead of .*, then we can avoid catastrophic backtracking:

^(?:[^:]+:)+#$ 

The regex engine doesn't backtrack excessively because the negated character class, [^:], matches any character except a colon instead of the dot that matches everything, including the colon.

Consider another example with this regex pattern that has the nested repetition operator, +:

%(?:[p-s]+|ye|wo)+% 

This regex pattern attempts to match a string that starts with the following conditions:

  • Must start with %
  • % must be followed by one or more alternations: letters p,q,r,s or the string ye or wo
  • Must end with another %

Now test this regex pattern against the input string, as follows:

%yeqpsrwospqr 

Obviously, the regex pattern is not going to match because the last % is missing. However, note that the starting % and all the following letters will match the regex pattern before the last %. Due to this, the regex engine will backtrack several times while making attempts to match the complete input before finally giving up.

When testing this regex on the regex101 website's debugger, it shows the following:

match 1 failed in 748 steps 

748 may be a quite a big number for the number of steps taken to fail the match for a small-sized input. Regex patterns such as this can slow down your application considerably. Some of them can even hang your code for many hours or days due to the catastrophic backtracking behavior.

Now, to prevent this catastrophic backtracking behavior, let's consider the two options the we recommended earlier:

  1. Use a possessive quantifier, as follows:
   %(?:[p-s]+|ye|wo)++% 

On testing the preceding pattern on the same site, we get the following in the debugger:

  match 1 failed in 33 steps 
  1. Use an atomic group, as follows:
  %(?>[p-s]+|ye|wo)+% 

On testing the preceding pattern on the same site, we get the following in the debugger:

  match 1 failed in 36 steps 

You can notice that by using any of the aforementioned techniques, we make the regex engine fail sooner and avoid the unnecessarily high number of backtracking steps.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.76.175