Atomic groups

An atomic group is a non-capturing group that throws away all the alternative positions remembered by any token inside the group when the matching process exits the group after the first match of the pattern inside the group. Thus, it avoids backtracking to attempt all the alternatives present in the group.

Here is the syntax:

(?>regex) 

Here, the regex may contain alternative patterns. On the other hand, a non-atomic group will allow backtracking; it will try to find the first match and then if the matching ahead fails, it will backtrack and try to find the next match in alternation, until a match for the entire expression is found or all the possibilities are exhausted.

To understand it better, let's take an example of a regular expression using a non-atomic group:

^foo(d|die|lish)$

The input string here is foodie.

It will match the starting pattern foo and then the first alternative d. It fails at this time because the end anchor, $, requires that we must be at the end of the input string, but we still have two characters, i and e, to be matched. Then, the engine attempts to match the second alternative die. This match operation succeeds, as the $ anchor asserts true since the input ends there and stops matching further with a successful match returned.

Even if we use a non-capturing group instead of a capturing group here to make it ^foo(?:d|die|lish)$, it will have the same effect while matching.

Now, take an example of the same regular expression using an atomic group:

^foo(?>d|die|lish)$ 

Note the use of ?> after ( to make it an atomic non-capturing group.

Let's see what happens when we apply the preceding regex against the same input string, that is, foodie.

It will match the starting pattern, foo, and then its first alternative, d. It fails because the $ anchor asserts false since the input does not end at food. However, because of the use of the atomic group, the regex engine gives up immediately and doesn't backtrack. Since the regex engine throws away all the alternative positions remembered inside the atomic group, it does not attempt to match the second alternative die, which would have been a successful match for a non-atomic group. Finally, this match operation fails with no match.

You need to remember a simple but important fact that the alternation tries its alternatives from left to right and always attempts to complete the match using the leftmost alternative. Therefore, whenever listing all the options in an alternation, it is good practice to place the longest matches first and then use the other alternatives to place shorter matches.

Using this principle, we can make some small changes to our atomic group to make it work.

Here is the working regex:

 

^foo(?>lish|die|d)$ 

We have the same input string, foodie.

Note that we have the same alternatives in this atomic group but with a different order. Since d is a prefix of die, we are placing the die alternative on the left-hand side of d so that the regex engine can first attempt to match foodie before attempting food.

Here is the full code listing running these examples:

 

package example.regex; 
class AtomicGroupExample
{
public static void main (String[] args)
{
final String input = "foodie";
// regex with non-atomic group
final String nonAtomicRegex = "foo(d|die|lish)";
// regex with an atomic group
final String atomicRegex = "foo(?>d|die|lish)";
// regex with an alternate atomic group with correct order
final String atomicRegexImproved = "foo(?>lish|die|d)";
// now execute all 3 regex against same input
System.out.printf("%s: %s%n",
nonAtomicRegex, input.matches(nonAtomicRegex));

System.out.printf("%s: %s%n",
atomicRegex, input.matches(atomicRegex));

System.out.printf("%s: %s%n",
atomicRegexImproved , input.matches(atomicRegexImproved));
}
}

After compiling and running the code, it will generate the following output:

foo(?:d|die|lish): true 
foo(?>d|die|lish): false
foo(?>lish|die|d): true
Since the atomic group prevents the regex engine from backtracking by exiting from the evaluation of all the alternatives inside the group, the atomic group usually provides a significant gain in performance while evaluating a largely sized text with multiple options in alternation.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.239.226