Capturing text from overlapping matches

Lookahead patterns are also very useful for situations where we want to match and capture text from overlapping matches.

Let's consider the following input string as an example:

thathathisthathathatis 

Suppose that we need to count the occurrence of the string, that, in this input, including all overlapping occurrences.

Note that there are three independent that substrings in the input string, but there are two additional overlapping matches that we need to match and count. Here are the start-end positions of overlapping the substring that:

Positions 0-3 3-6 10-13 13-16 16-19 

A simple search using the regex that will give us a match count of three because we miss out all the overlapping matches. To be able to match the overlapping matches, we need to use the lookahead pattern because lookahead patterns are zero-length. These patterns don't consume any characters; they just assert the presence of the required text ahead, based on the patterns used inside the lookahead, and the current position doesn't change. So, the solution is to use a lookahead regex as follows:

(?=that) 

Here is the full code to see this regex working in action:

package example.regex; 
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class LookaheadOverlapping
{
public static void main (String[] args)
{
final String kw = "that";
final String regex = "(?=" + kw+ ")";
final String string = "thathathisthathathatis";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
int count = 0; while (matcher.find())
{
System.out.printf("Start: %d End:%d%n",
matcher.start(), matcher.start() + kw.length() -1);
count++;
}
System.out.printf("Match count: %d%n", count);
}
}

Once we run and compile the preceding class, we will get the following output:

Start: 0 End:3 
Start: 3 End:6
Start: 10 End:13
Start: 13 End:16
Start: 16 End:19
Match count: 5

You can see from this output that all the Start, End positions of the overlapping matches and, more importantly, the count of the overlapping matches, which is 5.

Here is another code listing that finds all the three character strings that have 'a' as the middle letter and the same word character before and after the letter 'a'. For example, bab, zaz, kak, dad, 5a5, and _a_ should be matched:

package example.regex; 
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class LookaheadOverlappingMatches
{
public static void main(String[] args)
{
final String regex = "(?=(\w)a\1)";
final String string = "5a5akaktjzazbebbobabababsab";
final Matcher matcher = Pattern.compile(regex)
.matcher(string);
int count = 0; while (matcher.find())
{
final int start = matcher.start();
final int end = start + 2;
System.out.printf("Start: %2d End:%2d %s%n",
start, end, string.substring(start,end+1));
count++;
}
System.out.printf("Match count: %d%n", count);
}
}

This code generates the following output:

Start: 0 End: 2 5a5 
Start: 4 End: 6 kak
Start: 9 End:11 zaz
Start: 17 End:19 bab
Start: 19 End:21 bab
Start: 21 End:23 bab
Match count: 6
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.77.250