Non-capturing groups

As we've mentioned before, capturing content is not the only use of groups. There are cases when we want to use groups, but we're not interested in extracting the information; alternation would be a good example. That's why we have a way to create groups without capturing. Throughout this book, we've been using groups to create subexpressions, as can be seen in the following example:

>>>re.search("Españ(a|ol)", "Español")
<_sre.SRE_Match at 0x10e90b828>
>>>re.search("Españ(a|ol)", "Español").groups()
('ol',)

You can see that we've captured a group even though we're not interested in the content of the group. So, let's try it without capturing, but first we have to know the syntax, which is almost the same as in normal groups, (?:pattern). As you can see, we've only added ?:. Let's see the following example:

>>>re.search("Españ(?:a|ol)", "Español")
<_sre.SRE_Match at 0x10e912648>
>>>re.search("Españ(?:a|ol)", "Español").groups()
()

After using the new syntax, we have the same functionality as before, but now we're saving resources and the regex is easier to maintain. Note that the group cannot be referenced.

Atomic groups

They're a special case of non-capturing groups; they're usually used to improve performance. It disables backtracking, so with them you can avoid cases where trying every possibility or path in the pattern doesn't make sense. This concept is difficult to understand, so stay with me up to the end of the section.

The re module doesn't support atomic groups. So, in order to see an example, we're going to use the regex module: https://pypi.python.org/pypi/regex.

Imagine we have to look for an ID made up of one or more alphanumeric characters followed by a dash and by a digit:

>>>data = "aaaaabbbbbaaaaccccccdddddaaa"
>>>regex.match("(w+)-d",data)

Let's see step by step what's happening here:

  1. The regex engine matches the first a.
  2. It then matches every character up to the end of the string.
  3. It fails because it doesn't find the dash.
  4. So, the engine does backtracking and tries the same with the following a.
  5. Start the same process again.

It tries this with every character. If you think about what we're doing, it doesn't make any sense to keep trying once you have failed the first time. And that's exactly what an atomic group is useful for. For example:

>>>regex.match("(?>w+)-d",data)

Here we've added ?>, which indicates an atomic group, so once the regex engine fails to match, it doesn't keep trying with every character in the data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.51.157