Chapter 3. Grouping

Grouping is a powerful tool that allows you to perform operations such as:

  • Creating subexpressions to apply quantifiers. For instance, repeating a subexpression rather than a single character.
  • Limiting the scope of the alternation. Instead of alternating the whole expression, we can define exactly what has to be alternated.
  • Extracting information from the matched pattern. For example, extracting a date from lists of orders.
  • Using the extracted information again in the regex, which is probably the most useful property. One example would be to detect repeated words.

Throughout this chapter, we will explore groups, from the simplest to the most complex ones. We'll review some of the previous examples in order to bring clarity to how these operations work.

Introduction

We've already used groups in several examples throughout Chapter 2 Regular Expressions with Python. Grouping is accomplished through two metacharacters, the parentheses (). The simplest example of the use of parentheses would be building a subexpression. For example, imagine you have a list of products, the ID for each product being made up of two or three sequences of one digit followed by a dash and followed by one alphanumeric character, 1-a2-b:

>>>re.match(r"(d-w){2,3}", ur"1-a2-b")
<_sre.SRE_Match at 0x10f690738>

As you can see in the preceding example, the parentheses indicate to the regex engine that the pattern inside them has to be treated like a unit.

Let's see another example; in this case, we need to match whenever there is one or more ab followed by c:

>>>re.search(r"(ab)+c", ur"ababc")
<_sre.SRE_Match at 0x10f690a08>
>>>re.search(r"(ab)+c", ur"abbc")
None

So, you could use parentheses whenever you want to group meaningful subpatterns inside the main pattern.

Another simple example of their use is limiting the scope of alternation. For example, let's say we would like to write an expression to match if someone is from Spain. In Spanish, the country is spelled España and Spaniard is spelled Español. So, we want to match España and Español. The Spanish letter ñ can be confusing for non-Spanish speakers, so in order to avoid confusion we'll use Espana and Espanol instead of España and Español.

We can achieve it with the following alternation:

>>>re.search("Espana|ol", "Espanol")
<_sre.SRE_Match at 0x1043cfe68>
>>>re.search("Espana|ol", "Espana")
<_sre.SRE_Match at 0x1043cfed0>

The problem is that this also matches ol:

>>>re.search("Espana|ol", "ol")
<_sre.SRE_Match at 0x1043cfe00>

So, let's try character classes as in the following code:

>>>re.search("Espan[aol]", "Espanol")
<_sre.SRE_Match at 0x1043cf1d0>

>>>re.search("Espan[aol]", "Espana")
<_sre.SRE_Match at 0x1043cf850>

It works, but here we have another problem: It also matches "Espano" and "Espanl" that doesn't mean anything in Spanish:

>>>re.search("Espan[a|ol]", "Espano")
<_sre.SRE_Match at 0x1043cfb28>

The solution here is to use parentheses:

>>>re.search("Espan(a|ol)", "Espana")
<_sre.SRE_Match at 0x10439b648>

>>>re.search("Espan(a|ol)", "Espanol")
<_sre.SRE_Match at 0x10439b918>

>>>re.search("Espan(a|ol)", "Espan")
   None

>>>re.search("Espan(a|ol)", "Espano")
   None

>>>re.search("Espan(a|ol)", "ol")
   None

Let's see another key feature of grouping, capturing. Groups also capture the matched pattern, so you can use them later in several operations, such as sub or in the regex itself.

For example, imagine you have a list of products, the IDs of which are made up of digits representing the country of the product, a dash as a separator, and one or more alphanumeric characters as the ID in the DB. You're requested to extract the country codes:

>>>pattern = re.compile(r"(d+)-w+")
>>>it = pattern.finditer(r"1-a
20-baer
34-afcr")
>>>match = it.next()
>>>match.group(1)
'1'
>>>match = it.next()
>>>match.group(1)
'20'
>>>match = it.next()
>>>match.group(1)
'34'

In the preceding example, we've created a pattern to match the IDs, but we're only capturing a group made up of the country digits. Remember that when working with the group method, the index 0 returns the whole match, and the groups start at index 1.

Capturing groups give a huge range of possibilities due to which they can also be used with several operations, which we would discuss in the upcoming sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.36.71