Look behind

We could safely define look behind as the opposite operation to look ahead. It tries to match behind the subexpression passed as an argument. It has a zero-width nature as well, and therefore, it won't be part of the result.

It is represented as an expression preceded by a question mark, a less-than sign, and an equals sign, ?<=, inside a parenthesis block: (?<=regex).

We could, for instance, use it in an example similar to the one we used in negative look ahead to find just the surname of someone named John McLane. To accomplish this, we could write a look behind like the following:

>>>pattern = re.compile(r'(?<=Johns)McLane')
>>>result = pattern.finditer("I would rather go out with John McLane than with John Smith or John Bon Jovi")
>>>for i in result:
...    print i.start(), i.end()
...
32 38

With the preceding look behind, we requested the regex engine to match only positions that are preceded with John and a whitespace to then consume McLane as a result.

In Python's re module, there is, however, a fundamental difference between how look ahead and look behind are implemented. Due to a number of deeply rooted technical reasons, the look behind mechanism is only able to match fixed-width patterns. If variable-width patterns in look behind are required, the regex module at https://pypi.python.org/pypi/regex can be leveraged instead of the standard Python re module.

Fixed-width patterns don't contain variable-length matchers such as the quantifiers we studied in Chapter 1, Introducing Regular Expressions. Other variable-length constructions such as back references aren't allowed either. Alternation is allowed but only if the alternatives have the same length. Again, these limitations are not present in the aforementioned regex module.

Let's see what'll happen if we use an alternation with different length alternatives in a back reference:

>>>pattern = re.compile(r'(?<=(John|Jonathan)s)McLane')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 190, in compile
return _compile(pattern, flags)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: look-behind requires fixed-width pattern

We've got an exception as look behind requires a fixed-width pattern. We will get a similar result if we try to apply quantifiers or other variable-length constructions.

Now that we have learned different techniques to match ahead or behind without consuming characters and the different limitations we might have, we can try to write another example that embraces a few of the mechanisms that we have studied to solve a real-world problem.

Let's assume that we want to extract any Twitter username that is present in a tweet in order to create an automatic mood detection system. To write a regular expression in order to extract them, we should start by identifying how a Twitter username is represented. If we browse the Twitter's site https://support.twitter.com/articles/101299-why-can-t-i-register-certain-usernames, we might find the following description:

"A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."

For our development tests, we are going to use this Packt Publishing tweet:

Look behind

The first thing we should be able to construct is a character set with all the characters that could potentially be used in a Twitter username. This could be any alphanumeric character followed by the underscore character as we just found in the previous Twitter support article. Therefore, we could construct a character set similar to the following:

[w_]

This will represent all the parts that we want to extract from the username. Then, we will need to prepend a word boundary and the at symbol (@) that will be used to locate the usernames:

/B@[w_]+/

The reason behind using the word boundary is that we don't want to get confused with e-mails and so on. We are looking only for text that follows the start of the line or a word boundary, then followed by an @ symbol, and then just a number of alphanumeric or underscore characters. The examples are as follows:

  • @vromer0 is a valid user name
  • iam@vromer0 is not a valid user name as it should start with the @ symbol
  • @vromero.org is not a valid username as it contains an invalid character

If we use the regular expression we have at the moment, we will obtain the following result:

>>>pattern = re.compile(r'B@[w_]+') 
>>>pattern.findall("Know your Big Data = 5 for $50 on eBooks and 40% off all eBooks until Friday #bigdata #hadoop @HadoopNews packtpub.com/bigdataoffers")
['@HadoopNews']

We do want to match just the username without including the preceding @ symbol. At this point, a look behind mechanism becomes useful. We can include the word boundary and the @ symbol in a look behind subexpression so that they don't become a part of the matched result:

>>>pattern = re.compile(r'(?<=B@)[w_]+')
>>>pattern.findall("Know your Big Data = 5 for $50 on eBooks and 40% off all eBooks until Friday #bigdata #hadoop @HadoopNews packtpub.com/bigdataoffers")
['HadoopNews']

And now we have accomplished our goals.

Negative look behind

The negative look behind mechanism presents the very same nature of the main look behind mechanism, but we will only have a valid result if the passed subexpression doesn't match.

It is represented as an expression preceded by a question mark, a less-than sign, and an exclamation mark, ?<!, inside a parenthesis block: (?<!regex).

It is worth remembering that negative look behind not only shares most of the characteristics of the look behind mechanism, but it also shares the limitations. The negative look behind mechanism is only able to match fixed-width patterns. These have the same cause and implications as we have studied in the previous section.

We could put this into practice by trying to match any person surnamed Doe who is not named John with a regular expression like this: /(?<!Johns)Doe/. If we use it in Python's console, we will obtain the following result:

>>>pattern = re.compile(r'(?<!Johns)Doe')
>>>results = pattern.finditer("John Doe, Calvin Doe, Hobbes Doe")
>>>for result in results:
...   print result.start(), result.end()
...
17 20
29 32
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.128.105