Chapter 5. Performance of Regular Expressions

Up to this point, we worried about learning how to leverage a feature or obtain a result without caring too much about how fast the process would be. Our only goals were correctness and readability.

In this chapter, we are going to steer towards a completely different concern—performance. However, we will find that often an improvement in performance will result in degradation of readability. When we are modifying something to make it faster, we are probably making it easier for the machine to understand, and therefore, we are probably compromising on human readability.

On December 4, 1974, Donald Knuth, the author of the famous book The Art of Computer Programming, wrote the paper Structured Programming with go-to statements. This well-known quote is extracted from the paper:

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

That said, we should be careful about what we optimize. Probably, for a regular expression used to validate an e-mail address of a form, we should have more interest in readability than in performance. On the other hand, if we are writing a regular expression to be used in batch processing of huge historical files, we should be more interested in the performance.

The most commonly used approach for optimization is to first write, then measure, and only then optimize that critical 3 percent. So, in this chapter, first we are going to learn how to measure and analyze the regular expressions, and then follow on with optimization techniques.

Benchmarking regular expressions with Python

In order to benchmark our regex, we're going to measure the time a regex takes to execute. It's important to test them with different inputs, because with small inputs almost every regex is fast enough. However, with longer ones it can be a completely different beast, as we will see later in the section Backtracking.

First, we're going to create a small function to help us with this task:

>>> from time import clock as now
>>> def test(f, *args, **kargs):
        start = now()
        f(*args, **kargs)
        print "The function %s lasted: %f" %(f.__name__, now() - start)

So, we can test a regex using the following code:

>>> def alternation(text):
       pat = re.compile('spa(in|niard)')
       pat.search(text)
>>> test(alternation, "spain")
The function alternation lasted: 0.000009

Python comes with a built-in profiler http://docs.python.org/2/library/profile.html that we can also use to measure the time and the number of calls, among other things:

>>> import cProfile
>>> cProfile.run("alternation('spaniard')")

You can see the output in the following screenshot:

Benchmarking regular expressions with Python

Profiling output

Let's see another useful technique that is going to help when you want to see what's going on under the hook of your regex. It's something that we've seen before in Chapter 2, Regular Expressions with Python, the flag DEBUG. Recall that it gives us information about how the pattern is compiled. For example:

>>> re.compile('(w+d+)+-dd', re.DEBUG)
max_repeat 1 4294967295
  subpattern 1
    max_repeat 1 4294967295
      in
        category category_word
    max_repeat 1 4294967295
      in
        category category_digit
literal 45
in
  category category_digit
in
  category category_digit

Here, we can see three max_repeat conditions from 1 to 4294967295, two of them inside another max_repeat. Think about them as nested loops, as you're probably thinking this is a bad smell. In fact, this will lead you to a catastrophic backtracking, something that we'll see later.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.68.18