Optimization recommendations

In the following sections, we will find a number of recommendations that could be applied to improve regular expressions.

The best tool will always be common sense, and common sense will need to be used even while following these recommendations. It has to be understood when the recommendation is applicable and when it is not. For instance, the recommendation don't be greedy cannot be used in all the cases.

Reuse compiled patterns

We have learned in Chapter 2, Regular Expressions with Python, that to use a regular expression we have to convert it from its string representation to a compiled form as RegexObject.

This compilation takes some time. If we are using the rest of the module operations instead of using the compile function to avoid the creation of the RegexObject, we should understand that the compilation is executed anyway and a number of compiled RegexObject are cached automatically.

However, when we are compiling, that cache won't back us. Every single compile execution will consume an amount of time that perhaps could be negligible for a single execution, but it's definitely relevant if many executions are performed.

Let's see the difference between reusing and not reusing the compiled patterns in the following example:

>>> def dontreuse():
        pattern = re.compile(r'foo')
        pattern.match("foo bar")
   
>>> def callonethousandtimes():
        for _ in range(1000):
            dontreuse()
   
>>> test(callonethousandtimes)
The function callonethousandtimes lasted: 0.001965

>>> pattern = re.compile(r'foo')
>>> def reuse():
        pattern.match("foo bar")
   
>>> def callonethousandtimes():
        for _ in range(1000):
            reuse()
   
>>> test(callonethousandtimes)
The function callonethousandtimes lasted: 0.000633
>>>

Extract common parts in alternation

Alternation is always a performance risk in regular expressions. When using them in a sort of NFA implementation, in Python, we should extract any common part outside of the alternation.

For instance, if we have /(HelloExtract common parts in alternationWorld|HelloExtract common parts in alternationContinent|HelloExtract common parts in alternationCountry,)/,we could easily extract HelloExtract common parts in alternation with the following expression: /HelloExtract common parts in alternation(World|Continent|Country)/. This would enable our engine to just check HelloExtract common parts in alternation once, and it will not go back to recheck for each possibility. In the following example, we can see the difference on execution:

>>> pattern = re.compile(r'/(HellosWorld|HellosContinent|HellosCountry)')
>>> def nonoptimized():
         pattern.match("HellosCountry")
   
>>> def callonethousandtimes():
         for _ in range(1000):
             nonoptimized()
   
>>> test(callonethousandtimes)
The function callonethousandtimes lasted: 0.000645

>>> pattern = re.compile(r'/Hellos(World|Continent|Country)')
>>> def optimized():
         pattern.match("HellosCountry")
   
>>> def callonethousandtimes():
         for _ in range(1000):
             optimized()
   
>>> test(callonethousandtimes)
The function callonethousandtimes lasted: 0.000543
>>>

Shortcut to alternation

Ordering in alternation is relevant, each of the different options present in the alternation will be checked one by one, from left to right. This can be used in favor of performance.

If we place the more likely options at the beginning of the alternation, more checks will mark the alternation as matched sooner.

For instance, we know that the more common colors of cars are white and black. If we are writing a regular expression to accept some colors, we should put white and black first as those are more likely to appear. We can frame the regex like this /(white|black|red|blue|green)/.

For the rest of the elements, if they have the very same odds of appearing, it could be favorable to put the shortest ones before the longer ones:

>>> pattern = re.compile(r'(white|black|red|blue|green)')
>>> def optimized():
         pattern.match("white")
   
>>> def callonethousandtimes():
         for _ in range(1000):
             optimized()
   
>>> test(callonethousandtimes)
The function callonethousandtimes lasted: 0.000667
>>>

>>> pattern = re.compile(r'(green|blue|red|black|white)')
>>> def nonoptimized():
         pattern.match("white")
   
>>> def callonethousandtimes():
         for _ in range(1000):
             nonoptimized()
   
>>> test(callonethousandtimes)
The function callonethousandtimes lasted: 0.000862
>>>

Use non-capturing groups when appropriate

Capturing groups will consume some time for each group defined in an expression. This time is not very important, but it is still relevant if we are executing a regular expression several times.

Sometimes, we use groups but we might not be interested in the result, for instance, when using alternation. If that is the case, we can save some execution time of the engine by marking that group as non-capturing, for example, (?:person|company).

Be specific

When the patterns we define are very specific, the engine can help us perform quick integrity checks before the actual pattern matching is executed.

For instance, if we pass the expression /w{15}/ to the engine to match it against the text hello, the engine could decide to check whether the input string is actually at least 15 characters long instead of matching the expression.

Don't be greedy

We've studied about quantifiers in Chapter 1, Introducing Regular Expressions, and we learned the difference between greedy and reluctant quantifiers. We also found that the quantifiers are greedy by default.

What does this mean in terms of performance? It means that the engine will always try to catch as many characters as possible, and then reduce the scope step-by-step until the matching is done. This could potentially make the regular expression slow if the match is typically short. Keep in mind, however, that this is only applicable if the match is usually short.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.146.37.250