Chapter 2. Regular Expressions with Python

In the previous chapter, we've seen how generic regular expressions work. In this chapter, we walk you through all the operations Python provides us with to work with regular expressions and how Python deals with them.

To do so, we will see the quirks of the language when dealing with regular expressions, the different types of strings, the API it offers through the RegexObject and MatchObject classes, every operation that we can do with them in depth with many examples, as well as some problems generally faced by users. Lastly, we will see the small nuances and differences between Python and other regex engines and between Python 2 and Python 3.

A brief introduction

Since v1.5, Python provides a Perl-style regular expression with some subtle exceptions that we will see later. Both patterns and strings to be searched can be Unicode strings, as well as an 8-bit string (ASCII).

Tip

Unicode is the universal encoding with more than 110.00 characters and 100 scripts to represent all the world's living characters and even historic scripts. You can think of it as a mapping between numbers, or code points as they are called, and characters. So, we can represent every character, no matter in what language, with one single number. For example, the character A brief introduction is the number 26159, and it is represented as u662f (hexadecimal) in Python.

Regular expressions are supported by the re module. So, as with all modules in Python, we only need to import it to start playing with them. For that, we need to start the Python interactive shell using the following line of code:

>>> import re

Once we have imported the module, we can start trying to match a pattern. To do so, we need to compile a pattern, transforming it into bytecode, as shown in the following line of code. This bytecode will be executed later by an engine written in C.

>>> pattern = re.compile(r'foo')

Tip

Bytecode is an intermediary language. It's the output generated by languages, which will be later interpreted by an interpreter. The Java bytecode that is interpreted by JVM is probably the best known example.

Once we have the compiled pattern, we can try to match it against a string, as in the following code:

>>> pattern.match("foo bar")
<_sre.SRE_Match at 0x108acac60>

As we mentioned in the preceding example, we compiled a pattern and then we searched whether the pattern matches the text foo bar.

Working with Python and regular expressions in the command line is easy enough to perform quick tests. You just need to start the python interpreter and import the re module as we mentioned previously. However, if you prefer a GUI to test your regex, you can download one written in Python at the following link:

http://svn.python.org/view/*checkout*/python/trunk/Tools/scripts/redemo.py?content-type=text%2Fplain

There are a number of online tools such as the one at https://pythex.org/, as well as desktop programs such as RegexBuddy that we will cover in Chapter 5, Performance of Regular Expressions.

At this point, it's preferable to use the interpreter to gain fluency with them and get direct feedback.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.16.81