Chapter 1. Introducing Regular Expressions

Regular expressions are text patterns that define the form a text string should have. Using them, among other usages, it will be possible to do the following activities:

  • Check if an input honors a given pattern; for example, we can check whether a value entered in a HTML formulary is a valid e-mail address
  • Look for a pattern appearance in a piece of text; for example, check if either the word "color" or the word "colour" appears in a document with just one scan
  • Extract specific portions of a text; for example, extract the postal code of an address
  • Replace portions of text; for example, change any appearance of "color" or "colour" with "red"
  • Split a larger text into smaller pieces, for example, splitting a text by any appearance of the dot, comma, or newline characters

In this chapter, we are going to learn the basics of regular expressions from a language-agnostic point of view. At the end of the chapter, we will understand how regular expressions work, but we won't yet be able to execute a regular expression in Python. This is going to be covered in the next chapter. Because of this reason, the examples in this chapter will be approached from a theoretical point of view rather than being executed in Python.

History, relevance, and purpose

Regular expressions are pervasive. They can be found in the newest offimatic suite or JavaScript framework to those UNIX tools dating back to the 70s. No modern programming language can be called complete until it supports regular expressions.

Although they are prevalent in languages and frameworks, regular expressions are not yet pervasive in the modern coder's toolkit. One of the reasons often used to explain this is the tough learning curve that they have. Regular expressions can be difficult to master and very complex to read if they are not written with care.

As a result of this complexity, it is not difficult to find in Internet forums the old chestnut:

 

"Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

 
 --Jamie Zawinski, 1997

You'll find it at https://groups.google.com/forum/?hl=en#!msg/alt.religion.emacs/DR057Srw5-c/Co-2L2BKn7UJ.

Going through this book, we'll learn how to leverage the best practices when writing regular expressions to greatly simplify the reading process.

Even though regular expressions can be found in the latest and greatest programming languages nowadays and will, probably, for many years on, their history goes back to 1943 when the neurophysiologists Warren McCulloch and Walter Pitts published A logical calculus of the ideas immanent in nervous activity. This paper not only represented the beginning of the regular expressions, but also proposed the first mathematical model of a neural network.

The next step was taken in 1956, this time by a mathematician. Stephen Kleene wrote the paper Representation of events in nerve nets and finite automata, where he coined the terms regular sets and regular expressions.

Twelve years later, in 1968, a legendary pioneer of computer science took Kleene's work and extended it, publishing his studies in the paper Regular Expression Search Algorithm. This engineer was Ken Thompson, known for the design and implementation of Unix, the B programming language, the UTF-8 encoding, and others.

Ken Thompson's work didn't end in just writing a paper. He included support for these regular expressions in his version of QED. To search with a regular expression in QED, the following had to be written:

g/<regular expression>/p

In the preceding line of code, g means global search and p means print. If, instead of writing regular expression, we write the short form re, we get g/re/p, and therefore, the beginnings of the venerable UNIX command-line tool grep.

The next outstanding milestones were the release of the first non-proprietary library of regex by Henry Spence, and later, the creation of the scripting language Perl by Larry Wall. Perl pushed the regular expressions to the mainstream.

The implementation in Perl went forward and added many modifications to the original regular expression syntax, creating the so-called Perl flavor. Many of the later implementations in the rest of the languages or tools are based on the Perl flavor of regular expressions.

The IEEE thought their POSIX standard has tried to standardize and give better Unicode support to the regular expression syntax and behaviors. This is called the POSIX flavor of the regular expressions.

Today, the standard Python module for regular expressions—re—supports only Perl-style regular expressions. There is an effort to write a new regex module with better POSIX style support at https://pypi.python.org/pypi/regex. This new module is intended to replace Python's re module implementation eventually. In this book, we will learn how to leverage only the standard re module.

Tip

Regular expressions, regex, regexp, or regexen?

Henry Spencer referred indistinctly to his famous library as "regex" or "regexp". Wikipedia proposed regex or regexp to be used as abbreviations. The famous Jargon File lists them as regexp, regex, and reg-ex.

However, even though there does not seem to be a very strict approach to naming regular expressions, they are based in the field of mathematics called formal languages, where being exact is everything. Most modern implementations support features that cannot be expressed in formal languages, and therefore, they are not real regular expressions. Larry Wall, creator of the Perl language, used the term regexes or regexen for this reason.

In this book, we will indistinctly use all the aforementioned terms as if they were perfect synonyms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.128.105