Getting to know regular expressions

Strings that store data usually have certain patterns, which can be leveraged to retrieve actual data values in a unified fashion. For example, some location cells have distinctive coordinates, and numbers and symbols of degrees, minutes, and seconds. To extract those values, we could write a custom Python code, but this will be verbose and time-consuming.

This problem – extracting values from text by defining a pattern – sounds like something quite general and useful in many situations. When a problem can be stated as something universal, it usually means that it is, and someone has a solution! This is, by the way, a good approach for programming in general.

Indeed, there is a universal solution, called regular expressions, or regex. Regex is a special mini-language that defines patterns in a text to look for. It is language-agnostic, and there are implementations for most languages. Python, for example, has a built-in re library but, in this case, we don't even need to invoke it explicitly, as pandas has the corresponding built-in functions. 

In order to use regex, we first need to define our pattern as a string, using its language. This language is relatively easy to write (at least for simple queries), but notoriously hard to read. Here is an example of a regex that detects emails in text:

(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$)

Don't worry—it will start making sense soon (also, we won't write the regex in such a complex manner). Here are the basics:

Rule Example text Example pattern Result

Any character except the special ones ( ., ?, !, /, *, +, |, (), [], {}) represents themselves (have to be exact in the string). This includes white spaces as well.

Hello! llo llo

The plus sign (+) means that the character before it can be repetitive (appears one or more times consecutively).

Hello!

l+o

llo

Similarly, the asterisk (*) means that the previous character can be repeated any number of times, or not exist at all.

Hello! r*o o

Figure brackets with one or two numbers in them will specify a permissible range of repetition for the character before them.

Hello! l{2}o llo

Square brackets ([]) define a choice. Within the brackets, a pipe symbol (|) means or (as in Python), so you can define an option of sub-patterns. Another symbol, ^, within brackets mean anything apart from the following characters.

many or menu m[a|e]n[^y] menu

Square brackets also support a number or alphabet ranges: A-Z, a-z, and 0-9 will fit any digit or character.

Hello! [a-z]+ ello

There is a handful of special characters, such as d for any digit, D for any non-digit, s for any type of white space (including tabs and newline characters), S for any type of non-white space, w for any alpha-numeric, and W for any non-alphanumeric, and many more. A slash before special symbols (for example, an exclamation mark) will escape them, so regex will treat them as an actual, literal character.

Hello! w! Hello!

The period (.) represents any character. It can be combined with a plus sign, asterisk, or square brackets.

Hello! .+ Hello!

A parenthesis defines a capture group—which substrings (there could be more than one) to retrieve. Groups can be named in pandas; this will return a dataframe with columns named after group names.

name: Huckleberry Finn (w+s+w+)  Huckleberry Finn

^ and $ match the beginning and end of the string, respectively.

Hello! He$ no match

 

Those are just a few main rules and symbols, but that should suffice for our goals. Combined, those rules can form formidable, complex patterns that are perhaps hard to read (as someone said, regex is meant to write but not to read), but extremely powerful. To learn more about regular expressions, take a look at this documentation (https://www.regular-expressions.info/). There are also quite a few free online editors that help to test your patterns. As regex has a number of minor differences between implementations, we recommend using editors with Python-flavored regex, like this one (http://pythex.org/). There are even regex games (https://alf.nu/RegexGolf)!

Now, let's try using it on the data we collected!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.188.90.148