Pattern Syntax

It’s beyond the scope of this book to teach regular expressions in their entirety. But it’s not necessary to learn anything close to the entirety of regular expressions to productively use them to process text. In the rest of this chapter, we’ll take a tour of the fundamentals of regular expressions—that 20 percent of functionality that allows you to perform 80 percent of the tasks you’ll encounter.

Regular expression syntax is, as we’ve discussed, famously terse. This terseness makes it relatively unintuitive. There’s sometimes no astoundingly logical reason why a particular special character represents a particular pattern, apart from that it hadn’t already been used for something else. But this terseness brings great power, so making the effort to remember the special characters and their roles is worth it in the long run.

Matching Characters

We’ve mentioned that, like wildcards, regular expressions use special characters to represent particular patterns. But more fundamentally, the first thing to note about them is that characters other than these special characters don’t do anything special at all: they just match themselves. So to write a regular expression that matched the string abc anywhere in the text, we’d simply write /abc/; to write one that matched hello, world, we’d write /hello, world/; and so on.

At the other extreme, we can match any character at all using a dot (.). So the regex /h.llo/ would match hello, hallo, hullo, h_llo, and so on. There can be absolutely any character in that second position, provided there’s actually a second character.

Somewhere between these two extremes are character classes, which match specific lists of characters. So if we wanted to limit our earlier expression to only match hello, hallo, and hullo, we could use the pattern /h[eau]llo/. Similarly, to match both the British and US spellings of “initialise,” we could use the pattern /initiali[sz]e/.

We can also use ranges in character classes: so, for any lowercase letter we could use [a-z], and for any number we could use [0-9]. Ranges can be arbitrary, of course; we can limit ourselves to just the numbers 4–9 with [4-9] or just the letters f–t with [f-t].

As well as character classes, there are some shortcuts for common character types. There’s d for any digit (equivalent to [0-9]), w for any “word character” (equivalent to [a-zA-Z0-9_]), and s for any whitespace character.

These also have negative forms, each of which is just the uppercase version of the same letter. So D matches any character except a digit, W matches any character except a word character, and S matches any non-whitespace character.

You can make negative character classes yourself, too: simply start the class with a ^ character. So to match any character apart from vowels, we could use [^aeiou]; to match any character except uppercase letters, we could use [^A-Z].

Quantifiers

So far we’ve written patterns that match only single characters. What do we do if we want to match more than one instance of the same pattern? We could match two numbers in a row with /dd/, for example, but what if we want to match an arbitrary number? That’s where quantifiers come in: they let us say how many times a pattern can repeat while still matching.

The simplest quantifier is *. This matches zero or more instances of the preceding pattern. This often trips people up: zero or more is an important concept, since it matches, well, anything. /n*/ (“zero or more n characters”), to make things a bit more concrete, matches both “hand” and “spanner,” as you might expect (they contain one and two n’s, respectively). But it also matches “trowel.” And “beeblebrox.” And any other string of characters, regardless of whether any of them is an “n.”

* is more useful in the middle of a pattern, of course, where it does restrict the characters that can appear. For example, /hellos*world/ will match “helloworld” and “hello world,” but not “hello, world.” (You’ll often see this s* pattern, where whitespace is optional and is likely being ignored.)

Maybe we don’t want to allow for arbitrary repetitions; we just want a character or pattern to be optional. In that case, we can use ?. It counts zero or one repetitions but doesn’t allow any more than that. So to match both the US and British spellings of “tranquillity” (it’s spelled “tranquillity” in British English and “tranquility” in the US), we could write /tranquill?ity/; this matches both correct spellings, but not the obviously incorrect “tranquilllity.”

To go in the other direction, if we want to match one or more (but not zero) repetitions of a pattern, we can use the + quantifier. This ensures that the preceding pattern occurs at least once but allows for an arbitrary number of repetitions. If we wanted to match the exclamations “great,” “greeeat,” and “greeeeeeeeat,” then we could use /gre+at/. This lets us be as much like Tony the Tiger as we like, but doesn’t match the nonsensical “grat” like * would.

Finally, if we want to match specific numbers of repetitions, we can do that, too, by using braces. So, to match four repetitions, we use {4}; to match between one and four, we use {1,4}. This allows us to write, for example, /ste{1,2}p/, a pattern that would match both “step” and “steep,” but not “stp” or “steeep.” We can leave the second value blank, too, so to match five or more repetitions, we could write {5,}.

Anchors

We’ve thus far only written patterns that can occur anywhere in a string. But in reality, we often want to either test that the whole string matches a pattern—when validating input, for example. Or we might want to check that a string starts or ends in a certain way, or that a pattern occurs at the start or end of a line.

We can do this using anchors. Let’s imagine we have the regex /d+/, which matches a sequence of one or more numbers. It will match 12345, as we’d expect, but it would also match “abc12345def.” In other words, it’s checking that the string contains a sequence of one or more numbers, not that the string is a sequence of one or more numbers. But if we use A to anchor to the start of the text and z to anchor to the end, we can make sure the text consists entirely of digits: /Ad+z/.

If we’d just like to ensure that a line matches a particular format, we can use the ^ (start of line) and $ (end of line) anchors. It’s important to know that, unlike many other regular expression engines, Ruby doesn’t use these to anchor the start and the end of the string. Many people will write a regex like /^d+$/ and expect it to match only if a string consists entirely of numbers. However, it will also match the input:

 
not a number
 
12345
 
also not a number

If you’re using this kind of validation for security, then this can lead to vulnerabilities. If you’re expecting to match the whole text, use A and z instead.

Capture Groups

The final element of basic regex syntax is the ability to group segments of your match. Let’s say that you wanted to match a British telephone number. These are in the format of a five-digit area code and then a six-digit number. But you want to capture the area code and the phone number separately, because you want to distinguish between these two parts that make up the whole phone number.

To match a number in this format, you could write: /d{5}s*d{6}/. Five digits, followed by optional whitespace, followed by six digits. To group the elements of the match, all you need to do is put them into parentheses: /(d{5})s*(d{6})/. After matching these in Ruby, you can then access the two different elements individually. (We’ll look at exactly how to do that shortly.)

Grouping in this way results in numbered capture groups (the area code in 1, the phone number in 2). But we can also name them, with the ?<name> syntax. The previous example with named groups would look like this:

 
/(?<area_code>d{5})s*(?<number>d{6})/

This can make the code after the match clearer; it reveals our intentions more clearly. Accessing match[:area_code] and match[:number] is much more obvious than match[0] and match[1].

A secondary feature of capture groups is the ability to perform alternation: that is, to match one string of characters or another. So to match two greetings, “hello there” and “hi there,” we can use the expression /(hello|hi) there/.

We can also use alternation in the whole pattern; for example, to match text that contains either “this” or “that,” we can write the pattern /this|that/, without the need for capture groups.

And with that, we’ve covered the fundamentals of regular expression syntax. By composing these special characters together in different ways, we can create limitless combinations of patterns that allow us to match anything we might find ourselves needing to. The next question, then, is a practical one: how does this actually translate to Ruby?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.246.211