A regular expression (also known as a regexp or regex)
describes a textual pattern. Ruby’s Regexp
class[*] implements regular expressions, and both Regexp
and String
define pattern matching methods and
operators. Like most languages that support regular expressions, Ruby’s
Regexp
syntax follows closely (but
not precisely) the syntax of Perl 5.
Regular expression literals are delimited by forward slash characters:
/Ruby?/ # Matches the text "Rub" followed by an optional "y"
The closing slash character isn’t a true delimiter because a regular expression literal may be followed by one or more optional flag characters that specify additional information about the how pattern matching is to be done. For example:
/ruby?/i # Case-insensitive: matches "ruby" or "RUB", etc. /./mu # Matches Unicode characters in Multiline mode
The allowed modifier characters are shown in Table 9-1.
Table 9-1. Regular expression modifier characters
Like string literals delimited with %Q
, Ruby allows you to begin your regular
expressions with %r
followed by
a delimiter of your choice. This is useful when the pattern you are
describing contains a lot of forward slash characters that you don’t
want to escape:
%r|/| # Matches a single slash character, no escape required %r[</(.*)>]i # Flag characters are allowed with this syntax, too
Regular expression syntax gives special meaning to the
characters ()
, []
, {}
,
.
, ?
, +
,
*
, |
, ^
, and
$
. If you want to describe a
pattern that includes one of these characters literally, use a
backslash to escape it. If you want to describe a pattern that
includes a backslash, double the backslash:
/()/ # Matches open and close parentheses /\/ # Matches a single backslash
Regular expression literals behave like double-quoted string
literals and can include escape characters such as
,
,
and (in Ruby 1.9) u
(see Table 3-1 in Chapter 3 for
a complete list of escape sequences):
money = /[$u20ACu{a3}u{a5}]/ # match dollar,euro,pound, or yen sign
Also like double-quoted string literals, Regexp literals allow
the interpolation of arbitrary Ruby expressions with the #{}
syntax:
prefix = "," /#{prefix} / # Matches a comma followed by an ASCII TAB character
Note that interpolation is done early, before the content of the
regular expression is parsed. This means that any special characters
in the interpolated expression become part of the regular expression
syntax. Interpolation is normally done anew each time a regular
expression literal is evaluated. If you use the o
modifier, however, this interpolation is
only performed once, the first time the code is parsed. The behavior
of the o
modifier is best
demonstrated by example:
[1,2].map{|x| /#{x}/} # => [/1/, /2/] [1,2].map{|x| /#{x}/o} # => [/1/, /1/]
As an alternative to regexp literals, you can also create regular
expressions with Regexp.new
, or its
synonym, Regexp.compile
:
Regexp.new("Ruby?") # /Ruby?/ Regexp.new("ruby?", Regexp::IGNORECASE) # /ruby?/i Regexp.compile(".", Regexp::MULTILINE, "u") # /./mu
Use the Regexp.escape
to
escape special regular expression characters in a string
before passing them to the Regexp
constructor:
pattern = "[a-z]+" # One or more letters suffix = Regexp.escape("()") # Treat these characters literally r = Regexp.new(pattern + suffix) # /[a-z]+()/
In Ruby 1.9 (and 1.8.7), the factory method Regexp.union
creates a pattern that
is the “union” of any number of strings or Regexp
objects. (That is, the resulting
pattern matches any of the strings matched by its constituent
patterns.) Pass any number of arguments or a single array of strings
and patterns. This factory method is good for creating patterns that
match any word in a list of words. Strings passed to Regexp.union
are automatically escaped,
unlike those passed to new
and
compile
:
# Match any one of five language names. pattern = Regexp.union("Ruby", "Perl", "Python", /Java(Script)?/) # Match empty parens, brackets, or braces. Escaping is automatic: Regexp.union("()", "[]", "{}") # => /()|[]|{}/
Many programming languages support regular expressions, using the syntax popularized by Perl. This book does not include a complete discussion of that syntax, but the following examples walk you through the elements of regular expression grammar. The tutorial is followed by Table 9-2, which summarizes the syntax. The tutorial’s focus is on Ruby 1.8 regular expression syntax, but some of the features available only in Ruby 1.9 are demonstrated as well. For book-length coverage of regular expressions, see Mastering Regular Expressions by Jeffrey E. F. Friedl (O’Reilly).
# Literal characters /ruby/ # Match "ruby". Most characters simply match themselves. /¥/ # Matches Yen sign. Multibyte characters are suported # in Ruby 1.9 and Ruby 1.8. # Character classes /[Rr]uby/ # Match "Ruby" or "ruby" /rub[ye]/ # Match "ruby" or "rube" /[aeiou]/ # Match any one lowercase vowel /[0-9]/ # Match any digit; same as /[0123456789]/ /[a-z]/ # Match any lowercase ASCII letter /[A-Z]/ # Match any uppercase ASCII letter /[a-zA-Z0-9]/ # Match any of the above /[^aeiou]/ # Match anything other than a lowercase vowel /[^0-9] # Match anything other than a digit # Special character classes /./ # Match any character except newline /./m # In multiline mode . matches newline, too /d/ # Match a digit /[0-9]/ /D/ # Match a nondigit: /[^0-9]/ /s/ # Match a whitespace character: /[ f]/ /S/ # Match nonwhitespace: /[^ f]/ /w/ # Match a single word character: /[A-Za-z0-9_]/ /W/ # Match a nonword character: /[^A-Za-z0-9_]/ # Repetition /ruby?/ # Match "rub" or "ruby": the y is optional /ruby*/ # Match "rub" plus 0 or more ys /ruby+/ # Match "rub" plus 1 or more ys /d{3}/ # Match exactly 3 digits /d{3,}/ # Match 3 or more digits /d{3,5}/ # Match 3, 4, or 5 digits # Nongreedy repetition: match the smallest number of repetitions /<.*>/ # Greedy repetition: matches "<ruby>perl>" /<.*?>/ # Nongreedy: matches "<ruby>" in "<ruby>perl>" # Also nongreedy: ??, +?, and {n,m}? # Grouping with parentheses /Dd+/ # No group: + repeats d /(Dd)+/ # Grouped: + repeats Dd pair /([Rr]uby(, )?)+/ # Match "Ruby", "Ruby, ruby, ruby", etc. # Backreferences: matching a previously matched group again /([Rr])uby&1ails/ # Match ruby&rails or Ruby&Rails /(['"])[^1]*1/ # Single or double-quoted string # 1 matches whatever the 1st group matched # 2 matches whatever the 2nd group matched, etc. # Named groups and backreferences in Ruby 1.9: match a 4-letter palindrome /(?<first>w)(?<second>w)k<second>k<first>/ /(?'first'w)(?'second'w)k'second'k'first'/ # Alternate syntax # Alternatives /ruby|rube/ # Match "ruby" or "rube" /rub(y|le))/ # Match "ruby" or "ruble" /ruby(!+|?)/ # "ruby" followed by one or more ! or one ? # Anchors: specifying match position /^Ruby/ # Match "Ruby" at the start of a string or internal line /Ruby$/ # Match "Ruby" at the end of a string or line /ARuby/ # Match "Ruby" at the start of a string /Ruby/ # Match "Ruby" at the end of a string /Ruby/ # Match "Ruby" at a word boundary /rubB/ # B is nonword boundary: # match "rub" in "rube" and "ruby" but not alone /Ruby(?=!)/ # Match "Ruby", if followed by an exclamation point /Ruby(?!!)/ # Match "Ruby", if not followed by an exclamation point # Special syntax with parentheses /R(?#comment)/ # Matches "R". All the rest is a comment /R(?i)uby/ # Case-insensitive while matching "uby" /R(?i:uby)/ # Same thing /rub(?:y|le))/ # Group only without creating 1 backreference # The x option allows comments and ignores whitespace / # This is not a Ruby comment. It is a literal part # of the regular expression, but is ignored. R # Match a single letter R (uby)+ # Followed by one or more "uby"s # Use backslash for a nonignored space /x # Closing delimiter. Don't forget the x option!
Table 9-2 summarizes the syntax rules demonstrated by this code.
Table 9-2. Regular expression syntax
=~
is Ruby’s basic pattern-matching operator. One operand
must be a regular expression and one must be a string. (It is
implemented equivalently by both Regexp
and String
, so it doesn’t matter whether the
regular expression is on the left or the right.) The =~
operator checks its string operand to see
if it, or any substring, matches the pattern specified by the regular
expression. If a match is found, the operator returns the string index
at which the first match begins. Otherwise, it returns nil
:
pattern = /Ruby?/i # Match "Rub" or "Ruby", case-insensitive pattern =~ "backrub" # Returns 4. "rub ruby" =~ pattern # 0 pattern =~ "r" # nil
After using the =~
operator,
we may be interested in things other than the position at which the
matched text begins. After any successful (non-nil
) match, the global variable $~
holds a MatchData
object which contains complete information about the match:
"hello" =~ /ew{2}/ # 1: Match an e followed by 2 word characters $~.string # "hello": the complete string $~.to_s # "ell": the portion that matched $~.pre_match # "h": the portion before the match $~.post_match # "o": the portion after the match
$~
is a special thread-local
and method-local variable. Two threads running concurrently will see
distinct values of this variable. And a method that uses the =~
operator does not alter the value of
$~
seen by the calling method.
We’ll have more to say about $~
and
related global variables later. An object-oriented alternative to this
magical and somewhat cryptic variable is Regexp.last_match
. Invoking this method with no arguments returns the same value as a
reference to $~
.
A MatchData
object is more powerful when the Regexp
that was matched contains
subexpressions in parentheses. In this case, the MatchData
object can tell us the text (and
the starting and ending offsets of that text) that matched each
subexpression:
# This is a pattern with three subpatterns pattern = /(Ruby|Perl)(s+)(rocks|sucks)!/ text = "Ruby rocks!" # Text that matches the pattern pattern =~ text # => 0: pattern matches at the first character data = Regexp.last_match # => Get match details data.size # => 4: MatchData objects behave like arrays data[0] # => "Ruby rocks!": the complete matched text data[1] # => "Ruby": text matching first subpattern data[2] # => " ": text matching second subpattern data[3] # => "rocks": text matching third subpattern data[1,2] # => ["Ruby", " "] data[1..3] # => ["Ruby", " ", "rocks"] data.values_at(1,3) # => ["Ruby", "rocks"]: only selected indexes data.captures # => ["Ruby", " ", "rocks"]: only subpatterns Regexp.last_match(3) # => "rocks": same as Regexp.last_match[3] # Start and end positions of matches data.begin(0) # => 0: start index of entire match data.begin(2) # => 4: start index of second subpattern data.end(2) # => 5: end index of second subpattern data.offset(3) # => [5,10]: start and end of third subpattern
In Ruby 1.9, if a pattern includes named captures, then
a MatchData
obtained
from that pattern can be used like a hash, with the names of capturing
groups (as strings or symbols) as keys. For example:
# Ruby 1.9 only pattern = /(?<lang>Ruby|Perl) (?<ver>d(.d)+) (?<review>rocks|sucks)!/ if (pattern =~ "Ruby 1.9.1 rocks!") $~[:lang] # => "Ruby" $~[:ver] # => "1.9.1" $~["review"] # => "rocks" $~.offset(:ver) # => [5,10] start and end offsets of version number end # Names of capturing groups and a map of group names to group numbers pattern.names # => ["lang", "ver", "review"] pattern.named_captures # => {"lang"=>[1],"ver"=>[2],"review"=>[3]}
In addition to the =~
operator, the Regexp
and String
classes also define a match
method. This method is like the match operator, except that
instead of returning the index at which a match is found, it returns
the MatchData
object, or nil
if no matching text is found. Use it
like this:
if data = pattern.match(text) # Or: data = text.match(pattern) handle_match(data) end
In Ruby 1.9, you can also associate a block with a call to
match
. If no match is found, the
block is ignored, and match
returns
nil
. If a match is found, however,
the MatchData
object is passed to the block, and the match
method returns whatever the block
returns. So in Ruby 1.9, this code can be more succinctly written like
this:
pattern.match(text) {|data| handle_match(data) }
Another change in Ruby 1.9 is that the match
methods optionally accept an integer
as the second argument to specify the starting position of the
search.
Ruby adopts Perl’s regular expression syntax and, like Perl, sets special global variables
after each match. If you are a Perl programmer, you may find these
special variables useful. If you are a not a Perl programmer, you
may find them unreadable! Table 9-3
summarizes these variables. The variables listed in the second
column are aliases that are available if you require 'English'
.
Table 9-3. Special global regular expression variables
Global | English | Alternative |
---|---|---|
$~ | $LAST_MATCH_INFO | Regexp.last_match |
$& | $MATCH | Regexp.last_match[0] |
$` | $PREMATCH | Regexp.last_match.pre_match |
$' | $POSTMATCH | Regexp.last_match.post_match |
$1 | none | Regexp.last_match[1] |
$2 , etc. | none | Regexp.last_match[2] , etc. |
$+ | $LAST_PAREN_MATCH | Regexp.last_match[-1] |
$~
is the most important of the variables listed in Table 9-3. All the others are derived from it.
If you set $~
to a MatchData
object, the values of the other
special globals change. The other global variables are read-only and
cannot be set directly. Finally, it is important to remember that $~
and the variables derived from it are
all thread-local and method-local. This means that two Ruby threads
can perform matches at the same time without interfering with each
other and it means that the value of these variables, as seen by
your code, will not change when your code calls a method that
performs a pattern match.
The String
class defines
a number of methods that accept Regexp
arguments. If you index a string
with a regular expression, then the portion of the string that
matches the pattern is returned. If the Regexp
is followed by an integer, then the
corresponding element of the MatchData
is returned:
"ruby123"[/d+/] # "123" "ruby123"[/([a-z]+)(d+)/,1] # "ruby" "ruby123"[/([a-z]+)(d+)/,2] # "123"
The slice
method
is a synonym for the string index operator []
. The slice!
variant returns the same value as
slice
but also has the side
effect of deleting the returned substring from the string:
r = "ruby123" r.slice!(/d+/) # Returns "123", changes r to "ruby"
The split
method splits a
string into an array of substrings, using a string or
regular expression as its delimiter:
s = "one, two, three" s.split # ["one,","two,","three"]: whitespace delimiter by default s.split(", ") # ["one","two","three"]: hardcoded delimiter s.split(/s*,s*/) # ["one","two","three"]: space is optional around comma
The index
method searches a
string for a character, substring, or pattern, and returns the start
index. With a Regexp
argument, it
works much like the =~
operator,
but it also allows a second argument that specifies the character
position at which to begin the search. This allows you to find
matches other than the first:
text = "hello world" pattern = /l/ first = text.index(pattern) # 2: first match starts at char 2 n = Regexp.last_match.end(0) # 3: end position of first match second = text.index(pattern, n) # 3: search again from there last = text.rindex(pattern) # 9: rindex searches backward from end
Some of the most important String
methods that use regular
expressions are sub
(for
substitute) and gsub
(for global
substitute), and their in-place variants sub!
and gsub!
. All of these methods perform a search-and-replace
operation using a Regexp
pattern.
sub
and sub!
replace the first occurrence of the
pattern. gsub
and gsub!
replace all occurrences. sub
and gsub
return a new string, leaving the
original unmodified. sub!
and
gsub!
modify the string on which
they are called. If any modifications are made to the string, these
mutator methods return the modified string. If no modifications are
made, they return nil
(which
makes the methods suitable for use in if
statements and while
loops):
phone = gets # Read a phone number phone.sub!(/#.*$/, "") # Delete Ruby-style comments phone.gsub!(/D/,' '=>'-') # 1.9: remove non-digits but map space to hyphen
These search-and-replace methods do not require the use of regular expressions; you can also use an ordinary string as the text to be replaced:
text.gsub!("rails", "Rails") # Change "rails" to "Rails" throughout
However, regular expressions really are more flexible. If you
want to capitalize “rails” without modifying “grails”, for example,
use a Regexp
:
text.gsub!(/rails/, "Rails") # Capitalize the word "Rails" throughout
The reason that the search-and-replace methods are covered in
this subsection on their own is that the replacement does not need
to be an ordinary string of text. (Replacement strings specified in
a hash must be ordinary strings, however.) Suppose you want a
replacement string that depends on the details of the match found.
The search-and-replace methods process the replacement string before
performing replacements. If the string contains a backslash followed
by a single digit, then that digit is used as an index into the
$~
object, and the text from the
MatchData
object is used in place
of the backslash and the digit. For example, if the string contains
the escape sequence