Regular Expressions

A regular expression (also known as a regexp or regex) describes a textual pattern. Ruby’s Regexp class[*] implements regular expressions, and both Regexp and String define pattern matching methods and operators. Like most languages that support regular expressions, Ruby’s Regexp syntax follows closely (but not precisely) the syntax of Perl 5.

Regexp Literals

Regular expression literals are delimited by forward slash characters:

/Ruby?/  # Matches the text "Rub" followed by an optional "y"

The closing slash character isn’t a true delimiter because a regular expression literal may be followed by one or more optional flag characters that specify additional information about the how pattern matching is to be done. For example:

/ruby?/i  # Case-insensitive: matches "ruby" or "RUB", etc.
/./mu     # Matches Unicode characters in Multiline mode

The allowed modifier characters are shown in Table 9-1.

Table 9-1. Regular expression modifier characters

ModifierDescription
i Ignore case when matching text.
m The pattern is to be matched against multiline text, so treat newline as an ordinary character: allow . to match newlines.
x Extended syntax: allow whitespace and comments in regexp.
o

Perform #{} interpolations only once, the first time the regexp literal is evaluated.

u,e,s,n

Interpret the regexp as Unicode (UTF-8), EUC, SJIS, or ASCII. If none of these modifiers is specified, the regular expression is assumed to use the source encoding.

Like string literals delimited with %Q, Ruby allows you to begin your regular expressions with %r followed by a delimiter of your choice. This is useful when the pattern you are describing contains a lot of forward slash characters that you don’t want to escape:

%r|/|         # Matches a single slash character, no escape required
%r[</(.*)>]i  # Flag characters are allowed with this syntax, too

Regular expression syntax gives special meaning to the characters (), [], {}, ., ?, +, *, |, ^, and $. If you want to describe a pattern that includes one of these characters literally, use a backslash to escape it. If you want to describe a pattern that includes a backslash, double the backslash:

/()/     # Matches open and close parentheses
/\/       # Matches a single backslash

Regular expression literals behave like double-quoted string literals and can include escape characters such as , , and (in Ruby 1.9) u (see Table 3-1 in Chapter 3 for a complete list of escape sequences):

money = /[$u20ACu{a3}u{a5}]/ # match dollar,euro,pound, or yen sign

Also like double-quoted string literals, Regexp literals allow the interpolation of arbitrary Ruby expressions with the #{} syntax:

prefix = ","
/#{prefix}	/   # Matches a comma followed by an ASCII TAB character

Note that interpolation is done early, before the content of the regular expression is parsed. This means that any special characters in the interpolated expression become part of the regular expression syntax. Interpolation is normally done anew each time a regular expression literal is evaluated. If you use the o modifier, however, this interpolation is only performed once, the first time the code is parsed. The behavior of the o modifier is best demonstrated by example:

[1,2].map{|x| /#{x}/}   # => [/1/, /2/]
[1,2].map{|x| /#{x}/o}  # => [/1/, /1/]

Regexp Factory Methods

As an alternative to regexp literals, you can also create regular expressions with Regexp.new, or its synonym, Regexp.compile:

Regexp.new("Ruby?")                          # /Ruby?/
Regexp.new("ruby?", Regexp::IGNORECASE)      # /ruby?/i
Regexp.compile(".", Regexp::MULTILINE, "u")  # /./mu

Use the Regexp.escape to escape special regular expression characters in a string before passing them to the Regexp constructor:

pattern = "[a-z]+"                # One or more letters
suffix = Regexp.escape("()")      # Treat these characters literally
r = Regexp.new(pattern + suffix)  # /[a-z]+()/

In Ruby 1.9 (and 1.8.7), the factory method Regexp.union creates a pattern that is the “union” of any number of strings or Regexp objects. (That is, the resulting pattern matches any of the strings matched by its constituent patterns.) Pass any number of arguments or a single array of strings and patterns. This factory method is good for creating patterns that match any word in a list of words. Strings passed to Regexp.union are automatically escaped, unlike those passed to new and compile:

# Match any one of five language names.
pattern = Regexp.union("Ruby", "Perl", "Python", /Java(Script)?/)
# Match empty parens, brackets, or braces. Escaping is automatic:
Regexp.union("()", "[]", "{}")   # => /()|[]|{}/

Regular Expression Syntax

Many programming languages support regular expressions, using the syntax popularized by Perl. This book does not include a complete discussion of that syntax, but the following examples walk you through the elements of regular expression grammar. The tutorial is followed by Table 9-2, which summarizes the syntax. The tutorial’s focus is on Ruby 1.8 regular expression syntax, but some of the features available only in Ruby 1.9 are demonstrated as well. For book-length coverage of regular expressions, see Mastering Regular Expressions by Jeffrey E. F. Friedl (O’Reilly).

# Literal characters
/ruby/             # Match "ruby". Most characters simply match themselves.
/¥/                # Matches Yen sign. Multibyte characters are suported
                   # in Ruby 1.9 and Ruby 1.8.

# Character classes
/[Rr]uby/          # Match "Ruby" or "ruby"
/rub[ye]/          # Match "ruby" or "rube"
/[aeiou]/          # Match any one lowercase vowel
/[0-9]/            # Match any digit; same as /[0123456789]/
/[a-z]/            # Match any lowercase ASCII letter
/[A-Z]/            # Match any uppercase ASCII letter
/[a-zA-Z0-9]/      # Match any of the above
/[^aeiou]/         # Match anything other than a lowercase vowel
/[^0-9]            # Match anything other than a digit

# Special character classes
/./                # Match any character except newline
/./m               # In multiline mode . matches newline, too
/d/               # Match a digit /[0-9]/
/D/               # Match a nondigit: /[^0-9]/
/s/               # Match a whitespace character: /[ 	
f]/
/S/               # Match nonwhitespace: /[^ 	
f]/
/w/               # Match a single word character: /[A-Za-z0-9_]/
/W/               # Match a nonword character: /[^A-Za-z0-9_]/

# Repetition
/ruby?/            # Match "rub" or "ruby": the y is optional
/ruby*/            # Match "rub" plus 0 or more ys
/ruby+/            # Match "rub" plus 1 or more ys
/d{3}/            # Match exactly 3 digits
/d{3,}/           # Match 3 or more digits
/d{3,5}/          # Match 3, 4, or 5 digits

# Nongreedy repetition: match the smallest number of repetitions
/<.*>/             # Greedy repetition: matches "<ruby>perl>"
/<.*?>/            # Nongreedy: matches "<ruby>" in "<ruby>perl>" 
                   # Also nongreedy: ??, +?, and {n,m}?

# Grouping with parentheses
/Dd+/            # No group: + repeats d
/(Dd)+/          # Grouped: + repeats Dd pair
/([Rr]uby(, )?)+/  # Match "Ruby", "Ruby, ruby, ruby", etc.

# Backreferences: matching a previously matched group again
/([Rr])uby&1ails/ # Match ruby&rails or Ruby&Rails
/(['"])[^1]*1/   # Single or double-quoted string
                   #   1 matches whatever the 1st group matched
                   #   2 matches whatever the 2nd group matched, etc.

# Named groups and backreferences in Ruby 1.9: match a 4-letter palindrome
/(?<first>w)(?<second>w)k<second>k<first>/
/(?'first'w)(?'second'w)k'second'k'first'/ # Alternate syntax

# Alternatives
/ruby|rube/        # Match "ruby" or "rube"
/rub(y|le))/       # Match "ruby" or "ruble"
/ruby(!+|?)/      # "ruby" followed by one or more ! or one ?

# Anchors: specifying match position
/^Ruby/            # Match "Ruby" at the start of a string or internal line
/Ruby$/            # Match "Ruby" at the end of a string or line
/ARuby/           # Match "Ruby" at the start of a string
/Ruby/           # Match "Ruby" at the end of a string
/Ruby/         # Match "Ruby" at a word boundary
/rubB/          # B is nonword boundary:
                   #   match "rub" in "rube" and "ruby" but not alone
/Ruby(?=!)/        # Match "Ruby", if followed by an exclamation point
/Ruby(?!!)/        # Match "Ruby", if not followed by an exclamation point

# Special syntax with parentheses
/R(?#comment)/     # Matches "R". All the rest is a comment
/R(?i)uby/         # Case-insensitive while matching "uby"
/R(?i:uby)/        # Same thing
/rub(?:y|le))/     # Group only without creating 1 backreference

# The x option allows comments and ignores whitespace
/  # This is not a Ruby comment. It is a literal part
   # of the regular expression, but is ignored.
   R      # Match a single letter R
   (uby)+ # Followed by one or more "uby"s
         # Use backslash for a nonignored space
/x                 # Closing delimiter. Don't forget the x option!

Table 9-2 summarizes the syntax rules demonstrated by this code.

Table 9-2. Regular expression syntax

SyntaxMatches
Character classes
.

Matches any single character except newline. Using m option allows it to match newline as well.

[...]

Matches any single character in brackets.

[^...]

Matches any single character not in brackets.

w

Matches word characters.

W

Matches nonword characters.

s

Matches whitespace. Equivalent to [ f].

S

Matches nonwhitespace.

d

Matches digits. Equivalent to [0–9].

D

Matches nondigits.

Sequences, alternatives, groups, and references
ab

Matches expression a followed by expression b.

a | b

Matches either expression a or expression b.

( re )

Grouping: groups re into a single syntactic unit that can be used with *, +, ?, |, and so on. Also “captures” the text that matches re for later use.

(?: re )

Groups as with (), but does not capture the matched text.

(?< name > re )

Groups a subexpression and captures the text that matches re as with (), and also labels the subexpression with name. Ruby 1.9.

(?' name ' re )

A named capture, as above. Single quotes may optionally replace angle brackets around name. Ruby 1.9.

1...9

Matches the same text that matched the nth grouped subexpression.

10...

Matches the same text that matched the nth grouped subexpression if there are that many previous subexpressions. Otherwise, matches the character with the specified octal encoding.

k< name >

Matches the same text that matched the named capturing group name.

g< n >

Matches group n again. n can be a group name or a group number. Contrast g, which rematches or reexecutes the specified group, with an ordinary back reference that tries to match the same text that matched the first time. Ruby 1.9.

Repetition

By default, repetition is “greedy”—as many occurrences as possible are matched. For “reluctant” matching, follow a * , + , ? , or {} quantifier with a ? . This will match as few occurrences as possible while still allowing the rest of the expression to match. In Ruby 1.9, follow a quantifier with a + for “possessive” (nonbacktracking) behavior.

re *

Matches zero or more occurrences of re.

re +

Matches one or more occurrences of re.

re ?

Optional: matches zero or one occurrence of re.

re { n }

Matches exactly n occurrences of re.

re { n ,}

Matches n or more occurrences of re.

re { n , m }

Matches at least n and at most m occurrences of re.

Anchors

Anchors do not match characters but instead match the zero-width positions between characters, “anchoring” the match to a position at which a specific condition holds.

^

Matches beginning of line.

$

Matches end of line.

A

Matches beginning of string.



Matches end of string. If string ends with a newline, it matches just before newline.

z

Matches end of string.

G

Matches point where last match finished.



Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.

B

Matches nonword boundaries.

(?= re )

Positive lookahead assertion: ensures that the following characters match re, but doesn’t include those characters in the matched text.

(?! re )

Negative lookahead assertion: ensures that the following characters do not match re.

(?<= re )

Positive lookbehind assertion: ensures that the preceeding characters match re, but doesn’t include those characters in the matched text. Ruby 1.9.

(?<! re )

Negative lookbehind assertion: ensures that the preceeding characters do not match re. Ruby 1.9.

Miscellaneous
(? onflags - offflags )

Doesn’t match anything, but turns on the flags specified by onflags, and turns off the flags specified by offflags. These two strings are combinations in any order of the modifier letters i, m, and x. Flag settings specified in this way take effect at the point that they appear in the expression and persist until the end of the expression, or until the end of the parenthesized group of which they are a part, or until overridden by another flag setting expression.

(? onflags - offflags : x )

Matches x, applying the specified flags to this subexpression only. This is a noncapturing group, like (?:...), with the addition of flags.

(?#...)

Comment: all text within parentheses is ignored.

(?> re )

Matches re independently of the rest of the expression, without considering whether the match causes the rest of the expression to fail to match. Useful to optimize certain complex regular expressions. The parentheses do not capture the matched text.

Pattern Matching with Regular Expressions

=~ is Ruby’s basic pattern-matching operator. One operand must be a regular expression and one must be a string. (It is implemented equivalently by both Regexp and String, so it doesn’t matter whether the regular expression is on the left or the right.) The =~ operator checks its string operand to see if it, or any substring, matches the pattern specified by the regular expression. If a match is found, the operator returns the string index at which the first match begins. Otherwise, it returns nil:

pattern = /Ruby?/i      # Match "Rub" or "Ruby", case-insensitive
pattern =~ "backrub"    # Returns 4.
"rub ruby" =~ pattern   # 0
pattern =~ "r"          # nil

After using the =~ operator, we may be interested in things other than the position at which the matched text begins. After any successful (non-nil) match, the global variable $~ holds a MatchData object which contains complete information about the match:

"hello" =~ /ew{2}/     # 1: Match an e followed by 2 word characters
$~.string               # "hello": the complete string
$~.to_s                 # "ell": the portion that matched
$~.pre_match            # "h": the portion before the match
$~.post_match           # "o": the portion after the match

$~ is a special thread-local and method-local variable. Two threads running concurrently will see distinct values of this variable. And a method that uses the =~ operator does not alter the value of $~ seen by the calling method. We’ll have more to say about $~ and related global variables later. An object-oriented alternative to this magical and somewhat cryptic variable is Regexp.last_match. Invoking this method with no arguments returns the same value as a reference to $~.

A MatchData object is more powerful when the Regexp that was matched contains subexpressions in parentheses. In this case, the MatchData object can tell us the text (and the starting and ending offsets of that text) that matched each subexpression:

# This is a pattern with three subpatterns
pattern = /(Ruby|Perl)(s+)(rocks|sucks)!/ 
text = "Ruby	rocks!"     # Text that matches the pattern    
pattern =~ text           # => 0: pattern matches at the first character
data = Regexp.last_match  # => Get match details
data.size                 # => 4: MatchData objects behave like arrays
data[0]                   # => "Ruby	rocks!": the complete matched text
data[1]                   # => "Ruby": text matching first subpattern
data[2]                   # => "	": text matching second subpattern
data[3]                   # => "rocks": text matching third subpattern
data[1,2]                 # => ["Ruby", "	"]
data[1..3]                # => ["Ruby", "	", "rocks"]
data.values_at(1,3)       # => ["Ruby", "rocks"]: only selected indexes
data.captures             # => ["Ruby", "	", "rocks"]: only subpatterns
Regexp.last_match(3)      # => "rocks": same as Regexp.last_match[3]

# Start and end positions of matches
data.begin(0)             # => 0: start index of entire match
data.begin(2)             # => 4: start index of second subpattern
data.end(2)               # => 5: end index of second subpattern
data.offset(3)            # => [5,10]: start and end of third subpattern

In Ruby 1.9, if a pattern includes named captures, then a MatchData obtained from that pattern can be used like a hash, with the names of capturing groups (as strings or symbols) as keys. For example:

# Ruby 1.9 only
pattern = /(?<lang>Ruby|Perl) (?<ver>d(.d)+) (?<review>rocks|sucks)!/ 
if (pattern =~ "Ruby 1.9.1 rocks!")
  $~[:lang]            # => "Ruby"
  $~[:ver]             # => "1.9.1"
  $~["review"]         # => "rocks"
  $~.offset(:ver)      # => [5,10] start and end offsets of version number
end
# Names of capturing groups and a map of group names to group numbers
pattern.names          # => ["lang", "ver", "review"]
pattern.named_captures # => {"lang"=>[1],"ver"=>[2],"review"=>[3]}

In addition to the =~ operator, the Regexp and String classes also define a match method. This method is like the match operator, except that instead of returning the index at which a match is found, it returns the MatchData object, or nil if no matching text is found. Use it like this:

if data = pattern.match(text)  # Or: data = text.match(pattern)
  handle_match(data)
end

In Ruby 1.9, you can also associate a block with a call to match. If no match is found, the block is ignored, and match returns nil. If a match is found, however, the MatchData object is passed to the block, and the match method returns whatever the block returns. So in Ruby 1.9, this code can be more succinctly written like this:

pattern.match(text) {|data| handle_match(data) }

Another change in Ruby 1.9 is that the match methods optionally accept an integer as the second argument to specify the starting position of the search.

Global variables for match data

Ruby adopts Perl’s regular expression syntax and, like Perl, sets special global variables after each match. If you are a Perl programmer, you may find these special variables useful. If you are a not a Perl programmer, you may find them unreadable! Table 9-3 summarizes these variables. The variables listed in the second column are aliases that are available if you require 'English'.

Table 9-3. Special global regular expression variables

GlobalEnglishAlternative
$~$LAST_MATCH_INFORegexp.last_match
$&$MATCHRegexp.last_match[0]
$`$PREMATCHRegexp.last_match.pre_match
$'$POSTMATCHRegexp.last_match.post_match
$1noneRegexp.last_match[1]
$2, etc.noneRegexp.last_match[2], etc.
$+$LAST_PAREN_MATCHRegexp.last_match[-1]

$~ is the most important of the variables listed in Table 9-3. All the others are derived from it. If you set $~ to a MatchData object, the values of the other special globals change. The other global variables are read-only and cannot be set directly. Finally, it is important to remember that $~ and the variables derived from it are all thread-local and method-local. This means that two Ruby threads can perform matches at the same time without interfering with each other and it means that the value of these variables, as seen by your code, will not change when your code calls a method that performs a pattern match.

Pattern matching with strings

The String class defines a number of methods that accept Regexp arguments. If you index a string with a regular expression, then the portion of the string that matches the pattern is returned. If the Regexp is followed by an integer, then the corresponding element of the MatchData is returned:

"ruby123"[/d+/]              # "123"
"ruby123"[/([a-z]+)(d+)/,1]  # "ruby"
"ruby123"[/([a-z]+)(d+)/,2]  # "123"

The slice method is a synonym for the string index operator []. The slice! variant returns the same value as slice but also has the side effect of deleting the returned substring from the string:

r = "ruby123"
r.slice!(/d+/)  # Returns "123", changes r to "ruby"

The split method splits a string into an array of substrings, using a string or regular expression as its delimiter:

s = "one, two, three"
s.split            # ["one,","two,","three"]: whitespace delimiter by default
s.split(", ")      # ["one","two","three"]: hardcoded delimiter
s.split(/s*,s*/) # ["one","two","three"]: space is optional around comma

The index method searches a string for a character, substring, or pattern, and returns the start index. With a Regexp argument, it works much like the =~ operator, but it also allows a second argument that specifies the character position at which to begin the search. This allows you to find matches other than the first:

text = "hello world"
pattern = /l/
first = text.index(pattern)       # 2: first match starts at char 2
n = Regexp.last_match.end(0)      # 3: end position of first match
second = text.index(pattern, n)   # 3: search again from there
last = text.rindex(pattern)       # 9: rindex searches backward from end

Search and replace

Some of the most important String methods that use regular expressions are sub (for substitute) and gsub (for global substitute), and their in-place variants sub! and gsub!. All of these methods perform a search-and-replace operation using a Regexp pattern. sub and sub! replace the first occurrence of the pattern. gsub and gsub! replace all occurrences. sub and gsub return a new string, leaving the original unmodified. sub! and gsub! modify the string on which they are called. If any modifications are made to the string, these mutator methods return the modified string. If no modifications are made, they return nil (which makes the methods suitable for use in if statements and while loops):

phone = gets               # Read a phone number
phone.sub!(/#.*$/, "")     # Delete Ruby-style comments
phone.gsub!(/D/,' '=>'-') # 1.9: remove non-digits but map space to hyphen

These search-and-replace methods do not require the use of regular expressions; you can also use an ordinary string as the text to be replaced:

text.gsub!("rails", "Rails")     # Change "rails" to "Rails" throughout

However, regular expressions really are more flexible. If you want to capitalize “rails” without modifying “grails”, for example, use a Regexp:

text.gsub!(/rails/, "Rails") # Capitalize the word "Rails" throughout

The reason that the search-and-replace methods are covered in this subsection on their own is that the replacement does not need to be an ordinary string of text. (Replacement strings specified in a hash must be ordinary strings, however.) Suppose you want a replacement string that depends on the details of the match found. The search-and-replace methods process the replacement string before performing replacements. If the string contains a backslash followed by a single digit, then that digit is used as an index into the $~ object, and the text from the MatchData object is used in place of the backslash and the digit. For example, if the string contains the escape sequence , the entire matched text is used. If the replacement string contains 1, then the text that matches the first subexpression is used in the replacement. The following code does a case-insensitive search for the word “ruby” and puts HTML bold tags around it, preserving the word’s capitalization:

text.gsub(/ruby/i, '<b></b>')

Note that if you use a double-quoted replacement string, you must double the backslash character.

You might be tempted to try the same thing using normal double-quoted string interpolation:

text.gsub(/ruby/i, "<b>#{$&}</b>")

This does not work, however, because in this case the interpolation is performed on the string literal before it is passed to gsub. This is before the pattern has been matched, so variables like $& are undefined or hold values from a previous match.

In Ruby 1.9, you can refer to named capturing groups using the k named backreference syntax:

# Strip pairs of quotes from a string
re = /(?<quote>['"])(?<body>[^'"]*)k<quote>/
puts "These are 'quotes'".gsub(re, 'k<body>')

Replacement strings can also refer to text other than that matched by capturing groups. Use &, `, ', and + to substitute in the value of $&, $`, $', and $+.

Instead of using a static replacement string, the search-and-replace methods can also be called with a block of code that computes the replacement string dynamically. The argument to the block is the text that matched the pattern:

# Use consistent capitalization for the names of programming languages
text = "RUBY Java perl PyThOn"         # Text to modify
lang = /ruby|java|perl|python/i        # Pattern to match
text.gsub!(lang) {|l| l.capitalize }   # Fix capitalization

Within the block of code, you can use $~ and the related global variables listed earlier in Table 9-3:

pattern = /(['"])([^1]*)1/   # Single- or double-quoted string
text.gsub!(pattern) do
  if ($1 == '"')   # If it was a double-quoted string
    "'#$2'"        # replace with single-quoted
  else             # Otherwise, if single-quoted
    ""#$2""      # replace with double-quoted
  end
end

Regular expression encoding

In Ruby 1.9, Regexp objects have an encoding method just like strings do. You can explicitly specify the encoding of a regular expression with modifiers: u for UTF-8, s for SJIS, e for EUC-JP, and n for none. You can also explicitly specify UTF-8 encoding by including a u escape in the regular expression. If you don’t explicitly specify an encoding, then the source encoding is used. But if all the characters in the regexp are ASCII, then ASCII is used, even if the source encoding is some superset of ASCII.

Ruby 1.9 pattern-matching operations raise an exception if you attempt to match a pattern and a string that have incompatible encodings. The fixed_encoding? method returns true if a Regexp has an encoding other than ASCII. If fixed_encoding? returns false, then it is safe to use that pattern to match against any string whose encoding is ASCII or a superset of ASCII.



[*] JavaScript programmers should note that the Ruby class has a lowercase e, unlike the JavaScript RegExp class.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.34.226