Chapter 17. Regular Expressions

One of the most exciting SQL features in Oracle Database 10g is support for regular expressions. Open source databases such as MySQL and PostgreSQL have supported regular expressions for years. Frankly, we were tired of having that thrown in our faces, of feeling one-upped by the open source competition. In the Oracle Database 10g release, Oracle has implemented regular expression support with a vengeance. We were surprised and delighted at what we found. You will be, too.

Elementary Regular Expression Syntax

Regular expressions provide a syntax to describe patterns in text. Being able to describe a passage of text gives you power, the power to manipulate that text in ways that would otherwise be impossible. For example, regular expressions enable you to easily perform an operation such as the following:

Find all apparent phone numbers in a free-text, comment column, and reformat those numbers, at least the U.S. and Canadian numbers, so that all appear in the form: (999) 999-9999.

By the end of this chapter, we’ll have shown you exactly how to implement this directive. For now, we want to familiarize you with some elementary regular expression syntax.

Matching a Single Character

The most basic regular expression metacharacter is probably the period (.). Use it to match any single character, except for the newline (usually chr(10)). For example, assume that users of our book’s example database have taken to entering employee address and phone number information into a free-form text column. Further suppose that you need to identify and extract the phone numbers from that column. All employees are in the States, so you can use the pattern ...-.... to search for the common phone number pattern of three digits, a hyphen, and four digits:

               SELECT emp_id, text
               FROM employee_comment
               WHERE REGEXP_LIKE(text,'...-....'),

    EMP_ID TEXT
---------- ------------------------------------------------------------
      7369 126 Varnum, Edmore MI 48829, 989 313-5351
      7499 1105 McConnell Court
           Cedar Lake MI 48812
           Home: 989-387-4321
           Cell: (237) 438-3333
      7521 Her address is: 12646 Bramell. Her city and state are Vestab
           urg, MI 48891
           Home phone is 387-1698
 . . . 
      7900 Cares for 100-year-old aunt during the day. Schedule only fo
           r evening and night shifts.
 . . .

REGEXP_LIKE is the regular expression equivalent to the LIKE predicate. You’ll notice that it looks like a function, and if you prefer to think of it that way, that’s fine, because in PL/SQL, REGEXP_LIKE is, in fact, a function, a Boolean function. However, in SQL, Oracle refers to REGEXP_LIKE as a predicate.

Whatever you choose to call it, REGEXP_LIKE searches the target string to see whether it contains an occurrence of the pattern you specify. In our example, the predicate searches the text column for the phone number pattern, and evaluates to TRUE whenever that pattern is found.

As you can see from our query’s output, our users have taken the words “free-form” to heart. Phone numbers have been entered using a variety of formats. The expression used in our query doesn’t catch every phone number. It also matches at least one string that is not a phone number: “100-year-old” in the comment for #7900. We’ll refine the expression as we present more of Oracle’s regular expression syntax.

Tip

Before running the examples in this chapter, we used the following SQL*Plus formatting commands: SET RECSEP OFF, COLUMN text FORMAT A60.

When using regular expressions in Oracle, you aren’t limited to detecting occurrences of a pattern. After all, once you’ve determined that a string contains text matching a pattern, you’re likely to want to do something with that text. For that Oracle provides the REGEXP_SUBSTR function. REGEXP_SUBSTR extracts text matching a regular expression:

               SELECT emp_id, REGEXP_SUBSTR(text,'...-....') text
               FROM employee_comment
               WHERE REGEXP_LIKE(text,'...-....'),

    EMP_ID TEXT
---------- ---------------
      7369 313-5351
      7499 989-387-
      7521 387-1698

Hmm . . . What’s up with 989-387-? We didn’t ask for a trailing hyphen, or did we? In a way, we did. Our pattern used a period (.) to ask for any four characters following a hyphen, and a hyphen is itself a character. The string 989-387- comes from the combined area code and phone number 989-387-4321. Keep reading! We’ll show you how to fix this.

Matching Any of a Set of Characters

Phone numbers are not made up of any characters; they are, well, numbers. Our previous regular expression is a bit too broad in this respect, as it allows for hyphens and other characters where we want to see only digits. Not to worry! There is a solution. You can use square-brackets to define a matching-list consisting of only the digits 0 through 9. The pattern [0123456789] will match any of the digits, 0 through 9, given in the list. Using this pattern, you can write a phone number expression that more narrowly focuses in on phone numbers:

[0123456789][0123456789][0123456789]-[0123456789][0123456789][0123456789][0123456789]

This pattern is rather awful though, in that it repeats the digits from 0 through 9 seven times. It’s a good thing we aren’t looking for area codes too. Maybe we should look for area codes, but for now we’ll just tell you that you can use a hyphen within square-brackets as a short-hand for enumerating each character in a range. Rather than write [0123456789], you can use the much simpler [0-9] to represent the same set of values. You can now shorten the phone number expression to:

[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]

The following query shows the change in results from using this more refined expression:

               SELECT emp_id, 
                  REGEXP_SUBSTR(text,'[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]') text
               FROM employee_comment
               WHERE REGEXP_LIKE(text,'[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'),

    EMP_ID TEXT
---------- -------------
      7369 313-5351
      7499 387-4321
      7521 387-1698

This query is much better. Notice 387-4321 in the second row of output. This is the correct phone number that was misrepresented as 989-387- by our query at the end of the preceding section.

You’re not limited to specifying a single range within square-brackets. You can specify any combination of ranges and single characters. For example, you can generate the same matching set as from [0-9] using the following expression:

[01-3456-9]

The matching set in this expression is composed of:

  • The single digit 0

  • The range of digits from 1 through 3

  • The single digits 4 and 5

  • The range of digits 6 through 9

Be aware that, under the default NLS_SORT=BINARY setting, any ranges you specify depend on the underlying character set encoding. It so happens that most, if not all, ASCII- and Unicode-based encodings map the digits 0 through 9 to contiguous code points. However, you can run into problems in this area. For example, the range [A-Z] encompasses all the uppercase Latin characters used in American English, but that range does not encompass other uppercase Latin characters, such as the Croatian alphabet’s LJ. You’ll also find that, while the letters A-Z are contiguous in ASCII, they are not contiguous in EBCDIC (an encoding with roots in IBM operating systems). In EBCDIC, you’ll find that the range [A-Z] encompasses not only the letters A through Z, but also the closing curly-brace (}), the backslash (), and some unassigned code points.

In addition to defining a range, you can negate a matching set. Do this by writing a caret (^) as the first character following the opening square-bracket. For example, use the expression [^0-9] to match all characters but the digits 0 through 9. Use [^a-zA-Z] to match characters other than the 26 letters used in American English.

Within square-brackets, the hyphen (-), caret (^), and other characters take on a special meaning. Table 17-1 tells you a bit more about working with these characters.

Table 17-1. Characters with special meaning inside square brackets

Character

Usage and example

^

Negates a matching set. To include the actual ^ character in a set, place it in any position but the first. For example: [0-9^] includes the digits 0 through 9, plus the caret character.

-

Separates the first and last characters in a range. To include the hyphen as itself, use it in either the first or last position, where it can’t possibly define a range. For example: [-0-9] and [0-9-] both include the digits 0-9 along with the hyphen. In a negation situation you can write the hyphen immediately following the caret. The expression [^-0-9] excludes the digits 0 through 9, as well as excluding the hyphen.

[

Within a square-bracket expression, an opening square-bracket simply represents itself. For example, [[0-9] includes the digits 0 through 9 as well as the [ character.

]

Closes a matching set. To include ] in a set, specify it as the first character following the opening square-bracket (e.g., []0-9]). When negating a set, you can specify ] immediately following the caret, as in: [^]0-9].

Bracket expressions seem simple at first, but can get a bit tricky to write because of the special characters described in Table 17-1. The treatment of those characters may seem arbitrary at first, but there is a logic that will begin to sink in as you gain more experience writing regular expressions.

Tip

In addition to the characters described in Table 17-1, the sequences [:. : :], [. .], and [= =] also have special meaning. We describe these in Section 17.3.

Matching Repeating Sequences

The period (.) matches a single character in a target string. So does a bracket expression, which is why we had to repeat [0-9] three times, and then again four times, in our phone number expression. Wouldn’t it be nice to more easily define elements of an expression that repeat? Sure it would, and to that end you have what are termed quantifiers. A quantifier is a special character, or sequence of characters that specify a valid range of occurrences for the immediately preceding element.

Getting back to our phone number example, rather than repeat each element in the phone number expression:

[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]

you can specify each element once followed by a quantifier giving the repeat count:

[0-9]{3}-[0-9]{4}

The first element, in this case, is the bracket expression [0-9]. The quantifier for the first element is {3}, which means that we want three repetitions of [0-9]. Similarly, the quantifier {4} for the third element specifies four repetitions of a digit. The second element is the hyphen, for which we didn’t specify a quantifier, so the default is to look for a single occurrence. Plugging this shorter and simpler expression into our query from the previous section gives the following, equivalent query:

SELECT emp_id, 
   REGEXP_SUBSTR(text,'[0-9]{3}-[0-9]{4}') text
FROM employee_comment
WHERE REGEXP_LIKE(text,'[0-9]{3}-[0-9]{4}'),

Oracle’s regular expression syntax supports the quantifiers shown in Table 17-2.

Table 17-2. Regular expression quantifiers

Quantifier

Description

*

Specifies that an element may occur zero or more times. For example, .* is the regular expression equivalent to the LIKE predicate’s % metacharacter.

+

Specifies that an element may occur one or more times.

?

Specifies that an element may occur zero or one times, but no more.

{ n }

Specifies that an element must occur n times.

{ n ,}

Specifies than an element must occur at least n times, but may occur more than n times.

{ n,m }

Specifies that an element must occur anywhere between n and m times, inclusive.

{0,m}

Is the same case really, as { n,m }, but this is a way to specify that an element occurs up to, but not more than, m times.

Defining Alternate Possibilities

Our users have not been at all consistent in how they’ve entered data into our free-form comment column. Some phone numbers have area codes. Some do not. And when it comes to separating digit groups, our users show no end of creativity. They’ve used hyphens, periods, and spaces as separators, have mixed those in a single phone number, and you’ll even find a few area codes enclosed within parentheses. What’s a SQL programmer to do?

Dealing with alternate possibilities is an area in which regular expressions really shine when compared to the rather lame LIKE predicate. Let’s begin with the problem of area codes. Some phone numbers have them, and others do not. One way to handle this is to simply write an expression for each case, and join those two expressions with the vertical-bar (|) operator:

[0-9]{3}-[0-9]{3}-[0-9]{4}|[0-9]{3}-[0-9]{4}

This regular expression will find phone numbers in the following two forms: 999-999-9999 and 999-9999. The vertical bar defines an alternation, or a choice between two possibilities: one with an area code and one without.

Alternation appears to take care of the area code problem. Let’s turn to the issue of separators. You could try to take care of the different separators by adding even more alternations, with a different expression to cover each permutation of separators:

[0-9]{3}-[0-9]{3}-[0-9]{4}|[0-9]{3}.[0-9]{3}.[0-9]{4}|[0-9]{3} [0-9]{3} [0-9]{4}|[0-
9]{3}-[0-9]{3}.[0-9]{4}|[0-9]{3}.[0-9]{3}-[0-9]{4}|[0-9]{3} [0-9]{3}.[0-9]{4}| . . .

This gets messy fast. A combined area code and phone number contains two separator characters, and with three possible characters to choose from in each position, the number of permutations gets quickly out of hand. Since we’re dealing with single characters, a better approach might be to define a matching set of valid separators:

[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}|[0-9]{3}[-. ][0-9]{4}

This is much better. We’re back to a single alternation between two patterns. The matching set [-. ] matches any of the three separator characters that we are recognizing as valid.

Tip

Because the hyphen (-) comes first in the matching set [-. ], that hyphen is treated as a character in the set. On the other hand, in the matching set [0-9], the hyphen, because it is neither first nor last, is treated as a delimiter between the first (0) and last (9) characters in a range.

The following query shows the results of using this improved expression:

               SELECT emp_id, 
                  REGEXP_SUBSTR(
                     text,
                     '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}|[0-9]{3}[-. ][0-9]{4}')   
                     text
               FROM employee_comment
               WHERE REGEXP_LIKE(
                  text,
                  '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}|[0-9]{3}[-. ][0-9]{4}'),

    EMP_ID TEXT
---------- -------------
      7369 989 313-5351
      7499 989-387-4321
      7521 387-1698
      7566 989.387.4444
      7654 231-898-9823
      7698 388-1234
      7844 989-387.5359
      7876 453-9999

Order matters with alternation. Consider the following two regular expressions:

[0-9]{3}-[0-9]{3}|[0-9]{3}-[0-9]{3}-[0-9]{4}

[0-9]{3}-[0-9]{3}-[0-9]{4}|[0-9]{3}-[0-9]{3}

Both of these expressions look for either a 999-999 or a 999-999-9999 pattern. The 999-999 is a purposeful deviation from our phone number pattern. The difference between the expressions lies solely in the order in which the alternation is written. Look at the difference in output:

               SELECT REGEXP_SUBSTR(text,'[0-9]{3}-[0-9]{3}|[0-9]{3}-[0-9]{3}-[0-9]{4}')
               FROM employee_comment
               WHERE emp_id = 7499;

REGEXP_SUBSTR
-------------
989-387

SELECT REGEXP_SUBSTR(text,'[0-9]{3}-[0-9]{3}-[0-9]{4}|[0-9]{3}-[0-9]{3}')
               FROM employee_comment
               WHERE emp_id = 7499;

REGEXP_SUBSTR
-------------
989-387-4321

Why the difference in results? It’s because Oracle’s regular expression engine looks at the alternates in left-to-right order. For reference, here’s an example showing the target text:

               SELECT text
               FROM employee_comment
               WHERE emp_id = 7499;

TEXT
---------------------
1105 McConnell Court
Cedar Lake MI 48812
Home: 989-387-4321
Cell: (237) 438-3333

Both alternates begin to match at the same point in the target text. Whenever that happens, the regular expression engine looks at the alternates from left-to-right. If the left-most alternate matches, the regular expression engine is satisfied, and that match is returned from REGEXP_SUBSTR. When writing alternations, if you have any preference as to which alternation takes precedence, be sure to write that alternation first.

Subexpressions

We’re not quite done with phone numbers yet. One last issue to deal with is that of phone numbers with area codes enclosed in parentheses, in the form: (999) 999-9999. Here’s one way to think about this problem:

  • A phone number might have an area code:

    • The area code might be enclosed within parentheses.

    • The area code might not be enclosed within parentheses.

  • The phone number might not have an area code.

Essentially, what you have here is an alternation within an alternation. Working from the inside out, you can write the following expression to accommodate both area code possibilities:

[0-9]{3}|([0-9]{3})

The left side of this expression looks for three digits. The right side looks for three digits enclosed within parentheses. Why the backslash () characters? Those are there because otherwise the parentheses are treated as special characters, enclosing a subexpression.

A subexpression is a portion of a longer expression that you wish to treat as a discrete unit. For example, you might wish to apply a quantifier to a subexpression. Indeed, that’s a good way to indicate that area codes are optional:

([0-9]{3}|([0-9]{3}))?

The parentheses, the ones not preceded by characters, define a subexpression defining two alternate representations of an area code. The ? quantifier specifies that the preceding subexpression, the area code, occurs either once or not at all.

There’s one more housekeeping detail to take care of before filling out the non-optional portion of the phone number expression. If there’s an optional area code, it will be followed by some sort of separator. That separator will be a space if the area code is enclosed by parentheses, or it will be one of either a hyphen, period, or space:

([0-9]{3}[-. ]|([0-9]{3}) )?

Now, all that remains is to add the seven-digit phone number pattern onto the end of the preceding expression:

([0-9]{3}[-. ]|([0-9]{3}) )?[0-9]{3}[-. ][0-9]{4}

The following query shows the results of this latest iteration of the phone number pattern:

               SELECT emp_id, REGEXP_SUBSTR(text,
                  '([0-9]{3}[-. ]|([0-9]{3}) )?[0-9]{3}[-. ][0-9]{4}')
                  text
               FROM employee_comment;

    EMP_ID TEXT
---------- ---------------
      7369 989 313-5351
      7499 989-387-4321
      7521 387-1698
      7566 989.387.4444
      7654 231-898-9823
      7698 (989) 388-1234
      7782
      7788
      7839
      7844 989-387.5359
      7876 (231) 453-9999
      7900

In addition to dealing with tricky alternations, subexpressions are also essential when working with backreferences, something you’ll read more about later in Section 17.3.

Anchoring an Expression

Two metacharacters allow you to anchor an expression to either the beginning or end, or both, of the target string. When you anchor an expression to the beginning of the target string, you are specifying that no other characters may precede the expression. Likewise, when you anchor an expression to the end of a string, you are specifying that no characters are allowed to follow the expression.

Use a caret (^) to anchor an expression to the beginning of a string. Use a dollar sign ($) to anchor an expression to the end of a string. Here’s an example, using REGEXP_INSTR, to illustrate. The REGEXP_INSTR function returns the character position where a match for an expression is found:

               SELECT REGEXP_INSTR('123','[0-9]') unanchored,
                      REGEXP_INSTR('123','[0-9]$') anchored_end
               FROM dual;

UNANCHORED ANCHORED_END
---------- ------------
         1            3

Notice the difference in character positions returned by the two function calls. The expression in the second function call used a $ to search for a digit at the end of the string, and that’s exactly what the function found.

Tip

The ^ and $ are anchor characters only when used outside of a bracket expression.

You can combine the use of ^ and $ to write an expression encompassing the entire target string. The following example searches for comments containing only a phone number:

               SELECT emp_id, REGEXP_SUBSTR(text,
                  '^([0-9]{3}[-. ]|([0-9]{3}) )?[0-9]{3}[-. ][0-9]{4}$')
                  text
               FROM employee_comment
               WHERE REGEXP_LIKE(text, 
                  '^([0-9]{3}[-. ]|([0-9]{3}) )?[0-9]{3}[-. ][0-9]{4}$'),

    EMP_ID TEXT
---------- -------------
      7844 989-387.5359

There’s one thing to be careful of when using the anchoring characters. It’s possible to write expressions that can’t possibly match anything in the target text. For example, you could write $[0-9], which anchors the beginning of the expression to the end of the target string. You won’t find any digits after the string ends, so this expression will always fail. Watch for this sort of thing, and be careful about where you place your anchoring metacharacters.

Understanding Greediness

Greediness is an important concept to understand when writing regular expressions. We don’t mean greed of the Enron sort, but rather that each quantifier in a regular expression will always match as much text as possible. The results from this rule can sometimes be surprising.

For example, consider the following quotation:

Brighten the corner where you are.

Think about the problem of extracting the first word from this text. Many would look at the text, see that a word is a series of letters followed by a space, and would immediately translate that thought to an expression like '.* ', which matches any number of characters followed by space. That seems logical, doesn’t it? Yet look at the results:

               SELECT REGEXP_SUBSTR('Brighten the corner where you are',
                                    '.* ')
               FROM dual;

REGEXP_SUBSTR('BRIGHTENTHECORN
------------------------------
Brighten the corner where you

Is this the result you expected? If it is, then go to the head of the class and take a gold star, because you’re already way ahead of us. If this result surprises you, then think about what we asked for:

We asked for a series of characters, of any length, followed by a space.

Faced with this requirement, how would you satisfy it? It turns out there are three possible approaches you can take:

  • You could stop at the first space.

  • You could stop at the last space.

  • You could stop at some arbitrary space somewhere in between the first and the last.

Computers are not good at being arbitrary, and no one wants them to be arbitrary either, so that last option is out. Here’s how Oracle’s regular expression engine conceptually matches the expression '.* ' with our example text:

  1. The first element of the expression is a period, representing any character. The quantifier allows any number of occurrences, so the engine finds all the characters that it can, stopping only when it reaches the letter e at the end of the string. At this point, the engine has matched the entire string: “Brighten the corner where you are”.

  2. The expression calls for a space. An e is not a space, so the engine backs up one character to the r.

  3. An r is not a space, so the engine backs up another character to the a.

  4. An a is not a space, so the engine backs up again, finds the space following “you”, and returns “Brighten the corner where you”, including the trailing-space, as the result from REGEXP_SUBSTR.

The key point to take away from this section is that the regular expression engine doesn’t look at a pattern quite the same way you might. When you mentally apply the pattern '.* ' to a string, your tendency is probably going to be to stop the moment you have a match. You can do that easily, because your brain is a much better pattern-matching engine than a computer. You see the pattern as a whole, and you’ll tend to gravitate toward the interpretation that you had in mind when you wrote the expression. Regular expression engines look at patterns and strings a character at a time, moving back and forth, finding the longest match for each regular expression element.

Advanced Function Options

By now we’ve introduced you to all but one of Oracle’s regular expression functions. However, you’ve only seen them used in their simplest form. Their full syntax is:

REGEXP_INSTR(source_string, expression 
            [, position [, occurrence
            [, return_option 
            [, match_parameter]]]])

REGEXP_LIKE (source_string, expression 
            [, match_parameter])

REGEXP_REPLACE(source_string, expression
              [, replace_string
              [, position [, occurrence
              [, match_parameter]]]])

REGEXP_SUBSTR(source_string, expression
             [, position [, occurrence
             [, match_parameter]]])

If you’re familiar with Oracle’s string-manipulation functions, you’ll have no trouble discerning the purpose of the REGEXP functions:

REGEXP_INSTR

Returns the character position at which text can be found matching a regular expression in a target string.

REGEXP_LIKE

Discerns whether a given string contains text matching an expression. This is a Boolean function, returning TRUE or FALSE, or NULL.

REGEXP_REPLACE

Performs a regular expression search-and-replace operation, which you’ll learn about in Section 17.3.

REGEXP_SUBSTR

Extracts text matching a regular expression from a string.

The parameters to these functions are as follows:

source_string

The string to be searched.

expression

A regular expression describing the pattern of text that you seek.

replace_string

A string generating the replacement text to be used in a search-and-replace operation.

position

The character position within source_string at which to begin a search. This defaults to 1.

occurrence

The occurrence of the pattern you wish to locate. This defaults to 1, giving you the first possible match.

return_option

Valid only for REGEXP_INSTR, and determines whether the beginning or ending character position is returned for text matching a pattern. The default is 0, for the beginning. Use 1 to return the ending position.

match_parameter

A text string through which you may specify options to vary the behavior of the regular expression matching engine.

The match_parameter deserves a bit of explanation. It’s a character string that you build using one or more of the following letters: i, c, n, m. One use is to specify whether matching is case-sensitive. By default, your NLS_SORT setting drives this aspect of regular expression behavior. You can override that default on a call-by-call basis, using i to get case-insensitive matching, and c to get case-sensitive matching. Following is a simple demonstration that works with our example data set. When the i option is used, the match succeeds, even though the case really does not match:

            SELECT emp_id, text
            FROM employee_comment
            WHERE REGEXP_LIKE(text, 'CEDAR LAKE', 'c'),

no rows selected

SELECT emp_id, text
            FROM employee_comment
            WHERE REGEXP_LIKE(text, 'CEDAR LAKE', 'i'),

    EMP_ID TEXT
---------- ---------------------------------------------
      7499 1105 McConnell Court
           Cedar Lake MI 48812
           Home: (989) 387-4321
           Cell: (237) 438-3333
      7782 Academy Apartments, #138, Cedar Lake MI 48812
      7788 #1 Water Tower Lane
           Cedar Lake MI 48812

The n option is useful when working with data, such as ours, that contains embedded newline (usually chr(10)) characters. By default, the period (.) does not match newlines, which is why the following query brings back only the first line of comment text:

            SELECT emp_id, REGEXP_SUBSTR(text, '.*') text
            FROM employee_comment
            WHERE emp_id = 7698;

    EMP_ID TEXT
---------- ---------------
      7698 Marion Blake

Using 'n' for the match parameter, you can have the period match newlines, which in the case of this query, results in the entire comment being returned:

            SELECT emp_id, REGEXP_SUBSTR(text, '.*', 1, 1, 'n') text
            FROM employee_comment
            WHERE emp_id = 7698;

    EMP_ID TEXT
---------- -----------------------
      7698 Marion Blake
           N1375 Lakeshore Drive
           Vestaburg MI 48891
           (989) 388-1234

This example also illustrates the important point that specifying a match parameter forces you to also specify any preceding, optional parameters that you would otherwise skip.

The final match option is m, which changes the definition of line with respect to the ^ and $ metacharacters. By default, line means the entire target string, so an expression such as ^.*$, together with the n option, will bring back all characters in the target string:

            SELECT emp_id, REGEXP_SUBSTR(text, '^.*$', 1, 1, 'n') text
            FROM employee_comment
            WHERE emp_id = 7788;

    EMP_ID TEXT
---------- --------------------
      7788 #1 Water Tower Lane
           Cedar Lake MI 48812

The n option is necessary here, because otherwise the period would not match the newline. If the period did not match the newline, the expression would fail to match at that point, and it would be impossible to match all characters between the beginning and end of the string.

However, using the m option causes the definition of line to change from the entire target string, to any line within that string, where lines are delimited by newline characters. The following example removes n, replacing it with m:

            SELECT emp_id, REGEXP_SUBSTR(text, '^.*$', 1, 1, 'm') text
            FROM employee_comment
            WHERE emp_id = 7788;

    EMP_ID TEXT
---------- --------------------
      7788 #1 Water Tower Lane

You can specify multiple match parameters in any order. For example, 'in' means the same as 'ni‘. If you specify conflicting options, such as 'ic', the last option, 'c' in this case, is the one that takes precedence.

Tip

Try our last example in this section using 'mn' as the match parameter string. You’ll get back the entire target string. Why? Because of the greediness rule. When 'mn' is used, the first line is a match, but the entire string is a longer match.

Advanced Regular Expression Syntax

As you delve more deeply into using regular expressions, and especially if you work with multilingual data, there are some advanced aspects to regular expressions that you’ll find helpful.

Using Backreferences

You’ve already learned about subexpressions. A backreference is a way that you can reference the value of text matching a preceding subexpression. Think about the problem of searching specifically for 10-digit phone numbers that are inconsistent in their use of separator characters:

906.387-5359
989 313-5351

Each of these numbers uses a separator between the first two digit groups that differs from that used between the second two digit groups. What sort of expression would you write to detect this sort of mismatch? One with backreferences.

The key issue in the mismatched separator example we’ve just described is that to solve the problem, to identify phone-numbers with mismatched separators, you need the ability within an expression to refer to characters matched by earlier parts of that expression. In this case, you need the ability to refer to the first separator from the position of the second. Backreferences let you do this.

The following regular expression includes parentheses around the first bracket expression, which corresponds to the first separator. The parentheses define a subexpression:

[0-9]{3}([-. ])[0-9]{3}1[0-9]{4}

The subexpression ([-. ]) is the first subexpression, so you refer to it using the notation 1. Later in the expression, where you would otherwise expect to see a second occurrence of the bracket expression [-. ], you see 1, which references the value matched by the first subexpression.

The words matched by are critical here. The backreference 1 is not equivalent to [-. ]. Rather, the backreference is equivalent to that part of the text that is matched by [-. ]. If the first separator is a hyphen, then 1 will be equivalent to a hyphen. A specific value for 1 won’t be known until you execute a regular expression query, and even then it will change from row to row, as the regular expression is applied to one phone number after another.

The following query uses REGEXP_SUBSTR to extract the first phone number from the free-text, comment column. Only 10-digit phone numbers without parentheses around area codes are considered. That first phone number is then tested using NOT REGEXP_LIKE, to see whether both separators are the same.

               SELECT emp_id, text
               FROM employee_comment
               WHERE NOT REGEXP_LIKE(
                  REGEXP_SUBSTR(text, '[0-9]{3}[-. ][0-9]{3}[-. ][0-9]{4}'),
                  '[0-9]{3}([-. ])[0-9]{3}1[0-9]{4}'),

    EMP_ID TEXT
---------- --------------------------------------------
      7369 126 Varnum, Edmore MI 48829, 989 313-5351
      7844 989-387.5359

If the separators in a given phone number differ, then REGEXP_LIKE returns FALSE, causing NOT REGEXP_LIKE to return TRUE, with the result that the row containing the phone number is included in the result set.

Tip

If you try using WHERE NOT REGEXP_LIKE(text, '[0-9]{3}([-. ])[0-9]{3}1[0-9]{4}'), you’ll find many comments without phone numbers in that format. However, not all of those comments will contain phone numbers with mismatched separators.

Oracle allows you up to nine backreferences, numbered 1 through 9. These refer to the first up-to-nine subexpressions, counting from left to right.

Backreferences are particularly important in performing regular expression search-and-replace operations. In fact, most of our own backreference usage falls into this category.

Remember all those phone number searches from our earlier examples in this chapter? Remember how creative our users were when it came to formatting phone numbers? Imagine for a moment that you want to unify the phone number format across all those free-form text entries. You want all phone numbers to be in the form 999-9999, with a hyphen separating digit groups. If there’s an area code, you want that in parentheses, in the form (999) 999-9999. This is a hard problem, or it would be, if you had to write PL/SQL to solve it. Using regular expressions, it’s not such a hard problem at all. You can solve it with one UPDATE statement using a nested call to Oracle’s REGEXP_REPLACE function.

Begin by resurrecting the seven-digit portion of the phone number expression from Section 17.1.5:

[0-9]{3}[-. ][0-9]{4}

Enclose the pattern for each digit group within parentheses, making two subexpressions:

([0-9]{3})[-. ]([0-9]{4})

You can now reference the first three digits as 1, and the last four digits as 2. What’s more, you can use these two backreferences in the replacement text that you provide to REGEXP_REPLACE:

               SELECT emp_id, REGEXP_REPLACE(
                  text, '([0-9]{3})[-. ]([0-9]{4})', '1-2') text
               FROM employee_comment;

    EMP_ID TEXT
---------- ------------------------------------------
      7369 126 Varnum, Edmore MI 48829, 989 313-5351
      7499 90 McConnell Court
           Cedar Lake MI 48812
           Home: 989-387-4321
           Cell: (237) 438-3333
 . . .

The REGEXP_REPLACE function call in this statement finds each occurrence of the pattern matched by the second parameter, and replaces it with text from the third parameter. The backreferences are what make this operation truly exciting. Using backreferences, you can reference the text to be replaced from your replacement string, giving you great power to move and reformat text.

You’re not done though. Now it’s time to worry about that area code. Following is the area code pattern developed earlier in this chapter:

([0-9]{3}[-. ]|([0-9]{3}) )?

All the seven-digit phone numbers, whether following an area code or not, should fit the following pattern, which allows only a hyphen as a separator:

[0-9]{3}-[0-9]{4}

Put the two patterns together, and you have:

([0-9]{3}[-. ]|([0-9]{3}) )?[0-9]{3}-[0-9]{4}

For this particular replacement, you want to leave seven-digit phone numbers alone, because you’ve already fixed those. To this end, remove the ? to make the area code required:

([0-9]{3}[-. ]|([0-9]{3}) )[0-9]{3}-[0-9]{4}

Your next step is to put parentheses around each element of the phone number that you care about. There already are parentheses around the area code, but that subexpression also encompasses whatever separator character follows the area code. The following expression factors the separator out of the area code subexpression:

([0-9]{3}|([0-9]{3}))[-. ][0-9]{3}-[0-9]{4}

The semantics of this expression differ slightly from the previous. When the area code is enclosed in parentheses, this expression allows not only a space to follow, but also a period or a hyphen. We could work around this by approaching parenthetically enclosed area codes as a completely separate problem, to be solved using a separate UPDATE statement, but we’ve chosen to be flexible and loosen up our pattern just a bit to make our work easier. Sometimes you need to do that.

Next, put parentheses around the elements in the pattern that you wish to manipulate. The seven-digit phone number is ok as it is, so you can enclose that entire portion of the pattern, making it one subexpression:

([0-9]{3}|([0-9]{3}))[-. ]([0-9]{3}-[0-9]{4})

Dealing with the area code gets a bit tricky, and you’ll see why in a moment. For now, look at the expression so far, and realize that 1 refers to the area code, and 2 to the remainder of the phone number. Following is a new SQL query that feeds the results of the earlier REGEXP_REPLACE function, the one to fix seven-digit numbers, into a second REGEXP_REPLACE call, this time to correctly format area codes:

               SELECT emp_id, 
                  REGEXP_REPLACE(
                     REGEXP_REPLACE(
                        text, '([0-9]{3})[-. ]([0-9]{4})', '1-2'),
                     '([0-9]{3}|([0-9]{3}))[-. ]([0-9]{3}-[0-9]{4})',
                     '(1) 2') text
               FROM employee_comment;

    EMP_ID TEXT
---------- -----------------------------------------------
      7369 126 Varnum, Edmore MI 48829, (989) 313-5351
      7499 90 McConnell Court
           Cedar Lake MI 48812
           Home: (989) 387-4321
           Cell: ((237)) 438-3333
 . . .

We realize this query is becoming difficult to follow. Please stick with us, and study this query until you understand it. Notice the output, in particular the cell-phone number, which now reads ((237)) 438-3333. Oops! Doubled parentheses are not at all what you wanted to see around an area code. What happened?

The reason you see doubled parentheses is because of what 1 refers to. If you carefully study the regular expression feeding into the outermost call to REGEXP_REPLACE, you’ll see that the first subexpression includes any parentheses that may already be around an area code. The replacement text rebuilds the area code by using (1) to enclose it within parentheses. Take an area code already within parentheses, enclose it again, and you end up with results such as ((237)).

We fell into the pit we’ve just described while writing this chapter. We’re telling you the story now for a couple reasons. One, you wouldn’t otherwise easily understand why we went with the solution we’re about to show you. Two, we want you to know that the thought-process to a successful regular expression is rarely a straight line from problem to solution. You’ll often go down a path only to encounter a problem, forcing you to backtrack a bit and try again. Indeed, we backtracked and changed course at least three times while developing this example. Don’t be dissuaded if your first attempt at a regular expression solution doesn’t quite have the effect you’re after. Expect to do a certain amount of experimenting and testing whenever you write anything but the simplest of expressions.

The following expression is the key to resolving the doubled parentheses problem:

(([0-9]{3})|(([0-9]{3})))[-. ]([0-9]{3}-[0-9]{4})

In this expression, we’ve made each of the two alternate area code possibilities into its own subexpression. Both those subexpressions are nested within a larger subexpression. Subexpressions are numbered, from left to right, beginning at 1, based on the order in which each opening parenthesis is encountered. In this expression, 1 will be the area code inclusive of any parentheses that might be present. 2 corresponds to the area code without parentheses, if one is found. Otherwise, 2 will be NULL. Likewise, 3 corresponds to an area code enclosed within parentheses, but does not include those parentheses. The following query uses (23) to generate the area code in the correct format. It may seem odd to use both backreferences, but the technique works in this case, because one will always be NULL.

               SELECT emp_id, 
                  REGEXP_REPLACE(
                     REGEXP_REPLACE(
                        text, '([0-9]{3})[-. ]([0-9]{4})', '1-2'),
                     '(([0-9]{3})|(([0-9]{3})))[-. ]([0-9]{3}-[0-9]{4})',
                     '(23) 4') text
               FROM employee_comment;

    EMP_ID TEXT
---------- ---------------------------------------------
      7369 126 Varnum, Edmore MI 48829, (989) 313-5351
      7499 90 McConnell Court
           Cedar Lake MI 48812
           Home: (989) 387-4321
           Cell: (237) 438-3333
 . . .

Now that all the kinks have been worked out of the expression, it’s a relatively simple matter to use it in an UPDATE statement:

UPDATE employee_comment
SET text = 
   REGEXP_REPLACE(
      REGEXP_REPLACE(
         text, '([0-9]{3})[-. ]([0-9]{4})', '1-2'),
      '(([0-9]{3})|(([0-9]{3})))[-. ]([0-9]{3}-[0-9]{4})',
      '(23) 4'),

And there you have it. With one statement, and a little bit of regular expression magic, you can consistently format all those randomly formatted phone numbers in a free-format comment column.

Warning

Before unleashing a regular expression search-and-replace, follow our example from this section and test what you are doing by issuing a SELECT statement, and carefully verify that your regular expression is having the effect that you expected it to have. When everything looks good, then unleash your UPDATE.

Using Named Character Classes

Earlier, in our phone number examples, we used bracket expressions such as [0123456789] and [0-9] to create matching sets to match any of the digits, zero through nine. Oracle also supports named, character classes, which provide you with a handy and reliable way to create commonly used matching sets. For example, you can use [:digit:] within a bracket expression to represent the set of all digits. Instead of using [0-9] in the following expression:

([0-9]{3}[-. ]|([0-9]{3}) )?[0-9]{3}[-. ][0-9]{4}

You can use [[:digit:]] to match any character defined as a digit:

([[:digit:]]{3}[-. ]|([ [:digit:]]{3}) )?[ [:digit:]]{3}[-. ][ [:digit:]]{4}

Table 17-3 gives a list of valid, character class names, which you must always enclose within a bracket expression. You can include other characters within the same bracket expression. For example, you can write [[:digit:]A-F] to create a matching set of all digits plus the letters A-F.

Table 17-3. Supported character classes

Class

Description

[:alnum:]

Alphanumeric characters (same as [:alpha:] + [:digit:])

[:alpha:]

Alphabetic characters only

[:blank:]

Blankspace characters, such as space and tab.

[:cntrl:]

Nonprinting, or control characters

[:digit:]

Numeric digits

[:graph:]

Graphical characters (same as [:punct:] + [:upper:] + [:lower:] + [:digit:])

[:lower:]

Lowercase letters

[:print:]

Printable characters

[:punct:]

Punctuation characters

[:space:]

Whitespace characters, such as space, form-feed, newline, carriage return, horizontal tab, and vertical tab

[:upper:]

Uppercase letters

[:xdigit:]

Hexadecimal characters

There are pros and cons to using named character classes, but mostly pros:

  • You don’t need to worry about the underlying code points used to represent characters in whatever character set you are using. A matching set defined as [A-Za-z] might include characters other than those letters. A matching set defined as [[:alpha:]] will contain only letters.

  • You can easily accommodate characters from many languages. For example, [:digit:] matches not only the English 0-9, but also the Arabic-Indic ٠-٩. This is important in multilingual environments.

  • You don’t need to worry about inadvertently omitting a character from a matching set. Quick! If I want to match all punctuation, and use [.,;:!], have I forgotten anything? That worry goes away when using [[:punct:]].

  • You can more easily include characters that would otherwise be difficult to type from the keyboard. The [:cntrl:] class, for example, represents nonprinting, control characters.

The only downside we can think of to using named character classes, and you could argue that it’s not even a downside, is that if you do use a class such as [:digit:], and you really do care only about 0-9, you may prefer the results from [0-9] instead. In practice though, we have yet to regret using a named character class whenever one fits the problem at hand.

Specifying Collation Elements

Oracle has always been strong in its support of multilingual data, and that strength now extends to regular expressions. When working with languages other than English, you’ll sometimes encounter characters that appear to be two letters, because they are composed of two glyphs. For example, Spanish, at least old-Spanish, treats ch and ll as single letters.

Dealing with letters such as ch can sometimes be problematic. The following example shows one attempt to extract a word beginning with either the letter ch or ll from a string. We use the XSPANISH sort, because that sort works by the old rules under which ch and ll are each treated as one letter. We build our regular expression as follows:

(^| )

A word is preceded by either the beginning of a line, or by a space.

[chll]

We want our word to begin with either ch or ll. This bracket expression is our first attempt to define a matching set containing those two characters.

[^[:space:][:punct:]]+

The first letter of our word must be followed by one or more non-space, non-punctuation characters. We could use * instead + to include one-letter words in our search.

Using the expression we’ve just described, we specify 'i' as the match parameter to get a case-insensitive search. Look carefully at the results:

               ALTER SESSION SET NLS_SORT=XSPANISH;
               SELECT TRIM(REGEXP_SUBSTR(
                  'El caballo, Chico come la tortilla.',
                  '(^| )[chll][^[:space:][:punct:]]+',1,1,'i'))
               FROM dual;

caballo

This result isn’t what we want at all. The problem here is that our use of [chll] results in a matching set composed of three letters, c, h, and l, any of which is deemed a valid match. Hence, our query found caballo rather than Chico. Collation element syntax lets you deal with this situation. To treat a multicharacter collation element as a single letter, enclose it within [. and .]. Then you must enclose that within a bracket expression. The result is that [.ch.] is recognized as the single letter ch, and [.ll.] is recognized as the single letter ll:

               ALTER SESSION SET NLS_SORT=XSPANISH;
               SELECT TRIM(REGEXP_SUBSTR(
                  'El caballo, Chico come la tortilla.',
                  '(^| )[[.ch.][.ll.]][^[:space:][:punct:]]+',1,1,'i'))
               FROM dual;

Chico

Technically, any single character is a collation element. Thus, [a] and [[.a.]] are equivalent. In practice, you only need use collation element syntax when a collation element consists of multiple characters that linguistically represent one character.

Tip

You cannot arbitrarily put any two letters in a collation. For example, you cannot write [.jg.], because those are two, separate letters, which you cannot arbitrarily treat as a single letter.

Table 17-4 provides a list of collation elements recognized by Oracle. The elements in the table are valid only for the specified NLS_SORT settings.

Table 17-4. Collation elements

NLS_SORT

Multicharacter collation elements

XDANISH

aa AA Aaoe OE Oe

XSPANISH

ch CH Chll LL Ll

XHUNGARIAN

cs CS Csgy GY Gyly LY Lyny NY Nysz SZ Szty TY Tyzs ZS Zs

XCZECH

ch CH Ch

XCZECH_PUNCTUATION

ch CH Ch

XSLOVAK

dz DZ Dz

dž DŽ Dž

ch CH Ch

XCROATIAN

dž DŽ Dž

lj LJ Lj

nj Nj NJ

Defining Equivalence Classes

An equivalence class is a set of characters that would all be the same except for their case or the way they are accented. You can create such a class by using [= and =] to surround a letter when you wish to match all accented and unaccented versions of that letter. The resulting equivalence class reference must always be within a bracket expression.

For example:

               SELECT REGEXP_SUBSTR('eéëèÉËÈE' '[[=É=]]+')
               FROM dual;

eéëèÉËÈE

SELECT REGEXP_SUBSTR('eéëèÉËÈE', '[[=e=]]+') 
               FROM dual;

eéëèÉËÈE

It doesn’t matter which version of a letter you specify between the [= and =]. All equivalent accented and unaccented letters, whether uppercase or lowercase, will match.

Tip

NLS_SORT determines which characters are considered to be equivalent. Thus, equivalence is determined appropriately for whatever language you are using.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.235.219