Oracle’s regular expression support manifests itself in the form of three SQL functions and one predicate that you can use to search and manipulate text in any of Oracle’s supported text datatypes: VARCHAR2, CHAR, NVARCHAR2, NCHAR, CLOB, and NCLOB.
Regular expression support does not extend to LONG, because LONG is supported only for backward compatibility with existing code.
Following are the four functions you’ll use to work with regular expressions in Oracle:
Determines whether a specific column, variable, or text literal contains text matching a regular expression.
Locates, by character position, an occurrence of text matching a regular expression.
Replaces text matching a regular expression with new text that you specify. Your replacement text can include backreferences to values in the regular expression.
Extracts text matching a regular expression from a character column, variable, or text literal.
Of these, you’ve already seen REGEXP_LIKE in quite a few examples. REGEXP_LIKE is documented in the “Conditions” chapter of the Oracle Database 10g SQL Reference because in SQL it can only be used as a predicate in the WHERE and HAVING clauses of a query or DML statement. In PL/SQL, however, you can use REGEXP_LIKE as you would any other Boolean function:
DECLARE x Boolean; BEGIN x := REGEXP_LIKE( 'Does this string mention Oracle?', 'Oracle'), END; /
The remaining three functions work identically in SQL and PL/SQL. All four functions are fully described in Section 1.9 near the end of this book.
Oracle is notable for its Globalization Support in that it supports an exceedingly wide variety of character sets, languages, territories, and linguistic sorts. Regular expressions are no exception. The combination of character set, language, and territory is known as a locale. Oracle’s regular expression engine respects locale, and is configurable via NLS (National Language Support) parameter settings.
Following are some notable examples of the way in which regular expression locale support affects you:
The regular expression engine is character-based. The period (.) will always match a single character or, more strictly speaking, a single codepoint, regardless of how many bytes are used to represent that character in the underlying character set.
Character classes are sensitive to the underlying character set. For
example, if you’re using one of the Unicode
character sets, the class [:digit:]
will include
not only 0, 1, 2, through 9, but also the Arabic-Indic
٠, ١, ٢ through ٩, the Bengali ০, ১, ২ through ৯, and so forth.
NLS_SORT can affect how comparisons are performed. If NLS_SORT
considers two characters to be equivalent, then so does the regular
expression engine. For example, using the default sort of BINARY, the
expression 'resume
' will not match the text
'Résumé
‘. Change NLS_SORT to
GENERIC_BASELETTER, and the expression does match, because that sort
treats 'e
' and 'é
'
as the same letter and also ignores case.
Bracket expressions such as [A-z]
are affected by
the underlying character set and the sort order. For example:
[a-z]
includes A when using the case-insensitive
sort GERMAN_CI, but not when using GERMAN.
Given an ASCII-based character set and the BINARY sort order,
[A-z]
encompasses all letters, upper- and
lowercase. Given an EBCDIC character set and the BINARY sort order,
[A-z]
fails to be a valid expression, even failing
to compile, because in EBCDIC the binary representation of the letter
A
comes after that of the letter
z
.
If a regular expression is in one character set, and the text to be searched is in another, the regular expression will be converted to the character set of the text to be searched.
Your NLS_SORT setting affects whether case-sensitive matching is done by default. A sort such as SPANISH yields case-sensitive sorting. You can add the suffix _CI, as in SPANISH_CI, to linguistic sorts to get a case-insensitive sort. Use the suffix _AI for an accent-insensitive sort.
NLS_SORT also affects which accented and unaccented characters are
considered to be of the same class. For example, the expression
'na[[=i=]]ve
' will match both
'naive
' and 'naïve
'
when NLS_SORT is set to BINARY (the default sort for the AMERICAN
language), but not when NLS_SORT is set to GREEK.
NLS_SORT affects which collation elements are considered valid. For
example, [.ch.]
is recognized by Spanish sorting
rules (when NLS_SORT equals XSPANISH), but not by American sorting
rules.
Each of Oracle’s regular expression functions takes
an optional match_parameter
, which is a
character string that you can fill with one-character flags. This
string gives you control over the following aspects of regular
expression behavior:
NLS_SORT controls whether matching is case-sensitive by default, which it usually will be. You can override the default on a per-call basis.
By default, periods do not match newline characters (occurrences of
CHR(10)
on Unix systems) in the source text. You
can specify that periods match newlines.
By default, the source string that you are searching is considered
one long line, and the caret (^)
and dollar sign
($)
match only the beginning and ending of the
entire string.
You can specify that the source value is to be treated as many lines
delimited by newline characters. If you do so, then the
^
and $
match the beginning and
end of each line respectively.
The following example demonstrates the use of the
match_parameter
by performing a
case-insensitive search for doubled words. The
match_parameter
value in this case is
'i
‘. The two 1
parameters
preceding 'i
' in REGEXP_SUBSTR supply the default
values for starting position and occurrence. Those parameters need to
be specified in order to reach the
match_parameter
.
SELECT park_name, REGEXP_SUBSTR( description, '(^|[[:space:][:punct:]]+)([[:alpha:]]+)' || '([[:space:][:punct:]])+2' || '([[:space:][:punct:]]+|$)', 1,1,'i') duplicates FROM park WHERE REGEXP_LIKE(description, '(^|[[:space:][:punct:]]+)([[:alpha:]]+)' || '([[:space:][:punct:]])+2' || '([[:space:][:punct:]]+|$)', 'i'),
To specify multiple parameters, simply list them in one string. For
example, to request case-insensitive matching with periods matching
newline characters, specify 'in
' or
'ni
' as your
match_parameter
.
If you specify contradictory parameters, Oracle uses the last value
in the string. For example, 'ic
' is contradictory
because 'i
' asks for case-insensitivity, while
'c
' asks for the opposite. Oracle resolves this by
taking the last value in the string, in this case the
'c
‘.
If you specify parameters that are undefined, Oracle will return an
ORA-01760: illegal argument for function
error.
Oracle’s regular expression engine is of the traditional nondeterministic finite automata (traditional NFA) variety, the same type used in Perl, the .NET environment, and Java. With one exception, Oracle’s engine implements the syntax and behavior for extended regular expressions (EREs) as described in the POSIX standard. In addition, Oracle adds support for backreferences.
The regular expression syntax and behavior documented in the Open Group Base Specifications Issue 6, IEEE Standard 1003.1, 2003 Edition is the same as that for POSIX. You can view the Open Group specifications at http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html
The one exception that stands between Oracle and full POSIX compliance is that Oracle does not attempt to determine the longest possible match for a pattern containing variations, as the standard requires. The following example demonstrates this very well:
SELECT REGEXP_SUBSTR('bbb','b|bb') FROM dual;
bSELECT REGEXP_SUBSTR('bbb','bb|b') FROM dual;
bb
These two statements differ only by the order in which the
alternatives are specified in the regular expression:
b|bb
versus bb|b
. The longest
possible match in either case is 'bb
', and
that’s the match POSIX requires for both cases.
However, Oracle’s regular expression engine takes
the first match it finds, which can be either 'b
'
or 'bb
', depending on the order in which the
alternatives are specified.
Like many regular expression engines, Oracle ignores the “longest possible match” rule, because the overhead of computing all possible permutations and then determining which is the longest can be excessive.
Perl has done a lot to popularize the use of regular expressions, and many regular expression engines (e.g., Java and PHP) follow Perl’s implementation closely. Many readers may have learned regular expressions using Perl or a Perl-like engine, so this brief section highlights the key differences between Perl’s and Oracle’s support for regular expressions.
Regular expressions are often written as string literals. When you move string literals from one language to another, you may encounter issues with the way that each language handles such literals.
For example, Perl enables you to to use x
followed by two hexadecimal digits to embed arbitrary byte codes
within a string. Perl also supports character sequences such as
for the newline (linefeed on Unix) character.
Thus, in Perl, you can write the following regular expression to
search for either a linefeed or a space:
/[ |x20]/
The issue is that this isn’t a regular expression
per se—it’s a Perl string. The backslash
sequences
and x20
have no
meaning to Perl’s regular expression engine, which,
in fact, never sees them. Those sequences are interpreted by Perl
itself. By the time the string gets to Perl’s
regular expression engine,
and
x20
have been replaced by the appropriate byte
codes.
Another issue you may encounter is Perl’s use of the
dollar sign ($
) to dereference a variable within a
string. In Perl, the expression /a$b/
searches for
the letter 'a
' followed by the contents of the
Perl variable named b
. Perl’s
regular expression never sees the '$b
', because
Perl substitutes the value of the variable before it passes the
string to the engine.
Neither SQL nor PL/SQL support the use of and
$
in the way that Perl does. Because Perl and
Oracle differ in their handling of string literals, you may not be
able to take a regular expression developed for Perl and simply drop
it into Oracle. Before attempting to move an expression in the form
of a string literal from Perl to Oracle, make sure that the
“expression”
doesn’t contain any characters that Perl itself
interprets.
Unlike Perl and many database products, Oracle treats an empty string as a NULL value. Thus, the following query, which attempts to match an empty string, brings back no data:
SELECT * FROM park WHERE REGEXP_LIKE(description,''),
In Oracle, the regular expression engine does not see an empty string; rather, it sees a NULL, or the complete absence of an expression with which to do any matching.
Oracle’s regular expression syntax is POSIX-compliant. Perl’s engine supports a number of operators, character classes, and so forth that are not defined as part of the POSIX standard. These are described in Table 1-2. Where possible, we also specify a POSIX equivalent that’s usable in Oracle.
The POSIX equivalents shown in Table 1-2 should work for the default locale (American_America.US7ASCII, with a BINARY sort). However, we have not yet been able to run exhaustive tests.
Perl operator |
Description / Oracle equivalent |
|
Matches any ASCII character. In Oracle, possibly use: ' |
|
A word character, defined as any alphanumeric character, including
underscore: |
|
Embeds arbitrary bytes in a regular expression. In Oracle, use the CHR function, but be aware that Oracle requires an expression to be composed of valid characters as defined by the underlying character set. |
|
Digits: |
|
Non-digits: |
|
Named properties, no POSIX equivalent |
|
Negated named properties, no POSIX equivalent |
|
Whitespace: |
|
Non-whitespace: |
|
Alphanumeric characters: |
|
Non-alphanumeric characters: |
|
Followed by a code point value, |
|
Perl supports a number of zero-width assertions. None are recognized by POSIX. |
Perl does not support the POSIX-standard [= =]
notation for defining an equivalence class. In addition, Perl does
not support the use of [. .]
to specify a
collation element.
Both Perl and Oracle support the POSIX-compliant caret
(^
) as the first character within a bracket
expression to mean all characters except those listed within the
expression. For example, you can write: [^A-Z]
to
match on any character but the uppercase letters.
Perl also supports the use of a caret in conjunction with a character
class name. For example, Perl allows you to write
[[:^digit:]]
to match on any character except for
one in the [:digit:]
class. You can get the same
effect in Oracle using the form: [^[:digit:]]
.
As we described in Section 1.6.7 in
Section 1.6,
quantifiers in a regular expression will match as many characters as
possible. For example, given a source string of
'123456
', the expression [0-9]+
will match the entire string of six digits.
Perl supports the addition of a question mark (?
)
to the end of a quantifier to make it non-greedy, or
lazy, in which case the quantifier matches the
minimum number of characters possible. For example, the expression
[0-9]+?
matches only the first digit of the string
'123456
‘.
The complete list of lazy quantifiers supported by Perl is:
*? , +? , ?? , and {}? |
POSIX, and by extension Oracle, does not support these quantifiers.
Perl supports a mechanism for adding experimental regular expression
features. Such features always take the form
(?...)
, in which the ellipses represent the
feature-specific syntax. Comments within expressions are one of the
so-called experimental features, and you can embed a comment in a
Perl regular expression as follows:
(?#area code)([[:digit:]]{3}[-.]|([[:digit:]]{3})) (?#local number)[[:digit:]]{3}[-.][[:digit:]]{4}
Oracle does not support Perl’s experimental feature syntax.
In a replacement string such as one you might use with
REGEXP_REPLACE, Perl supports the use of a dollar sign
($
) to indicate a backreference. For example, you
can use $1
to refer to the first subexpression.
Oracle supports only the backslash syntax 1
,
2
, and so forth.
POSIX and Perl differ somewhat in how they handle backslash
() characters:
in a bracket-listIn Perl, a in a bracket-list is treated as a
metacharacter. In Oracle, a
in a bracket-list
represents itself.
as the last character of an expressionUse as the last character of a regular
expression in Perl, and you get an error. Do the same thing in
Oracle, and the trailing
is silently ignored.
3.145.161.228