Regex Metacharacters, Modes, and Constructs

The metacharacters and metasequences shown here represent most available types of regular expression constructs and their most common syntax. However, syntax and availability vary by implementation.

Character representations

Many implementations provide shortcuts to represent some characters that may be difficult to input. (See MRE 114-117.)

Character shorthands

Most implementations have specific shorthands for the alert, backspace, escape character, form feed, newline, carriage return, horizontal tab, and vertical tab characters. For example, is often a shorthand for the newline character, which is usually LF (012 octal) but can sometimes be CR (15 octal) depending on the operating system. Confusingly, many implementations use  to mean both backspace and word boundary (between a “word” character and a non-word character). For these implementations,  means backspace in a character class (a set of possible characters to match in the string) and word boundary elsewhere.

Octal escape: um

Represents a character corresponding to a two- or three- octal digit number. For example, 1512 matches an ASCII CR/LF sequence.

Hex and Unicode escapes: xnum, x{num}, unum, Unum

Represents a character corresponding to a hexadecimal number. Four-digit and larger hex numbers can represent the range of Unicode characters. For example, x0Dx0A matches an ASCII CR/LF sequence.

Control characters: cchar

Corresponds to ASCII control characters encoded with values less than 32. To be safe, always use an uppercase char—some implementations do not handle lowercase representations. For example, cH matches Control-H, an ASCII backspace character.

Character classes and class-like constructs

Character classes are ways to define or specify a set of characters. A character class matches one character in the input string that is within the defined set. (See MRE 117-127.)

Normal classes: [...] and [^...]

Character classes, [...], and negated character classes, [^...], allow you to list the characters that you do or do not want to match. A character class always matches one character. The - (dash) indicates a range of characters. To include the dash in the list of characters, list it first or escape it. For example, [a-z] matches any lowercase ASCII letter.

Almost any character: dot (.)

Usually matches any character except a newline. The match mode can often be changed so that dot also matches newlines.

Class shorthands: w, d, s, W, D, S

Commonly provided shorthands for word character, digit, and space character classes. A word character is often all ASCII alphanumeric characters plus the underscore. However, the list of alphanumerics can include additional locale or Unicode alphanumerics, depending on the implementation. For example, d matches a single digit character and is usually equivalent to [0-9].

POSIX character class: [:alnum:]

POSIX defines several character classes that can be used only within regular expression character classes (see Table 1-1). For example, [:lower:], when written as [[:lower:]], is equivalent to [a-z] in the ASCII locale.

Unicode properties, scripts, and blocks: p{prop}, P{prop}

The Unicode standard defines classes of characters that have a particular property, belong to a script, or exist within a block. Properties are characteristics such as being a letter or a number (see Table 1-2). Scripts are systems of writing, such as Hebrew, Latin, or Han. Blocks are ranges of characters on the Unicode character map. Some implementations require that Unicode properties be prefixed with Is or In. For example, p{Ll} matches lowercase letters in any Unicode supported language, such as a or α.

Unicode combining character sequence: X

Matches a Unicode base character followed by any number of Unicode combining characters. This is a shorthand for P{M}p{M}. For example, X matches è as well as the two characters e‘.

Table 1-1. POSIX character classes

Class

Meaning

alnum

Letters and digits.

alpha

Letters.

blank

Space or tab only.

cntrl

Control characters.

digit

Decimal digits.

graph

Printing characters, excluding space.

lower

Lowercase letters.

print

Printing characters, including space.

punct

Printing characters, excluding letters and digits.

space

Whitespace.

upper

Uppercase letters.

xdigit

Hexadecimal digits.

Table 1-2. Standard Unicode properties (continued)

Property

Meaning

p{L}

Letters.

p{Ll}

Lowercase letters.

p{Lm}

Modifier letters.

p{Lo}

Letters, other. These have no case and are not considered modifiers.

p{Lt}

Titlecase letters.

p{Lu}

Uppercase letters.

p{C}

Control codes and characters not in other categories.

p{Cc}

ASCII and Latin-1 control characters.

p{Cf}

Non-visible formatting characters.

p{Cn}

Unassigned code points.

p{Co}

Private use, such as company logos.

p{Cs}

Surrogates.

p{M}

Marks meant to combine with base characters, such as accent marks.

p{Mc}

Modification characters that take up their own space. Examples include “vowel signs.”

p{Me}

Marks that enclose other characters, such as circles, squares, and diamonds.

p{Mn}

Characters that modify other characters, such as accents and umlauts.

p{N}

Numeric characters.

p{Nd}

Decimal digits in various scripts.

p{Nl}

Letters that are numbers, such as Roman numerals.

p{No}

Superscripts, symbols, or non-digit characters representing numbers.

p{P}

Punctuation.

p{Pc}

Connecting punctuation, such as an underscore.

p{Pd}

Dashes and hyphens.

p{Pe}

Closing punctuation complementing p{Ps}.

p{Pi}

Initial punctuation, such as opening quotes.

p{Pf}

Final punctuation, such as closing quotes.

p{Po}

Other punctuation marks.

p{Ps}

Opening punctuation, such as opening parentheses.

p{S}

Symbols.

p{Sc}

Currency.

p{Sk}

Combining characters represented as individual characters.

p{Sm}

Math symbols.

p{So}

Other symbols.

p{Z}

Separating characters with no visual representation.

p{Zl}

Line separators.

p{Zp}

Paragraph separators.

p{Zs}

Space characters.

Anchors and zero-width assertions

Anchors and “zero-width assertions” match positions in the input string. (See MRE 127-133.)

Start of line/string: ^, A

Matches at the beginning of the text being searched. In multiline mode, ^ matches after any newline. Some implementations support A, which only matches at the beginning of the text.

End of line/string: $, , z

$ matches at the end of a string. Some implementations also allow $ to match before a string-ending newline. If modified by multiline mode, $ matches before any newline as well. When supported,  matches the end of string or before a string-ending newline, regardless of match mode. Some implementations also provide z, which only matches the end of the string, regardless of newlines.

Start of match: G

In iterative matching, G matches the position where the previous match ended. Often, this spot is reset to the beginning of a string on a failed match.

Word boundary: , B, <, >

Word boundary metacharacters match a location where a word character is next to a non-word character.  often specifies a word boundary location, and B often specifies a not-word-boundary location. Some implementations provide separate metasequences for start- and end-of-word boundaries, often < and >.

Lookahead: (?=...), (?!...)
Lookbehind: (?<=...), (?<!...)

Lookaround constructs match a location in the text where the subpattern would match (lookahead), would not match (negative lookahead), would have finished matching (lookbehind), or would not have finished matching (negative lookbehind). For example, foo(?=bar) matches foo in foobar but not food. Implementations often limit lookbehind constructs to subpatterns with a predetermined length.

Comments and mode modifiers

Mode modifiers are a way to change how the regular expression engine interprets a regular expression. (See MRE 109-112, 133-135.)

Multiline mode: m

Changes the behavior of ^ and $ to match next to newlines within the input string.

Single-line mode: s

Changes the behavior of . (dot) to match all characters, including newlines, within the input string.

Case-insensitive mode: i

Treat as identical letters that differ only in case.

Free-spacing mode: x

Allows for whitespace and comments within a regular expression. The whitespace and comments (starting with # and extending to the end of the line) are ignored by the regular expression engine.

Mode modifiers: (?i), (?-i), (?mod:...)

Usually, mode modifiers may be set within a regular expression with (?mod) to turn modes on for the rest of the current subexpression; (?-mod) to turn modes off for the rest of the current subexpression; and (?mod:...) to turn modes on or off between the colon and the closing parentheses. For example, "use (?i:perl)" matches “use perl”, “use Perl”, “use PeRl”, etc.

Comments: (?#...) and #

In free-spacing mode, # indicates that the rest of the line is a comment. When supported, the comment span (?#...) can be embedded anywhere in a regular expression, regardless of mode. For example, .{0,80}(?#Field limit is 80 chars) allows you to make notes about why you wrote .{0,80}.

Literal-text span: Q...E

Escapes metacharacters between Q and E. For example, Q(.*)E is the same as (.*).

Grouping, capturing, conditionals, and control

This section covers the syntax for grouping subpatterns, capturing submatches, conditional submatches, and quantifying the number of times a subpattern matches. (See MRE 135-140.)

Capturing and grouping parentheses: (...) and 1, 2, ...

Parentheses perform two functions: grouping and capturing. Text matched by the subpattern within parentheses is captured for later use. Capturing parentheses are numbered by counting their opening parentheses from the left. If backreferences are available, the submatch can be referred to later in the same match with 1, 2, etc. The captured text is made available after a match by implementation-specific methods. For example, (w+)s+1 matches duplicate words, such as the the.

Grouping-only parentheses: (?:...)

Groups a subexpression, possibly for alternation or quantifiers, but does not capture the submatch. This is useful for efficiency and reusability. For example, (?:foobar) matches foobar, but does not save the match to a capture group.

Named capture: (?<name>...)

Performs capturing and grouping, with captured text later referenced by name. For example, Subject:(?<subject>.*) captures the text following Subject: to a capture group that can be referenced by the name subject.

Atomic grouping: (?>...)

Text matched within the group is never backtracked into, even if this leads to a match failure. For example, (?>[ab]*)ww matches aabbcc but not aabbaa.

Alternation: ...|...

Allows several subexpressions to be tested. Alternation’s low precedence sometimes causes subexpressions to be longer than intended, so use parentheses to specifically group what you want alternated. For example, (foo|bar) matches either of the words foo or bar.

Conditional: (?if then | else)

The if is implementation dependent, but generally is a reference to a captured subexpression or a lookaround. The then and else parts are both regular expression patterns. If the if part is true, the then is applied. Otherwise, else is applied. For example, (<)?foo(?(1)>|bar) matches <foo> and foobar.

Greedy quantifiers: *, +, ?, {num,num }

The greedy quantifiers determine how many times a construct may be applied. They attempt to match as many times as possible, but will backtrack and give up matches if necessary for the success of the overall match. For example, (ab)+ matches all of ababababab.

Lazy quantifiers: *?, +?, ??, {num,num }?

Lazy quantifiers control how many times a construct may be applied. However, unlike greedy quantifiers, they attempt to match as few times as possible. For example, (an)+? matches only an of banana.

Possessive Quantifiers: *+, ++, ?+, {num,num }+

Possessive quantifiers are like greedy quantifiers, except that they “lock in” their match, disallowing later backtracking to break up the sub-match. For example, (ab)++ab will not match ababababab.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.160.43