Elements

Let's look more closely at the elements that make up regular expressions.

Regexp Choice

image with no caption

A regexp choice contains one or more regexp sequences. The sequences are separated by the | (vertical bar) character. The choice matches if any of the sequences match. It attempts to match each of the sequences in order. So:

"into".match(/in|int/)

matches the in in into. It wouldn't match int because the match of in was successful.

Regexp Sequence

image with no caption

A regexp sequence contains one or more regexp factors. Each factor can optionally be followed by a quantifier that determines how many times the factor is allowed to appear. If there is no quantifier, then the factor will be matched one time.

Regexp Factor

image with no caption

A regexp factor can be a character, a parenthesized group, a character class, or an escape sequence. All characters are treated literally except for the control characters and the special characters:

 / [ ] ( ) { } ? + * | . ^ $

which must be escaped with a prefix if they are to be matched literally. When in doubt, any special character can be given a prefix to make it literal. The prefix does not make letters or digits literal.

An unescaped . matches any character except a line-ending character.

An unescaped ^ matches the beginning of the text when the lastIndex property is zero. It can also match line-ending characters when the m flag is specified.

An unescaped $ matches the end of the text. It can also match line-ending characters when the m flag is specified.

Regexp Escape

image with no caption

The backslash character indicates escapement in regexp factors as well as in strings, but in regexp factors, it works a little differently.

As in strings, f is the formfeed character, is the newline character, is the carriage return character, is the tab character, and u allows for specifying a Unicode character as a 16-bit hex constant. In regexp factors,  is not the backspace character.

d is the same as [0-9]. It matches a digit. D is the opposite: [^0-9].

s is the same as [f u000Bu0020u00A0u2028u2029]. This is a partial set of Unicode whitespace characters. S is the opposite: [^f u000Bu0020u00A0u2028u2029].

w is the same as [0-9A-Z_a-z]. W is the opposite: [^0-9A-Z_a-z]. This is supposed to represent the characters that appear in words. Unfortunately, the class it defines is useless for working with virtually any real language. If you need to match a class of letters, you must specify your own class.

A simple letter class is [A-Za-zu00C0-u1FFFu2800-uFFFD]. It includes all of Unicode's letters, but it also includes thousands of characters that are not letters. Unicode is large and complex. An exact letter class of the Basic Multilingual Plane is possible, but would be huge and inefficient. JavaScript's regular expressions provide extremely poor support for internationalization.

 was intended to be a word-boundary anchor that would make it easier to match text on word boundaries. Unfortunately, it uses w to find word boundaries, so it is completely useless for multilingual applications. This is not a good part.

1 is a reference to the text that was captured by group 1 so that it can be matched again. For example, you could search text for duplicated words with:

var doubled_words = /([A-Za-zu00C0-u1FFFu2800-uFFFD]+)s+1/gi;

doubled_words looks for occurrences of words (strings containing 1 or more letters) followed by whitespace followed by the same word.

2 is a reference to group 2, 3 is a reference to group 3, and so on.

Regexp Group

image with no caption

There are four kinds of groups:

Capturing

A capturing group is a regexp choice wrapped in parentheses. The characters that match the group will be captured. Every capture group is given a number. The first capturing ( in the regular expression is group 1. The second capturing ( in the regular expression is group 2.

Noncapturing

A noncapturing group has a (?: prefix. A noncapturing group simply matches; it does not capture the matched text. This has the advantage of slight faster performance. Noncapturing groups do not interfere with the numbering of capturing groups.

Positive lookahead

A positive lookahead group has a (?= prefix. It is like a noncapturing group except that after the group matches, the text is rewound to where the group started, effectively matching nothing. This is not a good part.

Negative lookahead

A negative lookahead group has a (?! prefix. It is like a positive lookahead group, except that it matches only if it fails to match. This is not a good part.

Regexp Class

image with no caption

A regexp class is a convenient way of specifying one of a set of characters. For example, if we wanted to match a vowel, we could write (?:a|e|i|o|u), but it is more conveniently written as the class [aeiou].

Classes provide two other conveniences. The first is that ranges of characters can be specified. So, the set of 32 ASCII special characters:

! " # $ % & ' ( ) * +, - . / :
; < = > ? @ [  ] ^ _ ` { | } ˜

could be written as:

(?:!|"|#|$|%|&|'|(|)|*|+|,|-|.|/|:|;|<|=|>|@|[|\|]|^|_|` |{|||}|˜)

but is slightly more nicely written as:

[!-/:-@[-`{-˜]

which includes the characters from ! through / and : through @ and [ through ` and { through ˜. It is still pretty nasty looking.

The other convenience is the complementing of a class. If the first character after the [ is ^, then the class excludes the specified characters.

So [^!-/:-@[-`{-˜] matches any character that is not one of the ASCII special characters.

Regexp Class Escape

image with no caption

The rules of escapement within a character class are slightly different than those for a regexp factor. [] is the backspace character. Here are the special characters that should be escaped in a character class:

- / [  ] ^

Regexp Quantifier

image with no caption

A regexp factor may have a regexp quantifier suffix that determines how many times the factor should match. A number wrapped in curly braces means that the factor should match that many times. So, /www/ matches the same as /w{3}/. {3,6} will match 3, 4, 5, or 6 times. {3,} will match 3 or more times.

? is the same as {0,1}. * is the same as {0,}. + is the same as {1,}.

Matching tends to be greedy, matching as many repetitions as possible up to the limit, if there is one. If the quantifier has an extra ? suffix, then matching tends to be lazy, attempting to match as few repetitions as possible. It is usually best to stick with the greedy matching.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.141.241