Appendix A. Regular Expressions

Regular expressions enable one to search for common patterns in text files. Many tools support regular-expression-based search and replace. Although there are small differences in syntax from one tool to the next, most of the syntax and the basic ideas are much the same.

Characters That Match Themselves

The first rule of regular expressions is that most normal characters match themselves. A “normal” character is a letter, a digit, or a space. Thus, the regular expression “Al Gore was elected president in the year 2000” matches exactly the string “Al Gore was elected president in the year 2000” and no others. Searching for that string will find all occurrences of that exact string in the searched document or documents. However, it will not find even slight variations, such an extra space between the words Al and Gore. Table A.1 shows some more examples.

Table A.1. Characters That Match Themselves

Pattern

Matches

Example

Foo

The string Foo

Foo

foo

The string foo

Foo

foo bar

The string foo bar

foo bar

2000

The string 2000

2000

ανερ

The string ανερ

ανερ

<!DOCTYPE

The string <!DOCTYPE

<!DOCTYPE

<p>

The string <p>

<p>

</p>

The string </p>

</p>

To be specific, the following characters match themselves:

  • All ASCII letters A–Z and a–z

  • The ASCII digits 0–9

  • The space character

  • All non-ASCII characters: é, ç,... and so on

  • The following ASCII punctuation characters:

    • !

    • #

    • %

    • ,

    • /

    • :

    • ;

    • <

    • =

    • >

    • @

    • _

    • `

    • ~

Parentheses vary according to dialect. In some regular-expression dialects, parentheses match themselves, and in some they don’t. In jEdit, parentheses are metacharacters that do not match themselves.

Metacharacters

Other ASCII characters are reserved and must be escaped if you wish to match them. For noncontrol characters, you do this with a backslash. For example, to match a period, you use .. (This is also called a character representation.) To match a question mark, you use ?. To match a left parenthesis, you use (. To match a right parenthesis, you use ). To match a backslash, you type \, and so forth. Table A.2 shows some more examples.

Table A.2. Metacharacters

Pattern

Matches

Example

<![CDATA[

The string <![CDATA[

<![CDATA[

<?xml-stylesheet

The string <?xml-stylesheet

<?xml-stylesheet

()

The string ()

()

2**3

The string 2**3

2**3

2 + 2 = 4

The string 2 + 2 = 4

2 + 2 = 4

$19.95

The string $19.95

$19.95

There are also character representations for many control characters, such as these seven:

  • for carriage return

  • for line feed

  • for tab

  • f for form feed

  • a for alarm

  •  for backspace

  • e for escape

However, we don’t use these a lot when refactoring HTML because these characters aren’t very important in HTML. Usually what you care about is whether there’s some whitespace, not exactly which character it is. The last four characters are not even legal in XML documents, including XHTML documents. The form feed, f, is the only one of these that’s even remotely common. It would not be a bad idea to do a quick search for f. If you find any, inspect the document where it appears to find out what purpose it serves. You can likely replace it with a br element, a p element, or a single space.

Wildcards

So far, we haven’t described anything you couldn’t find with a simple literal search. The power of regular expressions is that you can write strings that match several similar strings. For instance, you can write an expression that matches any start-tag, not just a particular start-tag. To do this, you need wildcards that can stand in for more than one character.

The first such wildcard is the period. It matches any single character except a line break. For example, the regular expression 200. matches 2000, 2001, 200Z, 200!, and many more strings. The regular expression a....b matches any six-character string that begins with a and ends with b, such as abbbbb, aaabbb, aDCEFb, ab bc b, and many more. Table A.3 shows more examples.

Table A.3. The Period Wildcard

Pattern

Matches

Example

Foo.Bar

Any string beginning with Foo, followed by a single character, followed by Bar, not containing any line breaks

FooZBar

FoozBar

Foo Bar

Foo9Bar

.Foo

Any four-character string whose last three letters are Foo

AFoo

fFoo

9Foo

-Foo

Foo

....

Any four-character string not containing any line breaks

<em>

This

that

I am

Cat.

2008

..c..

Any five-character string not containing any line breaks whose middle character is the letter c

Faced

a cat

abcde

The only characters the period doesn’t match are the carriage return and the line feed. Because HTML does not usually consider line breaks to be significant and tags can extend across multiple lines, this is problematic. Some regular-expression dialects, including Perl’s, allow you to modify this behavior so that the period does not match a line break. However, jEdit’s does not.

To match any character, including a line break, you can use the character class [.s]. More on this shortly.

Of course, sometimes a period is just a period. If you want to match a literal period, escape it with a backslash. For example, the regular expression other. would find sentences that end in the word other.

Quantifiers

A period or a literal character by itself always matches exactly one character. However, you can append a quantifier to it to indicate that the character may appear a variable number of times.

Zero or One: ?

A normal character suffixed with a question mark indicates that the character appears only optionally (zero times or once). For example, the regular expression a?b matches ab and b. The a is optional. The regular expression a?b?c? matches abc, ab, bc, ac, bc, a, b, and c, as well as the empty string.

You can suffix a period with a question mark to indicate that any character may or may not appear. For example, the regular expression 200.? matches 200 as well as 2000, 2001, 200Z, and 200!.

Zero or More: *

An asterisk (*) suffix indicates that the preceding character appears zero or more times. For example, a*b matches ab, aaab, aaaaab, and b. However, it does not match abb or acb.

You can put an asterisk after a period to indicate that any number of any characters may appear. For example, a.*b matches ab, aaab, aaaaab, abb, acb, a123b, and "a quick brown fox jumped into the tub".

Unlike UNIX shell globs, the asterisk alone does not match anything. It must be suffixed to something else. For example, to list all the HTML files in the current working directory, you’d usually type something such as this:

$ ls *html

However, in most regular-expression dialects, the regular expression that matches all strings ending in html is .*html. *html without the initial period is a syntax error.

One or More: +

A plus sign (+) suffix indicates that a character appears one or more times. For example, a+b matches ab, aaab, and aaaaab. However, it does not match a single b, abb, or acb.

Of course, a plus sign after a period indicates that one or more of any character is required. The regular expression a.+b requires at least one character between the a and the b, so it matches aaab, aaaaab, abb, acb, and a123b but not a simple ab.

A Specific Number of Times: {}

You can specify that a character must appear a specified number of times using curly braces. For example, a{3} is the same as the pattern aaa. It stands for exactly three a’s in a row.

You can also specify a range of possible occurrences using a comma. A{3,5} allows three to five As in a row. That is, it matches AAA, AAAA, and AAAAA but not AA or AAAAAA.

You can omit the second, maximum value to indicate that at least a certain number of repetitions is required but more are allowed. For example, a{3,} matches aaa, aaaa, aaaaa, aaaaaa, and any larger sequence of a’s. Table A.4 shows some more examples.

Table A.4. Quantifiers

Pattern

Matches

Example

</?p>

A p start tag or end tag

<p>

</p>

<br */>

A br start tag containing any number of spaces before the closing />

<br/><br /><br  />

<p.*>

A complete p start tag, followed by all other text through the last > on the same line

<p>

<p id=’c4’>

<p id=‘c4’>This is text</p>

<a+>

Any number of a’s, but no other characters, in angle brackets

<a>

<aa>

<aaa>

<aaaa>

Class Shorthands

Several backslash sequences match particular types of characters. For example, d matches any digit: 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. D matches any character that is not a digit. (Capitalization often reverses the sense of a pattern.) Other such classes include the following.

  • s

    A whitespace character: space, tab, carriage return, or line feed. This is very important in HTML because usually in HTML these four characters are interchangeable. s does not match the non-breaking space.

  • S

    Any nonwhitespace character.

  • w

    Any word character—that is, any letter or digit but not punctuation marks or spaces. However, the underscore, _, is considered to be a word character in most regular-expression dialects. Whether non-ASCII characters are considered individually or excluded as a group varies according to dialect.

  • W

    Any nonword character.

  • d

    Any digit from 0 to 9. In some dialects, this also matches non-ASCII digits such as (the Japanese 1) and (the Arabic 1) as well.

  • D

    Any character except 0 to 9; in some dialects, any character that is not a digit.

Character Classes

Bracketed expressions enable you to define your own character classes that match some characters and not others. Just place the characters you want to match inside square brackets.

For example, suppose you want to search for all hexadecimal digits. You can easily enumerate those characters as [0123456789abcdefABCDEF]. This matches any one of those characters. Then, to match any potential hexadecimal number, which will include one or more of these characters, you suffix the brackets with a plus sign to form this regular expression:

[0123456789abcdefABCDEF]+

Of course, this also matches words composed purely of these 22 characters, such as Decaf and fed.

This is simple enough, but enumerating all the characters can be tedious. Sometimes what you want is a range. Use a hyphen between two characters to indicate all characters from one’s ASCII value to the other. For example, [a-z] matches any lowercase letter. [A-Z] matches any uppercase letter.

You can combine ranges. [a-zA-Z] matches any upper- or lowercase letter. For example, we can match hexadecimal numbers a little more simply as:

[0-9a-fA-F]+

You can negate a character set or range by placing a caret, ^, immediately after the opening bracket. For example, [^a-z] matches any character except a lowercase ASCII letter. [^a-zA-Z] matches any character except a lower- or uppercase ASCII letter.

Warning

Ranges are determined by character value, as measured in ASCII or Unicode. This works pretty much as you expect within any one obvious range. However, beware of ranges that cross script, case, or type boundaries, such as [a-Z], [0-F], or [A-Ω]. These almost certainly don’t do what you want or expect.

Table A.5 shows some more examples.

Table A.5. Character Classes (a.k.a. bracketed expressions)

Pattern

Matches

Example

</[a-zA-Z1-6]+>

All HTML end-tags

</p>

</TABLE>

</Span>

</foo>

</h2>

</[a-z1-6]+>

All XHTML end-tags

</p>

</table>

</span>

</foo>

[a-zA-Z]+s*=s*“[^“>]*”

Double-quoted attributes

id=“c1”

id = “c1”

id=c1

<p.*>

A complete p start-tag, followed by all other text through the last > on the same line

<p>

<p id=’c4’>

<p id=’c4’>This is text</p>

([a-zA-Z0-9]{1-63}.) [a-zA-Z]+

Domain name

example.com

www.example.deserver4.nbc.ge.com

Groups and Back References

You can group expressions inside parentheses and then use the repetition operators after the group. For example, suppose you wanted to find all runs of <br> tags. The regular expression (<br>)+ will match <br>, <br><br>, <br><br><br>, and so forth.

You can further combine the expressions. For example, (<br>s*)+ will match all runs of <br> tags, even if they have whitespace in between them.

Even more powerfully, you can refer back to a group later in the expression. The first parenthesized match is 1. The second is 2, the third 3, and so forth. (If the groups nest, they are counted from the left parenthesis only.) For example, suppose you want to find all simple HTML elements in the form <foo>Blah Blah Blah</foo>. That is, you want to find all the elements without any attributes and that don’t contain any child elements. Furthermore, you really want to find all the elements from the beginning of the start-tag to the end of the end-tag.

We can start with the expression <[a-zA-Z]+> to find the start tags. We can use the expression </[a-zA-Z]+> to find the end-tags. However, we want only those pairs that match. So, first we put parentheses around the start-tag, like so:

<([a-zA-Z]+)>

Then we refer back to that in the end-tag expression as 1—that is, </1>. If the start-tag was div, the end-tag will be div. If the start-tag was em, the end-tag will be em, and so forth:

<([a-zA-Z]+)> </1>

Finally, we need to put a character class in the middle that excludes less-than signs but allows line breaks. This will avoid nested child elements and some overly greedy matches:

<([a-zA-Z]+)>[^<]*</1>

Even more important, you can use the back references 1, 2, and so on in replacement strings. For example, I was recently faced with this list:

<ul>
<li>marquee</li>
<li>basefont</li>
<li>bgsound</li>
<li>keygen</li>
<li>bgsound</li>
<li>spacer</li>
<li>wbr</li>
</ul>

I wanted to put the contents of each list item in a code element. Therefore, I searched for this:

<li>([a-z]+)</li>

I replaced it with this:

<li><code>1</code></li>

This gave me the following:

<ul>
<li><code>marquee</code></li>
<li><code>basefont</code></li>
<li><code>bgsound</code></li>
<li><code>keygen</code></li>
<li><code>bgsound</code></li>
<li><code>spacer</code></li>
<li><code>wbr</code></li>
</ul>

As another example, suppose you have a table of species count data organized like this:

<tr> <td> Great Egret </td> <td> 7 </td> </tr>
<tr> <td> Redhead </td> <td> 1 </td> </tr>
<tr> <td> Mallard </td> <td> 56 </td> </tr>
<tr> <td> House Finch </td> <td> 3 </td> </tr>

Now suppose you decide to swap the columns so that the counts go on the left and the species on the right. You search for this:

(<td>.*</td>) (<td>.*</td>)

Then you simply replace it with the following:

21

This very quickly turns the HTML into this:

<tr> <td> 7 </td><td> Great Egret </td> </tr>
<tr> <td> 1 </td><td> Redhead </td> </tr>
<tr> <td> 56 </td><td> Mallard </td> </tr>
<tr> <td> 3 </td><td> House Finch </td> </tr>

Groups and back references are critical anytime you need to chop data apart and put it back together again in a slightly different order.

Whitespace

Matching whitespace is quite tricky and but still quite important. Precisely because HTML does not consider whitespace to be hugely significant, it’s important to pay attention to it. Four whitespace characters are likely to appear in HTML documents:

  • The space itself

  • The carriage return,

  • The linefeed,

  • The tab,

The space character has no special representation in regular expressions. To match a space, you simply type a space. Just be careful that you type the right number of spaces, because it won’t usually be obvious if you’re trying to match two where one is called for or vice versa.

is particularly tricky. In some dialects, this represents the literal line feed character, ASCII 10. However, in others, including jEdit’s, it means any line break character including carriage return, line feed, and a carriage return-line feed pair. Finally, in still other dialects, it means the platform’s native line-terminating character. Thus, it can match a carriage return on the Mac, a line feed on UNIX, and a carriage return line feed pair on Windows.

This is quite troublesome for working with HTML because HTML documents are not platform-bound. You are likely to find all three line-ending conventions in your document collection, sometimes even in the same file. Consequently, we usually do one of several things instead:

  • Use [ ( )] to match all line breaks, regardless of type.

  • Use s to match all whitespace, line breaks or otherwise.

  • Use ^ and $ to anchor the pattern to the beginning and/or end of a line.

Line breaks are usually not significant in HTML, so more often than not we use the second option.

You may encounter documents that include other characters such as a form feed or a vertical tab. These have no defined meaning in HTML and should usually be replaced with a single space.

Alternation: |

The vertical bar, |, allows you to choose between two possible values. For example, suppose you want to search for all years in the twentieth or twenty-first century; 1904, 1952, 1999, 2001, 2059, and so on. The basic rule is that the first two characters must be either 19 or 20. The second two characters must be digits. 19dd matches all years in the twentieth century.[1] 20dd matches all years in the twenty-first century. (19dd)|(20dd) matches both sets of years. We could also write this as (19|20)dd—that is, either 19 or 20 followed by two digits.

Alternation is also important for matching HTML tags. For example, suppose you want a general expression for matching all start tags. The problem you run into is that there are three ways an attribute can appear, and each has its own regular expression:

  • name=value

    [a-zA-Z]+s*=s*[^s'">]+
    
  • name="value"

    [a-zA-Z]+s*=s*"[^">]*"
    
  • name='value'

    [a-zA-Z]+s*=s*'[^'>]*'
    

We can combine these regular expressions with an alternation, like so:

 <[a-zA-Z]+s*([a-zA-Z]+s*=s*[^s'">]+
|[a-zA-Z]+s*=s*"[^">]*"|[a-zA-Z]+s*=s*'[^'>]*')*>

This finds all single-quoted, double-quoted, and nonquoted attributes. (It also finds name=value parameters in URL query strings, which was not intended.)

Greedy and Nongreedy Matches

By default, all matches are greedy. That is, they match the maximum length of text they can get away with. For example, suppose you have the following paragraph:

<p>
<q id='g1'>Take your seats,</q> said the guard.<q id='g2'>Going
by the train, sir?</q>
</p>

Now suppose you want to match all the q start tags, and consequently you use the regular expression <q.*>. In fact, this will find:

<q id='g1'>Take your seats,</q> said the guard. <q id='g2'>

The regular expression <q.*> matches everything from the first <q on the line to the last >. The only reason it stops there is that the period does not match line breaks. The match is said to be greedy.

You specify a nongreedy match that stops at the first opportunity by putting a question mark after the quantifier. You can also use such a question mark after another question mark or after a plus sign. Thus, if I had written the regular expression as <q.*?>, it would have stopped with the start-tag <q id='g1'>.

You can also use nongreedy matches with the other quantifiers, such as ? and +. For example, a+? will match at least one a, but then it will stop if it can. However, if this is part of a larger pattern, such as a*?b or a+?b, it will match as many a’s as it needs to get to the first b.

Position

Several metacharacters anchor the regular expression to a particular location in the document without actually matching anything themselves. These include

  • ^

    The beginning of a line.

  • $

    The end of a line.

  • 

    A word boundary, including a space or line break.

  • B

    Any location that is not a word boundary.

  • A

    The beginning of the document.

  • z

    The end of the document.

  • 

    The end of the document, unless the document ends with a line break. In this case, it is the position immediately before the final line break.

Because HTML is not very line-oriented, we tend not to use ^ and $ very much. However,  and B are quite useful, and A, , and z sometimes are, too. For example, cat matches the word cat but does not match inside the words category, catheter, or abdicate. (Some GUI tools, including BBEdit but not jEdit, give you an option to only match entire words. This is essentially the same as putting  before and after your expression.) Other possible uses include

  • As*(<html|<HTML)

    Find all documents that start with <html or <HTML and thus don’t have a DOCTYPE declaration or a byte order mark.

  • As*(<body|<BODY)

    Find all documents that start with <body or <BODY and thus don’t have a proper html root element.

  • </[hH][tT][mM][lL]>s*

    Find all documents that end with </html> in various combinations of case, optionally followed by whitespace.

Table A.6 summarizes all of these patterns.

Table A.6. Regular-Expression Syntax

Pattern

Matches

.

Any one character

^

Beginning of line

$

End of line

c*

Zero or more c’s

c+

One or more c’s

c?

Zero or one c

c*?

Zero or more c’s, as few as possible

c+?

One or more c’s, as few as possible

c??

Zero or one c, as few as possible

c{count}

Exactly count c’s

c{count,}

At least count c’s

c{min,max}

At least min c’s and at most max c’s

[a-zA-z]

Any one of the characters from a–z or A–Z

[abc]

Any one of the characters between the brackets

[^abc]

Any one of the characters not between the brackets

[a-z]

Any one of the characters from a–z

[a-zA-z]

Any one of the characters from a–z or A–Z

A

Beginning of document

z

End of document



End of document, but before trailing line break, if any



Boundary of a word, that is, the beginning or end of a word

B

Not the boundary of a word

s

Any whitespace character (space, tab, carriage return, line feed)

S

Any nonwhitespace character

w

Any word character (letters, digits, and the underscore)

W

Any nonword character

d

Any digit (0–9)

D

Any nondigit

(abc)

The characters a, b, and c in that order

1, 2, ...

First matched pattern, second matched pattern, ...

Note

For more information on regular expressions, including lots more examples, some advanced features I haven’t gone into here, and details about dialect variations, I recommend Mastering Regular Expressions, 3rd Edition, by Jeffrey E. F. Friedl (O’Reilly, 2006).



[1] Pedants beware: Because there was no year 0, 1900 is really in the nineteenth century and 2000 is the twentieth, but I’m going to ignore that.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.184.200