Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9. String Manipulation

In Chapter 4 you looked at the String object, which is one of the native objects that JavaScript makes available to you. You saw a number of its properties and methods, including the following:

length — The length of the string in characters
charAt() and charCodeAt() — The methods for returning the character or character code at a certain position in the string
indexOf() and lastIndexOf() — The methods that allow you to search a string for the existence of another string and that return the character position of the string if found
substr() and substring() — The methods that return just a portion of a string
toUpperCase() and toLowerCase() — The methods that return a string converted to upper- or lowercase

In this chapter you'll look at four new methods of the String object, namely split(), match(), replace(), and search(). The last three, in particular, give you some very powerful text-manipulation functionality. However, to make full use of this functionality, you need to learn about a slightly more complex subject.

The methods split(), match(), replace(), and search() can all make use of regular expressions, something JavaScript wraps up in an object called the RegExp object. Regular expressions enable you to define a pattern of characters, which can be used for text searching or replacement. Say, for example, that you have a string in which you want to replace all single quotes enclosing text with double quotes. This may seem easy — just search the string for ' and replace it with " — but what if the string is Bob O'Hara said "Hello"? You would not want to replace the single-quote character in O'Hara. You can perform this text replacement without regular expressions, but it would take more than the two lines of code needed if you do use regular expressions.

Although split(), match(), replace(), and search() are at their most powerful with regular expressions, they can also be used with just plain text. You'll take a look at how they work in this simpler context first, to become familiar with the methods.

Additional String Methods

In this section you will take a look at the split(), replace(), search(), and match() methods, and see how they work without regular expressions.

The split() Method

The String object's split() method splits a single string into an array of substrings. Where the string is split is determined by the separation parameter that you pass to the method. This parameter is simply a character or text string.

For example, to split the string "A,B,C" so that you have an array populated with the letters between the commas, the code would be as follows:

var myString = "A,B,C";
var myTextArray = myString.split(','),

JavaScript creates an array with three elements. In the first element it puts everything from the start of the string myString up to the first comma. In the second element it puts everything from after the first comma to before the second comma. Finally, in the third element it puts everything from after the second comma to the end of the string. So, your array myTextArray will look like this:

A

B

C

If, however, your string were "A,B,C," JavaScript would split it into four elements, the last element containing everything from the last comma to the end of the string; in other words, the last string would be an empty string.

A

B

C

This is something that can catch you off guard if you're not aware of it.

Try It Out: Reversing the Order of Text

Let's create a short example using the split() method, in which you reverse the lines written in a <textarea> element.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Example 1</title>
<script language="JavaScript" type="text/javascript">
function splitAndReverseText(textAreaControl)
{

var textToSplit = textAreaControl.value;
   var textArray = textToSplit.split('
'),
   var numberOfParts = 0;
   numberOfParts = textArray.length;
   var reversedString = "";
   var indexCount;
   for (indexCount = numberOfParts - 1; indexCount >= 0; indexCount--)
   {
      reversedString = reversedString + textArray[indexCount];
      if (indexCount > 0)
      {
         reversedString = reversedString + "
";
      }
   }

   textAreaControl.value = reversedString;
}
</script>
</head>
<body>
<form name="form1">
<textarea rows="20" cols="40" name="textarea1" wrap="soft">Line 1
Line 2
Line 3
Line 4</textarea>
<br />
<input type="button" value="Reverse Line Order" name="buttonSplit"
   onclick="splitAndReverseText(document.form1.textarea1)">
</body>
</html>

Save this as ch9_examp1.htm and load it into your browser. You should see the screen shown in Figure 9-1.

Figure 9.1. Figure 9-1

Clicking the Reverse Line Order button reverses the order of the lines, as shown in Figure 9-2.

Figure 9.2. Figure 9-2

Try changing the lines within the text area to test it further.

Although this example works on Internet Explorer (IE) as it is, an extra line gets inserted. If this troubles you, you can fix it by replacing each instance of with for IE.

The key to how this code works is the function splitAndReverseText(). This function is defined in the script block in the head of the page and is connected to the onclick event handler of the button further down the page.

<input type="button" value="Reverse Line Order" name=buttonSplit
   onclick="splitAndReverseText(document.form1.textarea1)">

As you can see, you pass a reference of the text area that you want to reverse as a parameter to the function. By doing it this way, rather than just using a reference to the element itself inside the function, you make the function more generic, so you can use it with any textarea element.

Now, on with the function. You start by assigning the value of the text inside the textarea element to the textToSplit variable. You then split that string into an array of lines of text using the split() method of the String object and put the resulting array inside the textArray variable.

function splitAndReverseText(textAreaControl)
{
   var textToSplit = textAreaControl.value;
   var textArray = textToSplit.split('
'),

So what do you use as the separator to pass as a parameter for the split() method? Recall from Chapter 2 that the escape character is used for a new line. Another point to add to the confusion is that IE seems to need rather than .

You next define and initialize three more variables.

var numberOfParts = 0;
   numberOfParts = textArray.length;
   var reversedString = "";
   var indexCount;

Now that you have your array of strings, you next want to reverse them. You do this by building up a new string, adding each string from the array, starting with the last and working toward the first. You do this in the for loop, where instead of starting at 0 and working up as you usually do, you start at a number greater than 0 and decrement until you reach 0, at which point you stop looping.

for (indexCount = numberOfParts - 1; indexCount >= 0; indexCount--)
   {
      reversedString = reversedString + textArray[indexCount];
      if (indexCount > 0)
      {
         reversedString = reversedString + "
";
      }
   }

Finally, you assign the text in the textarea element to the new string you've built.

textAreaControl.value = reversedString;
}

After you've looked at regular expressions, you'll revisit the split() method.

The replace() Method

The replace() method searches a string for occurrences of a substring. Where it finds a match for this substring, it replaces the substring with a third string that you specify.

Let's look at an example. Say you have a string with the word May in it, as shown in the following:

var myString = "The event will be in May, the 21st of June";

Now, say you want to replace May with June. You can use the replace() method like so:

myCleanedUpString = myString.replace("May","June");

The value of myString will not be changed. Instead, the replace() method returns the value of myString but with May replaced with June. You assign this returned string to the variable myCleanedUpString, which will contain the corrected text.

"The event will be in June, the 21st of June"

The search() Method

The search() method enables you to search a string for a particular piece of text. If the text is found, the character position at which it was found is returned; otherwise −1 is returned. The method takes only one parameter, namely the text you want to search for.

When used with plain text, the search() method provides no real benefit over methods like indexOf(), which you've already seen. However, you'll see later that it's when you use regular expressions that the power of this method becomes apparent.

In the following example, you want to find out if the word Java is contained within the string called myString.

var myString = "Beginning JavaScript, Beginning Java, Professional JavaScript";
alert(myString.search("Java"));

The alert box that occurs will show the value 10, which is the character position of the J in the first occurrence of Java, as part of the word JavaScript.

The match() Method

The match() method is very similar to the search() method, except that instead of returning the position at which a match was found, it returns an array. Each element of the array contains the text of each match that is found.

Although you can use plain text with the match() method, it would be completely pointless to do so. For example, take a look at the following:

var myString = "1997, 1998, 1999, 2000, 2000, 2001, 2002";
myMatchArray = myString.match("2000");
alert(myMatchArray.length);

This code results in myMatchArray holding an element containing the value 2000. Given that you already know your search string is 2000, you can see it's been a pretty pointless exercise.

However, the match() method makes a lot more sense when you use it with regular expressions. Then you might search for all years in the twenty-first century — that is, those beginning with 2. In this case, your array would contain the values 2000, 2000, 2001, and 2002, which is much more useful information!

Regular Expressions

Before you look at the split(), match(), search(), and replace() methods of the String object again, you need to look at regular expressions and the RegExp object. Regular expressions provide a means of defining a pattern of characters, which you can then use to split, search for, or replace characters in a string when they fit the defined pattern.

JavaScript's regular expression syntax borrows heavily from the regular expression syntax of Perl, another scripting language. The latest versions of languages, such as VBScript, have also incorporated regular expressions, as do lots of applications, such as Microsoft Word, in which the Find facility allows regular expressions to be used. The same is true for Dreamweaver. You'll find that your regular expression knowledge will prove useful even outside JavaScript.

Regular expressions in JavaScript are used through the RegExp object, which is a native JavaScript object, as are String, Array, and so on. There are two ways of creating a new RegExp object. The easier is with a regular expression literal, such as the following:

var myRegExp = /'|'/;

The forward slashes (/) mark the start and end of the regular expression. This is a special syntax that tells JavaScript that the code is a regular expression, much as quote marks define a string's start and end. Don't worry about the actual expression's syntax yet (the '|') — that will be explained in detail shortly.

Alternatively, you could use the RegExp object's constructor function RegExp() and type the following:

var myRegExp = new RegExp("\b'|'\b");

Either way of specifying a regular expression is fine, though the former method is a shorter, more efficient one for JavaScript to use and therefore is generally preferred. For much of the remainder of the chapter, you'll use the first method. The main reason for using the second method is that it allows the regular expression to be determined at runtime (as the code is executing and not when you are writing the code). This is useful if, for example, you want to base the regular expression on user input.

Once you get familiar with regular expressions, you will come back to the second way of defining them, using the RegExp() constructor. As you can see, the syntax of regular expressions is slightly different with the second method, so we'll return to this subject later.

Although you'll be concentrating on the use of the RegExp object as a parameter for the String object's split(), replace(), match(), and search() methods, the RegExp object does have its own methods and properties. For example, the test() method enables you to test to see if the string passed to it as a parameter contains a pattern matching the one defined in the RegExp object. You'll see the test() method in use in an example shortly.

Simple Regular Expressions

Defining patterns of characters using regular expression syntax can get fairly complex. In this section you'll explore just the basics of regular expression patterns. The best way to do this is through examples.

Let's start by looking at an example in which you want to do a simple text replacement using the replace() method and a regular expression. Imagine you have the following string:

var myString = "Paul, Paula, Pauline, paul, Paul";

and you want to replace any occurrence of the name "Paul" with "Ringo."

Well, the pattern of text you need to look for is simply Paul. Representing this as a regular expression, you just have this:

var myRegExp = /Paul/;

As you saw earlier, the forward-slash characters mark the start and end of the regular expression. Now let's use this expression with the replace() method.

myString = myString.replace(myRegExp, "Ringo");

You can see that the replace() method takes two parameters: the RegExp object that defines the pattern to be searched and replaced, and the replacement text.

If you put this all together in an example, you have the following:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<script language="JavaScript" type="text/javascript">
  var myString = "Paul, Paula, Pauline, paul, Paul";
  var myRegExp = /Paul/;
  myString = myString.replace(myRegExp, "Ringo");
  alert(myString);
</script>
</body>
</html>

If you load this code into a browser, you will see the screen shown in Figure 9-3.

Figure 9.3. Figure 9-3

You can see that this has replaced the first occurrence of Paul in your string. But what if you wanted all the occurrences of Paul in the string to be replaced? The two at the far end of the string are still there, so what happened?

By default, the RegExp object looks only for the first matching pattern, in this case the first Paul, and then stops. This is a common and important behavior for RegExp objects. Regular expressions tend to start at one end of a string and look through the characters until the first complete match is found, then stop.

What you want is a global match, which is a search for all possible matches to be made and replaced. To help you out, the RegExp object has three attributes you can define. You can see these listed in the following table.

Attribute Character	Description
`G`	Global match. This looks for all matches of the pattern rather than stopping after the first match is found.
`I`	Pattern is case-insensitive. For example, `Paul` and `paul` are considered the same pattern of characters.
`M`	Multi-line flag. Only available in IE 5.5+ and NN 6+, this specifies that the special characters `^` and $ can match the beginning and the end of lines as well as the beginning and end of the string. You'll learn about these characters later in the chapter.

If you change the RegExp object in the code to the following, a global case-insensitive match will be made.

var myRegExp = /Paul/gi;

Running the code now produces the result shown in Figure 9-4.

Figure 9.4. Figure 9-4

This looks as if it has all gone horribly wrong. The regular expression has matched the Paul substrings at the start and the end of the string, and the penultimate paul, just as you wanted. However, the Paul substrings inside Pauline and Paula have also been replaced.

The RegExp object has done its job correctly. You asked for all patterns of the characters Paul to be replaced and that's what you got. What you actually meant was for all occurrences of Paul, when it's a single word and not part of another word, such as Paula, to be replaced. The key to making regular expressions work is to define exactly the pattern of characters you mean, so that only that pattern can match and no other. So let's do that.

You want paul or Paul to be replaced.
You don't want it replaced when it's actually part of another word, as in Pauline.

How do you specify this second condition? How do you know when the word is joined to other characters, rather than just joined to spaces or punctuation or the start or end of the string?

To see how you can achieve the desired result with regular expressions, you need to enlist the help of regular expression special characters. You'll look at these in the next section, by the end of which you should be able to solve the problem.

Try It Out: Regular Expression Tester

Getting your regular expression syntax correct can take some thought and time, so in this exercise you'll create a simple regular expression tester to make life easier.

Type the following code in to your text editor and save it as ch9_examp2.htm:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Regular Expression Tester</title>
<style type="text/css">

body,td,th {
       font-family: Arial, Helvetica, sans-serif;

}

</style>

<script type="text/javascript">

function getRegExpFlags()
{
        var regExpFlags = '';
        if ( document.form1.chkGlobal.checked )
        {
                regExpFlags = 'g';
        }

        if ( document.form1.chkCaseInsensitive.checked )
        {
                regExpFlags += 'i';
        }

        if ( document.form1.chkMultiLine.checked )
        {
                regExpFlags += 'm';
        }

        return regExpFlags;

}

function doTest()
{
        var testRegExp = new RegExp(document.form1.txtRegularExpression.value,
 getRegExpFlags());
        if ( testRegExp.test(document.form1.txtTestString.value) )
        {
                document.form1.txtTestResult.value = "Match Found!";
        }
        else
        {
                document.form1.txtTestResult.value = "Match NOT Found!";
        }
}

function findMatches()
{
        var testRegExp = new RegExp(document.form1.txtRegularExpression.value,
getRegExpFlags());
        var myTestString = new String(document.form1.txtTestString.value)
        var matchArray = myTestString.match(testRegExp);

        document.form1.txtTestResult.value = matchArray.join('
'),

}

</script>

</head>

<body>

<form id="form1" name="form1" method="post" action=""<
  <p>
    Regular Expression:<br />
<label>
      <input name="txtRegularExpression" type="text" id="txtRegularExpression"
size="100" value=""/>
      <br />
      Global
      <input name="chkGlobal" type="checkbox" id="chkGlobal" value="true" />
</label>

  Case Insensitive
  <label>
    <input name="chkCaseInsensitive" type="checkbox" id="chkCaseInsensitive"
value="true" />
  </label>

  Multi Line
  <label>
    <input name="chkMultiLine" type="checkbox" id="chkMultiLine" value="true" />
  </label>
  </p>
  <p>
    <label>
      Test Text:<br />
      <textarea name="txtTestString" id="txtTestString" cols="100"
rows="8"></textarea>
    </label>
  </p>
  <p>Result:<br />
    <textarea name="txtTestResult" id="txtTestResult" cols="100"
rows="8"></textarea>
  </p>
  <p>
    <label>
      <input type="button" name="cmdTest" id="cmdTest" value="TEST"
onclick="doTest();"/>
    </label>
    <label>
      <input type="button" name="cmdMatch" id="cmdMatch" value="MATCH"
onclick="findMatches();" />
    </label>
    <label>
      <input type="reset" name="cmdClearForm" id="cmdClearForm" value="Reset Form"
 />
    </label>
  </p>
  <p>&nbsp;</p>
</form>
</body>
</html>

Load the page into your browser, and you'll see the screen shown in Figure 9-5.

Figure 9.5. Figure 9-5

In the top box, you enter your regular expression. You can set the attributes such as global and case sensitivity by ticking the tick boxes. The text to test the regular expression against goes in the Test Text box, and the result is displayed in the Result box.

As a test, enter the regular expression d{3}, which as you'll discover shortly, will match three digits. Also tick the Global box so all the matches will be found. Finally, your test text is ABC123DEF456GHI789.

If you click the Test button, the code will test to see if there are any matches (that is, if the test text contains three numbers). The result, as you can see in Figure 9-6, is that a match is found.

Figure 9.6. Figure 9-6

Now to find all the matches, click the Match button, and this results in the screen shown in Figure 9-7.

Figure 9.7. Figure 9-7

Each match of your regular expressions found in Test Text box is put on a separate line in the Results box.

The buttons cmdTest and cmdMatch have their click events linked to the doTest() and findMatches() functions. Let's start by looking at what happens in the doTest() function.

First, the regular expression object is created.

var testRegExp = new RegExp(document.form1.txtRegularExpression.value,
                            getRegExpFlags());

The first parameter of the object constructor is your regular expression as contained in the txtRegularExpression text box. This is easy enough to access, but the second parameter contains the regular expression flags, and these are generated via the tick boxes in the form. To convert the tick boxes to the correct flags, the function getRegExpFlags() has been created, and the return value from this function provides the flags value for the regular expressions constructor. The function getRegExpFlags() is used by both the doTest() and getMatches() functions. The getRegExpFlags() function is fairly simple. It starts by declaring regExpFlags and setting it to an empty string.

var regExpFlags = '';

Then for each of the tick boxes, it checks to see if the tick box is ticked. If it is, the appropriate flag is added to regExpFlags as shown here for the global flag:

if ( document.form1.chkGlobal.checked )
{
       regExpFlags = 'g';
}

The same principle is used for the case-insensitive and multi-line flags.

Okay, back to the doTest() function. The regular expression object has been created and its flags have been set, so now you test to see if the regular expression matches anything in the Test Text box.

if ( testRegExp.test(document.form1.txtTestString.value) )
{
       document.form1.txtTestResult.value = "Match Found!";
}
else
{
       document.form1.txtTestResult.value = "Match NOT Found!";
}

If a match is found, "Match Found!" is written to the Results box; otherwise "Match NOT Found!" is written.

The regular expression object's test() method is used to do the actual testing for a match of the regular expression with the test string supplied as the method's only parameter. It returns true when a match is found or false when it's not. The global flag is irrelevant for the test() method, because it simply looks for the first match and returns true if found.

Now let's look at the findMatches() function, which runs when the cmdMatches button is clicked. As with the doTest() function, the first line creates a new regular expression object with the regular expression entered in the Regular Expression text box in the form and the flags being set via the getRegExpFlags() function.

var testRegExp = new RegExp(document.form1.txtRegularExpression.value,
                            getRegExpFlags());

Next, a new String object is created, and you then use the String object's match() method to find the matches.

var myTestString = new String(document.form1.txtTestString.value)
var matchArray = myTestString.match(testRegExp);

The match() method returns an array with all the matches found in each element of the array. The variable matchArray is used to store the array.

Finally, the match results are displayed in the Results box on the form:

document.form1.txtTestResult.value = matchArray.join('
'),

The String object's join() method joins all the elements in an array and returns them as a single string. Each element is separated by the value you pass as the join() method's only parameter. Here or the newline character has been passed, which means when the string is displayed in the Results box, each match is on its own individual line.

Regular Expressions: Special Characters

You will be looking at three types of special characters in this section.

Text, Numbers, and Punctuation

The first group of special characters you'll look at contains the character class's special characters. Character class means digits, letters, and whitespace characters. The special characters are displayed in the following table.

Character Class	Characters It Matches	Example
`d`	Any digit from 0 to 9	`dd` matches 72, but not aa or 7a
`D`	Any character that is not a digit	`DDD` matches abc, but not 123 or 8ef
`w`	Any word character; that is, A–Z, a–z, 0–9, and the underscore character (_)	`wwww` matches Ab_2, but not £$%* or Ab_@
`W`	Any non-word character	`W` matches @, but not a
`s`	Any whitespace character	`s` matches tab, return, formfeed, and vertical tab
`S`	Any non-whitespace character	`S` matches `A`, but not the tab character
`.`	Any single character other than the newline character ()	`.` matches a or 4 or @
`[. . .]`	Any one of the characters between the brackets`[a-z]` will match any character in the range a to z	`[abc]` will match a or b or c, but nothing else
`[^. . .]`	Any one character, but not one of those inside the brackets	`[^abc]` will match any character except a or b or c `[^a-z]` will match any character that is not in the range a to z

Note that uppercase and lowercase characters mean very different things, so you need to be extra careful with case when using regular expressions.

Let's look at an example. To match a telephone number in the format 1-800-888-5474, the regular expression would be as follows:

d-ddd-ddd-dddd

You can see that there's a lot of repetition of characters here, which makes the expression quite unwieldy. To make this simpler, regular expressions have a way of defining repetition. You'll see this a little later in the chapter, but first let's look at another example.

Try It Out: Checking a Passphrase for Alphanumeric Characters

You'll use what you've learned so far about regular expressions in a full example in which you check that a passphrase contains only letters and numbers — that is, alphanumeric characters, not punctuation or symbols like @, %, and so on.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Example 3</title>

<script type="text/javascript">
function regExpIs_valid(text)
{
   var myRegExp = /[^a-zd ]/i;
   return !(myRegExp.test(text));
}
function butCheckValid_onclick()
{
   if (regExpIs_valid(document.form1.txtPhrase.value) == true)
   {
      alert("Your passphrase contains only valid characters");
   }
   else
   {
      alert("Your passphrase contains one or more invalid characters");
   }
}
</script>

</head>
<body>

<form name="form1">
Enter your passphrase:
<br />
<input type="text" name="txtPhrase">
<br />
<input type="button" value="Check Character Validity" name="butCheckValid"
   onclick="butCheckValid_onclick()">
</form>

</body>
</html>

Save the page as ch9_examp3.htm, and then load it into your browser. Type just letters, numbers, and spaces into the text box; click the Check Character Validity button, and you'll be told that the phrase contains valid characters. Try putting punctuation or special characters like @, ^, $, and so on into the text box, and you'll be informed that your passphrase is invalid.

Let's start by looking at the regExpIs_valid() function defined at the top of the script block in the head of the page. That does the validity checking of the passphrase using regular expressions.

function regExpIs_valid(text)
{
   var myRegExp = /[^a-zd ]/i;
   return !(myRegExp.test(text));
}

The function takes just one parameter: the text you want to check for validity. You then declare a variable, myRegExp, and set it to a new regular expression, which implicitly creates a new RegExp object.

The regular expression itself is fairly simple, but first think about what pattern you are looking for. What you want to find out is whether your passphrase string contains any characters that are not letters between A and Z or between a and z, numbers between 0 and 9, or spaces. Let's see how this translates into a regular expression:

You use square brackets with the ^ symbol.
```
[^]
```
This means you want to match any character that is not one of the characters specified inside the square brackets.
You add a-z, which specifies any character in the range a through z.
```
[^a-z]
```
So far, your regular expression matches any character that is not between a and z. Note that, because you added the i to the end of the expression definition, you've made the pattern case-insensitive. So your regular expression actually matches any character not between A and Z or a and z.
Add d to indicate any digit character, or any character between 0 and 9.
```
[^a-zd]
```
Your expression matches any character that is not between a and z, A and Z, or 0 and 9. You decide that a space is valid, so you add that inside the square brackets.
```
[^a-zd ]
```
Putting this all together, you have a regular expression that will match any character that is not a letter, a digit, or a space.
On the second and final line of the function, you use the RegExp object's test() method to return a value.
```
return !(myRegExp.test(text));
```

The test() method of the RegExp object checks the string passed as its parameter to see if the characters specified by the regular expression syntax match anything inside the string. If they do, true is returned; if not, false is returned. Your regular expression will match the first invalid character found, so if you get a result of true, you have an invalid passphrase. However, it's a bit illogical for an is_valid function to return true when it's invalid, so you reverse the result returned by adding the NOT operator (!).

Previously you saw the two-line validity checker function using regular expressions. Just to show how much more coding is required to do the same thing without regular expressions, here is a second function that does the same thing as regExpIs_valid() but without regular expressions.

function is_valid(text)
{
   var isValid = true;
   var validChars = "abcdefghijklmnopqrstuvwxyz1234567890 ";
   var charIndex;
   for (charIndex = 0; charIndex < text.length;charIndex++)
   {
      if ( validChars.indexOf(text.charAt(charIndex).toLowerCase()) < 0)
      {
         isValid = false;
         break;
      }
   }
   return isValid;
}

This is probably as small as the non-regular expression version can be, and yet it's still 15 lines long. That's six times the amount of code for the regular expression version.

The principle of this function is similar to that of the regular expression version. You have a variable, validChars, which contains all the characters you consider to be valid. You then use the charAt() method in a for loop to get each character in the passphrase string and check whether it exists in your validChars string. If it doesn't, you know you have an invalid character.

In this example, the non-regular expression version of the function is 15 lines, but with a more complex problem you could find it takes 20 or 30 lines to do the same thing a regular expression can do in just a few.

Back to your actual code: The other function defined in the head of the page is butCheckValid_onclick(). As the name suggests, this is called when the butCheckValid button defined in the body of the page is clicked.

This function calls your regExpis_valid() function in an if statement to check whether the passphrase entered by the user in the txtPhrase text box is valid. If it is, an alert box is used to inform the user.

function butCheckValid_onclick()
{
   if (regExpIs_valid(document.form1.txtPhrase.value) == true)
   {
      alert("Your passphrase contains valid characters");
   }

If it isn't, another alert box is used to let users know that their text was invalid.

else
   {
      alert("Your passphrase contains one or more invalid characters");
   }
}

Repetition Characters

Regular expressions include something called repetition characters, which are a means of specifying how many of the last item or character you want to match. This proves very useful, for example, if you want to specify a phone number that repeats a character a specific number of times. The following table lists some of the most common repetition characters and what they do.

Special Character	Meaning	Example
`{n}`	Match `n` of the previous item	`x{2}` matches xx
`{n,}`	Match `n` or more of the previous item	`x{2,}` matches xx, xxx, xxxx, xxxxx, and so on
`{n,m}`	Match at least `n` and at most `m` of the previous item	`x{2,4}` matches xx, xxx, and xxxx
`?`	Match the previous item zero or one time	`x?` matches nothing or x
`+`	Match the previous item one or more times	`x+` matches x, xx, xxx, xxxx, xxxxx, and so on
`*`	Match the previous item zero or more times	`x*` matches nothing, or `x`, `xx`, `xxx`, `xxxx`, and so on

You saw earlier that to match a telephone number in the format 1-800-888-5474, the regular expression would be d-ddd-ddd-dddd. Let's see how this would be simplified with the use of the repetition characters.

The pattern you're looking for starts with one digit followed by a dash, so you need the following:

d-

Next are three digits followed by a dash. This time you can use the repetition special characters — d{3} will match exactly three d, which is the any-digit character.

d-d{3}-

Next, there are three digits followed by a dash again, so now your regular expression looks like this:

d-d{3}-d{3}-

Finally, the last part of the expression is four digits, which is d{4}.

d-d{3}-d{3}-d{4}

You'd declare this regular expression like this:

var myRegExp = /d-d{3}-d{3}-d{4}/

Remember that the first / and last / tell JavaScript that what is in between those characters is a regular expression. JavaScript creates a RegExp object based on this regular expression.

As another example, what if you have the string Paul Paula Pauline, and you want to replace Paul and Paula with George? To do this, you would need a regular expression that matches both Paul and Paula.

Let's break this down. You know you want the characters Paul, so your regular expression starts as

Paul

Now you also want to match Paula, but if you make your expression Paula, this will exclude a match on Paul. This is where the special character ? comes in. It enables you to specify that the previous character is optional — it must appear zero (not at all) or one time. So, the solution is

Paula?

which you'd declare as

var myRegExp = /Paula?/

Position Characters

The third group of special characters you'll look at are those that enable you to specify either where the match should start or end or what will be on either side of the character pattern. For example, you might want your pattern to exist at the start or end of a string or line, or you might want it to be between two words. The following table lists some of the most common position characters and what they do.

Position Character	Description
`^`	The pattern must be at the start of the string, or if it's a multi-line string, then at the beginning of a line. For multi-line text (a string that contains carriage returns), you need to set the multi-line flag when defining the regular expression using `/myreg ex/m`. Note that this is only applicable to IE 5.5 and later and NN 6 and later.
`$`	The pattern must be at the end of the string, or if it's a multi-line string, then at the end of a line. For multi-line text (a string that contains carriage returns), you need to set the multi-line flag when defining the regular expression using `/myreg ex/m`. Note that this is only applicable to IE 5.5 and later and NN 6 and later.
	This matches a word boundary, which is essentially the point between a word character and a non-word character.
`B`	This matches a position that's not a word boundary.

For example, if you wanted to make sure your pattern was at the start of a line, you would type the following:

^myPattern

This would match an occurrence of myPattern if it was at the beginning of a line.

To match the same pattern, but at the end of a line, you would type the following:

myPattern$

The word-boundary special characters and B can cause confusion, because they do not match characters but the positions between characters.

Imagine you had the string "Hello world!, let's look at boundaries said 007." defined in the code as follows:

var myString = "Hello world!, let's look at boundaries said 007.";

To make the word boundaries (that is, the boundaries between the words) of this string stand out, let's convert them to the | character.

var myRegExp = //g;
myString = myString.replace(myRegExp, "|");
alert(myString);

You've replaced all the word boundaries, , with a |, and your message box looks like the one in Figure 9-8.

Figure 9.8. Figure 9-8

You can see that the position between any word character (letters, numbers, or the underscore character) and any non-word character is a word boundary. You'll also notice that the boundary between the start or end of the string and a word character is considered to be a word boundary. The end of this string is a full stop. So the boundary between the full stop and the end of the string is a non-word boundary, and therefore no | has been inserted.

If you change the regular expression in the example, so that it replaces non-word boundaries as follows:

var myRegExp = /B/g;

you get the result shown in Figure 9-9.

Figure 9.9. Figure 9-9

Now the position between a letter, number, or underscore and another letter, number, or underscore is considered a non-word boundary and is replaced by an | in the example. However, what is slightly confusing is that the boundary between two non-word characters, such as an exclamation mark and a comma, is also considered a non-word boundary. If you think about it, it actually does make sense, but it's easy to forget when creating regular expressions.

You'll remember this example from when you started looking at regular expressions:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<script language="JavaScript" type="text/javascript">

  var myString = "Paul, Paula, Pauline, paul, Paul";
  var myRegExp = /Paul/gi;
  myString = myString.replace(myRegExp, "Ringo");
  alert(myString);



</script>
</body>
</html>

You used this code to convert all instances of Paul or paul to Ringo.

However, you found that this code actually converts all instances of Paul to Ringo, even when the word Paul is inside another word.

One way to solve this problem would be to replace the string Paul only where it is followed by a non-word character. The special character for non-word characters is W, so you need to alter the regular expression to the following:

var myRegExp = /PaulW/gi;

This gives the result shown in Figure 9-10.

Figure 9.10. Figure 9-10

It's getting better, but it's still not what you want. Notice that the commas after the second and third Paul substrings have also been replaced because they matched the W character. Also, you're still not replacing Paul at the very end of the string. That's because there is no character after the letter l in the last Paul. What is after the l in the last Paul? Nothing, just the boundary between a word character and a non-word character, and therein lies the answer. What you want as your regular expression is Paul followed by a word boundary. Let's alter the regular expression to cope with that by entering the following:

var myRegExp = /Paul/gi;

Now you get the result you want, as shown in Figure 9-11.

Figure 9.11. Figure 9-11

At last you've got it right, and this example is finished.

Covering All Eventualities

Perhaps the trickiest thing about a regular expression is making sure it covers all eventualities. In the previous example your regular expression works with the string as defined, but does it work with the following?

var myString = "Paul, Paula, Pauline, paul, Paul, JeanPaul";

Here the Paul substring in JeanPaul will be changed to Ringo. You really only want to convert the substring Paul where it is on its own, with a word boundary on either side. If you change your regular expression code to

var myRegExp = /Paul/gi;

you have your final answer and can be sure only Paul or paul will ever be matched.

Grouping Regular Expressions

The final topic under regular expressions, before you look at examples using the match(), replace(), and search() methods, is how you can group expressions. In fact, it's quite easy. If you want a number of expressions to be treated as a single group, you just enclose them in parentheses, for example, /(dd)/. Parentheses in regular expressions are special characters that group together character patterns and are not themselves part of the characters to be matched.

Why would you want to do this? Well, by grouping characters into patterns, you can use the special repetition characters to apply to the whole group of characters, rather than just one.

Let's take the following string defined in myString as an example:

var myString = "JavaScript, VBScript and Perl";

How could you match both JavaScript and VBScript using the same regular expression? The only thing they have in common is that they are whole words and they both end in Script. Well, an easy way would be to use parentheses to group the patterns Java and VB. Then you can use the ? special character to apply to each of these groups of characters to make the pattern match any word having zero or one instances of the characters Java or VB, and ending in Script.

var myRegExp = /(VB)?(Java)?Script/gi;

Breaking this expression down, you can see the pattern it requires is as follows:

A word boundary:
Zero or one instance of VB: (VB)?
Zero or one instance of Java: (Java)?
The characters Script: Script
A word boundary:

Putting these together, you get this:

var myString = "JavaScript, VBScript and Perl";
var myRegExp = /(VB)?(Java)?Script/gi;
myString = myString.replace(myRegExp, "xxxx");
alert(myString);

The output of this code is shown in Figure 9-12.

Figure 9.12. Figure 9-12

If you look back at the special repetition characters table, you'll see that they apply to the item preceding them. This can be a character, or, where they have been grouped by means of parentheses, the previous group of characters.

However, there is a potential problem with the regular expression you just defined. As well as matching VBScript and JavaScript, it also matches VBJavaScript. This is clearly not exactly what you meant.

To get around this you need to make use of both grouping and the special character |, which is the alternation character. It has an or-like meaning, similar to || in if statements, and will match the characters on either side of itself.

Let's think about the problem again. You want the pattern to match VBScript or JavaScript. Clearly they have the Script part in common. So what you want is a new word starting with Java or starting with VB; either way, it must end in Script.

First, you know that the word must start with a word boundary.

Next you know that you want either VB or Java to be at the start of the word. You've just seen that in regular expressions | provides the "or" you need, so in regular expression syntax you want the following:

(VB|Java)

This matches the pattern VB or Java. Now you can just add the Script part.

(VB|Java)Script

Your final code looks like this:

var myString = "JavaScript, VBScript and Perl";
var myRegExp = /(VB|Java)Script/gi;
myString = myString.replace(myRegExp, "xxxx");
alert(myString);

Reusing Groups of Characters

You can reuse the pattern specified by a group of characters later on in the regular expression. To refer to a previous group of characters, you just type and a number indicating the order of the group. For example, the first group can be referred to as 1, the second as 2, and so on.

Let's look at an example. Say you have a list of numbers in a string, with each number separated by a comma. For whatever reason, you are not allowed to have two instances of the same number in a row, so although

009,007,001,002,004,003

would be okay, the following:

007,007,001,002,002,003

would not be valid, because you have 007 and 002 repeated after themselves.

How can you find instances of repeated digits and replace them with the word ERROR? You need to use the ability to refer to groups in regular expressions.

First, let's define the string as follows:

var myString  = "007,007,001,002,002,003,002,004";

Now you know you need to search for a series of one or more number characters. In regular expressions the d specifies any digit character, and + means one or more of the previous character. So far, that gives you this regular expression:

d+

You want to match a series of digits followed by a comma, so you just add the comma.

d+,

This will match any series of digits followed by a comma, but how do you search for any series of digits followed by a comma, then followed again by the same series of digits? As the digits could be any digits, you can't add them directly into the expression like so:

d+,007

This would not work with the 002 repeat. What you need to do is put the first series of digits in a group; then you can specify that you want to match that group of digits again. This can be done with 1, which says, "Match the characters found in the first group defined using parentheses." Put all this together, and you have the following:

(d+),1

This defines a group whose pattern of characters is one or more digit characters. This group must be followed by a comma and then by the same pattern of characters as in the first group. Put this into some JavaScript, and you have the following:

var myString  = "007,007,001,002,002,003,002,004";
var myRegExp = /(d+),1/g;
myString = myString.replace(myRegExp,"ERROR");
alert(myString);

The alert box will show this message:

ERROR,1,ERROR,003,002,004

That completes your brief look at regular expression syntax. Because regular expressions can get a little complex, it's often a good idea to start simple and build them up slowly, as was done in the previous example. In fact, most regular expressions are just too hard to get right in one step — at least for us mere mortals without a brain the size of a planet.

If it's still looking a bit strange and confusing, don't panic. In the next sections, you'll be looking at the String object's split(), replace(), search(), and match() methods with plenty more examples of regular expression syntax.

The String Object — split(), replace(), search(), and match() Methods

The main functions making use of regular expressions are the String object's split(), replace(), search(), and match() methods. You've already seen their syntax, so you'll concentrate on their use with regular expressions and at the same time learn more about regular expression syntax and usage.

The split() Method

You've seen that the split() method enables us to split a string into various pieces, with the split being made at the character or characters specified as a parameter. The result of this method is an array with each element containing one of the split pieces. For example, the following string:

var myListString = "apple, banana, peach, orange"

could be split into an array in which each element contains a different fruit, like this:

var myFruitArray = myListString.split(", ");

How about if your string is this instead?

var myListString = "apple, 0.99, banana, 0.50, peach, 0.25, orange, 0.75";

The string could, for example, contain both the names and prices of the fruit. How could you split the string, but retrieve only the names of the fruit and not the prices? You could do it without regular expressions, but it would take many lines of code. With regular expressions you can use the same code and just amend the split() method's parameter.

Try It Out: Splitting the Fruit String

Let's create an example that solves the problem just described — it must split your string, but include only the fruit names, not the prices.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<body>

<script type="text/javascript">
var myListString = "apple, 0.99, banana, 0.50, peach, 0.25, orange, 0.75";
var theRegExp = /[^a-z]+/i;
var myFruitArray = myListString.split(theRegExp);
document.write(myFruitArray.join("<br />"));

</script>
</body>
</html>

Save the file as ch9_examp4.htm and load it in your browser. You should see the four fruits from your string written out to the page, with each fruit on a separate line.

Within the script block, first you have your string with fruit names and prices.

var myListString = "apple, 0.99, banana, 0.50, peach, 0.25, orange, 0.75";

How do you split it in such a way that only the fruit names are included? Your first thought might be to use the comma as the split() method's parameter, but of course that means you end up with the prices. What you have to ask is, "What is it that's between the items I want?" Or in other words, what is between the fruit names that you can use to define your split? The answer is that various characters are between the names of the fruit, such as a comma, a space, numbers, a full stop, more numbers, and finally another comma. What is it that these things have in common and makes them different from the fruit names that you want? What they have in common is that none of them are letters from a through z. If you say "Split the string at the point where there is a group of characters that are not between a and z," then you get the result you want. Now you know what you need to create your regular expression.

You know that what you want is not the letters a through z, so you start with this:

[^a-z]

The ^ says "Match any character that does not match those specified inside the square brackets." In this case you've specified a range of characters not to be matched — all the characters between a and z. As specified, this expression will match only one character, whereas you want to split wherever there is a single group of one or more characters that are not between a and z. To do this you need to add the + special repetition character, which says "Match one or more of the preceding character or group specified."

[^a-z]+

The final result is this:

var theRegExp = /[^a-z]+/i

The / and / characters mark the start and end of the regular expression whose RegExp object is stored as a reference in the variable theRegExp. You add the i on the end to make the match case-insensitive.

Don't panic if creating regular expressions seems like a frustrating and less-than-obvious process. At first, it takes a lot of trial and error to get it right, but as you get more experienced, you'll find creating them becomes much easier and will enable you to do things that without regular expressions would be either very awkward or virtually impossible.

In the next line of script you pass the RegExp object to the split() method, which uses it to decide where to split the string.

var myFruitArray = myListString.split(theRegExp);

After the split, the variable myFruitArray will contain an Array with each element containing the fruit name, as shown here:

Array Element Index	0	1	2	3
Element value	apple	banana	peach	orange

You then join the string together again using the Array object's join() methods, which you saw in Chapter 4.

document.write(myFruitArray.join("<BR>"))

The replace() Method

You've already looked at the syntax and usage of the replace() method. However, something unique to the replace() method is its ability to replace text based on the groups matched in the regular expression. You do this using the $ sign and the group's number. Each group in a regular expression is given a number from 1 to 99; any groups greater than 99 are not accessible. Note that in earlier browsers, groups could only go from 1 to 9 (for example, in IE 5 or earlier or Netscape 4 and earlier). To refer to a group, you write $ followed by the group's position. For example, if you had the following:

var myRegExp = /(d)(W)/g;

then $1 refers to the group(d), and $2 refers to the group (W). You've also set the global flag g to ensure that all matching patterns are replaced — not just the first one.

You can see this more clearly in the next example. Say you have the following string:

var myString = "1999, 2000, 2001";

If you wanted to change this to "the year 1999, the year 2000, the year 2001", how could you do it with regular expressions?

First, you need to work out the pattern as a regular expression, in this case four digits.

var myRegExp = /d{4}/g;

But given that the year is different every time, how can you substitute the year value into the replaced string?

Well, you change your regular expression so that it's inside a group, as follows:

var myRegExp = /(d{4})/g;

Now you can use the group, which has group number 1, inside the replacement string like this:

myString = myString.replace(myRegExp, "the year $1");

The variable myString now contains the required string "the year 1999, the year 2000, the year 2001".

Let's look at another example in which you want to convert single quotes in text to double quotes. Your test string is this:

'Hello World' said Mr. O'Connerly.
He then said 'My Name is O'Connerly, yes that's right, O'Connerly'.

One problem that the test string makes clear is that you want to replace the single-quote mark with a double only where it is used in pairs around speech, not when it is acting as an apostrophe, such as in the word that's, or when it's part of someone's name, such as in O'Connerly.

Let's start by defining the regular expression. First you know that it must include a single quote, as shown in the following code:

var myRegExp = /'/;

However, as it is this would replace every single quote, which is not what you want.

Looking at the text, you should also notice that quotes are always at the start or end of a word — that is, at a boundary. On first glance it might be easy to assume that it would be a word boundary. However, don't forget that the ' is a non-word character, so the boundary will be between it and another non-word character, such as a space. So the boundary will be a non-word boundary or, in other words, B.

Therefore, the character pattern you are looking for is either a non-word boundary followed by a single quote or a single quote followed by a non-word boundary. The key is the "or," for which you use | in regular expressions. This leaves your regular expression as the following:

var myRegExp = /B'|'B/g;

This will match the pattern on the left of the | or the character pattern on the right. You want to replace all the single quotes with double quotes, so the g has been added at the end, indicating that a global match should take place.

Try It Out: Replacing Single Quotes with Double Quotes

Let's look at an example using the regular expression just defined.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>example</title>
<script type="text/javascript">
function replaceQuote(textAreaControl)
{
   var myText = textAreaControl.value;
   var myRegExp = /B'|'B/g;
   myText = myText.replace(myRegExp,'"'),
   textAreaControl.value = myText;
}
</script>
</head>
<body>
<form name="form1">
<textarea rows="20" cols="40" name="textarea1">
'Hello World' said Mr O'Connerly.
He then said 'My Name is O'Connerly, yes that's right, O'Connerly'.
</textarea>
<br>
<input type="button" VALUE="Replace Single Quotes" name="buttonSplit"
   onclick="replaceQuote(document.form1.textarea1)">
</form>

</body>
</html>

Save the page as ch9_examp5.htm. Load the page into your browser and you should see what is shown in Figure 9-13.

Figure 9.13. Figure 9-13

Click the Replace Single Quotes button to see the single quotes in the text area replaced as in Figure 9-14.

Figure 9.14. Figure 9-14

Try entering your own text with single quotes into the text area and check the results.

You can see that by using regular expressions, you have completed a task in a couple of lines of simple code. Without regular expressions, it would probably take four or five times that amount.

Let's look first at the replaceQuote() function in the head of the page where all the action is.

function replaceQuote(textAreaControl)
{
   var myText = textAreaControl.value;
   var myRegExp = /B'|'B/g;

myText = myText.replace(myRegExp,'"'),
   textAreaControl.value = myText;
}

The function's parameter is the textarea object defined further down the page — this is the text area in which you want to replace the single quotes. You can see how the textarea object was passed in the button's tag definition.

<input type="button" value="Replace Single Quotes" name="buttonSplit"
   onclick="replaceQuote(document.form1.textarea1)">

In the onclick event handler, you call replaceQuote() and pass document.form1.textarea1 as the parameter — that is the textarea object.

Returning to the function, you get the value of the textarea on the first line and place it in the variable myText. Then you define your regular expression (as discussed previously), which matches any non-word boundary followed by a single quote or any single quote followed by a non-word boundary. For example, 'H will match, as will H', but O'R won't, because the quote is between two word boundaries. Don't forget that a word boundary is the position between the start or end of a word and a non-word character, such as a space or punctuation mark.

In the function's final two lines, you first use the replace() method to do the character pattern search and replace, and finally you set the textarea object's value to the changed string.

The search() Method

The search() method enables you to search a string for a pattern of characters. If the pattern is found, the character position at which it was found is returned, otherwise −1 is returned. The method takes only one parameter, the RegExp object you have created.

Although for basic searches the indexOf() method is fine, if you want more complex searches, such as a search for a pattern of any digits or one in which a word must be in between a certain boundary, then search() provides a much more powerful and flexible, but sometimes more complex, approach.

In the following example, you want to find out if the word Java is contained within the string. However, you want to look just for Java as a whole word, not part of another word such as JavaScript.

var myString = "Beginning JavaScript, Beginning Java 2, Professional JavaScript";
var myRegExp = /Java/i;
alert(myString.search(myRegExp));

First, you have defined your string, and then you've created your regular expression. You want to find the character pattern Java when it's on its own between two word boundaries. You've made your search case-insensitive by adding the i after the regular expression. Note that with the search() method, the g for global is not relevant, and its use has no effect.

On the final line, you output the position at which the search has located the pattern, in this case 32.

The match() Method

For example, if you had the string

var myString = "The years were 1999, 2000 and 2001";

and wanted to extract the years from this string, you could do so using the match() method. To match each year, you are looking for four digits in between word boundaries. This requirement translates to the following regular expression:

var myRegExp = /d{4}/g;

You want to match all the years so the g has been added to the end for a global search.

To do the match and store the results, you use the match() method and store the Array object it returns in a variable.

var resultsArray = myString.match(myRegExp);

To prove it has worked, let's use some code to output each item in the array. You've added an if statement to double-check that the results array actually contains an array. If no matches were made, the results array will contain null — doing if (resultsArray) will return true if the variable has a value and not null.

if (resultsArray)
{
  var indexCounter;
  for (indexCounter = 0; indexCounter > resultsArray.length; indexCounter++)
  {
     alert(resultsArray[indexCounter]);
  }
}

This would result in three alert boxes containing the numbers 1999, 2000, and 2001.

Try It Out: Splitting HTML

In the next example, you want to take a string of HTML and split it into its component parts. For example, you want the HTML Hello to become an array, with the elements having the following contents:

Hello

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/
DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<title>example 6</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script type="text/javascript">
function button1_onclick()
{
   var myString = "<table align=center><tr><td>";
   myString = myString + "Hello World</td></tr></table>";
   myString = myString +"<br><h2>Heading</h2>";
   var myRegExp = /<[^>
]+>|[^<>
]+/g;
   var resultsArray = myString.match(myRegExp);
   document.form1.textarea1.value = "";
   document.form1.textarea1.value = resultsArray.join ("
");
}
</script>
</head>
<body>
<form name="form1">
   <textarea rows="20" cols="40" name="textarea1"></textarea>
   <input type="button" value="Split HTML" name="button1"
      onclick="return button1_onclick();">
</form>

</body>
</html>

Save this file as ch9_examp6.htm. When you load the page into your browser and click the Split HTML button, a string of HTML is split, and each tag is placed on a separate line in the text area, as shown in Figure 9-15.

Figure 9.15. Figure 9-15

The function button1_onclick() defined at the top of the page fires when the Split HTML button is clicked. At the top, the following lines define the string of HTML that you want to split:

function button1_onclick()
{
   var myString = "<table align=center><tr><td>";

myString = myString + "Hello World</td></tr></table>";
   myString = myString +"<br><h2>Heading</h2>";

Next you create your RegExp object and initialize it to your regular expression.

var myRegExp = /<[^>
]+>|[^<>
]+/g;

Let's break it down to see what pattern you're trying to match. First, note that the pattern is broken up by an alternation symbol: |. This means that you want the pattern on the left or the right of this symbol. You'll look at these patterns separately. On the left, you have the following:

The pattern must start with a <.
In [^> ]+, you specify that you want one or more of any character except the > or a (carriage return) or a (linefeed).
> specifies that the pattern must end with a >.

On the right, you have only the following:

[^<> ]+ specifies that the pattern is one or more of any character, so long as that character is not a <, >, , or . This will match plain text.

After the regular expression definition you have a g, which specifies that this is a global match.

So the <[^> ]+> regular expression will match any start or close tags, such as  or . The alternative pattern is [^<> ]+, which will match any character pattern that is not an opening or closing tag.

In the following line, you assign the resultsArray variable to the Array object returned by the match() method:

var resultsArray = myString.match(myRegExp);

The remainder of the code deals with populating the text area with the split HTML. You use the Array object's join() method to join all the array's elements into one string with each element separated by a character, so that each tag or piece of text goes on a separate line, as shown in the following:

document.form1.textarea1.value = "";
   document.form1.textarea1.value = resultsArray.join("
");
}

Using the RegExp Object's Constructor

So far you've been creating RegExp objects using the / and / characters to define the start and end of the regular expression, as shown in the following example:

var myRegExp = /[a-z]/;

Although this is the generally preferred method, it was briefly mentioned that a RegExp object can also be created by means of the RegExp() constructor. You might use the first way most of the time. However, there are occasions, as you'll see in the trivia quiz shortly, when the second way of creating a RegExp object is necessary (for example, when a regular expression is to be constructed from user input).

As an example, the preceding regular expression could equally well be defined as

var myRegExp = new RegExp("[a-z]");

Here you pass the regular expression as a string parameter to the RegExp() constructor function.

A very important difference when you are using this method is in how you use special regular expression characters, such as , that have a backward slash in front of them. The problem is that the backward slash indicates an escape character in JavaScript strings — for example, you may use , which means a backspace. To differentiate between meaning a backspace in a string and the special character in a regular expression, you have to put another backward slash in front of the regular expression special character. So becomes \b when you mean the regular expression that matches a word boundary, rather than a backspace character.

For example, say you have defined your RegExp object using the following:

var myRegExp = //;

To declare it using the RegExp() constructor, you would need to write this:

var myRegExp = new RegExp("\b");

and not this:

var myRegExp = new RegExp("");

All special regular expression characters, such as w, , d, and so on, must have an extra in front when you create them using RegExp().

When you defined regular expressions with the / and / method, you could add after the final / the special flags m, g, and i to indicate that the pattern matching should be multi-line, global, or case-insensitive, respectively. When using the RegExp() constructor, how can you do the same thing?

Easy. The optional second parameter of the RegExp() constructor takes the flags that specify a global or case-insensitive match. For example, this will do a global case-insensitive pattern match:

var myRegExp = new RegExp("hello\b","gi");

You can specify just one of the flags if you wish — such as the following:

var myRegExp = new RegExp("hello\b","i");

var myRegExp = new RegExp("hello\b","g");

Try It Out: Form Validation Module

In this Try It Out, you'll create a set of useful JavaScript functions that use regular expressions to validate the following:

Telephone numbers
Postal codes
E-mail addresses

The validation only checks the format. So, for example, it can't check that the telephone number actually exists, only that it would be valid if it did.

First is the .js code file with the input validation code. Please note that the lines of code in the following block are too wide for the book — make sure each regular expression is contained on one line.

function isValidTelephoneNumber( telephoneNumber )
{
               var telRegExp = /^(+d{1,3} ?)?((d{1,5})|d{1,5}) ?d{3}
 ?d{0,7}( (x|xtn|ext|extn|pax|pbx|extension)?.? ?d{2-5})?$/i
               return telRegExp.test( telephoneNumber );
}

function isValidPostalCode( postalCode )
{
        var pcodeRegExp = /^(d{5}(-d{4})?|([a-z][a-z]?dd?|[a-z{2}d[a-z])
    ?d[a-z][a-z])$/i
        return pcodeRegExp.test( postalCode );
}

function isValidEmail( emailAddress )
{
        var emailRegExp = /^(([^<>()[]\.,;:@"x00-x20x7F]|\.)+|("""
([^x0Ax0D"\]|\\)+"""))@(([a-z]|#d+?)([a-z0-9-]|#d+?)*
([a-z0-9]|#d+?).)+([a-z]{2,4})$/i
       return emailRegExp.test( emailAddress );
}

Save this as ch9_examp7_module.js.

To test the code, you need a simple page with a text box and three buttons that validate the telephone number, postal code, or e-mail address.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>example 7</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script type="text/javascript" src="ch9_examp7_module.js"></script>
</head>
<body>
<form name="form1">
  <p>
    <label>

<input type="text" name="txtString" id="txtString" />
   </label>
 </p>
 <p>
   <label>
     <input type="button" name="cmdIsValidTelephoneNumber"
    id="cmdIsValidTelephoneNumber"
    value="Is Valid Telephone Number?"
onclick="alert('Is valid is ' +
                    isValidTelephoneNumber( document.form1.txtString.value ))"
     />

      <input type="button" name="cmdIsValidPostalCode"
             id="cmdIsValidPostalCode"
             value="Is Valid Postal Code?"
             onclick="alert('Is valid is '
+ isValidPostalCode( document.form1.txtString.value ))" />
      <input type="button" name="cmdIsEmailValid" id="cmdIsEmailValid"
value="Is Valid Email?"
onclick="alert('Is valid is '
+ isValidEmail( document.form1.txtString.value ))" />
   </label>
 </p>
</form>

</body>
</html>

Save this as ch9_examp7.htm and load it into your browser, and you'll see a page with a text box and three buttons. Enter a valid telephone number (the example uses +1 (123) 123 4567), click the Is Valid Telephone Number button, and the screen shown in Figure 9-16 is displayed.

Figure 9.16. Figure 9-16

If you enter an invalid phone number, the result would be Is Valid is false. This is pretty basic but it's sufficient for testing your code.

The actual code is very simple, but the regular expressions are tricky to create, so let's look at those in depth starting with telephone number validation.

Telephone Number Validation

Telephone numbers are more of a challenge to validate. The problems are:

Phone numbers differ from country to country.
There are different ways of entering a valid number (for example, adding the national or international code or not).

For this regular expression, you need to specify more than just the valid characters; you also need to specify the format of the data. For example, all of the following are valid:

+1 (123) 123 4567

+1123123 456

+44 (123) 123 4567

+44 (123) 123 4567 ext 123

+44 20 7893 4567

The variations that our regular expression needs to deal with (optionally separated by spaces) are shown in the following table:

The international number	"`+`" followed by one to three digits (optional)
The local area code	Two to five digits, sometimes in parentheses (compulsory)
The actual subscriber number	Three to 10 digits, sometimes with spaces (compulsory)
An extension number	Two to five digits, preceded by `x, xtn`, `extn`, `pax`, `pbx`, or `extension`, and sometimes in parentheses

Obviously, there will be countries where this won't work, which is something you'd need to deal with based on where your customers and partners would be. The following regular expression is rather complex, its length meant it had to be split across two lines; make sure you type it in on one line.

^(+d{1,3} ?)?((d{1,5})|d{1,5}) ?d{3} ?d{0,7}
( (x|xtn|ext|extn|pax|pbx|extension)?.? ?d{2-5})?$

You will need to set the case-insensitive flag with this, as well as the explicit capture option. Although this seems complex, if broken down, it's quite straightforward.

Let's start with the pattern that matches an international dialing code:

(+d{1,3} ?)?

So far, you've matching a plus sign (+) followed by one to three digits (d{1,3}) and an optional space ( ?). Remember that since the + character is a special character, you add a character in front of it to specify that you mean an actual + character. The characters are wrapped inside parentheses to specify a group of characters. You allow an optional space and match this entire group of characters zero or one times, as indicated by the ? character after the closing parenthesis of the group.

Next is the pattern to match an area code:

((d{1,5})|d{1,5})

This pattern is contained in parentheses, which designate it as a group of characters, and matches either one to five digits in parentheses ((d{1,5})) or just one to five digits (d{1,5}). Again, since the parenthesis characters are special characters in regular expression syntax and you want to match actual parentheses, you need the character in front of them. Also note the use of the pipe symbol (|), which means "OR" or "match either of these two patterns."

Next, let's match the subscriber number:

?d{3,4} ?d{0,7}

Note that there is a space before the first ? symbol: this space and question mark mean "match zero or one space." This is followed by three or four digits (d{3,4}) — although there are always three digits in the U.S., there are often four in the UK. Then there's another "zero or one space," and finally between zero and seven digits (d{0,7}).

Finally, add the part to cope with an optional extension number:

( (x|xtn|ext|extn|extension)?.? ?d{2-5})?

This group is optional, since its parentheses are followed by a question mark. The group itself checks for a space, optionally followed by x, ext, xtn, extn, or extension, followed by zero or one periods (note the character, since . is a special character in regular expression syntax), followed by zero or one space, followed by between two and five digits. Putting these four patterns together, you can construct the entire regular expression, apart from the surrounding syntax. The regular expression starts with ^ and ends with $. The ^ character specifies that the pattern must be matched at the beginning of the string, and the $ character specifies that the pattern must be matched at the end of the string. This means that the string must match the pattern completely; it cannot contain any other characters before or after the pattern that is matched.

Therefore, with the regular expression explained, you can now add it to your JavaScript module ch9_examp7_module.js as follows:

function isValidTelephoneNumber( telephoneNumber )
{
               var telRegExp = /^(+d{1,3} ?)?
               ((d{1,5})|d{1,5}) ?d{3} ?d{0,7}
               ( (x|xtn|ext|extn|pax|pbx|extension)?
               .? ?d{2-5})?$/i
               return telRegExp.test( telephoneNumber );
}

Note in this case that it is important to set the case-insensitive flag by adding an i on the end of the expression definition; otherwise, the regular expression could fail to match the ext parts. Please also note that the regular expression itself must be on one line in your code — it's shown in four lines here due to the page-width restrictions of this book.

Validating a Postal Code

We just about managed to check worldwide telephone numbers, but doing the same for postal codes would be something of a major challenge. Instead, you'll create a function that only checks for U.S. zip codes and UK postcodes. If you needed to check for other countries, the code would need modifying. You may find that checking more than one or two postal codes in one regular expression begins to get unmanageable, and it may well be easier to have an individual regular expression for each country's postal code you need to check. For this purpose though, let's combine the regular expression for the UK and the U.S.:

^(d{5}(-d{4})?|[a-z][a-z]?dd? ?d[a-z][a-z])$

This is actually in two parts: The first part checks for zip codes, and the second part checks UK postcodes. Start by looking at the zip code part.

Zip codes can be represented in one of two formats: as five digits (12345), or five digits followed by a dash and four digits (12345-1234). The zip code regular expression to match these is as follows:

d{5}(-d{4})?

This matches five digits, followed by an optional non-capturing group that matches a dash, followed by four digits.

For a regular expression that covers UK postcodes, let's consider their various formats. UK postcode formats are one or two letters followed by either one or two digits, followed by an optional space, followed by a digit, and then two letters. Additionally, some central London postcodes look like this: SE2V 3ER, with a letter at the end of the first part. Currently, it is only some of those postcodes starting with SE, WC, and W, but that may change. Valid examples of UK postcode include: CH3 9DR, PR29 1XX, M27 1AE, WC1V 2ER, and C27 3AH.

Based on this, the required pattern is as follows:

([a-z][a-z]?dd?|[a-z]{2}d[a-z]) ?d[a-z][a-z]

These two patterns are combined using the | character to "match one or the other" and grouped using parentheses. You then add the ^ character at the start and the $ character at the end of the pattern to be sure that the only information in the string is the postal code. Although postal codes should be uppercase, it is still valid for them to be lowercase, so you also set the case-insensitive option as follows when you use the regular expression:

^(d{5}(-d{4})?|([a-z][a-z]?dd?|[a-z{2}d[a-z]) ?d[a-z][a-z])$

The following function needed for your validation module is much the same as it was with the previous example:

function isValidPostalCode( postalCode )
{

var pcodeRegExp = /^(d{5}(-d{4})?|
([a-z][a-z]?dd?|[a-z{2}d[a-z]) ?d[a-z][a-z])$/i
        return pcodeRegExp.test( postalCode );
}

Again please remember that the regular expression must be on one line in your code.

Validating an E-mail Address

Before working on a regular expression to match e-mail addresses, you need to look at the types of valid e-mail addresses you can have. For example:

Also, if you examine the SMTP RFC (http://www.ietf.org/rfc/rfc0821.txt), you can have the following:

[email protected]
"""Paul Wilton"""@somedomain.com

That's quite a list and contains many variations to cope with. It's best to start by breaking it down. First, there are a couple of things to note about the two immediately above. The latter two versions are exceptionally rate and not provided for in the regular expression you'll create.

You need to break up the e-mail address into separate parts, and you will look at the part after the @ symbol, first.

Validating a Domain Name

Everything has become more complicated since Unicode domain names have been allowed. However, the e-mail RFC still doesn't allow these, so let's stick with the traditional definition of how a domain can be described using ASCII. A domain name consists of a dot-separated list of words, with the last word being between two and four characters long. It was often the case that if a two-letter country word was used, there would be at least two parts to the domain name before it: a grouping domain (.co, .ac, and so on) and a specific domain name. However, with the advent of the .tv names, this is no longer the case. You could make this very specific and provide for the allowed top-level domains (TLDs), but that would make the regular expression very large, and it would be more productive to perform a DNS lookup instead.

Each part of a domain name has certain rules it must follow. It can contain any letter or number or a hyphen, but it must start with a letter. The exception is that, at any point in the domain name, you can use a #, followed by a number, which represents the ASCII code for that letter, or in Unicode, the 16-bit Unicode value. Knowing this, let's begin to build up the regular expression, first with the name part, assuming that the case-insensitive flag will be set later in the code.

([a-z]|#d+)([a-z0-9-]|#d+)*([a-z0-9]|#d+)

This breaks the domain into three parts. The RFC doesn't specify how many digits can be contained here, so neither will we. The first part must only contain an ASCII letter; the second must contain zero or more of a letter, number, or hyphen; and the third must contain either a letter or number. The top-level domain has more restrictions, as shown here:

[a-z]{2,4}

This restricts you to a two, three, or four letter top-level domain. So, putting it all together, with the periods you end up with this:

^(([a-z]|#d+?)([a-z0-9-]|#d+?)*([a-z0-9]|#d+?).)+([a-z]{2,4})$

Again, the domain name is anchored at the beginning and end of the string. The first thing is to add an extra group to allow one or more name. portions and then anchor a two-to-four-letter domain name at the end in its own group. We have also made most of the wildcards lazy. Because much of the pattern is similar, it makes sense to do this; otherwise, it would require too much backtracking. However, you have left the second group with a "greedy" wildcard: It will match as much as it can, up until it reaches a character that does not match. Then it will only backtrack one position to attempt the third group match. This is more resource-efficient than a lazy match is in this case, because it could be constantly going forward to attempt the match. One backtrack per name is an acceptable amount of extra processing.

Validating a Person's Address

You can now attempt to validate the part before the @ sign. The RFC specifies that it can contain any ASCII character with a code in the range from 33 to 126. You are assuming that you are matching against ASCII only, so you can assume that there are only 128 characters that the engine will match against. This being the case, it is simpler to just exclude the required values as follows:

[^<>()[],;:@"x00-x20x7F]+

Using this, you're saying that you allow any number of characters, as long as none of them are those contained within the square brackets. The [, ], and characters have to be escaped. However, the RFC allows for other kinds of matches.

Validating the Complete Address

Now that you have seen all the previous sections, you can build up a regular expression for the entire e-mail address. First, here's everything up to and including the @ sign:

^([^<>()[],;:@"x00-x20x7F]|\.)+@

That was straightforward. Now for the domain name part.

^([^<>()[],;:@"x00-x20x7F]|\.)+@(([a-z]|#d+?)([a-z0-9-]
|#d+?)*([a-z0-9]|#d+?).)+([a-z]{2,4})$

We've had to put it on two lines to fit this book's page width, but in your code this must all be on one line.

Finally, let's create the function for the JavaScript module.

function isValidEmail( emailAddress )
{
        var emailRegExp = /^([^<>()[],;:@"x00-x20x7F]|\.)+
@(([a-z]|#d+?)([a-z0-9-]|
#d+?)*([a-z0-9]|#d+?).)+([a-z]{2,4})$/i
        return emailRegExp.test( emailAddress );
}

Please note the regular expression must all be on one line in your code.

With the module completed, let's take a look at the code to test the module.

First, the module is linked to the test page like this:

<script type="text/javascript" src="ch9_examp7_module.js"></script>

Then each of the three test buttons has its click events linked to the validation functions in the module as follows:

<input type="button" name="cmdIsValidTelephoneNumber"
         id="cmdIsValidTelephoneNumber"
 value="Is Valid Telephone Number?"
onclick="alert('Is valid is ' +
isValidTelephoneNumber( document.form1.txtString.value ))" />
      <input type="button" name="cmdIsValidPostalCode" id="cmdIsValidPostalCode"
 value="Is Valid Postal Code?"
onclick="alert('Is valid is ' +
isValidPostalCode( document.form1.txtString.value ))" />
      <input type="button" name="cmdIsEmailValid" id="cmdIsEmailValid"
value="Is Valid Email?"
onclick="alert('Is valid is ' + isValidEmail( document.form1.txtString.value ))" />

So taking telephone validation test button, an onclick event handler is added.

onclick="alert('Is valid is ' +
isValidTelephoneNumber( document.form1.txtString.value ))"

This shows an alert box returning the true or false value from the isValidTelephoneNumber() function in your validation module. In a non-test situation, you'd want a more user-friendly message. The other two test buttons work in the same way but just call different validation functions.

Summary

In this chapter you've looked at some more advanced methods of the String object and how you can optimize their use with regular expressions.

To recap, the chapter covered the following points:

The split() method splits a single string into an array of strings. You pass a string or a regular expression to the method that determines where the split occurs.
The replace() method enables you to replace a pattern of characters with another pattern that you specify as a second parameter.
The search() method returns the character position of the first pattern matching the one given as a parameter.
The match() method matches patterns, returning the text of the matches in an array.
Regular expressions enable you to define a pattern of characters that you want to match. Using this pattern, you can perform splits, searches, text replacement, and matches on strings.
In JavaScript the regular expressions are in the form of a RegExp object. You can create a RegExp object using either myRegExp = /myRegularExpression/ or myRegExp = new RegExp("myRegularExpression"). The second form requires that certain special characters that normally have a single in front now have two.
The g and i characters at the end of a regular expression (as in, for example, myRegExp = /Pattern/gi;) ensure that a global and case-insensitive match is made.
As well as specifying actual characters, regular expressions have certain groups of special characters, which allow any of certain groups of characters, such as digits, words, or non-word characters, to be matched.
Special characters can also be used to specify pattern or character repetition. Additionally, you can specify what the pattern boundaries must be, for example at the beginning or end of the string, or next to a word or non-word boundary.
Finally, you can define groups of characters that can be used later in the regular expression or in the results of using the expression with the replace() method.

In the next chapter, you'll take a look at using and manipulating dates and times using JavaScript, and time conversion between different world time zones. Also covered is how to create a timer that executes code at regular intervals after the page is loaded.

Exercise Questions

Suggested solutions to these questions can be found in Appendix A.

What problem does the following code solve?
```
var myString = "This sentence has has a fault and and we need to fix it."
var myRegExp = /(w+) 1/g;
myString = myString.replace(myRegExp,"$1");
```
Now imagine that you change that code, so that you create the RegExp object like this:
```
var myRegExp = new RegExp("(w+) 1");
```
Why would this not work, and how could you rectify the problem?
Write a regular expression that finds all of the occurrences of the word "a" in the following sentence and replaces them with "the":
"a dog walked in off a street and ordered a finest beer"
The sentence should become:
"the dog walked in off the street and ordered the finest beer"
Imagine you have a web site with a message board. Write a regular expression that would remove barred words. (You can make up your own words!)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. String Manipulation

Create new playlist

Sign In

Sign Up

Chapter 9. String Manipulation

Additional String Methods

The split() Method

The replace() Method

The search() Method

The match() Method

Regular Expressions

Simple Regular Expressions

Regular Expressions: Special Characters

Text, Numbers, and Punctuation

Repetition Characters

Position Characters

Covering All Eventualities

Grouping Regular Expressions

Reusing Groups of Characters

The String Object — split(), replace(), search(), and match() Methods

The split() Method

The replace() Method

The search() Method

The match() Method

Using the RegExp Object's Constructor

Telephone Number Validation

Validating a Postal Code

Validating an E-mail Address

Validating a Domain Name

Validating a Person's Address

Validating the Complete Address

Summary

Exercise Questions

Table of Contents for
9. String Manipulation