© Russ Ferguson and Keith Cirkel 2017
Russ Ferguson and Keith CirkelJavaScript Recipes10.1007/978-1-4302-6107-0_3

3. Working with Strings

Russ Ferguson and Keith Cirkel2
(1)
Ocean, New Jersey, USA
(2)
London, UK
 

Understanding Unicode Strings

Caution
Unicode code point escape codes are an ES6 feature. Some browsers still in use, such as Internet Explorer 11 and below, or Safari 7 and below, do not support this feature. Check out http://kangax.github.io/es5-compat-table/es6/ for the current compatibility charts.

Problem

You want to have a better understanding of Unicode strings.

Solution

JavaScript uses the Unicode standard of encoding characters. In JavaScript Unicode, character points can be manually expressed using the uXXXX (Unicode code unit format) or u{XXXXX} (Unicode code point format). Unicode code point format only works in ES6 compatible browsers.

The Code

console.log('u0068u0065u006cu006cu006f');
console.log('u{68}u{65}u{6c}u{6c}u{6f}');
console.log('u0061u0041u0062u0042u0063u0043');
console.log('you can mix u0075nicode escape codes');
Listing 3-1.
Understanding Unicode Strings
The output is:
hello
hello
aAbBcC
you can mix unicode escape codes

How It Works

JavaScript strings are actually a collection of 16-bit integers; the underlying language does not have a concept of characters like A, B or C like we do. It uses these 16-bit integers in combination with the Unicode UTF-16 encoding format, which tells JavaScript interpreters how to represent these 16-bit integers. Every unique integer has a unique character to go with it; for example, the integer 65 (0x41 in Hexadecimal) represents the character ‘A’ (Latin Capital Letter A), 66 (0x42) represents the letter ‘B’ (Latin Capital Letter B), and so on (for the curious, you can see the huge selection of Unicode character references at http://unicode-table.com/ ).
Unicode calls these 16-bit integers code units. Each code unit is usually represented in Hexadecimal, as four Hexadecimal digits. When you type a UTF-16 character in a string (for example, 'A') it is converted to the 16-bit integer, the exception being escape codes (which are covered a bit more later in this chapter). You can manually enter in UTF code units by using the u character escape code, followed by four hexadecimal characters, and JavaScript will convert this into a Unicode UTF-16 code unit, just like the rest of the string. Consider the string "hello"; this is actually the string "u0068u0065u006Cu006Cu006F". Each character is comprised of a UTF-16 hexadecimal (as an aside, you can actually express the string as "u0068u0065u006Cu006Cu006F" and JavaScript interpreters will represent it as "hello"; the two strings are identical in JavaScript).
The character set in UTF that spans from u0000 to uFFFF is called the Basic Multilingual Plane or BMP, and it covers the Latin Alphabet (for example, u0061 is the Latin Small Case A "A"), and many, many others, including Japanese Katakana Character characters (for example u30C1 is the Japanese Katakana Letter Ti  "チ"). In fact the BMP covers most languages’ alphabets and syllabaries, and includes special ranges, such as the Unicode Private Use Area (uE00 to uF8FF), which is specifically unspecified and designed for users of Unicode to define for themselves. There is another specific range worth mentioning inside the BMP, and that is the Surrogates range (uD800 to uDFFF).
To extend the Unicode range past the initial 65,535 characters that can be expressed in a four-digit Hexadecimal, there are other planes (17 in total, including the BMP) that extend the character set from 16 bits to 24 bits, such as the SMP (Supplementary Multilingual Plane), which contains more esoteric blocks such as Byzantine Musical Symbols, Domino Tiles. and Alchemical Symbols, as well as the popular Emoticons range of characters. To access the extra characters on these planes in JavaScript, you have to use the Surrogate Pairs special range in the BMP.
Surrogates Pairs are special Unicode characters that on their own do not represent a single character. When a High Surrogate (uD800 to uDBFF) is combined with a Low Surrogate (uDC00 to uDFFF), the resulting 32-bit character may represent a single character; this is known as a Surrogate Pair. Some parts of the High Surrogate range are used as shortcuts to denote character blocks on different planes, such as the SMP. When a Unicode aware program such as a JavaScript interpreter sees the High Surrogate code unit, it knows to expect a Low Surrogate code unit afterward.
Take, for example, the Emoticon character “Open Book” (uD83DuDCD6): the High Surrogate - uD83D - covers the SMP 0x1F400 to 0x1F7FF range, which includes Emoticons (which range from 0x1F600 to 0x1F64F). The Low Surrogate - uDCD6 - specifies Open Book Emoticon inside that range (for further examples, uD83DuDCD7 is Green Book Emoticon and uD83DuDCD8 is Black Book Emoticon). In ES6 compatible browsers, surrogate pairs can be defined using the Unicode Code Point Escape Code (u{XXXXX}), as single hex codes; for example uD83DuDCD8 becomes u{1F4D6}.
Because Unicode has so many character sets, often overlapping in terms of appearance, it has a problem with what are known as “confusables.” Take for example the Greek Lowercase Letter Alpha (α, u03B1). In certain typefaces, especially when capitalized (A, u0391), this letter can look identical to the Latin Capital Letter A (u0041 for capital A, u0061 for lowercase a). There is a potential for phishing or spoofing attacks by generating lookalike names unless properly validated by your program. As an example, if your name was “Alan” ('u0041u006Cu0061u6E' all Latin characters), a would-be attacker could write this as “Alan” ('u0391u006Cu0430u006E'), that is, a Greek Capital Letter Alpha, a Latin Lowercase Letter L, a Cyrillic Lowercase Letter A and a Latin Lowercase Letter N. This string uses over 50% different characters, but to the human eye, it’s almost indiscernible .

Using Special Characters (Escape Codes ) in Strings

Problem

You want to be able to express certain characters in a string, such as a newline, tab indentation, backslashes, or quotes.

Solution

JavaScript strings have a set of escape characters, denoted by the backslash. The character directly after the backslash is used to determine the specific escape code. Earlier, you learned about the unicode escape sequence, u. There are eight others: ' (single quote), " (double quote), \ (backslash), (newline), (carriage return), (tab),  (backspace), and f (form feed).

The Code

console.log('this is a multiline string');
console.log(' this string has tabs instead of spaces');
console.log('this string uses 'single quotes', and includes escaped 'single quotes' inside it');
Listing 3-2.
Using Special Characters in Strings
The output is:
This
is
a
multiline
string
      this string has tabs instead of spaces
this string uses 'single quotes', and includes escaped 'single quotes' inside it

How It Works

JavaScript strings are evaluated against any escape characters (the character). If a string contains a character, then the interpreter will expect the character directly after it to be an escape code sequence. Primarily, they are useful for entering characters you couldn't otherwise normally enter.

Comparing Two Strings for Equality

Problem

You want to determine if two strings have the same or different content.

Solution

As described in Chapter 2, the easiest way to compare two string values is with the equality operators.

The Code

if ('hello' === 'goodbye') {
    console.log('"hello" is equal to "goodbye"');
} else {
    console.log('"hello" is NOT equal to "goodbye"');
}
if ('A' === 'a') {
    console.log('"A" is equal to "a"');
} else {
    console.log('"A" is NOT equal to "a"');
}
if ('hello' === 'u0068u0065u006cu006cu006f') {
    console.log('"hello" is equal to "u0068u0065u006cu006cu006f"');
} else {
    console.log('"hello" is NOT equal to "u0068u0065u006cu006cu006f"');
}
if ('Alan' === 'u0391u006Cu0430u006E') {
    console.log('"Alan" is equal to "u0391u006Cu0430u006E" ("Alan")');
} else {
    console.log('"Alan" is NOT equal to "u0391u006Cu0430u006E" ("Alan")');
}
Listing 3-3.
Comparing Two Strings for Equality
The output is:
"hello" is NOT equal to "goodbye"
"A" is NOT equal to "a"
"hello" is equal to "hello"
"Alan" is NOT equal to "Alan" ("Alan")

How It Works

The equality operators are a built-in part of JavaScript and are low-level building blocks that can be used for all datatypes, including strings. More detail on each of the equality operators can be found in Chapter 2.
Strings that are literal values are compared for their content; the interpreter goes through each Unicode code unit in each string and compares the numeric values. If any Unicode code unit in the left hand operand differs with the code unit in the right hand operand, then the string is not equal and the operation returns false. If all Unicode code units are the same value across the whole string, the operation returns true. The reason this concept is important to grasp is because, as suggested in the first section of this chapter, a user could make two subtly different strings that would be unequal.

Determining a String’s Length

Problem

You want to determine how long a string is, in characters.

Solution

When every string is created in JavaScript, it is assigned the .length property, which can be used to ascertain how many characters make up the string

The Code

console.log( '"hello" is ' + ( 'hello'.length ) + ' characters long' );
console.log( '"1234" is ' + ( '1234'.length ) + ' characters long' );
console.log( '"0061" is ' + ( '0061'.length ) + ' characters long' );
console.log( '"u0061" is ' + ( 'u0061'.length ) + ' characters long' );
console.log( '"length" is ' + ( 'length'.length ) + ' characters long' );
console.log( '"Mixedu0055nicode" is ' + ( 'Mixedu0055nicode'.length ) + ' characters long' );
console.log( '"uD83DuDCD6" is ' + ( 'uD83DuDCD6'.length ) + ' characters long' );
Listing 3-4.
Determining a String's Length
The output is:
"hello" is 5 characters long
"1234" is 4 characters long
"0061" is 4 characters long
"a" is 1 characters long
"length" is 6 characters long
"MixedUnicode" is 12 characters long
"••" is 2 characters long

How It Works

The .length property is an “automatic” property (it is referred to as a getter, which is a type of function that is called when the property is accessed; read more about getters in Chapter 15). It cannot be assigned a value, but when accessed it will return the value of the total number of characters in a string.
There is a small caveat with a string's length—it actually represents the number of Unicode code units, and not individual characters (or code points) inside of a string. As established earlier, the Emoticon character Open Book Emoticon is actually comprised of two Unicode code units, uD83D and uDCD6. The problem herein lies that JavaScript's .length property only reads code points and as such the character Open Book Emoticon has a length of 2 ('uD83DuDCD6'.length === 2). Note that this behavior is the same in many other programming languages, such as Java, Perl, or C#, and .NET. On the other hand, Python 3 and Ruby tend to count code points and not code units and so display the “correct” length.
If you concatenate large strings or perform this often, you may run into performance issues. Because strings are immutable, performance over time suffers from creating new strings based on the previous strings.

Concatenating Strings with the + Operator

Problem

You want to be able to combine two strings together, to form one string.

Solution

As briefly mentioned, you may use the + operator to concatenate two string literals. This works differently than the addition operator, which adds numbers together. If the first operand is a string then the addition operator will concatenate the right hand operand as a string to the first, resulting in a new string.

The Code

console.log( 'hello ' + 'world' );
console.log( 'hello' + ' ' + 'world' );
console.log( 'strings' + ' ' + 'can' + ' ' + 'be' + ' ' + 'concatenated' + ' ' + 'multiple' + ' ' + 'times');
console.log('A' + 'uD83D');
console.log('A' + 'uD83D' + 'uDCD6');
Listing 3-5.
Concatenating Strings with the + Operator
The output is:
hello world
hello world
strings can be concatenated multiple times
A
A••

How It Works

The addition operator (+) with a string left hand operand will concatenate the right hand operand as a string, not the add the values together as numbers (described in Chapter 2). It is important to note that this operation is idempotent—any strings used in the operation remain unaffected.
Like most of JavaScript's string operations, surrogate characters can play an interesting role with string concatenation. If you concatenate a string ending in a High Surrogate code unit and a Low Surrogate code unit, the combined string will feature that Surrogate code point, as you may expect. For example, 'AuD83D' + 'uDCD6' will result in 'AuD83DuDCD6' (the Latin Capital Letter A followed by a Open Book Emoticon ).

Getting a Single Character from a String

Problem

You want to retrieve a single character from a string, at a given index (position).

Solution

Strings can be treated similarly to arrays (discussed in Chapter 7), in that you can extract any index using bracket notation ([]). It returns either a single character string, or undefined if the index you supply is out of range (longer than the length of the string). The alternative is to use the method String.prototype.charAt(), which exists on all string objects and primitives. It takes one argument, which is a number index value that corresponds to the character you want to retrieve. It always returns either a single character string or an empty string if the index you supply is out of range or is not a number.

The Code

console.log( 'abc'[0] );
console.log( 'abc'[1] );
console.log( 'abc'[2] );
console.log( 'abc'[4] );
console.log( 'abc'.charAt(0) );
console.log( 'abc'.charAt(1) );
console.log( 'abc'.charAt(2) );
console.log( 'abc'.charAt(4) );
Listing 3-6.
Getting a Single Character from a String
The output is:
A
b
c
undefined
a
b
c
(an empty string, '')

How It Works

Bracket notation is a specific piece of syntax in JavaScript that works on strings, arrays, and objects. Every string has a set of properties from 0 to the length of the string, which can be accessed using this. For example, the string 'hello' has the properties 0 ('h'), 1 ('e'), 2 ('l'), 3 ('l') and 4 ('o'). You access each property by passing the appropriate numerical key in between the square brackets, for example 'hello'[4] returns 'o'. Each property represents a UTF code unit at the given position.
String.prototype.charAt() retrieves the UTF code unit at the given index. The index is coerced into a number (using the ToInteger function—section 7.1.4 in ES6, section 9.4 in ES5), meaning numerical strings will be coerced to a number value. If the resulting value is NaN then the function will return an empty string (''). The index number starts from 0, being the first character, 1 being the second, and so on until the string's total length.
Like most string functions, both the bracket notation and String.prototype.charAt() only deal with code units, not individual characters (or code points) inside of a string. This is fine for most strings, but you can run into problems. As established earlier, the Emoticon character Open Book Emoticon is actually comprised of two Unicode code units, uD83D and uDCD6. Meaning the expression 'uD83DuDCD6'[0] or 'uD83DuDCD6'.charAt(0) returns 'uD83D' which is not a valid UTF code point on its own (it's a High Surrogate Value). This can be avoided in ES6 platforms by using String.prototype.codePointAt().
It should be pointed out that bracket notation for strings was first introduced in ES5 and so really old browsers that are ES3 compatible, such as Internet Explorer 7, can't actually use this syntax. But virtually no one uses non-ES5 compatible browsers today, so you should have no problems.

Creating a String of UTF Character Code Units with fromCharCode( )

Problem

You want to be able to create a string using number-based character codes as input. String literals with Unicode escaping (e.g., u0061) will not suffice.

Solution

Unicode escape characters, as mentioned, will only work when you hard-code strings. String.fromCharCode(), however, can be used to make strings programmatically using character codes. String.fromCharCode() takes numbers as arguments and turns them into UTF string characters. It takes unlimited arguments and returns a combined string of each character code.

The Code

console.log( String.fromCharCode(0x61) );
console.log( String.fromCharCode(0x61) + String.fromCharCode(0x61) );
console.log( String.fromCharCode(0x68, 0x65, 0x6C, 0x6C, 0x6F) );
console.log( String.fromCharCode(119, 111, 114, 108, 100) );
console.log( String.fromCharCode('0x61') ); // note this is a String of a hexadecimal
console.log( String.fromCharCode(0xD83D, 0xDCD6) ); // Open Book Emoticon
console.log( String.fromCharCode(0x10102) ); // Number > 16 bits
Listing 3-7.
Creating a String of UTF Character Code Units with fromCharCode()
The output is:
A
aa
hello
world
a
•• (UTF character "Open Book")
Ă

How It Works

String.fromCharCode() can take any numerical value, even strings that contain numeric values, as the function coerces each argument to a number before using it. Actually, more specifically the number is converted to a 16-bit integer, and any value greater than 16 bits is truncated to the first 16 bits, as shown in the code example, where 0x10102 (Aegean Check Mark “••”) is truncated to 0x0102 (Latin Capital Letter A with Breve “Ă”).
As you'd expect, you can also use UTF-16 Surrogate Pairs within String.fromCharCode(), meaning the archetypal Open Book Emoticon is able to be successfully rendered in the example. Also noted in the example, string concatenation works with the resulting return values, since those resulting return values are string literals.

Creating a String of UTF Code Points with fromCodePoint( )

Caution
String.fromCodePoint() is an ES6 feature. Some browsers still in use, such as Internet Explorer 11 and below, or Safari 7 and below, do not support this feature. Check out http://kangax.github.io/es5-compat-table/es6/ for the current compatibility charts.

Problem

You want to be able to create a string from a whole UTF code point (sometimes comprised of multiple code units).

Solution

String.fromCodePoint() works almost the same as String.fromCharCode(), with the exception that it can handle numbers (or Hexadecimal numbers) larger than 8 bits. This makes it ideal for creating strings using the upper planes of Unicode, such as Emoticons. String.fromCodePoint() takes numbers as arguments and turns them into UTF string characters. It takes unlimited arguments and returns a combined string of each character code.

The Code

console.log( String.fromCodePoint(0x61) );
console.log( String.fromCodePoint(0x61) + String.fromCodePoint(0x61) );
console.log( String.fromCodePoint(0x68, 0x65, 0x6C, 0x6C, 0x6F) );
console.log( String.fromCodePoint(119, 111, 114, 108, 100) );
console.log( String.fromCodePoint('0x61') ); // note this is a String of a hexadecimal
console.log( String.fromCodePoint(0xD83D, 0xDCD6) ); // Open Book Emoticon
console.log( String.fromCodePoint(0x1F4D6) ); // Open Book Emoticon
console.log( String.fromCodePoint(0x10102) ); // Number > 16 bits
Listing 3-8.
Creating a String of UTF Code Points with fromCodePoint()
The output is:
A
aa
hello
world
a
••
••
••

How It Works

String.fromCodePoint() can take any numerical value, even strings that contain numeric values, as the function coerces each argument to a number before using it. As opposed to String.fromCharCode(), which truncates numbers to 16-bit integers (you can see in the previous example, 0x0102 outputs String.fromCodePoint()) will actually throw a RangeError. While this is an inconsistency between the two methods, throwing an error is preferred, and String.fromCharCode() does not, for legacy compatibility reasons.
You can also use UTF-16 Surrogate Pairs within String.fromCodePoint(), although there is little point over using the actual code points. But it is important to note when passing multiple arguments, as it may trip you up. Also noted in the example, string concatenation works with the resulting return values, as the resulting return values are string literals.

Getting a Single Character’s UTF Code Unit from a String with charCodeAt( )

Problem

You want to retrieve the code unit at a given point in a string.

Solution

String.prototype.charCodeAt() works similarly to String.prototype.charAt() discussed previously and it’s a method that exists on all string objects and primitives. It takes one argument, which is a number index value that corresponds to the character code unit you want to retrieve. It always returns a number, relating to the Unicode code unit, or NaN if the index you supply is out of range (longer than the length of the string).

The Code

console.log( 'a'.charCodeAt(0) ); // 97
console.log( 'aa'.charCodeAt(1) ); // 97
console.log( 'hello'.charCodeAt(4) ); // 111
console.log( 'abc'.charCodeAt(4) ); // NaN
console.log( 'uD83DuDCD6'.charCodeAt(0) ); // 55357
console.log( 'uD83DuDCD6'.charCodeAt(1) ); // 56534
console.log( 'u{1F4D6}'.charCodeAt(1) ); // 56534
Listing 3-9.
Getting a Single Character's UTF Code Unit from a String with charCodeAt()
The output is:
97
97
111
NaN
55357
56534
56534

How It Works

String.prototype.charAt() retrieves the UTF code unit at the given index. The index is coerced into a number (using ToNumber function—section 7.1.3 in ES6, section 9.3 in ES5), meaning the numerical strings will be coerced to a number value. If the resulting value is NaN then the function will return NaN. If the number provided is greater than the string's .length property, then the resulting return value will be NaN.
Like most string functions, it only deals with code units, not individual characters (or code points) inside of a string. This is fine for most strings, but you can run into problems. As established earlier, the Emoticon character Open Book Emoticon is actually comprised of two Unicode code units, uD83D and uDCD6. This means the expression 'uD83DuDCD6'.charAt(0) returns 'uD83D' (well, it returns 55357, which is the decimal representation of 0xD38D), which is not a valid UTF character. This can be avoided in ES6 platforms by using String.prototype.codePointAt().

Getting a Single Character’s UTF Code Point from a string with codePointAt( )

Caution
String.prototype.codePointAt() is an ES6 feature. Browsers such as Internet Explorer 11 do have support for this feature. However, Safari does not.

Problem

You want to retrieve the code unit at a given point in a string.

Solution

String.prototype.codePointAt() works similarly to String.prototype.charCodeAt() discussed previously, and it’s a method that exists on all string objects and primitives. It takes one argument, which is a number index value that corresponds to the character code point (which may consist of multiple code units) you want to retrieve. It always returns either a number representing the hexadecimal value of the code point, or undefined if the index you supply is out of range (longer than the length of the string).

The Code

console.log( 'a'.codePointAt(0) ); // 97
console.log( 'aa'.codePointAt(1) ); // 97
console.log( 'hello'.codePointAt(4) ); // 111
console.log( 'abc'.codePointAt(4) ); // undefined
console.log( 'uD83DuDCD6'.codePointAt(0) ); // 128214
console.log( 'uD83DuDCD6'.codePointAt(1) ); // 56534
console.log( 'u{1F4D6}'.codePointAt(1) ); // 56534
Listing 3-10.
Getting a Single Character’s UTF Code Point from a String with codePointAt()
The output is:
97
97
111
undefined
128214
56534
56534

How It Works

String.prototype.codePointAt() retrieves the UTF code point at the given index. The index is coerced into a number (using ToNumber function (section 7.1.3 in ES6, section 9.3 in ES5), meaning numerical strings will be coerced to a number value. If the resulting value is NaN then the function will return undefined.
String.prototype.codePointAt() specifically deals with code points, as opposed to the norm, code units. This means characters such as the Emoticon character Open Book Emoticon will be returned as the full code point. In this case the number 128214 (0x1F4D6 in hex); however, characters in the BMP (Basic Multilingual Plane), which are comprised of single code units, will be the same values as using String.prototype.charCodeAt(), for example, 'a'.codePointAt() returns 97 (0x61 in hex), just as 'a'.charCodeAt() does .

Iterating Over a String’s code Units Using for...in

Problem

You want to iterate over all code units in a string.

Solution

With a for...in loop you can iterate over all of the keys in a string. Each key in a string represents a code point, and so you can iterate over a string's code unit values by combining a for...in with bracket notation, as discussed earlier in this chapter.

The Code

var myString = 'abc';
for(var i in myString) {
    console.log('Character at position ' + i + ' is ' + myString[i]);
}
Listing 3-11.
Iterating Over a String's Code Units Using for...in
The output is:
Character at position 0 is a
Character at position 1 is b
Character at position 2 is c

How It Works

for...in loops will check the “enumerable keys” of an object or primitive value. In the case of a string, its enumerable properties are index keys for each code unit in the string, from 0 to the value of String.prototype.length. This means that the left hand operand of the for...in statement (the variable) gets assigned to each index of the string, and the block is executed over and over, each time reassigning the variable until it is equal to String.prototype.length. This makes for(var i in string) the equivalent of for(var i = 0; i < string.length; ++i). Because you only get the key values, you still need to extract the individual characters with bracket notation, for example myString[i].
Just like most string methods and behaviors, this does not deal well with code points. Using the for...in loop combined with String.prototype.codePointAt() will also give very undesirable results, because you’re iterating over code units and extracting code points. For this, use a for...of loop instead (discussed later in this chapter).

Iterating Over a String’s Code Points Using for...of

Caution
for...of symbols and iterators are ES6 features. Older browsers still in use, such as Internet Explorer 11 and below, or Safari 7 and below, do not support this feature. Check out http://kangax.github.io/es5-compat-table/es6/ for the current compatibility charts.

Problem

You want to iterate over all code points in a string. A for...in loop will not work because it iterates over code units, giving you undesirable effects.

Solution

With a for...of loop you can iterate over all of the code units in a string. It uses the string’s (hidden) underlying String.prototype[Symbol.iterator] function, which will iterate over each code point in a string, returning a string for each iteration, which is a code point for that position.

The Code

var myString = 'abcuD83DuDCD6';
for(var v of myString) {
    console.log(v);
}
Listing 3-12.
Iterating Over a String’s Code Units Using for...in
The output is:
"a"
"b"
"c"

How It Works

for...of works surprisingly different than for...in (described previously). for...in simply iterates over enumerable keys of the right hand operand (in this case a string), but for...of uses the right hand operand’s [Symbol.iterator] method and calls it for each iteration until it has no more iterations left. The [Symbol.iterator] property is a built-in symbol used by the language. You can read more about the built-in symbols, like [Symbol.iterator] in Chapter 17. String.prototype[Symbol.iterator] is an iterator (discussed in Chapter 15), which when called will return each code point in sequence, one by one, until there are none left. It specifically returns each code point as a string value. If you wanted to retrieve the numerical code point, a simple call to v.codePointAt(0) on each iteration would suffice.
Because String.prototype[Symbol.iterator] returns a string value of each code point, and not an index referencing each point. The use case for both for...of and for...in still exist, so do not discount either looping mechanism over .

Repeating a String with repeat( )

Caution
String.prototype.repeat() is an ES6 feature. Browsers such as Internet Explorer and Opera do not support this feature.

Problem

You want to be able to repeat a string.

Solution

In ES6 compatible JavaScript engines, String.prototype.repeat() is available. It takes one argument, which is a number representing the number of repetitions you want. If you pass a number less than 0 or pass infinity, it will throw a RangeError. It always returns a string, which is the result of repeating the given string.

The Code

console.log( 'a '.repeat(6) );
console.log( 'ab '.repeat(6) );
console.log( 'echo '.repeat(4) );
console.log( 'uD83DuDCD6 '.repeat(2) );
console.log( 'return value will be empty'.repeat(0) );
Listing 3-13.
Repeating a String with repeat()
The output is:
Aiea
abababababab
echo echo echo echo
⎕ ⎕)
   (an empty string, '')

How It Works

String.prototype.repeat() repeats the attached string a given number of times. The argument passed to it is coerced into a number (using the ToInteger function (section 7.1.4 in ES6, section 9.4 in ES5), meaning numerical strings will be coerced to a number value and then floored (the decimal place is removed). If the resulting value is NaN or 0 then the return value is an empty string.
String.prototype.repeat(), like all string methods, is idempotent—that is, it does not mutate the original string value. Also, like other string methods, it works on code units and so you need to be aware of UTF-8 surrogate characters (described earlier in this chapter).

Determining If a String Contains a Smaller String Using contains( )

Caution
String.prototype.contains() is an ES6 feature. Older browsers still in use, such as Internet Explorer 11 and below, or Safari 7 and below, do not support this feature. Check out http://kangax.github.io/es5-compat-table/es6/ for the current compatibility charts.

Problem

You want to find out if a string contains another string (substring) anywhere within the parent string.

Solution

String.prototype.contains() is a method that exists on all string objects and primitives, used to search for a substring inside of a string value. It takes two arguments: the first is a string to search for (the substring or “needle” to the parent string’s “haystack”), the second is an optional argument that determines the start position to search from which defaults to 0. It returns a Boolean that indicates if the string was found (true) or not (false).

The Code

if ('abc'.contains('a')) {
    console.log('The string "abc" contains the letter a');
} else {
    console.log('The string "abc" does not contain the letter a');
}
if ('abc'.contains('d')) {
    console.log('The string "abc" contains the letter d');
} else {
    console.log('The string "abc" does not contain the letter d');
}
if ('abc'.contains('a', 1)) {
    console.log('The string "abc" contains the letter a past the first character');
} else {
    console.log('The string "abc" does not contain the letter a past the first character');
}
if ('Surprise!'.contains('!')) {
    console.log('The string "Surprise!" contains the letter !');
} else {
    console.log('The string "Surprise!" does not contain the letter !');
}
var greeting = 'Hello Jim, how are you';
if (greeting.contains('Jim')) {
    console.log('The string ' + greeting + ' contains the word Jim');
} else {
    console.log('The string ' + greeting + ' does not contain the word Jim');
}
if (greeting.contains('jim')) {
    console.log('The string ' + greeting + ' contains the word jim');
} else {
    console.log('The string ' + greeting + ' does not contain the word jim');
}
if (greeting.contains('Jim', greeting.length / 2)) {
    console.log('The string ' + greeting + ' contains the word Jim in the second half of the String');
} else if(greeting.contains('Jim')) {
    console.log('The string ' + greeting + ' contains the word Jim in the first half of the String');
} else {
    console.log('The string ' + greeting + ' does not contain the word Jim');
}
Listing 3-14.
Determining If a String Contains a Smaller String Using contains()
The output is:
The string "abc" contains the letter a
The string "abc" does not contain the letter d
The string "abc" does not contain the letter a past the first character
The string "Surprise!" contains the letter !
The string Hello Jim, how are you contains the word Jim
The string Hello Jim, how are you does not contain the word jim
The string Hello Jim, how are you contains the word Jim in the first half of the String

How It Works

String.prototype.contains() will search for the string given inside the parent string. The first argument passed to it is coerced into a string (using the ToString function; ES6 section 7.1.12, ES5 section 9.8). The second argument, the starting position, is coerced to an integer number (using the ToInteger function—section 7.1.4 in ES6, section 9.4 in ES5), meaning numerical strings will be coerced to a number value and then floored (the decimal place is removed) and any NaN values will be converted to 0, and the search will begin at 0.
String.prototype.contains() has some very predictable results, for example, if the starting search position is greater than the parent string’s length, then it will always return false. Similarly, if the substring to search for (the proverbial “needle”) is longer than the parent string (“haystack”) then the return value will also always be false. It is worth bearing this in mind, and perhaps executing these checks beforehand to optimize for performance. It also, like many string methods, uses code units not code points .
If you are using an older browser that’s not ES6 compliant, you can still achieve the same functionality as String.prototype.contains() by using String.prototype.indexof(), for example 'abcabc'.indexOf('b', 3) !== -1 is the same as 'abcabc'.contains('b', 3).

Determining If a String Starts with a Smaller String using startsWith( )

Caution
String.prototype.startsWith() is an ES6 feature. Older browsers still in use, such as Internet Explorer 11 and below, or Safari 7 and below, do not support this feature. Check out http://kangax.github.io/es5-compat-table/es6/ for the current compatibility charts.

Problem

You want to find out if a string contains another string (substring), and you want to know that the parent string specifically starts with the substring.

Solution

String.prototype.startsWith() is a method that exists on all string objects and primitives, used to search for substrings that begin at a specific position inside a string. It takes two arguments: the first is a string to search for (the substring or the “needle” to the parent string’s “haystack”), the second is an optional argument that determines the start position to search from which defaults to 0. It returns a Boolean that indicates if the string was found (true) or not (false).

The Code

if ('abc'.startsWith('a')) {
    console.log('The string "abc" starts with the letter a');
} else {
    console.log('The string "abc" does not start with the letter a');
}
if ('abc'.startsWith('b')) {
    console.log('The string "abc" starts with the letter b');
} else {
    console.log('The string "abc" does not start with the letter b');
}
if ('abc'.startsWith('a', 1)) {
    console.log('The string "abc" starts with the letter a from the first character');
} else {
    console.log('The string "abc" does not start with the letter a from the first character');
}
if ('Surprise!'.startsWith('!')) {
    console.log('The string "Surprise!" starts with the letter !');
} else {
    console.log('The string "Surprise!" does not start with the letter !');
}
var greeting = 'Hello Jim, how are you';
if (greeting.startsWith('Jim')) {
    console.log('The string ' + greeting + ' starts with the word Jim');
} else {
    console.log('The string ' + greeting + ' does not start with the word Jim');
}
if (greeting.startsWith('Jim', 6)) {
    console.log('The string ' + greeting + ' starts with the word Jim from the letter 6');
} else {
    console.log('The string ' + greeting + ' does not start with the word Jim from the letter 6');
}
Listing 3-15.
Determining If a String Starts with Substring Using startsWith()
The output is:
The string "abc" starts with the letter a
The string "abc" does not start with the letter b
The string "abc" does not start with the letter a from the first character
The string "Surprise!" does not start with the letter !
The string Hello Jim, how are you does not start with the word Jim
The string Hello Jim, how are you starts with the word Jim from the letter 6

How It Works

String.prototype.startsWith() will search for the string given inside the parent string, expecting to see the string at the exact starting position you specify (or, if not specified, the end of the string). The first argument passed to it is coerced into a string (using the ToString function; ES6 section 7.1.12, ES5 section 9.8). The second argument, the starting position, is coerced to an integer number (using the ToInteger function—section 7.1.4 in ES6, section 9.4 in ES5), meaning numerical strings will be coerced to a number value and then floored (the decimal place is removed) and any NaN values will be converted to 0, and the search will begin at 0.
String.prototype.startsWith() has some very predictable results, for example, if the starting search position is greater than the parent string’s length, then it will always return false. Similarly if the substring to search for (the proverbial “needle”) is longer than the parent string (“haystack”), then the return value will also always be false. It is worth bearing this in mind, and perhaps executing these checks beforehand to optimize for performance. It also, like many string methods, uses code units not code points.

Determining If a String Ends with a Smaller String Using endsWith( )

Caution
String.prototype.endsWith() is an ES6 feature. Older browsers still in use, such as Internet Explorer 11 and below, or Safari 7 and below, do not support this feature. Check out http://kangax.github.io/es5-compat-table/es6/ for the current compatibility charts.

Problem

You want to find out if a string contains another string (substring), and also want to determine if the parent string specifically ends with the substring.

Solution

String.prototype.endsWith() is a method that exists on all string objects and primitives; it’s used to search for substrings that end at a specific position inside a string. It takes two arguments: the first is a string to search for (the substring or the “needle” to the parent string’s “haystack”), the second is an optional argument that determines the ending position to search from which defaults to the string’s length. It returns a Boolean that indicates if the string was found (true) or not (false).

The Code

if ('abc'.endsWith('a')) {
    console.log('The string "abc" ends with the letter a');
} else {
    console.log('The string "abc" does not end with the letter a');
}
if ('abc'.endsWith('c')) {
    console.log('The string "abc" ends with the letter c');
} else {
    console.log('The string "abc" does not end with the letter c');
}
if ('abc'.endsWith('a', 2)) {
    console.log('The string "abc" ends with the letter a from the second character');
} else {
    console.log('The string "abc" does not end with the letter a from the second character');
}
if ('Surprise!'.endsWith('!')) {
    console.log('The string "Surprise!" ends with the letter !');
} else {
    console.log('The string "Surprise!" does not end with the letter !');
}
var greeting = 'Hello Jim, how are you';
if (greeting.endsWith('Jim')) {
    console.log('The string ' + greeting + ' ends with the word Jim');
} else {
    console.log('The string ' + greeting + ' does not end with the word Jim');
}
if (greeting.endsWith('Jim', 13)) {
    console.log('The string ' + greeting + ' ends with the word Jim');
} else {
    console.log('The string ' + greeting + ' does not end with the word Jim');
}
Listing 3-16.
Determining If a String Ends with Substring Using endsWith()
The output is:
The string "abc" does not end with the letter a
The string "abc" ends with the letter c
The string "abc" does not end with the letter a from the second character
The string "Surprise!" ends with the letter !
The string Hello Jim, how are you does not end with the word Jim
The string Hello Jim, how are you does not end with the word Jim

How It Works

String.prototype.endsWith() will search for the string given inside the parent string, expecting to see the string at the exact ending position you specify (or, if not specified, the end of the string). The first argument passed to it is coerced into a string (using the ToString function; ES6 section 7.1.12, ES5 section 9.8). The second argument, the starting position, is coerced to an integer number (using the ToInteger function—section 7.1.4 in ES6, section 9.4 in ES5), meaning numerical strings will be coerced to a number value and then floored (the decimal place is removed) and any NaN values will be converted to 0, and the search will begin at 0. It is important to emphasize that the ending position (second argument) is a reverse index—it effectively truncates the string before the assertion. If you look closely at the examples, you’ll see how this works.
String.prototype.endsWith() has some very predictable results. For example, if the starting search position is greater than the parent string’s length, then it will always return false. Similarly if the substring to search for (the proverbial “needle”) is longer than the parent string (“haystack”), then the return value will also always be false. It is worth bearing this in mind, and perhaps executing these checks beforehand to optimize for performance. It also, like many string methods, uses code units not code points.

Finding the Index of an Occurring Substring with indexOf( )

Problem

You want to find the index of the first occurring substring that is contained within the parent string.

Solution

String.prototype.indexOf() is a method that exists on all string objects and primitives, and will return the index of a substring that occurs in the parent string. It takes two arguments: the first is a string to search for (the substring or the “needle” to the parent string’s “haystack”), the second is an optional argument that determines the starting position to search from, which defaults to 0. It returns a whole number that indicates at which position in the string the substring exists; if the substring is not found, it returns -1.

The Code

console.log('letter a in "abc" is in position ' + 'abc'.indexOf('a') );
console.log('letter b in "abc" is in position ' + 'abc'.indexOf('b') );
var directory = '/var/www/javascriptrecipes.com/'
if (directory.indexOf('/') === 0) {
    console.log('directory is absolute');
} else if (directory.indexOf('../') === 0) {
    console.log('directory is relative');
}
var directory = '../var/www/javascriptrecipes.com/'
if (directory.indexOf('/') === 0) {
    console.log('directory is absolute');
} else if (directory.indexOf('../') === 0) {
    console.log('directory is relative');
}
var userAgent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/6.0)';
if (userAgent.indexOf('Trident') === -1) {
    console.log('user agent is not Internet Explorer');
} else {
    console.log('user agent is Internet Explorer');
}
var userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:27.0) Gecko/20100101 Firefox/27.0';
if (userAgent.indexOf('Trident') === -1) {
    console.log('user agent is not Internet Explorer');
} else {
    console.log('user agent is Internet Explorer');
}
Listing 3-17.
Finding the Index of an Occurring Substring with indexOf()
The output is:
letter a in "abc" is in position 0
letter b in "abc" is in position 1
directory is absolute
directory is relative
user agent is Internet Explorer
user agent is not Internet Explorer

How It Works

String.prototype.indexOf() will search for the string given inside the parent string. The first argument passed to it is coerced into a string (using the ToString function; ES6 section 7.1.12, ES5 section 9.8). The second argument, the starting position, is coerced to an integer number (using the ToInteger function—section 7.1.4 in ES6, section 9.4 in ES5), meaning numerical strings will be coerced to a number value and then floored (the decimal place is removed) and any NaN values will be converted to 0. The search will begin at 0.
String.prototype.indexOf() will return -1 if the substring is not found at all in the parent string, and predictably, if the substring is longer than the parent string or the search position is greater than the parent string’s length. It is important to note that it may sometimes return 0 if the substring is at the start of a string. Be careful with this and with Boolean coercions that will convert 0 to false. Like many string methods, it also will search using code units not code points, meaning you can use it to find halves of Unicode Surrogate Pairs.

Finding the Index of the Last Occurrence of a Substring with lastIndexOf( )

Problem

You want to find the index of the last occurring substring that is contained within the parent string.

Solution

String.prototype.lastIndexOf() is a method that exists on all string objects and primitives and will return the index of a substring that occurs in the parent string. It takes two arguments: the first is a string to search for (the substring or the “needle” to the parent string’s “haystack”), the second is an optional argument that determines the starting position to search from which defaults to the length of the string. It returns a whole number that indicates at which position in the string the substring exists; if the substring is not found, it returns -1.

The Code

console.log('letter a in "abc" is in position ' + 'abc'.lastIndexOf('a') );
console.log('letter b in "abc" is in position ' + 'abc'.lastIndexOf('b') );
var directory = '../var/www/javascriptrecipes.com/'
if (directory.lastIndexOf('/') === directory.length - 1) {
    console.log('directory has a trailing slash');
} else {
    console.log('directory does not have a trailing slash');
}
var mySentence = 'Lorem ipsum dolor sit amet';
if (mySentence.lastIndexOf('.') === directory.length - 1) {
    console.log('sentences must end in a . mySentence ends in a .');
} else {
    console.log('sentences must end in a ., but mySentence does not');
}
Listing 3-18.
Finding the Index of an Occurring Substring with indexOf()
The output is:
letter a in "abc" is in position 0
letter b in "abc" is in position 1
directory has a trailing slash
sentences must end in a ., but mySentence does not

How It Works

String.prototype.lastIndexOf() is similar to String.prototype.indexOf(), except it searches the string backwards, for the last occurring substring’s index. The first argument passed to it is coerced into a string (using the ToString function; ES6 section 7.1.12, ES5 section 9.8). The second argument, the starting position, is coerced to an integer number (using the ToInteger function—section 7.1.4 in ES6, section 9.4 in ES5), meaning numerical strings will be coerced to a number value and then floored (the decimal place is removed) and any NaN values will be converted to 0. The search will begin at 0.
String.prototype.lastIndexOf() will return -1 if the substring is not found at all in the parent string, and also (predictably), if the substring is longer than the parent string or the search position is less than the substring’s length. It is also important to note that it may sometimes return 0 if the substring is at the start of a string. Be careful with this and with Boolean coercions that will convert 0 to false. Like many string methods, it also will search using code units not code points, meaning you can use it to find halves of Unicode Surrogate Pairs.

Finding Many Matches of a Substring with match( )

Problem

You want to be able to determine how many times a particular substring occurs in a string, or you’d like to use a regular expression and determine how many occurrences of a regular expression are in a string.

Solution

String.prototype.match() is a method that exists on all string objects and primitives and will return an array of matches pertaining to the regular expression or string given to it as the first argument. If it finds no matches, it will return null.

The Code

console.log('/var/www/javascriptrecipes/'.match('/'));
console.log('/var/www/javascriptrecipes/'.match('/var/www/'));
console.log('/var/www/javascriptrecipes/'.match(///g));
console.log('/var/www/javascriptrecipes/'.match(/[^/]+/g));
console.log('There are ' + 'javascript'.match('a').length + ' letter "a"s in the word javascript');
Listing 3-19.
Finding Many Matches of a Substring with match() Using Firefox
The output is:
["/"]
["/var/www/"]
["/", "/", "/", "/"]
["var", "www", "javascriptrecipes"]
There are 1 letter "a"s in the word javascript

How It Works

String.prototype.match() actually uses regular expressions (discussed at length in Chapter 20) to search for matches inside of the parent String. If the first argument passed to it is not a regular expression, it will convert it to one (using the RegExp constructor function). This means simply using a string is mostly fine, as long as you don’t cross swords with regular expression syntax.
String.prototype.match() will return an array of all of the matched values as strings. The array will always contain strings and no other value. If no results are found, null is returned.
It is worth mentioning that just using a string as the first argument will not match all occurrences (because of the way regular expressions work). To match all occurrences of a substring in a string, you need to convert it to a greedy regular expression. For example, 'abc' becomes /abc/g (note the /g part means “greedy”). Once again, regular expressions are discussed in depth in Chapter 20.

Replacing Parts of a String with replace( )

Problem

You want to be able to replace part of a string with another string.

Solution

String.prototype.replace() is a method that exists on all string objects and primitives and will replace parts of the parent string (“haystack”) with another string, using a string or regular expression to find the replacement part. The first argument is the “needle” string or regular expression, the second the string (or function) used to replace the matched parts. The resulting return value is always a string.

The Code

console.log('javascript'.replace('java', 'ecma'));
console.log('/var/www/javascriptrecipes/'.replace('/', '--'));
console.log('/var/www/javascriptrecipes/'.replace(///g, '--'));
console.log('/var/www/javascriptrecipes/'.replace('/var/www/', '/home/sites/'));
console.log('/var/www/javascriptrecipes/'.replace(/[^/]+/g, function (str) {
    return str.toUpperCase();
}));
Listing 3-20.
Replacing Parts of a String with replace()
The output is:
Ecmascript
--var/www/javascriptrecipes/
--var--www--javascriptrecipes--
/home/sites/javascriptrecipes/
/VAR/WWW/JAVASCRIPTRECIPES/

How It Works

String.prototype.replace() actually uses regular expressions (discussed at length in Chapter 20) to search for matches inside of the parent string. If the first argument passed to it is not a regular expression, it will convert it to one (using the RegExp constructor function). This means simply using a string is mostly fine, as long as you don’t cross swords with regular expression syntax.
The second argument can be a string or a function—both are treated specially. If it is a string, it uses a $   as a string escape pattern, as is common in many languages. This is useful for shortcuts based on complex regular expression pattern matching. $  $   returns a $  , $  & will return the matched substring, $  ` returns the portion of the string preceding the matched substring, and $  ’ returns the portion of the string following the matched substring. You can also use $  n or $  nn to replace parts of the string with other matched substrings. This can get very complicated, especially using the scope of regular expressions, and so once again, it’s covered in more detail in Chapter 20.
String.prototype.replace() will always return a string based on the original string, with the necessary replacements made. If the search substring (“needle”) is not found within the parent string, the return result will be a copy of the parent string, as noted in the example code.
It is worth mentioning that just using a string as the first argument will not match all occurrences (because of the way regular expressions work). To match all occurrences of a substring in a string, you need to convert it to a greedy regular expression. For example, 'abc' becomes /abc/g (note the /g part means “greedy”). Once again, regular expressions are discussed in depth in Chapter 20.

Searching a String Using a Regular Expression with search( )

Problem

You want to be able to determine if a string contains more abstract strings using regular expressions (Chapter 20).

Solution

String.prototype.search() is a method that exists on all string objects and primitives. It is very similar to String.prototype.contains(), with the exception that it takes a regular expression rather than a string, to determine a substring match. That means that it allows for much more complex string matching. It does not have a second argument (a dissimilarity with String.prototype.contains()). It will always return a Boolean value based on if it found a match (true) or not (false).

The Code

console.log('the word javascript has the letter a in position ' + 'javascript'.search('a'));
console.log('the last word in "lorem ipsum dolor" is in position ' + 'lorem ipsum dolor'.search(/w+$  /));
var myName = 'Keith'
if (myName.search(/[^a-z]/i) !== -1) {
    console.log('variable myName contains non-alphabetical characters!');
} else {
    console.log('variable myName contains only alphabetical characters');
}
var myName = 'not a name'
if (myName.search(/[^a-z]/i) !== -1) {
    console.log('variable myName contains non-alphabetical characters!');
} else {
    console.log('variable myName contains only alphabetical characters');
}
Listing 3-21.
Searching a String Using a Regular Expression with search()
The output is:
the word javascript has the letter a in position 1
the last word in "lorem ipsum dolor" is in position 12
variable myName contains only alphabetical characters
variable myName contains non-alphabetical characters!

How It Works

String.prototype.search()uses regular expressions (discussed at length in Chapter 20) to search for matches inside of the parent string. If the first argument passed to it is not a regular expression, it will convert it to one (using the RegExp constructor function). This means simply using a string is mostly fine, as long as you don’t cross swords with regular expression syntax.

Getting a Substring Form a String with slice( )

Problem

You want to be able to extract part of a string using its index.

Solution

String.prototype.slice() is a method that exists on all string objects and primitives and it’s used to extract a substring from a string, given a start and end index. The first argument is the starting index number, and the second is the ending index number, which is optional and defaults to the strings length. It will always return a string based on the parent string.

The Code

console.log('abc'.slice(1));
console.log('Hello World'.slice(6));
console.log('It was the best of times'.slice(-5));
console.log('It was the best of times'.slice(7, -6));
Listing 3-22.
Getting a Substring from a String with slice()
The output is:
Bc
World
times
the best of

How It Works

String.prototype.slice() coerces both its arguments into integer numbers (using the ToInteger function—section 7.1.4 in ES6, section 9.4 in ES5), meaning numerical strings will be coerced to a number value and then floored (the decimal place is removed) and any NaN values will be converted to 0.
String.prototype.slice() will return an empty string if both arguments are the same number, or if the first argument is longer than the parent string’s length. It also works on code units, and so using our archetypal Open Book character (uD83DuDCD8), 'uD83DuDCD8'.slice(0, 1) would equal 'uD83D'. The ending index number can also be a negative number. If that is the case, the browser is treated as first stringLength, last index.

Splitting Strings with .split( )

Problem

You want to be able to split a string into an array of strings, given a specific substring (or regular expression) to split them by.

Solution

String.prototype.split() is a method that exists on all string objects and primitives and it’s used to dice up a string into an array of strings at a given split market. The first argument is the string or regular expression to split the parent string by, and the second argument is the limit—the number of times the string should be split. The second argument is optional and defaults to 9,007,199,254,740,991 (that’s nine quadrillion or so— an Unsigned 53-bit integer’s maximum value). The return result is always an array.

The Code

console.log('abc'.split('a'));
console.log('/var/www/site/javascriptrecipes'.split('/'));
console.log('/var/www/site/javascriptrecipes'.split(///g));
console.log('/var/www/site/javascriptrecipes'.split('/', 2));
Listing 3-23.
Splitting Strings with .split()
The output is:
["", "bc"]
["", "var", "www", "site", "javascriptrecipes"]
["", "var", "www", "site", "javascriptrecipes"]
["", "var"]

How It Works

String.prototype.split()uses regular expressions (discussed at length in Chapter 20) to search for matches inside of the parent string. If the first argument passed to it is not a regular expression, it will convert it to one (using the RegExp constructor function). This means simply using a string is mostly fine, as long as you don’t cross swords with regular expression syntax. The second argument can be used to limit the amount of splits that happen. It defaults to 2^53-1 (9,007,199,254,740,991) and if you attempt to enter a larger value (e.g., 9,007,199,254,740,992), 2^53-1 will be used. It’s a hard cap on the amount of splits that can occur in a string. Realistically this is a limit that you’re very unlikely to hit.
Using the limit integer will not affect the split marks in a string, that is to say that the split will still happen on the whole string, but the returned array will just be limited to the first n strings from the resulting split operation.
String.prototype.split() will then search for the first occurrence that matches the string or regular expression, and will split at the index, until the length of the substring, meaning the actual contents of the substring are removed. For example, 'a-b-c'.split('-') results in ['a', 'b', 'c']. The dashes have been removed. You can sometimes mitigate this depending on the string used with regular expression capture groups (discussed more in Chapter 20). For example, 'a-b-c'.split(/-/) results in ['a', '-', 'b', '-', 'c']. If you absolutely must keep the extra characters you're splitting by then you are likely better off using String.prototype.match().

Changing String Case with toUpperCase( ) and toLowerCase( )

Problem

You want to be able to convert the case of a string to uppercase or lowercase.

Solution

The methods String.prototype.toUpperCase() and String.prototype.toLowerCase() are available to all string objects and primitives and will take an entire string and convert all letters to the respective case. The resulting string will have only one case of lettering (uppercase or lowercase, depending on the method).

The Code

console.log('lowercase'.toUpperCase());
console.log('uppercase'.toLowerCase());
console.log('Hello'.toLowerCase());
console.log('#@!?'.toLowerCase());
console.log('The ligature uFB02 has a length of ' + 'uFB02'.length);
console.log('The ligature uFB02 capitalized has a length of ' + 'uFB02'.toUpperCase().length);
Listing 3-24.
Changing String Case with toUpperCase and toLowerCase()
The output is:
LOWERCASE
uppercase
hello
#@!?
The ligature fl has a length of 1
The ligature fl capitalized has a length of 2

How It Works

JavaScript interpreters have a set of hard-coded tables for converting Unicode characters between upper- and lowercase values (the reference Unicode table can be found at http://unicode.org/Public/UCD/latest/ucd/UnicodeData.txt ). The methods take each Unicode code point and use a lookup table to see if there is an available conversion. If there is, then it will convert the case to the opposing case; if not, then it will leave the existing code point in place.
The intricacies of the case map are complex. Some scripts, such as Japanese Kanji, Hiragana, or Katakana do not have upper- and lowercase variants, and so cannot convert case with these methods. In fact it is mostly “modern” languages that have case, such as Latin, Greek, Armenian, and Cyrillic.
Some specific characters also can make things confusing—for example both Greek Small Letter Final Sigma ('u03C2' "σ") and Greek Small Letter Sigma ('u03C3' "σ")capitalize as Greek Capital Sigma ('u03A3' "σ"). The lowercase of Greek Capital Sigma ('u03A3' "σ") is Greek Small Letter Sigma ('u03C3' "σ"), meaning the conversion back and forth actually changes the character ('u03C2'.toUpperCase().toLowerCase() returns 'u03C3').
Also, sometimes “decomposition” can occur. For example, with the Ligature (a combined set of characters for visual appeal). Latin Small Ligature FL ('uFB02' "fl") does not have an uppercase variant of its own, but instead is converted to its individual letters: Latin Capital Letter F ('u0046') and Latin Capital Letter L ('u004C'), meaning that 'uFB02'.length is 1, while 'uFB02'.toUpperCase().length is 2.
Some of these caveats can really trip developers up, so remember to be mindful of them when working with the case-swapping methods. For the most part you will be fine, but it is worth boundary testing for these edge cases, as they can present attack vectors for your application!

Stripping Blank Spaces with trim( )

Problem

You want to be able to remove trailing and leading whitespace from a string.

Solution

It's a common need in programming to remove any whitespace from the beginning or end of a string, and String.prototype.trim() does exactly that. Available on all string objects and primitives, it takes no arguments, and the return value is always a string.

The Code

console.log(' hello '.trim());
console.log(' File start... ... File end... ');
console.log(' File start... ... File end... '.trim());
Listing 3-25.
Stripping Blank Spaces with trim()
The output is:
hello
File start...
...
File end...
File start...
...
File end...

How It Works

String.prototype.trim() simply looks at the start and end of a string, truncating the string until the first non-whitespace character. “Whitespace” is considered to be one of the following characters: u0009 Tab, u000B Vertical Tab, u000C Form Feed, u0020 Space, u00A0 No-Break Space, and uFEFF Byte Order Mark, as well as “Line Terminators” u000A Line Feed, u000D Carriage Return, u2028 Line Separator, and u2029 Paragraph Separator.

Determining If a String “Comes Before” Another with localeCompare( )

Caution
Although String.prototype.localeCompare() is part of the ES5 specification, browsers such as Internet Explorer 10 and below, and Safari 7 and below still do not support this feature. The additional second and third arguments in this method also come from the supplementary ES Internationalisation API Specification, and so may not be supported in all browsers.

Problem

You want to check if a particular string should precede another, in terms of ordering within a particular locale. For example in English a “comes before” b, as a precedes b in the English alphabet.

Solution

Comparing two strings for their supposed ordering is quite complex and relies on various levels of information. String.prototype.localeCompare() is the method available to all string objects and primitives to try and execute a comparison between a given string and the parent string. The first argument is the string to compare the parent string against, the second is a string (or array of strings) representing a BCP-47 language tag. For example, "en-US" for the United States or "en-GB" for Great Britain. This argument is optional and will default to the user’s locale. The final option is an object of options properties. The return value will always be a number that is negative if the parent string comes before the string argument, positive if the parent string comes after the string argument, or 0 if they are in the same order.

The Code

if ('Mike'.localeCompare('John') === -1) {
    console.log('Mike comes before John');
} else if ('Mike'.localeCompare('John') === 1) {
    console.log('John comes before Mike');
} else {
    console.log('John and Mike are both in the same position');
}
if ('Jan'.localeCompare('John') === -1) {
    console.log('Jan comes before John');
} else if ('Jan'.localeCompare('John') === 1) {
    console.log('John comes before Jan');
} else {
    console.log('John and Jan are both in the same position');
}
if ('a'.localeCompare('á') === -1) {
    console.log('a comes before á');
} else if ('a'.localeCompare('á') === 1) {
    console.log('á comes before a');
} else {
    console.log('a and á are the same kind of letter');
}
if ('Michelle'.localeCompare('Michéllé') === -1) {
    console.log('Michelle comes before Michéllé');
} else if ('Michelle'.localeCompare('Michéllé') === 1) {
    console.log('Michéllé comes before Michelle');
} else {
    console.log('Michéllé and Michelle are both in the same position');
}
Listing 3-26.
Determining If a String “Comes Before” Another with localeCompare()
The output is:
John comes before Mike
Jan comes before John
a comes before á
Michelle comes before Michéllé

How It Works

Using String.prototype.localeCompare() can get very complex, because of the complexities of language. The options object of String.prototype.localeCompare() can add a lot of complexity—it has various options that change the behavior of the function. The localeMatcher property will decide which algorithm to use and must be a string of either “lookup” or “best fit” (the default). The sensitivity property changes how similar characters are compared, for example the “Base” value will compare “a”, “á”, and “A” as 0 (the same character), while “Accents” will compare “a” and “A” as 0 (the same), but “á” as different, the “Case” value compares “a” and “á” as 0 (the same) but “A” as different. Finally, “variant”—the default—compares “a”, “á”, and “A” as all different. The exact sensitivity you pick will be entirely up to your exact application. Other properties, such as ignorePunctuation, numeric, and caseFirst are reasonably self-explanatory, and will be left as an exercise for you to explore should you need to, especially as these options are part of a separate ECMAScript Spec (the ECMAScript Internationalisation API Specification 1st Edition, which could warrant its own book).

Counting the Occurrences of a Substring

Problem

You want to be able to count the occurrences of a substring in a given parent string, but no methods exist for this natively in JavaScript.

Solution

There are many ways to solve this particular problem, some using regular expressions, others using String.prototype.split(), but the simplest and most effective (also the fastest) is to iterate over the string while using String.prototype.indexOf().

The Code

function findOccurrences(string, substring) {
  var occurrenceCount = 0, position = 0;
  while ((position = string.indexOf(substring, position)) !== -1)  {
      occurrenceCount++;
      position += substring.length;
  }
  return occurrenceCount;
}
console.log('javascript contains ' + findOccurrences('javascript', 'a') + ' letter "a"s');
console.log('antidisestablishmentarianism contains ' + findOccurrences('antidisestablishmentarianism', 's') + ' letter "s"s');
console.log('"echo echo echo" contains the word "echo" ' + findOccurrences('echo echo echo', 'echo') + ' times');
var sentence = 'once you, like, find that, like, you hear people saying like a lot, it gets really like, irritating';
console.log('That sentence had the word "like" ' + findOccurrences(sentence, 'like') + ' times');
var name = 'Robert';
if (findOccurrences(name.toLowerCase(), 'e') === 1) {
    console.log('Your name contains the letter E once, good for you!');
} else if(findOccurrences(name.toLowerCase(), 'e') > 2) {
    console.log('Wow, your name has a lot of Es in it!');
} else {
    console.log('Strange, the letter E is the most common vowel, but your name has none');
}
var name = 'Ebraheem';
if (findOccurrences(name.toLowerCase(), 'e') === 1) {
    console.log('Your name contains the letter E once, good for you!');
} else if(findOccurrences(name.toLowerCase(), 'e') > 1) {
    console.log('Wow, your name has a lot of Es in it!');
} else {
    console.log('Strange, the letter E is the most common vowel, but your name has none');
}
Listing 3-27.
Counting the Occurrences of a Substring
The output is:
javascript contains 2 letter "a"s
antidisestablishmentarianism contains 4 letter "s"s
"echo echo echo" contains the word "echo" 3 times
That sentence had the word "like" 4 times
Your name contains the letter E once, good for you!
Wow, your name has a lot of Es in it!

How It Works

A high-level overview of this function is that it uses a loop to recurse through the string, each time finding the first occurrence since the last. When it cannot find an occurrence, it breaks the loop and returns the total tally. Let’s investigate this line by line to get a better understanding.

Padding a String with a Custom Function

Caution
String.prototype.repeat(), default function parameters, and the use of let to declare local variables are ES6 features. Older browsers still in use, such as Internet Explorer 11 and below, or Safari 7 and below, only support some of these features. Check out http://kangax.github.io/es5-compat-table/es6/ for the current compatibility charts.

Problem

You want to be able to “pad” a string to ensure it is of a given length. If it is too short, you want to fill the remainder with characters of your own choosing.

Solution

The solution here is to mix a blend of various string methods. The biggest part will be String.prototype.repeat() and using the string’s .length property to determine how much additional padding needs to be added. We’ll make a function that takes a string to pad as the first argument, the second argument will be the desired length after padding, the third will be the string to pad with, and the fourth and final argument will be the desired direction to pad. Negative numbers will pad left, positive numbers will pad right, and 0 will pad in both directions, but favor the right if there is an uneven number. The function will also cater to multi-character pad strings .

The Code

function pad(string, desiredLength = 0, padString = ' ', direction = -1) {
    var repetition = (desiredLength - string.length) / padString.length;
    if (repetition && direction > 0) {
        return string + padString.repeat(repetition);
    } else if (repetition && direction < 0) {
        return padString.repeat(repetition) + string;
    } else if (repetition && direction === 0) {
        var left = Math.floor(repetition/2),
            right = repetition - left;
        return padString.repeat(left) + string + padString.repeat(right);
    }
    return string;
}
console.log(pad('indent', 10));
console.log(pad('indent', 14));
console.log(pad('01', 12, '0'));
console.log('0x' + pad('61', 4, '0'));
console.log(pad('echo', 8, 'o', 1));
console.log(pad('trails off', 22, '.', 1));
console.log(pad('equal indent', 20, ' ', 0));
console.log(pad('double padded', 20, '-', 0));
Listing 3-28.
Padding a String with a Custom Function
The output is:
   Indent
       indent
000000000001
0x0061
echooooo
trails off............
    equal indent (extra whitespace on the right hand side)
---double padded----

How It Works

As a high-level overview of this function, it calculates to see if it needs to repeat any characters. It then repeats them using String.prototype.repeat() based on the desired direction. If the direction is “both” it will calculate the left side as an integer and calculate the right using the leftover length, as this way will provide the most reliable desired length at the cost of having uneven padding (which favors the right). Let’s examine further to get an understanding of how it works:
function pad(string, desiredLength = 0, padString = ' ', direction = 1) {
pad has been declared as a function with four arguments (three of them are optional). Chapter 14 goes in a lot more detail about argument defaults and how to use them. This specific syntax requires an ES6 compatible interpreter—read Chapter 14 for the alternative ES5 compatible version.
    var repetition = (desiredLength - string.length) / padString.length;
This line does a simple bit of math—it calculates the amount the given string is short by ((desiredLength - string.length)) and divides it by padString’s length. This way repetition becomes the amount of times padString needs to repeat, rather than the amount of characters that the given string is lacking.
    if (repetition && direction > 0) {
        return string + padString.repeat(repetition);
Here is the first block of the if statement (while not strictly one line, both lines are simple enough to explain as one). The if statement checks repetition is a significant value (i.e., not 0) and if the direction is a positive number (our semantics indicate that means pad to the right). Repetition is checked because there is no point entering the if block to only return the given string, as String.prototype.repeat() will return an empty string if repetition is 0, so we may as well save the method call and not enter this block. If repetition is 0, it will fall down the if statement until it hits the final line of the function: return string;. This is where it will simply return back the given string, in other words a “no op”.
    } else if (repetition && direction < 0) {
        return padString.repeat(repetition) + string;
This block has the same fundamentals as the previous block, the only difference being that this block adds padding to the left of the string.
    } else if (repetition && direction === 0) {
        let left = Math.floor(repetition/2),
            right = repetition - left;
        return padString.repeat(left) + string + padString.repeat(right);
    }
This block is more complex than the others. This is path when the direction is “both,” in other words you should expect equal amount of padding on the left and on the right. Obviously, the method should get as close to the desired length as possible, and each side should be even—but that might not be possible because of the structure of the strings, but the string certainly shouldn’t be longer than the desired length. Because of this desired behavior, the first step is to create two variables, left and right, and have them set to roughly half of the amount of repetitions needed. More specifically though, the left side is set to half (rounded down to the nearest whole number) and the right side takes the remaining amount of padding (repetition - left). The return line should be self-explanatory by now. If you wanted your implementation to favor the right side, as opposed to the left, you could simply swap the left and right variable statements around.
    return string;
}
The last block is a fallback just in case any of these if statements do not work and the method falls through all if statements; the method should always return a string so that it is a reliable method for developers to use. We simply return the original string here because there is nothing else we can really do—the interpreter will only end up here if the repetitions count is 0, or if the direction is set to a non-numeric value. We could check the type of the direction early on to ensure it is a number (using typeof) and if not throw a TypeError, to ensure the API is followed accurately, but that is generally an uncommon practice in JavaScript, although feel free to add this for your own implementations .

Truncating a String with a Custom Function

Caution
Default function parameters are an ES6 feature. Older browsers still in use, such as Internet Explorer 11 and below, or Safari 7 and below, do not support this feature. Check out http://kangax.github.io/es5-compat-table/es6/ for the current compatibility charts.

Problem

You want to be able to truncate a string to a particular length, while also ensuring word integrity.

Solution

The solution for this may seem quite simple—you simply slice the given string by the given amount—however for it to be a robust solution we need to be more intelligent, especially when considering the preservation of whole words.

The Code

function truncate(string, desiredLength, addendum = 'u2026') {
    if (string.length <= desiredLength) {
        return string;
    }
    return string.slice(0, string.lastIndexOf(' ', desiredLength - addendum.length)) + addendum;
}
console.log( truncate('The truncate function will shorten strings to the nearest word', 20) );
console.log( truncate('The addendum can be customized to be any desired substring', 35, ' (read more...)') );
Listing 3-29.
Truncating a String with a Custom Function
The output is:
The truncate...
The addendum can be (read more...)

How It Works

A high-level overview of this function is as follows: it takes the string, and, using String.prototype.slice(), it chops the end off of the string. The ending index is decided upon the first space at the end of the string before the truncation index—the desiredLength. The returned string is then given an addendum, which defaults to the Horizontal Ellipsis character u2026. Looking at this code piece by piece provides more insight :
function truncate(string, desiredLength, addendum = 'u2026') {
truncate has been declared as a function with three arguments (one of them is optional). Chapter 14 goes in a lot more detail about argument defaults and how to use them. This specific syntax requires an ES6-compatible interpreter—read Chapter 14 for the alternative ES5 compatible version.
    if (string.length <= desiredLength) {
        return string;
    }
This if statement provides an early exit to the function if the given string’s length is under or the same as the desiredLength threshold; for example if string.length is 20 and the desiredLength is 20, there is no point truncating this string. It simply returns string here, making this particular path a “no op”.
    return string.slice(0, string.lastIndexOf(' ', desiredLength - addendum.length)) + addendum;
Here is where all of the logic lies. The solution in this particular implementation is simple, perhaps a little naive. Much more complex solutions exist for the same problem. This also generally only works with LTR (Left to Right) languages such as English or French. String.prototype.slice() is given two arguments, the first is 0 (the beginning of the string) the next is more complex.
Here the function calls String.prototype.lastIndexOf(). It calls it with a ' ' (space) character, the idea being it should look for the last space character in the string, which would indicate the index between the last and second to last word. Except it's actually passed a second argument, desiredLength - addendum.length, which tells String.prototype.lastIndexOf() to look further down the string. This means that String.prototype.lastIndexOf() actually finds the nearest space, left-most of the desiredLength cutoff point. This makes the function a little greedy, oftentimes coming in under the desiredLength, but the use case of this means it’s better to be under than over.
The line as a whole, one can see, simply returns the newly sliced string with the addendum string concatenated onto the end. If you wanted to use this implementation but did not want additional strings concatenated on, you could simply pass an empty string ('') as the third argument, which would have the same effect .

Making a ROT13 Cypher with a Custom Function

Problem

You want a function that will convert text using the famous ROT13 cypher (a crude form of encryption on the Latin alphabet that moves every letter right by 13 characters, e.g., a becomes n, b becomes m, and so on).

Solution

Using a combination of String.prototype.replace(), String.prototype.toLowerCase(), String.prototype.charCodeAt(),, and String.fromCharCode(), we can create a custom function that will turn the first and only argument (the given string) into a string that has been “encoded” with ROT13. It always expects a string and always returns a string.

The Code

function rot13(string) {
    return string.replace(/[a-z]/ig, function (character) {
        var moveNumber = character.toLowerCase() < 'n' ? 13 : -13;
        character = character.charCodeAt(0) + moveNumber;
        return String.fromCharCode(character);
    });
}
console.log(rot13('hello'));
console.log(rot13('HELLO'));
console.log(rot13('uryyb'));
console.log(rot13('URYYB'));
console.log(rot13('This is a secret message'));
console.log(rot13('Guvf vf n frperg zrffntr'));
Listing 3-30.
Padding a String with a Custom Function
The output is:
Uryyb
URYYB
hello
HELLO
Guvf vf n frperg zrffntr
This is a secret message

How It Works

The core of this function utilizes String.prototype.replace() and the function facility to process each Latin alphabet character, by shifting the character code up 13 characters. A line-by-line look at this will give more insight.
function rot13(string) {
Here the function rot13 is declared and takes one argument, a string. For more on functions, read Chapter 14.
    return string.replace(/[a-z]/ig, function (character) {
This is the call to String.prototype.replace(), which, as discussed earlier in this chapter, can take a regular expression as the first argument and a function as the second. The first argument given is a reasonably simple regular expression: /[a-z]/ig. Splitting this regular expression into its component parts, the first bit—[a-z]—says match any character between lowercase a and z, including a and z. The regular expression then has two flags: i and g. The i flag means “insensitive” as in case insensitive, and means that the regular expression will now capture both upper- and lowercase characters. The g flag means “greedy,” that is, it will capture all occurrences of the match rather than just the first. So in total this regular expression says “greedily match all characters from A to Z, regardless of case,” it’ll capture each character, one by one, and evaluate it against the second argument. For more on regular expressions, read Chapter 20.
The second argument is a function that itself has one argument, a named character. String.prototype.replace() will execute this function, passing it a set of arguments. The first of which is the matched string from the given regular expression. It will replace the matched point with whatever the function returns.
    var moveNumber = character.toLowerCase() < 'n' ? 13 : -13;
ROT13 typically moves Latin alphabet characters 13 places forward, taking into account that the letter after z loops back to a again, so the letter z moved 13 places forward becomes m. That would be too complex to try and implement literally in code, so we cheat. If the character given is an n or lower, then we add 13, but if it is higher than an n, we take away 13. This gives exactly the same effect as moving forward on an infinitely looping alphabet.
This works by taking the existing string and comparing it to the less than operator (discussed in Chapter 2) against the character n. As mentioned, all strings are effectively a series of 16-bit integers, and so comparing a one-character string (for example ‘a’ - 0x61 or 97 in decimal) to another one-character string (in this case ‘n’ - 0x6e or 110 in decimal) will compare the integer values. Because we are effectively comparing integers, ‘a’ < ‘n’ works and evaluates to true, the same as 97 < 110 does. However, capital letters have a completely different set of character codes—while ‘a’ is 0x61, ‘A” is 0x41 (65 in decimal). Having capital letters a value of 32 less than their lowercase counterparts would break our cheat to evaluate which direction to move, so we have to use String.prototype.toLowerCase() to normalize any uppercase characters to fit within our pattern. This does not affect the final result because we only use it here to compare the integer value, not to return a lowercase string. The remainder of this line is simply a conditional operator to determine if moveNumber is +13 or -13. Read more on conditional operators in Chapter 2.
    character = character.charCodeAt(0) + moveNumber;
Here, String.prototype.charCodeAt(0) extracts the given character into a character code. You could alternatively use String.prototype.codePointAt(0) for the same effect. We know because of our regular expression we are only dealing with single characters from the Latin alphabet. The resulting character code is then added or subtracted to, based on the value of moveNumber (-13 or +13). This is the line of code that actually does the ROT13 conversion—we now have a character code that is 13 places away from the original one.
    return String.fromCharCode(character);
Of course, this line actually converts the character variable into a number. It does not give back a string character but a character code. So the last line cannot simply return a character; it needs to do the conversion back from character code to actual string character , enter String.fromCharCode().

Calculating the Levenshtein Distance Between Two Strings with a Custom Function

Caution
Let variables are an ES6 feature. Older browsers still in use, such as Internet Explorer 11 and below, or Safari 7 and below, do not support this feature. Check out http://kangax.github.io/es5-compat-table/es6/ for the current compatibility charts. If you would like to use this function in ES5 compatible browsers, simply replace all lets for vars.

Problem

You want to use the Levenshtein Distance algorithm to denote the numerical similarity between two words. Levenshtein distance is the algorithm used to predict spelling corrections or auto-correct, and counts the number of single character edits between two words, to get from one word to the other. For example, the Levenshtein distance between “cat” and “hat” is 1 (change the c to h), while “cat” and “care” is 2 (change the t to r, add the e). A naive implementation might just count the different letters, but we want the most efficient way to change this.

Solution

This code relies fairly little on the string’s prototype manipulation functions, and more on the mathematics behind strings. It utilizes for loops to iterate over what we call a “two-dimensional matrix”—an array of arrays. This is done to simulate the Levenshtein Distance algorithm. You could use this function, for example, in a Spell Checking engine. You have a spelling dictionary (literally a list of all available English words) and if a given word is not available in that list then you could find the closest word to it by calculating the Levenshtein Distance of each word in the dictionary, and taking the top-N (say, five) of those words to offer as suggestions for spelling corrections.

The Code

function lev(string1, string2) {
    var string1Length = string1.length,
        string2Length = string2.length,
        matrix = new Array(string1Length + 1);
    for (var i = 0; i <= string1Length; i++) {
        matrix[i] = new Array(string2Length + 1);
        matrix[i][0] = i;
    }
    for (let i = 0; i <= string2Length; i++) {
        matrix[0][i] = i;
    }
    for (let i = 1; i <= string1Length; i++) {
        for (let n = 1; n <= string2Length; n++) {
            var add = matrix[i - 1][n] + 1,
                remove = matrix[i][n - 1] + 1,
                change = matrix[i - 1][n - 1] + Number(string1.charAt(i - 1) !== string2.charAt(n - 1));
            matrix[i][n] = Math.min(add, remove, change);
        }
    }
    return matrix[string1Length][string2Length];
}
console.log('Distance between pea and part is ' + lev('pea', 'part'));
console.log('Distance between foo and four is ' + lev('foo', 'four'));
console.log('Distance between matrix and mattress is ' + lev('matrix', 'mattress'));
console.log('Distance between honey and money is ' + lev('honey', 'money'));
console.log('Distance between tape and hate is ' + lev('tape', 'hate'));
Listing 3-31.
Calculating the Levenshtein Distance Between Two Strings with a Custom Function
The output is:
Distance between pea and part is 3
Distance between foo and four is 2
Distance between matrix and mattress is 4
Distance between honey and money is 1
Distance between tape and hate is 2

How It Works

The Levenshtein algorithm defines a way to calculate the cheapest combination of operations that will change the first string into the second, or vice versa. The crux of the Levenshtein algorithm is the matrix, or grid. Table 3-1 shows an example of how the initial grid should look.
Table 3-1.
A Basic Levenshtein Grid Set Up to Convert “PEA” to “PART"
  
P
A
R
T
 
0
1
2
3
4
P
1
    
E
2
    
A
3
   
In Table 3-1, the initial row of numbers has been set up so that they count sequentially for each letter. In this case for PEA and PART the numbers (starting at 0) count to 3 for PEA and 4 for PART. These numbers represent the cost of converting an empty string to the respective word, for example to convert an empty string to 'P' it costs 1, to convert an empty string to 'PE' it costs 2, 'PEA' would cost 3. The same applies to the 'PART' word going across. The empty cells will be full of similar numbers, which we have the task of calculating.
For each empty cell, we need to fill in the cell with one of three values, whichever is the least. It is important that each cell represents the string up until that point, so cell C3 (the cross-section between P and P) represents the change from 'P' to 'P', which of course is no change, meaning we can put a 0 here. Cell C5 (the cross-section between the P in PART and the A in PEA) will represent the change from 'PE' to 'P', which is one deletion so we can put 1 here. For each string you can do three operations: remove letters, change letters, and add letters. Each operation costs 1 point, but each one has different rules about how that number is taken:
  • If you want to add a letter, you need to take the number left of the cell and add 1.
  • If you want to remove a letter, you need to take the number above the cell and add 1.
  • If you want to change a letter, you need to take the number above-left of the cell and add 0.
  • If you want to do nothing, simply take the number above-left of the cell.
Now we have the rules to what numbers to put down, we can calculate each of the cells and fill out the whole table. For the example in Table 3-1, a filled out version will look like Table 3-2. To help work out how each change has been made, all new digits are suffixed with E (equal), A (add), R (remove), or C (change).
Table 3-2.
A Completed Basic Levenshtein Grid Converting “PEA” to “PART
  
P
A
R
T
 
0
1
2
3
4
P
1
0 E
1 A
2 A
3 A
E
2
1R
1S
2S
3S
A
3
2R
1E
2S
3S
From here, we can simply take the bottom-right number, which is 3. This is the least amount of operations to convert “PEA” to “PART”. The operations to complete this would be to add the letter A to ‘pea’ to make ‘paea’, change the letter E to the letter R, changing ‘paea’ to ‘para’. And finally to change the letter A to the letter T, meaning ‘para’ becomes ‘part’. This is categorically the shortest way to change the word ‘PEA’ to the word ‘PART’.
Once you have understood how this grid works, you can decipher what happens in the function. Essentially exactly the same thing, but in code, a “2D matrix” (array of arrays) is set up and each cell is calculated, then finally the bottom-right cell is taken as the Levenshtein Distance. Let’s look closer—piece by piece—to understand this function:
var string1Length = string1.length,
        string2Length = string2.length,
        matrix = new Array(string1Length + 1);
Here we store the string lengths, for easier later reference, and also create the initial array with a length of the first string’s length + 1. This array will form our virtual Levenshtein grid, just like the tables demonstrated.
    for (var i = 0; i <= string1Length; i++) {
        matrix[i] = new Array(string2Length + 1);
        matrix[i][0] = i;
    }
This for loop will count from 0 until the length of the first string, for each iteration it adds a new array into the right position in matrix. It sets the first position of this new array to i, which is an incremental counting up, so we have an array of arrays that looks something like [[0], [1], [2], [3], [4]] (given our “PART”, “PEA” example). This has created the X axis of our string.
    for (let i = 0; i <= string2Length; i++) {
        matrix[0][i] = i;
    }
This for loop does the same thing for the Y axis—simply looping through the length of String2 and adding incremental numbers throughout the array. Given our “PART”/“PEA” example, the array now looks something like [[0, 1, 2, 3], [1], [2], [3], [4]]. This effectively looks like an unfinished Levenshtein grid.
    for (let i = 1; i <= string1Length; i++) {
        for (let n = 1; n <= string2Length; n++) {
            let add = matrix[i - 1][n] + 1,
                remove = matrix[i][n - 1] + 1,
                change = matrix[i - 1][n - 1] + Number(string1.charAt(i - 1) !== string2.charAt(n - 1));
            matrix[i][n] = Math.min(add, remove, change);
        }
    }
This is the main meat of the function—it’s a nested for loop, as we need to enter each of the nested arrays, and then look at each item in each of the sub-arrays. This simulates a human behavior of filling out the grid, simply filling out each cell, one by one. When inside the inner-most loop, we create the three allowed rules, so we can naively determine which one is the quickest for this individual cell: adding a character (matrix[i - 1][n] + 1, that is, getting the number from the row above and adding 1), removing a character (matrix[i][n - 1] + 1, that is, getting the number from the row to the left and adding 1), and changing the character, if possible (matrix[i - 1][n - 1] + Number(string1.charAt(i - 1) !== string2.charAt(n - 1)), that is, taking the above-left number, and adding 1 if the character needs changing, or 0 if it doesn’t. Here we use the number constructor converting a Boolean; if the strings are not equal the Boolean operation will return true, which is coerced to 1, meaning we want a change operation; otherwise it’ll return false, which is coerced to 0, meaning we don’t want a change operation because the strings are equal.
The last line of this loop assigns this cell to the lowest value of the three rules—add, remove, or change. It uses Math.min (discussed in Chapter 4), which will choose the smallest number out of a given set of numbers. This means that each cell is filled with the smallest number out of the given operations .
    return matrix[string1Length][string2Length];
The last line of the function once again copies how you could manually calculate the Levenshtein distance. It takes the bottom-right value out of its grid (matrix, the two-dimensional array).
As you can see, most of the complexity with the Levenshtein Distance algorithm actually lies within its concept, but once it’s understood, it is reasonably trivial to implement into any language, including JavaScript. While the algorithm itself mostly relies on the use of arrays (Chapter 7) and numbers and mathematics (Chapter 4), it certainly deserves its place in this chapter, having huge utility in comparing the closeness of strings.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.76.175