Lexicographic comparison

Lexicographic comparison occurs when both operands are strings, and involves the character-by-character comparison of each string. Broadly, strings that are greater are those that would appear later in a dictionary. Therefore, banana would be lexicographically greater than apple.

As we discovered in Chapter 6, Primitive and Built-In Types, JavaScript uses UTF-16 to encode strings and therefore each codeunit is a 16-bit integer. The UTF-16 codeunits from 65 (U+0041) to 122 (U+007A) are as follows:

ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz

Those characters appearing later are represented by larger UTF-16 integers. To compare any two given codeunits, JavaScript will simply compare their integer values. For the case of comparing B to A, this might look something like this:

const intA = 'A'.charCodeAt(0); // => 65
const intB = 'B'.charCodeAt(0); // => 66
intB > intA; // => true

Every character in each operand string must be compared. To do this, JavaScript will go codeunit-by-codeunit. At each index of each string, if codeunits differ, the larger codeunit will be considered greater, and that string will, therefore, be considered greater than the other:

"AAA" > "AAB"
"AAB" > "AAC"

And if one operand is equal to the prefix of the other, then it will always be considered less than, as shown here:

'coff' < 'coffee'; // => true

As you may have spotted, the lowercase English letters occupy higher UTF-16 integers than uppercase letters. This has the effect of meaning that uppercase is considered less than lowercase and would, therefore, appear before it in a lexicographic ordering:

'A' < 'a'; // => true
'Z' < 'z'; // => true
'Adam' < 'adam'; // => true

You'll also notice that the codeunits from 91 to 96 include the punctuation characters, []^_`. This will also affect our lexicographic comparisons:

'[' < ']'; // => true
'_' < 'a'; // => true

Unicode tends to be arranged in a way that any given language's characters will be naturally sorted lexicographically so that the first symbols in a language's alphabet are expressed by lower 16-bit integers than the later symbols. Here, we see, for example, the word for chicken ("ไก่") in Thai is lexicographically less than the word for egg ("ไข่") since the ก character appears before ข in the Thai alphabet:

'ไก่' < 'ไข่'; // => true ("chicken" comes before "egg")
'ก'.charCodeAt(0); // => 3585
'ข'.charCodeAt(0); // => 3586

The natural order of Unicode may not always yield a sensible lexicographic order. As we learned in the previous chapter, complex symbols can be expressed by combining together multiple codeunits into combining character pairs, surrogate pairs (creating code points), or even grapheme clusters. This can create various difficulties. One example would be the following case where a given symbol, in this case, LATIN CAPITAL LETTER A WITH CIRCUMFLEX, can be expressed either via the lone Unicode code-point U+00C2 or via combining the capital letter "A" (U+0041) with the COMBINING CHARACTER ACCENT (U+0302). Symbolically and semantically, these are identical:

'Â'; // => Â
'Au0302'; // => Â

However, since U+00C2 (decimal: 194) is technically larger than U+0041 (decimal: 65), it will be considered greater than in a lexicographic comparison, even though they are symbolically and semantically identical:

'Â' > 'Au0302'; // => true

There are thousands of these potential discrepancies to watch out for, so if you ever find yourself needing to compare lexicographically, be mindful that JavaScript's greater-than and less-than operators will be limited by Unicode's inherent ordering.

Table of Contents for Lexicographic comparison

Create new playlist

Sign In

Sign Up

Table of Contents for
Lexicographic comparison