String

The String type in JavaScript allows us to express sequences of characters. It is usually used to encapsulate words, sentences, lists, HTML, and many other forms of text-like content.

Strings are expressed by delimiting sequences of characters with either single quotes, double quotes, or backticks:

// Single quotes:
const name = 'Titanic';

// Double quotes:
const type = "Ship";

// Template literals (back-ticks):
const report = `
RMS Titanic was a British passenger liner that sank
in the North Atlantic Ocean in 1912 after the ship
struck an iceberg during her maiden voyage.
`;

Only backtick-delimited strings, known as template literals (or template strings), can occupy multiple lines. Single quote- or double quote-delimited strings can technically be spread along multiple lines as well, but this is only achieved by escaping their invisible newline characters (with a character), which effectively removes the newlines:

const a = "example of a 
string with escaped newline
characters";

const b = "example of a string with escaped newline characters";

a === b; // => true

Nowadays, template literals are preferred as they retain newlines and allow us to interpolate arbitrary expressions, like so:

const nBreadLoaves = 4;
const breadLoafCost = 2.40;

`
I went to the market and bought ${nBreadLoaves} loaves of
bread and it cost me ${nBreadLoaves * breadLoafCost} euros.
`

Strings come with a number of curious challenges once your usage exceeds the most simple use cases. Under the surface, the humble string is masking a miraculous scale of complexity in the form of Unicode.

Unicode is an industry standard for the encoding, representation, and handling of text that's used in writing systems around the world. The Unicode standard contains over 130,000 characters, including all of your favorite emojis.

To step beneath the veneer of the String abstraction slightly, we can say that Strings in JavaScript are really just an ordered sequence of 16-bit unsigned integers. Each of these integers is interpreted as a UTF-16 code unit. UTF-16 is a type of encoding for the Unicode character set. Using it, we are able to express hundreds of thousands of valid Unicode code points. This means that we can express emojis, many languages, and a myriad of Unicode oddities via our strings:

A Unicode code point is a character (such as the letter B, a question mark, or a smiling emoji). We can express a code point by using one or more UTF-16 code units. Most code points that we use from day to day only need a single code unit. These are known as scalars. There are, however, quite a few Unicode code points that require a pair of code units (known as a surrogate pair). The panda emoji is an example of such a surrogate pair:

Since UTF-16 only has 16 bits to work with, it has to use pairs of 16-bit integers to express some characters. Naturally, if we're using UTF-32 encoding (with 32 bits to play with), then we'd be able to express the panda emoji in a single 32-bit integer.

Here, we've used charCodeAt() to determine the individual UTF-16 code units of the Panda emoji and we've found that these are the 55,357th and 56,380th decimal code units within Unicode. Since there are so many code units, it is simpler and more convenient to use hexadecimal digits to express them, so we can say that the panda emoji is expressed by code units  U+D83D and U+DC3C (Unicode hexadecimal values are conventionally prefixed with U+).

In addition to surrogate pairs, there is another type of combination that's useful to know about. The Combining Code Point enables certain traditional non-combining code points to be augmented into new characters. Examples of this include traditional Latin characters that can be augmented with accents or other augmentations, such as the combining tilde:

We've chosen to express this particular combining character via a Unicode escape sequence (u0303). The format of uXXXX allows us to express Unicode code units between U+0000 and U+FFFF within JavaScript strings.

The range of Unicode between U+0000 and U+FFFF is known as the Basic Multilingual Plane (BMP) and includes the most commonly used everyday characters.

Our panda emoji, as we've already seen, is quite an obscure symbol. It does not exist on the BMP and is thus expressed by a surrogate pair of two UTF-16 code units. We can express these individually in JavaScript strings via two Unicode escape sequences:

More obscure and ancient symbols are found in the supplementary (or astral) planes between U+010000 and U+10FFFF. The escaping format of uXXXX does not have enough slots for us to express these. Symbols within the astral planes require at least five hexadecimal digits to express, so we must use the more recently introduced escape sequence format of u{X}. This provides up to six hexadecimal slots (u{XXXXXX}) and can thus express over 1 million different code points. Using this type of escape sequence, we can express our Panda emoji directly via its 32-bit representation (U+1F43C):

The newer u{X} escape sequence is really convenient and goes some way in making Unicode less burdensome to use than JavaScript. But there is still a little more complexity to explore. Surrogate pairs and combining characters are examples where UTF-16 code units are combined to produce individual symbols. On top of this, there are longer sequences called grapheme clusters. These are used to express combinations of code points that can be combined to create an aggregate symbol:

Wow! Unicode is a pretty incredible feat of engineering, but it can make things complicated for us. The ability to combine Unicode in all of these ways (combining characters, surrogate pairs, and grapheme clusters) creates a challenge for us. JavaScript strings, as you may know, have a length property. This property returns the number of code units in a given string (that is, the number of 16-bit integers in the entire sequence). For most strings, this is straightforward:

'fox'.length;   // => 3
'12345'.length; // => 5

However, as we know, we are able to combine code units to create code points and we are also able to combine code points to create grapheme clusters. This means the length property, which is only concerned with the 16-bit code units, can give us unexpected results:

The smiling-face emoji is composed of two code units, so JavaScript correctly tells us this string has a length of 2. But this may not be what we expect or desire. It's even more challenging when we're dealing with grapheme clusters that may use a dozen different code units to express a single symbol. 

Watch out when attempting to truncate or establish the width of a piece of text within a UI using only its length property. Due to the fact that many Unicode symbols may be expressed by multiple code units, using length alone is not reliable.

Throughout this section, we've explored the tricky domain of Unicode. Va our new understanding of it, we're now far more empowered to cleanly work with strings in JavaScript. Excluding the complexity of Unicode, the behavior of strings in JavaScript is rather intuitive and shouldn't cause many headaches as long as we use them in a way that clearly communicates our intent.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.235.23