Character and Text Encoding on the Web

To a computer, a character is nothing more than a bit of code that it recognizes to display a particular glyph on a screen (or print, or other output media). The shapes of letters, numbers, and other symbols are meaningless to the computer; it only know the code to display a particular glyph so that humans can understand it.

Character encoding or font encoding is a system that pairs each character in a given font with the computer code needed to display it. On the flip side, with text encoding, all of the text in any document is encoded by the computer, each letter having its own unique code “number.”

There are a number of different standards that can be used to encode a computer file such as a font or text document. Like trying to understand a different human language, if a computer doesn’t understand a particular encoding “language”—or know which encoding system is being used—the text is likely to appear as gibberish.

As long as the computer program displaying the text understands the encoding used by the font and the encoding used by the text document—and all the glyphs needed are included in the font—then the text will display just as you wrote it. Even if the font and text file are encoded differently, the program will translate between them.

The Web has begun to standardize around a common encoding language called Unicode. If you stick to it, you are unlikely to have problems with gibberish text.

Check your font’s character set to make sure that it has everything you need

A character set is simply that—the set of all characters in a particular font in a particular order for its particular encoding scheme. This is referred to as the font repertoire or character range; every glyph is in order for the encoding scheme, which is why the term character set is often used synonymously with character encoding.

A font’s character set does not necessarily include every possible character that can be encoded. Some fonts even include alternatives to the expected character. This is, in fact, how a dingbat font works—each letter space contains the information used to display the dingbat glyph rather than a letter.

Gill Sans Character Set

Although it contains all of the usual suspects—A–Z, a–z, and 0–9—the complete character set for Gill Sans (designed by Eric Gill) not only contains an assortment of common punctuation marks, symbols, and other marks; many fonts will also include characters used by languages other than English.


When designing for the Web, you must be sure that the fonts you choose contain all of the characters you need to display your text. If a font does not include a particular character in its repertoire, the result will depend on how the particular application handles nonexistent characters, but will not be the result you want. Generally, a placeholder glyph will be used, such as a rectangle with an “x” or a diamond with a “?” inside or just empty space.

Barrett Ironwork Character Set

Including uppercase styles for both upper- and lowercase glyphs as well as several dingbats, the character set for Barrett Ironwork has a more limited repertoire than Gill Sans.

If a particular character is not available in the font’s character set, it will generally be replaced with a rectangle.


Missing characters are rarely a problem with English, since most common fonts include a full repertoire of 26 English alphabetic characters in both upper- and lowercase, along with common punctuation. In fact, most professionally created fonts include the entire Latin character set used to display Western languages. It’s generally the more specialized and decorative fonts that will give you problems.

Bad Character Encoding

In the example to the left, a special separator character is not available in the repertoire of the font being used to display the message and is replaced by a diamond shape with a question mark inside.


Know your encoding: Unicode vs. ISO Latin-1

Literally hundreds of different character encoding systems have been developed over the years, most famously ASCII (American Standard Code for Information Interchange), Mac OS Roman (used in older versions of the Macintosh operating system), and multiple Windows encoding systems for different languages. Most are defined as either Western or non-Western, depending on the glyphs needed to display particular languages.

For the Web, the most common encoding system is the Unicode format. Unicode is considered superior to other encoding methods because its repertoire includes over 100,000 possible characters from a variety of languages. A font using Unicode can contain a complete character set for multiple languages in a single font file, although it probably won’t. Fonts for specific languages are more likely to give results that are acceptable to native speakers than fonts that try to cover many languages and scripts.

The most common Unicode format in use today is UTF-8. The 8 refers to how many bits are used to store each letter, meaning that a UTF-8 encoded file can contain a maximum of 256 individual character glyphs.

The newer UTF-16 is capable of encoding the entire Unicode repertoire of characters in a single file, but it is not supported by all Web browsers for text encoding.


An older alternative for Western language encoding that you might still run into on the Web is ISO 8859-1 (or ISO Latin 1). Similar to UTF-8, it also uses 8 bits to store characters.

Most text editors and Web editing software allow you to switch your file encoding, but I wouldn’t. If you do start to see the wrong characters being displayed, it’s likely due to inconsistent encoding. My advice is to set them to UTF-8 and then walk away.

Setting a Web page’s encoding

Although there are a wide variety of ways to encode a text file—HTML, CSS, and JavaScript files are all text files—encoding is only rarely an issue on the Web because most browsers automatically recognize the file’s encoding type and adjust accordingly. For the most part, character encoding happens automatically, and you will not need to change settings.

One place you will always need to specify the encoding being used is directly in the HTML of your document. To make sure the browser knows which encoding system you are using, in the head of your HTML, add the following line of code if you are using UTF-8:


							<meta http-equiv="Content-Type"
						
						
							content="text/html;charset=utf-8">
						

Choosing Your Text File’s Encoding

Web code editing software like Coda (panic.com) allows you to switch your file encoding. Unless you are setting your encoding for a specific language, it’s best to leave it set for UTF-8.


If you are using ISO Latin 1, the code is

If you are using another encoding system, then you will need to replace the charset value with the appropriate encoding value. As long as the charset is defined, the browser will be able to associate the right character with the right glyph in your chosen font.


							<meta http-equiv="Content-Type"
						
						
							content="text/html;
						
						
							charset=iso-8859-1">
						

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.169.40