Characters

XML data is composed of a simple sequence of characters, including the text of the document and the markup that describes and structures this text. In order to store any text on a computer, it is necessary to specify what characters are available, and how to store them electronically.

Character sets

The term character set is used to describe a collection of characters that are related in some way. A character set is not a physical thing. It is only a concept, defined by a specification or standard. A typical character set will include letters, numbers and commonly used symbols:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789!"#$%&'()*+,-./:;<=>?@[]^_'{|}~

Encodings

Computers do not directly understand characters (or character sets). They are basically calculating machines (or 'number crunchers') that can store and manipulate numbers. In order to store text in a computer, a unique numeric value is therefore used to represent each character in a given character set, usually including letters, digits and punctuation marks. This is termed an encoding scheme.

For example, the value 51 may be used to represent the digit '3', the value 33 may be used to represent the exclamation mark, '!', and the value 84 may be used to represent the letter 'T'. When a user presses the letter 'T' key on their keyboard, a signal is sent to the computer, which uses a look-up table to determine the value to be stored on disk or in memory:



The actual shape of the character will vary depending on which font is used to display or print that character. A font table provides a suitable shape for each character in the character set (and in the simplest cases, each shape has a numerical position that matches the character value):



Confusing sets and encodings

It is important to recognize that an encoding is not conceptually the same as a character set, but that a single standard may define both simultaneously. For example, there is no distinction between the two with ASCII (see below), which is both a character set that includes 'A' and '1' in its set, and an encoding that specifies that 'A' has a value of 65 and '1' has a value of 49. But in other languages, such as Japanese, the same character set may be made available in several different encodings.

Incompatibility

When information is transferred between two computers that use incompatible encoding schemes, the numeric values are preserved but the number-to-character mappings differ, and the text becomes unintelligible when presented to a user of the second system.

Assuming the incompatible representation schemes shown below (where every character on System B has a value one lower than on System A), the word 'TEXT' would be corrupted to 'UFYU' on transfer to System B. Clearly, there is a need for standards to prevent such problems:



..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.117.109