The char Type: Characters and Small Integers

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

The `char` Type: Characters and Small Integers

It’s time to turn to the final integer type: char. As you probably suspect from its name, the char type is designed to store characters, such as letters and numeric digits. Now, whereas storing numbers is no big deal for computers, storing letters is another matter. Programming languages take the easy way out by using number codes for letters. Thus, the char type is another integer type. It’s guaranteed to be large enough to represent the entire range of basic symbols—all the letters, digits, punctuation, and the like—for the target computer system. In practice, many systems support fewer than 128 kinds of characters, so a single byte can represent the whole range. Therefore, although char is most often used to handle characters, you can also use it as an integer type that is typically smaller than short.

The most common symbol set in the United States is the ASCII character set, described in Appendix C, “The ASCII Character Set.” A numeric code (the ASCII code) represents each character in the set. For example, 65 is the code for the character A, and 77 is the code for the character M. For convenience, this book assumes ASCII code in its examples. However, a C++ implementation uses whatever code is native to its host system—for example, EBCDIC (pronounced “eb-se-dik”) on an IBM mainframe. Neither ASCII nor EBCDIC serve international needs that well, and C++ supports a wide-character type that can hold a larger range of values, such as are used by the international Unicode character set. You’ll learn about this wchar_t type later in this chapter.

Try the char type in Listing 3.5.

Listing 3.5. chartype.cpp

// chartype.cpp -- the char type
#include <iostream>
int main( )
{
    using namespace std;
    char ch;        // declare a char variable

    cout << "Enter a character: " << endl;
    cin >> ch;
    cout << "Hola! ";
    cout << "Thank you for the " << ch << " character." << endl;
    return 0;
}

Here’s the output from the program in Listing 3.5:

Enter a character:
M
Hola! Thank you for the M character.

The interesting thing is that you type an M, not the corresponding character code, 77. Also the program prints an M, not 77. Yet if you peer into memory, you find that 77 is the value stored in the ch variable. The magic, such as it is, lies not in the char type but in cin and cout. These worthy facilities make conversions on your behalf. On input, cin converts the keystroke input M to the value 77. On output, cout converts the value 77 to the displayed character M; cin and cout are guided by the type of variable. If you place the same value 77 into an int variable, cout displays it as 77. (That is, cout displays two 7 characters.) Listing 3.6 illustrates this point. It also shows how to write a character literal in C++: Enclose the character within two single quotation marks, as in 'M'. (Note that the example doesn’t use double quotation marks. C++ uses single quotation marks for a character and double quotation marks for a string. The cout object can handle either, but, as Chapter 4 discusses, the two are quite different from one another.) Finally, the program introduces a cout feature, the cout.put() function, which displays a single character.

Listing 3.6. morechar.cpp

// morechar.cpp -- the char type and int type contrasted
#include <iostream>
int main()
{
    using namespace std;
    char ch = 'M';       // assign ASCII code for M to ch
    int i = ch;          // store same code in an int
    cout << "The ASCII code for " << ch << " is " << i << endl;

    cout << "Add one to the character code:" << endl;
    ch = ch + 1;          // change character code in ch
    i = ch;               // save new character code in i
    cout << "The ASCII code for " << ch << " is " << i << endl;

    // using the cout.put() member function to display a char
    cout << "Displaying char ch using cout.put(ch): ";
    cout.put(ch);

    // using cout.put() to display a char constant
    cout.put('!'),

    cout << endl << "Done" << endl;
    return 0;
}

Here is the output from the program in Listing 3.6:

The ASCII code for M is 77
Add one to the character code:
The ASCII code for N is 78
Displaying char ch using cout.put(ch): N!
Done

Program Notes

In the program in Listing 3.6, the notation 'M' represents the numeric code for the M character, so initializing the char variable ch to 'M' sets ch to the value 77. The program then assigns the identical value to the int variable i, so both ch and i have the value 77. Next, cout displays ch as M and i as 77. As previously stated, a value’s type guides cout as it chooses how to display that value—just another example of smart objects.

Because ch is really an integer, you can apply integer operations to it, such as adding 1. This changes the value of ch to 78. The program then resets i to the new value. (Equivalently, you can simply add 1 to i.) Again, cout displays the char version of that value as a character and the int version as a number.

The fact that C++ represents characters as integers is a genuine convenience that makes it easy to manipulate character values. You don’t have to use awkward conversion functions to convert characters to ASCII and back.

Even digits entered via the keyboard are read as characters. Consider the following sequence:

char ch;
cin >> ch;

If you type 5 and Enter, this code reads the 5 character and stores the character code for the 5 character (53 in ASCII) in ch. Now consider this code:

int n;
cin >> n;

The same input results in the program reading the 5 character and running a routine converting the character to the corresponding numeric value of 5, which gets stored in n.

Finally, the program uses the cout.put() function to display both c and a character constant.

A Member Function: `cout.put()`

Just what is cout.put(), and why does it have a period in its name? The cout.put() function is your first example of an important C++ OOP concept, the member function. Remember that a class defines how to represent data and how to manipulate it. A member function belongs to a class and describes a method for manipulating class data. The ostream class, for example, has a put() member function that is designed to output characters. You can use a member function only with a particular object of that class, such as the cout object, in this case. To use a class member function with an object such as cout, you use a period to combine the object name (cout) with the function name (put()). The period is called the membership operator. The notation cout.put() means to use the class member function put() with the class object cout. You’ll learn about this in greater detail when you reach classes in Chapter 10, “Objects and Classes.” Now the only classes you have are the istream and ostream classes, and you can experiment with their member functions to get more comfortable with the concept.

The cout.put() member function provides an alternative to using the << operator to display a character. At this point you might wonder why there is any need for cout.put(). Much of the answer is historical. Before Release 2.0 of C++, cout would display character variables as characters but display character constants, such as 'M' and 'N', as numbers. The problem was that earlier versions of C++, like C, stored character constants as type int. That is, the code 77 for 'M' would be stored in a 16-bit or 32-bit unit. Meanwhile, char variables typically occupied 8 bits. A statement like the following copied 8 bits (the important 8 bits) from the constant 'M' to the variable ch:

char ch = 'M';

Unfortunately, this meant that, to cout, 'M' and ch looked quite different from one another, even though both held the same value. So a statement like the following would print the ASCII code for the $ character rather than simply display $:

cout << '$';

But the following would print the character, as desired:

cout.put('$'),

Now, after Release 2.0, C++ stores single-character constants as type char, not type int. Therefore, cout now correctly handles character constants.

The cin object has a couple different ways of reading characters from input. You can explore these by using a program that uses a loop to read several characters, so we’ll return to this topic when we cover loops in Chapter 5, “Loops and Relational Expressions.”

`char` Literals

You have several options for writing character literals in C++. The simplest choice for ordinary characters, such as letters, punctuation, and digits, is to enclose the character in single quotation marks. This notation stands for the numeric code for the character. For example, an ASCII system has the following correspondences:

• 'A' is 65, the ASCII code for A.

• 'a' is 97, the ASCII code for a.

• '5' is 53, the ASCII code for the digit 5.

• ' ' is 32, the ASCII code for the space character.

• '!' is 33, the ASCII code for the exclamation point.

Using this notation is better than using the numeric codes explicitly. It’s clearer, and it doesn’t assume a particular code. If a system uses EBCDIC, then 65 is not the code for A, but 'A' still represents the character.

There are some characters that you can’t enter into a program directly from the keyboard. For example, you can’t make the newline character part of a string by pressing the Enter key; instead, the program editor interprets that keystroke as a request for it to start a new line in your source code file. Other characters have difficulties because the C++ language imbues them with special significance. For example, the double quotation mark character delimits string literals, so you can’t just stick one in the middle of a string literal. C++ has special notations, called escape sequences, for several of these characters, as shown in Table 3.2. For example, a represents the alert character, which beeps your terminal’s speaker or rings its bell. The escape sequence represents a newline. And " represents the double quotation mark as an ordinary character instead of a string delimiter. You can use these notations in strings or in character constants, as in the following examples:

Table 3.2. C++ Escape Sequence Codes

char alarm = 'a';
cout << alarm << "Don't do that again!a ";
cout << "Ben "Buggsie" Hacker was here! ";

The last line produces the following output:

Ben "Buggsie" Hacker
was here!

Note that you treat an escape sequence, such as , just as a regular character, such as Q. That is, you enclose it in single quotes to create a character constant and don’t use single quotes when including it as part of a string.

The escape sequence concept dates back to when people communicated with computers using the teletype, an electromechanical typewriter-printer, and modern systems don’t always honor the complete set of escape sequences. For example, some systems remain silent for the alarm character.

The newline character provides an alternative to endl for inserting new lines into output. You can use the newline character in character constant notation (' ') or as character in a string (" "). All three of the following move the screen cursor to the beginning of the next line:

cout << endl;    // using the endl manipulator
cout << ' ';    // using a character constant
cout << " ";    // using a string

You can embed the newline character in a longer string; this is often more convenient than using endl. For example, the following two cout statements produce the same output:

cout << endl << endl << "What next?" << endl << "Enter a number:" << endl;
cout << " What next? Enter a number: ";

When you’re displaying a number, endl is a bit easier to type than " " or ' ', but when you’re displaying a string, ending the string with a newline character requires less typing:

cout << x << endl; // easier than cout << x << " ";
cout << "Dr. X. "; // easier than cout << "The Dr. X." << endl;

Finally, you can use escape sequences based on the octal or hexadecimal codes for a character. For example, Ctrl+Z has an ASCII code of 26, which is 032 in octal and 0x1a in hexadecimal. You can represent this character with either of the following escape sequences: 32 or x1a. You can make character constants out of these by enclosing them in single quotes, as in '32', and you can use them as parts of a string, as in "hix1a there".

Tip

When you have a choice between using a numeric escape sequence or a symbolic escape sequence, as in x8 versus , use the symbolic code. The numeric representation is tied to a particular code, such as ASCII, but the symbolic representation works with all codes and is more readable.

Listing 3.7 demonstrates a few escape sequences. It uses the alert character to get your attention, the newline character to advance the cursor (one small step for a cursor, one giant step for cursorkind), and the backspace character to back the cursor one space to the left. (Houdini once painted a picture of the Hudson River using only escape sequences; he was, of course, a great escape artist.)

Listing 3.7. bondini.cpp

// bondini.cpp -- using escape sequences
#include <iostream>
int main()
{
    using namespace std;
    cout << "aOperation "HyperHype" is now activated! ";
    cout << "Enter your agent code:________";
    long code;
    cin >> code;
    cout << "aYou entered " << code << "... ";
    cout << "aCode verified! Proceed with Plan Z3! ";
    return 0;
}

Note

Some systems might behave differently, displaying the as a small rectangle rather than backspacing, for example, or perhaps erasing while backspacing, perhaps ignoring a.

When you start the program in Listing 3.7, it puts the following text onscreen:

Operation "HyperHype" is now activated!
Enter your agent code:________

After printing the underscore characters, the program uses the backspace character to back up the cursor to the first underscore. You can then enter your secret code and continue. Here’s a complete run:

Operation "HyperHype" is now activated!
Enter your agent code:42007007
You entered 42007007...
Code verified! Proceed with Plan Z3!

Universal Character Names

C++ implementations support a basic source character set—that is, the set of characters you can use to write source code. It consists of the letters (uppercase and lowercase) and digits found on a standard U.S. keyboard, the symbols, such as { and =, used in the C language, and a scattering of other characters, such as the space character. Then there is a basic execution character set, which includes characters that can be processed during the execution of a program (for example, characters read from a file or displayed on screen). This adds a few more characters, such as backspace and alert. The C++ Standard also allows an implementation to offer extended source character sets and extended execution character sets. Furthermore, those additional characters that qualify as letters can be used as part of the name of an identifier. Thus, a German implementation might allow you to use umlauted vowels, and a French implementation might allow accented vowels. C++ has a mechanism for representing such international characters that is independent of any particular keyboard: the use of universal character names.

Using universal character names is similar to using escape sequences. A universal character name begins either with u or U. The u form is followed by 8 hexadecimal digits, and the U form by 16 hexadecimal digits. These digits represent the ISO 10646 code point for the character. (ISO 10646 is an international standard under development that provides numeric codes for a wide range of characters. See “Unicode and ISO 10646,” later in this chapter.)

If your implementation supports extended characters, you can use universal character names in identifiers, as character constants, and in strings. For example, consider the following code:

int ku00F6rper;
cout << "Let them eat gu00E2teau. ";

The ISO 10646 code point for ö is 00F6, and the code point for â is 00E2. Thus, this C++ code would set the variable name to körper and display the following output:

Let them eat gâteau.

If your system doesn’t support ISO 10646, it might display some other character for â or perhaps simply display the word gu00E2teau.

Actually, from the standpoint of readability, there’s not much point to using u00F6 as part of a variable name, but an implementation that included the ö character as part of an extended source character set probably would also allow you to type that character from the keyboard.

Note that C++ uses the term “universal code name,” not, say, “universal code.” That’s because a construction such as u00F6 should be considered a label meaning “the character whose Unicode code point is U-00F6.” A compliant C++ compiler will recognize this as representing the 'ö' character, but there is no requirement that internal coding be 00F6. Just as, in principle, the character 'T' can be represented internally by ASCII on one computer and by a different coding system on another computer, the 'u00F6' character can have different encodings on different systems. Your source code can use the same universal code name on all systems, and the compiler will then represent it by the appropriate internal code used on the particular system.

Unicode and ISO 10646

Unicode provides a solution to the representation of various character sets by providing a standard numbering system for a great number of characters and symbols, grouping them by type. For example, the ASCII code is incorporated as a subset of Unicode, so U.S. Latin characters such as A and Z have the same representation under both systems. But Unicode also incorporates other Latin characters, such as those used in European languages; characters from other alphabets, including Greek, Cyrillic, Hebrew, Cherokee, Arabic, Thai, and Bengali; and ideographs, such as those used for Chinese and Japanese. So far Unicode represents more than 109,000 symbols and more than 90 scripts, and it is still under development. If you want to know more, you can check the Unicode Consortium’s website, at www.unicode.org.

Unicode assigns a number, called a code point, for each of its characters. The typical notation for Unicode code points looks like this: U-222B. The U identifies this as a Unicode character, and the 222B is the hexadecimal number for the character—an integral sign, in this case.

The International Organization for Standardization (ISO) established a working group to develop ISO 10646, also a standard for coding multilingual text. The ISO 10646 group and the Unicode group have worked together since 1991 to keep their standards synchronized with one another.

`signed char` and `unsigned char`

Unlike int, char is not signed by default. Nor is it unsigned by default. The choice is left to the C++ implementation in order to allow the compiler developer to best fit the type to the hardware properties. If it is vital to you that char has a particular behavior, you can use signed char or unsigned char explicitly as types:

char fodo;              // may be signed, may be unsigned
unsigned char bar;      // definitely unsigned
signed char snark;      // definitely signed

These distinctions are particularly important if you use char as a numeric type. The unsigned char type typically represents the range 0 to 255, and signed char typically represents the range –128 to 127. For example, suppose you want to use a char variable to hold values as large as 200. That works on some systems but fails on others. You can, however, successfully use unsigned char for that purpose on any system. On the other hand, if you use a char variable to hold a standard ASCII character, it doesn’t really matter whether char is signed or unsigned, so you can simply use char.

For When You Need More: `wchar_t`

Programs might have to handle character sets that don’t fit within the confines of a single 8-bit byte (for example, the Japanese kanji system). C++ handles this in a couple ways. First, if a large set of characters is the basic character set for an implementation, a compiler vendor can define char as a 16-bit byte or larger. Second, an implementation can support both a small basic character set and a larger extended character set. The usual 8-bit char can represent the basic character set, and another type, called wchar_t (for wide character type), can represent the extended character set. The wchar_t type is an integer type with sufficient space to represent the largest extended character set used on the system. This type has the same size and sign properties as one of the other integer types, which is called the underlying type. The choice of underlying type depends on the implementation, so it could be unsigned short on one system and int on another.

The cin and cout family consider input and output as consisting of streams of chars, so they are not suitable for handling the wchar_t type. The iostream header file provides parallel facilities in the form of wcin and wcout for handling wchar_t streams. Also you can indicate a wide-character constant or string by preceding it with an L. The following code stores a wchar_t version of the letter P in the variable bob and displays a wchar_t version of the word tall:

wchar_t bob = L'P'; // a wide-character constant
wcout << L"tall" << endl; // outputting a wide-character string

On a system with a 2-byte wchar_t, this code stores each character in a 2-byte unit of memory. This book doesn’t use the wide-character type, but you should be aware of it, particularly if you become involved in international programming or in using Unicode or ISO 10646.

New C++11 Types: `char16_t` and `char32_t`

As the programming community gained more experience with Unicode, it became clear that the wchar_t type wasn’t enough. It turns out that encoding characters and strings of characters on a computer system is more complex than just using the Unicode numeric values (called code points). In particular, it’s useful, when encoding strings of characters, to have a type of definite size and signedness. But the sign and size of wchar_t can vary from one implementation to another. So C++11 introduces the types char16_t, which is unsigned and 16 bits, and char32_t, which is unsigned and 32 bits. C++11 uses the u prefix for char16_t character and string constants, as in u'C' and u"be good". Similarly, it uses the U prefix for char32_t constants, as in U'R' and U"dirty rat". The char16_t type is a natural match for universal character names of the form u00F6, and the char32_t type is a natural match for universal character names of the form U0000222B. The prefixes u and U are used to indicate character literals of types char16_t and char32_t, respectively:

char16_t ch1 = u'q'; // basic character in 16-bit form
char32_t ch2 = U'U0000222B'; // universal character name in 32-bit form

Like wchar_t, char16_t and char32_t each have an underlying type, which is one of the built-in integer types. But the underlying type can be different on one system from what it is on another.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for The char Type: Characters and Small Integers

Create new playlist

Sign In

Sign Up

The char Type: Characters and Small Integers

Program Notes

A Member Function: cout.put()

char Literals

Universal Character Names

signed char and unsigned char

For When You Need More: wchar_t