© Ray Lischner 2020
R. LischnerExploring C++20https://doi.org/10.1007/978-1-4842-5961-0_17

17. Characters

Ray Lischner1 
(1)
Ellicott City, MD, USA
 

In Exploration 2, I introduced you to character literals in single quotes, such as ' ', to end a line of output, but I have not yet taken the time to explain these fundamental building blocks. Now is the time to explore characters in greater depth.

Character Type

The char type represents a single character. Internally, all computers represent characters as integers. The character set defines the mapping between characters and numeric values. Common character sets are ISO 8859-1 (also called Latin-1) and ISO 10646 (same as Unicode), but many, many other character sets are in wide use.

The C++ standard does not mandate any particular character set. The literal '4' represents the digit 4, but the actual value that the computer uses internally is up to the implementation. You should not assume any particular character set. For example, in ISO 8859-1 (Latin-1), '4' has the value 52, but in EBCDIC, it has the value 244.

Similarly, given a numeric value, you cannot assume anything about the character that value represents. If you know a char variable stores the value 169, the character may be 'z' (EBCDIC), '©' (Unicode), or 'Љ' (ISO 8859-5).

C++ does not try to hide the fact that a character is actually a number. You can compare char values with int values, assign a char to an int variable, or do arithmetic with chars. For example, C++ guarantees that any character set your compiler and library support represents digit characters with contiguous values, starting at '0'. Thus, for example, the following is true for all C++ implementations:
'0' + 7 == '7'

The same sequence is true for letters in the alphabet, that is, 'A' + 25 == 'Z', and 'q' - 'm' == 4, but C++ makes no guarantees concerning the relative values of, say, 'A' and 'a'.

Read Listing 17-1. What does the program do? (Hint: The get member function reads a single character from the stream. It does not skip over white space or treat any character specially. Extra hint: What happens if you subtract '0' from a character that you know to be a digit?)
  • _____________________________________________________________

  • _____________________________________________________________

  • _____________________________________________________________

  • _____________________________________________________________

import <iostream>;
int main()
{
  int value{};
  bool have_value{false};
  char ch{};
  while (std::cin.get(ch))
  {
    if (ch >= '0' and ch <= '9')
    {
      value = ch - '0';
      have_value = true;
      while (std::cin.get(ch) and ch >= '0' and ch <= '9')
        value = value * 10 + ch - '0';
    }
    if (ch == ' ')
    {
      if (have_value)
      {
        std::cout << value << ' ';
        have_value = false;
      }
    }
    else if (ch != ' ' and ch != ' ')
    {
      std::cout << 'a';
      have_value = false;
      while (std::cin.get(ch) and ch != ' ')
        /*empty*/;
    }
  }
}
Listing 17-1.

Working and Playing with Characters

Briefly, this program reads numbers from the standard input and echoes the values to the standard output. If the program reads any invalid characters, it alerts the user (with a, which I describe later in this Exploration), ignores the line of input, and discards the value. Leading and trailing blank and tab characters are allowed. The program prints the saved numeric value only after reaching the end of an input line. This means if a line contains more than one valid number, the program prints only the last value. I ignore the possibility of overflow, to keep the code simple.

The get function takes a character variable as an argument. It reads one character from the input stream, then stores the character in that variable. The get function does not skip over white space. When you use get as a loop condition, it returns true if it successfully reads a character and the program should keep reading. It returns false if no more input is available or some kind of input error occurred.

All the digit characters have contiguous values, so the inner loop tests to determine if a character is a digit character by comparing it to the values for '0' and '9'. If it is a digit, subtracting the value of '0' from it leaves you with an integer in the range 0 to 9.

The final loop reads characters and does nothing with them. The loop terminates when it reads a new line character. In other words, the final loop reads and ignores the rest of the input line.

Programs that need to handle white space on their own (such as Listing 17-1) can use get, or you can tell the input stream not to skip over white space prior to reading a number or anything else. The next section discusses character I/O in more detail.

Character I/O

You just learned that the get function reads a single character without treating white space specially. You can do the same thing with a normal input operator, but you must use the std::noskipws manipulator. To restore the default behavior, use the std::skipws manipulator (declared in <ios>).
// Skip white space, then read two adjacent characters.
char left, right;
std::cin >> left >> std::noskipws >> right >> std::skipws;

After turning off the skipws flag, the input stream does not skip over leading white space characters. For instance, if you were to try to read an integer, and the stream is positioned at white space, the read would fail. If you were to try to read a string, the string would be empty, and the stream position would not advance. So you have to consider carefully whether to skip white space. Typically, you would do that only when reading individual characters.

Remember that an input stream uses the >> operator (Exploration 5), even for manipulators. Using >> for manipulators seems to break the mnemonic of transferring data to the right, but it follows the convention of always using >> with an input stream. If you forget, the compiler will remind you.

Write a program that reads the input stream one character at a time and echoes the input to the standard output stream verbatim. This is not a demonstration of how to copy streams but an example of working with characters. Compare your program with Listing 17-2.
import <iostream>;
int main()
{
  std::cin >> std::noskipws;
  char ch{};
  while (std::cin >> ch)
    std::cout << ch;
}
Listing 17-2.

Echoing Input to Output, One Character at a Time

You can also use the get member function, in which case you don’t need the noskipws manipulator.

Let’s try something a little more challenging. Suppose you have to read a series of points. The points are defined by a pair of x, y coordinates, separated by a comma. White space is allowed before and after each number and around the comma. Read the points into a vector of x values and a vector of y values. Terminate the input loop if a point does not have a proper comma separator. Print the vector contents, one point per line. I know this is a bit dull, but the point is to experiment with character input. If you prefer, do something special with the data. Compare your result with Listing 17-3.
import <algorithm>;
import <iostream>;
import <limits>;
import <vector>;
int main()
{
  using intvec = std::vector<int>;
  intvec xs{}, ys{};        // store the x's and y's
  char sep{};
  // Loop while the input stream has an integer (x), a character (sep),
  // and another integer (y); then test that the separator is a comma.
  for (int x{},y{}; std::cin >> x >> sep and sep == ',' and std::cin >> y;)
  {
    xs.emplace_back(x);
    ys.emplace_back(y);
  }
  for (auto x{xs.begin()}, y{ys.begin()}; x != xs.end(); ++x, ++y)
    std::cout << *x << ',' << *y << ' ';
}
Listing 17-3.

Reading and Writing Points

The first for loop is the key. The loop condition reads an integer and a character and tests to determine if the character is a comma, before reading a second integer. The loop terminates if the input is invalid or ill-formed or if the loop reaches the end-of-file. A more sophisticated program would distinguish between these two cases, but that’s a side issue for the moment.

A for loop can have only one definition, not two. So I had to move the definition of sep out of the loop header. Keeping x and y inside the header avoids conflict with the variables in the second for loop, which have the same names but are distinct variables. In the second loop, the x and y variables are iterators, not integers. The loop iterates over two vectors at the same time. A range-based for loop doesn’t help in this case, so the loop must use explicit iterators.

Newlines and Portability

You’ve probably noticed that Listing 17-3, and every other program I’ve presented so far, prints ' ' at the end of each line of output. We have done so without considering what this really means. Different environments have different conventions for end-of-line characters. UNIX uses a line feed ('x0a'); macOS uses a carriage return ('x0d'); DOS and Microsoft Windows use a combination of a carriage return, followed by a line feed ('x0dx0a'); and some operating systems don’t use line terminators but, instead, have record-oriented files, in which each line is a separate record.

In all these cases, the C++ I/O streams automatically convert a native line ending to a single ' ' character. When you print ' ' to an output stream, the library automatically converts it to a native line ending (or terminates the record).

In other words, you can write programs that use ' ' as a line ending and not concern yourself with native OS conventions. Your source code will be portable to all C++ environments.

Character Escapes

In addition to ' ', C++ offers several other escape sequences , such as ' ', for horizontal tab. Table 17-1 lists all the character escapes. Remember that you can use these escapes in character literals and string literals.
Table 17-1.

Character Escape Sequences

Escape

Meaning

a

Alert: ring a bell or otherwise signal the user



Backspace

f

Form feed

Newline

Carriage return

Horizontal tab

v

Vertical tab

\

Literal

'

Literal '

"

Literal "

OOO

Octal (base 8) character value

xXX . . .

Hexadecimal (base 16) character value

The last two items are the most interesting. An escape sequence of one to three octal digits (0 to 7) specifies the value of the character. Which character the value represents is up to the implementation.

Understanding all the caveats from the first section of this Exploration, there are times when you must specify an actual character value. The most common is '', which is the character with value zero, also called a null character, which you may utilize to initialize char variables. It has some other uses as well, especially when interfacing with C functions and the C standard library.

The final escape sequence (x) lets you specify a character value in hexadecimal. Typically, you would use two hexadecimal digits, because this is all that fits in the typical, 8-bit char. (The purpose of longer x escapes is for wide characters, the subject of Exploration 59.)

The next Exploration continues your understanding of characters by examining how C++ classifies characters according to letter, digit, punctuation, and so on.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.248.162