CHAPTER 6

Applications with Strings and Text

In the last chapter you were introduced to arrays and you saw how using arrays of numerical values could make many programming tasks much easier. In this chapter you'll extend your knowledge of arrays by exploring how you can use arrays of characters. You'll frequently have a need to work with a text string as a single entity. As you'll see, C doesn't provide you with a string data type as some other languages do. Instead, C uses an array of elements of type char to store a string.

In this chapter I'll show you how you can create and work with variables that store strings, and how the standard library functions can greatly simplify the processing of strings.

You'll learn the following:

  • How you can create string variables
  • How to join two or more strings together to form a single string
  • How you compare strings
  • How to use arrays of strings
  • How you work with wide character strings
  • What library functions are available to handle strings and how you can apply them
  • How to write a simple password-protection program

What Is a String?

You've already seen examples of string constants—quite frequently in fact. A string constant is a sequence of characters or symbols between a pair of double-quote characters. Anything between a pair of double quotes is interpreted by the compiler as a string, including any special characters and embedded spaces. Every time you've displayed a message using printf(), you've defined the message as a string constant. Examples of strings used in this way appear in the following statements:

printf("This is a string.");
  printf("This is on two lines!");
  printf("For " you write \".");

These three example strings are shown in Figure 6-1. The decimal value of the character codes that will be stored in memory are shown below the characters.

image

Figure 6-1. Examples of strings in memory

The first string is a straightforward sequence of letters followed by a period. The printf() function will output this string as the following:

This is a string.

The second string has a newline character, , embedded in it so the string will be displayed over two lines:

This is on
two lines!

The third string may seem a little confusing but the output from printf() should make is clearer:

For " you write ".

You must write a double quote within a string as the escape sequence " because the compiler will interpret an explicit " as the end of the string. You must also use the escape sequence \ when you want to include a backslash in a string because a backslash in a string always signals to the compiler the start of an escape sequence.

As Figure 6-1 shows, a special character with the code value 0 is added to the end of each string to mark where it ends. This character is known as the null character (not to be confused with NULL, which you'll see later), and you write it as .


Note Because a string in C is always terminated by a character, the length of a string is always one greater than the number of characters in the string.


There's nothing to prevent you from adding a character to the end of a string yourself, but if you do, you'll simply end up with two of them. You can see how the null character works with a simple example. Have a look at the following program:

/* Program 6.1 Displaying a string */
#include <stdio.h>

int main(void)
{
  printf("The character is used to terminate a string.");
  return 0;
}

If you compile and run this program, you'll get this output:


The character

It's probably not quite what you expected: only the first part of the string has been displayed. The output ends after the first two words because the printf() function stops outputting the string when it reaches the first null character, . Even though there's another at the end of string, it will never be reached. The first that's found always marks the end of the string.

String- and Text-Handling Methods

Unlike some other programming languages, C has no specific provision within its syntax for variables that store strings, and because there are no string variables, C has no special operators for processing strings. This is not a problem, though, because you're quite well-equipped to handle strings with the tools you have at your disposal already.

As I said at the beginning of this chapter, you use an array of type char to hold strings. This is the simplest form of string variable. You could declare a char array variable as follows:

char saying[20];

The variable saying that you've declared in this statement can accommodate a string that has up to 19 characters, because you must allow one element for the termination character. Of course, you can also use this array to store 20 characters that aren't a string.


Caution Remember that you must always declare the dimension of an array that you intend to use to store a string as at least one greater than the number of characters that you want to allow the string to have because the compiler will automatically add to the end of a string constant.


You could also initialize the preceding string variable in the following declaration:

char saying[] = "This is a string.";

Here you haven't explicitly defined the array dimension. The compiler will assign a value to the dimension sufficient to hold the initializing string constant. In this case it will be 18, which corresponds to 17 elements for the characters in the string, plus an extra one for the terminating . You could, of course, have put a value for the dimension yourself, but if you leave it for the compiler to do, you can be sure it will be correct.

You could also initialize just part of an array of elements of type char with a string, for example:

char str[40] = "To be";

Here, the compiler will initialize the first five elements from str[0] to str[4] with the characters of the specified string in sequence, and str[5] will contain the null value ''. Of course, space is allocated for all 40 elements of the array, and they're all available to use in any way you want.

Initializing a char array and declaring it as constant is a good way of handling standard messages:

const char message[] = "The end of the world is nigh";

Because you've declared message as const, it's protected from being modified explicitly within the program. Any attempt to do so will result in an error message from the compiler. This technique for defining standard messages is particularly useful if they're used in various places within a program. It prevents accidental modification of such constants in other parts of your program. Of course, if you do need to be able to change the message, then you shouldn't specify the array as const.

When you want to refer to the string stored in an array, you just use the array name by itself. For instance, if you want to output the string stored in message using the printf() function, you could write this:

printf(" The message is: %s", message);

The %s specification is for outputting a null-terminating string. At the position where the %s appears in the first argument, the printf() function will output successive characters from the message array until it finds the '' character. Of course, an array with elements of type char behaves in exactly the same way as an array of elements of any other type, so you use it in exactly the same way. Only the special string handling functions are sensitive to the '' character, so outside of that there really is nothing special about an array that holds a string.

The main disadvantage of using char arrays to hold a variety of different strings is the potentially wasted memory. Because arrays are, by definition, of a fixed length, you have to declare each array that you intend to use to store strings with its dimension set to accommodate the maximum string length you're likely to want to process. In most circumstances, your typical string length will be somewhat less than the maximum, so you end up wasting memory. Because you normally use your arrays here to store strings of different lengths, getting the length of a string is important, especially if you want to add to it. Let's look at how you do this using an example.

Operations with Strings

The code in the previous example is designed to show you the mechanism for finding the length of a string, but you never have to write such code in practice. As you'll see very soon, the strlen() function in the standard library will determine the length of a null-terminated string for you. So now that you know how to find the lengths of strings, how can you manipulate them?

Unfortunately you can't use the assignment operator to copy a string in the way you do with int or double variables. To achieve the equivalent of an arithmetic assignment with strings, one string has to be copied element by element to the other. In fact, performing any operation on string variables is very different from the arithmetic operations with numeric variables you've seen so far. Let's look at some common operations that you might want to perform with strings and how you would achieve them.

Appending a String

Joining one string to the end of another is a common requirement. For instance, you might want to assemble a single message from two or more strings. You might define the error messages in a program as a few basic text strings to which you append one of a variety of strings to make the message specific to a particular error. Let's see how this works in the context of an example.

Arrays of Strings

It may have occurred to you by now that you could use a two-dimensional array of elements of type char to store strings, where each row is used to hold a separate string. In this way you could arrange to store a whole bunch of strings and refer to any of them through a single variable name, as in this example:

char sayings[3][32] = {
                        "Manners maketh man.",
                        "Many hands make light work.",
                        "Too many cooks spoil the broth."
                      };

This creates an array of three rows of 32 characters. The strings between the braces will be assigned in sequence to the three rows of the array, sayings[0], sayings[1], and sayings[2]. Note that you don't need braces around each string. The compiler can deduce that each string is intended to initialize one row of the array. The last dimension is specified to be 32, which is just sufficient to accommodate the longest string, including its terminating character. The first dimension specifies the number of strings.

When you're referring to an element of the array—sayings[i][j], for instance—the first index, i, identifies a row in the array, and the second index, j, identifies a character within a row. When you want to refer to a complete row containing one of the strings, you just use a single index value between square brackets. For instance, sayings[1] refers to the second string in the array, "Many hands make light work.".

Although you must specify the last dimension in an array of strings, you can leave it to the compiler to figure out how many strings there are:

char sayings[][32] = {
                        "Manners maketh man.",
                        "Many hands make light work.",
                        "Too many cooks spoil the broth."
                      };

I've omitted the value for the size of the first dimension in the array here so the compiler will deduce this from the initializers between braces. Because you have three initializing strings, the compiler will make the first array dimension 3. Of course, you must still make sure that the last dimension is large enough to accommodate the longest string, including its terminating null character.

You could output the three sayings with the following code:

for(int i = 0 ; i<3 ; i++)
printf(" %s", sayings[i]);

You reference a row of the array using a single index in the expression sayings[i]. This effectively accesses the one-dimensional array that is at index position i in the sayings array.

You could change the last example to use a two-dimensional array.

String Library Functions

Now that you've struggled through the previous examples, laboriously copying strings from one variable to another, it's time to reveal that there's a standard library for string functions that can take care of all these little chores. Still, at least you know what's going on when you use the library functions.

The string functions are declared in the <string.h> header file, so you'll need to put

#include <string.h>

at the beginning of your program if you want to use them. The library actually contains quite a lot of functions, and your compiler may provide an even more extensive range of string library capabilities than is required by the C standard. I'll discuss just a few of the essential functions to demonstrate the basic idea and leave you to explore the rest on your own.

Copying Strings Using a Library Function

First, let's return to the process of copying the string stored in one array to another, which is the string equivalent of an assignment operation. The while loop mechanism you carefully created to do this must still be fresh in your mind. Well, you can do the same thing with this statement:

strcpy(string1, string2);

The arguments to the strcpy() function are char array names. What the function actually does is copy the string specified by the second argument to the string specified by the first argument, so in the preceding example string2 will be copied to string1, replacing what was previously stored in string1. The copy operation will include the terminating ''. It's your responsibility to ensure that the array string1 has sufficient space to accommodate string2. The function strcpy() has no way of checking the sizes of the arrays, so if it goes wrong it's all your fault. Obviously, the sizeof operator is important because you'll most likely check that everything is as it should be:

if(sizeof(string2) <= sizeof (string1))
  strcpy(string1, string2);

You execute the strcpy() operation only if the length of the string2 array is less than or equal to the length of the string1 array.

You have another function available, strncpy(), that will copy the first n characters of one string to another. The first argument is the destination string, the second argument is the source string, and the third argument is an integer of type size_t that specifies the number of characters to be copied. Here's an example of how this works:

char destination[] = "This string will be replaced";
char source[] = "This string will be copied in part";
size_t n = 26;                    /* Number of characters to be copied */
strncpy(destination, source, n);

After executing these statements, destination will contain the string "This string will be copied", because that corresponds to the first 26 characters from source. A '' character will be appended after the last character copied. If source has fewer than 26 characters, the function will add '' characters to make up the count to 26.

Note that when the length of the source string is greater than the number of characters to be copied, no additional '' character is added to the destination string by the strncpy() function. This means that the destination string may not have a termination null character in such cases, which can cause major problems with further operations with the destination string.

Determining String Length Using a Library Function

To find out the length of a string you have the function strlen(), which returns the length of a string as an integer of type size_t. To find the length of a string in Program 6.3 you wrote this:

while (str2[count2])
  count2++;

Instead of this rigmarole, you could simply write this:

count2 = strlen(str2);

Now the counting and searching that's necessary to find the end of the string is performed by the function, so you no longer have to worry about it. Note that it returns the length of the string excluding the '', which is generally the most convenient result. It also returns the value as size_t which corresponds to an unsigned integer type, so you may want to declare the variable to hold the result as size_t as well. If you don't, you may get warning messages from your compiler.

Just to remind you, type size_t is a type that is defined in the standard library header file <stddef.h>. This is also the type returned by the operator sizeof. The type size_t will be defined to be one of the unsigned integer types you have seen, typically unsigned int. The reason for implementing things this way is code portability. The type returned by sizeof and the strlen() function, among others, can vary from one C implementation to another. It's up to the compiler writer to decide what it should be. Defining the type to be size_t and defining size_t in a header file enables you to accommodate such implementation dependencies in your code very easily. As long as you define count2 in the preceding example as type size_t, you have code that will work in every standard C implementation, even though the definition of size_t may vary from one implementation to another.

So for the most portable code, you should write the following:

size_t count2 = 0;
count2 = strlen(str2);

As long as you have #include directives for <string.h> and <stddef.h>, this code will compile with the ISO/IEC standard C compiler.

Joining Strings Using a Library Function

In Program 6.3, you copied the second string onto the end of the first using the following rather complicated looking code:

count2 = 0;
while(str2[count2])
  str1[count1++] = str2[count2++];
str1[count1] = '';

Well, the string library gives a slight simplification here, too. You could use a function that joins one string to the end of another. You could achieve the same result as the preceding fragment with the following exceedingly simple statement:

strcat(str1, str2);              /* Copy str2 to the end of str1 */

This function copies str2 to the end of str1. The strcat() function is so called because it performs string catenation; in other words it joins one string onto the end of another. As well as appending str2 to str1, the strcat() function also returns str1.

If you only want to append part of the source string to the destination string, you can use the strncat() function. This requires a third argument of type size_t that indicates the number of characters to be copied, for instance

strncat(str1, str2, 5);    /* Copy 1st 5 characters of str2 to the end of str1 */

As with all the operations that involve copying one string to another, it's up to you to ensure that the destination array is sufficiently large to accommodate what's being copied to it. This function and others will happily overwrite whatever lies beyond the end of your destination array if you get it wrong.

All these string functions return the destination string. This allows you to use the value returned in another string operation, for example

size_t length = 0;
length = strlen(strncat(str1, str2, 5));

Here the strncat() function copies five characters from str2 to the end of str1. The function returns the array str1, so this is passed as an argument to the strlen() function. This will then return the length of the new version of str1 with the five characters from str2 appended.

Comparing Strings

The string library also provides functions for comparing strings and deciding whether one string is greater than or less than another. It may sound a bit odd applying such terms as "greater than" and "less than" to strings, but the result is produced quite simply. Successive corresponding characters of the two strings are compared based on the numerical value of their character codes. This mechanism is illustrated graphically in Figure 6-2, in which the character codes are shown as hexadecimal values.

image

Figure 6-2. Comparing two strings

If two strings are identical, then of course they're equal. The first pair of corresponding characters that are different in two strings determines whether the first string is less than or greater than the second. So, for example, if the character code for the character in the first string is less than the character code for the character in the second string, the first string is less than the second. This mechanism for comparison generally corresponds to what you expect when you're arranging strings in alphabetical order.

The function strcmp(str1, str2) compares two strings. It returns a value of type int that is less than, equal to, or greater than 0, corresponding to whether str1 is less than, equal to, or greater than str2. You can express the comparison illustrated in Figure 6-2 in the following code fragment:

char str1[] = "The quick brown fox";
char str2[] = "The quick black fox";
if(strcmp(str1, str2) < 0)
  printf("str1 is less than str2");

The printf() statement will execute only if the strcmp() function returns a negative integer. This will be when the strcmp() function finds a pair of corresponding characters in the two strings that do not match and the character code in str1 is less than the character code in str2.

The strncmp() function compares up to n characters of the two strings. The first two arguments are the same as for the strcmp() function and the number of characters to be compared is specified by a third argument that's an integer of type size_t. This function would be useful if you were processing strings with a prefix of ten characters, say, that represented a part number or a sequence number. You could use the strncmp() function to compare just the first ten characters of two strings to determine which should come first:

if(strncmp(str1, str2, 10) <= 0)
  printf(" %s %s", str1, str2);
else
  printf(" %s %s", str2, str1);

These statements output strings str1 and str2 arranged in ascending sequence according to the first ten characters in the strings.

Let's try comparing strings in a working example.

Searching a String

The <string.h> header file declares several string-searching functions, but before I get into these, we'll take a peek at the subject of the next chapter, namely pointers. You'll need an appreciation of the basics of this in order to understand how to use the string-searching functions.

The Idea of a Pointer

As you'll learn in detail in the next chapter, C provides a remarkably useful type of variable called a pointer. A pointer is a variable that contains an address—that is, it contains a reference to another location in memory that can contain a value. You already used an address when you used the function scanf(). A pointer with the name pNumber is defined by the second of the following two statements:

int Number = 25;
int *pNumber = &Number;

Figure 6-3 illustrates what happens when these two statements are executed.

image

Figure 6-3. An example of a pointer

You declare a variable, Number, with the value 25, and a pointer, pNumber, which contains the address of Number. You can now use the variable pNumber in the expression *pNumber to obtain the value contained in Number. The * is the dereference operator and its effect is to access the data stored at the address specified by a pointer.

The main reason for introducing this idea here is that the functions I'll discuss in the following sections return pointers, so you could be a bit confused by them if there was no explanation here at all. If you end up confused anyway, don't worry—all will be illuminated in the next chapter.

Searching a String for a Character

The strchr() function searches a given string for a specified character. The first argument to the function is the string to be searched (which will be the address of a char array), and the second argument is the character that you're looking for. The function will search the string starting at the beginning and return a pointer to the first position in the string where the character is found. This is the address of this position in memory and is of type char* described as "pointer to char." So to store the value that's returned you must create a variable that can store an address of a character. If the character isn't found, the function will return a special value NULL, which is the equivalent of 0 for a pointer and represents a pointer that doesn't point to anything.

You can use the strchr() function like this:

char str[] = "The quick brown fox";  /* The string to be searched        */
char c = 'q';                        /* The character we are looking for */
char *pGot_char = NULL;              /* Pointer initialized to zero      */
pGot_char = strchr(str, c);          /* Stores address where c is found  */

You define the character that you're looking for by the variable c of type char. Because the strchr() function expects the second argument to be of type int, the compiler will convert the value of c to this type before passing it to the function.

You could just as well define c as type int like this:

int c = 'q';    /* Initialize with character code for q */

Functions are often implemented so that a character is passed as an argument of type int because it's simpler to work with type int than type char.

Figure 6-4 illustrates the result of this search using the strchr() function.

image

Figure 6-4. Searching for a character

The address of the first character in the string is given by the array name str. Because 'q' appears as the fifth character in the string, its address will be str + 4, an offset of 4 bytes from the first character. Thus, the variable pGot_char will contain the address str + 4.

Using the variable name pGot_char in an expression will access the address. If you want to access the character that's stored at that address too, then you must dereference the pointer. To do this, you precede the pointer variable name with the dereference operator *, for example:

printf("Character found was %c.", *pGot_char);

I'll go into more detail on using the dereferencing operator further in the next chapter.

Of course, in general it's always possible that the character you're searching for might not be found in the string, so you should take care that you don't attempt to dereference a NULL pointer.

If you do try to dereference a NULL pointer, your program will crash. This is very easy to avoid with an if statement, like this:

if(pGot_char != NULL)
  printf("Character found was %c.", *pGot_char);

Now you only execute the printf() statement when the variable pGot_char isn't NULL.

The strrchr() function is very similar in operation to the strchr() function, except that it searches for the character starting from the end of the string. Thus, it will return the address of the last occurrence of the character in the string, or NULL if the character isn't found.

Searching a String for a Substring

The strstr() function is probably the most useful of all the searching functions declared in string.h. It searches one string for the first occurrence of a substring and returns a pointer to the position in the first string where the substring is found. If it doesn't find a match, it returns NULL. So if the value returned here isn't NULL, you can be sure that the searching function that you're using has found an occurrence of what it was searching for. The first argument to the function is the string that is to be searched, and the second argument is the substring you're looking for.

Here is an example of how you might use the strstr() function:

char text[] = "Every dog has his day";
char word[] = "dog";
char *pFound = NULL;
pFound = strstr(text, word);

This searches text for the first occurrence of the string stored in word. Because the string "dog" appears starting at the seventh character in text, pFound will be set to the address text + 6. The search is case sensitive, so if you search the text string for "Dog", it won't be found.

Analyzing and Transforming Strings

If you need to examine the internal contents of a string, you can use the set of standard library functions that are declared the <ctype.h> header file that I introduced in Chapter 3. These provide you with a very flexible range of analytical functions that enable you to test what kind of character you have. They also have the advantage that they're independent of the character code on the computer you're using. Just to remind you, Table 6-1 shows the functions that will test for various categories of characters.

Table 6-1. Character Classification Functions

Function Tests For
islower() Lowercase letter
isupper() Uppercase letter
isalpha() Uppercase or lowercase letter
isalnum() Uppercase or lowercase letter or a digit
iscntrl() Control character
isprint() Any printing character including space
isgraph() Any printing character except space
isdigit() Decimal digit ('0' to '9')
isxdigit() Hexadecimal digit ('0' to '9', 'A' to 'F', 'a' to 'f')
isblank() Standard blank characters (space, ' ')
isspace() Whitespace character (space, ' ', ' ', 'v', ' ', 'f')
ispunct() Printing character for which isspace() and isalnum() return false

The argument to a function is the character to be tested. All these functions return a nonzero value of type int if the character is within the set that's being tested for; otherwise, they return 0. Of course, these return values convert to true and false respectively so you can use them as Boolean values. Let's see how you can use these functions for testing the characters in a string.

Converting Characters

You've already seen that the standard library also includes two conversion functions that you get access to through <ctype.h>. The toupper() function converts from lowercase to uppercase, and the tolower() function does the reverse. Both functions return either the converted character or the same character for characters that are already in the correct case. You can therefore convert a string to uppercase using this statement:

for(int i = 0 ; (buffer[i] = toupper(buffer[i])) != '' ; i++);

This loop will convert the entire string to uppercase by stepping through the string one character at a time, converting lowercase to uppercase and leaving uppercase characters unchanged. The loop stops when it reaches the string termination character ''. This sort of pattern in which everything is done inside the loop control expressions is quite common in C.

Let's try a working example that applies these functions to a string.

Converting Strings to Numerical Values

The <stdlib.h> header file declares functions that you can use to convert a string to a numerical value. Each of the functions in Table 6-2 requires an argument that's a pointer to a string or an array of type char that contains a string that's a representation of a numerical value.

Table 6-2. Functions That Convert Strings to Numerical Values

Function Returns
atof() A value of type double that is produced from the string argument
atoi() A value of type int that is produced from the string argument
atol() A value of type long that is produced from the string argument
atoll() A value of type long long that is produced from the string argument

These functions are very easy to use, for example

char value_str[] = "98.4";
double value = 0;
value = atof(value_str);          /* Convert string to floating-point */

The value_str array contains a string representation of a value of type double. You pass the array name as the argument to the atof() function to convert it to type double. You use the other three functions in a similar way.

These functions are particularly useful when you need to read numerical input in the format of a string. This can happen when the sequence of the data input is uncertain, so you need to analyze the string in order to determine what it contains. Once you've figured out what kind of numerical value the string represents, you can use the appropriate library function to convert it.

Working with Wide Character Strings

Working with wide character strings is just as easy as working with the strings you have been using up to now. You store a wide character string in an array of elements of type wchar_t and a wide character string constant just needs the L modifier in front of it. Thus you can declare and initialize a wide character string like this:

wchar_t proverb[] = L"A nod is as good as a wink to a blind horse.";

As you saw back in Chapter 2, a wchar_t character occupies 2 bytes. The proverb string contains 44 characters plus the terminating null, so the string will occupy 90 bytes.

If you wanted to write the proverb string to the screen using printf() you must use the %S format specifier rather than %s that you use for ASCII string. If you use %s, the printf() function will assume the string consists of single-byte characters so the output will not be correct. Thus the following statement will output the wide character string correctly:

printf("The proverb is: %S", proverb);

Operations on Wide Character Strings

The <wchar.h> header file declares a range of functions for operating on wide character strings that parallel the functions you have been working with that apply to ordinary strings. Table 6-3 shows the functions declared in <wchar.h> that are the wide character equivalents to the string functions I have already discussed in this chapter.

Table 6-3. Functions That Operate on Wide Character Strings

Function Description
wcslen(const wchar_t* ws) Returns a value of type size_t that is the length of the wide character string ws that you pass as the argument. The length excludes the termination L'' character.
wcscpy(wchar_t* destination, const wchar_t source) Copies the wide character string source to the wide character string destination. The function returns source.
wcsncpy(wchar_t* destination, const wchar_t source, size_t n) Copies n characters from the wide character string source to the wide character string destination. If source contains less than n characters, destination is padded with L'' characters. The function returns source.
wcscat(whar_t* ws1, whar_t* ws2) Appends a copy of ws2 to ws1. The first character of ws2 overwrites the terminating null at the end of ws1. The function returns ws1.
wcsncmp(const wchar_t* ws1, const wchar_t* ws2) Compares the wide character string pointed to by ws1 with the wide character string pointed to by ws2 and returns a value of type int that is less than, equal to, or greater than 0 if the string ws1 is less than, equal to, or greater than the string ws2.
wcscmp(const wchar_t* ws1, const wchar_t* ws2, size_t n) Compares up to n characters from the wide character string pointed to by ws1 with the wide character string pointed to by ws2. The function returns a value of type int that is less than, equal to, or greater than 0 if the string of up to n characters from ws1 is less than, equal to, or greater than the string of up to n characters from ws2.
wcschr(const wchar_t* ws, wchar_t wc) Returns a pointer to the first occurrence of the wide character, wc, in the wide character string pointed to by ws. If wc is not found in ws, the NULL pointer value is returned.
wcsstr(const wchar_t* ws1, const wchar_t* ws2) Returns a pointer to the first occurrence of the wide character string ws2 in the wide character string ws1. If ws2 is not found in ws1, the NULL pointer value is returned.

As you see from the descriptions, all these functions work in essentially the same way as the string functions you have already seen. Where the const keyword appears in the specification of the type of argument you can supply to a function, it implies that the argument will not be modified by the function. This forces the compiler to check that the function does not attempt to change such arguments. You'll see more on this in Chapter 7 when you explore how you create your own functions in more detail.

The <wchar.h> header also declares the fgetws() function that reads a wide character string from a stream such as stdin, which by default corresponds to the keyboard. You must supply three arguments to the fgetws() function, just like the fgets() function you use for reading for single-byte strings:

  • The first argument is a pointer to an array of wchar_t elements that is to store the string.
  • The second argument is a value n of type size_t that is the maximum number of characters that can be stored in the array.
  • The third argument is the stream from which the data is to be read, which will be stdin when you are reading a string from the keyboard.

The function reads up to n-1 characters from the stream and stores them in the array with an L'' appended. Reading a newline in less than n-1 characters from the stream signals the end of input. The function returns a pointer to the array containing the string.

Testing and Converting Wide Characters

The <wchar.h> header also declares functions to test for specific subsets of wide characters, analogous to the functions you have seen for characters of type char. These are shown in Table 6.4.

Table 6-4. Wide Character Classification Functions

Function Tests For
iswlower() Lowercase letter
iswupper() Uppercase letter
iswalnum() Uppercase or lowercase letter
iswcntrl() Control character
iswprint() Any printing character including space
iswgraph() Any printing character except space
iswdigit() Decimal digit (L'0' to L'9')
iswxdigit() Hexadecimal digit (L'0' to L'9', L'A' to L'F', L'a' to L'f')
iswblank() Standard blank characters (space, L' ')
iswspace() Whitespace character (space, L' ', L' ', L'v', L' ', L'f')
iswpunct() Printing character for which iswspace() and iswalnum() return false

You also have the case-conversion functions, towlower() and towupper(), that return the lowercase or uppercase equivalent of the wchar_t argument.

You can see some of the wide character functions in action with a wide character version of Program 6.9.

Designing a Program

You've almost come to the end of this chapter. All that remains is to go through a larger example to use some of what you've learned so far.

The Problem

You are going to develop a program that will read a paragraph of text of an arbitrary length that is entered from the keyboard, and determine the frequency of which each word in the text occurs, ignoring case. The paragraph length won't be completely arbitrary, as you'll have to specify some limit for the array size within the program, but you can make the array that holds the text as large as you want.

The Analysis

To read the paragraph from the keyboard, you need to be able to read input lines of arbitrary length and assemble them into a single string that will ultimately contain the entire paragraph. You don't want lines truncated either, so fgets() looks like a good candidate for the input operation. If you define a symbol at the beginning of the code that specifies the array size to store the paragraph, you will be able to change the capacity of the program by changing the definition of the symbol.

The text will contain punctuation, so you will have to deal with that somehow if you are to be able to separate one word from another. It would be easy to extract the words from the text if each word is separated from the next by one or more spaces. You can arrange for this by replacing all characters that are not characters that appear in a word with spaces. You'll remove all the punctuation and any other odd characters that are lying around in the text. We don't need to retain the original text, but if you did you could just make a copy before eliminating the punctuation.

Separating out the words will be simple. All you need to do is extract each successive sequence of characters that are not spaces as a word. You can store the words in another array. Since you want to count word occurrences, ignoring case, you can store each word as lowercase. As you find a new word, you'll have to compare it with all the existing words you have found to see if it occurs previously. You'll only store it in the array if it is not already there. To record the number of occurrences of each word, you'll need another array to store the word counts. This array will need to accommodate as many counts as the number of words you have provided for in the program.

The Solution

This section outlines the steps you'll take to solve the problem. The program boils down to a simple sequence of steps that are more or less independent of one another. At the moment, the approach to implementing the program will be constrained by what you have learned up to now, and by the time you get to Chapter 9 you'll be able to implement this much more efficiently.

Step 1

The first step is to read the paragraph from the keyboard. As this is an arbitrary number of input lines it will be necessary to involve an indefinite loop. Let's first define the variables that we'll be using to code up the input mechanism:

/* Program 6.10 Analyzing text */
#include <stdio.h>
#include <string.h>

#define TEXTLEN  10000      /* Maximum length of text            */
#define BUFFERSIZE 100      /* Input buffer size                  */

int main(void)
{
  char text[TEXTLEN+1];
  char buffer[BUFFERSIZE];
  char endstr[] = "* ";          /* Signals end of input        */

  printf("Enter text on an arbitrary number of lines.");
  printf(" Enter a line containing just an asterisk to end input: ");

  /* Read an arbitrary number of lines of text */
  while(true)
  {
    /* A string containing an asterisk followed by newline */
    /* signals end of input                                */
    if(!strcmp(fgets(buffer, BUFFERSIZE, stdin), endstr))
      break;

    /* Check if we have space for latest input */
    if(strlen(text)+strlen(buffer)+1 > TEXTLEN)
      {
        printf("Maximum capacity for text exceeded. Terminating program.");
        return 1;
      }
    strcat(text, buffer);
  }

  /* Plus the rest of the program code ... */

  return 0;
}

You can compile and run this code as it stands if you like. The symbols TEXTLEN and BUFFERSIZE specify the capacity of the text array and the buffer array respectively. The text array will store the entire paragraph, and the buffer array stores a line of input. We need some way for the user to tell the program when he is finished entering text. As the initial prompt for input indicates, entering a single asterisk on a line will do this. The single asterisk input will be read by the fgets() function as the string "* " because the function stores newline characters that arise when the Enter key is pressed. The endstr array stores the string that marks the end of the input so you can compare each input line with this array.

The entire input process takes place within the indefinite while loop that follows the prompt for input. A line of input is read in the if statement:

if(!strcmp(fgets(buffer, BUFFERSIZE, stdin), endstr))
      break;

The fgets() function reads a maximum of BUFFERSIZE-1 characters from stdin. If the user enters a line longer than this, it won't really matter. The characters that are in excess of BUFFERSIZE-1 will be left in the input stream and will be read on the next loop iteration. You can check that this works by setting BUFFERSIZE at 10, say, and entering lines longer than ten characters.

Because the fgets() function returns a pointer to the string that you pass as the first argument, you can use fgets() as the argument to the strcmp() function to compare the string that was read with endstr. Thus, the if statement not only reads a line of input, it also checks whether the end of the input has been signaled by the user.

Before you append the new line of input to what's already stored in text, you check that there is still sufficient free space in text to accommodate the additional line. To append the new line, just use the strcat() library function to concatenate the string stored in buffer with the existing string in text.

Here's an example of output that results from executing this input operation:


Enter text on an arbitrary number of lines.
Enter a line containing just an asterisk to end input:

Mary had a little lamb,
Its feet were black as soot,
And into Mary's bread and jam,
His sooty foot he put.
*

Step 2

Now that you have read all the input text, you can replace the punctuation and any newline characters recorded by the fgets() function by spaces. The following code goes immediately before the return statement at the end of the previous version of main():

/* Replace everything except alpha and single quote characters by spaces */
  for(int i = 0 ; i < strlen(text) ; i++)
  {
    if(text[i] == quote || isalnum(text[i]))
      continue;
    text[i] = space;
  }

The loop iterates over the characters in the string stored in the text array. We are assuming that words can only contain letters, digits, and single-quote characters, so anything that is not in this set is replaced by a space character. The isalnum() that returns true for a character that is a letter or a digit is declared in the <ctype.h> header file so you must add an #include statement for this to the program. You also need to add declarations for the variables quote and space, following the declaration for endstr:

const char space = ' ';
const char quote = ''';

You could, of course, use character literals directly in the code, but defining variables like this helps to make the code a little more readable.

Step 3

The next step is to extract the words from the text array and store them in another array. You can first add a couple more definitions for symbols that relate to the array you will use to store the words. These go immediately after the definition for BUFFERSIZE:

#define MAXWORDS    500      /* Maximum number of different words */
#define WORDLEN      15      /* Maximum word length                */

You can now add the declarations for the additional arrays and working storage that you'll need for extracting the words from the text, and you can put these after the existing declarations at the beginning of main():

char words[MAXWORDS][WORDLEN+1];
  int nword[MAXWORDS];            /* Number of word occurrences */
  char word[WORDLEN+1];            /* Stores a single word        */
  int wordlen = 0;                /* Length of a word            */
  int wordcount = 0;              /* Number of words stored      */

The words array stores up to MAXWORDS word strings of length WORDLEN, excluding the terminating null. The nword array hold counts of the number of occurrences of the corresponding words in the words array. Each time you find a new word, you'll store it in the next available position in the words array and set the element in the nword array that is at the same index position to 1. When you find a word that you have found and stored previously in words, you just need to increment the corresponding element in the nword array.

You'll extract words from the text array in another indefinite while loop because you don't know in advance how many words there are. There is quite a lot of code in this loop so we'll put it together incrementally. Here's the initial loop contents:

/* Find unique words and store in words array */
  int index = 0;
  while(true)
  {
    /* Ignore any leading spaces before a word */
    while(text[index] == space)
      ++index;

    /* If we are at the end of text, we are done */
    if(text[index] == '')
      break;

    /* Extract a word */
    wordlen = 0;          /* Reset word length */
    while(text[index] == quote || isalpha(text[index]))
    {
      /* Check if word is too long */
      if(wordlen == WORDLEN)
      {
        printf("Maximum word length exceeded. Terminating program.");
        return 1;
      }
      word[wordlen++] = tolower(text[index++]);  /* Copy as lowercase      */
    }
    word[wordlen] = '';                        /* Add string terminator */
  }

This code follows the existing code in main(), immediately before the return statement at the end.

The index variable records the current character position in the text array. The first operation within the outer loop is to move past any spaces that are there so that index refers to the first character of a word. You do this in the inner while loop that just increments index as long as the current character is a space.

It's possible that the end of the string in text has been reached, so you check for this next. If the current character at position index is '', you exit the loop because all words must have been extracted.

Extracting a word just involves copying any character that is alphanumeric or a single quote. The first character that is not one of these marks the end of a word. You copy the characters that make up the word into the word array in another while loop, after converting each character to lowercase using the tolower() function from the standard library. Before storing a character in word, you check that the size of the array will not be exceeded. After the copying process, you just have to append a terminating null to the characters in the word array.

The next operation to be carried out in the loop is to see whether the word you have just extracted already exists in the words array. The following code does this and goes immediately before the closing brace for the while loop in the previous code fragment:

/* Check for word already stored */
    bool isnew = true;
    for(int i = 0 ; i< wordcount ; i++)
      if(strcmp(word, words[i]) == 0)
      {
        ++nword[i];
        isnew = false;
        break;
      }

The isnew variable records whether the word is present and is first initialized to indicate that the latest word you have extracted is indeed a new word. Within the for loop you compare word with successive strings in the words array using the strcmp() library function that compares two strings. The function returns 0 if the strings are identical; as soon as this occurs you set isnew to false, increment the corresponding element in the nword array, and exit the for loop.

The last operation within the indefinite loop that extracts words from text is to store the latest word in the words array, but only if it is new, of course. The following code does this:

if(isnew)
    {
      /* Check if we have space for another word */
      if(wordcount >= MAXWORDS)
      {
        printf(" Maximum word count exceeded. Terminating program.");
        return 1;
      }

      strcpy(words[wordcount], word);    /* Store the new word  */
      nword[wordcount++] = 1;            /* Set its count to 1  */
    }

This code also goes after the previous code fragment, but before the closing brace in the indefinite while loop. If the isnew indicator is true, you have a new word to store, but first you verify that there is still space in the words array. The strcpy() function copies the string in word to the element of the words array selected by wordcount. You then set the value of the corresponding element of the nword array that holds the count of the number of times a word has been found in the text.

Step 4

The last code fragment that you need will output the words and their frequencies of occurrence. Following is a complete listing of the program with the additional code from steps 3 and 4 highlighted in bold font:

/* Program 6.10 Analyzing text */
#include <stdio.h>
#include <stdbool.h>
#include <string.h>
#include <ctype.h>

#define TEXTLEN  10000      /* Maximum length of text            */
#define BUFFERSIZE 100      /* Input buffer size                  */
#define MAXWORDS    500      /* Maximum number of different words */
#define WORDLEN             15      /* Maximum word length                */

int main(void)
{
  char text[TEXTLEN+1];
  char buffer[BUFFERSIZE];
  char endstr[] = "* ";          /* Signals end of input        */

  const char space = ' ';
  const char quote = ''';

  char words[MAXWORDS][WORDLEN+1];
  int nword[MAXWORDS];            /* Number of word occurrences */
  char word[WORDLEN+1];            /* Stores a single word        */
  int wordlen = 0;                /* Length of a word            */
  int wordcount = 0;              /* Number of words stored      */

  printf("Enter text on an arbitrary number of lines.");
  printf(" Enter a line containing just an asterisk to end input: ");

  /* Read an arbitrary number of lines of text */
  while(true)
  {
    /* A string containing an asterisk followed by newline */
    /* signals end of input                                */
    if(!strcmp(fgets(buffer, BUFFERSIZE, stdin), endstr))
      break;

    /* Check if we have space for latest input */
    if(strlen(text)+strlen(buffer)+1 > TEXTLEN)
      {
        printf("Maximum capacity for text exceeded. Terminating program.");
        return 1;
      }
    strcat(text, buffer);
  }

  /* Replace everything except alpha and single quote characters by spaces */
  for(int i = 0 ; i < strlen(text) ; i++)
  {
    if(text[i] == quote || isalnum(text[i]))
      continue;
    text[i] = space;
  }
  /* Find unique words and store in words array */
  int index = 0;
  while(true)
  {
    /* Ignore any leading spaces before a word */
    while(text[index] == space)
      ++index;

    /* If we are at the end of text, we are done */
    if(text[index] == '')
      break;

    /* Extract a word */
    wordlen = 0;          /* Reset word length */
    while(text[index] == quote || isalpha(text[index]))
    {
      /* Check if word is too long */
      if(wordlen == WORDLEN)
      {
        printf("Maximum word length exceeded. Terminating program.");
        return 1;
      }
      word[wordlen++] = tolower(text[index++]);  /* Copy as lowercase      */
    }
    word[wordlen] = '';                        /* Add string terminator */

    /* Check for word already stored */
    bool isnew = true;
    for(int i = 0 ; i< wordcount ; i++)
      if(strcmp(word, words[i]) == 0)
      {
        ++nword[i];
        isnew = false;
        break;
      }

    if(isnew)
    {
      /* Check if we have space for another word */
      if(wordcount >= MAXWORDS)
      {
        printf(" Maximum word count exceeded. Terminating program.");
        return 1;
      }

      strcpy(words[wordcount], word);    /* Store the new word  */
      nword[wordcount++] = 1;            /* Set its count to 1  */
    }
  }
  /* Output the words and frequencies */
  for(int i = 0 ; i<wordcount ; i++)
  {
    if( !(i%3) )                         /* Three words to a line */
      printf(" ");
    printf("  %-15s%5d", words[i], nword[i]);
  }

  return 0;
}

The seven lines highlighted in bold output the words and corresponding frequencies. This is very easily done in a for loop that iterates over the number of words. The loop code arranges for three words plus frequencies to be output per line by writing a newline character to stdout if the current value of i is a multiple of 3. The expression i%3 will be zero when i is a multiple of 3, and this value maps to the bool value false, so the expression !(i%3) will be true.

The program ends up as a main() function of more than 100 statements. When you learn the complete C language you would organize this program very differently with the code segmented into several much shorter functions. By Chapter 9 you'll be in a position to do this, and I would encourage you to revisit this example when you reach the end of Chapter 9. Here's a sample of output from the complete program:

Enter text on an arbitrary number of lines.
Enter a line containing just an asterisk to end input:

When I makes tea I makes tea, as old mother Grogan said.
And when I makes water I makes water.
Begob, ma'am, says Mrs Cahill, God send you don't make them in the same pot.
*

  when              2  i                  4  makes              4
  tea                2  as                1  old                1
  mother            1  grogan            1  said              1
  and                1  water              2  begob              1
  ma'am              1  says              1  mrs                1
  cahill            1  god                1  send              1
  you                1  don't              1  make              1
  them              1  in                1  the                1
  same              1  pot                1

Summary

In this chapter, you applied the techniques you acquired in earlier chapters to the general problem of dealing with character strings. Strings present a different, and perhaps more difficult, problem than numeric data types.

Most of the chapter dealt with handling strings using arrays, but I also mentioned pointers. These will provide you with even more flexibility in dealing with strings, and many other things besides, as you'll discover as soon as you move on to the next chapter.

Exercises

The following exercises enable you to try out what you've learned in this chapter. If you get stuck, look back over the chapter for help. If you're still stuck, you can download the solutions from the Source Code/Downloads section of the Apress web site (http://www.apress.com), but that really should be a last resort.

Exercise 6-1. Write a program that will prompt for and read a positive integer less than 1000 from the keyboard, and then create and output a string that is the value of the integer in words. For example, if 941 is entered, the program will create the string "Nine hundred and forty one".

Exercise 6-2. Write a program that will allow a list of words to be entered separated by commas, and then extract the words and output them one to a line, removing any leading or trailing spaces. For example, if the input is

John  ,  Jack ,    Jill

then the output will be


John
Jack
Jill

Exercise 6-3. Write a program that will output a randomly chosen thought for the day from a set of at least five thoughts of your own choosing.

Exercise 6-4. A palindrome is a phrase that reads the same backward as forward, ignoring whitespace and punctuation. For example, "Madam, I'm Adam" and "Are we not drawn onward, we few? Drawn onward to new era?" are palindromes. Write a program that will determine whether a string entered from the keyboard is a palindrome.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.144.216