© Ray Lischner 2020
R. LischnerExploring C++20https://doi.org/10.1007/978-1-4842-5961-0_19

19. Case-Folding

Ray Lischner1 
(1)
Ellicott City, MD, USA
 

Picking up where we left off in Exploration 18, the next step to improving the word-counting program is to update it, so that it ignores case differences when counting. For example, the program should count The just as it does the. This is a classic problem in computer programming. C++ offers some rudimentary help but lacks some important fundamental pieces. This Exploration takes a closer look at this deceptively tricky issue.

Simple Cases

Western European languages have long made use of capital (or majuscule) letters and minuscule letters. The more familiar terms—uppercase and lowercase—arise from the early days of typesetting, when the type slugs for majuscule letters were kept in the upper cases of large racks containing all the slugs used to make a printing plate. Beneath them were the cases, or boxes, that stored the minuscule letter slugs.

In the <locale> header, C++ declares the isupper and islower functions . They take a character as the first argument and a locale as the second argument. The return value is a bool: true if the character is uppercase (or lowercase, respectively) and false if the character is lowercase (or uppercase) or not a letter.
std::isupper('A', std::locale{"en_US.latin1"}) == true
std::islower('A', std::locale{"en_US.latin1"}) == false
std::isupper('Æ', std::locale{"en_US.latin1"}) == true
std::islower('Æ', std::locale{"en_US.latin1"}) == false
std::islower('½', std::locale{"en_US.latin1"}) == false
std::isupper('½', std::locale{"en_US.latin1"}) == false

The <locale> header also declares two functions to convert case: toupper converts lowercase to uppercase. If its character argument is not a lowercase letter, toupper returns the character as is. Similarly, tolower converts to lowercase, if the character in question is an uppercase letter. Just like the category testing functions, the second argument is a locale object.

Now you can modify the word-counting program to fold uppercase to lowercase and count all words in lowercase. Modify your program from Exploration 18, or start with Listing 18-4. If you have difficulty, take a look at Listing 19-1.
import <iostream>;
import <locale>;
import <map>;
import <string>;
int main()
{
  using count_map = std::map<std::string, int>;
  std::locale native{""};     // get the native locale
  std::cin.imbue(native);     // interpret the input and output according to
  std::cout.imbue(native);    // the native locale
  count_map counts{};
  std::string word{};
  // Read words from the standard input and count the number of times
  // each word occurs.
  while (std::cin >> word)
  {
    // Make a copy of word, keeping only alphabetic characters.
    std::string copy{};
    for (char ch : word)
      if (std::isalnum(ch, native))
        copy.push_back(tolower(ch, native));
    // The "word" might be all punctuation, so the copy would be empty.
    // Don't count empty strings.
    if (not copy.empty())
      ++counts[copy];
  }
  // For each word/count pair, print the word & count on one line.
  for (auto pair : counts)
    std::cout << pair.first << ' ' << pair.second << ' ';
}
Listing 19-1.

Folding Uppercase to Lowercase Prior to Counting Words

That was easy. So what’s the problem?

Harder Cases

Some of you—especially German readers—already know the problem. Several languages have letter combinations that do not map easily between uppercase and lowercase, or one character maps to two characters. The German Eszett, ß, is a lowercase letter; when you convert it to uppercase, you get two characters: SS. Thus, if your input file contains “ESSEN” and “eßen”, you want them to map to the same word, so they’re counted together, but that just isn’t feasible with C++. The way the program currently works, it maps “ESSEN” to “essen”, which it counts as a different word from “eßen”. A naïve solution would be to map “essen” to “eßen”, but not all uses of ss are equivalent to ß.

Greek readers are familiar with another kind of problem. Greek has two forms of lowercase sigma: use ς at the end of a word and σ elsewhere. Our simple program maps Σ (uppercase sigma) to σ, so some words in all uppercase will not convert to a form that matches its lowercase version.

Sometimes, accents are lost during conversion. Mapping é to uppercase usually yields É but may also yield E. Mapping uppercase to lowercase has fewer problems, in that É maps to é, but what if that E (which maps to e) really means É, and you want it to map to é? The program has no way of knowing the writer’s intentions, so all it can do is map the letters it receives.

Some character sets are more problematic than others. For example, ISO 8859-1 has a lowercase ÿ but not an uppercase equivalent (Ϋ). Windows-1252, on the other hand, extends ISO 8859-1, and one of the new code points is Ϋ.

Tip

Code point is a fancy way of saying “numeric value that represents a character.” Although most programmers don’t use code point in everyday speech, those programmers who work closely with character-set issues use it all the time, so you may as well get used to it. Mainstream programmers should become more accustomed to using this phrase.

In other words, converting case is impossible to do correctly using only the standard C++ library.

If you know your alphabet is one that C++ handles correctly, then go ahead and use toupper and tolower. For example, if you are writing a command-line interpreter, within which you have full control over the commands, and you decide that the user should be able to enter commands in any case, simply make sure the commands map correctly from one case to another. This is easy to do, as all character sets can map the 26 letters of the Roman alphabet without any problems.

On the other hand, if your program accepts input from the user and you want to map that input to uppercase or lowercase, you cannot and must not use standard C++. For example, if you are writing a word processor, and you decide you need to implement some case-folding functions, you must write or acquire a library outside the standard to implement the case-folding logic correctly. Most likely, you would need a library of character and string functions to implement your word processor. Case-folding would simply be one small part of this hypothetical library. (See this book’s website for some links to non-hypothetical libraries that can help you.)

What about our simple program? It isn’t always practical to handle the full, complete, correct handling of cases and characters when you just want to count a few words. The case-handling code would dwarf the word-counting code.

In this case (pun intended), you must accept the fact that your program will sometimes produce incorrect results. Our poor little program will never recognize that “ESSEN” and “eßen” are the same word but in different cases. You can work around some of the multiple mappings (such as with Greek sigma) by mapping to uppercase, then to lowercase. On the other hand, this can introduce problems with some accented characters. And I still have not touched upon the issue of whether “naïve” is the same word as “naive”. In some locales, the diacritics are significant, which would cause “naïve” and “naive” to be interpreted as two different words. In other locales, they are the same word and should be counted together.

In some character sets, accented characters can be composed from separate non-accented characters followed by the desired accent. For example, maybe you can write “naı¨ve”, which is the same as “naïve”.

I hope by now you are completely scared away from manipulating cases and characters. Far too many naïve programmers become entangled in this web or, worse, simply write bad code. I was tempted to wait until much later in the book before throwing all this at you, but I know that many readers will want to improve the word-counting program by ignoring case, so I decided to tackle the problem early.

Well, now you know better.

That doesn’t mean you can’t keep working on the word-counting program. The next Exploration returns to the realm of the realistic and feasible, as I finally show you how to write your own functions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.103.219