© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
M. KalinModern C Up and Runninghttps://doi.org/10.1007/978-1-4842-8676-0_8

8. Miscellaneous Topics

Martin Kalin1  
(1)
Chicago, IL, USA
 

8.1 Overview

This chapter introduces libraries and topics not seen so far, but it also extends and refines the coverage of earlier material. For example, the flexible library function system, for quick multiprocessing, is introduced; the input function scanf is examined more closely.

The chapter begins with regular expressions, a language designed for pattern matching, which makes the language well suited for verifying input. Indeed, professional data validation relies on regular expressions as a base level. The chapter then moves to assertions, which allow the programmer to express and enforce constraints in a program. A section on locales and internationalization follows. Short code examples and full programs get into the details.

WebAssembly is a language designed for high-performance web modules, for example, ones that do serious number crunching. C is among the earliest languages (the others are C++ and Rust) to compile into WebAssembly. This section goes into detail with an full code example.

A signal is a low-level but still powerful way for one process to communicate with another, and C has an API for generating and handling signals. The section on signals is code oriented as usual.

The chapter ends with a section on building static and dynamic libraries in C. It is no surprise that a client written in C can consume a library written in the same language, but almost every modern language can interoperate with C. This section underscores the point by having a Python client consume a C library built from scratch.

8.2 Regular Expressions

The regular expression language, or regex for short, is used to match strings against patterns and even for editing strings. Users of command-line utilities such as grep (short for grab regular expression) or rename already have experience with regex. In web and other applications, regex verification of user input is best practice; modern programming languages typically support regex. The first code example prompts a user for an employee ID and then checks whether the entered string matches a pattern that validates IDs.
#include <stdio.h>
#include <regex.h>
#define MaxBuffer 64
void main() {
  char input[MaxBuffer];
  char error[MaxBuffer + 1]; /* null terminator */
  printf("Employee Id: ");
  scanf("%7s", input); /* read only 7 chars */
  const char* regex = "^[A-Z]{2}[1-9]{3}[a-k]{2}$"; /* regex as a string */
  regex_t regex_comp;
  int flag;
  if ((flag = regcomp(&regex_comp, regex, REG_EXTENDED)) < 0) { /* compile regex */
    regerror(flag, &regex_comp, error, MaxBuffer);
    fprintf(stderr, "Error compiling '%s': %s ", regex, error);
    return;
  }
  if (REG_NOMATCH == regexec(&regex_comp, input, 0, NULL, 0)) /* match? */
    fprintf(stderr, " %s is an invalid employee ID. ", input);
  else
    fprintf(stderr, " %s is a valid employee ID. ", input);
  regfree(&regex_comp); /* good idea to clean up */
}
Listing 8-1

A regex to check an employee ID

The empId program (see Listing 8-1) prompts the user for an employee ID and then reads the entered ID using scanf:
scanf("%7s", input); /* read only 7 chars */

The 7 in the format string %7s ensures that no more than seven characters are scanned into the buffer named input, which has room for 64 in any case.

The program then compiles a regex pattern given as a string. This pattern is the most complicated part of the program and so deserves careful analysis. The pattern consists of three parts, and each part consists of a set and a count. For now, ignore the start character ^ and the end character $; these are covered shortly.

The first set/count pair is
[A-Z]{2}
The square brackets represent a set, a collection of nonduplicate items in which order does not matter. For example, the set
[1234]
is the same as the set
[2143]

In the empId program , the members of the first set are the uppercase letters A,B,…,Z. These letters could be enumerated in the square brackets and in any order—a tedious undertaking. The regex language thus has a shortcut: [A-Z] means the uppercase letters A through Z.

Immediately after the set [A-Z] comes the count (quantifier) of how many characters from the set are required. The count occurs in braces:
[A-Z]{2} /* exactly 2 letters from the set A-Z */
The count can be flexible. For example, the count in
[A-Z]{2,4} /* 2 to 4 letters from the set A-Z */

allows two to four letters from the set.

The second part of the pattern requires exactly three decimal digits from the set [1-9]:
[1-9]{3} /* 3 digits, 1 through 9 */
The third part of the pattern requires two lowercase letters, but in the range of a through k:
[a-k]{2} /* 2 letters, a through k */
Here is a summary of other quantifier options:
[A-Z]?   /* zero or one from the set */
[A-Z]*   /* zero or more from the set */
[A-Z]+   /* one or more from the set */
The employee ID is supposed to begin with an uppercase letter and end with a lowercase letter. There should not be any other characters, including whitespace, flanking the employee ID on either side. To express this requirement, the regex expression uses anchors: the hat character   ̂ is the left anchor, and the dollar-sign character $ is the right anchor. Without these anchors, an employee ID such as
foobarAB123bb9876

would pass muster because the substring AB123bb matches the pattern without the anchors. The anchored expression requires that the ID start with an uppercase letter and end with a lowercase one.

The employee ID pattern as a string is compiled using the library function regcomp , which creates a regex_t instance if successful. The compiled pattern is used in matches. The last argument to regcomp is REG_EXTENDED, which enables various POSIX extensions to the original regex library. There is also a C library that supports Perl syntax and features (see www.pcre.org/ ), which has become the de facto standard for regex syntax.

Once the pattern is compiled, it can be used in a call to regexec, which matches the pattern against an input string. The call takes five arguments:
if (REG_NOMATCH == regexec(&pattern_comp, /* pattern */
                           input,         /* input string */
                           0,        /* zero capture groups */
                           NULL,     /* no capture array */
                           0))       /* no special flags */

The first two arguments are the address of the compiled pattern and the string to test against the pattern, which in this case is the user input. The next two arguments, 0 and NULL, are for capture groups: parts of the string to be tested can be captured for later reference. In this example, the capture option is not needed; hence, the number of capture groups is 0, and then there is NULL instead of an array in which to save the captures. A later example illustrates captures. The last argument consists of optional integer flags, for example, a flag to ignore case when matching letters. In this example, there are no flags, which 0 represents.

The empId program works as advertised. For example, it accepts AQ431af as an employee ID but rejects AQ431mf (m is not between a and k, inclusive) and AQ444kk7 (ends with a digit, not a letter).

A first experience with regex syntax may seem daunting, but a rhetorical question puts the challenge into perspective: Would it be easier to learn regex, or to write a program from scratch that does what the empId example requires? Regular expressions are not always intuitive, but they make up for this shortcoming with their power and flexibility.
#include <stdio.h>
#include <unistd.h>
#include <regex.h>
#define MaxBuffer 128
#define GroupCount  4 /* entire expression counts as one group by default */
void main() {
  char error[MaxBuffer + 1];
  char* inputs[ ] = {"AABC123dd95", "Az4321jb81", "QQ987ii4",
                     "QQ98ii4", "YTE987ef4", "ARNQ999kk6", NULL};
  const char* regex = "^([A-Z]{2,4})([1-9]{3})([a-k]{2})[0-9]+$";
  regex_t regex_comp;
  int flag;
  if ((flag = regcomp(&regex_comp, regex, REG_EXTENDED)) < 0) {
    regerror(flag, &regex_comp, error, MaxBuffer);
    printf("Regex error compiling '%s': %s ", regex, error);
    return;
  }
  unsigned i = 0, j;
  while (inputs[i]) { /* iterate over the inputs */
    regmatch_t groups[GroupCount]; /* for extracting substrings */
    if (REG_NOMATCH == regexec(&regex_comp, inputs[i], GroupCount, groups, 0))
      fprintf(stderr, " %s is not a valid employee ID. ", inputs[i]);
    else {
      fprintf(stdout, " Valid employee ID. %i parts follow: ", GroupCount);
      for (j = 0; j < GroupCount; j++) {
        if (groups[j].rm_so < 0) break;
        write(1, inputs[i] + groups[j].rm_so, groups[j].rm_eo - groups[j].rm_so);
        write(1, " ", 1);
      }
      printf("-----");
    }
    i++; /* loop counter */
  }
  regfree(&regex_comp); /* good idea to clean up */
}
Listing 8-2

A revised version of the empId program

The empId2 program (see Listing 8-2) adds features to the original empId program. The new features can be summarized as follows:
  • An employee ID may start out with between two and four letters. In the fictitious company for which the employees work, the number of starting letters is a security code: two letters is low-security, three is middle-security, and four is high-security clearance.

  • An employee ID must end with one or more decimal digits.

  • The empId2 program introduces groups, the three parenthesized expressions, in order to parse the employee ID.

The revised regex expression is
^([A-Z]{2,4})([1-9]{3})([a-k]{2})[0-9]+$ ## [0-9]+ means 1 or more decimal digits

The anchors remain, but the end requirement for one or more decimal digits is new. The other major change is the use of parenthesized subexpressions, each of which represents a group that is captured for later analysis.

The major change in the rest of the code has to do with group captures. The code declares an array:
regmatch_t groups[GroupCount]; /* for extracting substrings */
The value of GroupCount is four, one more than the number of parenthesized subexpressions (in this case, three) in the regex. The reason is that the entire string to be matched counts as one group, in fact the first. The regmatch_t type is
typedef struct {
   regoff_t rm_so; /* start offset */
   regoff_t rm_eo; /* end offset */
} regmatch_t;

The two offsets indicate where, in the string to be matched, the different groups begin and end. The groups array, in the current example, has four elements of this type. For the first string to be matched, AABC123dd95, the start index (rm_so in the structure) for the first subexpression is 0, and the end index (rm_eo) is 4, immediately beyond the last character C in the first subexpression.

Given the regmatch_t, it is straightforward to print the captured groups in valid employee IDs. Indeed, the easy way is to use the low-level I/O API . Here is the relevant statement:
write(1,                                  /* stdout */
      inputs[i] + groups[j].rm_so,        /* start */
      groups[j].rm_eo - groups[j].rm_so); /* length */
The first argument to write is, of course, the standard output. The second argument takes the base address of a test string (for instance, inputs[0] is the string AABC123dd95) and adds the start offset (rm_so, which is 0, 4, or 7). The third argument to write is the captured part’s length: the end index (one beyond the end of the part) minus the start index. The output for parsing the first two candidate IDs is
Valid employee ID. 4 parts follow:
AABC123dd95
AABC
123
dd
        Az4321jb81 is not a valid employee ID.

The standard C library for regex covers the basics but does not include newer features such as lookaheads. These features make it easier or more efficient to do pattern matching that still can be done without them. The previously mentioned PCRE (Perl Compatible Regular Expressions) library is an option for such newer features.

8.3 Assertions

An assertion checks whether a program satisfies a condition at a specified point in its execution. There are three traditional types of assertion that can be used to check a program module such as a C block:
  • An assertion expressing a precondition, which must hold at the start of a block

  • An assertion expressing a postcondition, which must hold at the end of a block

  • An assertion expressing an invariant, which must hold throughout a block

C implements assertions with the assert macro, which takes an arbitrary boolean expression as its argument. If the assert evaluates to true (nonzero), the program continues execution; otherwise, the program aborts with an explanatory error message.
#include <stdio.h>
#include <regex.h>
#include <assert.h>
#define MaxBuffer 64
#define MaxTries 3
unsigned check_id(const char* id, regex_t* regex) {
  return REG_NOMATCH != regexec(regex, id, 0, NULL, 0);
}
void main() {
  const char* regex_s = "^[A-Z]{2,4}[1-9]{3}[a-k]{2}[0-1]?$";
  regex_t regex_c;
  if (regcomp(&regex_c, regex_s, REG_EXTENDED) < 0) {
    fprintf(stderr, "Bad regex. Exiting. ");
    return;
  }
  char id[MaxBuffer];
  unsigned tries = 0, flag = 0;
  assert(0 == tries);            /* precondition */
  do {
    assert(tries < MaxTries);    /* invariant */
    printf("Employee Id: ");
    scanf("%10s", id);
    if (check_id(id, &regex_c)) {
      flag = 1;
      break;
    }
    tries++;
  } while (tries < MaxTries);
  assert(tries <= MaxTries);        /* postcondition */
  regfree(&regex_c); /* clean up */
  if (flag) printf("%s verified. ", id);
  else printf("%s not verified. ", id);
}
Listing 8-3

Using assertions to track login attempts

The verifyEmp program (see Listing 8-3) builds on the earlier empId program , in particular by using a regex to verify an employee’s ID. The regex itself has changed a little in order to show more aspects of the language:
^[A-Z]{2,4}[1-9]{3}[a-k]{2}[0-1]?$ /* new part is: [0-1]? */

This pattern allows the starting uppercase letters to be between two and four in number and makes a single ending digit (either 0 or 1) optional. The function check_id takes two arguments, the ID to verify and the compiled regex; the function returns either true, if the candidate ID matches the regex, or false otherwise.

The program uses a do while loop to prompt the user for an employee ID. Of interest now is that the employee is to get no more than MaxTries chances to enter the ID. Similar approaches are used for login/password combinations, of course. The loop condition is
while (tries < MaxTries)
where tries is updated on each attempt and MaxTries is a macro defined as 3. If this condition were changed to
while (tries < MaxTries + 1)
and the user failed to provide a valid ID, the program would abort, and the error message from the failed assertion would be
empId3: empId3.c:24: main: Assertion 'tries < 3' failed.
The 24 represents line 24 in the source code, the assertion immediately after the do:
assert(tries < MaxTries); /* invariant */
The verifyEmp program has three assertions, each with a different test:
  • The precondition occurs immediately before the loop starts. It checks that, at this point, the value of tries is zero. If tries were not initialized at all, then—as a stack-based variable—its value would be random and possibly greater than MaxTries already. The precondition is evaluated exactly once, as it occurs before the loop.

  • The postcondition occurs immediately after the loop ends. It checks that, at this point, tries is less than or equal to the value of MaxTries . There are two possibilities:
    • Suppose that the candidate ID is verified in any one of the three allowed attempts. Even if success comes at the third and final attempt, the value of tries is only 2 and so still less than MaxTries, which is 3.

    • Suppose that the candidate ID fails three times. Control then exits the loop because of the loop test that the value of tries be strictly less than the value of MaxTries : both tries and MaxTries now have a value of 3. The loop test has done its job, and so the program should continue to run normally. The postcondition thus must allow tries to be less than or equal to the value of constant MaxTries.

  • The invariant occurs immediately inside the loop, which is the only place that tries changes after its initialization to zero. On each iteration, tries is incremented by 1. If the candidate ID is verified, then the break statement, rather than the loop test, is what moves control beyond the loop. If tries is incremented to 3, then the loop condition, not the break statement, should cause control to exit the loop. Accordingly, the invariant checks that tries is always less than MaxTries.

The syntax of assertions is easy in C, but the reasoning behind assertion tests and assertion placement can be complicated. Even a program as relatively simple as verifyEmp confirms the point. The complication arises because assertions articulate reasoning about program correctness—and determining what makes a program correct is notoriously hard.

C has a convenient way to turn assertions off without commenting out the assert statements or deleting them from the source code. In a file with assertions, simply define the macro NDEBUG :
#define NDEBUG /* turns off assertions */

As code development moves from testing to production, it is common to turn assertions off.

8.4 Locales and i18n

Date, currency , and other information should be formatted in a locale-aware way as part of i18n programming, where i18n abbreviates internationalization. (The skeptic should count the letters between the i and the n.) Consider, for example, this large number formatted in a way familiar to North Americans:
1,234,567,891.234
In Germany, Italy, or Norway, the expected format would be
1 234 567.891,234
Locale information is available as part of the environment of a local system. When a C program begins execution, the program inherits environment variables about the locale and other features, but this locale inheritance does not extend to library functions that the program may call. Accordingly, a locale-aware program needs to do some initialization. Before looking at this initialization in code, it will be useful to consider how a C program can get environment information in general.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
extern char** environ; /* declaration */
void main () {
  int i = 0;
  while (environ[i]) printf("%s ", environ[i++]);
  printf("Locale: %s ", getenv("LANG")); /* en_US.UTF-8 */
  char cmd[32];
  strcpy(cmd, "locale -a");
  int status = system(cmd);
  printf(" %s exited with %i ", cmd, status);
}
Listing 8-4

How to get information about the program environment

The environ program (see Listing 8-4) shows two ways to access environment information. The first way uses the extern variable environ, an array of strings each with a key=value format. Here, for example, are two entries from my desktop system: the first key/value pair provides information about the terminal and the second about the shell language.
TERM=xterm
SHELL=/bin/bash

The library function getenv takes a single argument, a key such as TERM or SHELL as a string. The printf call illustrates with the key LANG, which gives a standard abbreviation (en_US for English in the United States) together with the character encoding scheme, in this case UTF-8 (Unicode Transformation Format-8). UTF-8 formats multibyte Unicode character encodings as a sequence of 8-bit bytes.

The last part of the environ program introduces the versatile system function. This function takes a single string argument, which represents a shell command, that is, a command that can be given at the command line. The system function starts another process and then blocks until the started process terminates. The int value returned to the system function is the exit status of the process in question. In this example, the command is locale -a, a utility that (with the -a flag) lists all of the locales available on the system. (The locale utility is available on Unix-like systems and on Windows through Cygwin.)

A given system supports some locales, but not others. The system administrator is responsible for installing and otherwise managing locale information. At the command line, or through the environ program shown previously, a listing of locales would look something like this:
C
C.UTF-8
en_AG.utf8
en_AU.utf8
...

The string en_AG.utf-8 represents English in Antigua, whereas en_AU.utf8 represents English in Australia. The first two entries, C and C.UTF-8, represent the default locale. In the setlocale function , investigated shortly, entries such as C.UTF-8 can be used as an argument.

Here is the declaration for the setlocale function :
char* setlocale(int category, const char* locale);
If the second argument is NULL, the function acts as a getter or query: the function returns a string that represents the current locale. If the second argument is not NULL, the function acts as a setter by setting the locale represented by the second argument, a string. (The empty string as the second argument also represents the default locale C.) Furthermore, the string returned from setlocale is opaque and typically prints as (null). This string is useful only as a second argument to setlocale. A typical use of the string would be as follows:
  1. 1.

    Retrieve the current locale, and save it as a string.

     
  2. 2.

    Set the locale to something new, and perform whatever application logic is appropriate.

     
  3. 3.

    Restore the saved locale by using the string from step 1 as the second argument to setlocale .

     
The next code example illustrates.
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
void main () {
  setlocale(LC_ALL, ""); /* set current locale for library functions */
  char* prev_locale = setlocale(LC_ALL, NULL);                       /* with NULL, a getter, not a setter */
  char* saved_locale = strdup(prev_locale);                                    /* get a separate copy */
  if (NULL == saved_locale) {          /* verify the copying */
    perror(NULL); /* out of memory */
    return;
  }
  const struct lconv* loc = localeconv(); /* get ptr to current locale struct */
  printf("Currency symbol: %s ", loc->currency_symbol);
  setlocale(LC_ALL, "en_GB.utf8"); /* english in Great Britain */
  loc = localeconv();
  printf("Currency symbol: %s ", loc->currency_symbol);
  setlocale(LC_ALL, saved_locale); /* restored saved locale */
  /*...*/
}
Listing 8-5

Introducing the setlocale function

The localeBasics program (see Listing 8-5) opens with two calls to library function setlocale , but the calls are quite different. The first call has the empty string, hence non-NULL, as its second argument:
setlocale(LC_ALL, ""); /* set current locale for library functions */

The integer macro LC_ALL represents all of the locale categories, and the empty string represents the default locale. Because the second argument is a string, even though empty, this call to setlocale acts as a setter rather than a getter of information.

The immediately following call to the setlocale function acts as a getter:
char* prev_locale = setlocale(LC_ALL, NULL);                          /* with NULL as 2nd arg, a getter */

The program then uses the strdup function (string duplicate) to make an altogether separate copy of this string just in case there are further calls to setlocale. Note that setlocale returns a pointer to a string, not a copy of this string.

The program ends by resetting the locale to the saved_locale. The save/restore pattern is common in locale-aware programs.

In the middle, the localeBasics program calls the library function localeconv to get a pointer to a structure that contains information in all of the locale categories. This structure is displayed shortly. For now, the pointer loc is used to access the currency symbol, first for the United States and then for Great Britain. The output is
Currency symbol: $ /* default locale, en_US */
Currency symbol: £ /* en_GB */
At the end, the program resets the locale to the original one:
setlocale(LC_ALL, saved_locale); /* restored saved locale */
Recall that saved_locale is a string copy of the original locale and so not NULL. This call to setlocale is therefore a setter, which restores the locale back to the original setting.
typedef struct {
   char *decimal_point;
   char *thousands_sep;
   char *grouping;
   char *int_curr_symbol;
   char *currency_symbol;
   char *mon_decimal_point;
   char *mon_thousands_sep;
   char *mon_grouping;
   char *positive_sign;
   char *negative_sign;
   char int_frac_digits;
   char frac_digits;
   char p_cs_precedes;
   char p_sep_by_space;
   char n_cs_precedes;
   char n_sep_by_space;
   char p_sign_posn;
   char n_sign_posn;
} lconv;
Listing 8-6

The lconv structure with locale information

Locale information is stored in a structure of type lconv (see Listing 8-6), and the library function localeconv returns a pointer to a typically static instance of this structure. The 18 fields contain locale-specific information. In Canada, for example, the decimal_point is the period symbol, whereas in Germany, the decimal_point is the comma symbol.
Table 8-1

Argument categories for setlocale

Category

Meaning

LC_ALL

All of the below

LC_COLLATE

regex string settings

LC_CTYPE

regex, character conversion, etc.

LC_MESSAGES

Localizable natural-language messages

LC_MONETARY

Currency formatting

LC_NUMERIC

Number formatting

LC_TIME

Time and date formatting

The fields in the lconv structure are numerous, and there are connections among many of them. The connections may not be evident. Accordingly, these fields are divided into seven categories, with macros to define each category (see Table 8-1). The categories make it easier to set related pieces of locale information.

A typical call to function setlocale uses the LC_ALL category as the first argument:
setlocale(LC_ALL, ""); /* set all categories to default locale */
For fine-tuning, however, a specific category could be used instead as the first argument:
setlocale(LC_MONETARY, "en_GB.utf-8"); /* monetary category for Great Britain */
The next code example puts the LC_MONETARY category to use. The program first sets all locale categories (LC_ALL) to local settings. The program then resets LC_MONETARY only to get locale-specific currency information from six English-speaking regions around the world.
#include <locale.h>
#include <stdio.h>
#include <stdlib.h>
void main () {
  setlocale(LC_ALL, ""); /* set all categories to default locale */
  char* regions[ ] = {"en_AU.utf-8", "en_CA.utf-8", "en_GB.utf-8",
 "en_US.utf-8", "en_NZ.utf-8", "en_ZM.utf-8", NULL};
  int i = 0;
  while (regions[i]) {
    setlocale(LC_MONETARY, regions[i]); /* change the locale */
    const struct lconv* loc = localeconv();
    printf("Region: %s Currency symbol: %s International currency symbol: %s ",
           regions[i], loc->currency_symbol, loc->int_curr_symbol);
    i++;
  }
}
Listing 8-7

Using the category LC_MONENTARY

The locMonetary program (see Listing 8-7) initializes the array regions to standard codes for six English-speaking regions around the world. For each of these regions, the LC_MONETARY category is set before the currency_symbol and the int_curr_symbol (international currency symbol) are printed in a while loop. The localeconv library function is called to get a pointer to the lconv structure that stores the desired information.
Region: en_AU.utf-8  Currency symbol: $  International currency symbol: AUD
Region: en_CA.utf-8  Currency symbol: $  International currency symbol: CAD
Region: en_GB.utf-8  Currency symbol: £  International currency symbol: GBP
Region: en_US.utf-8  Currency symbol: $  International currency symbol: USD
Region: en_NZ.utf-8 Currency symbol: $ International currency symbol: NZD
Region: en_ZM.utf-8 Currency symbol: K International currency symbol: ZMK
Listing 8-8

Output from the locMonetary program

The output from the locMonetary program (see Listing 8-8) shows the region, currency symbol, and international currency acronym for the six regions.

8.5 C and WebAssembly

WebAssembly is a language well-suited for compute-bound tasks (e.g., number crunching) executed on a browser. All rumors to the contrary, the WebAssembly language is not meant to replace JavaScript , but rather to supplement JavaScript by providing better performance on CPU-intensive tasks that JavaScript otherwise might perform. JavaScript remains the glue that ties together HTML pages and WebAssembly modules:
HTML pages<--->JavaScript<--->WebAssembly modules

WebAssembly has an advantage over other web artifacts when it comes to downloading. For example, a browser fetches HTML pages, CSS stylesheets, and JavaScript code as text, an inefficiency that WebAssembly addresses: a WebAssembly module has a compact binary format, which speeds up downloading.

After a WebAssembly program is downloaded to a browser, the just-in-time (JIT) compiler in the browser’s virtual machine translates the binary WebAssembly code into fast, platform-specific machine code. Here is a summary depiction:
            download  +-------+ translate
wasm module---------->|browser|----------->fast machine code
                      +-------+

JavaScript code embedded in an HTML page can call functions delivered in WebAssembly modules.

WebAssembly has a development language known as the text format language , which has a Lisp-like syntax for writing programs on a virtual stack-based machine. However, code from higher-level programming languages (including C) can be translated in WebAssembly. Although the list of languages that can be translated into WebAssembly is growing, the original ones were C, C++ , and Rust—three languages suited for systems programming and high-performance applications programming. These three languages share two features that promote fast execution: explicit data typing and no garbage collector.

When it comes to high-performance web code, WebAssembly is not the only game in town. For example, asm.js is a JavaScript dialect designed, like WebAssembly, to approach native speed. The asm.js dialect invites optimization because the code mimics the explicit data types in the three aforementioned languages. Here is an example with C and then asm.js. The sample function in C is
int f(int n) {  /** C **/
  return n + 1;
}
Both the parameter n and the returned value are explicitly typed as int. The equivalent function is asm.js would be
function f(n) { /** asm.js **/
  n = n | 0;
  return (n + 1) | 0;
}
JavaScript , in general, does not have explicit data types, but a bitwise-OR operation in JavaScript yields an integer value. This explains the otherwise pointless bitwise-OR operation:
n = n | 0; /* bitwise-OR of n and zero */

The bitwise-OR of n and zero evaluates to n, but the purpose here is to signal that n holds an integer value. The return statement repeats this optimizing trick. Among the JavaScript dialects, TypeScript stands out for adopting explicit data types, which makes this language attractive for compilation into WebAssembly.

Almost any discussion of the WebAssembly language covers near-native speed as one of the language’s major design goals. The native speed is that of the compiled systems languages C, C++, and Rust; hence, these three languages were also the originally designated candidates for compilation into WebAssembly.

8.5.1 A C into WebAssembly Example

A production-grade example would have WebAssembly code perform a heavy compute-bound task such as generating large cryptographic key pairs or using such pairs for encryption and decryption. A simpler example fits the bill as a stand-in that is easy to follow. There is number crunching, but of the routine sort.

Consider the function hstone (for hailstone), which takes a positive integer as an argument. The function is defined as follows:
               3N + 1 if N is odd
hstone(N) =
               N/2 if N is even

For example, hstone(12) returns 6, whereas hstone(11) returns 34. If N is odd, then 3N+1 is even; but if N is even, then N/2 could be either even (e.g., 4/2 = 2) or odd (e.g., 6/2 = 3).

The hstone function can be used iteratively by passing the returned value as the next argument. The result is a hailstone sequence such as this one, which starts with 24 as the original argument, the returned value 12 as the next argument, and so on:
24,12,6,3,10,5,16,8,4,2,1,4,2,1,...

It takes ten calls for the sequence to converge to 1, at which point the sequence of 4,2,1 repeats indefinitely: (3x1)+1 is 4, which is halved to yield 2, which is halved to yield 1, and so on. The Wikipedia page ( https://en.wikipedia.org/wiki/Collatz_conjecture ) goes into technical detail on the hailstone function, including a clarification of the name hailstone.

Note that powers of two (2N) converge quickly to 1, requiring just N divisions by two to reach 1. For example, 32 (25) has a convergence length of five, and 64 (26) has a convergence length of six. A hailstone sequence converges to 1 if and only if the sequence generates a power of two. At issue, therefore, is whether a hailstone sequence inevitably generates a power of two.

The Collatz conjecture is that a hailstone sequence converges to 1 no matter what the initial argument N > 0 happens to be. No one has found a counterexample to the Collatz conjecture, nor has anyone come up with a proof to elevate the conjecture to a theorem. The conjecture, simple as it is to test with a program, remains a profoundly challenging problem in mathematics. My hstone example generates hailstone sequences and counts the number of steps required for a sequence to hit the first 1.

8.5.2 The Emscripten Toolchain

The systems languages, including C, require specialized toolchains to translate source code into WebAssembly. Emscripten is a pioneering and excellent option, one built upon the well-known LLVM (Low-Level Virtual Machine) compiler infrastructure. Emscription can be installed following the instructions at https://emscripten.org/docs/getting_started/downloads.html .

To begin, consider this version of a C hstone program (see Listing 8-9) with two functions, the familiar entry point main and hstone, which main invokes repeatedly.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
int hstone(int n) {
  int len = 0;
  while (1) {
    if (1 == n) break; /* halt on 1 */
    if (0 == (n & 1)) n = n / 2; /* if n is even */
    else n = (3 * n) + 1; /* if n is odd */
    len++; /* increment counter */
  }
  return len;
}
#define HowMany 8
int main() {
  srand(time(NULL)); /* seed random number generator */
  int i;
  puts(" Num Steps to 1");
  for (i = 0; i < HowMany; i++) {
    int num = rand() % 100 + 1; /* + 1 to avoid zero */
    printf("%4i %7i ", num, hstone(num));
  }
  return 0;
}
Listing 8-9

The hstoneCL program with main

On a sample run, the hstoneCL program (with CL for command line) had this output:
Num   Steps to 1
64        6
40        8
86       30
16        4
30       18
47      104
12        9
60       19
The hstoneCL program can be webified—with no changes whatsoever to the source code—by using the Emscription toolchain, which can do the following:
  • Compile the C source into a WebAssembly module.

  • Generate a test HTML page with calls to ams.js code that, in turn, invokes the hstone function through a call to main.

However, the WebAssembly module does not require the main function because JavaScript could invoke the hstone function directly. The hstone program can be simplified by dropping the main function in the hstoneCL version.

The hstoneWA revision (see Listing 8-10) drops main and adds the directive EMSCRIPTEN_KEEPALIVE to the hstone function. This directive informs the compiler that the C function named hstone should be exposed, under the same name, as a WebAssembly function.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <emscripten/emscripten.h>
int EMSCRIPTEN_KEEPALIVE hstone(int n) {
  int len = 0;
  while (1) {
    if (1 == n) break; /* halt on 1 */
    if (0 == (n & 1)) n = n / 2; /* if n is even */
    else n = (3 * n) + 1; /* if n is odd */
    len++; /* increment counter */
  }
  return len;
}
Listing 8-10

The revised hstone code

As noted earlier, the Emscripten toolchain can be used not only to compile C code into WebAssembly but also to generate an appropriate HTML page together with JavaScript glue that links the WebAssembly module with the HTML page. To understand the details, however, it is useful to generate only the WebAssembly module and to craft, by hand, the HTML page and some JavaScript test calls.

With the Emscripten toolchain installed, the C function hstone in the file hstoneWA.c can be compiled into WebAssembly from the command line as follows:
% emcc hstoneWA.c --no-entry -o hstone.wasm

The flag --no-entry indicates that the file hstoneWA.c does not contain the function main, and the -o flag stands for output: the resulting WebAssembly file is named hstone.wasm. On my desktop machine, this file is a trim 662 bytes in size.

For testing, the next requirement is an HTML page that, when downloaded to a browser, fetches the WebAssembly module. A production-grade version of the HTML page would include embedded JavaScript calls to appropriate WebAssembly functions. A handcrafted version of the HTML page reveals details that otherwise remain hidden. Here is an HTML page that downloads and prepares the WebAssembly module stored in the hstone.wasm file:
<!doctype html>
<html>
  <head>
    <meta charset="utf-8"/>
    <script>
      fetch('hstone.wasm').then(response =>     <!-- Line 1 -->
      response.arrayBuffer()                    <!-- Line 2 -->
      ).then(bytes =>                           <!-- Line 3 -->
      WebAssembly.instantiate(bytes, {imports: {}})    <!-- Line 4 -->
      ).then(results => {                       <!-- Line 5 -->
      window.hstone = results.instance.exports.hstone; <!-- Line 6 -->
});
    </script>
  </head>
  <body/>
</html>

The script element in the preceding HTML page can be clarified line by line. The fetch call in Line 1 uses the web Fetch module to get the WebAssembly module from the web server that hosts this HTML page. When the HTTP response arrives, the WebAssembly module does so as a sequence of bytes, which are stored in the arrayBuffer of the script’s Line 2. These bytes make up the WebAssembly module, the contents of the file hstone.wasm. This module has no imports from other WebAssembly modules, as indicated at the end of Line 4.

At the start of Line 4, the WebAssembly module is instantiated. A WebAssembly module is akin to a nonstatic class with nonstatic members in an object-oriented language such as Java . The module contains variables, functions, and various support artifacts; but the module must be instantiated to be called from JavaScript .

The script’s Line 6 exports the original C function hstone under the same name. This WebAssembly function is available now to any JavaScript code, as a session in the browser’s JavaScript console confirms. Here is part of my test session in Chrome’s JavaScript console :
> hstone(27)      ## invoke hstone by name
< 111             ## output
> hstone(7)       ## again
< 16              ## output

The outputs are the steps required to reach 1 from the input (e.g., hstone(27) requires 111 steps to reach 1).

WebAssembly now has a more concise API for fetching and instantiating a module; the new API reduces the preceding script to only the fetch and instantiate operations. The longer version shown previously has the benefit of exhibiting details, in particular the representation of a WebAssembly module as a byte array that gets instantiated as an object with exported functions.

Emscripten comes with a test server, which can be invoked as follows to host the handcrafted HTML file hstone.html and the WebAssembly file hstone.wasm:
% emrun --no_browser --port 7777 .

The flag --no_browser means that a user manually opens a browser such as Firefox or Chrome. The request URL from the browser is then localhost:7777/hstone.html. If all goes well, the browser’s JavaScript console can be used to confirm, as shown previously, that the WebAssembly module is available for use.

8.5.3 WebAssembly and Code Reuse

The EMSCRIPTEN_KEEPALIVE directive is the straightforward way to have the Emscripten compiler produce a WebAssembly module that exports any C function of interest to the JavaScript glue embedded in an HTML page. A customized HTML document, with whatever handcrafted JavaScript is appropriate, can call the functions exported from the WebAssembly module. Hats off to Emscripten for this clean approach.

Web programmers are unlikely to write WebAssembly in its own text format language , as compiling from some high-level language, such as C or Rust, is far too attractive an option. Compiler writers, by contrast, might find it productive to work at the fine-grained level that the text format language provides.

Much has been made of WebAssembly’s goal of achieving near-native speed . But as the JIT compilers for JavaScript continue to improve, and as dialects well-suited for optimization (e.g., TypeScript) emerge and evolve, it may be that JavaScript also achieves near-native speed. Would this imply that WebAssembly is wasted effort? I think not.

WebAssembly addresses another traditional goal in computing: code reuse. As even the short hstone example illustrates, code in a suitable language, such as C, translates readily into a WebAssembly module, which plays well with JavaScript code—the glue that connects a range of technologies used on the Web. WebAssembly is thus an inviting way to reuse legacy code and to broaden the use of new code. For example, a high-performance program for image processing, written originally as a desktop application, might also be useful in a web application. WebAssembly then becomes an attractive path to reuse. (For new web modules that are compute bound, WebAssembly is a sound choice.) My hunch is that WebAssembly will thrive as much for reuse as for performance .

8.6 Signals

A signal interrupts an executing program (process) to notify it of some exceptional event:
                                 interrupt +---------+
signal from outside the program----------->| process |
              /                            +---------+
  e.g., Control-C from the keyboard

Signals have integer values as identifiers, with symbolic constants such as SIGKILL for ease of reference. When interrupted through a signal, a process may be able to ignore the interruption or else handle it in some program-appropriate way. However, some signals cannot be ignored, in particular SIGKILL (terminate) and SIGSTOP (pause).

Operating system routines regularly use signals to notify a process of an exceptional condition. For example, if a process runs out of memory, an operating system routine uses a signal as notification. Programs designed to handle signals typically do so in one of two ways:
  • The program requests that the signal be ignored. Recall the basicFork program (see Listing 7-1), which included this call to the signal function:

signal(SIGCHLD, SIG_IGN); /** prevent child from becoming a permanent zombie **/
The call requests that the SIGCHLD signal, which the system sends to a parent process when a child terminates, be ignored. The motive is to prevent the child from becoming a permanent zombie process, if the parent should happen to terminate before the child.
  • The program provides a signal handler as a callback function automatically invoked when a specified signal occurs. For example, the SIGINT (interrupt) signal can be sent to a process by hitting Control-C in the terminal window from which the program is launched. Perhaps a user hits Control-C by accident: the program might handle the signal by asking the user to confirm that the running program should be stopped.

At the core of the signal library is the legacy signal function, but best practice now favors the newer sigaction function. The signal function may behave differently across platforms and even operating system versions. The forthcoming code example uses the better-behaved sigaction function, introduced as a POSIX replacement for signal.
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>
#define MaxLoops 500
void cntrlC_handler(int signum) { /** callback function: int arg, void return **/
  fprintf(stderr, " Handling signal %i ", signum);
  int ans = 1;
  printf("Sure you want to exit (1 = yes, 0 = no)? ");
  scanf("%i", &ans);
  if (1 == ans) exit(EXIT_SUCCESS);
}
void main() {
  /** Set up a signal handler. **/
  struct sigaction current;
  sigemptyset(&current.sa_mask);       /* clear the signal set */
  current.sa_flags = 0;                /* enables setting sa_handler, not sa_action */
  current.sa_handler = cntrlC_handler; /* specify a handler */
  sigaction(SIGINT, &current, NULL);   /* control-C is a SIGINT */
  int i;
  for (i = 0; i < MaxLoops; i++) {
    printf("Counting sheep %i... ", i + 1);
    sleep(1);
  }
}
Listing 8-11

A signal-handling program

The signals program (see Listing 8-11) introduces the basic signal API . Here is an overview of how the program handles SIGINT and why the program does so:
  • The main function has a tiresome loop that prints integer values 1 through MaxLoops, currently set at 500. After printing each value, the program sleeps for a second. A user will be inclined to terminate this program from the command line with a Control-C.

  • At the start of main, a signal handler is registered for SIGINT, which a Control-C from the keyboard can generate. A program’s default response to a SIGINT is termination.

  • The signal handler cntrlC_handler can have any name but should return void and take a single int argument, which is the signal number. (The integer value for SIGINT happens to be 2.) This signal handler prompts the user for confirmation: if the user confirms, the program exits; otherwise, the program continues as before.

To record a signal handler using the sigaction function , a program first uses an instance of the struct sigaction type to set relevant information. In this example, the signal set for the process first is emptied; the relevant field is sa_mask, whose address is passed to the library function sigemptyset. In general, a child process may inherit signal behavior from a parent, and so clearing the signal set may be done to wipe out the inheritance. In this case, the call to sigemptyset is simply to illustrate details of the API .

Two different callback types can be registered with the sigaction function : one takes a single argument (the signal number), and the other takes three arguments (the signal number and pointers to two different structures that contain pertinent information about the current process state with respect to signals). The initialization
current.sa_flags = 0; /* current is a struct sigaction instance */
is a setup for using the simpler of the two callbacks:
current.sa_handler = cntrlC_handler; /* cntrlC_handler is the 1-argument callback */

If the sa_action field were used instead, then the sa_flags field would indicate which pieces of signal information were of interest.

The sigaction function, which sets the desired signal-handling action, takes three arguments:
sigaction(SIGINT, &current, NULL);
The first argument is the signal number, in this case SIGINT. The second argument is a pointer to the new signal-handling action, and the last argument is a pointer to the previous action, which can be saved with a non-NULL pointer for later retrieval. In this example, the old action is not saved: the third argument is NULL. Each action is specified by setting a field in an instance of the struct sigaction type.
% ./signals                              ## on Windows, drop ./
Counting sheep 1...
Counting sheep 2...
^C                                         ## 1st Control-C
        Handling signal 2
Sure you want to exit (1 = yes, 0 = no)? 0 ## resume execution
Counting sheep 3...
Counting sheep 4...
^C                                         ## 2nd Control-C
        Handling signal 2
Sure you want to exit (1 = yes, 0 = no)? 1 ## terminate
%
Listing 8-12

A sample run of the signals program

A sample run (see Listing 8-12) of the signals program confirms that the signal handling works as expected. As the loop starts, there is a Control-C from the user, and then a user response of 0, which means continue. The looping thus goes on. After a second Control-C and a user response of 1, which means terminate, the program ends.

Signals are a powerful, widely used mechanism not only for user/program interaction but also for interprocess communication. For example, the kill function
int kill(pid_t pid, int signum)

can be used by one process to terminate another process or group of processes. If the first argument to function kill is greater than zero, this argument is treated as the pid of the targeted process; if the argument is zero, the argument identifies the group of processes to which the signal sender belongs. The graceful shutdown of a multiprocessing application such as a web server could be accomplished by killing a group of processes. The second argument to kill is either a standard signal number (e.g., SIGTERM terminates a process but can be blocked, whereas SIGKILL terminates a process and cannot be blocked) or 0, which makes the call to signal a query about whether the pid in the first argument is indeed valid.

The older signal function is still used widely and dominates in legacy code. It is worth repeating that the sigaction replacement is the preferred way forward.

8.7 Software Libraries

Software libraries are a long-standing, easy, and sensible way to reuse code and to extend C by providing new functionalities. This section explains how to build such libraries from scratch and to make them easily available to clients. Although the two sample libraries target Linux , the steps for creating, publishing, and using these libraries apply in essentials to other Unix-like systems.

There are two sample clients (one in C, the other in Python) to access the libraries. It is no surprise that a C client can access a library written in C, but the Python client underscores that a library written in C can serve clients from other languages.

Computer systems in general and Linux in particular have two types of library:
  • A static library (library archive) is baked into a statically compiled client (e.g., one in C or Rust) during the link phase of the compilation process. In effect, each client gets its own copy of the library. A significant downside of a static library comes to the fore if the library needs revision, for example, to fix a bug—each library client now must be linked to the revised static library. A dynamic library, described next, avoids this shortcoming.

  • A dynamic (shared) library is flagged during the link phase of a statically compiled client program, but the client program and the library code remain otherwise unconnected until runtime—the library code is not baked into the client. At runtime, the system’s dynamic loader connects a shared library with an executing client, regardless of whether the client comes from a statically compiled language such as C or a dynamically compiled language such as Python. As a result, a dynamic library can be updated without thereby inconveniencing clients. Finally, multiple clients can share a single copy of a dynamic library.

In general, dynamic libraries are preferred over static ones, although there is a cost in complexity and performance. Here is a first look at how a library of either type is created and published :
  1. 1.

    The source code for the library is compiled into one or more object modules, which can be packaged as a library and linked to executable clients.

     
  2. 2.

    The object modules are packaged into a single file. For a static library, the standard extension is .a for “archive.” For a dynamic library, the extension is .so for “shared object.” The two sample libraries, which have the same functionality, are published as the files libprimes.a (static) and libshprimes.so (dynamic). The prefix lib is standard for both types of library.

     
  3. 3.

    The library file is copied to a standard directory so that client programs, without fuss, can access the library. A typical location for the library, whether static or dynamic, is /usr/lib or /usr/local/lib; other locations are possible.

     

Detailed steps for building and publishing each type of library are coming shortly. First, however, the C functions in the two libraries should be introduced.

8.7.1 The Library Functions

The two sample libraries are built from the same five C functions, four of which are extern and, therefore, accessible to client programs. The fifth function, which is a utility for one of the other four, is static and thus accessible only to the four extern functions defined in the same file. The library functions are elementary and deal, in various ways, with prime numbers. All of the functions expect unsigned (nonnegative) integer values as arguments:
  • The is_prime function tests whether its single argument is a prime.

  • The are_coprimes function checks whether its two arguments have a greatest common divisor (gcd) of 1, which defines co-primes.

  • The prime_factors function lists the prime factors of its argument.

  • The goldbach function expects an even integer value of 4 or more, listing whichever two primes sum to this argument; there may be multiple summing pairs. The function is named after the 18th-century mathematician Christian Goldbach, whose conjecture that every even integer greater than two is the sum of two primes remains one of the oldest unsolved problems in number theory.

The static utility function gcd, which the are_coprimes function invokes, resides in the deployed library files, but this function is not accessible outside of its containing file; hence, a library client cannot directly invoke the gcd function.

8.7.2 Library Source Code and Header File

The header file primes.h provides declarations for the four extern functions in each library. Such a header file also serves as input for utilities (e.g., the Rust bindgen utility) that enable clients in other languages to access a C library. Here is the primes.h header file:
/** header file primes.h: function declarations **/
extern unsigned is_prime(unsigned);
extern void prime_factors(unsigned);
extern unsigned are_coprimes(unsigned, unsigned);
extern void goldbach(unsigned);
As usual, these declarations serve as an interface by specifying the invocation syntax for each function. For client convenience, the text file primes.h could be stored in a directory on the C compiler’s search path. Typical locations are /usr/include and /usr/local/include.
#include <stdio.h>
#include <math.h>
extern unsigned is_prime(unsigned n) {
  if (n <= 3) return n > 1;            /* 2 and 3 are prime */
  if (0 == (n % 2) || 0 == (n % 3)) return 0; /* multiples of 2 or 3 aren't */
  /* check that n is not a multiple of other values < n */
  unsigned i;
  for (i = 5; (i * i) <= n; i += 6)
    if (0 == (n % i) || 0 == (n % (i + 2))) return 0; /* not prime */
  return 1; /* a prime other than 2 or 3 */
}
extern void prime_factors(unsigned n) {
  /* list 2s in n's prime factorization */
  while (0 == (n % 2)) {
    printf("%i ", 2);
    n /= 2;
  }
  /* 2s are done, the divisor is now odd */
  unsigned i;
  for (i = 3; i <= sqrt(n); i += 2) {
    while (0 == (n % i)) {
      printf("%i ", i);
      n /= i;
    }
  }
  /* one more prime factor? */
  if (n > 2) printf("%i", n);
}
/* utility function: greatest common divisor */
static unsigned gcd(unsigned n1, unsigned n2) {
  while (n1 != 0) {
    unsigned n3 = n1;
    n1 = n2 % n1;
    n2 = n3;
  }
  return n2;
}
extern unsigned are_coprimes(unsigned n1, unsigned n2) {
  return 1 == gcd(n1, n2);
}
extern void goldbach(unsigned n) {
  /* input errors */
  if ((n <= 2) || ((n & 0x01) > 0)) {
    printf("Number must be > 2 and even: %i is not. ", n);
    return;
  }
  /* two simple cases: 4 and 6 */
  if ((4 == n) || (6 == n)) {
    printf("%i = %i + %i ", n, n / 2, n / 2);
    return;
  }
  /* for n >= 8: multiple possibilities for many */
  unsigned i;
  for (i = 3; i < (n / 2); i++) {
    if (is_prime(i) && is_prime(n - i)) {
      printf("%i = %i + %i ", n, i, n - i);
      /* if one pair is enough, replace this with break */
    }
  }
}
Listing 8-13

The library functions

The five functions (see Listing 8-13) serve as grist for the library mill. The two libraries derive from exactly the same source code, and the header file primes.h is the C interface for both libraries.

8.7.3 Steps for Building the Libraries

The steps for building and then publishing a static and a dynamic library differ in a few details. Only three steps are required for the static library and just two more for the dynamic library. The additional steps in building the dynamic library reflect the added flexibility of the dynamic approach.

The library source file primes.c is compiled into an object module. Here is the command, with the percent sign again as the system prompt and with double sharp signs to introduce my comments:
% gcc -c primes.c ## step 1 static
This produces the binary file primes.o, the object module. The flag -c means compile only. The next step is to archive the object module(s) by using the Linux ar utility:
% ar -cvq libprimes.a primes.o ## step 2 static

The three flags -cvq are short for “create,” “verbose,” and “quick append” in case new files must be added to an archive. The prefix lib is standard, but the library name is arbitrary. Of course, the file name for a library must be unique to avoid conflicts.

The archive is ready to be published:
% sudo cp libprimes.a /usr/local/lib ## step 3 static

The static library is now accessible to clients, examples of which are forthcoming. (The sudo is included to ensure the correct access rights for copying a file into /usr/local/lib.)

The dynamic library also requires one or more object modules for packaging:
% gcc primes.c -c -fpic ## step 1 dynamic

The added flag -fpic directs the compiler to generate position-independent code , which is a binary module that need not be loaded into a fixed memory location. Such flexibility is critical in a system of multiple dynamic libraries. The resulting object module is slightly larger than the one generated for the static library.

Here is the command to create the single library file from the object module(s):
% gcc -shared -Wl,-soname,libshprimes.so -o libshprimes.so.1 primes.o ## step 2 dynamic

The flag -shared indicates that the library is shared (dynamic) rather than static. The -Wl flag introduces a list of compiler options, the first of which sets the dynamic library’s soname, which is required. The soname first specifies the library’s logical name (libshprimes.so) and then, following the -o flag, the library’s physical file name (libshprimes.so.1). The goal is to keep the logical name constant while allowing the physical file name to change with new versions. In this example, the 1 at the end of the physical file name libshprimes.so.1 represents the first version of the library. The logical and physical file names could be the same, but best practice is to have separate names. A client accesses the library through its logical name (in this case, libshprimes.so), as clarified shortly.

The next step is to make the shared library easily accessible to clients by copying it to the appropriate directory, for example, /usr/local/lib again:
% sudo cp libshprimes.so.1 /usr/local/lib ## step 3 dynamic
A symbolic link is now set up between the shared library’s logical name (libshprimes.so) and its full physical file name (/usr/local/lib/libshprimes.so.1). Here is the command with /usr/local/lib as the working directory:
% sudo ln --symbolic libshprimes.so.1 libshprimes.so ## step 4 dynamic

The logical name libshprimes.so should not change, but the target of the symbolic link (libshprimes.so.1) can be updated as needed for new library implementations that fix bugs, boost performance, and so on.

The final step (a precautionary one) is to invoke the ldconfig utility, which configures the system’s dynamic loader. This configuration ensures that the loader will find the newly published library:
% sudo ldconfig ## step 5 dynamic

The dynamic library is now ready for clients, including the two sample ones that follow.

8.7.4 A Sample C Client

The sample C client is the program tester, whose source code begins with two #include directives:
#include <stdio.h>   /* standard input/output functions */
#include <primes.h>  /* my library functions */

Both header files are to be found on the compiler’s search path (in the case of primes.h, the directory /usr/local/include). Without this #include, the compiler would complain as usual about missing declarations for functions such as is_prime and prime_factors. By the way, the source code for the tester program need not change at all to test each of the two libraries.

By contrast, the source file for the library (primes.c) opens with these #include directives:
#include <stdio.h>
#include <math.h>

The header file math.h is required because the library function prime_factors calls the mathematics function sqrt from the standard library libm.so.

For reference, Listing 8-14 is the source code for the tester program.
#include <stdio.h>
#include <primes.h>
int main() {
  /* is_prime */
  printf(" is_prime ");
  unsigned i, count = 0, n = 1000;
  for (i = 1; i <= n; i++) {
    if (is_prime(i)) {
      count++;
      if (1 == (i % 100)) printf("Sample prime ending in 1: %i ", i);
    }
  }
  printf("%i primes in range of 1 to a thousand. ", count);
  /* prime_factors */
  printf(" prime_factors ");
  printf("prime factors of 12: ");
  prime_factors(12);
  printf(" ");
  printf("prime factors of 13: ");
  prime_factors(13);
  printf(" ");
  printf("prime factors of 876,512,779: ");
  prime_factors(876512779);
  printf(" ");
  /* are_coprimes */
  printf(" are_coprime ");
  printf("Are %i and %i coprime? %s ",
         21, 22, are_coprimes(21, 22) ? "yes" : "no");
  printf("Are %i and %i coprime? %s ",
         21, 24, are_coprimes(21, 24) ? "yes" : "no");
  /* goldbach */
  printf(" goldbach ");
  goldbach(11);  /* error */
  goldbach(4);   /* small one */
  goldbach(6);   /* another */
  for (i = 100; i <= 150; i += 2) goldbach(i);
  return 0;
}
Listing 8-14

A sample C client

In compiling tester.c into an executable, the tricky part is the order of the link flags. Recall that the two sample libraries begin with the prefix lib and each has the usual extension: .a for the static library libprimes.a and .so for the dynamic library libshprimes.so. In a links specification, the prefix lib and the extension fall away. A link flag begins with -l (lowercase L), and a compilation command may contain arbitrarily many link flags. Here is the full compilation command for the tester program, using the dynamic library as the example:
% gcc -o tester tester.c -lshprimes -lm

The first link flag identifies the library libshprimes.so, and the second link flag identifies the standard mathematics library libm.so.

The linker is lazy, which means that the order of the link flags matters. For example, reversing the order of the link specifications generates a compile-time error:
% gcc -o tester tester.c -lm -lshprimes ## DANGER!
The flag that links to libm.so comes first, but no function from this library is invoked explicitly in the tester program; hence, the linker does not link to the math.so library. The call to the sqrt library function occurs only in the prime_factors function from the libshprimes.so library. The resulting error in compiling the tester program is
primes.c: undefined reference to 'sqrt'
Accordingly, the order of the link flags should notify the linker that the sqrt function is needed:
% gcc -o tester tester.c -lshprimes -lm ## -lshprimes 1st

The linker picks up the call to the library function sqrt in the libshprimes.so library and, therefore, does the appropriate link to the mathematics library libm.so. There is a more complicated option for linking that supports either link-flag order; in this case, however, it is just as easy to arrange the link flags appropriately.

Here is some output from a run of the tester client:
is_prime
Sample prime ending in 1: 101
Sample prime ending in 1: 401
...
168 primes in range of 1 to a thousand.
prime_factors
prime factors of 12: 2 2 3
prime factors of 13: 13
prime factors of 876,512,779: 211 4154089
are_coprime
Are 21 and 22 coprime? yes
Are 21 and 24 coprime? no
goldbach
Number must be > 2 and even: 11 is not.
4 = 2 + 2
6 = 3 + 3
...
32 = 3 + 29
32 = 13 + 19
...
100 = 3 + 97
100 = 11 + 89
...

For the goldbach function , even a relatively small even value (e.g., 18) may have multiple pairs of primes that sum to it (in this case, 5 + 13 and 7 + 11). Such multiple prime pairs are among the factors that complicate an attempted proof of Goldbach’s conjecture.

8.7.5 A Sample Python Client

Python , unlike C, is not a statically compiled language, which means that the sample Python client must access the dynamic rather than the static version of the primes library. To do so, Python has various modules (standard and third party) that support a foreign function interface (FFI) , which allows a program written in one language to invoke functions written in another. Python ctypes is a standard and relatively simple FFI that enables Python code to call C functions.

Any FFI has challenges because the interfacing languages are unlikely to have exactly the same data types. For example, the primes library uses the C type unsigned int, which Python does not have; the ctypes FFI maps a C unsigned int to a Python int. Of the four extern C functions published in the primes library, two behave better in Python with explicit ctypes configuration.

The C functions prime_factors and goldbach have void instead of a return type, but ctypes by default replaces the C void with the Python int. When called from Python code, the two C functions then return a random (hence, meaningless) integer value from the stack. However, ctypes can be configured to have the functions return None (Python’s null type) instead. Here is the configuration for the prime_factors function:
primes.prime_factors.restype = None

A similar statement handles the goldbach function .

The following interactive session (in Python3) shows that the interface between a Python client and the primes library is straightforward:
>>> from ctypes import cdll
>>> primes = cdll.LoadLibrary("libshprimes.so") ## logical name
>>> primes.is_prime(13)
1
>>> primes.is_prime(12)
0
>>> primes.are_coprimes(8, 24)
0
>>> primes.are_coprimes(8, 25)
1
>>> primes.prime_factors.restype = None
>>> primes.goldbach.restype = None
>>> primes.prime_factors(72)
2 2 2 3 3
>>> primes.goldbach(32)
32 = 3 + 29
32 = 13 + 19

The functions in the primes library use only a simple data type, unsigned int. If this C library used complicated types such as structures, and if pointers to structures were passed to and returned from library functions, then an FFI more powerful than ctypes might be better for a smooth interface between Python and C. Nonetheless, the ctypes example shows that a Python client can use a library written in C. Indeed, the popular NumPy library for scientific computing is written in C and then exposed in a high-level Python API .

8.8 What’s Next?

This is a small book about a big language—not big in size, but in its impact throughout computing. C is a very small language with easy access to an expanse of standard and third-party libraries. As the libraries get better, C gets better.

C has quirks and presents challenges. Perhaps the greatest challenge is memory leakage: heap storage that the program either allocates explicitly or obtains indirectly through library functions must be freed explicitly, and it is easy to allocate—and then forget to deallocate. Better APIs and tools such as valgrind ( https://valgrind.org ) address this challenge. The OpenSSL API illustrates best practices: the API includes a family of free functions that do whatever nested deallocation might be required. C brings the programmer close to the machine, an intimacy that requires particular discipline in code that uses dynamic storage.

Despite its age, C has the look and feel of a modern language with an emphatic separation of concerns: an interface describes, in particular the invocation syntax of functions; an implementation defines by providing the appropriate operational detail. Once published, an interface should remain unchanged, as it represents a contract with programmers; by contrast, an implementation can change to fix bugs, boost performance, and so on.

The standard C library functions are minimalist in design and, therefore, a guide for programmers. Recall the write function, which requires three arguments: where to write, what to write, and how many bytes to write. There are no formatting flags or data-type specifications. If these are needed, there are higher-level I/O functions at hand.

C can interact with virtually every other programming language. Is it nonetheless possible that C might lose its role as the lingua franca in programming? What would replace C? Its position as the dominant systems language, but one suited for applications as well, makes C the natural language to play this role. Are the standard system libraries, let alone the operating system kernel, to be rewritten in some other language? C combines two features that make it an ideal systems language: C has a high-level syntax that promotes the writing of clear, modular code; but C remains close to the metal, which promotes efficiency.

What, then, is next? The code examples are available from GitHub ( https://github.com/mkalin/cbook.git ). They are short enough to explore, to tweak, and to improve.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.75.133