There is more to comparing strings than strcmp()
or even strncmp()
. Linux provides several general string-matching functions that make your programming tasks simpler. We start with the simple tasks and then cover the more complex ones.
Chapter 14 explains how to glob file names using the glob()
function, but people used to globbing capabilities sometimes wish to apply them to other sorts of strings. The fnmatch()
function allows you to apply globbing rules to arbitrary strings:
#include <fnmatch.h> int fnmatch(const char *pattern, const char *string, int flags);
The pattern is a standard glob expression with four special characters, modified by the flags
argument:
| Matches any string, including an empty one. |
| Matches exactly one character, any character. |
| Starts a list of characters to match, or, if the next character is |
Causes the next character to be interpreted literally instead of as a special character. |
The flags
argument affects some details of the glob, and is mostly there to be useful for globbing against file names. If you are not globbing file names, you probably want to set flags
to 0
.
FNM_NOESCAPE
Treat
as an ordinary character, not a special character.
FNM_PATHNAME
Do not match
/
characters instring
with a*, ?
, or even a[/]
sequence inpattern;
match it only with a literal, nonspecial/
.
FNM_PERIOD
A leading
.
character inpattern
matches a.
character instring
only if it is the first character instring
or ifFNM_PATHNAME
is set and the.
character instring
directly follows a.
fnmatch()
returns zero if the pattern matches the string, FNM_NOMATCH
if the pattern does not match the string, or some other unspecified value if an error occurs.
An example of using fnmatch()
is provided in the example program on pages 315-317 in Chapter 14, where it is used as part of a simple reimplementation of the find
command.
Regular expressions, as used in sed, awk, grep, vi
, and countless other Unix programs through the years, have become a major part of the Unix programming environment. They are also available for use within C programs. This section explains how to use them and then presents a simple file parser using these functions.
Regular expressions have two flavors: basic regular expressions (BREs) and extended regular expressions (EREs). They correspond (roughly) to the grep
and egrep
commands. Both forms of regular expressions are explained in the grep man page, in the POSIX.2 standard [IEEE, 1993], in A Practical Guide to Red Hat Linux 8 [Sobell, 2002], and in other places, so we do not describe their syntax here, only the function interface that allows you to use regular expressions from within your programs.
POSIX specifies four functions to provide regular expression handling:
#include <regex.h> int regcomp(regex_t *preg, const char *regex, int cflags); int regexec(const regex_t *preg, const char *string, size_t nmatch, regmatch_t pmatch[], int eflags); void regfree(regex_t *preg); size_t regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size);
Before you can compare a string to a regular expression, you need to compile it with the regcomp()
function. The regex_t *preg
holds all the state for the regular expression. You need one regex_t
for each regular expression that you wish to have available concurrently. The regex_t
structure has only one member on which you should rely: re_nsub
, which specifies the number of parenthesized subexpressions in the regular expression. Consider the rest of the structure opaque.
The cflags
argument determines many things about how the regular expression regex
is interpreted. It may be zero, or it may be the bitwise OR of any of the following four items:
REG_EXTENDED
If set, use ERE syntax instead of BRE syntax.
REG_ICASE
If set, do not differentiate between upper- and lowercase.
REG_NOSUB
If set, do not keep track of substrings. The
regexec()
function then ignores thenmatch
andpmatch
arguments.
REG_NEWLINE
If
REG_NEWLINE
is not set, the newline character is treated essentially the same as any other character. The^
and$
characters match only the beginning and end of the entire string, not adjacent newline characters. IfREG_NEWLINE
is set, you get the same behavior as you do withgrep, sed
, and other standard system tools;^
anchors both to the beginning of a string and to the character after a newline (technically, it matches zero-length strings following a newline character);$
anchors to the end of the string and to newline characters (technically, it matches a zero-length string preceding the newline character); and.
does not match a newline character.
A typical invocation looks like this:
if ((rerr = regcomp(&p, "(^(.*[^\])#.*$)|(^[^#]+$)", REG_EXTENDED|REG_NEWLINE))) { if (rerr == REG_NOMATCH) { /* string simply did not match regular expression */ } else { /* some other error, such as a badly formed expression */ } }
This ERE finds lines of a file that are not commented out, or that are, at most, partially commented out, by #
characters not prefixed with characters. This kind of regular expression might be useful as part of a simple parser for an application’s configuration file.
Even if you are compiling an expression that you know is good, you should still check for errors. regcomp()
returns zero for a successful compilation and a nonzero error code for an error. Most errors involve invalid regular expressions of one sort or another, but another possible error is running out of memory. See page 562 for a description of the regerror()
function.
#include <regex.h> int regexec(const regex_t *preg, const chat *string, size_t nmatch, regmatch_t pmatch[], int eflags);
The regexec()
function tests a string against a compiled regular expression. The eflags
argument may be zero, or it may be the bitwise OR of any of the following symbols:
REG_NOTBOL
If set, the first character of the string does not match a
^
character. Any character following a newline character still matches^
as long asREG_NEWLINE
was set in the call toregcomp()
.
REG_NOTEOL
If set, the final character of the string does not match a
$
character. Any character preceding a newline character still matches$
as long asREG_NEWLINE
was set in the call toregcomp()
.
An array of regmatch_t
structures is used to represent the location of subexpressions in the regular expression:
#include <regex.h> typedef struct { regoff_t rm_so; /* byte index within string of start of match */ regoff_t rm_eo; /* byte index within string of end of match */ } regmatch_t;
The first regmatch_t
element describes the entire string that was matched; note that any newline, including a trailing newline, is included in this entire string, regardless of whether REG_NEWLINE
is set or not.
Following array elements express parenthesized subexpressions in the order they are expressed in the regular expression, in order by the location of the opening parenthesis. (In C code, element i is equivalent to the replacement expression i
in sed or awk.) Subexpressions that do not match have a value of-1 in their regmatch_t.rm_so
member.
This code matches a string against a regular expression with subexpressions, and prints out all the subexpressions that match:
1: /* match.c */ 2: 3: #include <alloca.h> 4: #include <sys/types.h> 5: #include <regex.h> 6: #include <stdlib.h> 7: #include <string.h> 8: #include <stdio.h> 9: 10: void do_regerror(int errcode, const regex_t *preg) { 11: char *errbuf; 12: size_t errbuf_size; 13: 14: errbuf_size = regerror(errcode, preg, NULL, 0); 15: errbuf = alloca(errbuf_size); 16: if (!errbuf) { 17: perror("alloca"); 18: return; 19: } 20: 21: regerror(errcode, preg, errbuf, errbuf_size); 22: fprintf(stderr, "%s ", errbuf); 23: } 24: 25: int main() { 26: 27: regex_t p; 28: regmatch_t *pmatch; 29: int rerr; 30: char *regex = "(^(.*[^\])#.*$)|(^[^#]+$)"; 31: char string[BUFSIZ+1]; 32: int i; 33: 34: if ((rerr = regcomp(&p, regex, REG_EXTENDED | REG_NEWLINE))) { 35: do_regerror(rerr, &p); 36: } 37: 38: pmatch = alloca(sizeof(regmatch_t) * (p.re_nsub+1)); 39: if (!pmatch) { 40: perror("alloca"); 41: } 42: 43: printf("Enter a string: "); 44: fgets(string, sizeof(string), stdin); 45: 46: if ((rerr = regexec(&p, string, p.re_nsub+1, pmatch, 0))) { 47: if (rerr == REG_NOMATCH) { 48: /* regerror can handle this case, but in most cases 49: * it is handled specially 50: */ 51: printf("String did not match %s ", regex); 52: } else { 53: do_regerror(rerr, &p); 54: } 55: } else { 56: /* match succeeded */ 57: printf("String matched regular expressioon %s ", regex); 58: for(i = 0; i <= p.re_nsub; i++) { 59: /* print the matching portion(s) of the string */ 60: if (pmatch[i].rm_so != -1) { 61: char *submatch; 62: size_t matchlen = pmatch[i].rm_eo - pmatch[i].rm_so; 63: submatch = malloc(matchlen+1); 64: strncpy(submatch, string+pmatch[i].rm_so, 65: matchlen); 66: submatch[matchlen] = ' '; 67: printf("matched subexpression %d: %s ", i, 68: submatch); 69: free(submatch); 70: } else { 71: printf("no match for subexpression %d ", i); 72: } 73: } 74: } 75: exit(0); 76: }
In the sample regular expression given in match.c, there are three subexpressions: The first is an entire line containing text followed by a comment character, the second is the text in that line that precedes the comment character, and the third is an entire line containing no comment character. For a line with a comment character at the beginning, the second and third elements of pmatch[]
have rm_so
set to -1; for a line with a comment character at the beginning, the first and second are set to -1; and for a line with no comment characters, the second and third are set to -1.
Whenever you are done with a compiled regular expression, you need to free it to avoid a memory leak. You must use the regfree()
function to free it, not the free()
function:
#include <regex.h> void regfree(regex_t *preg);
The POSIX standard does not explicitly specify whether you need to use regfree()
each time you call regcomp()
or only after the final time you call regcomp()
on one regex_t
structure. Therefore, regfree()
your regex_t
structures between uses to avoid memory leaks.
Whenever you get a nonzero return code from regcomp()
or regexec()
, the regerror()
function can provide a detailed message explaining what went wrong. It writes as much as possible of an error message into a buffer and returns the size of the total message. Because you do not know beforehand how big the error message might be, you first ask for its size, then allocate the buffer, and then use the buffer, as demonstrated in our sample code below. Because that kind of error handling gets old fast, and because you need to include that error handling code at least twice (once after regcomp()
and once after regexec()
), we recommend that you write your own wrapper around regerror()
, as shown on line 10 of match.c.
Grep is a popular utility, specified by POSIX, which provides regular expression searching in text files. Here is a simple (not POSIX-compliant) version of grep implemented using the standard regular expression functions:
1: /* grep.c */ 2: 3: #include <alloca.h> 4: #include <ctype.h> 5: #include <popt.h> 6: #include <regex.h> 7: #include <stdio.h> 8: #include <string.h> 9: #include <unistd.h> 10: 11: #define MODE_REGEXP 1 12: #define MODE_EXTENDED 2 13: #define MODE_FIXED 3 14: 15: void do_regerror(int errcode, const regex_t *preg) { 16: char *errbuf; 17: size_t errbuf_size; 18: 19: errbuf_size = regerror(errcode, preg, NULL, 0); 20: errbuf = alloca(errbuf_size); 21: if (!errbuf) { 22: perror("alloca"); 23: return; 24: } 25: 26: regerror(errcode, preg, errbuf, errbuf_size); 27: fprintf(stderr, "%s ", errbuf); 28: } 29: 30: int scanFile(FILE * f, int mode, const void * pattern, 31: int ignoreCase, const char * fileName, 32: int * maxCountPtr) { 33: long lineLength; 34: char * line; 35: int match; 36: int rc; 37: char * chptr; 38: char * prefix = ""; 39: 40: if (fileName) { 41: prefix = alloca(strlen(fileName) + 4); 42: sprintf(prefix, "%s: ", fileName); 43: } 44: 45: lineLength = sysconf(_SC_LINE_MAX); 46: line = alloca(lineLength); 47: 48: while (fgets(line, lineLength, f) && (*maxCountPtr)) { 49: /* if we don't have a final ' ' we didn't get the 50: whole line */ 51: if (line[strlen(line) - 1] != ' ') { 52: fprintf(stderr, "%sline too long ", prefix); 53: return 1; 54: } 55: 56: if (mode == MODE_FIXED) { 57: if (ignoreCase) { 58: for (chptr = line; *chptr; chptr++) { 59: if (isalpha(*chptr)) *chptr = tolower(*chptr); 60: } 61: } 62: match = (strstr(line, pattern) != NULL); 63: } else { 64: match = 0; 65: rc = regexec(pattern, line, 0, NULL, 0); 66: if (!rc) 67: match = 1; 68: else if (rc != REG_NOMATCH) 69: do_regerror(match, pattern); 70: } 71: 72: if (match) { 73: printf("%s%s", prefix, line); 74: if (*maxCountPtr > 0) 75: (*maxCountPtr)--; 76: } 77: } 78: 79: return 0; 80: } 81: 82: int main(int argc, const char ** argv) { 83: const char * pattern = NULL; 84: regex_t regPattern; 85: const void * finalPattern; 86: int mode = MODE_REGEXP; 87: int ignoreCase = 0; 88: int maxCount = -1; 89: int rc; 90: int regFlags; 91: const char ** files; 92: poptContext optCon; 93: FILE * f; 94: char * chptr; 95: struct poptOption optionsTable[] = { 96: { "extended-regexp", 'E', POPT_ARG_VAL, 97: &mode, MODE_EXTENDED, 98: "pattern for match is an extended regular " 99: "expression" }, 100: { "fixed-strings", 'F', POPT_ARG_VAL, 101: &mode, MODE_FIXED, 102: "pattern for match is a basic string (not a " 103: "regular expression)", NULL }, 104: { "basic-regexp", 'G', POPT_ARG_VAL, 105: &mode, MODE_REGEXP, 106: "pattern for match is a basic regular expression" }, 107: { "ignore-case", 'i', POPT_ARG_NONE, &ignoreCase, 0, 108: "perform case insensitive search", NULL }, 109: { "max-count", 'm', POPT_ARG_INT, &maxCount, 0, 110: "terminate after N matches", "N" }, 111: { "regexp", 'e', POPT_ARG_STRING, &pattern, 0, 112: "regular expression to search for", "pattern" }, 113: POPT_AUTOHELP 114: { NULL, ' ', POPT_ARG_NONE, NULL, 0, NULL, NULL } 115: }; 116: 117: optCon = poptGetContext("grep", argc, argv, optionsTable, 0); 118: poptSetOtherOptionHelp(optCon, "<pattern> <file list>"); 119: 120: if ((rc = poptGetNextOpt(optCon)) < -1) { 121: /* an error occurred during option processing */ 122: fprintf(stderr, "%s: %s ", 123: poptBadOption(optCon, POPT_BADOPTION_NOALIAS), 124: poptStrerror(rc)); 125: return 1; 126: } 127: 128: files = poptGetArgs(optCon); 129: /* if we weren't given a pattern it must be the first 130: leftover */ 131: if (!files && !pattern) { 132: poptPrintUsage(optCon, stdout, 0); 133: return 1; 134: } 135: 136: if (!pattern) { 137: pattern = files[0]; 138: files++; 139: } 140: 141: regFlags = REG_NEWLINE | REG_NOSUB; 142: if (ignoreCase) { 143: regFlags |= REG_ICASE; 144: /* convert the pattern to lower case; this doesn't matter 145: if we're ignoring the case in a regular expression, but 146: it lets strstr() handle -i properly */ 147: chptr = alloca(strlen(pattern) + 1); 148: strcpy(chptr, pattern); 149: pattern = chptr; 150: 151: while (*chptr) { 152: if (isalpha(*chptr)) *chptr = tolower(*chptr); 153: chptr++; 154: } 155: } 156: 157: 158: switch (mode) { 159: case MODE_EXTENDED: 160: regFlags |= REG_EXTENDED; 161: case MODE_REGEXP: 162: if ((rc = regcomp(®Pattern, pattern, regFlags))) { 163: do_regerror(rc, ®Pattern); 164: return 1; 165: } 166: finalPattern = ®Pattern; 167: break; 168: 169: case MODE_FIXED: 170: finalPattern = pattern; 171: break; 172: } 173: 174: if (!*files) { 175: rc = scanFile(stdin, mode, finalPattern, ignoreCase, NULL, 176: &maxCount); 177: } else if (!files[1]) { 178: /* this is handled separately because the file name should 179: not be printed */ 180: if (!(f = fopen(*files, "r"))) { 181: perror(*files); 182: rc = 1; 183: } else { 184: rc = scanFile(f, mode, finalPattern, ignoreCase, NULL, 185: &maxCount); 186: fclose(f); 187: } 188: } else { 189: rc = 0; 190: 191: while (*files) { 192: if (!(f = fopen(*files, "r"))) { 193: perror(*files); 194: rc = 1; 195: } else { 196: rc |= scanFile(f, mode, finalPattern, ignoreCase, 197: *files, &maxCount); 198: fclose(f); 199: } 200: files++; 201: if (!maxCount) break; 202: } 203: } 204: 205: return rc; 206: }
18.117.186.125