Chapter 10. Boston Commons

Can you fault such common sense?

They Might Be Giants

The comm (common) utility will read two files and report the lines of text that are common to both and the lines that are unique to each. These are set operations where the common lines are the intersection of the two files and the unique lines are the difference. If you are familiar with databases, you might also consider these as types of join operations. In this chapter, you will learn:

  • How to manually iterate the lines of a filehandle

  • How to match on combinations of possibilities using a tuple

  • How to compare two strings

  • How to use std::cmp::Ordering

  • How to create an enum variant that contains a value

How comm Works

I’ll start with part of the manual page for the BSD comm:

COMM(1)                   BSD General Commands Manual                  COMM(1)

NAME
     comm -- select or reject lines common to two files

SYNOPSIS
     comm [-123i] file1 file2

DESCRIPTION
     The comm utility reads file1 and file2, which should be sorted lexically,
     and produces three text columns as output: lines only in file1; lines
     only in file2; and lines in both files.

     The filename ''-'' means the standard input.

     The following options are available:

     -1      Suppress printing of column 1.

     -2      Suppress printing of column 2.

     -3      Suppress printing of column 3.

     -i      Case insensitive comparison of lines.

     Each column will have a number of tab characters prepended to it equal to
     the number of lower numbered columns that are being printed.  For exam-
     ple, if column number two is being suppressed, lines printed in column
     number one will not have any tabs preceding them, and lines printed in
     column number three will have one.

     The comm utility assumes that the files are lexically sorted; all charac-
     ters participate in line comparisons.

It might help to see some examples:

EXAMPLES
     Assuming a	file named example.txt with the	following contents:

	   a
	   b
	   c
	   d

     Show lines	only in	example.txt, lines only	in stdin and common lines:

	   $ echo -e "B
c" | comm example.txt -
		   B
	   a
	   b
			   c
	   d

     Show only common lines doing case insensitive comparisons:

	   $ echo -e "B
c" | comm -1 -2 -i example.txt	-
	   b
	   c

The GNU version has some additional options:

$ comm --help
Usage: comm [OPTION]... FILE1 FILE2
Compare sorted files FILE1 and FILE2 line by line.

When FILE1 or FILE2 (not both) is -, read standard input.

With no options, produce three-column output.  Column one contains
lines unique to FILE1, column two contains lines unique to FILE2,
and column three contains lines common to both files.

  -1              suppress column 1 (lines unique to FILE1)
  -2              suppress column 2 (lines unique to FILE2)
  -3              suppress column 3 (lines that appear in both files)

  --check-order     check that the input is correctly sorted, even
                      if all input lines are pairable
  --nocheck-order   do not check that the input is correctly sorted
  --output-delimiter=STR  separate columns with STR
  --total           output a summary
  -z, --zero-terminated    line delimiter is NUL, not newline
      --help     display this help and exit
      --version  output version information and exit

Note, comparisons honor the rules specified by 'LC_COLLATE'.

Examples:
  comm -12 file1 file2  Print only lines present in both file1 and file2.
  comm -3 file1 file2  Print lines in file1 not in file2, and vice versa.

At this point, you may be wondering exactly why you’d use this. Consider that you have a list of cities where your favorite band played on their last tour:

$ cd 10_commr/tests/inputs/
$ cat cities1.txt
Jackson
Denton
Cincinnati
Boston
Santa Fe
Tucson

Another file lists the cities on their current tour:

$ cat cities2.txt
San Francisco
Denver
Ypsilanti
Denton
Cincinnati
Boston

You can use comm to find which cities occur in both sets by suppressing columns 1 (the lines unique to the first file) and 2 (the lines unique to the second file) and only show column 3 (the lines common to both files). This is like an inner join in SQL where only data that occurs in both inputs is shown. Note that both files need to be sorted first:

$ comm -12 <(sort cities1.txt) <(sort cities2.txt)
Boston
Cincinnati
Denton

If you wanted the cities they only played on the first tour, you could suppress columns 2 and 3:

$ comm -23 <(sort cities1.txt) <(sort cities2.txt)
Jackson
Santa Fe
Tucson

Finally, if you wanted the cities they only played on the second tour, you could suppress columns 1 and 3:

$ comm -13 <(sort cities1.txt) <(sort cities2.txt)
Denver
San Francisco
Ypsilanti

The first or second file can be STDIN as denoted by the filename “-”:

$ sort cities2.txt | comm -12 <(sort cities1.txt) -
Boston
Cincinnati
Denton

As with the GNU comm, only one of the inputs may be “-” with the challenge program. Note that the comm can perform case-insensitive comparisons when the -i flag is present. For instance, I can put the first tour cities in lowercase:

$ cat cities1_lower.txt
jackson
denton
cincinnati
boston
santa fe
tucson

and the second tour cities in uppercase:

$ cat cities2_upper.txt
SAN FRANCISCO
DENVER
YPSILANTI
DENTON
CINCINNATI
BOSTON

Then I can use the -i flag to find the cities in common:

$ comm -i -12 <(sort cities1_lower.txt) <(sort cities2_upper.txt)
boston
cincinnati
denton
Note

I know the tour cities example is a trivial one, so I’ll give you another example drawn from my experience in bioinformatics, which is the intersection of computer science and biology. Given a file of protein sequences, I can run an analysis that will group similar sequences into clusters. I can then use comm to compare the clustered proteins to the original list and find the proteins that failed to cluster. There may be something unique to these unclustered proteins that bears further analysis.

This is as much as the challenge program is expected to implement. One change from the original tool is that I wanted to add an optional output column delimiter that defaults to a tab character, which is the normal output from comm.

Getting Started

The program in this chapter will be called commr (pronounced comm-ar) for a Rust version of comm. I suggest you use cargo new commr to start, then add the following dependencies to your Cargo.toml file:

[dependencies]
clap = "2.33"

[dev-dependencies]
assert_cmd = "1"
predicates = "1"
rand = "0.8"

Copy my 10_commr/tests directory into your project, and then run cargo test to run the tests which should all fail.

Defining the Arguments

No surprises here, but I suggest the following for your src/main.rs:

fn main() {
    if let Err(e) = commr::get_args().and_then(commr::run) {
        eprintln!("{}", e);
        std::process::exit(1);
    }
}

You can start src/lib.rs with the following definition for Config:

use clap::{App, Arg};
use std::error::Error;

type MyResult<T> = Result<T, Box<dyn Error>>;

#[derive(Debug)]
pub struct Config {
    file1: String, 1
    file2: String, 2
    suppress_col1: bool, 3
    suppress_col2: bool, 4
    suppress_col3: bool, 5
    insensitive: bool, 6
    delimiter: String, 7
}
1

The first input filename is a String.

2

The second input filename is a String.

3

A Boolean for whether to suppress the first column of output.

4

A Boolean for whether to suppress the second column of output.

5

A Boolean for whether to suppress the third column of output.

6

A Boolean for whether to perform case-insensitive comparisons.

7

The output column delimiter, which will default to a tab.

You can start your get_args like so:

pub fn get_args() -> MyResult<Config> {
    let matches = App::new("commr")
        .version("0.1.0")
        .author("Ken Youens-Clark <[email protected]>")
        .about("Rust comm")
        // What goes here?
        .get_matches();

        Ok(Config {
        file1: ...
        file2: ...
        suppress_col1: ...
        suppress_col2: ...
        suppress_col3: ...
        insensitive: ...
        delimiter: ...
    })
}

Start your run function by printing the config:

pub fn run(config: Config) -> MyResult<()> {
    println!("{:#?}", config);
    Ok(())
}

Your program should be able to produce the following usage:

cargo run -- -h
commr 0.1.0
Ken Youens-Clark <[email protected]>
Rust comm

USAGE:
    commr [FLAGS] [OPTIONS] <FILE1> <FILE2>

FLAGS:
    -h, --help       Prints help information
    -i               Case insensitive comparison of lines
    -1               Suppress printing of column 1
    -2               Suppress printing of column 2
    -3               Suppress printing of column 3
    -V, --version    Prints version information

OPTIONS:
    -d, --output-delimiter <DELIM>    Output delimiter

ARGS:
    <FILE1>    Input file 1
    <FILE2>    Input file 2

If you run your program with no arguments, it should fail with a message that the two file arguments are required:

$ cargo run
error: The following required arguments were not provided:
    <FILE1>
    <FILE2>

USAGE:
    commr [FLAGS] [OPTIONS] <FILE1> <FILE2>

For more information try --help

If you supply two positional arguments, you should get the following output:

$ cargo run -- tests/inputs/file1.txt tests/inputs/file2.txt
Config {
    file1: "tests/inputs/file1.txt", 1
    file2: "tests/inputs/file2.txt",
    suppress_col1: false, 2
    suppress_col2: false,
    suppress_col3: false,
    insensitive: false,
    delimiter: "	",
}
1

The two positional arguments are parsed into file1 and file2.

2

All the rest of the values use defaults which are false for the Booleans and the tab character for delimiter.

Verify that you can set all the other arguments as well:

$ cargo run -- tests/inputs/file1.txt tests/inputs/file2.txt -123 -d , -i
Config {
    file1: "tests/inputs/file1.txt",
    file2: "tests/inputs/file2.txt",
    suppress_col1: true, 1
    suppress_col2: true,
    suppress_col3: true,
    insensitive: true, 2
    delimiter: ",", 3
}
1

The -123 sets each of the suppress values to true.

2

The -i sets insensitive to true.

3

The -d option sets the delimiter to a comma (,).

Stop reading and make your program match the preceding output. Come back when you’re done.

Following is how I defined the arguments in my get_args. I don’t have much to comment on here since it’s so similar to previous programs:

pub fn get_args() -> MyResult<Config> {
    let matches = App::new("commr")
        .version("0.1.0")
        .author("Ken Youens-Clark <[email protected]>")
        .about("Rust comm")
        .arg(
            Arg::with_name("file1")
                .value_name("FILE1")
                .help("Input file 1")
                .takes_value(true)
                .required(true),
        )
        .arg(
            Arg::with_name("file2")
                .value_name("FILE2")
                .help("Input file 2")
                .takes_value(true)
                .required(true),
        )
        .arg(
            Arg::with_name("suppress_col1")
                .short("1")
                .value_name("COL1")
                .takes_value(false)
                .help("Suppress printing of column 1"),
        )
        .arg(
            Arg::with_name("suppress_col2")
                .short("2")
                .value_name("COL2")
                .takes_value(false)
                .help("Suppress printing of column 2"),
        )
        .arg(
            Arg::with_name("suppress_col3")
                .short("3")
                .value_name("COL3")
                .takes_value(false)
                .help("Suppress printing of column 3"),
        )
        .arg(
            Arg::with_name("insensitive")
                .short("i")
                .value_name("INSENSITIVE")
                .takes_value(false)
                .help("Case insensitive comparison of lines"),
        )
        .arg(
            Arg::with_name("delimiter")
                .short("d")
                .long("output-delimiter")
                .value_name("DELIM")
                .help("Output delimiter")
                .takes_value(true),
        )
        .get_matches();

    Ok(Config {
        file1: matches.value_of("file1").unwrap().to_string(),
        file2: matches.value_of("file2").unwrap().to_string(),
        suppress_col1: matches.is_present("suppress_col1"),
        suppress_col2: matches.is_present("suppress_col2"),
        suppress_col3: matches.is_present("suppress_col3"),
        insensitive: matches.is_present("insensitive"),
        delimiter: matches.value_of("delimiter").unwrap_or("	").to_string(), 1
    })
}
1

Use Option::unwrap_or to unwrap the given value or use a default (a tab), then convert the value to a String.

At this point, your program should pass cargo test dies_no_args.

Validating and Opening the Input Files

The first order of business will be checking and opening the input files. I suggest a modification of the open function used in several previous chapters:

fn open(filename: &str) -> MyResult<Box<dyn BufRead>> {
    match filename {
        "-" => Ok(Box::new(BufReader::new(io::stdin()))),
        _ => Ok(Box::new(BufReader::new(
            File::open(filename)
                .map_err(|e| format!("{}: {}", filename, e))?, 1
        ))),
    }
}
1

Incorporate the filename into the error message.

This will require you to expand your imports accordingly:

use clap::{App, Arg};
use std::{
    error::Error,
    fs::File,
    io::{self, BufRead, BufReader},
};

As noted earlier, only one of the inputs is allowed to be “-” for STDIN. You can use the following code for your run that will check the filenames and then open the files:

pub fn run(config: Config) -> MyResult<()> {
    let filename1 = &config.file1;
    let filename2 = &config.file2;

    if filename1.as_str() == "-" && filename2.as_str() == "-" { 1
        return Err(From::from("Both input files cannot be STDIN ("-")"));
    }

    let _file1 = open(&filename1)?; 2
    let _file2 = open(&filename2)?;
    println!("Opened {} and {}", filename1, filename2); 3

    Ok(())
}
1

Check that both of the filenames are not “-”.

2

Attempt to open the two input files.

3

Print a message so you know what happened.

Your program should reject two STDIN arguments:

$ cargo run -- - -
Both input files cannot be STDIN ("-")

It should be able to print the following for two good input files:

$ cargo run -- tests/inputs/file1.txt tests/inputs/file2.txt
Opened tests/inputs/file1.txt and tests/inputs/file2.txt

It should reject a bad file for either argument:

$ cargo run -- tests/inputs/file1.txt blargh
blargh: No such file or directory (os error 2)

You should pass all the tests for cargo test dies:

running 4 tests
test dies_both_stdin ... ok
test dies_no_args ... ok
test dies_bad_file1 ... ok
test dies_bad_file2 ... ok

Processing the Files

Now your program has validated all the arguments and opened the input files, either of which may be STDIN. Next, you need to iterate over the lines from each file to compare them. You can use BufRead::lines for this as it is not necessary to preserve line endings. Start simple, perhaps using the empty.txt file (which is empty) and file1.txt which should print all the lines from file1.txt:

$ cd tests/inputs/
$ comm file1.txt empty.txt
a
b
c
d

Ensure that you get the same output (but now in column 2) if you reverse the argument order:

$ comm empty.txt file1.txt
	a
	b
	c
	d

Next, try with file1.txt and file2.txt. Note that the following is the output from BSD comm and is the expected behavior for the challenge program:

$ comm file1.txt file2.txt
	B
a
b
		c
d

GNU comm uses a different ordering for which lines to show first when they are not equal:

$ comm file1.txt file2.txt
a
b
	B
		c
d

This is one of those programs that was always rather mysterious to me until I wrote my own version. I suggest you start by trying to read a line from each file. The documentation for BufRead::lines notes that it will return a None when it reaches the end of the file. Starting with the empty file as one of the arguments will force you to deal with having an uneven number of lines where you will have to advance one of the filehandles while the other stays the same. Then when you use two nonempty files, you’ll have to consider how to read the files until you have matching lines and moving them independently, otherwise.

I’ll see you on the flip side after you’ve written your solution.

Solution

Adding to the last version I showed, I decided to create iterators for the lines of each of the files that would incorporate a closure to handle case-insensitive comparisons:

pub fn run(config: Config) -> MyResult<()> {
    let filename1 = &config.file1;
    let filename2 = &config.file2;

    if filename1.as_str() == "-" && filename2.as_str() == "-" {
        return Err(From::from("Both input files cannot be STDIN ("-")"));
    }

    let case = |line: String| { 1
        if config.insensitive {
            line.to_lowercase()
        } else {
            line
        }
    };

    let mut lines1 =
        open(filename1)?.lines().filter_map(Result::ok).map(case); 2
    let mut lines2 =
        open(filename2)?.lines().filter_map(Result::ok).map(case);

    let line1 = lines1.next(); 3
    let line2 = lines2.next();
    println!("line1 = {:?}", line1); 4
    println!("line2 = {:?}", line2);

    Ok(())
}
1

Create a closure to lowercase each line of text when config.insensitive is true.

2

Open the files, create a lines iterator that removes errors, and then map the lines through the case closure.

3

Use Iterator::next to get the first line from each filehandle.

4

Print the first two values.

Note

In the preceding code, I used the function Result::ok rather than writing a closure |line| line.ok(). They both accomplish the same thing, but the first is shorter.

As I suggested, I’ll start with one of the files being empty:

$ cd ../..
$ cargo run -- tests/inputs/file1.txt tests/inputs/empty.txt
line1 = Some("a")
line2 = None

That led me to think about how I can move through the lines of each iterator based on the four different combinations of Some(line) and None that I can get from two iterators. In the following code, I place the possibilities inside a tuple, which is a “finite heterogeneous sequence” surrounded by parentheses:

    let mut line1 = lines1.next(); 1
    let mut line2 = lines2.next();

    loop {
        match (&line1, &line2) { 2
            (Some(_val1), Some(_val2)) => { 3
                line1 = lines1.next();
                line2 = lines2.next();
            }
            (Some(_val1), None) => { 4
                line1 = lines1.next();
            }
            (None, Some(_val2)) => { 5
                line2 = lines2.next();
            }
            (None, None) => break, 6
        };
    }
1

Make the line variables mutable.

2

Compare all possible combinations of the two line variables for two variants.

3

When both are Some value, it’s possible to ask for the next line of each.

4

When there is only the first value, only ask for the next line from the first file.

5

Do the same for the second file.

6

When there are no lines left in either file, break from the loop.

Next, I must compare the two values when I have both. When I only have a value from the first or second file, I should print those values in the first or second column, respectively. When they are the same, I need to print the value in column 3. When the first value is less than the second, I should print the first value in column 1; otherwise, I should print the second value in column 2. To understand this last point, consider the following two input files which I’ll place side by side so you can imagine how the code will read the lines:

$ cat tests/inputs/file1.txt          $ cat tests/inputs/file2.txt
a                                     B
b                                     c
c
d

To help you see the output from BSD comm, I will pipe the output into Perl to replace the tab characters with the string ---> to make it clearer which columns are being printed:

$ comm tests/inputs/file1.txt tests/inputs/file2.txt | perl -pe "s/	/--->/g"
--->B
a
b
--->--->c
d

Now imagine your code reads the first line from each input and has a from file1.txt and B from file2.txt. They are not equal, so the question is which to print. The goal is to mimic BSD comm, so I know that the B should come first and be printed in the second column. When I compare a and B, I find that a is greater than B when they are ordered by their code point or numerical value. To help you see this, I’ve included a program in util/ascii that will show you a range of the ASCII table starting at the first printable character. Note that a has a value of 97 while B is 66:

 33: !	 52: 4	 71: G	 90: Z	109: m
 34: "	 53: 5	 72: H	 91: [	110: n
 35: #	 54: 6	 73: I	 92: 	111: o
 36: $	 55: 7	 74: J	 93: ]	112: p
 37: %	 56: 8	 75: K	 94: ^	113: q
 38: &	 57: 9	 76: L	 95: _	114: r
 39: '	 58: :	 77: M	 96: `	115: s
 40: (	 59: ;	 78: N	 97: a	116: t
 41: )	 60: <	 79: O	 98: b	117: u
 42: *	 61: =	 80: P	 99: c	118: v
 43: +	 62: >	 81: Q	100: d	119: w
 44: ,	 63: ?	 82: R	101: e	120: x
 45: -	 64: @	 83: S	102: f	121: y
 46: .	 65: A	 84: T	103: g	122: z
 47: /	 66: B	 85: U	104: h	123: {
 48: 0	 67: C	 86: V	105: i	124: |
 49: 1	 68: D	 87: W	106: j	125: }
 50: 2	 69: E	 88: X	107: k	126: ~
 51: 3	 70: F	 89: Y	108: l	127: DEL

To mimic BSD comm, I should print the lower value (B) first and draw another value from that file for the next iteration. The GNU version does the opposite. Note you should add use std::cmp::Ordering::* to your imports for this code:

    let mut line1 = lines1.next();
    let mut line2 = lines2.next();

    loop {
        match (&line1, &line2) {
            (Some(val1), Some(val2)) => match val1.cmp(val2) { 1
                Equal => { 2
                    println!("{}", val1);
                    line1 = lines1.next();
                    line2 = lines2.next();
                }
                Less => { 3
                    println!("{}", val1);
                    line1 = lines1.next();
                }
                _ => { 4
                    println!("{}", val2);
                    line2 = lines2.next();
                }
            },
            (Some(val1), None) => {
                println!("{}", val1); 5
                line1 = lines1.next();
            }
            (None, Some(val2)) => {
                println!("{}", val2); 6
                line2 = lines2.next();
            }
            (None, None) => break,
        };
    }
1

Use std::cmp to compare the first value to the second. This will return a variant from std::cmp::Ordering.

2

When the two values are equal, print the first and get values from each of the files.

3

When the value from the first file is less than the value from the second file, print the first and request the next value from the first file.

4

Otherwise, print the value from the second file and request the next value from the second file.

5

When there is only a value from the first file, print it and continue requesting values from the first file.

6

When there is only a value from the second file, print it and continue requesting values from the second file.

If I run this code using a nonempty and empty file, it works:

$ cargo run -- tests/inputs/file1.txt tests/inputs/empty.txt
a
b
c
d

If I use file1.txt and file2.txt, it’s not far from the expected output:

$ cargo run -- tests/inputs/file1.txt tests/inputs/file2.txt
B
a
b
c
d

I decided to create an enum called Column to represent in which column I should print a value. Each variant holds a String, and you can place the following at the top of src/lib.rs near your Config declaration. Be sure to add use crate::Column::* to your import so you can reference Col1 instead of Column::Col1:

enum Column {
    Col1(String),
    Col2(String),
    Col3(String),
}

Next, I chose to create a closure to handle the printing of the output. This also means I must figure out what goes in the first two columns based on the configuration:

    let default_col1 = if config.suppress_col1 { 1
        ""
    } else {
        &config.delimiter
    };

    let default_col2 = if config.suppress_col2 { 2
        ""
    } else {
        &config.delimiter
    };

    let printer = |col: Column| {
        let out = match col {
            Col1(val) => { 3
                if config.suppress_col1 {
                    "".to_string()
                } else {
                    val
                }
            }
            Col2(val) => format!( 4
                "{}{}",
                default_col1,
                if config.suppress_col2 { "" } else { &val },
            ),
            Col3(val) => format!( 5
                "{}{}{}",
                default_col1,
                default_col2,
                if config.suppress_col3 { "" } else { &val },
            ),
        };

        if !out.trim().is_empty() { 6
            println!("{}", out);
        }
    };
1

When suppressing column 1, use the empty string; otherwise use the output delimiter.

2

Do the same for column 2.

3

Given the text for column 1, decide whether to print or suppress the output.

4

Given the text for column 2, use the default value for column 1 and either print or suppress this column.

5

Given the text for column 3, use the default values for columns 1 and 2 and either print or suppress this column.

6

Only print nonempty lines of output.

Here is how I can use this idea:

    let mut line1 = lines1.next(); 1
    let mut line2 = lines2.next();

    loop {
        match (&line1, &line2) {
            (Some(val1), Some(val2)) => match val1.cmp(val2) {
                Equal => {
                    printer(Col3(val1.to_string())); 2
                    line1 = lines1.next();
                    line2 = lines2.next();
                }
                Less => {
                    printer(Col1(val1.to_string())); 3
                    line1 = lines1.next();
                }
                _ => {
                    printer(Col2(val2.to_string())); 4
                    line2 = lines2.next();
                }
            },
            (Some(val1), None) => {
                printer(Col1(val1.to_string())); 5
                line1 = lines1.next();
            }
            (None, Some(val2)) => {
                printer(Col2(val2.to_string())); 6
                line2 = lines2.next();
            }
            (None, None) => break,
        };
    }
1

Draw the initial values from the two input files.

2

When the values are the same, print one of them in column 3.

3

When the first value is less than the second, print the first value in column 1.

4

Otherwise, print the second value in column 2.

5

When there is only a value from the first file, print it in column 1.

6

When there is only a value from the second file, print it in column 2.

I like having the option to change the output delimiter from a tab to something more visible:

$ cargo run -- -d="--->" tests/inputs/file1.txt tests/inputs/file2.txt
--->B
a
b
--->--->c
d

With these changes, all the tests pass.

Going Further

  • Alter the program to show the GNU output and add those options.

  • As I noted in the chapter introduction, comm performs basic join operations on two files which is similar to the join program. Read the manual page for that program and use your experience from writing commr to write joinr.

Summary

Like I said, comm was always black magic to me, and I had to look up the manual page every time to remember what the flags meant. Now that I’ve written my own version, I understand it much better and quite love the elegance of how it works. Consider what you learned:

  • You can manually iterate the lines of a filehandle by calling Iterator::next on BufRead::lines.

  • It’s possible match on combinations of possibilities by grouping them into a tuple.

  • You can use std::cmp to compare one value to another. The result is a variant of std::cmp::Ordering.

  • You create an enum called Column where the variants can hold a string value like Col1(String), which is really handy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset