There’s only one everything
The Might Be Giants
The uniq
program (pronounced unique) will find the distinct strings in lines of text either from a file or STDIN
.
Among its many uses, it is often employed to count how many times each unique string is found.
I think you will find it challenging to create a program that mimics the output from the original tool.
Here are some of the skills you will learn:
How to write to a file or STDOUT
(pronounced standard out)
How and why to write a closure to capture a variable
About the don’t repeat yourself (DRY) concept
About the Write
trait and the write!
and writeln!
macros
How and why to use temporary files
How to indicate the lifetime of a variable
How to write tests that pass arguments on the command line or through STDIN
Following is part of the manual page for the BSD version of uniq
.
The challenge program in this chapter will only implement the reading of a file or STDIN
, writing to a file or STDOUT
, and counting the lines for the -c
flag:
UNIQ(1) BSD General Commands Manual UNIQ(1) NAME uniq -- report or filter out repeated lines in a file SYNOPSIS uniq [-c | -d | -u] [-i] [-f num] [-s chars] [input_file [output_file]] DESCRIPTION The uniq utility reads the specified input_file comparing adjacent lines, and writes a copy of each unique input line to the output_file. If input_file is a single dash ('-') or absent, the standard input is read. If output_file is absent, standard output is used for output. The second and succeeding copies of identical adjacent input lines are not written. Repeated lines in the input will not be detected if they are not adja- cent, so it may be necessary to sort the files first. The following options are available: -c Precede each output line with the count of the number of times the line occurred in the input, followed by a single space. -d Only output lines that are repeated in the input. -f num Ignore the first num fields in each input line when doing compar- isons. A field is a string of non-blank characters separated from adjacent fields by blanks. Field numbers are one based, i.e., the first field is field one. -s chars Ignore the first chars characters in each input line when doing comparisons. If specified in conjunction with the -f option, the first chars characters after the first num fields will be ignored. Character numbers are one based, i.e., the first char- acter is character one. -u Only output lines that are not repeated in the input. -i Case insensitive comparison of lines.
In the 06_uniq directory of the Git repository, you will find various input files I’ll use for testing.
To start, note that uniq
will print nothing when given an empty file:
$ uniq tests/inputs/empty.txt
Given a file with just one line, the one line will be printed:
$ uniq tests/inputs/one.txt a
It will also print the number of times a line occurs before the line when run with the -c
option.
The count is right-justified in a field 4 characters wide and is followed by a single space and then the line:
$ uniq -c tests/inputs/one.txt 1 a
Given input that contains two duplicate lines, only one line will be emitted:
$ cat tests/inputs/two.txt a a $ uniq tests/inputs/two.txt a
The line is shown to have a count of 2:
$ uniq -c tests/inputs/two.txt 2 a
A longer input file shows that uniq
only considers the lines in order and not globally as the line one shows several times in this input file:
$ cat tests/inputs/three.txt a a b b a c c c a d d d d
When counting, uniq
starts over at 1 each time it sees a new string:
$ uniq -c tests/inputs/three.txt 2 a 2 b 1 a 3 c 1 a 4 d
If you wanted actually unique values, you must first sort
the input.
Note in the following command that uniq
will read STDIN
by default:
$ sort tests/inputs/three.txt | uniq -c 4 a 2 b 3 c 4 d
The file tests/inputs/skip.txt contains a blank line in the input acts just like any other value and so it will reset the counter:
$ uniq -c tests/inputs/skip.txt 1 a 1 1 a 1 b
If you study the usage closely, you’ll see a very subtle indication of how to write the output to a file.
Notice how input
and output
in the following are grouped inside square brackets to indicate that they are optional as a pair.
That is, if you provide input
, then you may also optionally provide output
:
usage: uniq [-c | -d | -u] [-i] [-f fields] [-s chars] [input [output]]
For example, I can count tests/inputs/two.txt and place the output into out:
$ uniq -c tests/inputs/two.txt out $ cat out 2 a
With no positional arguments, uniq
will read from STDIN
by default:
$ cat tests/inputs/two.txt | uniq -c 2 a
If you want to read from STDIN
and indicate the output filename, you must use a dash (“-” for the input filename:
$ cat tests/inputs/two.txt | uniq -c - out $ cat out 2 a
The GNU version works basically the same while also providing many more options:
$ uniq --help Usage: uniq [OPTION]... [INPUT [OUTPUT]] Filter adjacent matching lines from INPUT (or standard input), writing to OUTPUT (or standard output). With no options, matching lines are merged to the first occurrence. Mandatory arguments to long options are mandatory for short options too. -c, --count prefix lines by the number of occurrences -d, --repeated only print duplicate lines, one for each group -D, --all-repeated[=METHOD] print all duplicate lines groups can be delimited with an empty line METHOD={none(default),prepend,separate} -f, --skip-fields=N avoid comparing the first N fields --group[=METHOD] show all items, separating groups with an empty line METHOD={separate(default),prepend,append,both} -i, --ignore-case ignore differences in case when comparing -s, --skip-chars=N avoid comparing the first N characters -u, --unique only print unique lines -z, --zero-terminated end lines with 0 byte, not newline -w, --check-chars=N compare no more than N characters in lines --help display this help and exit --version output version information and exit A field is a run of blanks (usually spaces and/or TABs), then non-blank characters. Fields are skipped before chars. Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.
As you can see, both the BSD and GNU versions have many more options, but this is as much as the challenge program is expected to implement.
Just so there are no surprises, this program will be called uniqr
for a Rust version of uniq
.
Start by running cargo new uniqr
to create a new binary package, then modify your Cargo.toml to add the following dependencies:
[dependencies] clap = "2.33" [dev-dependencies] assert_cmd = "1" predicates = "1" tempfile = "3" rand = "0.8"
Copy my 06_uniqr/tests directory into your projects, and then run cargo test
to ensure the program compiles and that the tests run and fail.
If you would like to copy my normal module structure, then modify your src/main.rs to the following:
fn main() { if let Err(e) = uniqr::get_args().and_then(uniqr::run) { eprintln!("{}", e); std::process::exit(1); } }
You can start your src/lib.rs like previous chapters:
use clap::{App, Arg}; use std::error::Error; type MyResult<T> = Result<T, Box<dyn Error>>; #[derive(Debug)] pub struct Config { in_file: String,out_file: Option<String>,
count: bool,
}
The program will either be read from a given in_file
or STDIN
.
The output will either be written to a given out_file
or STDOUT
.
The count
is a Boolean for whether to print the counts of each line.
Here is an outline for get_args
:
pub fn get_args() -> MyResult<Config> { let matches = App::new("uniq") .version("0.1.0") .author("Ken Youens-Clark <[email protected]>") .about("Rust uniq") // What goes here? .get_matches(); Ok(Config { in_file: ... out_file: ... count: ... }) }
I suggest you start your run
by printing the config
:
pub fn run(config: Config) -> MyResult<()> { println!("{:?}", config); Ok(()) }
Your program should be able to produce the following usage:
$ cargo run -- -h uniqr 0.1.0 Ken Youens-Clark <[email protected]> Rust uniq USAGE: uniqr [FLAGS] [ARGS] FLAGS: -c, --count Show counts-h, --help Prints help information -V, --version Prints version information ARGS: <FILE> Input file [default: -]
<FILE> Output file
The -c|--count
flag is optional.
The input file is the first positional argument and defaults to a dash.
The output file is the second positional argument and is optional.
By default the program will read from STDIN
which can be represented using a dash:
$ cargo run Config { in_file: "-", out_file: None, count: false }
The first positional argument should be interpreted as the in_file
, the second positional argument as the out_file
.
Note that clap
can handle options either before or after positional arguments:
$ cargo run -- tests/inputs/one.txt out --count Config { in_file: "tests/inputs/one.txt", out_file: Some("out"), count: true }
I am trying to mimic the original versions as much as possible, but I do not like optional positional parameters. In my opinion, it would be better to have an -o|--output
option that defaults to STDOUT
and have only one optional positional argument for the input file which defaults to STDIN
.
I assume you are an upright and moral person who figured out how to write that on your own, so I will now share my solution to this:
pub fn get_args() -> MyResult<Config> { let matches = App::new("uniq") .version("0.1.0") .author("Ken Youens-Clark <[email protected]>") .about("Rust uniq") .arg( Arg::with_name("in_file") .value_name("FILE") .help("Input file") .default_value("-"), ) .arg( Arg::with_name("out_file") .value_name("FILE") .help("Output file"), ) .arg( Arg::with_name("count") .value_name("COUNT") .help("Show counts") .short("c") .long("count") .takes_value(false), ) .get_matches(); Ok(Config { in_file: matches.value_of("in_file").map(str::to_string).unwrap(),out_file: matches.value_of("out_file").map(String::from),
count: matches.is_present("count"),
}) }
Convert the in_file
argument to a String
.
Convert the out_file
argument to an Option<String>
.
The count
is either present or not, so convert to a bool
.
In the preceding code, ArgMatches::value_of
method returns Option<&str>
.
To convert in_file
to a String
, I’m calling Option::map
which would normally take a closure, which I first mentioned in Chapter 5.
Instead, I’m using two existing functions that will convert the &str
to a String
: str::to_string
, and String::from
, which could also be written as From::from
because Rust infers the String
type.
Remember, functions are first-class objects that can be passed as arguments to other functions.
Because I have defined the in_file
argument to have a default value, it should never be None
; therefore, I can use Option::map
to convert Some(&str)
to Some(String)
and do nothing with a None
.
I similarly do this for the out_file
but leave it as an Option
since it is not required.
The test suite in tests/cli.rs is fairly large containing 78 tests. I intended to test the program under the following conditions:
Input file as the only positional argument, check STDOUT
Input file as a positional argument with --count
option, check STDOUT
Input from STDIN
with no positional arguments, check STDOUT
Input from STDIN
with --count
and no positional arguments, check STDOUT
Input and output files as positional arguments, check output file
Input and output files as positional arguments with --count
, check output file
Input from STDIN
and output files as positional arguments with --count
, check output file
Given how large and complicated the tests became, I incorporated a bit more structure in the code. I encourage you to read the tests which start off like so:
use assert_cmd::Command; use predicates::prelude::*; use rand::{distributions::Alphanumeric, Rng}; use std::fs; use tempfile::NamedTempFile;type TestResult = Result<(), Box<dyn std::error::Error>>; struct Test<'a> {
input: &'a str, out: &'a str, out_count: &'a str, }
This is used to create temporary output files.
A struct to define the input files and expected output values with and without the counts.
Note the use of 'a
to denote the lifetime of the values.
I want to define structs with &str
values, and the Rust compiler would like to know exactly how long the values are expected to stick around relative to each other.
The fact that all the values have the same lifetime 'a
means that they are all expected to live as long as each other.
If you remove the lifetime annotations and run the tests, you’ll see similar warnings from the compiler as shown in the previous section along with an suggestion of exactly how to fix it:
error[E0106]: missing lifetime specifier --> tests/cli.rs:8:12 | 8 | input: &str, | ^ expected named lifetime parameter | help: consider introducing a named lifetime parameter | 7 | struct Test<'a> { 8 | input: &'a str,
Next, I define the constant values I need for testing:
const PRG: &str = "uniqr";const EMPTY: Test = Test { input: "tests/inputs/empty.txt",
out: "tests/inputs/empty.txt.out",
out_count: "tests/inputs/empty.txt.c.out",
};
The name of the program being tested
The location of the input file for this test
The location of the output file without the counts
The location of the output file with the counts
After the declaration of EMPTY
, there are many more Test
structures followed by several helper functions.
The run
function will use the Test.input
as an input file and will compare STDOUT
to the contents of the Test.out
file:
fn run(test: &Test) -> TestResult {let expected = fs::read_to_string(test.out)?;
Command::cargo_bin(PRG)?
.arg(test.input) .assert() .success() .stdout(expected); Ok(()) }
The function accepts a Test
and returns a TestResult
.
Try to read the expected output file.
Try to run the program with the input file as an argument, verify it ran successfully, and compare the STDOUT
to the expected value.
The next helper works very similarly but this time tests for the counting:
fn run_count(test: &Test) -> TestResult { let expected = fs::read_to_string(test.out_count)?;Command::cargo_bin(PRG)? .args(&[test.input, "-c"])
.assert() .success() .stdout(expected); Ok(()) }
Read the Test.out_count
file for the expected output.
Pass both the Test.input
value and the flag -c
to count the lines.
The next function will pass the input as STDIN
:
fn run_stdin(test: &Test) -> TestResult { let input = fs::read_to_string(test.input)?;let expected = fs::read_to_string(test.out)?;
Command::cargo_bin(PRG)?
.write_stdin(input) .assert() .success() .stdout(expected); Ok(()) }
Try to read the Test.input
file.
Try to read the Test.out
file.
Pass the input
through STDIN
and verify that STDOUT
is the expected value.
The next function tests both reading from STDIN
and counting the lines:
fn run_stdin_count(test: &Test) -> TestResult { let input = fs::read_to_string(test.input)?; let expected = fs::read_to_string(test.out_count)?; Command::cargo_bin(PRG)?.arg("--count") .write_stdin(input) .assert() .success() .stdout(expected); Ok(()) }
Run the program with the long --count
flag and feed the input to STDIN
, verify that STDOUT
is correct.
The next function checks that the program accepts both the input and output files as positional arguments.
This is somewhat more interesting as I needed to use temporary files in the testing because, as you have seen repeatedly, Rust will run the tests in parallel.
If I were to use the same dummy filename like blargh to write all the output files, the tests would overwrite each other’s output.
To get around this, I use the tempfile::NamedTempFile
to get a dynamically generated temporary filename that will automatically be removed when I finish:
fn run_outfile(test: &Test) -> TestResult { let expected = fs::read_to_string(test.out)?; let outfile = NamedTempFile::new()?;let outpath = &outfile.path().to_str().unwrap();
Command::cargo_bin(PRG)?
.args(&[test.input, outpath]) .assert() .success() .stdout(""); let contents = fs::read_to_string(&outpath)?;
assert_eq!(&expected, &contents);
Ok(()) }
Try to get a named temporary file.
Get the path
to the file.
Run the program with the input and output filenames as arguments, verify there is nothing in STDOUT
.
Try to read the output file.
Check that the contents of the output file match the expected value.
The next two functions are variations on what I’ve already shown, adding in the --count
flag and finally asking the program to read from STDIN
when the input filename is a dash.
The rest of the module calls these helpers using the various structs to run all the tests.
It was tedious to write all this, but it’s very important to test every single aspect of a program with all possible combinations of the arguments.
You can perhaps see now why I didn’t want to add any more functionality to the challenge program.
One interesting aspect of implementing open-source programs like uniq
is that you can sometimes find the source code for those programs.
While looking for the original test suite, I found a Perl program used to test the GNU version.
I used several of those examples in the tests/inputs/t[1-6].txt input files, as you can see in 06_uniqr/mk-outs.sh:
$ cat mk-outs.sh #!/usr/bin/env bash ROOT="tests/inputs" OUT_DIR="tests/expected" [[ ! -d "$OUT_DIR" ]] && mkdir -p "$OUT_DIR" # Cf https://github.com/coreutils/coreutils/blob/master/tests/misc/uniq.pl echo -ne "a a " > $ROOT/t1.txtecho -ne "a a" > $ROOT/t2.txt
echo -ne "a b" > $ROOT/t3.txt
echo -ne "a a b" > $ROOT/t4.txt
echo -ne "b a a " > $ROOT/t5.txt
echo -ne "a b c " > $ROOT/t6.txt
for FILE in $ROOT/*.txt; do BASENAME=$(basename "$FILE") uniq $FILE > ${OUT_DIR}/${BASENAME}.out uniq -c $FILE > ${OUT_DIR}/${BASENAME}.c.out uniq < $FILE > ${OUT_DIR}/${BASENAME}.stdin.out uniq -c < $FILE > ${OUT_DIR}/${BASENAME}.stdin.c.out done
Two lines each ending with a newline
No trailing newline on last line
Two different lines, no trailing newline
Two lines the same, last is different with no trailing newline
Two different values with newlines on each
Three different values with newlines on each
I found it surprisingly challenging to exactly match the output from uniq
, but I guess I’m a better person for the effort now.
I would suggest start in src/lib.rs by reading the input file, and again it makes sense to use the open
function from previous chapters:
fn open(filename: &str) -> MyResult<Box<dyn BufRead>> { match filename { "-" => Ok(Box::new(BufReader::new(io::stdin()))), _ => Ok(Box::new(BufReader::new(File::open(filename)?))), } }
Be sure you expand your imports to include the following:
use clap::{App, Arg}; use std::{error::Error, fs::File, io::{self, BufRead, BufReader}, };
You can borrow quite a bit of code from Chapter 3 that reads lines of text from an input file or STDIN
while preserving the line endings:
pub fn run(config: Config) -> MyResult<()> { let mut file = open(&config.in_file) .map_err(|e| format!("{}: {}", config.in_file, e))?;let mut line = String::new();
loop {
let bytes = file.read_line(&mut line)?;
if bytes == 0 {
break; } print!("{}", line);
line.clear();
} Ok(()) }
Read either STDIN
if the input file is a dash or open the given filename. Create an informative error message when this fails.
Create a new, empty mutable String
buffer to hold each line.
Create an infinite loop.
Read a line of text while preserving the line endings.
If no bytes were read, break out of the loop.
Print the line buffer.
Clear the line buffer.
Run your program with an input file to ensure it works:
$ cargo run -- tests/inputs/one.txt a
It should also work for reading STDIN
:
$ cargo run -- - < tests/inputs/one.txt a
Try to get your program to iterate the lines of input and count each unique run of lines, then try to get your program to print the lines with and without the counts.
Look at how open
works, and try to copy those ideas to use File::create
to write output to a file or STDOUT
.
Remember that you can run just a subset of tests with a command like cargo test empty
to run all the tests with the string empty in the name.
I know you can do this.
I’ll step you through how I arrived at a solution. Your version may be different, but it’s fine as long as it passes the test suite. Building on the last version I showed, I decided to create two additional mutable variables to hold the last line of text and the running count. For now, I will always print the count to make sure it’s working correctly:
pub fn run(config: Config) -> MyResult<()> { let mut file = open(&config.in_file) .map_err(|e| format!("{}: {}", config.in_file, e))?; let mut line = String::new(); let mut last = String::new();let mut count: u64 = 0;
loop { let bytes = file.read_line(&mut line)?; if bytes == 0 { break; } if line.trim_end() != last.trim_end() {
if count > 0 {
print!("{:>4} {}", count, last);
} last = line.clone();
count = 0;
} count += 1;
line.clear(); } if count > 0 {
print!("{:>4} {}", count, last); } Ok(()) }
Create a mutable variable to hold the last line of text.
Create a mutable variable to hold the count.
Compare the current line to the last time, both trimmed of any possible trailing whitespace.
Only print the output when count
is greater than 0.
Print the count
right-justified in a column 4 characters wide followed by a space and the line
value.
Set the last
variable to a copy of the current line
.
Reset the counter to 0.
Increment the counter by 1.
Handle the last line of the file.
If I run cargo test
, this will pass a goodly number of tests.
This code is clunky, though.
I don’t like having to check if count > 0
twice as it violates the don’t repeat yourself (DRY) principle, so I’m thinking that needs to be put into a function.
I’m also always printing the count, but I should only print this when config.count
is true
—another condition that could go into this function.
I decided to write this as a closure inside the run
function to close around the config.count
value:
let print = |count: &u64, line: &String| {if count > &0 {
if config.count {
print!("{:>4} {}", &count, &line);
} else { print!("{}", &line);
} }; };
The print
closure will accept count
and line
values.
Only print if the count
is greater than 0.
Check if the config.count
value is true
.
Use the print!
macro to print the count
and line
to STDOUT
.
Otherwise, print the line
to STDOUT
.
I can update the rest of the function to use this closure:
loop { let bytes = file.read_line(&mut line)?; if bytes == 0 { break; } if line.trim_end() != last.trim_end() { print(&count, &last); last = line.clone(); count = 0; } count += 1; line.clear(); } print(&count, &last);
With these changes, my program will pass several more tests.
If I look at the failed tests, they all contain outfile because I’m failing to write a named output file.
I can add this last feature with two changes.
First, I would like to open the output file in the same way as I open the input file by either creating a named output file or by using STDOUT
.
You’ll need to add use std::io::Write
for the following code, which you can place just just after the file
variable:
let mut out_file: Box<dyn Write> = match &config.out_file {Some(out_name) => Box::new(File::create(&out_name)?),
_ => Box::new(io::stdout()),
};
The mutable out_file
will be a boxed value that implements the std::io::Write
trait.
When config.out_file
is Some
filename, use File::create
to try to create the file.
Otherwise, use std::io::stdout
.
If you follow the documentation for both File::create
and io::stdout
, you’ll see both have a Traits section showing the various traits they implement.
Both show that they implement Write
, and so they satisfy the type requirement Box<dyn Write>
which says that the value inside the Box
must implement this trait.
The second change I need to make is to use out_file
for the output.
I will replace the print!
macro with write!
to write the output to a stream like a filehandle or STDOUT
.
The first argument to write!
must be a mutable value that implements the Write
trait.
The documentation shows that write!
will return a std::io::Result
because it might fail.
As such, I changed my print
closure to return MyResult
.
Here is the final version of my run
function that passes all the tests:
pub fn run(config: Config) -> MyResult<()> { let mut file = open(&config.in_file) .map_err(|e| format!("{}: {}", config.in_file, e))?;let mut out_file: Box<dyn Write> = match &config.out_file {
Some(out_name) => Box::new(File::create(&out_name)?), _ => Box::new(io::stdout()), }; let mut print = |count: &u64, line: &String| -> MyResult<()> {
if count > &0 { if config.count { write!(out_file, "{:>4} {}", &count, &line)?; } else { write!(out_file, "{}", &line)?; } }; Ok(()) }; let mut line = String::new(); let mut last = String::new(); let mut count: u64 = 0; loop { let bytes = file.read_line(&mut line)?; if bytes == 0 { break; } if line.trim_end() != last.trim_end() { print(&count, &last)?;
last = line.clone(); count = 0; } count += 1; line.clear(); } print(&count, &last)?;
Ok(()) }
Open either STDIN
or the given input filename.
Open either STDOUT
or the given output filename.
Create a mutable print
closure to format the output.
Use the print
closure to possibly print output. Use ?
to propagate potential errors.
Handle the last line of the file.
I would think it’s highly unlikely that you would independently write something exactly like my solution. As long as your solution passes the tests, it’s perfectly acceptable. Part of what I like about writing with tests is that there is an objective determination when a program meets some level of specifications. Louis Srygley says, “Without requirements or design, programming is the art of adding bugs to an empty text file.” I would say that tests are the requirements made incarnate. Without tests, you simply have no way to know when a change to your program strays from the requirements or breaks the design.
In closing, I would share that I explored a couple of other ways to write this algorithm, one of which read all the lines of the input file into a vector and looked at pairs of lines using Vec::windows
.
This was interesting but could fail if the size of the input file exceeded the available memory on my machine.
The solution presented here will only ever allocate memory for the current and last lines and so should scale to any size file.
As usual, the BSD and GNU versions of uniq
both have many more features than I chose to include in the challenge.
I would encourage you to add all the features you would like to have in your version.
Consider reading the original source code and tests to see what you can borrow.
Be sure to add tests for each feature, and always run the entire test suite to verify that all previous features still work.
In my mind, uniq
is closely tied with sort
as I often use them together.
Consider implementing your own version of sort
, at least to the point of sorting values lexicographically (in dictionary order) or numerically.
I hope you found this challenge as interesting to write as I did. Let’s review some of the things you learned:
You can now open a new file for writing or use STDOUT
.
The idea of DRY is that you move any duplicated code into a single abstraction like a function or a closure.
You’ve seen that a closure can be used to capture values from the enclosing scope.
You learned about the Write
trait and how values that implement this trait can be used with the write!
and writeln!
macros.
Sometimes you will find you need to write a temporary file. You’ve now used a Rust crate that helps find a unique filename and open a filehandle for you.
The Rust compiler may sometimes require you to indicate the lifetime of a variable which is how long it lives in relation to other variables.
1 The name closure refers to this capturing or closing around a lexically scoped binding.