I’m a mess since you cut me out but Chucky’s arm keeps me company
They Might Be Giants
The cut
tool will excise text from a file or STDIN
.
The selected text could be some range of bytes or characters or might be fields denoted by a delimiter.
In Chapter 4 (headr
), you learned to select a contiguous range of characters or bytes, but this challenge goes further as the selections can be noncontiguous.
Additionally, the output of the program can rearrange the selections, as in the string “3,1,5-7” which should select the third, first, and fifth through seventh bytes, characters, or fields.
This will also be the first time you will deal with delimited text where some special character like a comma or a tab creates field boundaries.
The challenge program will capture the spirit of the original tools but will not strive for complete fidelity as I will suggest a few changes which I feel are improvements.
What you will learn:
How to read and write delimited text file using the csv
module
How to deference a value using *
More on using Iterator::filter_map
to combine filter
and map
operations
First let’s review a portion for the BSD version of cut
:
CUT(1) BSD General Commands Manual CUT(1) NAME cut -- cut out selected portions of each line of a file SYNOPSIS cut -b list [-n] [file ...] cut -c list [file ...] cut -f list [-d delim] [-s] [file ...] DESCRIPTION The cut utility cuts out selected portions of each line (as specified by list) from each file and writes them to the standard output. If no file arguments are specified, or a file argument is a single dash ('-'), cut reads from the standard input. The items specified by list can be in terms of column position or in terms of fields delimited by a special character. Column numbering starts from 1. The list option argument is a comma or whitespace separated set of num- bers and/or number ranges. Number ranges consist of a number, a dash ('-'), and a second number and select the fields or columns from the first number to the second, inclusive. Numbers or number ranges may be preceded by a dash, which selects all fields or columns from 1 to the last number. Numbers or number ranges may be followed by a dash, which selects all fields or columns from the last number to the end of the line. Numbers and number ranges may be repeated, overlapping, and in any order. If a field or column is specified multiple times, it will appear only once in the output. It is not an error to select fields or columns not present in the input line. The options are as follows: -b list The list specifies byte positions. -c list The list specifies character positions. -d delim Use delim as the field delimiter character instead of the tab character. -f list The list specifies fields, separated in the input by the field delimiter character (see the -d option.) Output fields are sepa- rated by a single occurrence of the field delimiter character. -n Do not split multi-byte characters. Characters will only be out- put if at least one byte is selected, and, after a prefix of zero or more unselected bytes, the rest of the bytes that form the character are selected. -s Suppress lines with no field delimiter characters. Unless speci- fied, lines with no delimiters are passed through unmodified.
As usual, the GNU version offers both short and long flags and several other features:
NAME cut - remove sections from each line of files SYNOPSIS cut OPTION... [FILE]... DESCRIPTION Print selected parts of lines from each FILE to standard output. Mandatory arguments to long options are mandatory for short options too. -b, --bytes=LIST select only these bytes -c, --characters=LIST select only these characters -d, --delimiter=DELIM use DELIM instead of TAB for field delimiter -f, --fields=LIST select only these fields; also print any line that contains no delimiter character, unless the -s option is specified -n with -b: don't split multibyte characters --complement complement the set of selected bytes, characters or fields -s, --only-delimited do not print lines not containing delimiters --output-delimiter=STRING use STRING as the output delimiter the default is to use the input delimiter --help display this help and exit --version output version information and exit Use one, and only one of -b, -c or -f. Each LIST is made up of one range, or many ranges separated by commas. Selected input is written in the same order that it is read, and is written exactly once. Each range is one of: N N'th byte, character or field, counted from 1 N- from N'th byte, character or field, to end of line N-M from N'th to M'th (included) byte, character or field -M from first to M'th (included) byte, character or field With no FILE, or when FILE is -, read standard input.
Both tools implement the selection ranges in similar ways where numbers can be selected individually, in closed ranges like “1-3”, or in half-open ranges like “-3” to indicate 1 to 3 or “5-” to indicate 5 to the end. Additionally, the original tools will not allow a field to be repeated in the output and will rearrange them in ascending order. The challenge program will instead allow only for a comma-separated list of either single numbers or bounded ranges like “2-4” and will use the selections in the given order to create the output.
I’ll show you some examples of how the original tools work to the extent that the challenge program should implement. I will use the files found in the 08_cutr/tests/inputs directory. First, consider a file with columns of information each in a fixed number of characters or so-called fixed-width text:
$ cd 08_cutr/tests/inputs $ cat books.txt Author Year Title Émile Zola 1865 La Confession de Claude Samuel Beckett 1952 Waiting for Godot Jules Verne 1870 20,000 Leagues Under the Sea
The Author column takes the first 20 characters:
$ cut -c 1-20 books.txt Author Émile Zola Samuel Beckett Jules Verne
The publication Year column occupies the next 5 characters:
$ cut -c 21-25 books.txt Year 1865 1952 1870
The Title column occupies the last 30 characters. Note here that I intentionally request a larger range than exists to show that this is not considered an error:
$ cut -c 26-70 books.txt Title La Confession de Claude Waiting for Godot 20,000 Leagues Under the Sea
I find it annoying that I cannot use this tool to rearrange the output such as by requesting the range for the Title followed by that for the Author:
$ cut -c 26-55,1-20 books.txt Author Title Émile Zola La Confession de Claude Samuel Beckett Waiting for Godot Jules Verne 20,000 Leagues Under the Sea
I can grab just the first character like so:
$ cut -c 1 books.txt A É S J
As you’ve seen in previous chapters, bytes and characters are not always interchangeable. For instance, the “É” in “Émile Zola” is a Unicode character that is composed of two bytes, so asking for just one will result in an invalid character that is represented with the Unicode replacement character:
$ cut -b 1 books.txt A � S J
In my experience, fixed-width data files are less common than those where the columns of data are delimited with a character like a comma or a tab to show the boundaries of the data. Consider the same data in the file books.tsv where the file extension .tsv stands for tab-separated values:
$ cat books.tsv Author Year Title Émile Zola 1865 La Confession de Claude Samuel Beckett 1952 Waiting for Godot Jules Verne 1870 20,000 Leagues Under the Sea
By default, cut
will assume the tab character is the field delimiter, so I can use the -f
option to select, for instance, the publication year in the second column and the title in the third column like so:
$ cut -f 2,3 books.tsv Year Title 1865 La Confession de Claude 1952 Waiting for Godot 1870 20,000 Leagues Under the Sea
The comma is another common delimiter, and such files often have the extension .csv for comma-separated values (CSV). Following is the same data as a CSV file:
$ cat books.csv Author,Year,Title Émile Zola,1865,La Confession de Claude Samuel Beckett,1952,Waiting for Godot Jules Verne,1870,"20,000 Leagues Under the Sea"
To parse a CSV file, I must indicate the delimiter using the -d
option.
Note that I’m still unable to reorder the fields in the output as I indicate “2,1” for the second column followed by the first, but I get the columns back in their original order:
$ cut -d , -f 2,1 books.csv Author,Year Émile Zola,1865 Samuel Beckett,1952 Jules Verne,1870
You may have noticed that the third title contains a comma in 20,000 and so the title has been enclosed in quotes to indicate that the comma is not the field delimiter. This is a way to escape the delimiter or to tell the parser to ignore it. Unfortunately, both the BSD and GNU versions don’t recognize this and will truncate the title prematurely:
$ cut -d , -f 1,3 books.csv Author,Title Émile Zola,La Confession de Claude Samuel Beckett,Waiting for Godot Jules Verne,"20
Noninteger values for any of the list option values that accept a list are rejected:
$ cut -f foo,bar books.tsv cut: [-cf] list: illegal list value
Nonexistent files are handled in the course of processing, printing a message to STDERR
that the file does not exist:
$ cut -c 1 books.txt blargh movies1.csv A É S J cut: blargh: No such file or directory t T L
Finally, the program will read STDIN
by default or if the given input filename is the dash (-
):
$ cat books.tsv | cut -f 2 Year 1865 1952 1870
The challenge program is expected to implement just this much of the original with the following changes:
Ranges must indicate both start and stop values (inclusive)
Output columns should be in the order specified by the user
Ranges may include repeated values
The parsing of delimited text files should respect escaped delimiters
The name of the challenge program should be cutr
(pronounced cutter, I think) for a Rust version of cut
.
I recommend you begin with cargo new cutr
and then copy the 08_cutr/tests directory into your project.
My solution will involve the following crates which you should add to your Cargo.toml:
[dependencies] clap = "2.33" csv = "1" regex = "1" [dev-dependencies] assert_cmd = "1" predicates = "1" rand = "0.8"
Run cargo test
to download the dependencies and run the tests, all of which should fail.
I recommend the following structure for your src/main.rs:
fn main() { if let Err(e) = cutr::get_args().and_then(cutr::run) { eprintln!("{}", e); std::process::exit(1); } }
Following is how I started my src/lib.rs.
I want to highlight that I’m creating another enum
as in Chapter 7, but this time the variants can hold a value.
The value in this case will be another type alias I’m creating called PositionList
, which is a Vec<usize>
:
use crate::Extract::*; use clap::{App, Arg}; use std::error::Error; type MyResult<T> = Result<T, Box<dyn Error>>; type PositionList = Vec<usize>; #[derive(Debug)] pub enum Extract { Fields(PositionList), Bytes(PositionList), Chars(PositionList), } #[derive(Debug)] pub struct Config { files: Vec<String>, delimiter: u8, extract: Extract, }
This allows me to use Fields(...)
instead of Extract::Fields(...)
.
A PositionList
is a vector of positive integer values.
Define an enum
to hold the variants for extracting fields, bytes, or characters.
The files
will be a vector of strings.
The delimiter
should be a single byte.
The extract
field will hold one of the Extract
variants.
I decided to represent a range selection of something like “3,1,5-7” as [3, 1, 5, 6, 7]
.
Well, I actually subtract 1 from each value because I will be dealing with 0-offsets, but the point is that I thought it easiest to explicitly list all the positions in the order in which they will be selected.
You may prefer to handle this differently.
You can start your get_args
with the following:
pub fn get_args() -> MyResult<Config> { let matches = App::new("cutr") .version("0.1.0") .author("Ken Youens-Clark <[email protected]>") .about("Rust cut") // What goes here? .get_matches(); Ok(Config { files: ... delimiter: ... fields: ... bytes: ... chars: ... }) }
Begin your run
by printing the config
:
pub fn run(config: Config) -> MyResult<()> { println!("{:#?}", &config); Ok(()) }
See if you can get your program to print the following usage:
$ cargo run -- --help cutr 0.1.0 Ken Youens-Clark <[email protected]> Rust cut USAGE: cutr [OPTIONS] <FILE>... FLAGS: -h, --help Prints help information -V, --version Prints version information OPTIONS: -b, --bytes <BYTES> Selected bytes -c, --chars <CHARS> Selected characters -d, --delim <DELIMITER> Field delimiter [default: ] -f, --fields <FIELDS> Selected fields ARGS: <FILE>... Input file(s) [default: -]
I wrote a function called parse_pos
that works like the parse_positive_int
function from Chapter 4.
Here is how you might start it:
fn parse_pos(range: &str) -> MyResult<PositionList> { unimplemented!(); }
I have, of course, written a unit test for you. Add the following to your src/lib.rs:
#[cfg(test)] mod tests { use super::parse_pos; #[test] fn test_parse_pos() { assert!(parse_pos("").is_err()); let res = parse_pos("0"); assert!(res.is_err()); assert_eq!(res.unwrap_err().to_string(), "illegal list value: "0"",); let res = parse_pos("a"); assert!(res.is_err()); assert_eq!(res.unwrap_err().to_string(), "illegal list value: "a"",); let res = parse_pos("1,a"); assert!(res.is_err()); assert_eq!(res.unwrap_err().to_string(), "illegal list value: "a"",); let res = parse_pos("2-1"); assert!(res.is_err()); assert_eq!( res.unwrap_err().to_string(), "First number in range (2) must be lower than second number (1)" ); let res = parse_pos("1"); assert!(res.is_ok()); assert_eq!(res.unwrap(), vec![0]); let res = parse_pos("1,3"); assert!(res.is_ok()); assert_eq!(res.unwrap(), vec![0, 2]); let res = parse_pos("1-3"); assert!(res.is_ok()); assert_eq!(res.unwrap(), vec![0, 1, 2]); let res = parse_pos("1,7,3-5"); assert!(res.is_ok()); assert_eq!(res.unwrap(), vec![0, 6, 2, 3, 4]); } }
At this point, I expect you can read the above code well enough to understand exactly how the function should work. I recommend you stop reading at this point and write the code that will pass this test.
After cargo test test_parse_pos
passes, your program should reject an invalid range and print an error message:
$ cargo run -- -f foo,bar tests/inputs/books.tsv illegal list value: "foo"
It should also reject invalid ranges:
$ cargo run -- -f 3-2 tests/inputs/books.tsv First number in range (3) must be lower than second number (2)
When given valid arguments, your program should be able to display a structure like so:
$ cargo run -- -f 1 -d , tests/inputs/movies1.csv Config { files: [ "tests/inputs/movies1.csv", ], delimiter: 44, extract: Fields( [ 0, ], ), }
The positional argument goes into files
.
The -d
value of a comma has a byte value of 44
.
The -f
value of 1 creates the Extract::Fields([0])
variant.
When parsing a TSV file, use the tab as the default delimiter, which has a byte value of 9
:
$ cargo run -- -f 2-3 tests/inputs/movies1.tsv Config { files: [ "tests/inputs/movies1.tsv", ], delimiter: 9, extract: Fields( [ 1, 2, ], ), }
Note that the options for -f|--fields
, -b|--bytes
, and -c|--chars
are all mutually exclusive and should be rejected:
$ cargo run -- -f 1 -b 8-9 tests/inputs/movies1.tsv error: The argument '--fields <FIELDS>' cannot be used with '--bytes <BYTES>' USAGE: cutr <FILE>... --bytes <BYTES> --delim <DELIMITER> --fields <FIELDS>
Try to get just this much of your program working before you proceed.
You should be able to pass cargo test dies
:
running 8 tests test dies_chars_bytes ... ok test dies_chars_bytes_fields ... ok test dies_chars_fields ... ok test dies_bytes_fields ... ok test dies_not_enough_args ... ok test dies_bad_digit_field ... ok test dies_bad_digit_bytes ... ok test dies_bad_digit_chars ... ok
I assume you wrote a passing parse_pos
function, so compare your version to mine:
fn parse_pos(range: &str) -> MyResult<PositionList> { let mut fields: Vec<usize> = vec![]; let range_re = Regex::new(r"(d+)?-(d+)?").unwrap(); for val in range.split(',') { if let Some(cap) = range_re.captures(val) { let n1: &usize = &cap[1].parse()?; let n2: &usize = &cap[2].parse()?; if n1 < n2 { for n in *n1..=*n2 { fields.push(n); } } else { return Err(From::from(format!( "First number in range ({}) must be lower than second number ({})", n1, n2 ))); } } else { match val.parse() { Ok(n) if n > 0 => fields.push(n), _ => { return Err(From::from(format!( "illegal list value: "{}"", val ))) } } } } // Subtract one for field indexes Ok(fields.into_iter().map(|i| i - 1).collect()) }
Create a mutable vector to hold all the positions.
Create a regular expression to capture two numbers separated by a dash.
Split the range values on a comma.
See if this part of the list matches the regex.
Convert the two captured numbers to usize
integer values.
If the first number is less than the second, iterate through the range and add the values to fields
.
Use the *
operator to dereference the two number values.
Return an error about an invalid range.
If it’s possible to convert the value to a usize
integer, add it to the list or else throw an error.
Return the given list with all the values adjusted down by 1.
In the regular expression, I use r""
to denote a raw string so that Rust won’t try to interpret the string. For instance, you’ve seen that Rust will interpret
as a newline. Without this, the compiler would complain that d
is an unknown character escape:
error: unknown character escape: `d` --> src/lib.rs:155:34 | 155 | let range_re = Regex::new("(d+)?-(d+)?").unwrap(); | ^ unknown character escape | = help: for more information, visit <https://static.rust-lang.org /doc/master/reference.html#literals>
I would like to highlight two new pieces of syntax in the preceding code.
First, I used parentheses in the regular expression (d+)-(d+)
to indicate one or more digits followed by a dash followed by one or more digits as shown in Figure 8-1.
If the regular expression matches the given string, then I can use the Regex::captures
to extract the digits from the string.
Note that they are available in 1-based counting, so the contents of the first capturing parentheses are available in position 1
of the captures.
Because the captured values matched digit characters, they should be parsable as usize
values.
The second piece of syntax is the *
operator in for n in *n1..=*n2
.
If you remove these and try to compile the code, you will see the following error:
error[E0277]: the trait bound `&usize: Step` is not satisfied --> src/lib.rs:165:34 | 165 | for n in n1..=n2 { | ^^^^^^^ the trait `Step` is not | implemented for `&usize`
This is one case where the compiler’s message is a bit cryptic and does not include the solution.
The problem is that n1
and n2
are &usize
references.
A reference is a pointer to a piece of memory, not a copy of the value, and so the pointer must be dereferenced to use the underlying value.
There are many times when Rust silently dereferences values, but this is one time when the *
operator is required.
Here is how I incorporate these ideas into my get_args
.
First, I define all the arguments:
pub fn get_args() -> MyResult<Config> { let matches = App::new("cutr") .version("0.1.0") .author("Ken Youens-Clark <[email protected]>") .about("Rust cut") .arg( Arg::with_name("files") .value_name("FILE") .help("Input file(s)") .required(true) .default_value("-") .min_values(1), ) .arg( Arg::with_name("delimiter") .value_name("DELIMITER") .help("Field delimiter") .short("d") .long("delim") .default_value(" "), ) .arg( Arg::with_name("fields") .value_name("FIELDS") .help("Selected fields") .short("f") .long("fields") .conflicts_with_all(&["chars", "bytes"]), ) .arg( Arg::with_name("bytes") .value_name("BYTES") .help("Selected bytes") .short("b") .long("bytes") .conflicts_with_all(&["fields", "chars"]), ) .arg( Arg::with_name("chars") .value_name("CHARS") .help("Selected characters") .short("c") .long("chars") .conflicts_with_all(&["fields", "bytes"]), ) .get_matches();
The required files
accepts multiple values and defaults to the dash.
The delimiter
uses the tab as the default value.
The fields
option conflicts with chars
and bytes
.
The bytes
option conflicts with fields
and chars
.
The chars
options conflicts with fields
and bytes
.
Next, I convert the delimiter to bytes and verify that there is only one:
let delimiter = matches.value_of("delimiter").unwrap_or(" "); let delim_bytes = delimiter.as_bytes(); if delim_bytes.len() > 1 { return Err(From::from(format!( "--delim "{}" must be a single byte", delimiter ))); }
I use the parse_pos
function to handle all the optional list values:
let fields = matches.value_of("fields").map(parse_pos).transpose()?; let bytes = matches.value_of("bytes").map(parse_pos).transpose()?; let chars = matches.value_of("chars").map(parse_pos).transpose()?;
I’m introducing Option::transpose
here that “transposes an Option
of a Result
into a Result
of an Option
.”
I then figure out which Extract
variant to create.
I should never trigger the else
clause in this code, but it’s good to have:
let extract = if let Some(field_pos) = fields { Fields(field_pos) } else if let Some(byte_pos) = bytes { Bytes(byte_pos) } else if let Some(char_pos) = chars { Chars(char_pos) } else { return Err(From::from("Must have --fields, --bytes, or --chars")); };
If the code makes it to this point, then I appear to have valid arguments that I can return:
Ok(Config { files: matches.values_of_lossy("files").unwrap(), delimiter: delim_bytes[0], extract, }) }
Next, you will need to figure out how you will use this information to extract the desired bits from the inputs.
In Chapters 4 (headr
) and 5 (wcr
), you learned how to process lines, bytes, and characters in a file.
You should draw on those programs to help you select characters and bytes in this challenge.
One difference is that line endings need not be preserved, so you may use BufRead::lines
to read the lines of input text.
To start, you might consider bringing in the open
function used before to help open each file:
fn open(filename: &str) -> MyResult<Box<dyn BufRead>> { match filename { "-" => Ok(Box::new(BufReader::new(io::stdin()))), _ => Ok(Box::new(BufReader::new(File::open(filename)?))), } }
You can expand your run
to handle good and bad files:
pub fn run(config: Config) -> MyResult<()> { for filename in &config.files { match open(filename) { Err(err) => eprintln!("{}: {}", filename, err), Ok(_file) => println!("Opened {}", filename), } } Ok(()) }
This will require some more imports:
use crate::Extract::*; use clap::{App, Arg}; use regex::Regex; use std::{ error::Error, fs::File, io::{self, BufRead, BufReader}, };
Now consider how you might extract characters from each line of the filehandle.
I wrote a function called extract_chars
that will return a new string composed of the characters at the given index positions:
fn extract_chars(line: &str, char_pos: &[usize]) -> String { unimplemented!(); }
I’m using &[usize]
instead of &PositionList
because of the suggestion from cargo clippy
:
warning: writing `&Vec<_>` instead of `&[_]` involves one more reference and cannot be used with non-Vec-based slices --> src/lib.rs:214:40 | 214 | fn extract_chars(line: &str, char_pos: &PositionList) -> String { | ^^^^^^^^^^^^^ | = note: `#[warn(clippy::ptr_arg)]` on by default = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#ptr_arg
Here is a test you can add to your tests
module to help you see how the function might work:
#[test] fn test_extract_chars() { assert_eq!(extract_chars("", &[0]), "".to_string()); assert_eq!(extract_chars("ábc", &[0]), "á".to_string()); assert_eq!(extract_chars("ábc", &[0, 2]), "ác".to_string()); assert_eq!(extract_chars("ábc", &[0, 1, 2]), "ábc".to_string()); assert_eq!(extract_chars("ábc", &[2, 1]), "cb".to_string()); assert_eq!(extract_chars("ábc", &[0, 1, 4]), "áb".to_string()); }
I also wrote a similar extract_bytes
function to parse out bytes:
fn extract_bytes(line: &str, byte_pos: &[usize]) -> String { unimplemented!(); )
Here is the test:
fn test_extract_bytes() { assert_eq!(extract_bytes("ábc", &[0]), "�".to_string()); assert_eq!(extract_bytes("ábc", &[0, 1]), "á".to_string()); assert_eq!(extract_bytes("ábc", &[0, 1, 2]), "áb".to_string()); assert_eq!(extract_bytes("ábc", &[0, 1, 2, 3]), "ábc".to_string()); assert_eq!(extract_bytes("ábc", &[3, 2]), "cb".to_string()); assert_eq!(extract_bytes("ábc", &[0, 1, 5]), "á".to_string()); }
Once you have written these two functions so that they pass the given test, iterate the lines of input text and use the preceding functions to extract and print the desired characters or bytes from each line.
You should be able to pass all but the following when you run cargo test
:
failures: csv_f1 csv_f1_2 csv_f1_3 csv_f2 csv_f2_3 csv_f3 tsv_f1 tsv_f1_2 tsv_f1_3 tsv_f2 tsv_f2_3 tsv_f3
To pass the final tests, you will need to learn how to parse delimited text files.
As stated earlier, this file format uses some delimiter like a comma or tab to indicate the boundaries of a field.
Sometimes the delimiting character may also be part of the data, in which case the field should be enclosed in quotes to escape the delimiter.
The easiest way to properly parse delimited text is to use something like the csv
module.
I highly recommend that you first read the tutorial.
Next, consider the following example that shows how you can use this module to parse delimited data.
If you would like to compile and run this code, start a new project and use the following for src/main.rs.
Be sure to add the csv
dependency to your Cargo.toml and copy the books.csv file into the project.
use csv::StringRecord; use std::fs::File; fn main() -> std::io::Result<()> { let mut reader = csv::ReaderBuilder::new() .delimiter(b',') .from_reader(File::open("books.csv")?); fmt(reader.headers()?); for record in reader.records() { fmt(&record?); } Ok(()) } fn fmt(rec: &StringRecord) { println!( "{}", rec.into_iter() .map(|v| format!("{:20}", v)) .collect::<Vec<String>>() .join("") ) }
Use csv::ReaderBuilder
to parse a file.
The delimiter
must be a single u8
byte.
The from_reader
method accepts a value that implements the Read
trait.
The Reader::headers
will return the column names in the first row as a StringRecord
.
The Reader::records
method provides access to an iterator over StringRecord
values.
The field values in a StringRecord
can be iterated.
Use Iterator::map
to format the values into a field 20 characters wide.
Collect the strings into a vector and join them into a new string for printing.
If I run this program, I will see the following output:
$ cargo run Author Year Title Émile Zola 1865 La Confession de Claude Samuel Beckett 1952 Waiting for Godot Jules Verne 1870 20,000 Leagues Under the Sea
Coming back to the challenge program, think about how you might use some of these ideas to write a function like extract_fields
that accepts a StringRecord
and pulls out the fields found in the PositionList
:
fn extract_fields(record: &StringRecord, field_pos: &[usize]) -> Vec<String> { unimplemented!(); }
Here is a test you could use:
#[test] fn test_extract_fields() { let rec = StringRecord::from(vec!["Captain", "Sham", "12345"]); assert_eq!(extract_fields(&rec, &[0]), &["Captain"]); assert_eq!(extract_fields(&rec, &[1]), &["Sham"]); assert_eq!(extract_fields(&rec, &[0, 2]), &["Captain", "12345"]); assert_eq!(extract_fields(&rec, &[0, 3]), &["Captain"]); assert_eq!(extract_fields(&rec, &[1, 0]), &["Sham", "Captain"]); }
I think that might be enough to help you find a solution. This is a challenging program, so don’t give up too quickly. Fear is the mind-killer.
I’ll show you my solution, starting with how I select the characters:
fn extract_chars(line: &str, char_pos: &[usize]) -> String { let chars: Vec<_> = line.chars().collect(); char_pos.iter().filter_map(|i| chars.get(*i)).collect() }
Break the line into a vector of characters. The Vec
type annotation is required by Rust because Iterator::collect
can return many different types of collections.
Use Iterator::filter_map
with Vec::get
to select valid character positions and collect them into a new string.
The filter_map
function, as you might imagine, combines the operations of filter
and map
.
The closure uses chars.get(*i)
in an attempt to select the character at the given index.
This might fail if the user has requested positions beyond the end of the string, but a failure to select a character should not generate an error.
Vec::get
will return an Option<char>
, and filter_map
will skip all the None
values and unwrap the Some<char>
values.
Here is a longer way to write this:
fn extract_chars(line: &str, char_pos: &[usize]) -> String { let chars: Vec<char> = line.chars().collect(); char_pos .iter() .map(|i| chars.get(*i)) .filter(|v| v.is_some()) .map(|v| v.unwrap()) .collect::<String>() }
Try to get the characters as the given index positions.
Filter out the None
values.
Unwrap the Some
values.
Collect the filtered characters into a String
.
In the preceding code, I use *i
to dereference the value similar to earlier in the chapter.
If I remove the *
, the compiler would complain thusly:
error[E0277]: the type `[char]` cannot be indexed by `&usize` --> src/lib.rs:227:35 | 227 | .filter_map(|i| chars.get(i)) | ^ slice indices are of type | `usize` or ranges of `usize` | = help: the trait `SliceIndex<[char]>` is not implemented for `&usize`
The error message is vague on how to fix this, but the problem is that i
is a &usize
but I need a type usize
.
The deference *
removes the &
reference, hence the name.
The selection of bytes is very similar, but I have to deal with the fact that bytes must be explicitly cloned.
As with extract_chars
, the goal is to return a new string, but there is a potential problem if the byte selection breaks Unicode characters and so produces an invalid UTF-8 string:
fn extract_bytes(line: &str, byte_pos: &[usize]) -> String { let bytes = line.as_bytes(); let selected: Vec<u8> = byte_pos .iter() .filter_map(|i| bytes.get(*i)) .cloned() .collect(); String::from_utf8_lossy(&selected).into_owned() }
Break the line into a vector of bytes.
Use filter_map
to select bytes at the wanted positions.
Clone the resulting Vec<&u8>
into a Vec<u8>
to remove the references.
Use String::from_utf8_lossy
to generate a string from possibly invalid bytes.
You may wonder why I used Iterator::clone
in the preceding code.
Let me show you the error message if I remove it:
error[E0277]: a value of type `Vec<u8>` cannot be built from an iterator over elements of type `&u8` --> src/lib.rs:215:10 | 215 | .collect(); | ^^^^^^^ value of type `Vec<u8>` cannot be built from | `std::iter::Iterator<Item=&u8>` | = help: the trait `FromIterator<&u8>` is not implemented for `Vec<u8>`
The filter_map
will produce a Vec<&u8>
, which is a vector of references to u8
values, but String::from_utf8_lossy
expects &[u8]
, a slice of bytes.
As the Iterator::clone
documentation notes, this method “Creates an iterator which clones all of its elements. This is useful when you have an iterator over &T
, but you need an iterator over T
.”
Finally, here is one way to extract the fields from a csv::StringRecord
:
fn extract_fields(record: &StringRecord, field_pos: &[usize]) -> Vec<String> { field_pos .iter() .filter_map(|i| record.get(*i)) .map(|v| v.to_string()) .collect() }
Use csv::StringRecord::get
to try to get the field for the index position.
Use Iterator::map
to turn &str
values into String
values.
Collect the results into a Vec<String>
.
I would like to show you another way to write this function so that it will return a Vec<&str>
which will be slightly more memory efficient as it will not have to make copies of the strings.
The tradeoff is that I must indicate the lifetimes.
First, let me naïvely try to write it like so:
// This will not compile fn extract_fields(record: &StringRecord, field_pos: &[usize]) -> Vec<&str> { field_pos.iter().filter_map(|i| record.get(*i)).collect() }
If I try to compile this, the Rust compiler will complain about lifetimes and will suggest the following changes:
help: consider introducing a named lifetime parameter | 162 | fn extract_fields<'a>(record: &'a StringRecord, field_pos: &'a [usize]) -> Vec<&'a str> {
I will change the function definition to the proposed version:
fn extract_fields<'a>( record: &'a StringRecord, field_pos: &'a PositionList, ) -> Vec<&'a str> { field_pos.iter().filter_map(|i| record.get(*i)).collect() }
Indicate the same lifetime 'a
for all the values.
I have removed the step to convert each value to a String
.
Both versions will pass the unit test. The latter version is slightly more efficient and shorter but also has more cognitive overhead for the reader. Choose whichever version you feel you’ll be able to understand six weeks from now.
Here is my final run
function that incorporates all these ideas and will pass all the tests:
pub fn run(config: Config) -> MyResult<()> { for filename in &config.files { match open(filename) { Err(err) => eprintln!("{}: {}", filename, err), Ok(file) => match &config.extract { Fields(field_pos) => { let mut reader = ReaderBuilder::new() .delimiter(config.delimiter) .has_headers(false) .from_reader(file); let mut wtr = WriterBuilder::new() .delimiter(config.delimiter) .from_writer(io::stdout()); for record in reader.records() { let record = record?; wtr.write_record(extract_fields( &record, field_pos, ))?; } } Bytes(byte_pos) => { for line in file.lines() { println!("{}", extract_bytes(line?, &byte_pos)); } } Chars(char_pos) => { for line in file.lines() { println!("{}", extract_chars(line?, &char_pos)); } } }, } } Ok(()) }
If the user has requested fields from a delimited file, use csv::ReaderBuilder
to create a mutable reader using the given delimiter. Do not attempt to parse a header row.
Use csv::WriterBuilder
to write the output to STDOUT
using the input delimiter.
Iterate through the records.
Write the extracted fields to the output.
Iterate the lines of text and print the extract bytes.
Iterate the lines of text and print the extract characters.
I use csv::WriterBuilder
to correctly escape enclosed delimiters in fields. None of the tests require this, so you may have found a simpler way to write the output that passes the tests. You will shortly see why I did this.
In the preceding code, you may be curious why I ignore any possible headers in the delimited files.
By default, the csv::Reader
will attempt to parse the first row for the column names, but I don’t need to do anything special with these values in this program.
If I used this default behavior, I would have to handle the headers separately from the rest of the records.
In this context, it’s easier to treat the first row like any other record.
This program passes all the tests and seems to work pretty well for all the testing input files.
Because I’m using the csv
module to parse delimited text files and write the output, this program will correctly handle delimited text files, unlike the original cut
programs.
I’ll use tests/inputs/books.csv again to demonstrate that cutr
will correctly select a field containing the delimiter and will create output that properly escapes the delimiter:
$ cargo run -- -d , -f 1,3 tests/inputs/books.csv Author,Title Émile Zola,La Confession de Claude Samuel Beckett,Waiting for Godot Jules Verne,"20,000 Leagues Under the Sea"
These choices make cutr
unsuitable as a direct replacement for cut
as many uses may count on the behavior of the original tool.
As Ralph Waldo Emerson said, “A foolish consistency is the hobgoblin of little minds.”
I don’t believe all these tools need to mimic the original tools, especially when this seems to be such an improvement.
Make cutr
parse delimited files exactly like the original tools and have the “correct” parsing of delimited files be an option.
Implement the partial ranges like -3 to mean 1-3 or 5- to mean 5 to the end. Be aware that trying to run cargo run -- -f -3 tests/inputs/books.tsv
will cause clap
to interpret -3
as an option. Use -f=-3
instead.
Currently the --delimiter
for parsing input delimited files is also used for the output delimiter. Add an option to change this but have it default to the input delimiter.
Add an output filename option that defaults to STDOUT
.
Check out the xsv
, a “fast CSV command line toolkit written in Rust.”
Lift your gaze upon the knowledge you gained in this exercise:
You’ve learned how to dereference a value using the *
. Sometimes the compiler messages indicate that this is the solution, but other times you must infer this syntax when, for instance, you have &usize
but need usize
. Remember that the *
essentially removes the &
.
The Iterator::filter_map
combines filter
and map
for more concise code. You used this with a get
idea that works both for selecting positions from a vector or fields from a StringRecord
, which might fail and so are removed from the results.
You compared how to return a String
versus a &str
from a function, the latter of which required indicating lifetimes.
You can now parse and create delimited text using the csv
module. While we only looked at files delimited by commas and tab characters, there are many other delimiters in the wild. CSV files are some of the most common data formats, so these patterns will likely prove very useful.
52.91.0.68