Chapter 8. Shave and a Haircut

I’m a mess since you cut me out but Chucky’s arm keeps me company

They Might Be Giants

The cut tool will excise text from a file or STDIN. The selected text could be some range of bytes or characters or might be fields denoted by a delimiter. In Chapter 4 (headr), you learned to select a contiguous range of characters or bytes, but this challenge goes further as the selections can be noncontiguous. Additionally, the output of the program can rearrange the selections, as in the string “3,1,5-7” which should select the third, first, and fifth through seventh bytes, characters, or fields. This will also be the first time you will deal with delimited text where some special character like a comma or a tab creates field boundaries. The challenge program will capture the spirit of the original tools but will not strive for complete fidelity as I will suggest a few changes which I feel are improvements.

What you will learn:

  • How to read and write delimited text file using the csv module

  • How to deference a value using *

  • More on using Iterator::filter_map to combine filter and map operations

How cut Works

First let’s review a portion for the BSD version of cut:

CUT(1)                    BSD General Commands Manual                   CUT(1)

NAME
     cut -- cut out selected portions of each line of a file

SYNOPSIS
     cut -b list [-n] [file ...]
     cut -c list [file ...]
     cut -f list [-d delim] [-s] [file ...]

DESCRIPTION
     The cut utility cuts out selected portions of each line (as specified by
     list) from each file and writes them to the standard output.  If no file
     arguments are specified, or a file argument is a single dash ('-'), cut
     reads from the standard input.  The items specified by list can be in
     terms of column position or in terms of fields delimited by a special
     character.  Column numbering starts from 1.

     The list option argument is a comma or whitespace separated set of num-
     bers and/or number ranges.  Number ranges consist of a number, a dash
     ('-'), and a second number and select the fields or columns from the
     first number to the second, inclusive.  Numbers or number ranges may be
     preceded by a dash, which selects all fields or columns from 1 to the
     last number.  Numbers or number ranges may be followed by a dash, which
     selects all fields or columns from the last number to the end of the
     line.  Numbers and number ranges may be repeated, overlapping, and in any
     order.  If a field or column is specified multiple times, it will appear
     only once in the output.  It is not an error to select fields or columns
     not present in the input line.

     The options are as follows:

     -b list
             The list specifies byte positions.

     -c list
             The list specifies character positions.

     -d delim
             Use delim as the field delimiter character instead of the tab
             character.

     -f list
             The list specifies fields, separated in the input by the field
             delimiter character (see the -d option.)  Output fields are sepa-
             rated by a single occurrence of the field delimiter character.

     -n      Do not split multi-byte characters.  Characters will only be out-
             put if at least one byte is selected, and, after a prefix of zero
             or more unselected bytes, the rest of the bytes that form the
             character are selected.

     -s      Suppress lines with no field delimiter characters.  Unless speci-
             fied, lines with no delimiters are passed through unmodified.

As usual, the GNU version offers both short and long flags and several other features:

NAME
       cut - remove sections from each line of files

SYNOPSIS
       cut OPTION... [FILE]...

DESCRIPTION
       Print selected parts of lines from each FILE to standard output.

       Mandatory  arguments  to  long  options are mandatory for short options
       too.

       -b, --bytes=LIST
              select only these bytes

       -c, --characters=LIST
              select only these characters

       -d, --delimiter=DELIM
              use DELIM instead of TAB for field delimiter

       -f, --fields=LIST
              select only these fields;  also print any line that contains  no
              delimiter character, unless the -s option is specified

       -n     with -b: don't split multibyte characters

       --complement
              complement the set of selected bytes, characters or fields

       -s, --only-delimited
              do not print lines not containing delimiters

       --output-delimiter=STRING
              use  STRING  as  the  output delimiter the default is to use the
              input delimiter

       --help display this help and exit

       --version
              output version information and exit

       Use one, and only one of -b, -c or -f.  Each LIST is  made  up  of  one
       range,  or  many ranges separated by commas.  Selected input is written
       in the same order that it is read, and is written exactly  once.   Each
       range is one of:

       N      N'th byte, character or field, counted from 1

       N-     from N'th byte, character or field, to end of line

       N-M    from N'th to M'th (included) byte, character or field

       -M     from first to M'th (included) byte, character or field

       With no FILE, or when FILE is -, read standard input.

Both tools implement the selection ranges in similar ways where numbers can be selected individually, in closed ranges like “1-3”, or in half-open ranges like “-3” to indicate 1 to 3 or “5-” to indicate 5 to the end. Additionally, the original tools will not allow a field to be repeated in the output and will rearrange them in ascending order. The challenge program will instead allow only for a comma-separated list of either single numbers or bounded ranges like “2-4” and will use the selections in the given order to create the output.

I’ll show you some examples of how the original tools work to the extent that the challenge program should implement. I will use the files found in the 08_cutr/tests/inputs directory. First, consider a file with columns of information each in a fixed number of characters or so-called fixed-width text:

$ cd 08_cutr/tests/inputs
$ cat books.txt
Author              Year Title
Émile Zola          1865 La Confession de Claude
Samuel Beckett      1952 Waiting for Godot
Jules Verne         1870 20,000 Leagues Under the Sea

The Author column takes the first 20 characters:

$ cut -c 1-20 books.txt
Author
Émile Zola
Samuel Beckett
Jules Verne

The publication Year column occupies the next 5 characters:

$ cut -c 21-25 books.txt
Year
1865
1952
1870

The Title column occupies the last 30 characters. Note here that I intentionally request a larger range than exists to show that this is not considered an error:

$ cut -c 26-70 books.txt
Title
La Confession de Claude
Waiting for Godot
20,000 Leagues Under the Sea

I find it annoying that I cannot use this tool to rearrange the output such as by requesting the range for the Title followed by that for the Author:

$ cut -c 26-55,1-20 books.txt
Author              Title
Émile Zola          La Confession de Claude
Samuel Beckett      Waiting for Godot
Jules Verne         20,000 Leagues Under the Sea

I can grab just the first character like so:

$ cut -c 1 books.txt
A
É
S
J

As you’ve seen in previous chapters, bytes and characters are not always interchangeable. For instance, the “É” in “Émile Zola” is a Unicode character that is composed of two bytes, so asking for just one will result in an invalid character that is represented with the Unicode replacement character:

$ cut -b 1 books.txt
A
�
S
J

In my experience, fixed-width data files are less common than those where the columns of data are delimited with a character like a comma or a tab to show the boundaries of the data. Consider the same data in the file books.tsv where the file extension .tsv stands for tab-separated values:

$ cat books.tsv
Author	Year	Title
Émile Zola	1865	La Confession de Claude
Samuel Beckett	1952	Waiting for Godot
Jules Verne	1870	20,000 Leagues Under the Sea

By default, cut will assume the tab character is the field delimiter, so I can use the -f option to select, for instance, the publication year in the second column and the title in the third column like so:

$ cut -f 2,3 books.tsv
Year	Title
1865	La Confession de Claude
1952	Waiting for Godot
1870	20,000 Leagues Under the Sea

The comma is another common delimiter, and such files often have the extension .csv for comma-separated values (CSV). Following is the same data as a CSV file:

$ cat books.csv
Author,Year,Title
Émile Zola,1865,La Confession de Claude
Samuel Beckett,1952,Waiting for Godot
Jules Verne,1870,"20,000 Leagues Under the Sea"

To parse a CSV file, I must indicate the delimiter using the -d option. Note that I’m still unable to reorder the fields in the output as I indicate “2,1” for the second column followed by the first, but I get the columns back in their original order:

$ cut -d , -f 2,1 books.csv
Author,Year
Émile Zola,1865
Samuel Beckett,1952
Jules Verne,1870

You may have noticed that the third title contains a comma in 20,000 and so the title has been enclosed in quotes to indicate that the comma is not the field delimiter. This is a way to escape the delimiter or to tell the parser to ignore it. Unfortunately, both the BSD and GNU versions don’t recognize this and will truncate the title prematurely:

$ cut -d , -f 1,3 books.csv
Author,Title
Émile Zola,La Confession de Claude
Samuel Beckett,Waiting for Godot
Jules Verne,"20

Noninteger values for any of the list option values that accept a list are rejected:

$ cut -f foo,bar books.tsv
cut: [-cf] list: illegal list value

Nonexistent files are handled in the course of processing, printing a message to STDERR that the file does not exist:

$ cut -c 1 books.txt blargh movies1.csv
A
É
S
J
cut: blargh: No such file or directory
t
T
L

Finally, the program will read STDIN by default or if the given input filename is the dash (-):

$ cat books.tsv | cut -f 2
Year
1865
1952
1870

The challenge program is expected to implement just this much of the original with the following changes:

  1. Ranges must indicate both start and stop values (inclusive)

  2. Output columns should be in the order specified by the user

  3. Ranges may include repeated values

  4. The parsing of delimited text files should respect escaped delimiters

Getting Started

The name of the challenge program should be cutr (pronounced cutter, I think) for a Rust version of cut. I recommend you begin with cargo new cutr and then copy the 08_cutr/tests directory into your project. My solution will involve the following crates which you should add to your Cargo.toml:

[dependencies]
clap = "2.33"
csv = "1"
regex = "1"

[dev-dependencies]
assert_cmd = "1"
predicates = "1"
rand = "0.8"

Run cargo test to download the dependencies and run the tests, all of which should fail.

Defining the Arguments

I recommend the following structure for your src/main.rs:

fn main() {
    if let Err(e) = cutr::get_args().and_then(cutr::run) {
        eprintln!("{}", e);
        std::process::exit(1);
    }
}

Following is how I started my src/lib.rs. I want to highlight that I’m creating another enum as in Chapter 7, but this time the variants can hold a value. The value in this case will be another type alias I’m creating called PositionList, which is a Vec<usize>:

use crate::Extract::*; 1
use clap::{App, Arg};
use std::error::Error;

type MyResult<T> = Result<T, Box<dyn Error>>;
type PositionList = Vec<usize>;  2

#[derive(Debug)] 3
pub enum Extract {
    Fields(PositionList),
    Bytes(PositionList),
    Chars(PositionList),
}

#[derive(Debug)]
pub struct Config {
    files: Vec<String>, 4
    delimiter: u8, 5
    extract: Extract, 6
}
1

This allows me to use Fields(...) instead of Extract::Fields(...).

2

A PositionList is a vector of positive integer values.

3

Define an enum to hold the variants for extracting fields, bytes, or characters.

4

The files will be a vector of strings.

5

The delimiter should be a single byte.

6

The extract field will hold one of the Extract variants.

I decided to represent a range selection of something like “3,1,5-7” as [3, 1, 5, 6, 7]. Well, I actually subtract 1 from each value because I will be dealing with 0-offsets, but the point is that I thought it easiest to explicitly list all the positions in the order in which they will be selected. You may prefer to handle this differently.

You can start your get_args with the following:

pub fn get_args() -> MyResult<Config> {
    let matches = App::new("cutr")
        .version("0.1.0")
        .author("Ken Youens-Clark <[email protected]>")
        .about("Rust cut")
        // What goes here?
        .get_matches();

    Ok(Config {
        files: ...
        delimiter: ...
        fields: ...
        bytes: ...
        chars: ...
    })
}

Begin your run by printing the config:

pub fn run(config: Config) -> MyResult<()> {
    println!("{:#?}", &config);
    Ok(())
}

See if you can get your program to print the following usage:

$ cargo run -- --help
cutr 0.1.0
Ken Youens-Clark <[email protected]>
Rust cut

USAGE:
    cutr [OPTIONS] <FILE>...

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -b, --bytes <BYTES>        Selected bytes
    -c, --chars <CHARS>        Selected characters
    -d, --delim <DELIMITER>    Field delimiter [default: 	]
    -f, --fields <FIELDS>      Selected fields

ARGS:
    <FILE>...    Input file(s) [default: -]

I wrote a function called parse_pos that works like the parse_positive_int function from Chapter 4. Here is how you might start it:

fn parse_pos(range: &str) -> MyResult<PositionList> { 1
    unimplemented!();
}
1

The function accepts a &str and might return a PositionList.

I have, of course, written a unit test for you. Add the following to your src/lib.rs:

#[cfg(test)]
mod tests {
    use super::parse_pos;

    #[test]
    fn test_parse_pos() {
        assert!(parse_pos("").is_err());

        let res = parse_pos("0");
        assert!(res.is_err());
        assert_eq!(res.unwrap_err().to_string(), "illegal list value: "0"",);

        let res = parse_pos("a");
        assert!(res.is_err());
        assert_eq!(res.unwrap_err().to_string(), "illegal list value: "a"",);

        let res = parse_pos("1,a");
        assert!(res.is_err());
        assert_eq!(res.unwrap_err().to_string(), "illegal list value: "a"",);

        let res = parse_pos("2-1");
        assert!(res.is_err());
        assert_eq!(
            res.unwrap_err().to_string(),
            "First number in range (2) must be lower than second number (1)"
        );

        let res = parse_pos("1");
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), vec![0]);

        let res = parse_pos("1,3");
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), vec![0, 2]);

        let res = parse_pos("1-3");
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), vec![0, 1, 2]);

        let res = parse_pos("1,7,3-5");
        assert!(res.is_ok());
        assert_eq!(res.unwrap(), vec![0, 6, 2, 3, 4]);
    }
}

At this point, I expect you can read the above code well enough to understand exactly how the function should work. I recommend you stop reading at this point and write the code that will pass this test.

After cargo test test_parse_pos passes, your program should reject an invalid range and print an error message:

$ cargo run -- -f foo,bar tests/inputs/books.tsv
illegal list value: "foo"

It should also reject invalid ranges:

$ cargo run -- -f 3-2 tests/inputs/books.tsv
First number in range (3) must be lower than second number (2)

When given valid arguments, your program should be able to display a structure like so:

$ cargo run -- -f 1 -d , tests/inputs/movies1.csv
Config {
    files: [
        "tests/inputs/movies1.csv", 1
    ],
    delimiter: 44, 2
    extract: Fields( 3
        [
            0,
        ],
    ),
}
1

The positional argument goes into files.

2

The -d value of a comma has a byte value of 44.

3

The -f value of 1 creates the Extract::Fields([0]) variant.

When parsing a TSV file, use the tab as the default delimiter, which has a byte value of 9:

$ cargo run -- -f 2-3 tests/inputs/movies1.tsv
Config {
    files: [
        "tests/inputs/movies1.tsv",
    ],
    delimiter: 9,
    extract: Fields(
        [
            1,
            2,
        ],
    ),
}

Note that the options for -f|--fields, -b|--bytes, and -c|--chars are all mutually exclusive and should be rejected:

$ cargo run -- -f 1 -b 8-9 tests/inputs/movies1.tsv
error: The argument '--fields <FIELDS>' cannot be used with '--bytes <BYTES>'

USAGE:
    cutr <FILE>... --bytes <BYTES> --delim <DELIMITER> --fields <FIELDS>

Try to get just this much of your program working before you proceed. You should be able to pass cargo test dies:

running 8 tests
test dies_chars_bytes ... ok
test dies_chars_bytes_fields ... ok
test dies_chars_fields ... ok
test dies_bytes_fields ... ok
test dies_not_enough_args ... ok
test dies_bad_digit_field ... ok
test dies_bad_digit_bytes ... ok
test dies_bad_digit_chars ... ok

Parsing the Position List

I assume you wrote a passing parse_pos function, so compare your version to mine:

fn parse_pos(range: &str) -> MyResult<PositionList> {
    let mut fields: Vec<usize> = vec![]; 1
    let range_re = Regex::new(r"(d+)?-(d+)?").unwrap(); 2
    for val in range.split(',') { 3
        if let Some(cap) = range_re.captures(val) { 4
            let n1: &usize = &cap[1].parse()?; 5
            let n2: &usize = &cap[2].parse()?;

            if n1 < n2 { 6
                for n in *n1..=*n2 { 7
                    fields.push(n);
                }
            } else {
                return Err(From::from(format!( 8
                    "First number in range ({}) 
                    must be lower than second number ({})",
                    n1, n2
                )));
            }
        } else {
            match val.parse() { 9
                Ok(n) if n > 0 => fields.push(n),
                _ => {
                    return Err(From::from(format!(
                        "illegal list value: "{}"",
                        val
                    )))
                }
            }
        }
    }

    // Subtract one for field indexes
    Ok(fields.into_iter().map(|i| i - 1).collect()) 10
}
1

Create a mutable vector to hold all the positions.

2

Create a regular expression to capture two numbers separated by a dash.

3

Split the range values on a comma.

4

See if this part of the list matches the regex.

5

Convert the two captured numbers to usize integer values.

6

If the first number is less than the second, iterate through the range and add the values to fields.

7

Use the * operator to dereference the two number values.

8

Return an error about an invalid range.

9

If it’s possible to convert the value to a usize integer, add it to the list or else throw an error.

10

Return the given list with all the values adjusted down by 1.

Note

In the regular expression, I use r"" to denote a raw string so that Rust won’t try to interpret the string. For instance, you’ve seen that Rust will interpret as a newline. Without this, the compiler would complain that d is an unknown character escape:

error: unknown character escape: `d`
   --> src/lib.rs:155:34
    |
155 |     let range_re = Regex::new("(d+)?-(d+)?").unwrap();
    |                                  ^ unknown character escape
    |
    = help: for more information, visit <https://static.rust-lang.org
      /doc/master/reference.html#literals>

I would like to highlight two new pieces of syntax in the preceding code. First, I used parentheses in the regular expression (d+)-(d+) to indicate one or more digits followed by a dash followed by one or more digits as shown in Figure 8-1. If the regular expression matches the given string, then I can use the Regex::captures to extract the digits from the string. Note that they are available in 1-based counting, so the contents of the first capturing parentheses are available in position 1 of the captures. Because the captured values matched digit characters, they should be parsable as usize values.

fig 1 regex
Figure 8-1. The parentheses in the regular expression will capture the values they surround

The second piece of syntax is the * operator in for n in *n1..=*n2. If you remove these and try to compile the code, you will see the following error:

error[E0277]: the trait bound `&usize: Step` is not satisfied
   --> src/lib.rs:165:34
    |
165 |                         for n in n1..=n2 {
    |                                  ^^^^^^^ the trait `Step` is not
    |                                          implemented for `&usize`

This is one case where the compiler’s message is a bit cryptic and does not include the solution. The problem is that n1 and n2 are &usize references. A reference is a pointer to a piece of memory, not a copy of the value, and so the pointer must be dereferenced to use the underlying value. There are many times when Rust silently dereferences values, but this is one time when the * operator is required.

Here is how I incorporate these ideas into my get_args. First, I define all the arguments:

pub fn get_args() -> MyResult<Config> {
    let matches = App::new("cutr")
        .version("0.1.0")
        .author("Ken Youens-Clark <[email protected]>")
        .about("Rust cut")
        .arg(
            Arg::with_name("files") 1
                .value_name("FILE")
                .help("Input file(s)")
                .required(true)
                .default_value("-")
                .min_values(1),
        )
        .arg(
            Arg::with_name("delimiter") 2
                .value_name("DELIMITER")
                .help("Field delimiter")
                .short("d")
                .long("delim")
                .default_value("	"),
        )
        .arg(
            Arg::with_name("fields") 3
                .value_name("FIELDS")
                .help("Selected fields")
                .short("f")
                .long("fields")
                .conflicts_with_all(&["chars", "bytes"]),
        )
        .arg(
            Arg::with_name("bytes") 4
                .value_name("BYTES")
                .help("Selected bytes")
                .short("b")
                .long("bytes")
                .conflicts_with_all(&["fields", "chars"]),
        )
        .arg(
            Arg::with_name("chars") 5
                .value_name("CHARS")
                .help("Selected characters")
                .short("c")
                .long("chars")
                .conflicts_with_all(&["fields", "bytes"]),
        )
        .get_matches();
1

The required files accepts multiple values and defaults to the dash.

2

The delimiter uses the tab as the default value.

3

The fields option conflicts with chars and bytes.

4

The bytes option conflicts with fields and chars.

5

The chars options conflicts with fields and bytes.

Next, I convert the delimiter to bytes and verify that there is only one:

    let delimiter = matches.value_of("delimiter").unwrap_or("	");
    let delim_bytes = delimiter.as_bytes();
    if delim_bytes.len() > 1 {
        return Err(From::from(format!(
            "--delim "{}" must be a single byte",
            delimiter
        )));
    }

I use the parse_pos function to handle all the optional list values:

    let fields = matches.value_of("fields").map(parse_pos).transpose()?;
    let bytes = matches.value_of("bytes").map(parse_pos).transpose()?;
    let chars = matches.value_of("chars").map(parse_pos).transpose()?;
Note

I’m introducing Option::transpose here that “transposes an Option of a Result into a Result of an Option.”

I then figure out which Extract variant to create. I should never trigger the else clause in this code, but it’s good to have:

    let extract = if let Some(field_pos) = fields {
        Fields(field_pos)
    } else if let Some(byte_pos) = bytes {
        Bytes(byte_pos)
    } else if let Some(char_pos) = chars {
        Chars(char_pos)
    } else {
        return Err(From::from("Must have --fields, --bytes, or --chars"));
    };

If the code makes it to this point, then I appear to have valid arguments that I can return:

    Ok(Config {
        files: matches.values_of_lossy("files").unwrap(),
        delimiter: delim_bytes[0],
        extract,
    })
}

Next, you will need to figure out how you will use this information to extract the desired bits from the inputs.

Extracting Characters or Bytes

In Chapters 4 (headr) and 5 (wcr), you learned how to process lines, bytes, and characters in a file. You should draw on those programs to help you select characters and bytes in this challenge. One difference is that line endings need not be preserved, so you may use BufRead::lines to read the lines of input text.

To start, you might consider bringing in the open function used before to help open each file:

fn open(filename: &str) -> MyResult<Box<dyn BufRead>> {
    match filename {
        "-" => Ok(Box::new(BufReader::new(io::stdin()))),
        _ => Ok(Box::new(BufReader::new(File::open(filename)?))),
    }
}

You can expand your run to handle good and bad files:

pub fn run(config: Config) -> MyResult<()> {
    for filename in &config.files {
        match open(filename) {
            Err(err) => eprintln!("{}: {}", filename, err),
            Ok(_file) => println!("Opened {}", filename),
        }
    }
    Ok(())
}

This will require some more imports:

use crate::Extract::*;
use clap::{App, Arg};
use regex::Regex;
use std::{
    error::Error,
    fs::File,
    io::{self, BufRead, BufReader},
};

Now consider how you might extract characters from each line of the filehandle. I wrote a function called extract_chars that will return a new string composed of the characters at the given index positions:

fn extract_chars(line: &str, char_pos: &[usize]) -> String {
    unimplemented!();
}
Tip

I’m using &[usize] instead of &PositionList because of the suggestion from cargo clippy:

warning: writing `&Vec<_>` instead of `&[_]` involves one more reference
and cannot be used with non-Vec-based slices
   --> src/lib.rs:214:40
    |
214 | fn extract_chars(line: &str, char_pos: &PositionList) -> String {
    |                                        ^^^^^^^^^^^^^
    |
    = note: `#[warn(clippy::ptr_arg)]` on by default
    = help: for further information visit
    https://rust-lang.github.io/rust-clippy/master/index.html#ptr_arg

Here is a test you can add to your tests module to help you see how the function might work:

#[test]
fn test_extract_chars() {
    assert_eq!(extract_chars("", &[0]), "".to_string());
    assert_eq!(extract_chars("ábc", &[0]), "á".to_string());
    assert_eq!(extract_chars("ábc", &[0, 2]), "ác".to_string());
    assert_eq!(extract_chars("ábc", &[0, 1, 2]), "ábc".to_string());
    assert_eq!(extract_chars("ábc", &[2, 1]), "cb".to_string());
    assert_eq!(extract_chars("ábc", &[0, 1, 4]), "áb".to_string());
}

I also wrote a similar extract_bytes function to parse out bytes:

fn extract_bytes(line: &str, byte_pos: &[usize]) -> String {
    unimplemented!();
)

Here is the test:

fn test_extract_bytes() {
    assert_eq!(extract_bytes("ábc", &[0]), "�".to_string());
    assert_eq!(extract_bytes("ábc", &[0, 1]), "á".to_string());
    assert_eq!(extract_bytes("ábc", &[0, 1, 2]), "áb".to_string());
    assert_eq!(extract_bytes("ábc", &[0, 1, 2, 3]), "ábc".to_string());
    assert_eq!(extract_bytes("ábc", &[3, 2]), "cb".to_string());
    assert_eq!(extract_bytes("ábc", &[0, 1, 5]), "á".to_string());
}

Once you have written these two functions so that they pass the given test, iterate the lines of input text and use the preceding functions to extract and print the desired characters or bytes from each line. You should be able to pass all but the following when you run cargo test:

failures:
    csv_f1
    csv_f1_2
    csv_f1_3
    csv_f2
    csv_f2_3
    csv_f3
    tsv_f1
    tsv_f1_2
    tsv_f1_3
    tsv_f2
    tsv_f2_3
    tsv_f3

Parsing Delimited Text Files

To pass the final tests, you will need to learn how to parse delimited text files. As stated earlier, this file format uses some delimiter like a comma or tab to indicate the boundaries of a field. Sometimes the delimiting character may also be part of the data, in which case the field should be enclosed in quotes to escape the delimiter. The easiest way to properly parse delimited text is to use something like the csv module. I highly recommend that you first read the tutorial.

Next, consider the following example that shows how you can use this module to parse delimited data. If you would like to compile and run this code, start a new project and use the following for src/main.rs. Be sure to add the csv dependency to your Cargo.toml and copy the books.csv file into the project.

use csv::StringRecord;
use std::fs::File;

fn main() -> std::io::Result<()> {
    let mut reader = csv::ReaderBuilder::new() 1
        .delimiter(b',') 2
        .from_reader(File::open("books.csv")?); 3

    fmt(reader.headers()?); 4
    for record in reader.records() { 5
        fmt(&record?);
    }

    Ok(())
}

fn fmt(rec: &StringRecord) {
    println!(
        "{}",
        rec.into_iter() 6
            .map(|v| format!("{:20}", v)) 7
            .collect::<Vec<String>>() 8
            .join("")
    )
}
1

Use csv::ReaderBuilder to parse a file.

2

The delimiter must be a single u8 byte.

3

The from_reader method accepts a value that implements the Read trait.

4

The Reader::headers will return the column names in the first row as a StringRecord.

5

The Reader::records method provides access to an iterator over StringRecord values.

6

The field values in a StringRecord can be iterated.

7

Use Iterator::map to format the values into a field 20 characters wide.

8

Collect the strings into a vector and join them into a new string for printing.

If I run this program, I will see the following output:

$ cargo run
Author              Year                Title
Émile Zola          1865                La Confession de Claude
Samuel Beckett      1952                Waiting for Godot
Jules Verne         1870                20,000 Leagues Under the Sea

Coming back to the challenge program, think about how you might use some of these ideas to write a function like extract_fields that accepts a StringRecord and pulls out the fields found in the PositionList:

fn extract_fields(record: &StringRecord, field_pos: &[usize]) -> Vec<String> {
    unimplemented!();
}

Here is a test you could use:

#[test]
fn test_extract_fields() {
    let rec = StringRecord::from(vec!["Captain", "Sham", "12345"]);
    assert_eq!(extract_fields(&rec, &[0]), &["Captain"]);
    assert_eq!(extract_fields(&rec, &[1]), &["Sham"]);
    assert_eq!(extract_fields(&rec, &[0, 2]), &["Captain", "12345"]);
    assert_eq!(extract_fields(&rec, &[0, 3]), &["Captain"]);
    assert_eq!(extract_fields(&rec, &[1, 0]), &["Sham", "Captain"]);
}

I think that might be enough to help you find a solution. This is a challenging program, so don’t give up too quickly. Fear is the mind-killer.

Solution

I’ll show you my solution, starting with how I select the characters:

fn extract_chars(line: &str, char_pos: &[usize]) -> String {
    let chars: Vec<_> = line.chars().collect(); 1
    char_pos.iter().filter_map(|i| chars.get(*i)).collect() 2
}
1

Break the line into a vector of characters. The Vec type annotation is required by Rust because Iterator::collect can return many different types of collections.

2

Use Iterator::filter_map with Vec::get to select valid character positions and collect them into a new string.

The filter_map function, as you might imagine, combines the operations of filter and map. The closure uses chars.get(*i) in an attempt to select the character at the given index. This might fail if the user has requested positions beyond the end of the string, but a failure to select a character should not generate an error. Vec::get will return an Option<char>, and filter_map will skip all the None values and unwrap the Some<char> values. Here is a longer way to write this:

fn extract_chars(line: &str, char_pos: &[usize]) -> String {
    let chars: Vec<char> = line.chars().collect();
    char_pos
        .iter()
        .map(|i| chars.get(*i)) 1
        .filter(|v| v.is_some()) 2
        .map(|v| v.unwrap()) 3
        .collect::<String>() 4
}
1

Try to get the characters as the given index positions.

2

Filter out the None values.

3

Unwrap the Some values.

4

Collect the filtered characters into a String.

In the preceding code, I use *i to dereference the value similar to earlier in the chapter. If I remove the *, the compiler would complain thusly:

error[E0277]: the type `[char]` cannot be indexed by `&usize`
   --> src/lib.rs:227:35
    |
227 |         .filter_map(|i| chars.get(i))
    |                                   ^ slice indices are of type
    |                                     `usize` or ranges of `usize`
    |
    = help: the trait `SliceIndex<[char]>` is not implemented for `&usize`

The error message is vague on how to fix this, but the problem is that i is a &usize but I need a type usize. The deference * removes the & reference, hence the name.

The selection of bytes is very similar, but I have to deal with the fact that bytes must be explicitly cloned. As with extract_chars, the goal is to return a new string, but there is a potential problem if the byte selection breaks Unicode characters and so produces an invalid UTF-8 string:

fn extract_bytes(line: &str, byte_pos: &[usize]) -> String {
    let bytes = line.as_bytes(); 1
    let selected: Vec<u8> = byte_pos
        .iter()
        .filter_map(|i| bytes.get(*i)) 2
        .cloned() 3
        .collect();
    String::from_utf8_lossy(&selected).into_owned() 4
}
1

Break the line into a vector of bytes.

2

Use filter_map to select bytes at the wanted positions.

3

Clone the resulting Vec<&u8> into a Vec<u8> to remove the references.

4

Use String::from_utf8_lossy to generate a string from possibly invalid bytes.

You may wonder why I used Iterator::clone in the preceding code. Let me show you the error message if I remove it:

error[E0277]: a value of type `Vec<u8>` cannot be built from
an iterator over elements of type `&u8`
   --> src/lib.rs:215:10
    |
215 |         .collect();
    |          ^^^^^^^ value of type `Vec<u8>` cannot be built from
    |                  `std::iter::Iterator<Item=&u8>`
    |
    = help: the trait `FromIterator<&u8>` is not implemented for `Vec<u8>`

The filter_map will produce a Vec<&u8>, which is a vector of references to u8 values, but String::from_utf8_lossy expects &[u8], a slice of bytes. As the Iterator::clone documentation notes, this method “Creates an iterator which clones all of its elements. This is useful when you have an iterator over &T, but you need an iterator over T.”

Finally, here is one way to extract the fields from a csv::StringRecord:

fn extract_fields(record: &StringRecord, field_pos: &[usize]) -> Vec<String> {
    field_pos
        .iter()
        .filter_map(|i| record.get(*i)) 1
        .map(|v| v.to_string()) 2
        .collect() 3
}
1

Use csv::StringRecord::get to try to get the field for the index position.

2

Use Iterator::map to turn &str values into String values.

3

Collect the results into a Vec<String>.

I would like to show you another way to write this function so that it will return a Vec<&str> which will be slightly more memory efficient as it will not have to make copies of the strings. The tradeoff is that I must indicate the lifetimes. First, let me naïvely try to write it like so:

// This will not compile
fn extract_fields(record: &StringRecord, field_pos: &[usize]) -> Vec<&str> {
    field_pos.iter().filter_map(|i| record.get(*i)).collect()
}

If I try to compile this, the Rust compiler will complain about lifetimes and will suggest the following changes:

help: consider introducing a named lifetime parameter
    |
162 | fn extract_fields<'a>(record: &'a StringRecord, field_pos: &'a [usize])
-> Vec<&'a str> {

I will change the function definition to the proposed version:

fn extract_fields<'a>( 1
    record: &'a StringRecord,
    field_pos: &'a PositionList,
) -> Vec<&'a str> {
    field_pos.iter().filter_map(|i| record.get(*i)).collect() 2
}
1

Indicate the same lifetime 'a for all the values.

2

I have removed the step to convert each value to a String.

Both versions will pass the unit test. The latter version is slightly more efficient and shorter but also has more cognitive overhead for the reader. Choose whichever version you feel you’ll be able to understand six weeks from now.

Here is my final run function that incorporates all these ideas and will pass all the tests:

pub fn run(config: Config) -> MyResult<()> {
    for filename in &config.files {
        match open(filename) {
            Err(err) => eprintln!("{}: {}", filename, err),
            Ok(file) => match &config.extract {
                Fields(field_pos) => {
                    let mut reader = ReaderBuilder::new() 1
                        .delimiter(config.delimiter)
                        .has_headers(false)
                        .from_reader(file);

                    let mut wtr = WriterBuilder::new() 2
                        .delimiter(config.delimiter)
                        .from_writer(io::stdout());

                    for record in reader.records() { 3
                        let record = record?;
                        wtr.write_record(extract_fields( 4
                            &record, field_pos,
                        ))?;
                    }
                }
                Bytes(byte_pos) => {
                    for line in file.lines() { 5
                        println!("{}", extract_bytes(line?, &byte_pos));
                    }
                }
                Chars(char_pos) => {
                    for line in file.lines() { 6
                        println!("{}", extract_chars(line?, &char_pos));
                    }
                }
            },
        }
    }
    Ok(())
}
1

If the user has requested fields from a delimited file, use csv::ReaderBuilder to create a mutable reader using the given delimiter. Do not attempt to parse a header row.

2

Use csv::WriterBuilder to write the output to STDOUT using the input delimiter.

3

Iterate through the records.

4

Write the extracted fields to the output.

5

Iterate the lines of text and print the extract bytes.

6

Iterate the lines of text and print the extract characters.

Note

I use csv::WriterBuilder to correctly escape enclosed delimiters in fields. None of the tests require this, so you may have found a simpler way to write the output that passes the tests. You will shortly see why I did this.

In the preceding code, you may be curious why I ignore any possible headers in the delimited files. By default, the csv::Reader will attempt to parse the first row for the column names, but I don’t need to do anything special with these values in this program. If I used this default behavior, I would have to handle the headers separately from the rest of the records. In this context, it’s easier to treat the first row like any other record.

This program passes all the tests and seems to work pretty well for all the testing input files. Because I’m using the csv module to parse delimited text files and write the output, this program will correctly handle delimited text files, unlike the original cut programs. I’ll use tests/inputs/books.csv again to demonstrate that cutr will correctly select a field containing the delimiter and will create output that properly escapes the delimiter:

$ cargo run -- -d , -f 1,3 tests/inputs/books.csv
Author,Title
Émile Zola,La Confession de Claude
Samuel Beckett,Waiting for Godot
Jules Verne,"20,000 Leagues Under the Sea"

These choices make cutr unsuitable as a direct replacement for cut as many uses may count on the behavior of the original tool. As Ralph Waldo Emerson said, “A foolish consistency is the hobgoblin of little minds.” I don’t believe all these tools need to mimic the original tools, especially when this seems to be such an improvement.

Going Further

  • Make cutr parse delimited files exactly like the original tools and have the “correct” parsing of delimited files be an option.

  • Implement the partial ranges like -3 to mean 1-3 or 5- to mean 5 to the end. Be aware that trying to run cargo run -- -f -3 tests/inputs/books.tsv will cause clap to interpret -3 as an option. Use -f=-3 instead.

  • Currently the --delimiter for parsing input delimited files is also used for the output delimiter. Add an option to change this but have it default to the input delimiter.

  • Add an output filename option that defaults to STDOUT.

  • Check out the xsv, a “fast CSV command line toolkit written in Rust.”

Summary

Lift your gaze upon the knowledge you gained in this exercise:

  • You’ve learned how to dereference a value using the *. Sometimes the compiler messages indicate that this is the solution, but other times you must infer this syntax when, for instance, you have &usize but need usize. Remember that the * essentially removes the &.

  • The Iterator::filter_map combines filter and map for more concise code. You used this with a get idea that works both for selecting positions from a vector or fields from a StringRecord, which might fail and so are removed from the results.

  • You compared how to return a String versus a &str from a function, the latter of which required indicating lifetimes.

  • You can now parse and create delimited text using the csv module. While we only looked at files delimited by commas and tab characters, there are many other delimiters in the wild. CSV files are some of the most common data formats, so these patterns will likely prove very useful.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset