Chapter 4. Head Aches

Stand on your own head for a change, give me some skin to call my own

They Might Be Giants

The challenge in this chapter is to implement the head program, which will print the first few lines or bytes of one or more files. This is a good way to peek at the contents of a regular text file and is often a much better choice than cat. When faced with a directory of something like output files from some process, this is a great way to quickly scan for potential problems.

In this exercise, you will learn:

  • How to create optional command-line arguments that accept values

  • How to parse a string into a number

  • How to write and run a unit test

  • How to use a guard with a match arm

  • How to convert between types using From, Into, and as

  • How to use take on an iterator or a filehandle

  • How to preserve line endings while reading a filehandle

  • How to read bytes from a filehandle

How head Works

You should keep in mind that there are many implementations of the original AT&T Unix operating system, such as BSD (Berkeley Standard Distribution), SunOS/Solaris, HP-UX, and Linux. Most of these operating systems have some version of a head program that will default to showing the first ten lines of one or more files. Most will probably have options -n to control the number of lines shown and -c to instead show some number of bytes. The BSD version has only these two options, which I can see via man head:

HEAD(1)                   BSD General Commands Manual                  HEAD(1)

NAME
     head -- display first lines of a file

SYNOPSIS
     head [-n count | -c bytes] [file ...]

DESCRIPTION
     This filter displays the first count lines or bytes of each of the speci-
     fied files, or of the standard input if no files are specified.  If count
     is omitted it defaults to 10.

     If more than a single file is specified, each file is preceded by a
     header consisting of the string ''==> XXX <=='' where ''XXX'' is the name
     of the file.

EXIT STATUS
     The head utility exits 0 on success, and >0 if an error occurs.

SEE ALSO
     tail(1)

HISTORY
     The head command appeared in PWB UNIX.

BSD                              June 6, 1993                              BSD

With the GNU version, I can run head --help to read the usage:

Usage: head [OPTION]... [FILE]...
Print the first 10 lines of each FILE to standard output.
With more than one FILE, precede each with a header giving the file name.
With no FILE, or when FILE is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -c, --bytes=[-]K         print the first K bytes of each file;
                             with the leading '-', print all but the last
                             K bytes of each file
  -n, --lines=[-]K         print the first K lines instead of the first 10;
                             with the leading '-', print all but the last
                             K lines of each file
  -q, --quiet, --silent    never print headers giving file names
  -v, --verbose            always print headers giving file names
      --help     display this help and exit
      --version  output version information and exit

K may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

Note the ability with the GNU version to specify -n and -c with negative numbers and using suffixes like K, M, etc., which I will not implement. In both versions, the files are optional positional arguments that will read STDIN by default or when a filename is “-”. The -n and -b are optional arguments that take integer values.

To demonstrate some examples using head, I’ll use the files found in 04_headr/tests/inputs. Given an empty file, there is no output, which you can verify with head tests/inputs/empty.txt. By default, head will print the first 10 lines of a file. If a file has fewer than 10 lines, it will print all the lines. You can see this using tests/inputs/three.txt, which has 3 lines:

$ cd 04_headr
$ head tests/inputs/three.txt
Three
lines,
four words.

The -n option allows you to control how many lines are shown. For instance, I can choose only 2 lines with the following command:

$ head -n 2 tests/inputs/three.txt
Three
lines,

The -c option shows only the given number of bytes from a file, for instance, just the first 4 bytes:

$ head -c 4 tests/inputs/three.txt
Thre

Oddly, the GNU version will allow you to provide both -n and -c and defaults to showing bytes. The BSD version will reject both arguments:

$ head -n 1 -c 2 tests/inputs/one.txt
head: can't combine line and byte counts

Any value for -n or -c that is not a positive integer will generate an error that will halt the program, and the error will echo back the illegal value:

$ head -n 0 tests/inputs/one.txt
head: illegal line count -- 0
$ head -c foo tests/inputs/one.txt
head: illegal byte count -- foo

When there are multiple arguments, head adds a header and inserts a blank line between each file:

$ head -n 1 tests/inputs/*.txt
==> tests/inputs/empty.txt <==

==> tests/inputs/one.txt <==
Öne line, four words.

==> tests/inputs/three.txt <==
Three

==> tests/inputs/two.txt <==
Two lines.

With no file arguments, head will read from STDIN:

$ cat tests/inputs/three.txt | head -n 2
Three
lines,

As with cat in Chapter 3, any nonexistent or unreadable file is skipped with a warning printed to STDERR. In the following command, I will use blargh as a nonexistent file and will create an unreadable file called cant-touch-this:

$ touch cant-touch-this && chmod 000 cant-touch-this
$ head blargh cant-touch-this tests/inputs/one.txt
head: blargh: No such file or directory
head: cant-touch-this: Permission denied
==> tests/inputs/one.txt <==
Öne line, four words.

This will be as much as the challenge program is expected to recreate.

Getting Started

You might have anticipated that the program I want you to write will be called headr (pronounced head-er). Start by running cargo new headr and copy my 04_headr/tests directory into your project directory. Add the following dependencies to your Cargo.toml:

[dependencies]
clap = "2.33"

[dev-dependencies]
assert_cmd = "1"
predicates = "1"
rand = "0.8"

I propose you again split your source code so that src/main.rs looks like this:

fn main() {
    if let Err(e) = headr::get_args().and_then(headr::run) {
        eprintln!("{}", e);
        std::process::exit(1);
    }
}

Begin your src/lib.rs by bringing in clap and the Error trait and declaring MyResult, which you can copy from the source code in Chapter 3:

use clap::{App, Arg};
use std::error::Error;

type MyResult<T> = Result<T, Box<dyn Error>>;

The program will have three parameters that can be represented with a Config struct:

#[derive(Debug)]
pub struct Config {
    files: Vec<String>, 1
    lines: usize, 2
    bytes: Option<usize>, 3
}
1

The files will be a vector of strings.

2

The number of lines to print will be of the type usize.

3

The bytes will be an optional usize.

The primitive usize is the “pointer-sized unsigned integer type,” and its size varies from 4 bytes on a 32-bit operating system to 8 bytes on a 64-bit. The choice of usize is somewhat arbitrary as I just want to store some sort of positive integer. I could also use a u32 (unsigned 32-bit integer) or a u64 (unsigned 64-bit integer), but I definitely want an unsigned type as it will only represent positive integer values. I would need to use a signed integer like i32 or i64 to represent positive or negative numbers, which would be needed if I wanted to allow negative values as the GNU version does.

The lines and bytes will be used in a couple of functions, one of which expects a usize and the other u64. This will provide an opportunity later to discuss how to convert between types. Your program should use 10 as the default value for lines, but the bytes will be an Option, which I first introduced in Chapter 2. This means that bytes will either be Some<usize> if the user provides a valid value or None if they do not.

You can start your get_args function with the following outline. You need to add the code to parse the arguments and return a Config struct:

pub fn get_args() -> MyResult<Config> {
    let matches = App::new("headr")
        .version("0.1.0")
        .author("Ken Youens-Clark <[email protected]>")
        .about("Rust head")
        ... // what goes here?
        .get_matches();

    Ok(Config {
        files: ...
        lines: ...
        bytes: ...
    })
}
Tip

All the command-line arguments for this program are optional because files will default to “-”, lines will default to 10, and bytes can be left out. The optional arguments in Chapter 3 were flags, but here lines and bytes will need Arg::takes_value set to true.

You can start off with a run function that prints the configuration:

pub fn run(config: Config) -> MyResult<()> {
    println!("{:#?}", config); 1
    Ok(()) 2
}
1

Pretty-print the config. You could also use dbg!(config).

2

Return a successful result.

Parsing Strings into Numbers

All the values that clap returns will be strings, but you will need to convert lines and bytes to integers when present. I will show you how to use str::parse for this. This function will return a Result that will be an Err when the provided value cannot be parsed into a number or an Ok containing the converted number. I will write a function called parse_positive_int that attempts to parse a string value into a positive usize value. You can add this to your src/lib.rs:

fn parse_positive_int(val: &str) -> MyResult<usize> { 1
    unimplemented!(); 2
}
1

This function accepts a &str and will either return a positive usize or an error.

2

The unimplemented! macro “indicates unimplemented code by panicking with a message of not implemented.”

In the spirit of test-driven development, I will add a unit test for this function. I would recommend adding this just after the function it’s testing:

#[test]
fn test_parse_positive_int() {
    // 3 is an OK integer
    let res = parse_positive_int("3");
    assert!(res.is_ok());
    assert_eq!(res.unwrap(), 3);

    // Any string is an error
    let res = parse_positive_int("foo");
    assert!(res.is_err());
    assert_eq!(res.unwrap_err().to_string(), "foo".to_string());

    // A zero is an error
    let res = parse_positive_int("0");
    assert!(res.is_err());
    assert_eq!(res.unwrap_err().to_string(), "0".to_string());
}

Run cargo test parse_positive_int and verify that, indeed, the test fails. Stop reading now and write a version of the function that passes this test. I’ll wait here until you finish.

TIME PASSES.
AUTHOR GETS A CUP OF TEA AND CONSIDERS HIS LIFE CHOICES.
AUTHOR RETURNS TO THE NARATIVE.

How did that go? Swell, I bet! Here is the function I wrote that passes the preceding tests:

fn parse_positive_int(val: &str) -> MyResult<usize> {
    match val.parse() { 1
        Ok(n) if n > 0 => Ok(n), 2
        _ => Err(From::from(val)), 3
    }
}
1

Attempt to parse the given value. Rust infers the usize type from the return type.

2

If the parse succeeds and the parsed value n is greater than 0, return it as an Ok variant.

3

For any other outcome, return an Err with the given value.

I’ve used match several times so far, but this is the first time I’m showing that match arms can include a guard, which is an additional check after the pattern match. I don’t know about you, but I think that’s pretty sweet.

Converting Strings into Errors

When I’m unable to parse a given string value into a positive integer, I want to return the original string so it can be included in an error message. To do this in the preceding function, I used the redundantly named From::from function to turn the input &str value into an Error. Consider this version where I try to put the unparsable string directly into the Err:

fn parse_positive_int(val: &str) -> MyResult<usize> {
    match val.parse() {
        Ok(n) if n > 0 => Ok(n),
        _ => Err(val), // This will not compile
    }
}

If I try to compile this, I get the following error:

error[E0308]: mismatched types
  --> src/lib.rs:75:18
   |
75 |         _ => Err(val), // This will not compile
   |                  ^^^
   |                  |
   |                  expected struct `Box`, found `&str`
   |                  help: store this in the heap by calling `Box::new`:
   |                  `Box::new(val)`
   |
   = note: expected struct `Box<dyn std::error::Error>`
           found reference `&str`
   = note: for more on the distinction between the stack and the heap,
   read https://doc.rust-lang.org/book/ch15-01-box.html,
   https://doc.rust-lang.org/rust-by-example/std/box.html,
   and https://doc.rust-lang.org/std/boxed/index.html

The problem is that I am expected to return a MyResult which is defined as either an Ok<T> for any kind of type T or something that implements the Error trait and which is stored in a Box:

type MyResult<T> = Result<T, Box<dyn Error>>;

In the preceding code, &str neither implements Error nor lives in a Box. I can try to fix this according to the suggestions by changing this to Err(Box::new(val)). Unfortunately, this still won’t compile as I still haven’t satisfied the Error trait:

error[E0277]: the trait bound `str: std::error::Error` is not satisfied
  --> src/lib.rs:75:18
   |
75 |         _ => Err(Box::new(val)), // This will not compile
   |                  ^^^^^^^^^^^^^ the trait `std::error::Error` is not
   |                                implemented for `str`
   |
   = note: required because of the requirements on the impl of
     `std::error::Error` for `&str`
   = note: required for the cast to the object type `dyn std::error::Error`

Enter the std::convert::From trait, which helps convert from one type to another. For example, the documentation shows how to convert from a str to a String:

let string = "hello".to_string();
let other_string = String::from("hello");
assert_eq!(string, other_string);

In my case, I can convert &str into an Error in several ways using both std::convert::From and std::convert::Into. As the documentation states:

The From is also very useful when performing error handling. When constructing a function that is capable of failing, the return type will generally be of the form Result<T, E>. The From trait simplifies error handling by allowing a function to return a single error type that encapsulate multiple error types.

Figure 4-1 shows several equivalent ways to write this, none of which are preferable.

fig 1 alt from
Figure 4-1. Alternate ways to convert a &str to an Error using From and Into traits

Now that you have a way to convert a string to a number, integrate it into your get_args. See if you can get your program to print a usage like the following. Note that I use the short and long names from the GNU version:

$ cargo run -- -h
headr 0.1.0
Ken Youens-Clark <[email protected]>
Rust head

USAGE:
    headr [OPTIONS] <FILE>...

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -c, --bytes <BYTES>    Number of bytes
    -n, --lines <LINES>    Number of lines [default: 10]

ARGS:
    <FILE>...    Input file(s) [default: -]

Run the program with no inputs and verify the defaults are correctly set:

$ cargo run
Config {
    files: [ 1
        "-",
    ],
    lines: 10, 2
    bytes: None, 3
}
1

The files should default to the filename “-”.

2

The number of lines should default to 10.

3

The bytes should be None.

Run the program with arguments and ensure they are correctly parsed:

$ cargo run -- -n 3 tests/inputs/one.txt
Config {
    files: [
        "tests/inputs/one.txt", 1
    ],
    lines: 3, 2
    bytes: None, 3
}
1

The positional argument tests/inputs/one.txt is parsed as one of the files.

2

The -n option for lines sets this to 3.

3

The -b option for bytes defaults to None.

If I provide more than one positional argument, they will all go into the files, and the -c argument will go into bytes. In the following command, I’m again relying on the bash shell to expand the file glob *.txt into all the files ending in .txt. PowerShell users should refer to the equivalent use of Get-ChildItem shown in Chapter 3:

$ cargo run -- -c 4 tests/inputs/*.txt
Config {
    files: [
        "tests/inputs/empty.txt", 1
        "tests/inputs/one.txt",
        "tests/inputs/three.txt",
        "tests/inputs/two.txt",
    ],
    lines: 10, 2
    bytes: Some( 3
        4,
    ),
}
1

There are four files ending in .txt.

2

The lines is still set to the default value of 10.

3

The -c 4 results in the bytes now being Some(4).

Any value for -n or -c that cannot be parsed into a positive integer should cause the program to halt with an error:

$ cargo run -- -n blarg tests/inputs/one.txt
illegal line count -- blarg
$ cargo run -- -c 0 tests/inputs/one.txt
illegal byte count -- 0

The program should disallow -n and -c to be present together:

$ cargo run -- -n 1 -c 1 tests/inputs/one.txt
error: The argument '--lines <LINES>' cannot be used with '--bytes <BYTES>'

Just parsing and validating the arguments is a challenge, but I know you can do it. Be sure to consult the clap documentation as you figure this out. I recommend you not move forward until your program can pass all the tests included with cargo test dies:

running 3 tests
test dies_bad_lines ... ok
test dies_bad_bytes ... ok
test dies_bytes_and_lines ... ok

Defining the Arguments

Following is how I defined the arguments for clap. Note that the two options for lines and bytes will take values. This is different from the flags implemented in Chapter 3 that are used as Boolean values:

    let matches = App::new("headr")
        .version("0.1.0")
        .author("Ken Youens-Clark <[email protected]>")
        .about("Rust head")
        .arg(
            Arg::with_name("lines") 1
                .short("n")
                .long("lines")
                .value_name("LINES")
                .help("Number of lines")
                .default_value("10"),
        )
        .arg(
            Arg::with_name("bytes") 2
                .short("c")
                .long("bytes")
                .value_name("BYTES")
                .takes_value(true)
                .conflicts_with("lines")
                .help("Number of bytes"),
        )
        .arg(
            Arg::with_name("files") 3
                .value_name("FILE")
                .help("Input file(s)")
                .required(true)
                .default_value("-")
                .min_values(1),
        )
        .get_matches();
1

The lines option takes a value and defaults to “10.”

2

The bytes option takes a value, and it conflicts with the lines parameter so that they are mutually exclusive.

3

The files parameter is positional, required, takes one or more values, and defaults to “-”.

Note

The Arg::value_name will be printed in the usage documentation, so be sure to choose a descriptive name. Don’t confuse this with the Arg::with_name that uniquely defines the name of the argument for accessing within your code.

Following is how I can use parse_positive_int inside get_args to validate lines and bytes. When the function returns an Err variant, I use ? to propagate the error to main and end the program; otherwise, I return the Config:

pub fn get_args() -> MyResult<Config> {
    let matches = App::new("headr")... // Same as before

    let lines = matches
        .value_of("lines") 1
        .map(parse_positive_int) 2
        .transpose() 3
        .map_err(|e| format!("illegal line count -- {}", e))?; 4

    let bytes = matches 5
        .value_of("bytes")
        .map(parse_positive_int)
        .transpose()
        .map_err(|e| format!("illegal byte count -- {}", e))?;

    Ok(Config {
        files: matches.values_of_lossy("files").unwrap(), 6
        lines: lines.unwrap(), 7
        bytes 8
    })
}
1

ArgMatches.value_of returns an Option<&str>.

2

Use Option::map to unpack a &str from Some and send it to parse_positive_int.

3

The result of Option::map will be an <Option<Result>>, and Option::transpose will turn this into a <Result<Option>>.

4

In the event of an Err, create an informative error message. Use ? to propagate an Err or unpack the Ok value.

5

Do the same for bytes.

6

The files option should have at least one value and so should be safe to call Option::unwrap.

7

The lines has a default value and is safe to unwrap.

8

The bytes should be left as an Option. Use the struct field init shorthand since the name of the field is the same as the variable.

In the preceding code, I could have written the Config with every key/value pair like so:

Ok(Config {
    files: matches.values_of_lossy("files").unwrap(),
    lines: lines.unwrap(),
    bytes: bytes,
})

Clippy will suggest the following:

$ cargo clippy
warning: redundant field names in struct initialization
  --> src/lib.rs:61:9
   |
61 |         bytes: bytes,
   |         ^^^^^^^^^^^^ help: replace it with: `bytes`
   |
   = note: `#[warn(clippy::redundant_field_names)]` on by default
   = help: for further information visit https://rust-lang.github.io/
     rust-clippy/master/index.html#redundant_field_names

It’s quite a bit of work to validate all the user input, but now I have some assurance that I can proceed with good data.

Processing the Input Files

This challenge program should handle the input files just as in Chapter 3, so I suggest you bring in the open function from there:

fn open(filename: &str) -> MyResult<Box<dyn BufRead>> {
    match filename {
        "-" => Ok(Box::new(BufReader::new(io::stdin()))),
        _ => Ok(Box::new(BufReader::new(File::open(filename)?))),
    }
}

Be sure to add all the require dependencies:

use clap::{App, Arg};
use std::error::Error;
use std::fs::File;
use std::io::{self, BufRead, BufReader, Read};

Expand your run function to try opening the files, printing errors as you encounter them:

pub fn run(config: Config) -> MyResult<()> {
    for filename in config.files { 1
        match open(&filename) { 2
            Err(err) => eprintln!("{}: {}", filename, err), 3
            Ok(_file) => println!("Opened {}", filename), 4
        }
    }
    Ok(())
}
1

Iterate through each of the filenames.

2

Attempt to open the filename.

3

Print errors to STDERR.

4

Print a message that the file was successfully opened.

Run your program with a good file and a bad file to ensure it seems to work:

$ cargo run -- blargh tests/inputs/one.txt
blargh: No such file or directory (os error 2)
Opened tests/inputs/one.txt

Next, try to solve reading the lines and then bytes of a given file, then try to add the headers separating multiple file arguments. Look closely at the error output from head when handling invalid files. Notice that readable files have a header first and then the file output, but invalid files only print an error. Additionally, there is an extra blank line separating the output for the valid files:

$ head -n 1 tests/inputs/one.txt blargh tests/inputs/two.txt
==> tests/inputs/one.txt <==
Öne line, four words.
head: blargh: No such file or directory

==> tests/inputs/two.txt <==
Two lines.

I’ve specifically designed some challenging inputs for you to consider. To see what you face, use the file command to report file type information:

$ file tests/inputs/*.txt
tests/inputs/empty.txt: empty 1
tests/inputs/one.txt:   UTF-8 Unicode text 2
tests/inputs/three.txt: ASCII text, with CRLF, LF line terminators 3
tests/inputs/two.txt:   ASCII text 4
1

This is an empty file just to ensure your program doesn’t fall over.

2

This file contains Unicode as I put an umlaut over the O in Őne to force you to consider the differences between bytes and characters.

3

This file has Windows-style line endings.

4

This file has Unix-style line endings.

Tip

On Windows, the newline is the combination of the carriage return and the line feed, often shown as CRLF or . On Unix platforms, only the newline is used, so LF or . These line endings must be preserved in the output from your program, so you will have to find a way to read the lines in a file without removing the line endings.

Reading Bytes versus Characters

I want to explain the difference between reading bytes and characters from a file. In the early 1960s, the American Standard Code for Information Interchange (ASCII, pronounced as-key) table of 128 characters represented all possible text elements in computing. It only takes seven bits (27 = 128) to represent each character, so the notion of byte and character were interchangeable.

Since the creation of Unicode (Universal Coded Character Set) to represent all the writing systems of the world (and even emojis), some characters may require up to four bytes. The Unicode standard defines several ways to encode characters including the UTF-8 (Unicode Transformation Format using 8 bits). As I noted, the file tests/inputs/one.txt begins with the character Ő which is two bytes long in UTF-8. If you want head to show you this one character, you must request two bytes:

$ head -c 2 tests/inputs/one.txt
Ö

If I ask head to select just the first byte from this file, I get the byte value 195, which is not a valid UTF-8 string. The output is a special character that indicates a problem converting a character into Unicode:

$ head -c 1 tests/inputs/one.txt
�

The challenge program is expected to recreate this behavior. This is a challenging program to write, but you should be able to use std::io, std::fs::File, and std::io::BufReader to figure out how to read bytes and lines from each of the files. I’ve included a full set of tests in tests/cli.rs that you should have copied into your source tree. Be sure to run cargo test frequently to check your progress. Do your best to pass all the tests before looking at my solution.

Solution

I was really surprised by how much I learned by writing this program. What I expected to be a rather simple program proved to be very challenging. I’d like to step you through how I arrived at my solution, starting with how I read a file line-by-line.

Reading a File Line-by-line

To start, I will modify some code from Chapter 3 for reading the lines from a file:

pub fn run(config: Config) -> MyResult<()> {
    for filename in config.files {
        match open(&filename) {
            Err(err) => eprintln!("{}: {}", filename, err),
            Ok(file) => {
                for line in file.lines().take(config.lines) { 1
                    println!("{}", line?); 2
                }
            }
        }
    }
    Ok(())
}
1

Take the desired number of lines from the filehandle.

2

Print the line to the console.

I think this is a really fun solution because it uses the Iterator::take method to select the number of lines from config.lines. I can run the program to select one line from a file that contains three, and it appears to work grandly:

$ cargo run -- -n 1 tests/inputs/three.txt
Three

If I run cargo test, the program will pass several tests, which seems pretty good for having only implemented a small portion of the specs. It’s failing all the tests starting with three which use the Windows-encoded input file. To fix this problem, I have a confession to make.

Preserving Line Endings While Reading a File

It pains me to tell you this, dear reader, but I lied to you in Chapter 3. The catr program I showed does not completely replicate the original program because it uses BufRead::lines to read the input files. The documentation for that functions says “Each string returned will not have a newline byte (the 0xA byte) or CRLF (0xD, 0xA bytes) at the end.” I hope you’ll forgive me because I wanted to show you how easy it can be to read the lines of a file, but you should be aware that the catr program replaces Windows CRLF line endings with Unix-style newlines.

To fix this, I must instead use BufRead::read_line, which says “This function will read bytes from the underlying stream until the newline delimiter (the 0xA byte) or EOF1 is found. Once found, all bytes up to, and including, the delimiter (if found) will be appended to buf.” Following is a version that will preserve the original line endings. With these changes, the program will pass more tests than it fails:

pub fn run(config: Config) -> MyResult<()> {
    for filename in config.files {
        match File::open(&filename) {
            Err(err) => eprintln!("{}: {}", filename, err),
            Ok(mut file) => { 1
                let mut line = String::new(); 2
                for _ in 0..config.lines { 3
                    let bytes = file.read_line(&mut line)?; 4
                    if bytes == 0 { 5
                        break;
                    }
                    print!("{}", line); 6
                    line.clear(); 7
                }
            }
        };
    }
    Ok(())
}
1

Accept the filehandle as a mut (mutable) value.

2

Use String::new to create a new, empty mutable string buffer to hold each line.

3

Use for to iterate through a std::ops::Range to count up from 0 to the requested number of lines. The variable name _ indicates I do not intend to use it.

4

Use BufRead::read_line to read the next line.

5

The filehandle will return 0 bytes when it reaches the end, so break out of the loop.

6

Print the line including the original line ending.

7

Use String::clear to empty the line buffer.

If I run cargo test at this point, I’m passing almost all the tests for reading lines and failing all those for reading bytes and handling multiple files.

Reading Bytes from a File

Next, I’ll handle reading bytes from a file. After I attempt to open the file, I check to see if the config.bytes is Some number of bytes; otherwise, I’ll use the preceding code that reads lines:

for filename in config.files {
    match File::open(&filename) {
        Err(err) => eprintln!("{}: {}", filename, err),
        Ok(mut file) => {
            if let Some(num_bytes) = config.bytes { 1
                let mut handle = file.take(num_bytes as u64); 2
                let mut buffer = vec![0; num_bytes]; 3
                let n = handle.read(&mut buffer)?; 4
                print!("{}", String::from_utf8_lossy(&buffer[..n])); 5
            } else {
                ... // Read lines as before
            }
        }
    };
}
1

Use pattern matching to check if config.bytes is Some number of bytes to read.

2

Use take to read the requested number of bytes.

3

Create a mutable buffer of a fixed length num_bytes filled with zeros to hold the bytes read from the file.

4

Read the desired number of bytes from the filehandle into the buffer. The value n will report the number of bytes that were actually read, which may be fewer than the number requested.

5

Convert the bytes into a string that may not be valid UTF-8. Note the range operation to select only the bytes actually read.

Tip

The take method from the std::io::Read trait expects its argument to be the type u64, but I have a usize. I cast or convert the value using the as keyword.

This was perhaps the hardest part of the program for me. Once I figured out how to read only a few bytes, I had to figure out how to convert them to text. If I take only part of a multibyte character, the result will fail because strings in Rust must be valid UTF-8. I was happy to find String::from_utf8_lossy that will quietly convert invalid UTF-8 sequences to the unknown or replacement character:

$ cargo run -- -c 1 tests/inputs/one.txt
�

Let me show you the first way I tried to read the bytes from a file. I decided to read the entire file into a string, convert that into a vector of bytes, and use a slice to select the first num_bytes.

let mut contents = String::new(); 1
file.read_to_string(&mut contents)?; // Danger here 2
let bytes = contents.as_bytes(); 3
print!("{}", String::from_utf8_lossy(&bytes[..num_bytes])); // More danger 4
1

Create a new string buffer to hold the contents of the file.

2

Read the entire file contents into the string buffer.

3

Use str::as_bytes to convert the contents into bytes (u8 or unsigned 8-bit integers).

4

Use String::from_utf8_lossy to turn a slice of the bytes into a string.

Warning

I show you this approach so that you know how to read a file into a string; however, this can be a very dangerous thing to do if the file’s size exceeds the amount of memory on your machine. In general, this is a terrible idea unless you are positive that a file is small.

Another serious problem with the preceding code is that it assumes the slice operation bytes[..num_bytes] will succeed. If you use this code with an empty file, for instance, you’ll be asking for bytes that don’t exist. This will cause your program to panic and exit immediately with an error message:

$ cargo run -- -c 1 tests/inputs/empty.txt
thread 'main' panicked at 'range end index 1 out of range for slice of
length 0', src/lib.rs:80:50
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Tip

Rust can prevent you from making all sorts of egregious errors, but it can’t stop you from doing stupid things. There are still plenty of ways for you to shoot yourself in the foot.

Following is perhaps the shortest way to read the desired number of bytes from a file:

let bytes: Result<Vec<_>, _> = file.bytes().take(num_bytes).collect();
print!("{}", String::from_utf8_lossy(&bytes?));

In the preceding code, the type annotation Result<Vec<_>, _> is necessary as the compiler infers the type of bytes as a slice, which has an unknown size. I must indicate I want a Vec, which is a smart pointer to heap-allocated memory. The underscores (_) here indicate partial type annotation, which basically instructs the compiler to infer the types. Without this, the compiler complains thusly:

   Compiling headr v0.1.0 (/Users/kyclark/work/sysprog-rust/playground/headr)
error[E0277]: the size for values of type `[u8]` cannot be known at compilation time
  --> src/lib.rs:95:58
   |
95 |                     print!("{}", String::from_utf8_lossy(&bytes?));
   |                                                          ^^^^^^^ doesn't
   |                                        have a size known at compile-time
   |
   = help: the trait `Sized` is not implemented for `[u8]`
   = note: all local variables must have a statically known size
   = help: unsized locals are gated as an unstable feature
Note

You’ve now seen that the underscore _ serves various different functions. As the prefix or name of a variable, it shows the compiler you don’t want to use the value. In a match arm, it is the wildcard for handling any case. When used in a type annotation, it tells the compiler to infer the type.

You can also indicate the type information on the righthand side of the expression using the turbofish operator (::<>). Often it’s a matter of style whether you indicate the type on the lefthand or righthand side, but later you will see examples where the turbofish is required for some expressions:

let bytes = file.bytes().take(num_bytes).collect::<Result<Vec<_>, _>>();

The unknown character produced by String::from_utf8_lossy (b'xefxbfxbd') is not exactly the same output produced by BSD head (b'xc3'), making this somewhat difficult to test. If you look at the run helper function in tests/cli.rs, you’ll see that I read the expected value (the output from head) and use the same function to convert what could be invalid UTF-8 so that I can compare the two outputs. The run_stdin function works similarly:

fn run(args: &[&str], expected_file: &str) -> TestResult {
    // Extra work here due to lossy UTF
    let mut file = File::open(expected_file)?;
    let mut buffer = Vec::new();
    file.read_to_end(&mut buffer)?;
    let expected = String::from_utf8_lossy(&buffer); 1

    Command::cargo_bin(PRG)?
        .args(args)
        .assert()
        .success()
        .stdout(predicate::eq(&expected.as_bytes() as &[u8])); 2

    Ok(())
}
1

Handle any invalid UTF-8 in the expected_file.

2

Compare the output and expected values as a slice of bytes ([u8]).

Printing the File Separators

The last piece to handle is the separators between multiple files. As noted before, valid files have a header the puts the filename inside ==> and <== markers. Files after the first have an additional newline at the beginning to visually separate the output. This means I will need to know the number of the file that I’m handling, which I can get by using the Iterator::enumerate method. Following is the final version of my run function that will pass all the tests:

pub fn run(config: Config) -> MyResult<()> {
    let num_files = config.files.len(); 1

    for (file_num, filename) in config.files.iter().enumerate() { 2
        match File::open(&filename) {
            Err(err) => eprintln!("{}: {}", filename, err),
            Ok(file) => {
                if num_files > 1 { 3
                    println!(
                        "{}==> {} <==",
                        if file_num > 0 { "
" } else { "" }, 4
                        filename
                    );
                }

                if let Some(num_bytes) = config.bytes {
                    let mut handle = file.take(num_bytes as u64);
                    let mut buffer = vec![0; num_bytes];
                    let n = handle.read(&mut buffer)?;
                    print!("{}", String::from_utf8_lossy(&buffer[..n]));
                } else {
                    let mut line = String::new();
                    for _ in 0..config.lines {
                        let bytes = file.read_line(&mut line)?;
                        if bytes == 0 {
                            break;
                        }
                        print!("{}", line);
                        line.clear();
                    }
                }
            }
        };
    }

    Ok(())
}
1

Use the Vec::len method to get the number of files.

2

Use the Iterator::enumerate method to track both the file number and filenames.

3

Only print headers when there are multiple files.

4

Print a newline when the file_num is greater than 0, which indicates the first file.

Going Further

  • Implement the multiplier suffixes of the GNU version so that, for instance, -c=1K means print the first 1024 bytes of the file. Be sure to add and run tests.

  • Implement the negative number options from the GNU version where -n=-3 means you should print all but the last three lines of the file. As always, create tests to ensure your program is correct.

  • Add an option for selecting characters.

  • Add the file with the Windows line endings to the tests in Chapter 3. Edit the mk-outs.sh for that program to incorporate this file, and then expand the tests and program to ensure that line endings are preserved.

Summary

This chapter dove into some fairly sticky subjects such as converting types like a &str to a usize, a String to an Error, and a usize to a u64. I feel like it took me quite a while to understand the differences between &str and String and why I need to use From::from to create the Err part of MyResult. If you still feel confused, just know that you won’t always. I think if you keep reading the docs and writing more code, it will eventually come to you.

Here are some things you accomplished in this exercise:

  • You learned to create optional parameters that can take values. Previously, the options were flags.

  • You saw that all command-line arguments are strings. You used the str::parse method to attempt the conversion of a string like “3” into the number 3.

  • You learned how to write and run a unit test for an individual function.

  • You learned to convert types using the as keyword or with traits like From and Into.

  • You found that _ as the name or prefix of a value is a way to indicate to the compiler that you don’t intend to use a variable. When used in a type annotation, it tells the compiler to infer the type.

  • You learned to that a match arm can incorporate an additional Boolean condition called a guard.

  • You learned how to use BufRead::read_line to preserve line endings while reading a filehandle.

  • You found that the take method works on both iterators and filehandles to limit the number of elements you select.

  • You learned to indicate type information on the lefthand side of an assignment or on the righthand side using the turbofish.

1 EOF is an acronym for end of file.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.191.22