Stand on your own head for a change, give me some skin to call my own
They Might Be Giants
The challenge in this chapter is to implement the head
program, which will print the first few lines or bytes of one or more files.
This is a good way to peek at the contents of a regular text file and is often a much better choice than cat
.
When faced with a directory of something like output files from some process, this is a great way to quickly scan for potential problems.
In this exercise, you will learn:
How to create optional command-line arguments that accept values
How to parse a string into a number
How to write and run a unit test
How to use a guard with a match
arm
How to convert between types using From
, Into
, and as
How to use take
on an iterator or a filehandle
How to preserve line endings while reading a filehandle
How to read bytes from a filehandle
You should keep in mind that there are many implementations of the original AT&T Unix operating system, such as BSD (Berkeley Standard Distribution), SunOS/Solaris, HP-UX, and Linux.
Most of these operating systems have some version of a head
program that will default to showing the first ten lines of one or more files.
Most will probably have options -n
to control the number of lines shown and -c
to instead show some number of bytes.
The BSD version has only these two options, which I can see via man head
:
HEAD(1) BSD General Commands Manual HEAD(1) NAME head -- display first lines of a file SYNOPSIS head [-n count | -c bytes] [file ...] DESCRIPTION This filter displays the first count lines or bytes of each of the speci- fied files, or of the standard input if no files are specified. If count is omitted it defaults to 10. If more than a single file is specified, each file is preceded by a header consisting of the string ''==> XXX <=='' where ''XXX'' is the name of the file. EXIT STATUS The head utility exits 0 on success, and >0 if an error occurs. SEE ALSO tail(1) HISTORY The head command appeared in PWB UNIX. BSD June 6, 1993 BSD
With the GNU version, I can run head --help
to read the usage:
Usage: head [OPTION]... [FILE]... Print the first 10 lines of each FILE to standard output. With more than one FILE, precede each with a header giving the file name. With no FILE, or when FILE is -, read standard input. Mandatory arguments to long options are mandatory for short options too. -c, --bytes=[-]K print the first K bytes of each file; with the leading '-', print all but the last K bytes of each file -n, --lines=[-]K print the first K lines instead of the first 10; with the leading '-', print all but the last K lines of each file -q, --quiet, --silent never print headers giving file names -v, --verbose always print headers giving file names --help display this help and exit --version output version information and exit K may have a multiplier suffix: b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024, GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.
Note the ability with the GNU version to specify -n
and -c
with negative numbers and using suffixes like K, M, etc., which I will not implement.
In both versions, the files are optional positional arguments that will read STDIN
by default or when a filename is “-”.
The -n
and -b
are optional arguments that take integer values.
To demonstrate some examples using head
, I’ll use the files found in 04_headr/tests/inputs.
Given an empty file, there is no output, which you can verify with head tests/inputs/empty.txt
.
By default, head
will print the first 10 lines of a file.
If a file has fewer than 10 lines, it will print all the lines.
You can see this using tests/inputs/three.txt, which has 3 lines:
$ cd 04_headr $ head tests/inputs/three.txt Three lines, four words.
The -n
option allows you to control how many lines are shown.
For instance, I can choose only 2 lines with the following command:
$ head -n 2 tests/inputs/three.txt Three lines,
The -c
option shows only the given number of bytes from a file, for instance, just the first 4 bytes:
$ head -c 4 tests/inputs/three.txt Thre
Oddly, the GNU version will allow you to provide both -n
and -c
and defaults to showing bytes.
The BSD version will reject both arguments:
$ head -n 1 -c 2 tests/inputs/one.txt head: can't combine line and byte counts
Any value for -n
or -c
that is not a positive integer will generate an error that will halt the program, and the error will echo back the illegal value:
$ head -n 0 tests/inputs/one.txt head: illegal line count -- 0 $ head -c foo tests/inputs/one.txt head: illegal byte count -- foo
When there are multiple arguments, head
adds a header and inserts a blank line between each file:
$ head -n 1 tests/inputs/*.txt ==> tests/inputs/empty.txt <== ==> tests/inputs/one.txt <== Öne line, four words. ==> tests/inputs/three.txt <== Three ==> tests/inputs/two.txt <== Two lines.
With no file arguments, head
will read from STDIN
:
$ cat tests/inputs/three.txt | head -n 2 Three lines,
As with cat
in Chapter 3, any nonexistent or unreadable file is skipped with a warning printed to STDERR
.
In the following command, I will use blargh as a nonexistent file and will create an unreadable file called cant-touch-this:
$ touch cant-touch-this && chmod 000 cant-touch-this $ head blargh cant-touch-this tests/inputs/one.txt head: blargh: No such file or directory head: cant-touch-this: Permission denied ==> tests/inputs/one.txt <== Öne line, four words.
This will be as much as the challenge program is expected to recreate.
You might have anticipated that the program I want you to write will be called headr
(pronounced head-er).
Start by running cargo new headr
and copy my 04_headr/tests directory into your project directory.
Add the following dependencies to your Cargo.toml:
[dependencies] clap = "2.33" [dev-dependencies] assert_cmd = "1" predicates = "1" rand = "0.8"
I propose you again split your source code so that src/main.rs looks like this:
fn main() { if let Err(e) = headr::get_args().and_then(headr::run) { eprintln!("{}", e); std::process::exit(1); } }
Begin your src/lib.rs by bringing in clap
and the Error
trait and declaring MyResult
, which you can copy from the source code in Chapter 3:
use clap::{App, Arg}; use std::error::Error; type MyResult<T> = Result<T, Box<dyn Error>>;
The program will have three parameters that can be represented with a Config
struct:
#[derive(Debug)] pub struct Config { files: Vec<String>,lines: usize,
bytes: Option<usize>,
}
The files
will be a vector of strings.
The number of lines
to print will be of the type usize
.
The bytes
will be an optional usize
.
The primitive usize
is the “pointer-sized unsigned integer type,” and its size varies from 4 bytes on a 32-bit operating system to 8 bytes on a 64-bit.
The choice of usize
is somewhat arbitrary as I just want to store some sort of positive integer.
I could also use a u32
(unsigned 32-bit integer) or a u64
(unsigned 64-bit integer), but I definitely want an unsigned type as it will only represent positive integer values.
I would need to use a signed integer like i32
or i64
to represent positive or negative numbers, which would be needed if I wanted to allow negative values as the GNU version does.
The lines
and bytes
will be used in a couple of functions, one of which expects a usize
and the other u64
.
This will provide an opportunity later to discuss how to convert between types.
Your program should use 10
as the default value for lines
, but the bytes
will be an Option
, which I first introduced in Chapter 2.
This means that bytes
will either be Some<usize>
if the user provides a valid value or None
if they do not.
You can start your get_args
function with the following outline.
You need to add the code to parse the arguments and return a Config
struct:
pub fn get_args() -> MyResult<Config> { let matches = App::new("headr") .version("0.1.0") .author("Ken Youens-Clark <[email protected]>") .about("Rust head") ... // what goes here? .get_matches(); Ok(Config { files: ... lines: ... bytes: ... }) }
All the command-line arguments for this program are optional because files
will default to “-”, lines
will default to 10, and bytes
can be left out. The optional arguments in Chapter 3 were flags, but here lines
and bytes
will need Arg::takes_value
set to true
.
You can start off with a run
function that prints the configuration:
pub fn run(config: Config) -> MyResult<()> { println!("{:#?}", config);Ok(())
}
All the values that clap
returns will be strings, but you will need to convert lines
and bytes
to integers when present.
I will show you how to use str::parse
for this.
This function will return a Result
that will be an Err
when the provided value cannot be parsed into a number or an Ok
containing the converted number.
I will write a function called parse_positive_int
that attempts to parse a string value into a positive usize
value.
You can add this to your src/lib.rs:
fn parse_positive_int(val: &str) -> MyResult<usize> {unimplemented!();
}
This function accepts a &str
and will either return a positive usize
or an error.
The unimplemented!
macro “indicates unimplemented code by panicking with a message of not implemented.”
In the spirit of test-driven development, I will add a unit test for this function. I would recommend adding this just after the function it’s testing:
#[test] fn test_parse_positive_int() { // 3 is an OK integer let res = parse_positive_int("3"); assert!(res.is_ok()); assert_eq!(res.unwrap(), 3); // Any string is an error let res = parse_positive_int("foo"); assert!(res.is_err()); assert_eq!(res.unwrap_err().to_string(), "foo".to_string()); // A zero is an error let res = parse_positive_int("0"); assert!(res.is_err()); assert_eq!(res.unwrap_err().to_string(), "0".to_string()); }
Run cargo test parse_positive_int
and verify that, indeed, the test fails.
Stop reading now and write a version of the function that passes this test.
I’ll wait here until you finish.
TIME PASSES. AUTHOR GETS A CUP OF TEA AND CONSIDERS HIS LIFE CHOICES. AUTHOR RETURNS TO THE NARATIVE.
How did that go? Swell, I bet! Here is the function I wrote that passes the preceding tests:
fn parse_positive_int(val: &str) -> MyResult<usize> { match val.parse() {Ok(n) if n > 0 => Ok(n),
_ => Err(From::from(val)),
} }
Attempt to parse the given value. Rust infers the usize
type from the return type.
If the parse succeeds and the parsed value n
is greater than 0, return it as an Ok
variant.
For any other outcome, return an Err
with the given value.
I’ve used match
several times so far, but this is the first time I’m showing that match
arms can include a guard, which is an additional check after the pattern match.
I don’t know about you, but I think that’s pretty sweet.
When I’m unable to parse a given string value into a positive integer, I want to return the original string so it can be included in an error message.
To do this in the preceding function, I used the redundantly named From::from
function to turn the input &str
value into an Error
.
Consider this version where I try to put the unparsable string directly into the Err
:
fn parse_positive_int(val: &str) -> MyResult<usize> { match val.parse() { Ok(n) if n > 0 => Ok(n), _ => Err(val), // This will not compile } }
If I try to compile this, I get the following error:
error[E0308]: mismatched types --> src/lib.rs:75:18 | 75 | _ => Err(val), // This will not compile | ^^^ | | | expected struct `Box`, found `&str` | help: store this in the heap by calling `Box::new`: | `Box::new(val)` | = note: expected struct `Box<dyn std::error::Error>` found reference `&str` = note: for more on the distinction between the stack and the heap, read https://doc.rust-lang.org/book/ch15-01-box.html, https://doc.rust-lang.org/rust-by-example/std/box.html, and https://doc.rust-lang.org/std/boxed/index.html
The problem is that I am expected to return a MyResult
which is defined as either an Ok<T>
for any kind of type T
or something that implements the Error
trait and which is stored in a Box
:
type MyResult<T> = Result<T, Box<dyn Error>>;
In the preceding code, &str
neither implements Error
nor lives in a Box
.
I can try to fix this according to the suggestions by changing this to Err(Box::new(val))
.
Unfortunately, this still won’t compile as I still haven’t satisfied the Error
trait:
error[E0277]: the trait bound `str: std::error::Error` is not satisfied --> src/lib.rs:75:18 | 75 | _ => Err(Box::new(val)), // This will not compile | ^^^^^^^^^^^^^ the trait `std::error::Error` is not | implemented for `str` | = note: required because of the requirements on the impl of `std::error::Error` for `&str` = note: required for the cast to the object type `dyn std::error::Error`
Enter the std::convert::From
trait, which helps convert from one type to another.
For example, the documentation shows how to convert from a str
to a String
:
let string = "hello".to_string(); let other_string = String::from("hello"); assert_eq!(string, other_string);
In my case, I can convert &str
into an Error
in several ways using both std::convert::From
and std::convert::Into
.
As the documentation states:
The
From
is also very useful when performing error handling. When constructing a function that is capable of failing, the return type will generally be of the formResult<T, E>
. TheFrom
trait simplifies error handling by allowing a function to return a single error type that encapsulate multiple error types.
Figure 4-1 shows several equivalent ways to write this, none of which are preferable.
&str
to an Error
using From
and Into
traitsNow that you have a way to convert a string to a number, integrate it into your get_args
.
See if you can get your program to print a usage like the following.
Note that I use the short and long names from the GNU version:
$ cargo run -- -h headr 0.1.0 Ken Youens-Clark <[email protected]> Rust head USAGE: headr [OPTIONS] <FILE>... FLAGS: -h, --help Prints help information -V, --version Prints version information OPTIONS: -c, --bytes <BYTES> Number of bytes -n, --lines <LINES> Number of lines [default: 10] ARGS: <FILE>... Input file(s) [default: -]
Run the program with no inputs and verify the defaults are correctly set:
$ cargo run Config { files: ["-", ], lines: 10,
bytes: None,
}
The files
should default to the filename “-”.
The number of lines
should default to 10.
The bytes
should be None
.
Run the program with arguments and ensure they are correctly parsed:
$ cargo run -- -n 3 tests/inputs/one.txt Config { files: [ "tests/inputs/one.txt",], lines: 3,
bytes: None,
}
The positional argument tests/inputs/one.txt is parsed as one of the files
.
The -n
option for lines
sets this to 3.
The -b
option for bytes
defaults to None
.
If I provide more than one positional argument, they will all go into the files
, and the -c
argument will go into bytes
.
In the following command, I’m again relying on the bash
shell to expand the file glob *.txt into all the files ending in .txt.
PowerShell users should refer to the equivalent use of Get-ChildItem
shown in Chapter 3:
$ cargo run -- -c 4 tests/inputs/*.txt Config { files: [ "tests/inputs/empty.txt","tests/inputs/one.txt", "tests/inputs/three.txt", "tests/inputs/two.txt", ], lines: 10,
bytes: Some(
4, ), }
There are four files ending in .txt.
The lines
is still set to the default value of 10
.
The -c 4
results in the bytes
now being Some(4)
.
Any value for -n
or -c
that cannot be parsed into a positive integer should cause the program to halt with an error:
$ cargo run -- -n blarg tests/inputs/one.txt illegal line count -- blarg $ cargo run -- -c 0 tests/inputs/one.txt illegal byte count -- 0
The program should disallow -n
and -c
to be present together:
$ cargo run -- -n 1 -c 1 tests/inputs/one.txt error: The argument '--lines <LINES>' cannot be used with '--bytes <BYTES>'
Just parsing and validating the arguments is a challenge, but I know you can do it.
Be sure to consult the clap
documentation as you figure this out.
I recommend you not move forward until your program can pass all the tests included with cargo test dies
:
running 3 tests test dies_bad_lines ... ok test dies_bad_bytes ... ok test dies_bytes_and_lines ... ok
Following is how I defined the arguments for clap
.
Note that the two options for lines
and bytes
will take values.
This is different from the flags implemented in Chapter 3 that are used as Boolean values:
let matches = App::new("headr") .version("0.1.0") .author("Ken Youens-Clark <[email protected]>") .about("Rust head") .arg( Arg::with_name("lines").short("n") .long("lines") .value_name("LINES") .help("Number of lines") .default_value("10"), ) .arg( Arg::with_name("bytes")
.short("c") .long("bytes") .value_name("BYTES") .takes_value(true) .conflicts_with("lines") .help("Number of bytes"), ) .arg( Arg::with_name("files")
.value_name("FILE") .help("Input file(s)") .required(true) .default_value("-") .min_values(1), ) .get_matches();
The lines
option takes a value and defaults to “10.”
The bytes
option takes a value, and it conflicts with the lines
parameter so that they are mutually exclusive.
The files
parameter is positional, required, takes one or more values, and defaults to “-”.
The Arg::value_name
will be printed in the usage documentation, so be sure to choose a descriptive name. Don’t confuse this with the Arg::with_name
that uniquely defines the name of the argument for accessing within your code.
Following is how I can use parse_positive_int
inside get_args
to validate lines
and bytes
.
When the function returns an Err
variant, I use ?
to propagate the error to main
and end the program; otherwise, I return the Config
:
pub fn get_args() -> MyResult<Config> { let matches = App::new("headr")... // Same as before let lines = matches .value_of("lines").map(parse_positive_int)
.transpose()
.map_err(|e| format!("illegal line count -- {}", e))?;
let bytes = matches
.value_of("bytes") .map(parse_positive_int) .transpose() .map_err(|e| format!("illegal byte count -- {}", e))?; Ok(Config { files: matches.values_of_lossy("files").unwrap(),
lines: lines.unwrap(),
bytes
}) }
ArgMatches.value_of
returns an Option<&str>
.
Use Option::map
to unpack a &str
from Some
and send it to parse_positive_int
.
The result of Option::map
will be an <Option<Result>>
, and Option::transpose
will turn this into a <Result<Option>>
.
In the event of an Err
, create an informative error message. Use ?
to propagate an Err
or unpack the Ok
value.
Do the same for bytes
.
The files
option should have at least one value and so should be safe to call Option::unwrap
.
The lines
has a default value and is safe to unwrap.
The bytes
should be left as an Option
. Use the struct
field init shorthand since the name of the field is the same as the variable.
In the preceding code, I could have written the Config
with every key/value pair like so:
Ok(Config { files: matches.values_of_lossy("files").unwrap(), lines: lines.unwrap(), bytes: bytes, })
Clippy will suggest the following:
$ cargo clippy warning: redundant field names in struct initialization --> src/lib.rs:61:9 | 61 | bytes: bytes, | ^^^^^^^^^^^^ help: replace it with: `bytes` | = note: `#[warn(clippy::redundant_field_names)]` on by default = help: for further information visit https://rust-lang.github.io/ rust-clippy/master/index.html#redundant_field_names
It’s quite a bit of work to validate all the user input, but now I have some assurance that I can proceed with good data.
This challenge program should handle the input files just as in Chapter 3, so I suggest you bring in the open
function from there:
fn open(filename: &str) -> MyResult<Box<dyn BufRead>> { match filename { "-" => Ok(Box::new(BufReader::new(io::stdin()))), _ => Ok(Box::new(BufReader::new(File::open(filename)?))), } }
Be sure to add all the require dependencies:
use clap::{App, Arg}; use std::error::Error; use std::fs::File; use std::io::{self, BufRead, BufReader, Read};
Expand your run
function to try opening the files, printing errors as you encounter them:
pub fn run(config: Config) -> MyResult<()> { for filename in config.files {match open(&filename) {
Err(err) => eprintln!("{}: {}", filename, err),
Ok(_file) => println!("Opened {}", filename),
} } Ok(()) }
Iterate through each of the filenames.
Attempt to open the filename.
Print errors to STDERR
.
Print a message that the file was successfully opened.
Run your program with a good file and a bad file to ensure it seems to work:
$ cargo run -- blargh tests/inputs/one.txt blargh: No such file or directory (os error 2) Opened tests/inputs/one.txt
Next, try to solve reading the lines and then bytes of a given file, then try to add the headers separating multiple file arguments.
Look closely at the error output from head
when handling invalid files.
Notice that readable files have a header first and then the file output, but invalid files only print an error.
Additionally, there is an extra blank line separating the output for the valid files:
$ head -n 1 tests/inputs/one.txt blargh tests/inputs/two.txt ==> tests/inputs/one.txt <== Öne line, four words. head: blargh: No such file or directory ==> tests/inputs/two.txt <== Two lines.
I’ve specifically designed some challenging inputs for you to consider.
To see what you face, use the file
command to report file type information:
$ file tests/inputs/*.txt tests/inputs/empty.txt: emptytests/inputs/one.txt: UTF-8 Unicode text
tests/inputs/three.txt: ASCII text, with CRLF, LF line terminators
tests/inputs/two.txt: ASCII text
This is an empty file just to ensure your program doesn’t fall over.
This file contains Unicode as I put an umlaut over the O in Őne to force you to consider the differences between bytes and characters.
This file has Windows-style line endings.
This file has Unix-style line endings.
On Windows, the newline is the combination of the carriage return and the line feed, often shown as CRLF or
. On Unix platforms, only the newline is used, so LF or
. These line endings must be preserved in the output from your program, so you will have to find a way to read the lines in a file without removing the line endings.
I want to explain the difference between reading bytes and characters from a file. In the early 1960s, the American Standard Code for Information Interchange (ASCII, pronounced as-key) table of 128 characters represented all possible text elements in computing. It only takes seven bits (27 = 128) to represent each character, so the notion of byte and character were interchangeable.
Since the creation of Unicode (Universal Coded Character Set) to represent all the writing systems of the world (and even emojis), some characters may require up to four bytes.
The Unicode standard defines several ways to encode characters including the UTF-8 (Unicode Transformation Format using 8 bits).
As I noted, the file tests/inputs/one.txt begins with the character Ő which is two bytes long in UTF-8.
If you want head
to show you this one character, you must request two bytes:
$ head -c 2 tests/inputs/one.txt Ö
If I ask head
to select just the first byte from this file, I get the byte value 195
, which is not a valid UTF-8 string.
The output is a special character that indicates a problem converting a character into Unicode:
$ head -c 1 tests/inputs/one.txt �
The challenge program is expected to recreate this behavior.
This is a challenging program to write, but you should be able to use std::io
, std::fs::File
, and std::io::BufReader
to figure out how to read bytes and lines from each of the files.
I’ve included a full set of tests in tests/cli.rs that you should have copied into your source tree.
Be sure to run cargo test
frequently to check your progress.
Do your best to pass all the tests before looking at my solution.
I was really surprised by how much I learned by writing this program. What I expected to be a rather simple program proved to be very challenging. I’d like to step you through how I arrived at my solution, starting with how I read a file line-by-line.
To start, I will modify some code from Chapter 3 for reading the lines from a file:
pub fn run(config: Config) -> MyResult<()> { for filename in config.files { match open(&filename) { Err(err) => eprintln!("{}: {}", filename, err), Ok(file) => { for line in file.lines().take(config.lines) {println!("{}", line?);
} } } } Ok(()) }
I think this is a really fun solution because it uses the Iterator::take
method to select the number of lines from config.lines
.
I can run the program to select one line from a file that contains three, and it appears to work grandly:
$ cargo run -- -n 1 tests/inputs/three.txt Three
If I run cargo test
, the program will pass several tests, which seems pretty good for having only implemented a small portion of the specs.
It’s failing all the tests starting with three which use the Windows-encoded input file.
To fix this problem, I have a confession to make.
It pains me to tell you this, dear reader, but I lied to you in Chapter 3.
The catr
program I showed does not completely replicate the original program because it uses BufRead::lines
to read the input files.
The documentation for that functions says “Each string returned will not have a newline byte (the 0xA
byte) or CRLF (0xD
, 0xA
bytes) at the end.”
I hope you’ll forgive me because I wanted to show you how easy it can be to read the lines of a file, but you should be aware that the catr
program replaces Windows CRLF line endings with Unix-style newlines.
To fix this, I must instead use BufRead::read_line
, which says “This function will read bytes from the underlying stream until the newline delimiter (the 0xA
byte) or EOF1 is found. Once found, all bytes up to, and including, the delimiter (if found) will be appended to buf
.”
Following is a version that will preserve the original line endings.
With these changes, the program will pass more tests than it fails:
pub fn run(config: Config) -> MyResult<()> { for filename in config.files { match File::open(&filename) { Err(err) => eprintln!("{}: {}", filename, err), Ok(mut file) => {let mut line = String::new();
for _ in 0..config.lines {
let bytes = file.read_line(&mut line)?;
if bytes == 0 {
break; } print!("{}", line);
line.clear();
} } }; } Ok(()) }
Accept the filehandle as a mut
(mutable) value.
Use String::new
to create a new, empty mutable string buffer to hold each line.
Use for
to iterate through a std::ops::Range
to count up from 0 to the requested number of lines. The variable name _
indicates I do not intend to use it.
Use BufRead::read_line
to read the next line.
The filehandle will return 0 bytes when it reaches the end, so break
out of the loop.
Print the line including the original line ending.
Use String::clear
to empty the line buffer.
If I run cargo test
at this point, I’m passing almost all the tests for reading lines and failing all those for reading bytes and handling multiple files.
Next, I’ll handle reading bytes from a file.
After I attempt to open the file, I check to see if the config.bytes
is Some
number of bytes; otherwise, I’ll use the preceding code that reads lines:
for filename in config.files { match File::open(&filename) { Err(err) => eprintln!("{}: {}", filename, err), Ok(mut file) => { if let Some(num_bytes) = config.bytes {let mut handle = file.take(num_bytes as u64);
let mut buffer = vec![0; num_bytes];
let n = handle.read(&mut buffer)?;
print!("{}", String::from_utf8_lossy(&buffer[..n]));
} else { ... // Read lines as before } } }; }
Use pattern matching to check if config.bytes
is Some
number of bytes to read.
Use take
to read the requested number of bytes.
Create a mutable buffer of a fixed length num_bytes
filled with zeros to hold the bytes read from the file.
Read the desired number of bytes from the filehandle into the buffer. The value n
will report the number of bytes that were actually read, which may be fewer than the number requested.
Convert the bytes into a string that may not be valid UTF-8. Note the range operation to select only the bytes actually read.
The take
method from the std::io::Read
trait expects its argument to be the type u64
, but I have a usize
. I cast or convert the value using the as
keyword.
This was perhaps the hardest part of the program for me.
Once I figured out how to read only a few bytes, I had to figure out how to convert them to text.
If I take only part of a multibyte character, the result will fail because strings in Rust must be valid UTF-8.
I was happy to find String::from_utf8_lossy
that will quietly convert invalid UTF-8 sequences to the unknown or replacement character:
$ cargo run -- -c 1 tests/inputs/one.txt �
Let me show you the first way I tried to read the bytes from a file.
I decided to read the entire file into a string, convert that into a vector of bytes, and use a slice to select the first num_bytes
.
let mut contents = String::new();file.read_to_string(&mut contents)?; // Danger here
let bytes = contents.as_bytes();
print!("{}", String::from_utf8_lossy(&bytes[..num_bytes])); // More danger
Create a new string buffer to hold the contents of the file.
Read the entire file contents into the string buffer.
Use str::as_bytes
to convert the contents into bytes (u8
or unsigned 8-bit integers).
Use String::from_utf8_lossy
to turn a slice of the bytes
into a string.
I show you this approach so that you know how to read a file into a string; however, this can be a very dangerous thing to do if the file’s size exceeds the amount of memory on your machine. In general, this is a terrible idea unless you are positive that a file is small.
Another serious problem with the preceding code is that it assumes the slice operation bytes[..num_bytes]
will succeed.
If you use this code with an empty file, for instance, you’ll be asking for bytes that don’t exist.
This will cause your program to panic
and exit immediately with an error message:
$ cargo run -- -c 1 tests/inputs/empty.txt thread 'main' panicked at 'range end index 1 out of range for slice of length 0', src/lib.rs:80:50 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Rust can prevent you from making all sorts of egregious errors, but it can’t stop you from doing stupid things. There are still plenty of ways for you to shoot yourself in the foot.
Following is perhaps the shortest way to read the desired number of bytes from a file:
let bytes: Result<Vec<_>, _> = file.bytes().take(num_bytes).collect(); print!("{}", String::from_utf8_lossy(&bytes?));
In the preceding code, the type annotation Result<Vec<_>, _>
is necessary as the compiler infers the type of bytes
as a slice, which has an unknown size.
I must indicate I want a Vec
, which is a smart pointer to heap-allocated memory.
The underscores (_
) here indicate partial type annotation, which basically instructs the compiler to infer the types.
Without this, the compiler complains thusly:
Compiling headr v0.1.0 (/Users/kyclark/work/sysprog-rust/playground/headr) error[E0277]: the size for values of type `[u8]` cannot be known at compilation time --> src/lib.rs:95:58 | 95 | print!("{}", String::from_utf8_lossy(&bytes?)); | ^^^^^^^ doesn't | have a size known at compile-time | = help: the trait `Sized` is not implemented for `[u8]` = note: all local variables must have a statically known size = help: unsized locals are gated as an unstable feature
You’ve now seen that the underscore _
serves various different functions. As the prefix or name of a variable, it shows the compiler you don’t want to use the value. In a match
arm, it is the wildcard for handling any case. When used in a type annotation, it tells the compiler to infer the type.
You can also indicate the type information on the righthand side of the expression using the turbofish operator (::<>
).
Often it’s a matter of style whether you indicate the type on the lefthand or righthand side, but later you will see examples where the turbofish is required for some expressions:
let bytes = file.bytes().take(num_bytes).collect::<Result<Vec<_>, _>>();
The unknown character produced by String::from_utf8_lossy
(b'xefxbfxbd'
) is not exactly the same output produced by BSD head
(b'xc3'
), making this somewhat difficult to test.
If you look at the run
helper function in tests/cli.rs, you’ll see that I read the expected value (the output from head
) and use the same function to convert what could be invalid UTF-8 so that I can compare the two outputs.
The run_stdin
function works similarly:
fn run(args: &[&str], expected_file: &str) -> TestResult { // Extra work here due to lossy UTF let mut file = File::open(expected_file)?; let mut buffer = Vec::new(); file.read_to_end(&mut buffer)?; let expected = String::from_utf8_lossy(&buffer);Command::cargo_bin(PRG)? .args(args) .assert() .success() .stdout(predicate::eq(&expected.as_bytes() as &[u8]));
Ok(()) }
The last piece to handle is the separators between multiple files.
As noted before, valid files have a header the puts the filename inside ==>
and <==
markers.
Files after the first have an additional newline at the beginning to visually separate the output.
This means I will need to know the number of the file that I’m handling, which I can get by using the Iterator::enumerate
method.
Following is the final version of my run
function that will pass all the tests:
pub fn run(config: Config) -> MyResult<()> { let num_files = config.files.len();for (file_num, filename) in config.files.iter().enumerate() {
match File::open(&filename) { Err(err) => eprintln!("{}: {}", filename, err), Ok(file) => { if num_files > 1 {
println!( "{}==> {} <==", if file_num > 0 { " " } else { "" },
filename ); } if let Some(num_bytes) = config.bytes { let mut handle = file.take(num_bytes as u64); let mut buffer = vec![0; num_bytes]; let n = handle.read(&mut buffer)?; print!("{}", String::from_utf8_lossy(&buffer[..n])); } else { let mut line = String::new(); for _ in 0..config.lines { let bytes = file.read_line(&mut line)?; if bytes == 0 { break; } print!("{}", line); line.clear(); } } } }; } Ok(()) }
Use the Vec::len
method to get the number of files.
Use the Iterator::enumerate
method to track both the file number and filenames.
Only print headers when there are multiple files.
Print a newline when the file_num
is greater than 0, which indicates the first file.
Implement the multiplier suffixes of the GNU version so that, for instance, -c=1K
means print the first 1024 bytes of the file. Be sure to add and run tests.
Implement the negative number options from the GNU version where -n=-3
means you should print all but the last three lines of the file. As always, create tests to ensure your program is correct.
Add an option for selecting characters.
Add the file with the Windows line endings to the tests in Chapter 3. Edit the mk-outs.sh for that program to incorporate this file, and then expand the tests and program to ensure that line endings are preserved.
This chapter dove into some fairly sticky subjects such as converting types like a &str
to a usize
, a String
to an Error
, and a usize
to a u64
.
I feel like it took me quite a while to understand the differences between &str
and String
and why I need to use From::from
to create the Err
part of MyResult
.
If you still feel confused, just know that you won’t always.
I think if you keep reading the docs and writing more code, it will eventually come to you.
Here are some things you accomplished in this exercise:
You learned to create optional parameters that can take values. Previously, the options were flags.
You saw that all command-line arguments are strings. You used the str::parse
method to attempt the conversion of a string like “3” into the number 3
.
You learned how to write and run a unit test for an individual function.
You learned to convert types using the as
keyword or with traits like From
and Into
.
You found that _
as the name or prefix of a value is a way to indicate to the compiler that you don’t intend to use a variable. When used in a type annotation, it tells the compiler to infer the type.
You learned to that a match
arm can incorporate an additional Boolean condition called a guard.
You learned how to use BufRead::read_line
to preserve line endings while reading a filehandle.
You found that the take
method works on both iterators and filehandles to limit the number of elements you select.
You learned to indicate type information on the lefthand side of an assignment or on the righthand side using the turbofish.
1 EOF is an acronym for end of file.