Performance testing with Criterion

We can now be fairly confident that our naive HashMap has the same behavior as the standard library for the properties we've checked. But how does the runtime performance of naive HashMap stack up? To answer this, we'll write a benchmark, but not a benchmark using the unstable Bencher subsystem that's available in the nightly channel. Instead, we'll use Jorge Aparicio's criterion—inspired by the Haskell tool of the same name by Bryan O'Sullivan—which is available on stable and does statistically valid sampling of runs. All Rust benchmark code lives under the top-level benches/ directory and criterion benchmarks are no different. Open benches/naive.rs and give it this preamble:

#[macro_use]
extern crate criterion;
extern crate naive_hashmap;
extern crate rand;

use criterion::{Criterion, Fun};
use rand::{Rng, SeedableRng, XorShiftRng};

This benchmark incorporates pseudorandomness to produce an interesting run. Much like unit tests, handcrafted datasets in benchmarks will trend towards some implicit bias of the author, harming the benchmark. Unless, of course, a handcrafted dataset is exactly what's called for. Benchmarking programs well is a non-trivial amount of labor. The Rng we use is XorShift, a pseudo-random generator known for its speed and less cryptographic security. That suits our purposes here:

fn insert_and_lookup_naive(mut n: u64) {
    let mut rng: XorShiftRng = SeedableRng::from_seed([1981, 1986,
                                                       2003, 2011]);
    let mut hash_map = naive_hashmap::HashMap::new();

    while n != 0 {
        let key = rng.gen::<u8>();
        if rng.gen::<bool>() {
            let value = rng.gen::<u32>();
            hash_map.insert(key, value);
        } else {
            hash_map.get(&key);
        }
        n -= 1;
    }
}

The first benchmark, named insert_and_lookup_native, performs pseudo-random insertions and lookups against the naive HashMap. The careful reader will note that XorShiftRng is given the same seed every benchmark. This is important. While we want to avoid handcrafting a benchmark dataset, we do want it to be the same every run, else the benchmark comparisons have no basis. That noted, the rest of the benchmark shouldn't be much of a surprise. Generate random actions, apply them, and so forth. As we're interested in the times for standard library HashMap we have a benchmark for that, too:

fn insert_and_lookup_standard(mut n: u64) {
    let mut rng: XorShiftRng = SeedableRng::from_seed([1981, 1986, 
                                                       2003, 2011]);
    let mut hash_map = ::std::collections::HashMap::new();

    while n != 0 {
        let key = rng.gen::<u8>();
        if rng.gen::<bool>() {
            let value = rng.gen::<u32>();
            hash_map.insert(key, value);
        } else {
            hash_map.get(&key);
        }
        n -= 1;
    }
}

Criterion offers, as of writing this book, a method for comparing two functions or comparing a function run over parameterized inputs but not both. That's an issue for us here, as we'd like to compare two functions over many inputs. To that end, this benchmark relies on a small macro called insert_lookup!:

macro_rules! insert_lookup {
    ($fn:ident, $s:expr) => {
        fn $fn(c: &mut Criterion) {
            let naive = Fun::new("naive", |b, i| b.iter(||                   
                                   insert_and_lookup_naive(*i))
            );
            let standard = Fun::new("standard", |b, i| b.iter(|| 
                                      insert_and_lookup_standard(*i))
            );

            let functions = vec![naive, standard];

            c.bench_functions(&format!("HashMap/{}", $s),
                              functions, &$s);
        }
    }
}

insert_lookup!(insert_lookup_100000, 100_000);
insert_lookup!(insert_lookup_10000, 10_000);
insert_lookup!(insert_lookup_1000, 1_000);
insert_lookup!(insert_lookup_100, 100);
insert_lookup!(insert_lookup_10, 10);
insert_lookup!(insert_lookup_1, 1);

The meat here is we create two Fun for comparison called naive and standard, then use Criterion::bench_functions to run a comparison between the two. In the invocation of the macro, we evaluate insert_and_lookup_* from 1 to 100_000, with it being the total number of insertions to be performed against the standard and naive HashMaps. Finally, we need the criterion group and main function in place:

criterion_group!{
    name = benches;
    config = Criterion::default();
    targets = insert_lookup_100000, insert_lookup_10000, 
              insert_lookup_1000, insert_lookup_100, 
              insert_lookup_10, insert_lookup_1
}
criterion_main!(benches);

Running criterion benchmarks is no different to executing Rust built-in benchmarks:

> cargo bench
   Compiling naive_hashmap v0.1.0 (file:///home/blt/projects/us/troutwine/concurrency_in_rust/external_projects/naive_hashmap)
    Finished release [optimized] target(s) in 5.26 secs
     Running target/release/deps/naive_hashmap-fe6fcaf7cf309bb8

running 2 tests
test test::get_what_you_give ... ignored
test test::sut_vs_genuine_article ... ignored

test result: ok. 0 passed; 0 failed; 2 ignored; 0 measured; 0 filtered out

     Running target/release/deps/naive_interpreter-26c60c76389fd26d

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

     Running target/release/deps/naive-96230b57fe0068b7
HashMap/100000/naive    time:   [15.807 ms 15.840 ms 15.879 ms]
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  10 (10.00%) high severe
HashMap/100000/standard time:   [2.9067 ms 2.9102 ms 2.9135 ms]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  5 (5.00%) low mild
  1 (1.00%) high mild

HashMap/10000/naive     time:   [1.5475 ms 1.5481 ms 1.5486 ms]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  2 (2.00%) high mild
  6 (6.00%) high severe
HashMap/10000/standard  time:   [310.60 us 310.81 us 311.03 us]
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe

And so forth. Criterion also helpfully produces gnuplot graphs of your runtimes, if gnuplot is installed on your benchmark system. This is highly recommended.

Table of Contents for Performance testing with Criterion

Create new playlist

Sign In

Sign Up

Table of Contents for
Performance testing with Criterion