Exercising the epoch-based Treiber stack

Let's put the crossbeam-epoch Treiber stack through the wringer to get an idea of the performance of this approach. Key areas of interest for us will be:

Push/pop cycles per second
Memory behavior, high-water mark, and what not
cache behavior

We'll run our programs on x86 and ARM, like we did in previous chapters. Our exercise program, similar to the hazard pointer program from the previous section:

extern crate crossbeam;
#[macro_use]
extern crate lazy_static;
extern crate num_cpus;
extern crate quantiles;

use crossbeam::sync::TreiberStack;
use quantiles::ckms::CKMS;
use std::sync::Arc;
use std::sync::atomic::{AtomicUsize, Ordering};
use std::{thread, time};

lazy_static! {
    static ref WORKERS: AtomicUsize = AtomicUsize::new(0);
    static ref COUNT: AtomicUsize = AtomicUsize::new(0);
}
static MAX_I: u32 = 67_108_864; // 2 ** 26

fn main() {
    let stk: Arc<TreiberStack<(u64, u64, u64)>> = Arc::new(TreiberStack::new());

    let mut jhs = Vec::new();

    let cpus = num_cpus::get();
    WORKERS.store(cpus, Ordering::Release);

    for _ in 0..cpus {
        let stk = Arc::clone(&stk);
        jhs.push(thread::spawn(move || {
            for i in 0..MAX_I {
                stk.push((i as u64, i as u64, i as u64));
                stk.pop();
                COUNT.fetch_add(1, Ordering::Relaxed);
            }
            WORKERS.fetch_sub(1, Ordering::Relaxed)
        }))
    }

    let one_second = time::Duration::from_millis(1_000);
    let mut iter = 0;
    let mut cycles: CKMS<u32> = CKMS::new(0.001);
    while WORKERS.load(Ordering::Relaxed) != 0 {
        let count = COUNT.swap(0, Ordering::Relaxed);
        cycles.insert((count / cpus) as u32);
        println!(
            "CYCLES PER SECOND({}):
  25th: 
             {}
  50th: {}
  75th: 
             {}
  90th: {}
  max:  {}
",
            iter,
            cycles.query(0.25).unwrap().1,
            cycles.query(0.50).unwrap().1,
            cycles.query(0.75).unwrap().1,
            cycles.query(0.90).unwrap().1,
            cycles.query(1.0).unwrap().1
        );
        thread::sleep(one_second);
        iter += 1;
    }

    for jh in jhs {
        jh.join().unwrap();
    }
}

We have a number of worker threads equal to the total number of CPUs in the target machine, each doing one push and then an immediate pop on the stack. A COUNT is kept and the main thread swaps that value for 0 every second or so, tossing the value into a quantile estimate structure—discussed in more detail in Chapter 4, Sync and Send – the Foundation of Rust Concurrency—and printing out a summary of the recorded cycles per second, scaled to the number of CPUs. The worker threads cycle up through to MAX_I, which is arbitrarily set to the smallish value of 2**26. When the worker is finished cycling, it decreases WORKERS and exits. Once WORKERS hits zero, the main loop also exits.

On my x86 machine, it takes approximately 38 seconds for this program to exit, with this output:

CYCLES PER SECOND(0):
  25th: 0
  50th: 0
  75th: 0
  90th: 0
  max:  0

CYCLES PER SECOND(1):
  25th: 0
  50th: 0
  75th: 1739270
  90th: 1739270
  max:  1739270

...

CYCLES PER SECOND(37):
  25th: 1738976
  50th: 1739528
  75th: 1740474
  90th: 1757650
  max:  1759459

CYCLES PER SECOND(38):
  25th: 1738868
  50th: 1739528
  75th: 1740452
  90th: 1757650
  max:  1759459

Compare this to the x86 hazard implementation, which takes 58 total seconds. The x86 perf run is as follows:

> perf stat --event task-clock,context-switches,page-faults,cycles,instructions,branches,branch-misses,cache-references,cache-misses target/release/epoch_stack > /dev/null

 Performance counter stats for 'target/release/epoch_stack':

     148830.337380      task-clock (msec)         #    3.916 CPUs utilized
             1,043      context-switches          #    0.007 K/sec
             1,039      page-faults               #    0.007 K/sec
   429,969,505,981      cycles                    #    2.889 GHz
   161,901,052,886      instructions              #    0.38  insn per cycle
    27,531,697,676      branches                  #  184.987 M/sec
       627,050,474      branch-misses             #    2.28% branches
    11,885,394,803      cache-references          #   79.859 M/sec
         1,772,308      cache-misses              #    0.015 % cache refs

      38.004548310 seconds time elapsed

The total number of executed instructions is much, much lower in the epoch-based approach, which squares well with the analysis in the opening discussion of this section. Even with a single hazardous pointer, epoch-based approaches do less work. On my ARM machine, it takes approximately 72 seconds for the program to run to completion, with this output:

CYCLES PER SECOND(0):
  25th: 0
  50th: 0
  75th: 0
  90th: 0
  max:  0

CYCLES PER SECOND(1):
  25th: 0
  50th: 0
  75th: 921743
  90th: 921743
  max:  921743

...

CYCLES PER SECOND(71):
  25th: 921908
  50th: 922326
  75th: 922737
  90th: 923235
  max:  924084

CYCLES PER SECOND(72):
  25th: 921908
  50th: 922333
  75th: 922751
  90th: 923235
  max:  924084

Compared to the ARM hazard implementation, which takes 463 total seconds! The ARM perf run is as follows:

> perf stat --event task-clock,context-switches,page-faults,cycles,instructions,branches,branch-misses,cache-references,cache-misses target/release/epoch_stack > /dev/null

 Performance counter stats for 'target/release/epoch_stack':

     304880.898057      task-clock (msec)         #    3.959 CPUs utilized
                 0      context-switches          #    0.000 K/sec
               248      page-faults               #    0.001 K/sec
   364,139,169,537      cycles                    #    1.194 GHz
   215,754,991,724      instructions              #    0.59  insn per cycle
    28,019,208,815      branches                  #   91.902 M/sec
     3,525,977,068      branch-misses             #   12.58% branches
   105,165,886,677      cache-references          #  344.941 M/sec
     1,450,538,471      cache-misses              #    1.379 % cache refs

      77.014340901 seconds time elapsed

Again, drastically fewer instructions are executed compared to the hazard-pointer implementation on the same processor architecture. In passing, it's worth noting that the memory use of both the x86 and ARM versions' epoch_stack were lower than the hazard pointer stack implementations, though neither were memory hogs, consuming only a few kilobytes each. Had one of our epoch threads slept for a long while—say, a second or so—before leaving its epoch, memory use would have grown during execution. The reader is encouraged to wreak havoc on their own system.

Table of Contents for Exercising the epoch-based Treiber stack

Create new playlist

Sign In

Sign Up

Table of Contents for
Exercising the epoch-based Treiber stack