Let's put the conc, Treiber stack through the wringer to get an idea of the performance of this approach. Key areas of interest for us will be:
- Push/pop cycles per second
- Memory behavior, high-water mark, and what not
- Cache behavior
We'll run our programs on x86 and ARM, as in previous chapters. One thing to note here, before we proceed, is that conc—at least as of version 0.5—requires the nightly channel to be compiled. Please read the section just before this one if you're jumping around in the book for full details.
Because we have to run on an old version of nightly, we have to stick static AtomicUsize into lazy_static!, a technique you'll see in older Rust code but not in this book, usually. With that in mind, here is our exercise program:
extern crate conc; #[macro_use] extern crate lazy_static; extern crate num_cpus; extern crate quantiles; use conc::sync::Treiber; use quantiles::ckms::CKMS; use std::sync::Arc; use std::sync::atomic::{AtomicUsize, Ordering}; use std::{thread, time}; lazy_static! { static ref WORKERS: AtomicUsize = AtomicUsize::new(0); static ref COUNT: AtomicUsize = AtomicUsize::new(0); } static MAX_I: u32 = 67_108_864; // 2 ** 26 fn main() { let stk: Arc<Treiber<(u64, u64, u64)>> = Arc::new(Treiber::new()); let mut jhs = Vec::new(); let cpus = num_cpus::get(); WORKERS.store(cpus, Ordering::Release); for _ in 0..cpus { let stk = Arc::clone(&stk); jhs.push(thread::spawn(move || { for i in 0..MAX_I { stk.push((i as u64, i as u64, i as u64)); stk.pop(); COUNT.fetch_add(1, Ordering::Relaxed); } WORKERS.fetch_sub(1, Ordering::Relaxed) })) } let one_second = time::Duration::from_millis(1_000); let mut iter = 0; let mut cycles: CKMS<u32> = CKMS::new(0.001); while WORKERS.load(Ordering::Relaxed) != 0 { let count = COUNT.swap(0, Ordering::Relaxed); cycles.insert((count / cpus) as u32); println!( "CYCLES PER SECOND({}): 25th:
{} 50th: {} 75th:
{} 90th: {} max: {} ", iter, cycles.query(0.25).unwrap().1, cycles.query(0.50).unwrap().1, cycles.query(0.75).unwrap().1, cycles.query(0.90).unwrap().1, cycles.query(1.0).unwrap().1 ); thread::sleep(one_second); iter += 1; } for jh in jhs { jh.join().unwrap(); } }
We have a number of worker threads equal to the total number of CPUs in the target machine, each doing one push and then an immediate pop on the stack. A COUNT is kept and the main thread swaps that value for 0 every second or so, tossing the value into a quantile estimate structure—discussed in more detail in Chapter 4, Sync and Send – the Foundation of Rust Concurrency—and printing out a summary of the recorded cycles per second, scaled to the number of CPUs. The worker threads the cycle up through to MAX_I, which is arbitrarily set to the smallish value of 2**26. When the worker is finished cycling, it decreases WORKERS and exits. Once WORKERS hits zero, the main loop also exits.
On my x86 machine, it takes approximately 58 seconds for this program to exit, with this output:
CYCLES PER SECOND(0): 25th: 0 50th: 0 75th: 0 90th: 0 max: 0 CYCLES PER SECOND(1): 25th: 0 50th: 0 75th: 1124493 90th: 1124493 max: 1124493 ... CYCLES PER SECOND(56): 25th: 1139055 50th: 1141656 75th: 1143781 90th: 1144324 max: 1145284 CYCLES PER SECOND(57): 25th: 1139097 50th: 1141656 75th: 1143792 90th: 1144324 max: 1145284
CYCLES PER SECOND(58): 25th: 1139097 50th: 1141809 75th: 1143792 90th: 1144398 max: 1152384
The x86 perf run is as follows:
> perf stat --event task-clock,context-switches,page-faults,cycles,instructions,branches,branch-misses,cache-references,cache-misses target/release/conc_stack > /dev/null Performance counter stats for 'target/release/conc_stack': 230503.932161 task-clock (msec) # 3.906 CPUs utilized 943 context-switches # 0.004 K/sec 9,140 page-faults # 0.040 K/sec 665,734,124,408 cycles # 2.888 GHz 529,230,473,047 instructions # 0.79 insn per cycle 99,146,219,517 branches # 430.128 M/sec 1,140,398,326 branch-misses # 1.15% of all branches 12,759,284,944 cache-references # 55.354 M/sec 124,487,844 cache-misses # 0.976 % of all cache refs 59.006842741 seconds time elapsed
The memory-allocation behavior of this program is very favorable as well.
On my ARM machine, it takes approximately 460 seconds for the program to run to completion, with this output:
CYCLES PER SECOND(0): 25th: 0 50th: 0 75th: 0 90th: 0 max: 0
CYCLES PER SECOND(1): 25th: 0 50th: 0 75th: 150477 90th: 150477 max: 150477 ... CYCLES PER SECOND(462): 25th: 137736 50th: 150371 75th: 150928 90th: 151129 max: 151381 CYCLES PER SECOND(463): 25th: 137721 50th: 150370 75th: 150928 90th: 151129 max: 151381
This is about 7.5 times slower than the x86 machine, even though both machines have the same number of cores. The x86 machine has a faster clock and a faster memory bus than my ARM:
> perf stat --event task-clock,context-switches,page-faults,cycles,instructions,branches,branch-misses,cache -references,cache-misses target/release/conc_stack > /dev/null Performance counter stats for 'target/release/conc_stack': 1882745.172714 task-clock (msec) # 3.955 CPUs utilized 0 context-switches # 0.000 K/sec 3,212 page-faults # 0.002 K/sec 2,109,536,434,019 cycles # 1.120 GHz 1,457,344,087,067 instructions # 0.69 insn per cycle 264,210,403,390 branches # 140.333 M/sec 36,052,283,984 branch-misses # 13.65% of all branches 642,854,259,575 cache-references # 341.445 M/sec 2,401,725,425 cache-misses # 0.374 % of all cache refs 476.095973090 seconds time elapsed
The fact that the branch misses on ARM are 12% higher than on x86 is an interesting result.
Before moving on, do remember to execute the rustup default stable.