Quick and Dirty Benchmarking

Ryan James Spencer

In the past I've advocated for the default use of criterion in Rust projects as the but it isn't always the fastest to run in a development loop for quick feedback on optimisations. First, let's make our release builds incremental by specifying the following in your project's Cargo.toml file.

[profile.release]
incremental = true

Next, let's setup a baseline benchmark. I've made a template here you can dump into a pre-existing module or give it it's own module. We use cargo watch here but it could just as well be any other tool that does the same job, such as entr. This benchmarking suite is bundled only with nightly, as it comes from the libtest crate.

cargo +nightly watch -x bench

Remember, the aim here is to get a really fast local feedback loop and not to be producing rigorous, publishable results. We want to know if changes have general speedups or slowdowns without having to wait for excessive periods of time. Granted, the runtime of the the code the benchmark is executing plays a lot into this, regardless of choice of harness. To minimize, focus on reducing:

  • Number of iterations - as a rule of thumb, try to pick somewhere between 5 to 100 iterations depending on the code under inspection. You want to have some confidence of an average between runs, but also not spend too long honing that average.

  • Size of input - try to pick input sizes that are neither trivial nor massive, as you want to ensure the code is getting properly exercised but also reduce the time to completing a benchmark in general.

It's worth stressing again that this isn't about building rigorous benchmarks for comparisons to other projects but to build benchmarks that help you understand the general trend of whether or not your changes are making improvements or regressing.

Alternative Approaches

Sometimes a benchmark like the above may be a bit awkward given the way the code is laid out, and if you have a binary or can cake the logic into a binary, it may be alright to record the respective wall times across invocations with a process. From scratch, let's build out a tester binary for us to run. First, we'll put in structopt for easily switching between changes we want to experiment against. As structopt is a nice veneer over clap, there's really no advantage to either except it might help you get results faster.

#[derive(Debug, StructOpt)]
#[structopt(name = "cli", about = "Benchmark harness for X.")]
struct Opt {
    #[structopt(parse(from_os_str))]
    input: PathBuf,

    #[structopt(long = "x1")]
    x1: bool,

    #[structopt(long = "x2")]
    x2: bool,
}

<snip>

fn main() {
    let opt = Opt::from_args();
    if opt.x1 {
        example1(&opt.input).expect("[cli] example1 failure");
    } else if opt.x2 {
        example2(&opt.input).expect("[cli] example2 failure");
    } else {
        baseline(&opt.input).expect("[cli] example failure");
    }
}

We use an input file above, but we could just as easily take input from anywhere, either embedded in the program or even from stdin, for example. We are going to run the program using hyperfine which wraps up criterion into a neat bundle and is infinitely useful for comparing wall time averages versus manually invoking time multiple times and performing the aggregations yourself:

; cargo build --release
; hyperfine "cli test.in" "cli --x1 test.in" "cli --x2 test.in"

Which gives us some nice output and a summary of the fastest variant in relation to the others:

; hyperfine "target/release/cli test.in" "target/release/cli --x1 test.c
sv" "target/release/cli --x2 test.in"
Benchmark #1: target/release/cli test.in
  Time (mean ± σ):       7.7 ms ±   0.5 ms    [User: 6.5 ms, System: 1.2 ms]
  Range (min … max):     7.1 ms …  11.2 ms    385 runs

Benchmark #2: target/release/cli --x1 test.in
  Time (mean ± σ):      16.3 ms ±   0.7 ms    [User: 13.9 ms, System: 3.7 ms]
  Range (min … max):    15.2 ms …  20.7 ms    171 runs

Benchmark #3: target/release/cli --x2 test.in
  Time (mean ± σ):      18.0 ms ±   2.3 ms    [User: 58.9 ms, System: 2.9 ms]
  Range (min … max):    14.7 ms …  29.4 ms    155 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  'target/release/cli test.in' ran
    2.13 ± 0.17 times faster than 'target/release/cli --x1 test.in'
    2.35 ± 0.34 times faster than 'target/release/cli --x2 test.in'

In my messy use of hyperfine above, it recommends it notes that it detected outliers and I might consider running this on a quieter system, which is a good suggestion and one that shouldn't simply be ignored, especially if the changes you are performing are producing rather minimal gains or reductions to performance.

You don't need to rig up an explicit benchmark harness program for the purposes of this, either. I've had luck using on-hand binaries from previous builds and newer binaries simply renamed or at different locations on a filesystem to compare relative performance. If a build from three months ago felt a lot faster and I can easily do a build of the latest version off my main branch, I can chuck those into hyperfine, too.

One last thing I'll recommend is that it can be handy to rig up profiling tools in scripts to get numbers across changes. For a variety of tooling, you are likely to get somewhat unstable numbers across runs on a target. If you want something rock-solid across runs, you might consider chucking valgrind into a script and comparing output across commands, such as the following:

#!/bin/sh -eux

cargo build --release
COMMAND="target/release/cli test.in"

valgrind --tool=cachegrind "$COMMAND" 2>&1 | rg '^=='
valgrind --tool=cachegrind "$COMMAND" --x1 2>&1 | rg '^=='
valgrind --tool=cachegrind "$COMMAND" --x2 2>&1 | rg '^=='

The rg '^==' and stream redirection will make sure we only see output from valgrind and not our tools (unless our tools are emitting lines with two or more equal signs). Cachegrind has a I ref field which stands for instruction references recorded. valgrind runs your program in a sandbox where it can do checking of various actions, hence numbers should not change depending on noisy neighbors. If you want something more direct from, say, PMC (performance monitoring counter), you could plug in perf stat -ad, or rig up a flamegraph to be generated and reloaded into a browser or preview tool each time you make a change.