Haskell

Test your benchmarks!

There are lies, damn lies and benchmarks.
Old Jungle saying

testbench is a new library designed to make it easier to write comparison benchmarks, ensure that they return the correct value and thus help prevent unintentional bias in benchmarks.

Motivation

About a year ago, I was working on some Haskell code that I wanted to compare to existing implementations. In Haskell, we of course have the wonderful criterion library to write benchmarks with, but whilst I’ve found it really helpful before to help me tell whether a particular function has been improving in performance as I work on it, I felt that it was a bit clunky for directly comparing implementations against each other (there used to be a [bcompare] function, but it hasn’t existed since version 1.0.0.0 which came out in August 2014).

When I tried looking at how others have approached this problem, I found that they did so by just directly using the bench and bgroup functions. From my point of view, there are two problems with this approach:

  1. There is a lot of duplication required with this: you would typically have something along the lines of:
    [ bench "f1" $ nf f1 a
    , bench "f2" $ nf f2 a
    ...
    ]

    Because of this duplication, it is too easy to have benchmarks nominally comparing two (or more) functions/values, but accidentally end up comparing apples to oranges (e.g. using whnf instead of nf).

  2. The output generated by criterion – especially as of version 1.0.0.0 – is rather verbose and tends not to lend itself well to directly comparing results to multiple benchmarks. I personally find myself starting to get swamped looking at the terminal output if there’s more than a few benchmarks, and the HTML report is even worse.As I said above, it’s great when I’m directly looking at just how one function compares as I tweak it, but not when I’m wanting to compare multiple functions.

Whilst I kept looking at existing comparison benchmarks, I even came across an example where a comparison ended up nominally showing that f1 was faster than f2… except that the result of f1 was a value with an O(1) implementation of [rnf], whereas f2 has an O(n) definition. I don’t know if this is intentional (I think it probably wasn’t) and even if this is rectified f1 was still faster… but the difference in runtimes – whilst minor in comparison to performance between the two functions – is non-negligible.

This to me demonstrated the desirability of not only having a wrapper around criterion to reduce the verbosity of comparison benchmarks, but to only be able to produce unit tests to ensure criteria are satisfied.

It’s taken me longer than I wished to produce a syntax that I was both happy with and would actually work (with lots of fighting against GHC in the form of “Why won’t you accept this? Oh, wait, now I get it; that makes sense… but can’t you accept it anyway? Pretty please?”), but I’ve now finally gotten it to a usable form and am hence releasing it.

testbench is now available on Hackage with the source on GitHub.

Example

As extremely simple and contrived examples, consider the following:

main :: IO ()
main = testBench $ do

  -- Monomorphic comparisons
  compareFunc "List length"
              (\n -> length (replicate n ()) == n)
              (testWith (@? "Not as long as specified")
               <> benchNormalForm)
              (mapM_ (\n -> comp ("len == " ++ show n) n) [1..5])

  -- Polymorphic comparisons.
  --
  -- Currently it isn't possible to use a Proxy as the argument to the
  -- function, so we're using 'undefined' to specify the type.
  compareFuncConstraint (Proxy :: Proxy (CUnion Eq Num))
                        "Number type equality"
                        (join (==) . (0`asTypeOf`))
                        (baseline "Integer"  (undefined :: Integer)
                         <> benchNormalForm)
                        $ do comp "Int"      (undefined :: Int)
                             comp "Rational" (undefined :: Rational)
                             comp "Float"    (undefined :: Float)
                             comp "Double"   (undefined :: Double)

When this is run, the result on the console is:

Cases: 9  Tried: 9  Errors: 0  Failures: 0
                          Mean    MeanLB    MeanUB    Stddev  StddevLB  StddevUB  OutlierVariance
List length
  len == 1            22.15 ns  21.86 ns  22.88 ns  1.505 ns  742.2 ps  2.826 ns              83%
  len == 2            22.64 ns  22.49 ns  22.87 ns  602.0 ps  449.5 ps  825.7 ps              43%
  len == 3            23.39 ns  23.16 ns  23.78 ns  1.057 ns  632.6 ps  1.553 ns              68%
  len == 4            23.70 ns  23.51 ns  23.95 ns  773.3 ps  567.9 ps  1.050 ns              53%
  len == 5            24.14 ns  23.96 ns  24.71 ns  962.4 ps  307.5 ps  1.886 ns              63%
Number type equality
  Integer             12.59 ns  12.48 ns  12.80 ns  538.0 ps  312.4 ps  944.2 ps              67%
  Int                 12.79 ns  12.69 ns  12.98 ns  463.6 ps  320.0 ps  665.2 ps              59%
  Rational            12.77 ns  12.67 ns  12.93 ns  395.1 ps  290.0 ps  535.9 ps              51%
  Float               13.13 ns  12.88 ns  13.42 ns  869.7 ps  667.3 ps  1.212 ns              83%
  Double              12.74 ns  12.57 ns  13.02 ns  704.6 ps  456.5 ps  1.047 ns              78%

You can see on the top line we’ve had nine tests (run using HUnit):

  • From the first group we’ve specified that all five values must return True.
  • From the second group, we’ve specified that all inputs must return the same value as for the Integer case.

Since all the tests passed, the benchmarks are run. The output for these is a tabular format to make it easier to do vertical comparisons (though in this case the variances are all high so we should take them with a grain of salt).

Caveats

Whilst I’m quite pleased with the API for defining the actual tests/benchmarks (subject to what GHC will let me write), there’s still scope for more functionality (e.g. support for IO-based benchmarks).

However, the default output (as soon above) isn’t configurable. It’s possible to get the individual tests and benchmarks out to feed them explicitly to HUnit and criterion respectively, but if you’re after this particular output then you have to wait until all the benchmarks are complete before the results are printed. There is no support for saving results to file (either as a CSV of all the results or an HTML report), or even to control how the benchmarks are run (minimum time spent on each benchmark, etc.) or any other option currently offered by criterion.

If there is enough interest I can look at adding these in; but this satisfies my itch for now whilst getting this library out there for people to start trying out.

Standard