There are lies, damn lies and benchmarks.
— Old Jungle saying
testbench is a new library designed to make it easier to write comparison benchmarks, ensure that they return the correct value and thus help prevent unintentional bias in benchmarks.
About a year ago, I was working on some Haskell code that I wanted to compare to existing implementations. In Haskell, we of course have the wonderful criterion library to write benchmarks with, but whilst I’ve found it really helpful before to help me tell whether a particular function has been improving in performance as I work on it, I felt that it was a bit clunky for directly comparing implementations against each other (there used to be a
[bcompare] function, but it hasn’t existed since version 126.96.36.199 which came out in August 2014).
When I tried looking at how others have approached this problem, I found that they did so by just directly using the
bgroup functions. From my point of view, there are two problems with this approach:
- There is a lot of duplication required with this: you would typically have something along the lines of:
[ bench "f1" $ nf f1 a , bench "f2" $ nf f2 a ... ]
Because of this duplication, it is too easy to have benchmarks nominally comparing two (or more) functions/values, but accidentally end up comparing apples to oranges (e.g. using
- The output generated by criterion – especially as of version 188.8.131.52 – is rather verbose and tends not to lend itself well to directly comparing results to multiple benchmarks. I personally find myself starting to get swamped looking at the terminal output if there’s more than a few benchmarks, and the HTML report is even worse.As I said above, it’s great when I’m directly looking at just how one function compares as I tweak it, but not when I’m wanting to compare multiple functions.
Whilst I kept looking at existing comparison benchmarks, I even came across an example where a comparison ended up nominally showing that
f1 was faster than
f2… except that the result of
f1 was a value with an
O(1) implementation of
f2 has an
O(n) definition. I don’t know if this is intentional (I think it probably wasn’t) and even if this is rectified
f1 was still faster… but the difference in runtimes – whilst minor in comparison to performance between the two functions – is non-negligible.
This to me demonstrated the desirability of not only having a wrapper around criterion to reduce the verbosity of comparison benchmarks, but to only be able to produce unit tests to ensure criteria are satisfied.
It’s taken me longer than I wished to produce a syntax that I was both happy with and would actually work (with lots of fighting against GHC in the form of “Why won’t you accept this? Oh, wait, now I get it; that makes sense… but can’t you accept it anyway? Pretty please?”), but I’ve now finally gotten it to a usable form and am hence releasing it.
As extremely simple and contrived examples, consider the following:
main :: IO () main = testBench $ do -- Monomorphic comparisons compareFunc "List length" (\n -> length (replicate n ()) == n) (testWith (@? "Not as long as specified") <> benchNormalForm) (mapM_ (\n -> comp ("len == " ++ show n) n) [1..5]) -- Polymorphic comparisons. -- -- Currently it isn't possible to use a Proxy as the argument to the -- function, so we're using 'undefined' to specify the type. compareFuncConstraint (Proxy :: Proxy (CUnion Eq Num)) "Number type equality" (join (==) . (0`asTypeOf`)) (baseline "Integer" (undefined :: Integer) <> benchNormalForm) $ do comp "Int" (undefined :: Int) comp "Rational" (undefined :: Rational) comp "Float" (undefined :: Float) comp "Double" (undefined :: Double)
When this is run, the result on the console is:
Cases: 9 Tried: 9 Errors: 0 Failures: 0 Mean MeanLB MeanUB Stddev StddevLB StddevUB OutlierVariance List length len == 1 22.15 ns 21.86 ns 22.88 ns 1.505 ns 742.2 ps 2.826 ns 83% len == 2 22.64 ns 22.49 ns 22.87 ns 602.0 ps 449.5 ps 825.7 ps 43% len == 3 23.39 ns 23.16 ns 23.78 ns 1.057 ns 632.6 ps 1.553 ns 68% len == 4 23.70 ns 23.51 ns 23.95 ns 773.3 ps 567.9 ps 1.050 ns 53% len == 5 24.14 ns 23.96 ns 24.71 ns 962.4 ps 307.5 ps 1.886 ns 63% Number type equality Integer 12.59 ns 12.48 ns 12.80 ns 538.0 ps 312.4 ps 944.2 ps 67% Int 12.79 ns 12.69 ns 12.98 ns 463.6 ps 320.0 ps 665.2 ps 59% Rational 12.77 ns 12.67 ns 12.93 ns 395.1 ps 290.0 ps 535.9 ps 51% Float 13.13 ns 12.88 ns 13.42 ns 869.7 ps 667.3 ps 1.212 ns 83% Double 12.74 ns 12.57 ns 13.02 ns 704.6 ps 456.5 ps 1.047 ns 78%
You can see on the top line we’ve had nine tests (run using HUnit):
- From the first group we’ve specified that all five values must return
- From the second group, we’ve specified that all inputs must return the same value as for the
Since all the tests passed, the benchmarks are run. The output for these is a tabular format to make it easier to do vertical comparisons (though in this case the variances are all high so we should take them with a grain of salt).
Whilst I’m quite pleased with the API for defining the actual tests/benchmarks (subject to what GHC will let me write), there’s still scope for more functionality (e.g. support for
However, the default output (as soon above) isn’t configurable. It’s possible to get the individual tests and benchmarks out to feed them explicitly to HUnit and criterion respectively, but if you’re after this particular output then you have to wait until all the benchmarks are complete before the results are printed. There is no support for saving results to file (either as a CSV of all the results or an HTML report), or even to control how the benchmarks are run (minimum time spent on each benchmark, etc.) or any other option currently offered by criterion.
If there is enough interest I can look at adding these in; but this satisfies my itch for now whilst getting this library out there for people to start trying out.