There are lies, damn lies and benchmarks.
— Old Jungle saying
testbench is a new library designed to make it easier to write comparison benchmarks, ensure that they return the correct value and thus help prevent unintentional bias in benchmarks.
Motivation
About a year ago, I was working on some Haskell code that I wanted to compare to existing implementations. In Haskell, we of course have the wonderful criterion library to write benchmarks with, but whilst I’ve found it really helpful before to help me tell whether a particular function has been improving in performance as I work on it, I felt that it was a bit clunky for directly comparing implementations against each other (there used to be a [bcompare]
function, but it hasn’t existed since version 1.0.0.0 which came out in August 2014).
When I tried looking at how others have approached this problem, I found that they did so by just directly using the bench
and bgroup
functions. From my point of view, there are two problems with this approach:
- There is a lot of duplication required with this: you would typically have something along the lines of:
[ bench "f1" $ nf f1 a , bench "f2" $ nf f2 a ... ]
Because of this duplication, it is too easy to have benchmarks nominally comparing two (or more) functions/values, but accidentally end up comparing apples to oranges (e.g. using
whnf
instead ofnf
). - The output generated by criterion – especially as of version 1.0.0.0 – is rather verbose and tends not to lend itself well to directly comparing results to multiple benchmarks. I personally find myself starting to get swamped looking at the terminal output if there’s more than a few benchmarks, and the HTML report is even worse.As I said above, it’s great when I’m directly looking at just how one function compares as I tweak it, but not when I’m wanting to compare multiple functions.
Whilst I kept looking at existing comparison benchmarks, I even came across an example where a comparison ended up nominally showing that f1
was faster than f2
… except that the result of f1
was a value with an O(1)
implementation of [rnf]
, whereas f2
has an O(n)
definition. I don’t know if this is intentional (I think it probably wasn’t) and even if this is rectified f1
was still faster… but the difference in runtimes – whilst minor in comparison to performance between the two functions – is non-negligible.
This to me demonstrated the desirability of not only having a wrapper around criterion to reduce the verbosity of comparison benchmarks, but to only be able to produce unit tests to ensure criteria are satisfied.
It’s taken me longer than I wished to produce a syntax that I was both happy with and would actually work (with lots of fighting against GHC in the form of “Why won’t you accept this? Oh, wait, now I get it; that makes sense… but can’t you accept it anyway? Pretty please?”), but I’ve now finally gotten it to a usable form and am hence releasing it.
testbench is now available on Hackage with the source on GitHub.
Example
As extremely simple and contrived examples, consider the following:
main :: IO ()
main = testBench $ do
-- Monomorphic comparisons
compareFunc "List length"
(\n -> length (replicate n ()) == n)
(testWith (@? "Not as long as specified")
<> benchNormalForm)
(mapM_ (\n -> comp ("len == " ++ show n) n) [1..5])
-- Polymorphic comparisons.
--
-- Currently it isn't possible to use a Proxy as the argument to the
-- function, so we're using 'undefined' to specify the type.
compareFuncConstraint (Proxy :: Proxy (CUnion Eq Num))
"Number type equality"
(join (==) . (0`asTypeOf`))
(baseline "Integer" (undefined :: Integer)
<> benchNormalForm)
$ do comp "Int" (undefined :: Int)
comp "Rational" (undefined :: Rational)
comp "Float" (undefined :: Float)
comp "Double" (undefined :: Double)
When this is run, the result on the console is:
Cases: 9 Tried: 9 Errors: 0 Failures: 0
Mean MeanLB MeanUB Stddev StddevLB StddevUB OutlierVariance
List length
len == 1 22.15 ns 21.86 ns 22.88 ns 1.505 ns 742.2 ps 2.826 ns 83%
len == 2 22.64 ns 22.49 ns 22.87 ns 602.0 ps 449.5 ps 825.7 ps 43%
len == 3 23.39 ns 23.16 ns 23.78 ns 1.057 ns 632.6 ps 1.553 ns 68%
len == 4 23.70 ns 23.51 ns 23.95 ns 773.3 ps 567.9 ps 1.050 ns 53%
len == 5 24.14 ns 23.96 ns 24.71 ns 962.4 ps 307.5 ps 1.886 ns 63%
Number type equality
Integer 12.59 ns 12.48 ns 12.80 ns 538.0 ps 312.4 ps 944.2 ps 67%
Int 12.79 ns 12.69 ns 12.98 ns 463.6 ps 320.0 ps 665.2 ps 59%
Rational 12.77 ns 12.67 ns 12.93 ns 395.1 ps 290.0 ps 535.9 ps 51%
Float 13.13 ns 12.88 ns 13.42 ns 869.7 ps 667.3 ps 1.212 ns 83%
Double 12.74 ns 12.57 ns 13.02 ns 704.6 ps 456.5 ps 1.047 ns 78%
You can see on the top line we’ve had nine tests (run using HUnit):
- From the first group we’ve specified that all five values must return
True
. - From the second group, we’ve specified that all inputs must return the same value as for the
Integer
case.
Since all the tests passed, the benchmarks are run. The output for these is a tabular format to make it easier to do vertical comparisons (though in this case the variances are all high so we should take them with a grain of salt).
Caveats
Whilst I’m quite pleased with the API for defining the actual tests/benchmarks (subject to what GHC will let me write), there’s still scope for more functionality (e.g. support for IO
-based benchmarks).
However, the default output (as soon above) isn’t configurable. It’s possible to get the individual tests and benchmarks out to feed them explicitly to HUnit and criterion respectively, but if you’re after this particular output then you have to wait until all the benchmarks are complete before the results are printed. There is no support for saving results to file (either as a CSV of all the results or an HTML report), or even to control how the benchmarks are run (minimum time spent on each benchmark, etc.) or any other option currently offered by criterion.
If there is enough interest I can look at adding these in; but this satisfies my itch for now whilst getting this library out there for people to start trying out.
This is awesome! I was looking for something like this. Can this be made part of Criterion itself? That way it will be easier to discover it and more people will benefit from it.
I’m not sure if Bryan would want to add something like in to criterion; it would add having support for HUnit in (and I personally think that without the HUnit integration testbench wouldn’t be as useful).
My actual preference/wish would be for criterion to be split up into more of a low-level benchmarking library with no reporting (to file or the terminal) capabilities, as I had to add in some workarounds to prevent them from coming up. Not to mention that this way testbench wouldn’t transitively depend on hastache, etc.