There are lies, damn lies and benchmarks.
— Old Jungle saying
testbench is a new library designed to make it easier to write comparison benchmarks, ensure that they return the correct value and thus help prevent unintentional bias in benchmarks.
About a year ago, I was working on some Haskell code that I wanted to compare to existing implementations. In Haskell, we of course have the wonderful criterion library to write benchmarks with, but whilst I’ve found it really helpful before to help me tell whether a particular function has been improving in performance as I work on it, I felt that it was a bit clunky for directly comparing implementations against each other (there used to be a [bcompare]
function, but it hasn’t existed since version 1.0.0.0 which came out in August 2014).
When I tried looking at how others have approached this problem, I found that they did so by just directly using the bench
and bgroup
functions. From my point of view, there are two problems with this approach:
[ bench "f1" $ nf f1 a
, bench "f2" $ nf f2 a
...
]
Because of this duplication, it is too easy to have benchmarks nominally comparing two (or more) functions/values, but accidentally end up comparing apples to oranges (e.g. using whnf
instead of nf
).
Whilst I kept looking at existing comparison benchmarks, I even came across an example where a comparison ended up nominally showing that f1
was faster than f2
… except that the result of f1
was a value with an O(1)
implementation of [rnf]
, whereas f2
has an O(n)
definition. I don’t know if this is intentional (I think it probably wasn’t) and even if this is rectified f1
was still faster… but the difference in runtimes – whilst minor in comparison to performance between the two functions – is non-negligible.
This to me demonstrated the desirability of not only having a wrapper around criterion to reduce the verbosity of comparison benchmarks, but to only be able to produce unit tests to ensure criteria are satisfied.
It’s taken me longer than I wished to produce a syntax that I was both happy with and would actually work (with lots of fighting against GHC in the form of “Why won’t you accept this? Oh, wait, now I get it; that makes sense… but can’t you accept it anyway? Pretty please?”), but I’ve now finally gotten it to a usable form and am hence releasing it.
testbench is now available on Hackage with the source on GitHub.
As extremely simple and contrived examples, consider the following:
main :: IO ()
main = testBench $ do
-- Monomorphic comparisons
compareFunc "List length"
(\n -> length (replicate n ()) == n)
(testWith (@? "Not as long as specified")
<> benchNormalForm)
(mapM_ (\n -> comp ("len == " ++ show n) n) [1..5])
-- Polymorphic comparisons.
--
-- Currently it isn't possible to use a Proxy as the argument to the
-- function, so we're using 'undefined' to specify the type.
compareFuncConstraint (Proxy :: Proxy (CUnion Eq Num))
"Number type equality"
(join (==) . (0`asTypeOf`))
(baseline "Integer" (undefined :: Integer)
<> benchNormalForm)
$ do comp "Int" (undefined :: Int)
comp "Rational" (undefined :: Rational)
comp "Float" (undefined :: Float)
comp "Double" (undefined :: Double)
When this is run, the result on the console is:
Cases: 9 Tried: 9 Errors: 0 Failures: 0
Mean MeanLB MeanUB Stddev StddevLB StddevUB OutlierVariance
List length
len == 1 22.15 ns 21.86 ns 22.88 ns 1.505 ns 742.2 ps 2.826 ns 83%
len == 2 22.64 ns 22.49 ns 22.87 ns 602.0 ps 449.5 ps 825.7 ps 43%
len == 3 23.39 ns 23.16 ns 23.78 ns 1.057 ns 632.6 ps 1.553 ns 68%
len == 4 23.70 ns 23.51 ns 23.95 ns 773.3 ps 567.9 ps 1.050 ns 53%
len == 5 24.14 ns 23.96 ns 24.71 ns 962.4 ps 307.5 ps 1.886 ns 63%
Number type equality
Integer 12.59 ns 12.48 ns 12.80 ns 538.0 ps 312.4 ps 944.2 ps 67%
Int 12.79 ns 12.69 ns 12.98 ns 463.6 ps 320.0 ps 665.2 ps 59%
Rational 12.77 ns 12.67 ns 12.93 ns 395.1 ps 290.0 ps 535.9 ps 51%
Float 13.13 ns 12.88 ns 13.42 ns 869.7 ps 667.3 ps 1.212 ns 83%
Double 12.74 ns 12.57 ns 13.02 ns 704.6 ps 456.5 ps 1.047 ns 78%
You can see on the top line we’ve had nine tests (run using HUnit):
True
.Integer
case.Since all the tests passed, the benchmarks are run. The output for these is a tabular format to make it easier to do vertical comparisons (though in this case the variances are all high so we should take them with a grain of salt).
Whilst I’m quite pleased with the API for defining the actual tests/benchmarks (subject to what GHC will let me write), there’s still scope for more functionality (e.g. support for IO
-based benchmarks).
However, the default output (as soon above) isn’t configurable. It’s possible to get the individual tests and benchmarks out to feed them explicitly to HUnit and criterion respectively, but if you’re after this particular output then you have to wait until all the benchmarks are complete before the results are printed. There is no support for saving results to file (either as a CSV of all the results or an HTML report), or even to control how the benchmarks are run (minimum time spent on each benchmark, etc.) or any other option currently offered by criterion.
If there is enough interest I can look at adding these in; but this satisfies my itch for now whilst getting this library out there for people to start trying out.
There are quite a few libraries that try and provide more convenient ways of dealing with monad transformers (typically using those defined in transformers so as to avoid re-defining them all the time and to provide inter-compatibility): the old standard of mtl, the type-family variant found in monads-tf, the more ambitious layers package and Roman Cheplyaka’s monad-classes work.
However, I found that none of the libraries I could find really satisfied me. Even layers
and monad-classes
– that aim to simplify/remove the quadratic instance problem still require a “catch-all” default instance for all other monad transformers. Ideally for me, if I want to define a new transformer class, then I should only need to define instances for transformers that directly implement it’s functionality.
As such, I’m pleased to announce the first (alpha-level) release of my new library: monad-levels.
Originally, all I wanted was to be able to lift operations in a base monad up through any transformers I might stack on top of it.
We already have MonadIO; I just need to generalise it to work on any monad, right?
Except that I didn’t want to just lift up a single monad up through the stack: I wanted to be able to convert a function on my base monad up to whatever set of transformers I had stacked up on top of it. So I resigned myself to writing out instances for every existing transformer in the transformers
library.
As I started doing so though, I noticed a common pattern: for each method in the instance, I would be using a combination of the following operations (using StateT
as an example):
wrap
: apply the monad transformer (e.g. m (a,s) → StateT s m a
)unwrap
: remove the monad transformer (e.g. StateT s m a → m (a,s)
)addInternal
: adds the internal per-transformer specific state (e.g. m a → m (a,s)
)In particular, wrap
is used everywhere, unwrap
is used when lowering existing monads down so that they can (eventually) be used in the base monad and addInternal
is used when lifting monadic values.
Thus, if I define this as a type class for monad transformers, then I could use the wonderful DefaultSignatures
extension to simplify defining all the instances, and what’s more such a class would be re-usable.
Generally, the definition of unwrap
and addInternal
require information from within the scope of the transformer (e.g. the s
parameter within StateT
); as such, wrap
ends up being a continuation function. I thus came up with the following class:
class (Monad m) => MonadLevel m where type LowerMonad m :: * -> * type InnerValue m a :: * -- A continuation-based approach for how to lift/lower a monadic value. wrap :: ( (m a -> LowerMonad m (InnerValue m a) -- unwrap -> (LowerMonad m a -> LowerMonad m (InnerValue m a)) -- addInternal -> LowerMonad m (InnerValue m a) ) -> m a
(Note that I’m not using MonadTrans
for this as I also wanted to be able to use this with newtype wrappers.)
So I define this class, use DefaultSignatures
and my instances – whilst still needing to be explicitly defined – become much simpler (and in many/most cases empty)!
Whilst I was looking to see if any existing libraries had something similar (layers
came the closest, but it uses multiple classes and requires being able to specify function inverses when using it), I came across Roman Cheplyaka’s blog post on how monad-control
uses closed type families to automatically recursively lift monads down to a monad that satisfies the required constraint. I became intrigued with this, and wondered if it would be possible to achieve this for any constraint (more specifically something of kind (* → *) → Constraint
) rather than using something that was almost identical for every possible monad transformer class.
So I wrote a prototype that made it seem as if this would indeed work (note that I used the term “lower” rather than “lift”, as I saw it as lowering operations on the overall monadic stack down to where the constraint would be satisfied):
data Nat = Zero | Suc Nat class SatisfyConstraint (n :: Nat) (m :: * -> *) (c :: (* -> *) -> Constraint) where _lower :: Proxy c -> Proxy n -> (forall m'. (c m') => m' a) -> m a instance (ConstraintSatisfied c m ~ True, c m) => SatisfyConstraint Zero m c where _lower _ _ m = m instance (MonadLevel m, SatisfyConstraint n (LowerMonad m) c) => SatisfyConstraint (Suc n) m c where _lower _ _ m = wrap (\ _unwrap addI -> addI (_lower (Proxy :: Proxy c) (Proxy :: Proxy n) m))
(This is a simplified snippet: for more information – including where the ConstraintSatisfied
definition comes from – see here.)
With this, you also get liftBase
for free! However, if all I wanted was a function just to lift a value in the base monad up the stack, then I could have used a much simpler definition. For this to actually be useful, I have to be able to write (semi-)arbitrary functions and lift/lower them as well.
I could just go back to my original plan and use MonadLevel
combined with DefaultSignatures
and not bother with this automatic lifting/lowering business… but I’ve already started, and in for a penny in for a pound. So full steam ahead!
It took a while to sort out it would work (dealing with State
and Reader
was easy; having to extend how this worked for Cont
took quite a while and then even more for Writer
) but monad-levels
is now able to deal with arbitrary monadic functions.
Well… I say arbitrary…
To be able to deal with functions, you first need to use the provided sub-language to be able to specify the type of the function. For example, a basic function of type m a → m a
is specified as Func MonadicValue (MkVarFnFrom MonadicValue))
(or more simply as just MkVarFn MonadicValue
, using the inbuilt simplification that most such functions will return a value of type m a
); something more complicated like CallCC
becomes MkVarFn (Func (Func ValueOnly (MonadicOther b)) MonadicValue)
.
This language of lower-able functions is used to be able to know how to convert arguments and results up and down the monadic stack.
I’m finally releasing this library after being able to successfully replicate all the existing monad transformer classes in mtl
(with the exception of the deprecated MonadError
class). As an example, here is the equivalent to MonadCont
:
import Control.Monad.Levels import Control.Monad.Levels.Constraints import Control.Monad.Trans.Cont (ContT (..)) import qualified Control.Monad.Trans.Cont as C import Control.Monad.Trans.List (ListT) -- | A simple class just to match up with the 'ContT' monad -- transformer. class (MonadLevel m) => IsCont m where -- Defined just to have it based upon the constraint _callCC :: CallCC m a b instance (MonadTower m) => IsCont (ContT r m) where _callCC = C.callCC instance ValidConstraint IsCont where type ConstraintSatisfied IsCont m = IsContT m type family IsContT m where IsContT (ContT r m) = True IsContT m = False -- | Represents monad stacks that can successfully pass 'callCC' down -- to a 'ContT' transformer. type HasCont m a b = SatisfyConstraintF IsCont m a (ContFn b) -- | This corresponds to @CallCC@ in @transformers@. type ContFn b = MkVarFn (Func (Func ValueOnly (MonadicOther b)) MonadicValue) -- This is defined solely as an extra check on 'ContFn' matching the -- type of 'C.callCC'. type CallCC m a b = VarFunction (ContFn b) m a -- Not using CallCC here to avoid having to export it. -- | @callCC@ (call-with-current-continuation) calls a function with -- the current continuation as its argument. callCC :: forall m a b. (HasCont m a b) => ((a -> m b) -> m a) -> m a callCC = lowerSat c vf m a _callCC where c :: Proxy IsCont c = Proxy vf :: Proxy (ContFn b) vf = Proxy m :: Proxy m m = Proxy a :: Proxy a a = Proxy -- By default, ListT doesn't allow arbitrary constraints through; -- with this definition it is now possible to use 'callCC' on @ListT (ContT r m) a@. instance (MonadTower m) => ConstraintPassThrough IsCont (ListT m) True
One thing that should be obvious is that the constraint is a tad more complicated than that required for MonadCont
. Specifically, it requires the a
and b
parameters as well; this is because not all instances of MonadLevel
allow dealing with arbitrary other monadic values (that is, we’re dealing with m a
over all, but we also need to consider m b
in this case). In practice, however, the only existing monad transformer with this constraint is ContT
itself, and you can’t pass through a call to callCC
from one ContT
transformer to another (as there’s no way to distinguish between the two).
(Something that might not be obvious is that the interaction between StateT
– both lazy and strict – and how I’ve defined callCC
differs from how it’s defined in mtl
. Hence why I started this thread on Haskell Cafe.)
But, any monad transformer in the transformers
library that is an instance of MonadCont
also satisfies the requirements for the HasCont
constraint, and furthermore just by making it an instance of MonadLevel
any new transformer (including a newtype wrapper over a monadic stack) will also automatically satisfy the constraint!
There are two main sources of problems currently with monad-levels
.
I have no idea how it compares speed- and memory-wise to mtl
; as it uses a lot of type families, explicit dictionary passing, etc. I expect it to be slower, but I haven’t compared it or investigated if there’s anywhere I can improve it.
I’m not sure of all the names (e.g. MkVarFn
and MkVarFnFrom
for dealing with variadic functions probably could be improved); not to mention that there’s probably also room for improvement in terms of what is exported (e.g. should the actual type classes for dealing with variadic arguments be fully exported in case people think of more possible argument types?).
It could also do with a lot more documentation.
These, however, are for the most part just a matter of time (though it might be that the performance one should actually belong to the next category).
The biggest noticeable problem is one of discovery: if you look at mtl
, it’s obvious to tell when a transformer is an instance of a particular class; in contrast, with monad-levels
there’s no obvious way of looking at Haddock documentation to tell whether or not this is the case. The best you can do for a specific constraint c
and monad m
(without trying it in ghci) is that if it’s MonadLevel_
definition has AllowOtherValues m ~ True
and DefaultAllowConstraints m ~ True
(both of which are the defaults) and the latter hasn’t been overriden with instance ConstraintPassThrough c m ~ False
then it is allowed. (Assuming that the constraint and its functions are sound, and something screwy hasn’t been done like having the monad actually being a loop.)
Something that might also be a problem for some is the complexity: lots of language extensions are used, not to mention using a lot of things like Proxy
and explicit dictionary passing.
As part of this, this means things like type errors sometimes being difficult to resolve due to the large usage of associated types and constraint kinds. Furthermore, as you probably saw in the HasCont
definition shown above, you typically need to use ScopedTypeVariables
with proxies.
Whilst it most definitely isn’t perfect, I think monad-levels
is now at a usable state. As such, I’d appreciate any attempts people make at using it and giving me any feedback you might have.
This is also the first time I’ve used git and Github for my own project. I missed the simplicity and discoverability of darcs, but magit for Emacs makes using it a bit easier, and in-place branches and re-basing turned out to be quite nice.
I realise that this might not be exactly a much-sought-after library, but I’ve been using this as a test-bed for various ideas I’ve been having for a “normal” graph library. I’m going to discuss various aspects of the design of this library and some ideas I’ve had for an extensible graph library.
The standard definition of a graph is as follows:
a graph is an ordered pair G = (V, E) comprising a set V of vertices or nodes together with a set E of edges or lines, which are 2-element subsets of V (i.e., an edge is related with two vertices, and the relation is represented as unordered pair of the vertices with respect to the particular edge).
However, this definition is rather limiting: by assuming that an edge is comprised of a two-element subset of the vertices, we make it harder for us to consider it computationally: in practice we don’t have a data-structure that represents a two-element set. In practice, we instead tend to use something like (Node,Node)
. However, this isn’t ideal:
Using this representation implicitly makes graphs directed rather than undirected, unless you do a lot more bookkeeping by checking both elements of the tuple. In practice this may not be too much of a problem, as people want directed graphs, but this doesn’t make it perfect.
The directionality of an edge is now part of its definition rather than being a property of the edge: that is, whether the edge is meant to be directed or not should be a value that can be determined from the edge; currently all edges are directed and undirected graphs are simulated with two inverse edges.
Multiple edges become difficult to handle: if you want to delete an edge between node n1
and node n2
and there are three edges there, how do you know which one to delete? In practice, the answer seems to be “all of them”.
A more preferable definition was stated by W. T. Tutte in a 1961 paper:
A graph G consists of a set E(G) of edges and a (disjoint) set V(G) of vertices, together with a relation of incidence which associates with each edge two vertices, not necessarily distinct, called its ends. An edge is a loop if its ends coincide and a link otherwise.
Note that no indication is given of E being a set of two-sets: instead, a mapping exists between every e ∈ E to the two endpoints of that edge.
A planar graph is a graph that can be drawn on a specified surface (usually a plane or a sphere) such that no edges intersect/cross except at their endpoints.
When considering a planar graph programmatically, we also want to take into account their embedding (i.e. where all the edges adjacent to a node are in relation to each other). As such, just using an approach of identifying edges solely by their end points fails completely if there are multiple edges. As such, using a unique identifier for each edge is preferable.
But the difficulty of endpoint identification (i.e. distinguishing between (n1,n2)
) remains. As such, several implementations of planar graphs use two identifiers for each edge. More about this later.
As I said earlier, I’ve been using the development of this library as a way to experiment with various approaches to how to design a graph library, all of which I intend to use in a non-planar graph library.
Most existing graph libraries (e.g. Data.Graph and fgl) use a type alias on Int
values to represent vertices/nodes. Furthermore, when considering how to create a graph, it requires that you:
Explicitly come up with new node identifier for each new node;
Make sure you don’t re-use an existing identifier.
Whilst this isn’t a big problem if considering a bulk creation of a graph (i.e. you have some arbitrary [a]
representing the nodes and edges represented as [(a,a)]
, in which case a zip
-based solution can be used to assign node identifiers, etc. though it would be a tad messy), it isn’t ideal for adding new nodes on after the fact and it is also open to abuse.
As such, planar-graph
does not permit users to create node identifiers: the constructor isn’t exported, it isn’t an instance of Enum
or Bounded
, etc. Instead, you provide the label for the new node you want to add and the function returns the updated graph and the identifier for the new node. When the node identifiers are changed for some reason (e.g. merging two graphs), a function is returned that allows you to update the values of any node identifiers you’ve been storing elsewhere.
Show
and Read
instances are available and leak some of the internals out, but you have to really be persistent to try and abuse them to create your own identifier values: you need to explicitly call read
on the String
to get it to parse as the result isn’t valid Haskell code (as the instances exist solely for debugging).
As intimated earlier, each edge is actually represented by two half-edges: an edge from n1
to n2
is “stored” twice: one half-edge n1 -> n2
and its inverse n2 -> n1
(this also includes loops). Each half-edge has its unique identifier (which is abstract, just as with nodes) and mapping function exists that lets you determine a half-edge’s inverse.
Most graph implementations are something like a newtyped version of:
type Graph = Map Node [Node]
where each node has information on its adjacent nodes. Or, if we consider a graph with labels, we have:
type LabelledGraph n e = Map Node (n, [(Node,e)])
Instead, the definition of planar-graph looks more like (just considering the labelled version):
type PlanarGraph n e = ( Map Node (n, [Edge]) , Map Edge (Node, e, Node, Edge) )
where a mapping exists between a half-edge identifier and the corresponding node that it comes from, the label of that half-edge, the node that it is going to and its inverse half-edge. This definition matches the mathematical ones stated earlier much more closely.
Now, this half-edge usage might be a requirement for a planar graph data structure, but it is also viable for non-planar graphs. First of all, if we wished to allow multiple edges between two nodes, then the traditional representation must be altered slightly:
type LabelledGraph' n e = Map Node (n, [(Node,[e])])
Each edge representation now keeps a list of edge labels, one per edge between the two nodes. Extra bookkeeping is required about what to do when that list becomes empty (and in fact previous fgl versions had an implementation where this list wasn’t considered at all and thus multiple edges would silently fail).
Also, consider how fgl-style graphs are implemented:
type FGLGraph n e = Map Node ([(Node,e)], n, [(Node,e)])
Here, each node stores not only the outgoing edges but also the incoming edges for efficiency reasons (otherwise we’d need to traverse all edges in the graph to determine what the incoming edges are). This also leads to possible data corruption issues as each edge label is stored twice.
However, with our half-edge implementation, neither of these points need any change: each multiple edge has its own unique identifier, and to obtain the incoming edges we just determine the inverse of all the outgoing edges (though technically this point isn’t quite valid when considering directed graphs, as planar-graph treats them differently; see the next section).
Most graph implementations conflate the structure of the graph (i.e. which nodes and edges there are) with the information that graph is representing. One example is the question of graph orientation: in fgl, a graph can be considered to be undirected if each edge is represented twice; however, it is quite possible that such a graph is not undirected but just happens to have each directed edge having an inverse (e.g. some kind of flow algorithm).
Whilst it is not fully formalised as yet, in planar-graph the orientation of a graph is dictated by its half-edge labels: in my use case that prompted the development of this library, I had a need for mixed-orientation of a graph: one half-edge pairing might have had a fixed direction on one half-edge whilst its inverse had a label of “InverseEdge
“; other pairings might both have some kind of partial edge label.
But the actual edge identifiers didn’t change: I could apply a mapping function to transform all edge labels to ()
and thus make the graph “undirected”, but I didn’t need to change the actual structure of the graph to do so.
I believe this is a much more useful way of considering graphs, where the information that the graph represents can be found in the node and edge labels, not the identifiers.
I needed to be able to encode my planar graphs using various binary encodings (e.g. PLANAR_CODE
, described in Appendix A here). Now, I could have written custom encoding functions from a PlanarGraph
to a ByteString
for every possible encoding; however, since they all follow the same basic underlying structure, I decide to utilise an intermediary list-based representation.
However, I then realised that this representation could also be used for Show
and Read
instances for the graphs as well as pretty-printing functions. The definition of the representation is:
[( node index , node label , [( edge index , node index that this edge points to , edge label , inverse edge index )] )]
For Show
, Read
, etc. this is basically a raw dump of the graph (which means it is technically open to abuse as the internals of the abstract identifiers are accessible this way, but I had to draw the line somewhere); the deserialise
function that is utilised by Read
also ended up being useful for an internal function rather than manually trying to construct a graph!
For encoding/decoding of binary representations, a re-numbering of the identifiers according to a breadth-first traversal is first undertaken (as many require that the identifiers be 0 ... n-1
and for decoding the order of edges for each node is important) and then the same structure is used. A class is then used to convert the graph into this representation and then convert it to the encoding of your choice.
However, not all of the encodings require that the graph be planar: whilst a breadth-first traversal doesn’t make as much sense for non-planar graphs, the same framework could be used for other graph types.
I haven’t even started to prototype this, as some of the ideas I’m listing below I only started to mentally flesh out over the past few days. However, I think that it should work and will provide a useful approach to dealing with graphs in Haskell.
The root of my idea is that we often have different properties that we want graphs to hold: should it be a directed graph? What should the identifier types be? Is there any way to automatically determine all nodes with a certain label (e.g. for graph colouring)? What kind should the graph have?
The current methods of dealing with such a thing is to have a graph implementation, and just live with it. This “solution” clearly has problems, not least of which is that if you try to do anything else you have to re-implement everything yourself.
However, consider this: we have a method of generalising monads via monad transformers: why not do the same thing with graphs?
Now, I’m not the first person to think of this; Edward Kmett has already released a library that has this kind of characteristic (though his classes differ from how I’m planning on structure/distinguish them by having them more implementation-based than property-based IMO).
What my plans entail is this:
Define a GraphTransformer
class. Not only will graph transformers be instances of this class, but so will the actual graph types (with identities for the different lift/unlift functions):
class (Graph (SubGraph gt)) => GraphTransformer gt where type SubGraph gt :: * .....
The Graph
class then requires that the instance type also be an instance of GraphTransformer
, and has default definitions of all methods using the lift/unlift functions. These default definitions will resolve down to f = f
definitions for actual graph types, but for transformers will just apply the function down the stack.
This class only defines “getters”: e.g. determine the size of the graph, get all its nodes or edges, etc.
Other classes are defined as sub-classes of Graph
, again using lift/unlift functions from GraphTransformer
for default definitions of the methods.
Most classes (except for where it’s necessary, e.g. a class defining mapping functions) will assume that the graph is of kind *
.
Most transformers will assume that the underlying graph is of kind * -> * -> *
(i.e. that you can specify node and edge labels) so that you can make the transformer an instance of any class that requires kind * -> * -> *
, but it should be possible to make transformers that take in a graph of kind *
.
Because of the default definitions using lift/unlift functions, most class instances for the graph transformers will be of the form instance Graph (MyTransformer ExistingGraph a b)
; this means that if you want to newtype your graph transformer stack, writing the instances will be trivial (albeit rather repetitive and boring).
As such, if a transformer only effects one small aspect of graph manipulation (e.g. a transformer that keeps a Map
of node labels to node IDs so you can more efficiently look up all nodes that have a particular label), then you only need to provide explicit definitions for those classes and methods (in this case, adding and deleting nodes and looking up node identifiers based upon labels rather than filtering on the entire list of nodes).
However, this does mean that any kind of unique operation you can think of (e.g. in the example above the ability to find all identifiers for nodes that have a particular label), you will need to create a new class and make appropriate instances for underlying types (if possible) and existing transformers (so that if you put extra transformers on the stack you can still use the improved definitions for this transformer).
Usage of the serialisation and encoding functionality for all graphs. This will provide Show
/Read
instances for graphs, pretty-printing, easy Binary
instances (using the serialised form) and any available encodings specified.
The actual method of this may change from what I’ve described above, as whilst a breadth-first traversal of a planar graph is unique up to the first edge chosen, for non-planar graphs this isn’t the case. However, for encodings that don’t assume a planar graph this shouldn’t be a problem.
Whilst the provided underlying graph types might use abstract node identifiers, it will not be required for instances of Graph
to do so (and a transformer will be provided to let you specify your own type, that under-the-hood maps to the generated abstract identifiers). However, I can’t see a way around having edges using some kind of abstract edge identifier, as it isn’t as common to attach a unique label to each edge.
Generally, transformers will utilise the node and edge labels of the underlying graph stack to store extra metadata (e.g. directionality of the edges); this is why the transformers will typically require that the underlying type is of kind * -> * -> *
. However, the question then arises: should users be aware of these underlying type transformations? For example, should this information leak out with a Show
instance?
My current thinking is that it shouldn’t: the output from Show
, prettyPrint
, etc. should be as if there were no transformers being used. The main counter-example I can think of is to have some kind of indicator whether each listed half-edge is the “real” one or not, especially when using a transformer that makes it a directed edge (though in this case it can be solved by only listing the “primary” half-edges and not their inverses; this again works for most graph types but not planar ones, as all half-edges need to be listed for the graph to be re-created due to the embedding of edge orders).
Assuming this all works (I plan on starting playing with it next week), I think this approach will do quite well. As you’re writing your code, if you use a newtype/type alias for your graph type, you can just add or remove a transformer from your stack without affecting usage (in most cases: some transformers might change which classes a stack can be used in; e.g. a transformer that lets you specify node identifiers will require using a different class for adding nodes than one that generates and returns identifiers for you). Then at the end if you want to try and tweak performance, you can always go and write custom transformers (or indeed your own graph type if you want to merge all the functionality in to the actual type without using transformers) without having to change your code.
If this all works as I think/hope it will…
I decided to take the opportunity to replace the current explicit String
-based (well, actually Doc
-based) mangling used to create Dot graphs for use with Graphviz from vacuum with usage of my graphviz library. I’ve just sent Conrad a pull request on his GitHub repo, and I decided that this would make a suitable “intro” tutorial on how to use graphviz
.
First of all, have a look at the current implementation of the GHC.Vacuum.Pretty.Dot
module. If you read through it, it’s pretty straight-forward: convert a graph in an adjacency-list format into Dot code by mapping a transformation function over it, then attach the required header and footer.
Note though that this way, the layout/printing aspects are mixed in with the actual conversion part: rather than separating the creation of the Dot code from how it actually appears, it’s all done together.
There’s also a mistake in there that probably isn’t obvious: the “arrowname=onormal
” part of the definition of gStyle
is completely useless: there is no such attribute as “arrowname”; what is probably meant there is “arrowhead” or “arrowtail”.
Let’s now consider the implementation using version 2999.12.*
of graphviz
(the version is important because even whilst doing this I spotted some changes I’m likely to make in the next version for usability purposes):
{-# LANGUAGE OverloadedStrings #-} module GHC.Vacuum.Pretty.Dot ( graphToDot ,graphToDotParams ,vacuumParams ) where import Data.GraphViz hiding (graphToDot) import Data.GraphViz.Attributes.Complete( Attribute(RankDir, Splines, FontName) , RankDir(FromLeft), EdgeType(SplineEdges)) import Control.Arrow(second) ------------------------------------------------ graphToDot :: (Ord a) => [(a, [a])] -> DotGraph a graphToDot = graphToDotParams vacuumParams graphToDotParams :: (Ord a, Ord cl) => GraphvizParams a () () cl l -> [(a, [a])] -> DotGraph a graphToDotParams params nes = graphElemsToDot params ns es where ns = map (second $ const ()) nes es = concatMap mkEs nes mkEs (f,ts) = map (\t -> (f,t,())) ts ------------------------------------------------ vacuumParams :: GraphvizParams a () () () () vacuumParams = defaultParams { globalAttributes = gStyle } gStyle :: [GlobalAttributes] gStyle = [ GraphAttrs [RankDir FromLeft, Splines SplineEdges, FontName "courier"] , NodeAttrs [textLabel "\\N", shape PlainText, fontColor Blue] , EdgeAttrs [color Black, style dotted] ]
(The OverloadedStrings
extension is needed for the FontName
attribute.)
First of all, note that there is no mention or concept of the overall printing/structure of the Dot code: this is all done behind the scenes. It’s also simpler this way to choose custom attributes: Don Stewart’s vacuum-cairo package ends up copying all of this and extra functions from vacuum just to have different attributes; here, you merely need to provide a custom GraphvizParams
value!
So let’s have a look more at what’s being done here. In graphToDotParams
, the provided adjacency list representation [(a,[a])]
is converted to explicit node and edge lists; the addition of ()
to each node/edge is because in many cases you would have some additional label attached to each node/edge, but for vacuum we don’t. There is a slight possible error in this, in that there may be nodes present in an edge list but not specified directly (e.g. [(1,[2])]
doesn’t specify the “2” node). However, Graphviz doesn’t require explicit listing of every node if it’s also present in an edge, and we’re not specifying custom attributes for each node, so it doesn’t matter. The actual grunt work of converting these node and edge lists is then done by graphElemsToDot
in graphviz.
The type signature of graphToDotParams
has been left loose enough so that if someone wants to specify clusters, it is possible to do so. However, by default, graphToDot
uses the specified vacuumParams
which have no clusters, no specific attributes for each node or edge but does have top-level global attributes. Rather than using String
s, we have a list of GlobalAttributes
, with one entry for each of top-level graph, node and edge attributes (the latter two applying to every node/edge respectively). I’ve just converted over the attributes specified in the original (though dropping off the useless “arrowname” one). Some of these attributes have more user-friendly wrappers that are re-exported by Data.GraphViz
; the other three need to be explicitly imported from the complete list of attributes (for these cases I prefer to do explicit named imports rather than importing the entire module so I know which actual attributes I’m using). I am adding more attributes to the “user-friendly” module all the time; RankDir
will probably make it’s way over there for the next release, with a better name and documentation (and thus not requiring any more imports).
Now, you might be wondering how I’ve managed to avoid a (a -> String)
or similar function like the original implementation had. That’s because the actual conversion uses the PrintDot class (which is going to have a nicer export location in the next version of graphviz). As such, as long as a type has an instance – and ones like String
, Int
, etc. all do – then it will be printed when the actual Dot code is created from the DotGraph a
value.
So how to actually use this? In the original source, there’s a commented out function to produce a png
image file. This is achieved by saving the Dot code to a file, then explicitly calling the dot
command and saving the output as an image. Here’s the version using graphviz
:
{-# LANGUAGE ScopedTypeVariables #-} import GHC.Vacuum.Pretty.Dot import Data.GraphViz.Exception graphToDotPng :: FilePath -> [(String,[String])] -> IO Bool graphToDotPng fpre g = handle (\(e::GraphvizException) -> return False) $ addExtension (runGraphviz (graphToDot g)) Png fpre >> return True
(Note: The exception-handling stuff is just used to provide the same IO Bool
result as the original.)
I hope you’ve seen how convenient graphviz can be rather than manually trying to produce Dot code and calling the Graphviz tools to visualise it. There are still some cludgy spots in the API (e.g. I would be more tempted now to have the graph to visualise be the last parameter; at the time I was considering more about using different image outputs for the same graph), so I appreciate people telling me how the API can be improved (including which attributes are commonly used).
Now, one of the packages I maintain is graphviz which suffices at what it does: use Graphviz to visualise graphs using static images. Despite its various problems, I keep using Graphviz because – unlike most of the flashier graph visualisation programs that I’ve found – it doesn’t require a fancy GUI just to convert a pre-existing graph into an image (admittedly, others may have library versions, but most seem to be written in Java and Python, which do not seem as useful in terms of writing Haskell bindings). However, one thing that Graphviz cannot do is let you dynamically visualise graphs, which is especially useful for extremely large graphs (e.g. call graphs).
One of the visualisation toolkits that Adam talked about was the JavaScript InfoVis Toolkit, which seemed quite nice in how you can dynamically interact with the graphs. The graphs are represented using JSON, and the format looks relatively straightforward.
So here’s my possibly crazy idea: does it make sense to create a companion library for graphviz to convert its DotRepr values into JIT-compatible JSON, possibly with some extensions to assist with the visualisation? We already have various libraries for interacting with JavaScript and JSON on HackageDB, so it may be possible to abstract most of the pain of visualising and interacting with graphs on the web into our preferred language. I’m not quite sure how to deal with incompatible/differing attribute values for Dot vs JIT’s JSON, but is this type of avenue worth considering? Such a conversion library would save having to doubly-convert graphs (in case you want static image versions of the visualisations as well).
So, how crazy am I?
For the last several months, a small group of people in Canberra including myself have been preparing a bid for LCA 2013. This is not just to give us more time to make the conference the most awesome, froody LCA you’ve ever been to. No – 2013 is also the centenary of the founding of Canberra as the nation’s capital. It’s a very significant year for us and we’d all be thrilled if we could show the attendees of LCA our great city and Canberrans the great work the FOSS community does to improve everyone’s lives.
So we’re really stoked that the bidding process is going to be opened early, and I think it’ll lead to a really interesting competition that will result, whoever wins, in the best LCA ever!
If you’re interested in getting involved, join the mailing list!
The overall thing I want to do with these generic graph classes is to be able to deal with the vast majority of graph-like data structures in as common a way as possible. Note that I say graph-like: I’m distinguishing here between data structures that match the mathematic definition of a graph (that is, a collection of distinguished objects, where pairs of these objects may be connected in some fashion) from what is usually considered as a graph data structure: the difference mainly arises in that we have notions of expected operations on graph data structures that may not be applicable on our graph-like data types. These operations can either be ones that are forbidden (e.g. adding a node to a static type) or partially forbidden (e.g. adding a cycle to a tree).
As such, the classes as they currently stand are mainly informational: what can we determine from this graph-like type? Do we know specific properties about it (e.g. is it an instance of the class that specifies whether or not the graph is meant to be directed)? There will, of course, be classes for graph manipulation, but I see those as secondary components: for example, it doesn’t make sense to consider using standard graph manipulation functions to add or delete values from a PackageIndex
as we can’t arbitrarily add values to it.
Such a collection of classes will by necessity be subject to compromise: it is not possible to have a fully-featured set of classes that comprehensively covers every single possible type of graph-like data structure whilst also being small and easy enough to use. After all, there’s no point in writing such classes if no-one uses them because they’re too difficult!
In my previous post, I said that the best way of dealing with labels is similar to the way that FGL currently does: force all graph-like types to have both node and edge labels (but not require types to have kind * -> * -> *
like FGL does). A few people objected, notably Sjoerd Visscher said that labels should be optional for both nodes and edges, and ideally be part of the overall node and edge types.
In theory, this solution is great (and I actually worked for a while trying to get something like it to work). However, as I stated in the comments, it fails one notable requirement: we now have to specialise functions on graphs to whether or not the graph has labels or not, and if so which ones. Specifically, if the behaviour of a function may change depending upon whether or not labels are present, such a solution may require four implementations:
Probably the best example I can think of for this is from my graphviz library: consider the preview
function as it is currently defined for FGL graphs:
preview :: (Ord el, Graph gr, Labellable nl, Labellable el) => gr nl el -> IO () preview g = ign $ forkIO (ign $ runGraphvizCanvas' dg Xlib) where dg = setDirectedness graphToDot params g params = nonClusteredParams { fmtNode = \ (_,l) -> [toLabel l] , fmtEdge = \ (_, _, l) -> [toLabel l] } ign = (>> return ())
This is a relatively simple function, that just sets some defaults for the main functions in graphviz. To change this to my proposed layout of compulsory labels mainly requires changes to the type signature (the only implementation change would be the way edges are defined). But with optional labels, then either four variants of this function will be required or else the user will have to specify how to distinguish the node/edge identifiers from the labels (if they exist); this latter solution is not satisfactory as the whole point of this function is to provide defaults to quickly visualise a graph, and as such should not take any other parameters apart from the graph itself.
If an “isInstanceOf
” function was available (to determine whether or not the graph type is an instance of the appropriate label classes without needing to specify them as explicit type constraints), then this wouldn’t be a problem: implementers of functions would just need to take into account the four possible label groupings in their code. But as it stands, the implementation of having optional labels breaks the simplicity requirement that I’m taking into account when writing these classes.
Note that I would actually prefer to have distinct/abstract node and edge types that optionally contain labels: for the planar graph library that I’m working on all operations on edges are done via unique identifiers rather than a data-type isomorphic to a tuple of node identifiers (so as to avoid problems with multiple edges). However, for most graph types such explicit differentiation between edges won’t be required, and in general it will be simpler to both instantiate and use classes when a more simple edge type is used rather than requiring in effect a new data type for each graph (as required when using data families).
One thing I’m still not sure about: how shall I deal with the naming of functions when I have both labelled and unlabelled variants of them? Should I take the FGL route of prepending “lab
” to them (e.g. nodes
vs labNodes
)? I’m not sure I like this solution, as I want to try and shift focus to making the labelled versions the defaults (or at least not as clumsy to use): does it make sense to adopt a policy of priming functions to distinguish between labelled and unlabelled (e.g. nodes
vs nodes'
)? Or should some other naming policy be used?
One of the first things I’ve been considering how to deal with in the new graph classes that I’m working on is how to deal with node and edge labels in graphs. My point of view is that graphs contain two separate but related types of information:
As an example, consider graph colouring: we have the actual structure of the graph and then the colours attached to individual vertices (or edges, depending how you’re doing the colouring). Another example is a flow graph, where the distances/weights are not an actual “physical” part of the graph structure yet nevertheless form an important part of the overall graph.
Yet there are times when the extra labelling/information is an inherent part of the structure: either we are concerning ourselves solely with some graph structural problem (e.g. connected components) or – more commonly when programming – the information about the structure is embedded within the structure (for example, Cabal’s PackageIndex
type: this is simplistically equivalent to an unlabelled graph with PackageIndexID
as the node type).
As such, I’ve come up with at least three different ways of dealing with graph labels:
()
).
Something along the lines of the first two options is very tempting: there is no requirement to force graphs that don’t have or need labels to pretend to have them just to fit the constraints of some class. Furthermore, different graph types can thus be more specific in terms of which graph classes they are instances of.
However, there is a problem here: duplication. Let us consider a simplified set of graph classes that fit the second criteria:
class Graph g where type Node g nodes :: g -> [Node g] edges :: g -> [Edge g] type Edge g = (Node g, Node g) class (Graph g) => LabelledGraph g where type NLabel g type ELabel g labNodes :: g -> [(Node g, NLabel g)] labEdges :: g -> [(Edge g, ELabel g)]
So if some graph type wants to be an instance of LabelledGraph
, it must specify two ways of getting all of the nodes available (admittedly, it will probably have something along the lines of nodes = map fst labNodes
, but wouldn’t it be nice if this could be done automatically?).
But OK, writing a set of classes and then instances for those classes is a one-off cost. Let’s say we accept that cost: the problems don’t stop there. First of all, consider something as simple as adding a node to the graph. There is no way (in general) that the two classifications (labelled and unlabelled) can share in the slightest a method to add a node, etc. Furthermore, this segregation would spread to other aspects of using a graph: almost all algorithms/functions on graphs would thus need to be duplicated (if possible). Since one of the main criteria I have for designing this library is that it should be possible to use graphviz to visualise the PackageIndex
type, this kind of split is not something I think would be beneficial.
As such, the only real viable choice is to enforce usage of labels for all graphs. This might be to the detriment of graphs without labels, but I’m planning on adding various functions that let you ignore labels (e.g. a variant of addNode
that uses mempty
for the graph label, which means it’s usable by graphs that have ()
as the label type). The distinction between nodes
and labNodes
above could also be made automatic, with only the latter being a class method and the former being a top-level function.
This solution isn’t perfect: to ensure it works for all suitable graph types, it has to be kind *
. But this means that no Functor
-like mapping ability will be available, at least without really ugly type signatures (which the current experimental definition uses) at least until superclass constraints become available (or possibly some kind of kind polymorphism, no pun intended). However, this is still the best available solution that I can come up with at this stage.
tests
.
However, several package maintainers seem to be unaware of a minor subtlety of how Cabal parses dependencies.
Let us consider a simple example: we have a package foo
which is primarily a library but also contains a testing executable which uses QuickCheck. The relevant parts of the .cabal
file look something like this:
... Flag test Description: Build the test suite, including an executable to run it. Default: False Library Build-Depends: base == 4.*, containers == 0.3.* Exposed-Modules: Data.Foo Executable foo-tester if flag(test) Buildable: True else Buildable: False Main-Is: FooTester.hs Build-Depends: QuickCheck >= 2.1 && < 2.1.2
So, we have an optional testing executable called foo-tester
and bonus points for defaulting the testing of this executable to false
.
However, this doesn’t quite behave as expected: if we try to build it as-is without enabling the test
flag, then Cabal will still make foo
depend upon QuickCheck
. Why? Because the dependency is not optional (I’m not saying that this behaviour is correct, just that this is how Cabal acts). This became noticeable when QuickCheck-2.2 came out, I upgraded to it and then ghc-pkg check
complained that some packages were now broken.
I’ve pointed out the correct way of doing this to individual maintainers in the past when I noticed it in their packages; now I’m doing it in this blog post in the hope that maintainers of all affected packages will remedy this. To ensure that testing dependencies are only considered when we are indeed building the testing executable, just shift it inside the if-statement:
... Executable foo-tester if flag(test) Buildable: True Build-Depends: QuickCheck >= 2.1 && < 2.1.2 else Buildable: False Main-Is: FooTester.hs
Now QuickCheck
will only be brought in when you’re building tests.
This doesn’t also apply to testing executables, but to any conditional dependencies. See for example how I have testing modules built and exported in graphviz’s .cabal
file.
People that took the survey were asked four things:
fgl
” or “inductive-graphs
“;Several people had some questions/comments regarding the survey both in the survey itself and on the Haskell Reddit. Here are some responses:
Because we couldn’t think of any better names. The former is what the library is already called, the latter describes exactly what the library is about (implementing and using graphs in an inductive fashion). Any other names such as “boxes-and-arrows
” are, we feel, rather silly and don’t make sense. We did ask, but didn’t hear any other names that were relevant.
I don’t want to go through the whole thing all over again, but I’ll summarise. This isn’t a completely new library; it’s just a new implementation (e.g. same as going from QuickCheck-1 to QuickCheck-2; the point of the library is the same, the concepts are the same, the implementation is different and the APIs are incompatible). As for the API incompatibility, that’s what version numbers are for.
Agreed. However, the package name “fgl” already exists, and I don’t believe in gratuitous proliferation of package names on Hackage as its hard enough to navigate as it is. Most people in the Haskell community already know that “fgl” is a graph library, etc. Also see the response to the previous question.
Because when we announced our plans, there was a number of vocal people that complained about our “usurpation” (my word, not theirs) of the FGL name for our own library.
Say what you like about the name of the package (I for one agree that it isn’t an ideal name, especially in the world of “modern” Haskell with Hackage, etc.), I think the module namespace is exactly right. And unless we decide to skip the “Data
” prefix and just have a stand-alone Graph.*
namespace, there isn’t a better namespace available for a package that defines and uses graphs in an inductive fashion. Again, this is a case of “You don’t like it? Fine: pick a better name and if it truly is better we’ll use it.”.
This isn’t going to happen for several reasons (and these are just the ones I could think of whilst writing this blog post):
5.x
series that get closer and closer to the new API and various migration guides/tutorials.Vertex
rather than Node
).We might have some compatibility APIs to help with the transition process (e.g. the noNodes
function is going to be replaced with order
, which is the proper terminology, but we might define noNodes
as an alias), but these will probably be in a different module and it will still not be possible to have code that will work with both the 5.x
series of FGL and the new library.
Here is the initial overall results from the survey:
If we only consider the 40 people who claimed to write code dealing with graphs, only 16 (≈ 40%) of them preferred FGL; as such, actual usage of fgl or other graph libraries does not seem to change the overall opinion of the community (if my vague recollection of how to do statistics is correct, and this is indeed a representative sample of the community).
For those that want it, a sanitised (in that the email addresses and names have been removed) copy of the results is available (I would have hosted them on wordpress.com with the blog, but it doesn’t allow text files to be uploaded, and I don’t see the point of creating a full blown word processor document – since spreadsheets can’t be uploaded – just for some CSV data).
Well…. there isn’t one. A 60% preference is too close to even for me to categorically say that the Haskell community prefers one name over another. As such, unless someone presents a really good reason otherwise we’re going to stick with FGL (due to inertia if nothing else).
After all this debate, I’d like to point out that I’m more and more agreeing that “inductive-graphs” would make a much better library name. However, as I’ve stated previously (including above), I would prefer to use the “fgl” name somehow if nothing else because it’s already there (so a few years from now when – hopefully – the new graph libraries are available and widely used, we don’t have a useless library sitting around confusing people, especially when it used to be present in GHC’s extralibs and the Haskell Platform).
However, I just thought of two possible solutions which may be satisfactory to all involved:
What about if we call the library we’re working on “inductive-graphs”, but then create a meta-library called “FGL” that isn’t just limited to inductive graphs? That is, once I’ve worked out my generic graph API library, then we take a large subset of the modules defined in the libraries contained within this figure and re-export them from the FGL library. Such a library would be analogous to how the Haskell Platform is an initial starting point of the libraries available on Hackage: an all-in-one subset of the available graph libraries in one overall API if you don’t know which libraries to use specifically, and you can then pare down your dependencies to what you actually use.
Another alternative (which I find less attractive) is that we make the FGL library contain the generic graph API; this way the library still exists, but then it is completely different from the 5.x
series. I’m mainly suggesting this just to provide another alternative; I don’t think it really makes sense or is viable.