«Insert Name Here»

14 July 2010

Data-Oriented Hierarchies

Filed under: Graphs,Haskell — Ivan Miljenovic @ 12:24 AM

In the Haskell community, there are several topics of discussion that keep coming up over and over again in terms of dealing with the hierarchies in our code. Some of these topics are:

  • Fixing the FunctorApplicativeMonad class hierarchy (however you want to structure it);
  • The best way to define and use monad transformers;
  • Making Functor more relevant; taken to the extreme by the "Caleskell" definitions used by lambdabot on IRC, where it seems almost everything can be expressed in terms of fmap.

Now, I think this kind of discussion is an indication of good health in the Haskell community where we are doing our best to determine what the optimal solution to these problems are (rather than just giving up or being dictated to by a single individual). However, something I've come to realise recently is that in my understanding these discussions are mainly oriented at what the best way to abstract how we write code rather than how we use the data structures that make up the code. Hence, the topic of this blog post.

My Goal

What I want to discuss here is the concept of how we can best define class hierarchies that let us easily interchange our data structures. The purpose of this is that currently, if I write some code using a list as my underlying data structure and then decide that a Sequence would be a better fit because I do a lot of appends, I have to re-write every single bit of my code that knows about that particular data structure. However, I would much prefer to just have to change a few top-level type signatures and maybe some list-specific items in my code and then the magic of type classes would take care of the rest.

Avoiding Duplication

The main focus of when such a hierarchy would be useful is when writing libraries: duplication is avoided by having to write a list-specific, a Sequence-specific and a Set-specific version of a function (e.g. to test if the data structure in question has at least two of the provided values). More than that: often times we are constrained in terms of how we use libraries by what data-type the library author preferred at the time of writing. A library function may require and then return a list, whereas we're using Sets everywhere else. If there is no pressing reason to use a list rather than a Set, then why should it?

Is such a hierarchy already available?

There are some previous attempts at something like this, including (but not limited to):

  • Functor + Foldable + Traversable; this approach can't deal with structures such as Sets as they require an extra restriction on the parametric type.
  • Edison can cope with Set, etc. and has a nice hierarchy between the individual sub-classes (if anything it has too many sub-classes), but is used by very few packages and has what I consider to be a few warts, such as explicitly re-exporting the data types in question in new modules, and some methods (such as strict) that really belong elsewhere.
  • collections seemed to have been another attempt at this, but never seemed to have built on any version of GHC since 6.8.
  • When you only want to consider structures with a linear structure, ListLike is available. However, it seems to be possibly over-busy.
  • Even more specialised than ListLike is IsString, the point of which is to be able to use string literals in Haskell code to define Bytestrings, etc.

The closest viable class/library to my ideal listed above would be a cross between Edison and ListLike; the former has an actual class hierarchy (to avoid duplication, etc.;) whereas the latter seems to be used more in actual practice.

My point here about a class hierarchy is this: in most aspects, any sequence (or "ListLike" data structure) can be considered a really inefficient generic collection/set: you still want to have a function to test for membership, you want to be able to add values, to know how many there are, etc. As such, definitions should be as high up in the hierarchy as possible to let functions that use them be as generic as possible in terms of their type signatures.

The Joker in the deck

There is one conflicting issue in any such hierarchy: mapping.

Ideally, we wouldn't want to require that instances of these types have kind * -> * (so that we can for instance [pun not intended] make Bytestring an instance of these classes with a "value type" of Word8). However, as soon as we do that we can no longer specify a map function nicely.

ListLike gets around this by defining a map function that doesn't constrain the data structure type. This means that it's possible to write map succ with a type of ByteString -> [Word8]. Whilst this might be handy at times, it also provides possible type-matching problems if your overall definition when using them doesn't force them to be the same (e.g.: print $ map (*2) [1,2,3,4]), and that in essence this definition of map does a complete fold over the data structure, whereas there may be more efficient versions if we can somehow specify at the type level that it must be constrained to the same data structure.

However, the problem is that technically [Int] and [Char] are two completely separate data types; as such, any map between them will require going from one type to another (since we're not assuming kind * -> * here). It is possible to get around this, but it's pretty ugly:

{-# LANGUAGE MultiParamTypeClasses, FunctionalDependencies, FlexibleInstances #-}

class Collection c a | c -> a where
  cons :: a -> c -> c

class (Collection (c a) a) => MappableCollection c a where
  cmap :: (MappableCollection c b) => (a -> b) -> c a -> c b

instance Collection [a] a where
  cons = (:)

instance MappableCollection [] a where
  cmap = map

In essence, the whole point of the MappableCollection class is to force the Collection instance back into having to kind * -> *. It might be better just having Collection use ListLike's rigidMap and then leave "normal" mapping up to Functor or RFunctor (which works better with the whole "class hierarchy" concept). It's just a shame that there's no way of having mapping work regardless of the kind of the data type.

So, what are you going to do about this?

I'm going to make a stab at yet-another-collection-class-hierarchy this weekend at AusHac. I'm not sure how far we'll get, but I'll see.

Graph Hierarchies

My interest in data structure hierarchies came out my frustration at the lack of a common reference point for graph data types. Developing a base hierarchy is going to be my main focus at AusHac (the collections classes are aimed at being used within this graph library). My current plans look something like this (note that this doesn't include extra packages providing specific instances, such as "vector-graph" or something):

That is, the actual "graph" library will also cater for other graph-like data structures, such as Cabal's PackageIndex type. FGL (both the old and the "new" version, whatever it'll be called) will then extend these classes to provide the notion of inductive graphs; anything that isn't directly related to the notion of inductive graphs will be shifted down to this notion of "generic graphs".

In terms of terminology, to ease the transition I'm probably going to stick to current FGL-nomenclature for the most part (unless there's something horribly wrong/bad about it). So we're still going to talk about Nodes rather than Vertices, etc.

Old FGL

As I intimated in the extended announcement for fgl-5.4.2.3, apart from bug-fixes we're not going to work on the current 5.4 branch. The 5.5 branch will be developed so as to use the generic graph classes once I've got them sorted out, and then that will probably be the end of it.

New FGL

Now, this has become a rather hot topic: should a rewrite of FGL still be called FGL? I've covered this earlier, but I have now created a survey to try and find out what the community thinks it should be called (I did want an "other" option in the first drop-down menu, but Google Docs wouldn't let me :( ).

About these ads

6 Comments »

  1. Note: Foldable can deal with Set. It just can’t change the set in any way.

    Comment by Edward Kmett — 16 July 2010 @ 5:24 AM | Reply

    • Foldable can; Functor and Traversable can’t.

      Comment by Ivan Miljenovic — 16 July 2010 @ 12:52 PM | Reply

  2. [...] to change which data structure you want to use in your code, with minimal code change. See his blog post for more [...]

    Pingback by Data.Random » Blog Archive » AusHac2010 Day 2 progress — 17 July 2010 @ 6:15 PM | Reply

  3. I don’t like the CFunctor approach either, but I think it’s the best solution to the problem posed by containers like StorableVector while maintaining support for things like ByteString.

    I think a better approach to the Container class is to break it up so it’s less monolithic. The functions “empty” and “null” should be in one class (or possibly two), “size” should be in a “Sizeable” class, etc. Then you can define type classes for different families of containers that call in the appropriate class constraints, e.g. Container would require Emptyable and Sizeable, NonEmptyContainer wouldn’t require Emptyable, and so on. I think this would be more usable. Contexts would still be very short because low-level functions would only require the one or two of these that they really need, and most client code would use Container, etc. and pull in all the constraints that way. In this vein I think Monoid should be avoided entirely.

    In particular, this will allow support for container types that cannot be empty, such as

    data SafeList a = SLast a | SCons a (SafeList a)

    One problem with previous attempts at solving this issue is the one-size-fits-all approach a monolithic container class requires. IMO, breaking up the restrictions into small chunks will be more flexible and more usable.

    Comment by John Lato — 21 July 2010 @ 8:15 PM | Reply

    • Are you commenting on the API design of container-classes? If so, you could have given me a couple of hours: I was just about to write a blog post about that! ;-)

      My problem with having multiple such typeclasses is that it can increase type signature constraints (at least until type-class aliases are available). Furthermore, no one API will be able to cover everything…

      Finally, what does fold over non-empty containers look like? And is such a list really used or are you just trying to come up with possible counter-examples?

      Comment by Ivan Miljenovic — 21 July 2010 @ 8:48 PM | Reply

  4. [...] Ivan Miljenovic @ 10:34 PM During AusHac, I worked on the container hierarchy I discussed in my previous post, which culminated in the initial release of container-classes. I had initially (and naively) [...]

    Pingback by Working out the container-classes API « «Insert Name Here» — 21 July 2010 @ 10:34 PM | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: