sets

Xavier Leroy (xavier@Theory.Stanford.EDU)
Thu, 6 May 1993 10:22:36 +0200 (MET DST)

Subject: sets
To: caml-redistribution@margaux
Date: Thu, 6 May 1993 10:22:36 +0200 (MET DST)

This is a very long message about sets. Read the abstract and skip
it, if it does not interest you.

Abstract:
Sets are a very useful data structure, but are very difficult to
implement. The basic problem is that ML polymorphism is not adequate
to a proper representation of sets. It's not a problem about modules,
but about polymorphism. We need a more powerful type system to get
really efficient sets.

As Xavier pointed out, a naive implementation based on lists is very
easy to implement. Its drawback is evidently efficiency (O(n^2) for
basic operations union and intersection). But as far as I know, it is
the only one which is absolutely compatible with ML polymorphism: you
get a truly polymorphic empty set and a truly polymorphic function to
add an element to a set.

If we agree that a library implementing sets must be very efficient or
useless, then the problem becomes very hard. The reason is that there
exists many clever ways to represent sets, each one adapted to a particular
usage of sets and to different underlying type. (I guess we want to
implement finite and homogeneous sets over any ML type.) Consider for
instance sets based on an enumerated type: if this type has less than,
say 32 or 31 constant constructors, you have to represent sets over
this type as integers so that union and intersection cost only one
machine instruction. If the enumerated type has more than 32 constant
constructors, you must represent sets on this type as strings
(considered as bit vectors) in order to get a still very fast
implementation of union and intersection. If the underlying type gets
non constant constructors, things become even worse: it can be the
case that you have to stick to the representation of sets as string,
if there are not many elements of the underlying type used in the sets
manipulated by the program, otherwise you must adopt a tree-like
representation (balanced tree or even lists in some cases). You may
choose even more complex representation, such that a mixt of bit
vector and balanced trees. Well, my claim is that efficient set
implementation requires various representations of sets coexisting at
the same time in the program, either a single implementation for each
basic universe (underlying type), or even better a dynamically
modified scheme of representation for the sets of a given universe.

This is very difficult to handle in ML, since data representation
should be uniform to be compatible with polymorphism (and we want
polymorphic sets, don't we?). My own opinion is that this is not a
question of modules, but a problem of typechecking system of the core
language. This is analoguous to I/Os and printing. There exists some
situations where you need to write functions which are able to run
differently according to the type of their arguments or results
(polymorphic write and read as in Pascal). The problem is clear when
you imagine to offer a polymorphic map_set function in the set
library: map_set is the higher-order function on sets, equivalent to
map on lists. Thus map_set has type:
For all 'a, 'b. ('a -> 'b) -> 'a set -> 'b set.
The ``object oriented'' approach of Damien or Didier is the trivial
solution that comes in mind in the first 5 minutes when you imagine to
implement sets. Since it took 5 minutes of Damien's or Didier's time
to be found, it's a very clever idea (I'm not flaming).
Thus, this method works very fine when you design functions which
get as arguments every sets involved in the problem: they dynamically asked to
these arguments ``Hey guy, what's your real implementation?'', get the
answer and work accordingly. This method (not in the OO sense!)
completely fails for map_set: you have to ask to the function what can
be the representation of a set built with elements of its co-domain
type! We don't want to add this ``method'' to each closure, for the
sole reason that we want to implement sets.

Well you can say: I don't mind map_set. You can, but that's not a good
way to deal with the problem: the problem is a real one, it will not desappear
like that. What about singleton : 'a -> 'a set ? or empty : 'a set ?
There are other examples of this scheme in the field of primitives
over ``maps'' (in the set litterature sense).

We need a type-system to treat this cases, and in particular
polymorphic printing (for user's convenience and debugging) and input
(I want a safe function read : unit -> 'a).

Catherine Dubois and I have are working since now 1.5 year about set
implementation in Caml. She knows very well set implementation, and I
know the Caml system. Thus, we got very efficient schemes to represent
sets, and all was fine, until we faced the map_set problem. We then
turned to the design of a new type-system for Caml integrating parametric
polymorphism and a new one which allows the IMPLEMENTATION of map_set.
Thanks to this new type-system we can write the map_set primitive.
Without it, even the most dirty tricks (including OO features), those
which cannot be told in manuals, failed to implement the map_set
function (providing you keep it its natural and required type).
For the time being, we are writing a paper (article or research report) on
this type-system. Wait and see ...

Pierre Weis
----------------------------------------------------------------------------
Formel Project
INRIA, BP 105, F-78153 Le Chesnay Cedex (France)
E-mail: Pierre.Weis@inria.fr
Telephone: +33 1 39 63 55 98
----------------------------------------------------------------------------