Re: sets

Xavier Leroy (xavier@Theory.Stanford.EDU)
Thu, 6 May 1993 12:08:25 -0700 (PDT)

From: Xavier Leroy <xavier@Theory.Stanford.EDU>
Message-Id: <9305061908.AA04126@Tamuz.Stanford.EDU>
Subject: Re: sets
To: caml-list@margaux
Date: Thu, 6 May 1993 12:08:25 -0700 (PDT)

In reply to Pierre Weis' message:

> As Xavier pointed out, a naive implementation based on lists is very
> easy to implement. Its drawback is evidently efficiency (O(n^2) for
> basic operations union and intersection). But as far as I know, it is
> the only one which is absolutely compatible with ML polymorphism: you
> get a truly polymorphic empty set and a truly polymorphic function to
> add an element to a set.

I disagree with this point. The list-based implementation of sets
assumes that you're either satisfied with the existing polymorphic
equality primitive, or able to define a reasonable polymorphic
equality function incorporating user-defined equality functions at
some types. In this case, it is easy to extend the polymorphic
equality function to a polymorphic comparison function (just do some
kind of lexicographic ordering over records and sums).
Then, we can have balanced tree implementations of sets, with
operations in O(log n) or O(n log n) instead of O(n) and O(n^2). This
implementation would be as "truly polymorphic" as the one based on
lists.

The problem with both lists and trees is: we don't know how to build a
good polymorphic equality/ordering function.

> If we agree that a library implementing sets must be very efficient or
> useless, then the problem becomes very hard. The reason is that there
> exists many clever ways to represent sets, each one adapted to a particular
> usage of sets and to different underlying type.

You set out to do something very ambitious here: define a polymorphic
data structure with non-uniform representations. Sets are not the only
data structure that could benefit from this technique: for instance,
arrays of characters and arrays of floating-point numbers could be
represented more compactly than generic arrays. But I still need to be
convinced we need to go that far. Generic arrays are a little bit
wasteful, but this seems tolerable in practice. Similarly, I think
generic sets implemented as balanced trees would also be tolerable in
many situations. Granted, the union of two binary trees is never going
to be as fast as an "or" operations between two integers. Do we really
need this efficiency? In SETL, the answer is certainly "yes", because
sets are the main data structure. But this is not so in ML.

- Xavier