Re: Data structures in ocaml

From: skaller (skaller@maxtal.com.au)
Date: Sun Oct 10 1999 - 10:14:45 MET DST


Date: Sun, 10 Oct 1999 18:14:45 +1000
From: skaller <skaller@maxtal.com.au>
To: Pierre Weis <Pierre.Weis@inria.fr>
Subject: Re: Data structures in ocaml

Pierre Weis wrote:

> 10 years ago, I designed and implemented for Caml the keyword
> ``any''. Any was an expression, with polymorphic type scheme
> For all 'a. 'a.
>
> The typing rule for ``any'' was regular except that the type variable
> instance of any's type scheme must be instanciated at the end of
> type-checking (since there is no Caml value of type 'a).
>
> The semantics was that (any : t) was any value of type t.

        A don't care value. This slightly different to 'initial',
which is more like 'not any'. :-)
 
> This feature is conceivably useful to initialize variables (I mean
> references) or vectors.

        Perhaps (but I don't see how it is always possible to
provide an 'any' instance of an object of class type?)

> > There are a whole host of possible run time errors in ocaml,
> > which represent a lack of expressiveness of the type system.
>
> Exactly, that's our dream to have a more expresive type system that
> would prevent those run time errors.

        I share your dream: but add to it the need to do things
efficiently, as well, and to support sharing of commonality
(that is, reuse).
 
> > > This is also due to:
> > >
> > > -- a design decision: no uninitialized entities exist in Caml. This is
> > > consistent to the fact that in Caml you DEFINE identifiers instead of
> > > DECLARING them.
> >
> > I understand that this decision was made, and why, but it does not
> > sit so well with the procedural aspect of ocaml.
>
> That's your opinion, which is interesting.

        A lot of work is done in C++, concerning initialisation
of class objects. As you may know, C++ allows the data of a class
to be initialised componentwise, and then also permits assignment
to the values in the constructor body. I have conjectured that
it is always possible to write the code with an empty constructor
body, that is, entirely functionally. I have not seen a contrary
example, but there are plenty of examples where it is very
messy and inconvenient not to leave the value uninitialised,
and assign to it later in the constructor body.
 
> The system simply initializes vectors with the value supplied by the
> user. If it did it with a special implicit value, the lack of
> efficiency would be the same.

        Accepted.
 
> > I do not care that efficiency is lost while debugging,
> > and the spurious test can be turned off with the -unsafe option
> > just like bounds checking.
>
> I do not understand: the efficiency is lost anyway as soon as the
> vectors have to be initialized, since this initialization is an O(n)
> operation.

        The issue here isn't just initialisation, but also testing
for the special 'initialised' value. If I use 'a option,
I must write code (ugly) to match the case which always
occurs, and presumably that takes time:

        match x.(i) with Some x' -> x' | None -> assert false

is surely slower (and unquestionably uglier) than

        x.(i)

where the latter throws if x.(i) is uninitialised OR past the end of the
array.
I can turn off the bound checking with -unsafe, and if I could turn off
the
uninitialised checking as well, then the cost of an access would be the
same as for a fixed length array.

>Also if you don't perform the test at array access, you
> will have a very difficult time to debug the program, since the
> ``null'' value could be a valid value of the type stored into the
> array, and the program may crash long time after the erroneous access
> or even worse continue with this erroneous value and give irrelevant
> results ...

        The idea is that the system check each access, but the checks
be elided if -unsafe is specified. The result is cleaner than either
using a dummy value (which is actually unsafe, since reading the dumyy
_doesn't_ throw an exception when it should), or using 'a option.

 
> You're certainly right. However you did not give the benchmark that
> proves that Caml lists are way too inefficient to implement the Python
> lists! Can you please give us the specification you need, that is, the
> kind of primitives you want to implement efficiently, and the runtime
> figures you got with the Python implementation (i.e. a benchmark) ?

        Sure. but note, it is not necessary to give a benchmark:
python lists support constant time random access, ocaml ones
are O(n), and that is enough. Indexing is common. Concatenation is
also common and must be fast. Here are the results from an appending
test:

Viper (ocaml) 2.0.1
Concatenate python list (ocaml Varray) 317 secs
Concatenate python tuple (ocaml list) 794 secs

Python (C)
Concatenate python list (ocaml Varray) 54.33 secs
Concatenate python tuple (ocaml list) 48.98 secs

The test code is:
-----------------------------
from time import clock
from sys import version
print version

loops = 10000
x0 = [1,2,3,4]
y0 = (1,2,3,4)

x = []
y = ()

start = clock()
for i in xrange(loops):
  x = x + x0
print 'Concatenate python list (ocaml Varray)', clock() - start,' secs'

start = clock()
for i in xrange(loops):
  y = y + y0
print 'Concatenate python tuple (ocaml list)', clock() - start, ' secs'
------------------------------------

> lists. (Your list datatype should look like:
>
> type 'a plist =
> | Empty
> | Cons of 'a evect;;
>
> isn't it ?)

        No, I just use a variable length array:

        type varray_t = { used: int; data: 'a array }

 
> > I think my point is this: when building a library module,
> > it is useful to have extra code to handle the magic
> > initial value, to find bugs. When the library is thought
> > to be correct, it can be recompiled with the -unsafe option.
> > The result is then fast, although it may crash the program
> > if there is still a bug -- but if there is a bug, the program
> > is bugged _anyhow_, and all that is lost is some extra
> > diagnostic information, which can sometimes be recovered
> > by rerunning the program on the same data, but with the
> > program re-built with the debugging version of the library.
>
> There is a philosophical problem here, since one of the fundamental
> idea of Caml-like languages is just to
> 1) prevents runtime errors as much as possible via static analyses.
> 2) give you some meaningful diagnostic in case of error (something
> different from ``bus error'')

        I agree with this philosophy, but would temper it with the
condition that not too much efficiency is lost, and that not too
much expressiveness is lost.
 
> We very often chose to jeopardise efficiency in favor of safety and good error
> diagnostic. For instance, we considered several times to remove the
> -unsafe option. If we had implemented a good array bound checkers, we
> certainly had removed it ...

        Yes, and I am sure each time you make this kind of compromise
you are asking for some research to show a way to get safety
without loss of expressiveness or efficiency.
 
> > The point is that not much safety is lost, since only that
> > module is compiled with -unsafe, and only when it is
> > well enough tested and analysed to have reasonable confidence
> > in its correctness.
>
> You're right. That's always what we claim to the clients of our
> programs: ``it is well enough tested and analysed to have reasonable
> confidence in its correctness.'' Unfortunately, we also have bug
> reports ...

        This will be the case until better languages are
designed where the compiler can prove correctness in more cases.
It seems to be a good way forward is to consider functors
as a programming technique, since a functor instantiating
an instance of an abstraction guarrantees the instance
is a correct instance of the abstraction.
 
> You're right. We had often the same kind of discussions and the same
> kind of arguments with people from other programming groups (C, Perl,
> Scheme, Ada, ...): very often they replace ``uninitialised arrays'' by
> ``strongly typed'' or ``pattern matching exaustivity''; and have
> exactly the same arguments ``sometime inappropriate'', ``_disservice_
> to the programmer'', ``core dump easier to trace'', ``simpler code if
> all those lifegards are omitted'', ``I don't need those static checks
> since I careful test and analyze my code''.
>
> Once more, it is a matter of philosophical opinions: in Caml the
> language ensures some invariants. This is a strong and convenient
> property, that you may like or dislike. And unfortunately this
> property has a price ...

        I believe I share much the same philosophy. But when I
see unnatural code, or inefficiency, imposed by the current
implementation consequences of the philosophy, I question
both the philosophy and the implementation consquences.

        In this case I ask, if a well principled solution
is not to be found from the underlying category theory,
which would not change the philosophy.

        In this particular case, it would probably be correct to
permit a value to be temporarily uninitialised, provided
the compiler could prove it was initialised before use.

        Since this isn't always the case, the question
is whether to ban the construction, or resort to a run time
error. In the similar case of array bounds, ocaml takes
the latter approach. Therefore, doing so with uninitialised
values is within the current philosophy.

        So I think it is simply a question of judgement of utility.
The only opinion I offer on that is that I have ONE case where
there would be some benefit.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:26 MET