Re: Data structures in ocaml

From: skaller (skaller@maxtal.com.au)
Date: Fri Oct 08 1999 - 20:05:46 MET DST


Date: Sat, 09 Oct 1999 04:05:46 +1000
From: skaller <skaller@maxtal.com.au>
To: Pierre Weis <Pierre.Weis@inria.fr>
Subject: Re: Data structures in ocaml

Pierre Weis wrote:
>
> > Some comments on data structures.
> > Please let me know if I miss something, which is possible.
> >
> > 1. It should be possible to create an uninitialised array.
> > [It can be done for string]
>
> No, it should not.

        It is _necessary_ for efficiency, to represent
procedural code cleanly, and it is possible that it can
be done in a _well typed_ manner. See my proposal for
an explicit categorical initial.

> This is an easy consequence of the main theorem of Caml type-checking:
> no well-typed Caml program can ever go wrong.

        I do not understand this (obviously simplified) assertion.
As stated, it is clearly not the case. Ocaml programs can
easily go wrong in many ways: infinite loops, core dumps
due to stack overflows, exceptions due to invalid function
arguments ....

        Perhaps you mean that the type system is sound, but that is
a relative thing: the soundness is useful only within
the context of it's expressive power, and like most
conventional type systems that is definitely limited in practice.

        There are a whole host of possible run time errors in ocaml,
which represent a lack of expressiveness of the type system.

> This is also due to:
>
> -- a design decision: no uninitialized entities exist in Caml. This is
> consistent to the fact that in Caml you DEFINE identifiers instead of
> DECLARING them.

        I understand that this decision was made, and why, but it does not
sit so well with the procedural aspect of ocaml.

> -- an implementation necessity, since the garbage collector cannot be
> faced with some unknown ill-formed data.

        This only follows from an overly 'strict' interpretation of
'uninitialised'. A pragmatic implementation might well initialise
boxed values for the reason you state, but that need not mean
the client programmer must. The system can surely do this
more efficiently, if it must be done for internal implementation
reasons.
 
> A way to mimick uninitialised array inside the Caml language framework
> could be to systematically use option encapsulation, initialising the
> array with None values, and storing Some x instead of x
> afterward.

        Yes. So the idea is to make the compiler do that,
for all values.

> This is not acceptable in practice since array accesses
> would systematically need destructuring the result.

        Precisely.
 
> There are other possibilities that cannot be written into the
> language, since they locally violates the correct memory management,
> but that an implementation may provide. For instance, the run-time
> system may define a special ``null'' (allocated) value that is used by
> an external primitive to fill uninitialised arrays; then, the array
> access primitive would test at each access if the value read is
> identical (==) to ``null'' or not; if so, then the array access fails
> and an error is raised, otherwise the value fetched is returned. This
> means a loss of efficiency due to a spurious test for each array
> access, and this is not desirable.

        I do not care that efficiency is lost while debugging,
and the spurious test can be turned off with the -unsafe option
just like bounds checking.

> > 2. Arrays are mutable, but not variable length.
>
> This would require an extra indirection for each vector access, which
> is undesirable.

        you are missing the point: I am NOT suggesting
changing the existing Array module, but considering the problem
of implementing a new module for a variable length one.
In this case the 'undesirability' of the extra indirection
is completely irrelevant: it is necessary IF you in fact
want variable length arrays, and I MUST have them
or I cannot implement Python lists efficiently (Python uses
variable length arrays for lists, using realloc to reallocate them).

>Furthermore, variable length vectors can be defined
> by an easy abstraction based on the basic static length vectors.

        No they can't. I know, I'm trying to do it.
It is not at all 'easy', but grossly obscures code clarity.
To see this, just consider that your code below is, in fact, wrong:

> type 'a evect = ('a array) ref;;
>
> let make n x = ref (Array.make n x);;
>
> let give_room v n =
> let s = !v in
> let l = Array.length s in
> if n >= l then
> let new_s = Array.make n s.(0) in

        What happens if Array.length s = 0??

        You see, it is NOT possible to implement the
'give_room' function. Instead, you must provide special
code in EVERY function of the module to do the
reallocation, and you must put dummy values at the end
of the array, which will not be collected. This code
must be special purpose for each function, to find
a suitable dummy value: if there is no such value,
the array can be set to zero length: it CAN be done,
but it is a mess, it is inefficent, and it can
waste a lot of storage, particularly if,
as in my implementation, I chose to leave 'old' values
lying around in the unused part instead of resetting
them all to a single (shared) dummy value.
        
> Anyway, if you want to append two chunks of arrays with no
> intermediate allocation, you just have to write a special purpose
> function, something like:
>
> let append_two_chunks v1 start1 end1 v2 start2 end2 =
> let result = Array.make ...
> Array.blit v1 start1 result 0 ...
> Array.blit v2 start2 result start1 ...
> result;;

        Yes, and this requires getting a dummy value to initialise
the whole array with, followed by immediately overwriting
that dummy value with the actual data, and thus is twice
as slow as C.
 
> (SMOP: you can also generalize the previous function to an arbitrary
> list of chunks...)

I have. And it is even messier to find the dummy value then;
I need a recursive function scanning the list of arrays to
be concatenated just to find the dummy value,
and if all the arguments are zero length, you have to
handle it as a special case.

----------------------

I think my point is this: when building a library module,
it is useful to have extra code to handle the magic
initial value, to find bugs. When the library is thought
to be correct, it can be recompiled with the -unsafe option.
The result is then fast, although it may crash the program
if there is still a bug -- but if there is a bug, the program
is bugged _anyhow_, and all that is lost is some extra
diagnostic information, which can sometimes be recovered
by rerunning the program on the same data, but with the
program re-built with the debugging version of the library.

The point is that not much safety is lost, since only that
module is compiled with -unsafe, and only when it is
well enough tested and analysed to have reasonable confidence
in its correctness.

The alternative is to write code three times longer
than would otherwise be required, which is MORE likely
to be bugged, even though the bug will not be from an
uninitialised value.

The tradeoff is: catching a specific kind of
error automatically compared with
missing a dynamic error which is more likely
to occur.

In the variable length array case there is no issue:
IF I fail to set an array element to the correct value,
THEN accessing it is an error. It does not matter whether
the value is the wrong value, or an uninitialised one,
except perhaps that a core dump from an uninitialised one
would be easier to trace! [Especially if the code were simpler]

In other words, the decision to force initialisation of values
even when it is inappropriate is some times a _disservice_
to the programmer. I recommend giving the programmer a choice.
I think this is viable, since the choice can be made
on a per compilation unit basis.

-- 
John Skaller, mailto:skaller@maxtal.com.au
1/10 Toxteth Rd Glebe NSW 2037 Australia
homepage: http://www.maxtal.com.au/~skaller
downloads: http://www.triode.net.au/~skaller



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:26 MET