Re: Data structures in ocaml

From: Pierre Weis (Pierre.Weis@inria.fr)
Date: Sun Oct 10 1999 - 01:52:09 MET DST


From: Pierre Weis <Pierre.Weis@inria.fr>
Message-Id: <199910092352.BAA00909@pauillac.inria.fr>
Subject: Re: Data structures in ocaml
To: skaller@maxtal.com.au (skaller)
Date: Sun, 10 Oct 1999 01:52:09 +0200 (MET DST)
In-Reply-To: <37FE327A.ED47D428@maxtal.com.au> from "skaller" at Oct 9, 99 04:05:46 am

> It is _necessary_ for efficiency, to represent
> procedural code cleanly, and it is possible that it can
> be done in a _well typed_ manner. See my proposal for
> an explicit categorical initial.

10 years ago, I designed and implemented for Caml the keyword
``any''. Any was an expression, with polymorphic type scheme
For all 'a. 'a.

The typing rule for ``any'' was regular except that the type variable
instance of any's type scheme must be instanciated at the end of
type-checking (since there is no Caml value of type 'a).

The semantics was that (any : t) was any value of type t.

For instance:

-- (any : int) was 0
-- (any : string) was ""
...
-- (any : t) where t was a sum type was the first constant constuctor if
any ot the first functional constructor C of ty with (any : ty) as
argument. (Hence (any : string list) was [].)
-- (any : t) where t was a record type was the list of field with the
value (any : ty) in each field (ty according to the label argument).
-- (any : t1 -> t2) was (fun _ -> raise Bottom)

This is well-typed and semantically sound.

This feature is conceivably useful to initialize variables (I mean
references) or vectors.

> > This is an easy consequence of the main theorem of Caml type-checking:
> > no well-typed Caml program can ever go wrong.
>
> I do not understand this (obviously simplified) assertion.
> As stated, it is clearly not the case. Ocaml programs can
> easily go wrong in many ways: infinite loops, core dumps
> due to stack overflows, exceptions due to invalid function
> arguments ...

You're perfectly right.

> Perhaps you mean that the type system is sound, but that is
> a relative thing: the soundness is useful only within
> the context of it's expressive power, and like most
> conventional type systems that is definitely limited in practice.

You're perfectly right.

> There are a whole host of possible run time errors in ocaml,
> which represent a lack of expressiveness of the type system.

Exactly, that's our dream to have a more expresive type system that
would prevent those run time errors.

> > This is also due to:
> >
> > -- a design decision: no uninitialized entities exist in Caml. This is
> > consistent to the fact that in Caml you DEFINE identifiers instead of
> > DECLARING them.
>
> I understand that this decision was made, and why, but it does not
> sit so well with the procedural aspect of ocaml.

That's your opinion, which is interesting.

> > -- an implementation necessity, since the garbage collector cannot be
> > faced with some unknown ill-formed data.
>
> This only follows from an overly 'strict' interpretation of
> 'uninitialised'. A pragmatic implementation might well initialise
> boxed values for the reason you state, but that need not mean
> the client programmer must. The system can surely do this
> more efficiently, if it must be done for internal implementation
> reasons.

The system simply initializes vectors with the value supplied by the
user. If it did it with a special implicit value, the lack of
efficiency would be the same.

> > There are other possibilities that cannot be written into the
> > language, since they locally violates the correct memory management,
> > but that an implementation may provide. For instance, the run-time
> > system may define a special ``null'' (allocated) value that is used by
> > an external primitive to fill uninitialised arrays; then, the array
> > access primitive would test at each access if the value read is
> > identical (==) to ``null'' or not; if so, then the array access fails
> > and an error is raised, otherwise the value fetched is returned. This
> > means a loss of efficiency due to a spurious test for each array
> > access, and this is not desirable.
>
> I do not care that efficiency is lost while debugging,
> and the spurious test can be turned off with the -unsafe option
> just like bounds checking.

I do not understand: the efficiency is lost anyway as soon as the
vectors have to be initialized, since this initialization is an O(n)
operation. Also if you don't perform the test at array access, you
will have a very difficult time to debug the program, since the
``null'' value could be a valid value of the type stored into the
array, and the program may crash long time after the erroneous access
or even worse continue with this erroneous value and give irrelevant
results ...

> > > 2. Arrays are mutable, but not variable length.
> >
> > This would require an extra indirection for each vector access, which
> > is undesirable.
>
> you are missing the point: I am NOT suggesting
> changing the existing Array module, but considering the problem
> of implementing a new module for a variable length one.

You're right, I misunderstood the phrase ``2. Arrays are mutable, but not
variable length.''. I did not catch it meant ``I'm considering of
implementing a new module for a variable length one.''

> In this case the 'undesirability' of the extra indirection
> is completely irrelevant: it is necessary IF you in fact
> want variable length arrays, and I MUST have them
> or I cannot implement Python lists efficiently (Python uses
> variable length arrays for lists, using realloc to reallocate them).

You're certainly right. However you did not give the benchmark that
proves that Caml lists are way too inefficient to implement the Python
lists! Can you please give us the specification you need, that is, the
kind of primitives you want to implement efficiently, and the runtime
figures you got with the Python implementation (i.e. a benchmark) ?

> >Furthermore, variable length vectors can be defined
> > by an easy abstraction based on the basic static length vectors.
>
> No they can't. I know, I'm trying to do it.
> It is not at all 'easy', but grossly obscures code clarity.
> To see this, just consider that your code below is, in fact, wrong:
>
> > type 'a evect = ('a array) ref;;
> >
> > let make n x = ref (Array.make n x);;
> >
> > let give_room v n =
> > let s = !v in
> > let l = Array.length s in
> > if n >= l then
> > let new_s = Array.make n s.(0) in
>
> What happens if Array.length s = 0??

Invalid array access. I implicitely supposed that the vectors were not
empty. If they can, just change the module as

type 'a evect = {init : 'a; mutable contents : 'a array};;

let make n x = {init = x; contents = Array.make n x};;

let give_room v n =
  let s = v.contents in
  let l = Array.length s in
  if n >= l then
   let new_s = Array.make n v.init in
   ...

Anyway, if you use those vectors to implement lists, the variable length
vectors you need are not empty, and you can use the original type I
posted, since you do not use these vectors in the case of empty
lists. (Your list datatype should look like:

type 'a plist =
   | Empty
   | Cons of 'a evect;;

isn't it ?)

> You see, it is NOT possible to implement the
> 'give_room' function.

I don't understand.

> Instead, you must provide special
> code in EVERY function of the module to do the
> reallocation, and you must put dummy values at the end
> of the array, which will not be collected. This code
> must be special purpose for each function, to find
> a suitable dummy value: if there is no such value,
> the array can be set to zero length: it CAN be done,
> but it is a mess, it is inefficent, and it can
> waste a lot of storage, particularly if,
> as in my implementation, I chose to leave 'old' values
> lying around in the unused part instead of resetting
> them all to a single (shared) dummy value.

If you have this problem, you can use option type (as I mentioned in
my previous message) in the module implementing variable length
vectors. ``Uninitialized'' elements will be None values that are not
garbbage collected anyway. Option manipulation will be local to the
set and get functions, and completely transparent to the clients of
the module.

If you have those ``dummy'' values in vectors, you may also try weak
pointers.

Also there are zillion of extensible vectors implementations, for
instance those based on a default value (which appears very often in
the vector and is never stored in the vector at all), with hash tables
(sparse vectors), trees of vectors of a fixed length, vectors of
vectors ...

> > Anyway, if you want to append two chunks of arrays with no
> > intermediate allocation, you just have to write a special purpose
> > function, something like:
> >
> > let append_two_chunks v1 start1 end1 v2 start2 end2 =
> > let result = Array.make ...
> > Array.blit v1 start1 result 0 ...
> > Array.blit v2 start2 result start1 ...
> > result;;
>
> Yes, and this requires getting a dummy value to initialise
> the whole array with, followed by immediately overwriting
> that dummy value with the actual data, and thus is twice
> as slow as C.

If you do not want to pay for the security to be sure that all values
are always valid, I guess you must write this function directly in C ...

> > (SMOP: you can also generalize the previous function to an arbitrary
> > list of chunks...)
>
> I have. And it is even messier to find the dummy value then;
> I need a recursive function scanning the list of arrays to
> be concatenated just to find the dummy value,
> and if all the arguments are zero length, you have to
> handle it as a special case.

Anyway you have to write such a function to compute the length of the
resulting array: I don't think it is difficult to also return an
initialisation value option with the total length of the vector...

> ----------------------
>
> I think my point is this: when building a library module,
> it is useful to have extra code to handle the magic
> initial value, to find bugs. When the library is thought
> to be correct, it can be recompiled with the -unsafe option.
> The result is then fast, although it may crash the program
> if there is still a bug -- but if there is a bug, the program
> is bugged _anyhow_, and all that is lost is some extra
> diagnostic information, which can sometimes be recovered
> by rerunning the program on the same data, but with the
> program re-built with the debugging version of the library.

There is a philosophical problem here, since one of the fundamental
idea of Caml-like languages is just to
 1) prevents runtime errors as much as possible via static analyses.
 2) give you some meaningful diagnostic in case of error (something
different from ``bus error'')

We very often chose to jeopardise efficiency in favor of safety and good error
diagnostic. For instance, we considered several times to remove the
-unsafe option. If we had implemented a good array bound checkers, we
certainly had removed it ...

> The point is that not much safety is lost, since only that
> module is compiled with -unsafe, and only when it is
> well enough tested and analysed to have reasonable confidence
> in its correctness.

You're right. That's always what we claim to the clients of our
programs: ``it is well enough tested and analysed to have reasonable
confidence in its correctness.'' Unfortunately, we also have bug
reports ...

> The alternative is to write code three times longer
> than would otherwise be required, which is MORE likely
> to be bugged, even though the bug will not be from an
> uninitialised value.
>
> The tradeoff is: catching a specific kind of
> error automatically compared with
> missing a dynamic error which is more likely
> to occur.
>
> In the variable length array case there is no issue:
> IF I fail to set an array element to the correct value,
> THEN accessing it is an error. It does not matter whether
> the value is the wrong value, or an uninitialised one,
> except perhaps that a core dump from an uninitialised one
> would be easier to trace! [Especially if the code were simpler]
>
> In other words, the decision to force initialisation of values
> even when it is inappropriate is some times a _disservice_
> to the programmer. I recommend giving the programmer a choice.
> I think this is viable, since the choice can be made
> on a per compilation unit basis.
>
> John Skaller, mailto:skaller@maxtal.com.au
> 1/10 Toxteth Rd Glebe NSW 2037 Australia
> homepage: http://www.maxtal.com.au/~skaller
> downloads: http://www.triode.net.au/~skaller

You're right. We had often the same kind of discussions and the same
kind of arguments with people from other programming groups (C, Perl,
Scheme, Ada, ...): very often they replace ``uninitialised arrays'' by
``strongly typed'' or ``pattern matching exaustivity''; and have
exactly the same arguments ``sometime inappropriate'', ``_disservice_
to the programmer'', ``core dump easier to trace'', ``simpler code if
all those lifegards are omitted'', ``I don't need those static checks
since I careful test and analyze my code''.

Once more, it is a matter of philosophical opinions: in Caml the
language ensures some invariants. This is a strong and convenient
property, that you may like or dislike. And unfortunately this
property has a price ...

Pierre Weis

INRIA, Projet Cristal, Pierre.Weis@inria.fr, http://cristal.inria.fr/~weis/



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:26 MET