Version française
Home     About     Download     Resources     Contact us    
Browse thread
SafeUnmarshal: questions/problems/timings
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Hendrik Tews <tews@c...>
Subject: Re: [Caml-list] SafeUnmarshal: questions/problems/timings
Grégoire Henry <gregoire.henry@pps.jussieu.fr> writes:

   Hendrik wrote:

   > 1. I made some measurements that suggest that the
   >    unmarshal has quadratic complexity in the data size, see

   Therefore, if your data contains shared parts but no cycle, then it
   should be close to linear. Currently, mapping shared blocks to their
   type is represented by an association list: accessing such type
   information could explain this quadratic behaviour.

That would explain something: An association list hardly scales
to several megabyte of data.

   Could you please send us more information about the shape of your data
   (sharing or not, cycles or not, and wether sharing increase linearly
   or not with the size)?

The data contains sharing and cycles. I have no idea about the
height of the cycles. It might well be the case that the amount
of shared data grows faster than the overal size. But almost all
shared data has type string * int * int.

The data are abstract syntax trees of C++ programs (produced by a
modified elsa/elkhound parser). Its type is defined with about 40
recursive variant and record types. Its monomorphic, no aliasing
or other fancy features. I would say the type is bigger, but not
more complex than "int list". I would be really surprised if the
stronly connected components are necessary for my application.

I am working on a release of the modified elsa parser (hopefully
early next week). Then you can examine the data and its type
yourself.

Maybe you can follow the approach of the Marshal module, where
the user can choose (via the flags) between different algorithms?

   > 2.	    Objective Caml version 3.09.3+dev0+ty1
   >
   >     # SafeUnmarshal.copy [^nativeint^] 4;;
   >     Segmentation fault

   See your fifth question :)

I don't quite understand. My point is that even if nativeint is
not supported yet, it should not end in a segmentation fault,
should it?

   My hope for reducing the complexity is to make some "pre-processing"
   on the value when the unmarshaller reallocates it in the memory.
   In this case, the Check.check function can't operate on "standard"
   values, and should not be exported.

   Yet I don't know if the "preprocessor" would be tightly bound to the
   unmarshaller or have a chance to be exported too.

   We can keep exporting a quadratic check for "standard" values.

My application scenario is as follows: I have this big C++
program (a C++ parser) that eventually generates an ocaml value
(the abstract syntax tree of a C++ program, possibly > 10MB). I
would then like to enter the clean world of ocaml to
manipulate/process those abstract syntax trees. 

If everything goes wrong I would first like to check the
integrety of the abstract syntax tree. If the integrety check
fails I need to know which node and which of its fields contain
the bad data (otherwise I can't fix the problem). To locate the
bad data I currently invoke check on the top node, and if it
fails I call it on the child nodes and descend into one for which
the check fails. That is, I invoke check very often, but always
on the same unmodified data.

It would be nice if the interface of SafeUnmarshal permits this
kind of use. What's necessary is 
- either a fast unmarshal function with the ability to output
  some error trace if requested
- or a check function that is fast as long as one works on the
  same unmodified data.


   >    I would strongly suggest to replace the catch all cases
   >
   >     | _ -> false
   >
   >    in check.ml with some more precise code (it took me several
   >    hours of bug hunting until I found out that the above line
   >    made my unmarshal fail without even looking at the
   >    nativeints).

   Something like printing the type clash (looking for ty_xx, found ty_yy)?

   Or pretty-printing part of the value that have been checked before
   failure?

No, we probably have some misunderstanding here. I meant the
following: Instead of 

      | _ -> false

enumerate all possible constructors and carefully distinguish
those cases for check should return false from those cases that
have not been implemented yet and which should therefore trigger
an assertion. (Because: your original code delivered false on a
nativeint. If it delivered an assertion instead I would have
immediately known that nativeints are not supported. Instead I
spent several hours to search for the inconsistency in my data.)


I tell you how to reproduce those unmarshals that run for hours
when I made the release of the modified elsa parser. I would be
very much interested in trying a new version of SafeUnmarshal,
especially one where the association list problem has been fixed
(I still have a problem in one ast of 4.7MB, it's only that check
runs from more than 2 hours on it, so I can't track it)

Bye,

Hendrik


PS. I assume that Michel Mouny and Grégoire Henry are subscribed
to the caml list, so I don't cc them.