Version française
Home     About     Download     Resources     Contact us    
Browse thread
Marshalling data format deteriorates compressibility
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Markus Mottl <markus.mottl@g...>
Subject: Marshalling data format deteriorates compressibility

we have just found out that under certain circumstances the current
binary format used for encoding marshalled OCaml-values badly affects
compressibility of the data by a wide class of compression algorithms
(especially Lempel-Ziv based ones).

Before we write out datastructures to disk we usually hashcons them to
maximize sharing.  This, of course, leads to much smaller file sizes.
Suprisingly, however, compressing the hashconsed files using e.g. gzip
leads to very significantly (e.g. three times!) larger files than
gzipping marshalled datastructures that have not been hashconsed.

We finally found out what causes the problem: OCaml represents
pointers to shared data values using relative offsets instead of
absolute positions within the marshalled data.  This means that e.g.
an array containing pointers to these values will be represented by a
sequence of increasing relative offsets, which essentially renders it
almost incompressible to the usual compression algorithms.

As it seems, the current marshalling algorithm uses this relative
addressing approach to save space: relative offsets are encoded with
variable length (this assumes some degree of locality), which is not
possible with absolute addressing.  Unfortunately, this does not take
compression algorithms into account, which may greatly benefit from
repeating patterns of pointers.

Could one of the maintainers please tell us what would be the least
intrusive way of adding a new marshalling format?  If this data format
were added to the distribution, it could be turned on using a new flag
in the Marshal-module.  A different magic number might be used for
chosing the right unmarshalling function.

Of course, we would be even more grateful if there were an
implementation of the alternative encoding in the next release... ;-)


Markus Mottl