Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
Long-term storage of values
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2008-02-29 (00:54)
From: Brian Hurt <bhurt@s...>
Subject: Re: [Caml-list] Long-term storage of values

On Thu, 28 Feb 2008, Dario Teixeira wrote:

> Hi,
> Suppose I have a value of type Story.t, fairly complex in its definition.
> I wish to store this value in a DB (like Postgresql) for posterity.
> At the moment, I am storing in the DB the marshalled representation
> of the data; whenever I need to use it again in the Ocaml programme
> I simply fetch it from the DB and unmarshal it.

The following is just my opinion, not that of my employeer.

You're making two mistakes.

Mistake #1: treating a database as a dumb object store.  This is a really 
popular idea right now- Hibernate does this, as does Ruby on Rails, and a 
number of other ORM packages take this effective approach.  On the other 
hand, dynamically typed languages are also really popular.

A database is an incredibly powerfull tool, used correctly. Used 
correctly, they allow you to handle huge amounts of data shared between 
multiple different clients with great flexibility and good performance. 
Used incorrectly, they tend to be bloated, slow, pigs.  There are a lot of 
things databases aren't good at- multidimensional data, for example, or 
recursive ("tree-structured") data.  Databases have some signifigant 
limitations.  Every single element in a relation (aka table) has to be 
exactly the same type- no superclasses, no variant types.  Worse yet, SQL 
isn't even Turing-complete.  It's the world's oldest, most popular, DSL.

So "used correctly" is tricky to define, because relational databases are 
a paradigm, not unlike functional programming or object oriented 
programming.  But the trick is that you're designing, and coding, to the 
database, and you can't hide that or ignore it.  Some things are easy: 
databases are really good at filtering, joining, some simple mapping and 
aggregation.  The first few "levels" of data handling should be done in 
SQL in the database- you should never be sucking whole tables down.  If 
you do try to hide the essential nature of the database, you're run right 
into the meatgrinder of it's limitations.  Used correctly, you get the 
advantages and avoid the disadvantages.

So, mistake number one: either use the data, and structure your data (at 
that layer) to take advantage of it, or don't use a database.

Mistake number two: file formats (and this includes marshalled data 
structures), are wire protocols, and need to be designed to be as abstract 
as possible- to reveal as little about the internal structure of the 
program as possible (preferrably none at all).

This is an idea that gets reinvented time after time, and it always ends 
in tears and recriminations: have some magic protocol that allows programs 
to communication directly- just have program X call a function or pass an 
object to program Y directly, and have the protocol handle all the mucking 
about with serializing/deserializing data, converting function calls into 
request/response messages, etc.  Sun RPC, COM, CORBA, OLE, XML-RPC, 
and SOAP are the implementations that spring to mind.  Object 
serialization hits the exact same problem: it doesn't matter whether 
program X and Y are communicating via TCP/IP sockets, files, or 
quantum-tachyon entanglement.

Sooner or later (and generally sooner), it'll happen: program X will ask 
to some function, or pass some type of data, that program Y doesn't have 
any knowledge of.  It may be because version X is a newer version of the 
program/protocol, and the function/data type has been added.  It may be 
because X is an older version, and the function/data type has since been 
removed.  In any case, the first time this happens is when the tears and 
recriminations start.

Versioning simply makes it more painfully obvious that you're shackled to 
the past.  You want to get rid of that pesky function?  You can't, because 
older versions of the protocol require it to be there.  Don't need a peice 
of data anymore?  Tough, older versions of the protocol still require it. 
The best thing versioning gives you is the ability to error out early, and 
make a more sensible error message ("Sorry, but protocol support >= 2.14 
is required!"), but it doesn't solve the problem.

The best solution I've found is to be aware that, when you're 
communicating with the outside world, you're implementing a *protocol*. 
And that protocol should be, as I said, as abstract as possible and reveal 
as little about the structure of the program as possible.  So I can change 
the program enormously, even reimplement it from scratch in a different 
language, without great difficulty.  Consider SMTP, HTTP, and YAML as 
examples of protocols or generic file formats done right.

Note that you can do protocol design, and then implement it is Corba or 
XML.  A sure that you've done this is the existance of a "translation 
layer" - comments like "OK, now we translate the XML data structure into 
our internal data structure" and such like.  THe successfull projects I've 
seen that used these technologies did this (or got lucky and grew into 

So that's mistake number two: you're communicating between different 
versions of the program with an ill-defined (at best) and not 
generic protocol/file format.

Fix these two problems, and I'm willing to bet most of the rest of the 
problems go away too.