Version française
Home     About     Download     Resources     Contact us    
Browse thread
Serialisation of PXP DTDs
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Mauricio Fernandez <mfp@a...>
Subject: Re: [Caml-list] Re: Serialisation of PXP DTDs
On Sun, Oct 26, 2008 at 02:15:18PM -0400, Markus Mottl wrote:
> On Sat, Oct 25, 2008 at 2:58 PM, Mauricio Fernandez <mfp@acm.org> wrote:
> > Unfortunately, growing sum types is far from being the only protocol extension
> > of interest. There's a trivial extension which, I suspect, will be at
> > least as common in practice, namely adding new fields to a record (or new
> > elements to a tuple). bin-prot is unable to handle it adequately --- a
> > self-describing format like the one I'm working on is required.
(...) 
> With records / tuples it is exactly the other way round: you could, in
> principle, read both in the old implementation, which just needs to drop
> new, unknown fields, whereas the new implementation requires these fields
> and hence cannot parse old protocols.

This (having old consumers ignore extra fields) is what bin-prot doesn't
support because records/tuples aren't self-delimited. It can only be done at
the outermost level if you prepend the length of the message, and breaks as
soon as you have a nested record or tuple type.  In my format, records and
tuples are self-delimited, so this is supported trivially.

Note that it is possible for a new implementation to read old protocols by
specifying default values for missing fields. This basically amounts to
turning newly added fields into generalized option types (the only diff being
whether the 'a option -> 'a conversion is controlled at the level of the type
definition or distributed throughout the code). New code has to cope with the
possibility that the fields might be None, that's all. Old code never sees
those fields and works unmodified.

> I don't see how any approach could "hande" the respective unsolvable
> case.  If a receiver doesn't know how to handle a tag, or if it
> requires data that is not there, you'll be stuck.

The former case is indeed unsolvable if the reader is to operate with that
field in a specific (not polymorphic) way. (It can still do things involving
only other fields, though.) In the second case, however, the receiver has got
the advantage of hindsight: it knows that the extra data might not be present,
and the code can cope with that.

> Note, too, that even if you created an implementation which allows
> handling extended records in old protocols, this would undoubtly come
> at a pretty hefty cost.  The only efficient way to do that would be to
> exchange protocols and generate code at runtime to translate quickly
> between protocols.  I don't think it's worth it.

? I haven't optimized the generated code yet, but I'm seeing only a 25% drop
in decoding speed compared to Marshal in my preliminary tests. Extra fields
aren't even decoded, just saved in encoded form and appended to the output
when serializing again.

> > You might argue that this extension is subsumed by the ability to grow sum types,
> > since you can go from
> >
> >    type record = { a : int } with bin_io
> >    type msg = A of record
> >
> > to
> >
> >    type record1 = { a : int } with bin_io
> >    type record2 = { a' : int; b : int } with bin_io
> >    type msg = A of record1 | B of record2
> >
> > (Note how special care has to be taken to tag the record --- "explicit
> > tagging" in ASN.1 parlance.)
> 
> This is surely a clean way to extend protocols without losing backward
> compatibility.

It's bothersome for the programmer (picture 
  type msg = ... | F of record6  
  and record6 = { a''''' : int; b'''': int; c''': float; d'': foo; e': bar; f : baz),
and arguably worse than extending the record directly, because, as you said
above, the receiver will not know how to handle the "B" tag, even though it
would be perfectly able to decode the subset of the record it understands.
It's safe only in one direction (new code can read old data).

> > My design lifts that restriction and allows an old consumer to read the data
> > from a new producer when new fields have been added to a record or a tuple.
> 
> I'd probably bet that simply putting a protocol translator in front of
> some old application you don't want to / cannot recompile would be
> about as efficient. 

It's not always a matter of not recompiling the application, but rather of not
having recompiled it *yet*: in a system with multiple nodes, it is hard to
migrate them all to the updated code atomically...  Putting a protocol
translator in front of the old code is just as hard as updating it: it also
means that all exchanges have to stop while the protocol translators are put
in place --- hardly any advantage over just migrating to updated code.

> > AFAICS the ability to process data not understood in full requires the use of
> > a self-describing format like the one I'm working on.
> 
> I'd go for the protocol translator.  Especially if two protocols share
> a lot of structure, it should be trivial to define translations.
> Another very reasonable approach, which does not diminish performance,
> would be to exchange protocol versions.  Assuming that one side is
> always more recent than the other, they should be able to support old
> protocols directly.

Protocol negotiation is not always possible. Consider the case of data stored
on disk (or on any dummy server that only knows about files, not protocols)
and accessed directly without an intermediate translation layer.

-- 
Mauricio Fernandez  -   http://eigenclass.org