Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0007238OCamlstandard librarypublic2016-04-22 05:552018-06-14 11:04
Assigned To 
PlatformOSOS Version
Product Version 
Target VersionFixed in Version4.08.0+dev 
Summary0007238: Marshal.to_channel does not clean the channel internal buffers when it fails
DescriptionThe semantic of Marshal.to_channel can lead to subtle bugs, particularly, when used with channels that are UDP sockets. The problem is due to the fact that to_channel fails when the message size is bigger than the UDP limit but, when it fails, it leaves part of the message (the prefix) in the internal data structures of the channel.

Leaving the prefix of the message in the buffer is a problem because subsequent calls to to_channel will append messages to the buffer that will be sent. In practice, this causes corruption of messages and prevents correct unmarshalling on the receiver side.

Internally, “Marshal.to_channel” works in the following way:
   a) While marshaling the components of message, to_channel makes several calls to the function caml_putblock [1] (through the sequence of invocations: to_channel [3] -> caml_output_value [2] -> caml_output_val [2] -> caml_really_putblock [1] -> caml_putblock [1])
   b) On each invocation caml_putblock appends the input data to an internal buffer (channel->buff).
   c) This buffer is flushed when (1) flush is explicitly called on the channel or (2) inside caml_putblock when new contents cannot be appended to the buffer because it would reach its maximum capacity. Flushing the buffer consists in calling the write system call on the socket file descriptor. The size of the internal buffer is larger than the UDP packet limit
   d) When caml_putblock tries to flush the internal buffer (in a case where it is filled beyond the UDP limit) the call to the write system call fails (as expected) and it fails with the error EMSGSIZE. Upon such error, an exception (UnixError) is thrown, which will be handled by an exception handler but crucially two things happen (1) the internal buffer is not cleared (i.e., it retains the prefix of the message) and (2) nobody tries to resend the suffix of the message that was discarded.

Given the above, it is hard to correctly use the to_channel function on UDP sockets without explicitly checking before invoking to_channel that the message size fits within the content limit of UDP packets. To complicate matters further, the message size is not obvious to the callee given that the message sent will include marshaling metadata, not just user data.

Chapar [4] is an example of an application that is affected by this problem.

    [1] [^]
    [2] [^]
    [3] [^]
    [4] [^]
TagsNo tags attached.
Attached Filesdiff file icon diff-retry-on-EMSGSIZE.diff [^] (599 bytes) 2016-04-27 18:40 [Show Content]

- Relationships

-  Notes
gasche (developer)
2016-04-23 18:04

Thanks for the very detailed analysis. Do you have an idea of whether it is possible to fix this by just changing the caml_{really_,}putblock implementations, or maybe even a patch proposal?
pfonseca (reporter)
2016-04-26 00:44

Ideally, it would be desirable to ensure atomicity of the marshaling operation, even in the presence of errors. Either marshaling should succeed -- the entire object gets written to the channel -- or it fails -- nothing gets written to the channel buffers and nothing gets sent.

I'm not sure that it would be possible to solve the problem at the caml_{really_,}putblock level. The reason is that, because multiple calls are made for each object marshaled, ensuring marshaling atomicity would require somehow reverting the effects of the calls of previous caml_{really_,}putblock invocations. This would be difficult to achieve in the general case because, at least in some contexts, caml_putblock could have already written the data to the channel (e.g., file) which would be hard to revert.

This problem seems nontrivial to solve and I don't have a patch for it.
gasche (developer)
2016-04-27 00:09

So I looked a bit more into the issue, but I know very little about these low-level details so I may be confused.

- I don't really understand what is the actual level of portability of this problem and the solution discussed. Most documentation I find seems Linux-specific.

- It seems that when you create an UDP socket under Linux, you can "disable MTU discovery", which means that the kernel will segment the packets for you instead of failing at writing time, using the IP_MTU_DISCOVER set socket option (this is clearly said in my "man 7 udp", less clearly in "man 7 ip"). This would be a (non-portable?) solution to work-around the issue on your machine (write a small C stub that does that at UDP creation time), and one potential solution to this issue in general would be for the Unix module to disable this whenever it creates an UDP socket.

- Otherwise we may try to be clever and just divide the sending size in half whenever we get a EMSGSIZE error. In caml_write_fd in unix.c there is already logic to retry on some errors, this could fit naturally in there. There are performance considerations to consider, however: do we keep the write size found by this trial-and-error process for future calls on the same socket, or do we redo the discovery each time? An alternative would be to implement this "find the right packet size" at the level of the caml_really_putblock function, that will do several writes in a row (so there is a local place to store the packet size we found, yet we can share the good values over several calls).

- We could try to get the packet size *before* writing (getsockopt IP_MTU?), but that sounds like adding an extra system call in the hot path for a corner case. (Or can we test at this point and do it only for UDP sockets? Do we need this for other sockets than UDP? It seems that we can't do it at socket-creation time, as the MTU value may evolve over time.)

- Finally, it seems to be possible to recover the correct writeable length in case of EMSGSIZE error by (on Linux) by using the IP_RECVERR mechanism (again, found in "man 7 ip"); we could have an error path in this case that retries with just the right value. But I don't know whether this is Linux-specific or would also work on other systems.
dim (developer)
2016-04-27 11:23

Wouldn't it be more reliable to marshal to a temporary buffer and then handle sending data manually? You get one extra memory copy, but on the other hand this copy is probably a lot faster than the marshaling itself, so I don't think it's worth trying to avoid it in the end.

If you'd prefer to marshal to bigarrays instead of OCaml strings, there is a module [Bigstring_marshal] in core_kernel: [^]

You could easily extract what you want from this module if you didn't want to depend on core_kernel
dim (developer)
2016-04-27 11:24

And actually you can avoid the extra copy if you use file descriptors directly instead of going through output channels
gasche (developer)
2016-04-27 12:56
edited on: 2016-04-27 12:56

dim: This seems an excellent workaround for this case, but don't you think that it is a bug that the code does not currently work? If we provide the ability to use UDP sockets, I think we should make sure that writing to them actually works, even when seen as out_channel.

dim (developer)
2016-04-27 14:56

Well, I would say it is the correct solution rather than a workaround.

Marshaling to an out_channel that is backed by an UDP socket seems dodgy anyway; there is no guarantee about how data will be split. For instance what should be the expected behavior if the marshaled data end up spanning over multiple packets? If one packet is lost you'll get a segfault on the other side.

At best we can indeed try to ensure that marshaled data always fit in a single packet and always fail otherwise. But if that's hard to do in a portable way, I tend to think we should just document that it is not supported.
gasche (developer)
2016-04-27 16:10

I suppose that the marshalling format has a length field that would let you know that some packets were dropped. But, in any case, wouldn't the problem also occur with plain old output_string/print_{string,endline}? Marshalling is arguably an edge case, but not being able to use print_* at all because the buffer is too large for its own good seems a bit drastic.
dim (developer)
2016-04-27 18:40

Indeed. I reread you previous message, basically you shouldn't need to do a discovery to find the max packet size, it should be 65507 (the maximum theoretical size of a UDP packet).

I added a patch to do this. pfonseca, could you try and see if it's enough for your use case?
gasche (developer)
2016-04-28 04:28

dim: it is unclear to me whether this limit is portable or specific to Linux systems. See for example [^]
or [^]

One compromise would be to move this retry-logic in the loop (while (n > 1)), and set

  n = min(n/2, MAX_UDP_SIZE)

This should work as your proposed patch in the common case, yet handle situations where the system requires smaller messages (with a small degradation in performance).

P.S.: there is a small typo in your patch, (errno = EMSGSIZE) instead of (errno == EMSGSIZE).
dim (developer)
2016-04-28 09:34

True, it doesn't even seem to work on OSX.

The more I think about this, the more I think we shouldn't try to add magic and give the programmer the fake impression that it works. If you are marshalling through UDP, you must really make sure that the message fits in one UDP packet, as packets can be lost or arrive in a different order (using the length in the marshal header is not enough). If you want to send bigger values you need some logic to ensure you properly reconstruct the marshaled buffer first.

For other output functions, at the end of the day we don't know whether the programmer expect the data to be sent as one packet or not, and if you are using UDP, there is a good chance you expect exactly that. So the current error seems right to me; it indicates you need to think about what you are doing.

However, it's indeed annoying that we are leaving half a marshaled message in the output buffer, if possible we should try to discard it before raising the error.
xleroy (administrator)
2016-06-27 15:16

Re-reading this discussion, I'm on dim's side: UDP is for transmitting packets, out_channel/in_channel are abstractions for streams, and the two models (stream vs packets) don't mix at all. It is much safer indeed to marshal to a string, then "packetize" the string explicitly, following whatever conventions your UDP protocol has for splitting large messages into packets.

This said, we may have the problem described here (the marshaler not cleaning its internal buffers) in other error cases, e.g. when a write to disk fails because there is no space left. This still needs to be investigated.
gasche (developer)
2016-06-27 16:00

> UDP is for transmitting packets, out_channel/in_channel are abstractions for streams, and the two models (stream vs packets) don't mix at all.

Should we not fail when trying to convert an UDP socket to a out_channel then? Because currently we pretend to be able to do it and it works most of the time, and eats user data only sometimes... If we think the feature is disfunctional by design, should we not disable it?
xleroy (administrator)
2017-01-31 19:52

> Should we not fail when trying to convert an UDP socket to a out_channel then?

Quite possibly. It would take a bit of work, though. Under Unix we could use getsockopt(SO_TYPE) to detect sockets that are not SOCK_STREAM. Under Win32 we might be able to do the same, or, alternatively, we could use flags from 'struct filedescr'.
gasche (developer)
2017-05-29 18:04
edited on: 2017-05-29 18:05

I was reminded of this bug again this morning as I read pfonseca's paper who mentions it (as one of the bugs found against formally verified distributed algorithm implementations): [^]
  An empirical study on the correctness of formally verified distributed systems
  Pedro Fonseca, Kaiyuan Zhang, Xi Wang, Arvind Krishnamurthy

gasche (developer)
2017-05-29 18:07

The paper writes:

> In practice, this bug may have a significant
> impact on verified systems reliability because of the number
> of such systems that use the OCaml library and may suffer
> from similar problems.

I think this is correct and that we should fail on UDP->out_channel conversion as discussed, so that people realize the issue when testing.
xleroy (administrator)
2018-06-09 09:53

> I think this is correct and that we should fail on UDP->out_channel conversion as discussed, so that people realize the issue when testing.

See proposal at [^]
xleroy (administrator)
2018-06-14 11:04

The check on file descriptor -> buffered I/O channel is now in trunk, will be in release 4.08.0.

Re-reading the discussion for this issue, I don't think other fixes are needed. It is true that output_value can fail in the middle of writing its data to an out_channel, e.g. if the disk is full, but I don't see any sensible way to make output_value more atomic w.r.t. such I/O errors.

- Issue History
Date Modified Username Field Change
2016-04-22 05:55 pfonseca New Issue
2016-04-22 17:29 doligez Status new => acknowledged
2016-04-22 17:29 doligez Target Version => 4.03.1+dev
2016-04-23 18:04 gasche Note Added: 0015841
2016-04-26 00:44 pfonseca Note Added: 0015854
2016-04-27 00:09 gasche Note Added: 0015868
2016-04-27 11:23 dim Note Added: 0015872
2016-04-27 11:24 dim Note Added: 0015873
2016-04-27 12:56 gasche Note Added: 0015875
2016-04-27 12:56 gasche Note Edited: 0015875 View Revisions
2016-04-27 14:56 dim Note Added: 0015878
2016-04-27 16:10 gasche Note Added: 0015879
2016-04-27 18:40 dim Note Added: 0015880
2016-04-27 18:40 dim File Added: diff-retry-on-EMSGSIZE.diff
2016-04-28 04:28 gasche Note Added: 0015882
2016-04-28 09:34 dim Note Added: 0015883
2016-06-27 15:16 xleroy Note Added: 0016002
2016-06-27 16:00 gasche Note Added: 0016006
2017-01-31 19:52 xleroy Note Added: 0017221
2017-02-16 14:00 doligez Target Version 4.03.1+dev => undecided
2017-02-23 16:28 doligez Target Version undecided =>
2017-02-23 16:43 doligez Category OCaml standard library => standard library
2017-05-29 18:04 gasche Note Added: 0017829
2017-05-29 18:05 gasche Note Edited: 0017829 View Revisions
2017-05-29 18:07 gasche Note Added: 0017830
2018-06-09 09:53 xleroy Note Added: 0019168
2018-06-14 11:04 xleroy Note Added: 0019188
2018-06-14 11:04 xleroy Status acknowledged => resolved
2018-06-14 11:04 xleroy Resolution open => fixed
2018-06-14 11:04 xleroy Fixed in Version => 4.08.0+dev

Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker