Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marshal.to_channel does not clean the channel internal buffers when it fails #7238

Closed
vicuna opened this issue Apr 22, 2016 · 18 comments
Closed

Comments

@vicuna
Copy link

vicuna commented Apr 22, 2016

Original bug ID: 7238
Reporter: pfonseca
Status: resolved (set by @xavierleroy on 2018-06-14T09:04:24Z)
Resolution: fixed
Priority: normal
Severity: minor
Fixed in version: 4.08.0+dev/beta1/beta2
Category: standard library
Monitored by: @gasche @diml @oandrieu

Bug description

The semantic of Marshal.to_channel can lead to subtle bugs, particularly, when used with channels that are UDP sockets. The problem is due to the fact that to_channel fails when the message size is bigger than the UDP limit but, when it fails, it leaves part of the message (the prefix) in the internal data structures of the channel.

Leaving the prefix of the message in the buffer is a problem because subsequent calls to to_channel will append messages to the buffer that will be sent. In practice, this causes corruption of messages and prevents correct unmarshalling on the receiver side.

Internally, “Marshal.to_channel” works in the following way:
a) While marshaling the components of message, to_channel makes several calls to the function caml_putblock [1] (through the sequence of invocations: to_channel [3] -> caml_output_value [2] -> caml_output_val [2] -> caml_really_putblock [1] -> caml_putblock [1])
b) On each invocation caml_putblock appends the input data to an internal buffer (channel->buff).
c) This buffer is flushed when (1) flush is explicitly called on the channel or (2) inside caml_putblock when new contents cannot be appended to the buffer because it would reach its maximum capacity. Flushing the buffer consists in calling the write system call on the socket file descriptor. The size of the internal buffer is larger than the UDP packet limit
d) When caml_putblock tries to flush the internal buffer (in a case where it is filled beyond the UDP limit) the call to the write system call fails (as expected) and it fails with the error EMSGSIZE. Upon such error, an exception (UnixError) is thrown, which will be handled by an exception handler but crucially two things happen (1) the internal buffer is not cleared (i.e., it retains the prefix of the message) and (2) nobody tries to resend the suffix of the message that was discarded.

Given the above, it is hard to correctly use the to_channel function on UDP sockets without explicitly checking before invoking to_channel that the message size fits within the content limit of UDP packets. To complicate matters further, the message size is not obvious to the callee given that the message sent will include marshaling metadata, not just user data.

Chapar [4] is an example of an application that is affected by this problem.

References:
[1] https://github.com/ocaml/ocaml/blob/trunk/byterun/io.c
[2] https://github.com/ocaml/ocaml/blob/trunk/byterun/extern.c
[3] https://github.com/ocaml/ocaml/blob/trunk/stdlib/marshal.ml
[4] coq-community/chapar#2

File attachments

@vicuna
Copy link
Author

vicuna commented Apr 23, 2016

Comment author: @gasche

Thanks for the very detailed analysis. Do you have an idea of whether it is possible to fix this by just changing the caml_{really_,}putblock implementations, or maybe even a patch proposal?

@vicuna
Copy link
Author

vicuna commented Apr 25, 2016

Comment author: pfonseca

Ideally, it would be desirable to ensure atomicity of the marshaling operation, even in the presence of errors. Either marshaling should succeed -- the entire object gets written to the channel -- or it fails -- nothing gets written to the channel buffers and nothing gets sent.

I'm not sure that it would be possible to solve the problem at the caml_{really_,}putblock level. The reason is that, because multiple calls are made for each object marshaled, ensuring marshaling atomicity would require somehow reverting the effects of the calls of previous caml_{really_,}putblock invocations. This would be difficult to achieve in the general case because, at least in some contexts, caml_putblock could have already written the data to the channel (e.g., file) which would be hard to revert.

This problem seems nontrivial to solve and I don't have a patch for it.

@vicuna
Copy link
Author

vicuna commented Apr 26, 2016

Comment author: @gasche

So I looked a bit more into the issue, but I know very little about these low-level details so I may be confused.

  • I don't really understand what is the actual level of portability of this problem and the solution discussed. Most documentation I find seems Linux-specific.

  • It seems that when you create an UDP socket under Linux, you can "disable MTU discovery", which means that the kernel will segment the packets for you instead of failing at writing time, using the IP_MTU_DISCOVER set socket option (this is clearly said in my "man 7 udp", less clearly in "man 7 ip"). This would be a (non-portable?) solution to work-around the issue on your machine (write a small C stub that does that at UDP creation time), and one potential solution to this issue in general would be for the Unix module to disable this whenever it creates an UDP socket.

  • Otherwise we may try to be clever and just divide the sending size in half whenever we get a EMSGSIZE error. In caml_write_fd in unix.c there is already logic to retry on some errors, this could fit naturally in there. There are performance considerations to consider, however: do we keep the write size found by this trial-and-error process for future calls on the same socket, or do we redo the discovery each time? An alternative would be to implement this "find the right packet size" at the level of the caml_really_putblock function, that will do several writes in a row (so there is a local place to store the packet size we found, yet we can share the good values over several calls).

  • We could try to get the packet size before writing (getsockopt IP_MTU?), but that sounds like adding an extra system call in the hot path for a corner case. (Or can we test at this point and do it only for UDP sockets? Do we need this for other sockets than UDP? It seems that we can't do it at socket-creation time, as the MTU value may evolve over time.)

  • Finally, it seems to be possible to recover the correct writeable length in case of EMSGSIZE error by (on Linux) by using the IP_RECVERR mechanism (again, found in "man 7 ip"); we could have an error path in this case that retries with just the right value. But I don't know whether this is Linux-specific or would also work on other systems.

@vicuna
Copy link
Author

vicuna commented Apr 27, 2016

Comment author: @diml

Wouldn't it be more reliable to marshal to a temporary buffer and then handle sending data manually? You get one extra memory copy, but on the other hand this copy is probably a lot faster than the marshaling itself, so I don't think it's worth trying to avoid it in the end.

If you'd prefer to marshal to bigarrays instead of OCaml strings, there is a module [Bigstring_marshal] in core_kernel:

https://github.com/janestreet/core_kernel/blob/master/src/bigstring_marshal.mli

You could easily extract what you want from this module if you didn't want to depend on core_kernel

@vicuna
Copy link
Author

vicuna commented Apr 27, 2016

Comment author: @diml

And actually you can avoid the extra copy if you use file descriptors directly instead of going through output channels

@vicuna
Copy link
Author

vicuna commented Apr 27, 2016

Comment author: @gasche

dim: This seems an excellent workaround for this case, but don't you think that it is a bug that the code does not currently work? If we provide the ability to use UDP sockets, I think we should make sure that writing to them actually works, even when seen as out_channel.

@vicuna
Copy link
Author

vicuna commented Apr 27, 2016

Comment author: @diml

Well, I would say it is the correct solution rather than a workaround.

Marshaling to an out_channel that is backed by an UDP socket seems dodgy anyway; there is no guarantee about how data will be split. For instance what should be the expected behavior if the marshaled data end up spanning over multiple packets? If one packet is lost you'll get a segfault on the other side.

At best we can indeed try to ensure that marshaled data always fit in a single packet and always fail otherwise. But if that's hard to do in a portable way, I tend to think we should just document that it is not supported.

@vicuna
Copy link
Author

vicuna commented Apr 27, 2016

Comment author: @gasche

I suppose that the marshalling format has a length field that would let you know that some packets were dropped. But, in any case, wouldn't the problem also occur with plain old output_string/print_{string,endline}? Marshalling is arguably an edge case, but not being able to use print_* at all because the buffer is too large for its own good seems a bit drastic.

@vicuna
Copy link
Author

vicuna commented Apr 27, 2016

Comment author: @diml

Indeed. I reread you previous message, basically you shouldn't need to do a discovery to find the max packet size, it should be 65507 (the maximum theoretical size of a UDP packet).

I added a patch to do this. pfonseca, could you try and see if it's enough for your use case?

@vicuna
Copy link
Author

vicuna commented Apr 28, 2016

Comment author: @gasche

dim: it is unclear to me whether this limit is portable or specific to Linux systems. See for example

https://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/472149
or
https://www.dsprelated.com/showthread/comp.dsp/44620-1.php

One compromise would be to move this retry-logic in the loop (while (n > 1)), and set

n = min(n/2, MAX_UDP_SIZE)

This should work as your proposed patch in the common case, yet handle situations where the system requires smaller messages (with a small degradation in performance).

P.S.: there is a small typo in your patch, (errno = EMSGSIZE) instead of (errno == EMSGSIZE).

@vicuna
Copy link
Author

vicuna commented Apr 28, 2016

Comment author: @diml

True, it doesn't even seem to work on OSX.

The more I think about this, the more I think we shouldn't try to add magic and give the programmer the fake impression that it works. If you are marshalling through UDP, you must really make sure that the message fits in one UDP packet, as packets can be lost or arrive in a different order (using the length in the marshal header is not enough). If you want to send bigger values you need some logic to ensure you properly reconstruct the marshaled buffer first.

For other output functions, at the end of the day we don't know whether the programmer expect the data to be sent as one packet or not, and if you are using UDP, there is a good chance you expect exactly that. So the current error seems right to me; it indicates you need to think about what you are doing.

However, it's indeed annoying that we are leaving half a marshaled message in the output buffer, if possible we should try to discard it before raising the error.

@vicuna
Copy link
Author

vicuna commented Jun 27, 2016

Comment author: @xavierleroy

Re-reading this discussion, I'm on dim's side: UDP is for transmitting packets, out_channel/in_channel are abstractions for streams, and the two models (stream vs packets) don't mix at all. It is much safer indeed to marshal to a string, then "packetize" the string explicitly, following whatever conventions your UDP protocol has for splitting large messages into packets.

This said, we may have the problem described here (the marshaler not cleaning its internal buffers) in other error cases, e.g. when a write to disk fails because there is no space left. This still needs to be investigated.

@vicuna
Copy link
Author

vicuna commented Jun 27, 2016

Comment author: @gasche

UDP is for transmitting packets, out_channel/in_channel are abstractions for streams, and the two models (stream vs packets) don't mix at all.

Should we not fail when trying to convert an UDP socket to a out_channel then? Because currently we pretend to be able to do it and it works most of the time, and eats user data only sometimes... If we think the feature is disfunctional by design, should we not disable it?

@vicuna
Copy link
Author

vicuna commented Jan 31, 2017

Comment author: @xavierleroy

Should we not fail when trying to convert an UDP socket to a out_channel then?

Quite possibly. It would take a bit of work, though. Under Unix we could use getsockopt(SO_TYPE) to detect sockets that are not SOCK_STREAM. Under Win32 we might be able to do the same, or, alternatively, we could use flags from 'struct filedescr'.

@vicuna
Copy link
Author

vicuna commented May 29, 2017

Comment author: @gasche

I was reminded of this bug again this morning as I read pfonseca's paper who mentions it (as one of the bugs found against formally verified distributed algorithm implementations):

https://homes.cs.washington.edu/~pfonseca/papers/eurosys2017-dsbugs.pdf
An empirical study on the correctness of formally verified distributed systems
Pedro Fonseca, Kaiyuan Zhang, Xi Wang, Arvind Krishnamurthy
EuroSys’17

@vicuna
Copy link
Author

vicuna commented May 29, 2017

Comment author: @gasche

The paper writes:

In practice, this bug may have a significant
impact on verified systems reliability because of the number
of such systems that use the OCaml library and may suffer
from similar problems.

I think this is correct and that we should fail on UDP->out_channel conversion as discussed, so that people realize the issue when testing.

@vicuna
Copy link
Author

vicuna commented Jun 9, 2018

Comment author: @xavierleroy

I think this is correct and that we should fail on UDP->out_channel conversion as discussed, so that people realize the issue when testing.

See proposal at #1825

@vicuna
Copy link
Author

vicuna commented Jun 14, 2018

Comment author: @xavierleroy

The check on file descriptor -> buffered I/O channel is now in trunk, will be in release 4.08.0.

Re-reading the discussion for this issue, I don't think other fixes are needed. It is true that output_value can fail in the middle of writing its data to an out_channel, e.g. if the disk is full, but I don't see any sensible way to make output_value more atomic w.r.t. such I/O errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant