Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0005421OCamlOCaml generalpublic2011-12-12 16:382013-08-31 12:44
Reporterhw 
Assigned Toprotz 
PrioritynoneSeveritymajorReproducibilityN/A
StatusclosedResolutionfixed 
Platformx86_64OSCentOSOS Version5.5
Product Version3.12.1 
Target VersionFixed in Version3.13.0+dev 
Summary0005421: fd leak in Unix.ml?
DescriptionI have an OCaml process that is leaking fds in multiples of 6. In the logs I have:

Unix.Unix_error(ENOMEM, "fork", "")
Raised by primitive operation at file "unix.ml", line 865, characters 8-14
Called from file "unix.ml", line 882, characters 2-141
[...]

Looking at otherlibs/unix/unix.ml in the source distribution, this corresponds to a call to open_process_full (line 882).

let open_process_full cmd env =
  let (in_read, in_write) = pipe() in
  let (out_read, out_write) = pipe() in
  let (err_read, err_write) = pipe() in
  let inchan = in_channel_of_descr in_read in
  let outchan = out_channel_of_descr out_write in
  let errchan = in_channel_of_descr err_read in
  open_proc_full cmd env (Process_full(inchan, outchan, errchan))
                 out_read in_write err_write [in_read; out_write; err_read];
  close out_read;
  close in_write;
  close err_write;
  (inchan, outchan, errchan)

open_process_full in turn calls open_proc_full, which calls fork() (line 865). Matching up the timestamps with an strace:

19188 12:39:29 pipe([223, 224]) = 0
19188 12:39:29 pipe([225, 226]) = 0
19188 12:39:29 pipe([227, 228]) = 0
19188 12:39:29 futex(0x9b0174, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x9b0170, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1} <unfinished ...>
19188 12:39:29 <... futex resumed> ) = 1
19188 12:39:29 lseek(223, 0, SEEK_CUR <unfinished ...>
19188 12:39:29 <... lseek resumed> ) = -1 ESPIPE (Illegal seek)
19188 12:39:29 futex(0x9b0174, FUTEX_WAIT_PRIVATE, 116925, NULL <unfinished ...>
19188 12:39:29 <... futex resumed> ) = 0
19188 12:39:29 futex(0x9b0140, FUTEX_WAKE_PRIVATE, 1) = 0
19188 12:39:29 lseek(226, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
19188 12:39:29 lseek(227, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)
19188 12:39:29 fcntl(223, F_GETFD) = 0
19188 12:39:29 fcntl(223, F_SETFD, FD_CLOEXEC) = 0
19188 12:39:29 fcntl(226, F_GETFD) = 0
19188 12:39:29 fcntl(226, F_SETFD, FD_CLOEXEC) = 0
19188 12:39:29 fcntl(227, F_GETFD) = 0
19188 12:39:29 fcntl(227, F_SETFD, FD_CLOEXEC) = 0
19188 12:39:29 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x447169d0) = -1 ENOMEM (Cannot allocate memory)

clone() is failing in one of the dozen or so ways documented in 'man 2 clone'. This exception propagates out of open_process_full, leaving the six fds from the three pipe() calls hanging around. As far as I can tell these fds are not cleaned up by any garbage collection process.

fds 223-228 are still open a few hours later. Soon the process will hit ulimit -n.
TagsNo tags attached.
Attached Filespatch file icon unix_proc_fdleak.patch [^] (3,299 bytes) 2011-12-21 04:57 [Show Content]

- Relationships

-  Notes
(0006243)
protz (manager)
2011-12-12 17:08

Hi,

So if I understand correctly, clone is failing because there's no memory left on the device. That's a pretty bad situation already. Can you suggest a way and/or a patch to handle this gracefully?

Thanks,

jonathan
(0006244)
hw (reporter)
2011-12-12 18:37
edited on: 2011-12-12 18:38

From 'man 2 clone':

ENOMEM Cannot allocate sufficient memory to allocate a task structure for the child, or to copy those parts of the caller’s context that need to be copied.

I'm not sure how clone() is implemented, but I've consulted a Unix veteran and he says the call can attempt to allocate up to the same amount of memory as the parent process. In my case the process has 645 MB resident, and there is ~ 64 MB unused on the machine. So the situation is bad but not terrible ;-) In my uninformed opinion a failure of clone() doesn't necessarily indicate your process is shafted beyond repair, so it would be nice if the error was recoverable.

So the fd leak. I'm not an expert in Unix programming and I understand forking is a particularly tricky matter. There are probably conventions in the standard library, but my intuition is that each time an fd is allocated it should be wrapped in a try..finally that closes the fd should anything fail. NB pipe() can also fail (e.g. with EMFILE/ENFILE). Unless of course I'm missing some other mechanism that is cleaning up these fds (but not in the case of my program).

(0006254)
till (reporter)
2011-12-12 22:49

Fork/Exec is indeed very tricky and should be done in C (you just need a level of low-level control on the forked side that ocaml cannot provide).
This specific issue could probably be fixed easily but I believe that the fork/exec code in:
http://caml.inria.fr/mantis/view.php?id=5266 [^]
is resilient to this problem.
(0006271)
protz (manager)
2011-12-13 17:53

Hi Till,

You patch is pretty involved and it's going to require a lot of work to make this happen (if we want to, which I'm in no position to declare :)). I suggest we try to fix this issue in a simpler way.

Hw: you wrote « my intuition is that each time an fd is allocated it should be wrapped in a try..finally that closes the fd should anything fail ». Could you possibly write a patch to the Unix module that fixes this? You'd be in a good position to make sure the patch is correct (I could write but I don't have your setup and I'm unsure how to trigger ENOMEM without bringing my work machine down to its knees :)).

I think it's a matter of simply wrapping the call to open_proc_full in a try..with block, cleaning up the file descriptors, and then re-raising the exception if there was a failure. There may be other places in the code of the Unix module that want similar cleanups.

Cheers,

jonathan
(0006426)
till (reporter)
2011-12-21 04:58

I've uploaded a patch that should take care of the fd leak. It is also careful to handle the cases where pipe fails (e.g. running out of available fds.)
(0006429)
protz (manager)
2011-12-21 11:11

The patch looks fine. The only concern I have is that you're sometimes using a mix of close_in, close_out, close. In some places (but not all) where you could use close_in or close_out, you chose to use close, which indeed makes things easier. Reading the code, it turns close_in does close the underlying fd, but is there any strong reason why you're not using close everywhere?

Apart from that, no concerns :).

Thanks,

jonathan
(0006451)
protz (manager)
2011-12-21 14:41

After chatting with Damien, what you're doing (closing the descr, not the channel, as in open_process_full) is definitely safe.
(0006715)
protz (manager)
2012-01-18 10:28

Fixed in svn r12038
(0006720)
hw (reporter)
2012-01-18 10:54

Nice one, thanks folks.

- Issue History
Date Modified Username Field Change
2011-12-12 16:38 hw New Issue
2011-12-12 17:08 protz Note Added: 0006243
2011-12-12 18:37 hw Note Added: 0006244
2011-12-12 18:38 hw Note Edited: 0006244 View Revisions
2011-12-12 22:49 till Note Added: 0006254
2011-12-13 17:53 protz Note Added: 0006271
2011-12-20 08:55 xleroy Status new => feedback
2011-12-21 04:57 till File Added: unix_proc_fdleak.patch
2011-12-21 04:58 till Note Added: 0006426
2011-12-21 11:11 protz Note Added: 0006429
2011-12-21 11:22 protz Assigned To => protz
2011-12-21 11:22 protz Status feedback => assigned
2011-12-21 14:41 protz Note Added: 0006451
2012-01-18 10:28 protz Note Added: 0006715
2012-01-18 10:28 protz Status assigned => resolved
2012-01-18 10:28 protz Fixed in Version => 3.13.0+dev
2012-01-18 10:28 protz Resolution open => fixed
2012-01-18 10:54 hw Note Added: 0006720
2013-08-31 12:44 xleroy Status resolved => closed


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker