Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0005371OCamlOCamlbuild (the tool)public2011-10-05 23:442014-07-30 23:10
Reporterglondu 
Assigned To 
PrioritynormalSeverityminorReproducibilityalways
StatusconfirmedResolutionopen 
PlatformOSOS Version
Product Version3.12.0 
Target Versionafter-4.02.0Fixed in Version 
Summary0005371: questionable reasoning in job control code
Descriptionocamlbuild recently started failing with dash 0.5.7, which is /bin/sh on Debian, when the ocamldep command writes to stderr. This was triggered by a patch that has been reverted in Debian (see [1]).

[1] http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=642922 [^]

However, after looking more closely at what was going on inside dash and ocamlbuild, I believe ocamlbuild is genuinely faulty: in ocamlbuild_executor.ml, around line 205, it assumes that when reading from a file descriptor that has been returned as ready for reading by select returns 0 bytes, the corresponding job is finished, and it then closes the file descriptors and waits for it. I don't understand the reasoning behind this, and this is blatantly wrong with dash 0.5.7: ocamlbuild closes the stderr of the job (in this particular case, ocamldep) while it is still running, and when the jobs writes to its stderr, it gets SIGPIPE-d and reported to its parent (ocamlbuild) as failed.

I think ocamlbuild should not schedule termination of a job based only on reading 0 bytes from its file descriptor. A possible fix could be to install a SIGCHLD handler to schedule finished jobs instead of doing it after reading 0 bytes. Another one (IMHO better) would be to provide non-blocking variants of Unix.close_process* functions (that would not close fds if the process is not dead), and use them in ocamlbuild. Making them actual part of the Unix module would be a valuable addition, too.
Tagspatch
Attached Filespatch file icon ocamlbuild.patch [^] (1,079 bytes) 2011-10-06 11:50 [Show Content]

- Relationships

-  Notes
(0006155)
edwin (reporter)
2011-10-06 11:56

The read doesn't really return 0, but ocambuild_executor.ml maps all Unix errors as if read returned 0, and from there glondu's analysis stands.
In this case it gets an EAGAIN (because subprocess is still alive and hasn't written to stderr anything yet).

Here is a proposed patch:
http://caml.inria.fr/mantis/file_download.php?file_id=503&type=bug [^]

It would be nice if EINTR was handled some way too though.

Thanks to Erkki Seppälä/flux on #ocaml for the idea on setting back to blocking mode.
(0006156)
glondu (reporter)
2011-10-06 15:33

I think this is not enough.

Actually, all ocamldep jobs start by closing their standard output (since it is redirected), which triggers job termination, which consists in waiting for the remaining output of the job, i.e. the whole job in case of those that redirect their standard output. It theoretically means that ocamldep jobs are not parallelized, and it seems to be what is happening in practice.

I retract my initial suggestion about using a non-blocking close_process_full as a better way to fix that: a job would (theoretically) stay "active" as long as there is no output from other jobs.

A neater, non-intrusive, approach could be to use signalfd... but this is Linux-specific :-( A portable way could be to schedule job termination from a SIGCHLD handler. But {open,close}_process_full seem to be a bad interface for that since they don't expose the PID of the child process, and the logic implemented in close_ is not the one we want.
(0009865)
meyer (developer)
2013-07-26 10:46

This should be retriaged ...
(0011433)
edwin (reporter)
2014-05-12 20:18

FWIW ocamlbuild returns with 'signal -8' for commands that print warnings/errors to stderr (like ocamlfind, menhir).
I'm not sure if this is the same bug, but seems related, see:
http://www.freebsd.org/cgi/query-pr.cgi?pr=189710 [^]
https://github.com/the-lambda-church/merlin/issues/193#issuecomment-42867072 [^]
(0011518)
gasche (developer)
2014-05-18 18:10

Thanks to an SSH access to a FreeBSD box given by Edwin, I could reproduce the issue and verify that the patch above (submitted again... three years ago) makes ocamlbuild work again on FreeBSD. I included the patch in 4.02 and trunk.

The patch only fixes the symptoms, though: as I understand it the issue is not yet solved. I'm leaving the PR open for this reason.

- Issue History
Date Modified Username Field Change
2011-10-05 23:44 glondu New Issue
2011-10-06 11:50 edwin File Added: ocamlbuild.patch
2011-10-06 11:56 edwin Note Added: 0006155
2011-10-06 15:33 glondu Note Added: 0006156
2011-11-16 14:13 xclerc Status new => assigned
2011-11-16 14:13 xclerc Assigned To => xclerc
2012-02-02 15:17 protz Category OCamlbuild => OCamlbuild (the tool)
2012-07-06 16:05 doligez Target Version => 4.01.0+dev
2012-07-31 13:36 doligez Target Version 4.01.0+dev => 4.00.1+dev
2012-09-07 12:55 frisch Target Version 4.00.1+dev => 4.00.2+dev
2013-06-16 18:33 gasche Target Version 4.00.2+dev => 4.02.0+dev
2013-06-16 21:25 gasche Severity crash => minor
2013-07-12 18:15 doligez Target Version 4.02.0+dev => 4.01.1+dev
2013-07-26 10:46 meyer Note Added: 0009865
2013-10-09 14:29 doligez Tag Attached: patch
2014-05-12 20:18 edwin Note Added: 0011433
2014-05-18 18:10 gasche Note Added: 0011518
2014-05-18 18:10 gasche Assigned To xclerc =>
2014-05-18 18:10 gasche Status assigned => confirmed
2014-05-25 20:20 doligez Target Version 4.01.1+dev => 4.02.0+dev
2014-07-30 23:10 doligez Target Version 4.02.0+dev => after-4.02.0


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker