Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cygwin32 in bytecode: problem with Bigarray and Unix.fork #7484

Closed
vicuna opened this issue Feb 14, 2017 · 8 comments
Closed

Cygwin32 in bytecode: problem with Bigarray and Unix.fork #7484

vicuna opened this issue Feb 14, 2017 · 8 comments
Assignees
Milestone

Comments

@vicuna
Copy link

vicuna commented Feb 14, 2017

Original bug ID: 7484
Reporter: @xavierleroy
Assigned to: @dra27
Status: resolved (set by @xavierleroy on 2017-03-24T17:53:39Z)
Resolution: fixed
Priority: normal
Severity: minor
Platform: Cygwin 32 bits
Version: 4.05.0 +dev/beta1/beta2/beta3/rc1
Target version: 4.06.0 +dev/beta1/beta2/rc1
Fixed in version: 4.05.0 +dev/beta1/beta2/beta3/rc1
Category: otherlibs
Monitored by: @alainfrisch

Bug description

A program

  • compiled to bytecode (ocamlc)
  • linked with unix.cma and bigarray.cma
  • calling Unix.fork()
    results in a Cygwin 32 hard error

ocamlrun 4692 child_info_fork::abort: unable to map C:\builds\workspace\precheck-cygwin-32\otherlibs\bigarray\dllbigarray.so, Win32 error 998

On my Windows PC I can also get error 1114 instead.

For reference, error 998 = Invalid access to memory location and error
1114 = A dynamic link library (DLL) initialization routine failed.

Everything works fine if:

  • program is compiled to native code (ocamlopt)
  • program doesn't link with bigarray.cma
  • program doesn't call Unix.fork.

Steps to reproduce

echo "let _ = Unix.fork()" > tfork.ml
ocamlc -o tfork.exe unix.cma bigarray.cma tfork.ml
./tfork.exe

@vicuna
Copy link
Author

vicuna commented Feb 15, 2017

Comment author: @dra27

The problem was introduced by commit 5ed7200 (#997). This introduced a dependency of the Bigarray C code on the Unix C code because mmap functions use uerror.

It's not at all clear to me why this only affects Cygwin-32's fork, but I've seen variations of this issue on Cygwin 2.5.2-2.7.0 (the tests work on older versions, but the repro case here also fails).

It's tempting to solve the problem by moving the mmap functions from Bigarray stubs to Unix at once (and avoiding the ref dance), but I'm certain of the implications of that.

@vicuna
Copy link
Author

vicuna commented Feb 15, 2017

Comment author: @diml

Currently we have:

  • Bigarray depends on Unix (for Unix.file_descr)
  • mmap stubs depend on bigarray stubs (for caml_ba_alloc and maybe other functions)

So if we move the mmap stubs to Unix, we'll need to move the bigarrray stubs to Unix as well, or the stdlib. Moving them to Unix seems a bit weird. I guess we can move them now to the stdlib, this will need to be done eventually. There will still be a problem because the bigarray finaliser needs to call a function from the mmap stubs. We can add a C reference until we provide a more customizable API, if we want to.

BTW I don't really understand what the problem is. Do you know why this dependency on the C code causes fork to fail? Lwt unix stubs depend on the Unix C code as well, and AFAIR you can use Unix.fork from a Lwt program under cygwin.

@vicuna
Copy link
Author

vicuna commented Feb 15, 2017

Comment author: @xavierleroy

@dra: great detective work, thanks!

I still need to understand better. At first I thought you found a circular dependency between dllunix and dllbigarray (the Unix and Bigarray stub DLLs). That would be worrying indeed. I see why dllbigarray depends on dllunix (because as you say mmap functions use uerror) but I don't see where dllunix depends on dllbigarray. The Caml part of Unix depends on the Caml and C parts of Bigarray, but I didn't see the dependence of the C part of Unix on the C part of Bigarray.

@vicuna
Copy link
Author

vicuna commented Feb 15, 2017

Comment author: @dra27

I had initially wondered if the function pointer in Unix (the map_file_impl ref) might create issues for fork but I don't think it is (I also tried a version where the ref is not set, which made no difference) - and, as you say, that's all in OCaml-land.

I want to do a little more testing, but I may see if anyone on the Cygwin mailing list has thoughts on how to debug it further.

@vicuna
Copy link
Author

vicuna commented Feb 15, 2017

Comment author: @dra27

OK, I appear to be seeing Win32 error 998 on Windows 7, but Win32 error 1114 on Windows 10 (same version of Cygwin in each case).

Incidentally, this:

try
Unix.fork () |> Printf.eprintf "%d\n%!"
with _ ->
Unix.fork () |> Printf.eprintf "%d\n%!"

reliably "works".

@vicuna
Copy link
Author

vicuna commented Feb 22, 2017

Comment author: @dra27

OK, I have made some progress. The problem is that during its forking process, Cygwin intentionally loads dllbigarray.so before dllunix.so. During flexdll_relocate (which occurs before anything, including Cygwin, has initialised), an attempt is made to find the "uerror" symbol. If I have interpreted what is going on correctly, at this stage the FlexDLL loaded unit linked list has been copied over by fork but at this stage the entry for dllunix.so points to a data section has not been yet loaded and it gets an access violation (I think).

What is really weird is that x64 definitely loads the DLLs in the same wrong order but it definitely works. I have not yet got the bottom of why there is this difference between x86 and x64.

Cygwin is changing the load order for a reason - I haven't looked properly, but I think this is a simple fail case for two DLLs which appear not to dependent on each other. If I compile with a custom Cygwin DLL which does not change the load order, this all works.

I'd really like to understand why this works on x64 but not x86, but in the meantime:

a) I'm going to investigate patching Cygwin1.dll further to see if this can be turned into a feature request (it's not strictly a bug). I think the feature we want is to have stable sort for DLLs which don't appear to be dependent on each other (i.e. which have been dlopen'd).
b) I have a solution in FlexDLL which involves deferring symbol resolution until after fork has completed, but it's not pretty and it requires fork to be invoked via Unix.fork (i.e. this would completely break if other C code chose to fork).

@vicuna
Copy link
Author

vicuna commented Feb 25, 2017

Comment author: @dra27

OK, I have no idea whether it will be accepted, but patch sent to the Cygwin team: https://cygwin.com/ml/cygwin-patches/2017-q1/msg00039.html

The issue is present in x64 as well, but it only comes into play if DLLs load at their preferred base address. Interestingly, noting @doligez's early attempt to clear the CI, this means that using Cygwin's rebaseall script will break a working build but definitely not fix a broken build. I think that by default x86 gcc tries to guess unique base addresses for DLLs because rebasing is very unreliable on x86 but x64 gcc does not because rebasing usually works with the massive address space on offer. In any event, dllunix.so and dllbigarray.so consistently have identical base addresses on my Cygwin64 installation, but if I use rebase -b to give dllunix.so a unique and available base addresses, I get the same fork problem. When the DLLs do not load at their preferred address, there is a ghost "preloaded" image of the DLL already in memory as part of Cygwin's dynamic rebasing process which FlexDLL merrily reads which is why this works sometimes...

The problem itself is because Cygwin uses an unstable topological sort of DLLs based on dependencies which causes the list of dlopen'd DLLs to be reversed on each call to fork - this means that even calls to fork work, but not odd calls!

It's not technically a bug, because FlexDLL is operating prior to Cygwin startup (in other words, Cygwin only has to care about Windows load order which is being satisfied).

If the patch is accepted, I hope that for OCaml we can simply that the Cygwin ports require Cygwin >= 2.7.1. If it's not possible to get the patch in, I will reveal my ugly patch to FlexDLL...

@vicuna
Copy link
Author

vicuna commented Mar 24, 2017

Comment author: @xavierleroy

For the record: the dllunix / dllbigarray dependency was removed, in 4.05 by reverting to the 4.04 implementation of those library, in trunk by switching to a different implementation. So, I believe this Cygwin issue no longer shows up in the particular repro case I gave.

In the meantime, I see that @dra's patch was accepted by the Cygwin maintainers, so the Cygwin DLL problem will go away. Thanks @dra for all the analysis and patching work!

In the end, the problem described in this PR was fixed in two ways, so let me close it.

@vicuna vicuna closed this as completed Mar 24, 2017
@vicuna vicuna added this to the 4.06.0 milestone Mar 14, 2019
@vicuna vicuna added the bug label Mar 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants