Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCaml SIGSEGV in invert_pointer_at() (OCaml Garbage Collector / Compaction) #7431

Closed
vicuna opened this issue Dec 10, 2016 · 35 comments
Closed
Assignees

Comments

@vicuna
Copy link

vicuna commented Dec 10, 2016

Original bug ID: 7431
Reporter: alexmarkley
Assigned to: @mshinwell
Status: closed (set by @mshinwell on 2016-12-22T07:16:05Z)
Resolution: not a bug
Priority: normal
Severity: crash
Platform: x86_64
OS: Linux
OS Version: Fedora 25
Version: 4.04.0
Category: runtime system and C interface

Bug description

When running the Unison File Synchronizer (a project written in OCaml: https://www.cis.upenn.edu/~bcpierce/unison/index.html ) against a large replica (1TB), I am encountering a showstopping segfault every single time.

  • I have tried multiple versions of Unison, including stable versions which were working fine for me in the past and newer beta versions.

  • I initially tried the official Fedora builds of OCaml (4.02.3-3) and when I was having no success with those, I removed them from my system and I built/installed OCaml 4.04.0 myself.

  • I finally got a really good backtrace (included in this report) running OCaml 4.04 and Unison git master. As you can see, the segmentation fault occurred within the OCaml heap compaction portion of the garbage collection routine.

  • It is worth noting that I never had this problem in earlier releases of Fedora, even with the same or earlier versions of OCaml and Unison. (I'm not sure what this implies, except that perhaps the bug is actually being triggered by a lower-level system component, like GCC or a system library.)

Related bug reports:

https://bugzilla.redhat.com/show_bug.cgi?id=1401759

bcpierce00/unison#48

Steps to reproduce

NOTE: These steps may only successfully reproduce the issue if the client is running Fedora 25 on x86_64, and if both OCaml and Unison were built on that machine.

  1. Create a large, complicated dataset on the server for Unison to synchronize. Ideally this will be over 1TB in size and require over 2 hours to transfer.

  2. Perform a synchronization between the client and the server, requiring the majority of the data to be transferred from the server to the client. (This mimics initial synchronization of a new hub/spoke node.)

  3. Observe the client fails to synchronize the entire dataset. Client is terminated with SIGSEGV.

Additional information

===SNIP===
/home/alex/Temp/galculator-2.1.3/intltool-extract.in has already been transferred
/home/alex/Temp/galculator-2.1.3/intltool-merge.in has already been transferred
/home/alex/Temp/galculator-2.1.3/intltool-update.in has already been transferred
33% 100:25 ETA
Program received signal SIGSEGV, Segmentation fault.
0x00000000004ec76b in invert_pointer_at (p=p@entry=0x7fffd38c7b28) at compact.c:90
90 compact.c: No such file or directory.
(gdb) thread apply all bt full

Thread 1 (process 19298):
#0 0x00000000004ec76b in invert_pointer_at (p=p@entry=0x7fffd38c7b28) at compact.c:90
val = 140736742586384
hp = 0x7461705f77617264
q = 140736742586416
#1 0x00000000004ec90c in do_compaction () at compact.c:228
q =
i =
sz = 6
t =
infixes =
p = 0x7fffd38c7b10
ch = 0x7fffbf2fc000 "\363\273M"
chend = 0x7ffff09f1000 ""
#2 0x00000000004ecdea in caml_compact_heap () at compact.c:426
target_wsz =
live =
#3 0x00000000004ed24a in caml_compact_heap_maybe () at compact.c:547
fw =
fp = 170.748871
#4 0x00000000004daf4a in caml_major_collection_slice (howmuch=howmuch@entry=-1) at major_gc.c:785
p = 0.0043600637275738388
dp =
filt_p = 0.0043600637275738388
spend =
computed_work = 1522479
i =
#5 0x00000000004dbedf in caml_gc_dispatch () at minor_gc.c:463
trigger =
#6 0x00000000004dbf77 in caml_check_urgent_gc (extra_root=) at minor_gc.c:482
caml__frame = 0x0
caml__roots_extra_root = {next = 0x0, ntables = 1, nitems = 1, tables = {0x7fffffffd758, 0x7fffffffd870, 0x4dc96a <caml_alloc_shr+170>, 0x22, 0x7fff9d02f6b0}}
#7 0x00000000004dcfe5 in caml_alloc_string (len=65497) at alloc.c:103
result =
offset_index =
wosize = 8188
#8 0x000000000047205c in camlBytearray__sub_1422 () at /root/unison-git/src/bytearray.ml:63
No locals.
#9 0x0000000000447812 in camlTransfer__receiveRec_1568 () at /root/unison-git/src/transfer.ml:295
No locals.
#10 0x0000000000427cef in camlCopy__decompr_2936 () at /root/unison-git/src/transfer.ml:304
No locals.
#11 0x0000000000426bca in camlCopy__fun_3367 () at /root/unison-git/src/copy.ml:401
No locals.
#12 0x000000000046cc11 in camlUtil__convertUnixErrorsToExn_1955 () at /root/unison-git/src/ubase/util.ml:170
No locals.
#13 0x000000000043f46a in camlRemote__processStream_2291 () at /root/unison-git/src/remote.ml:664
No locals.
#14 0x000000000043fe26 in camlRemote__fun_4468 () at /root/unison-git/src/remote.ml:732
No locals.
#15 0x0000000000464e4d in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#16 0x000000000046510e in camlLwt__fun_1451 () at /root/unison-git/src/lwt/lwt.ml:94
No locals.
#17 0x000000000048d101 in camlList__iter_1252 () at list.ml:77
No locals.
#18 0x0000000000464b2e in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#19 0x000000000046182e in camlLwt_unix_impl__fun_2430 () at /root/unison-git/src/lwt/generic/lwt_unix_impl.ml:153
No locals.
#20 0x000000000048d101 in camlList__iter_1252 () at list.ml:77
No locals.
#21 0x0000000000461671 in camlLwt_unix_impl__run_1579 () at /root/unison-git/src/lwt/generic/lwt_unix_impl.ml:148
No locals.
#22 0x000000000040e80a in camlUitext__doTransport_1863 () at /root/unison-git/src/uitext.ml:490
No locals.
#23 0x000000000040f84e in camlUitext__doit_1922 () at /root/unison-git/src/uitext.ml:556
No locals.
#24 0x0000000000410034 in camlUitext__synchronizeOnce_1968 () at /root/unison-git/src/uitext.ml:718
No locals.
#25 0x000000000041094a in camlUitext__loop_2237 () at /root/unison-git/src/uitext.ml:788
No locals.
#26 0x0000000000410b4d in camlUitext__synchronizeUntilDone_2242 () at /root/unison-git/src/uitext.ml:810
No locals.
#27 0x0000000000410df7 in camlUitext__start_2249 () at /root/unison-git/src/uitext.ml:870
No locals.
#28 0x00000000004085fa in camlMain__Body_1550 () at /root/unison-git/src/main.ml:241
No locals.
#29 0x0000000000407a93 in camlLinktext__entry () at /root/unison-git/src/linktext.ml:19
No locals.
#30 0x0000000000404369 in caml_program ()
No symbol table info available.
#31 0x00000000004ef12e in caml_start_program ()
No symbol table info available.
#32 0x00000000004ef475 in caml_main (argv=0x7fffffffdca8) at startup.c:145
exe_name =
proc_self_exe = "/usr/local/bin/unison", '\000' <repeats 234 times>
res =
tos = 0 '\000'
#33 0x0000000000403c5c in main (argc=, argv=) at main.c:37
No locals.
(gdb)
===SNIP===

File attachments

@vicuna
Copy link
Author

vicuna commented Dec 10, 2016

Comment author: @gasche

Wow, thanks for the reproduction work. Have you contacted the Unison maintainers and Richard Jones, as the OCaml package for Fedora and pointed them at this issue report? They may have additional insights (I'm no GC expert but my impression is that a segfault in compaction can easily be caused by any other source of memory corruption in the process, not necessarily the GC code itself, it is rather that this GC phase touches a lot of memory).

@vicuna
Copy link
Author

vicuna commented Dec 10, 2016

Comment author: alexmarkley

gasche, thanks for responding! Actually you're the first person to respond directly to any of my bug reports.

Since my troubleshooting has lead me deeper and deeper, I am thinking of adjusting my Redhat Bugzilla issue, so that it is a report against the ocaml package instead of a report against the unison package. After all, the ocaml package is probably used by quite a bit more people, so I would imagine it will get more exposure that way.

Also, I am quite aware how memory-intensive this kind of operation must be. In fact, since I'm running this on a brand-new laptop, I have been worried that this may actually be pointing to a hardware problem. However, so far, tools like memtest86+ have not reported any issues!

My other sneaking suspicion is that the compiler is producing broken code again. Fedora 25 ships with GCC 6.2.1, which is another major update to the compiler, and I'm not confident the dust has fully settled from all of this:

https://fedoraproject.org/wiki/Changes/GCC6

http://www.phoronix.com/scan.php?page=news_item&px=Fedora-GCC6-Rebuild-Results

I am currently in the process of building OCaml 4.04.0 with all optimizations disabled. Keeping my fingers crossed that this will "magically" fix the problem.

As always, any insight is greatly appreciated.

@vicuna
Copy link
Author

vicuna commented Dec 10, 2016

Comment author: alexmarkley

I rebuilt OCaml 4.04.0 with all GCC optimizations disabled. (My methodology for doing so can be discussed in more depth if that is of interest.)

I was able to reproduce the segfault again, and interestingly it landed in a different spot:

[alex@obsidian ~]$ gdb /usr/local/bin/unison
GNU gdb (GDB) Fedora 7.12-29.fc25
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/local/bin/unison...done.
(gdb) run -batch -prefer ssh://alex@elbmin//home/alex alexHome
Starting program: /usr/local/bin/unison -batch -prefer ssh://alex@elbmin//home/alex alexHome
Unison 2.50.0 (ocaml 4.04.0): Contacting server...
Detaching after fork from child process 9140.
alex@elbmin's password:
Connected [//elbmin.malexmedia.net//home/alex -> //obsidian.malexmedia.net//home/alex]
Looking for changes
Waiting for changes from server
Reconciling changes
error tmprequest
Error in digesting /home/alex/tmprequest:
/home/alex/tmprequest: Permission denied
<---- new dir .bitcoin
local : absent
elbmin.ma... : new dir modified on 2016-12-06 at 7:53:14 size 110501785182 rwx------
<---- new dir .config
local : absent
elbmin.ma... : new dir modified on 2016-12-05 at 23:09:03 size 3326278 rwx------
<---- new dir .dbus-keyrings
local : absent
elbmin.ma... : new dir modified on 2015-11-04 at 20:21:26 size 71 rwx------
<---- new dir .dvdcss
local : absent
elbmin.ma... : new dir modified on 2014-05-05 at 21:01:28 size 9335 rwxr-x---
<---- new dir .gimp-2.8
===SNIP===
Shortcut: copied /home/alex/Documents/Temp/loresque/resources/crate3d-02.swiv140.png from local file /home/alex/.unison.Documents.7f78c97b4bc1aea68cc67a71dd2bde7f.unison.tmp/Temp/loresque_win32/resources/crate3d-02.swiv140.png
Shortcut: copied /home/alex/Documents/Temp/loresque/resources/crate3d-02.swiv142.png from local file /home/alex/.unison.Documents.7f78c97b4bc1aea68cc67a71dd2bde7f.unison.tmp/Temp/loresque_win32/resources/crate3d-02.swiv142.png
Shortcut: copied /home/alex/Documents/Temp/loresque/resources/crate3d-02.swiv144.png from local file /home/alex/.unison.Documents.7f78c97b4bc1aea68cc67a71dd2bde7f.unison.tmp/Temp/loresque_win32/resources/crate3d-02.swiv144.png
Shortcut: copied /home/alex/Documents/Temp/loresque/resources/crate3d-02.swiv146.png from local file /home/alex/.unison.Documents.7f78c97b4bc1aea68cc67a71dd2bde7f.unison.tmp/Temp/loresque_win32/resources/crate3d-02.swiv146.png
Shortcut: copied /home/alex/Documents/Temp/loresque/resources/crate3d-02.swiv148.png from local file /home/alex/.unison.Documents.7f78c97b4bc1aea68cc67a71dd2bde7f.unison.tmp/Temp/loresque_win32/resources/crate3d-02.swiv148.png
66% 65:19 ETA
Program received signal SIGSEGV, Segmentation fault.
0x000000000048d0f6 in camlList__iter_1252 () at list.ml:75
75 list.ml: No such file or directory.
(gdb) thread apply all bt full

Thread 1 (process 9136):
#0 0x000000000048d0f6 in camlList__iter_1252 () at list.ml:75
No locals.
#1 0x0000000000464b2e in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#2 0x00000000004646db in camlLwt_util__leave_region_1278 () at /root/unison-git/src/lwt/lwt.ml:83
No locals.
#3 0x00000000004647a8 in camlLwt_util__fun_1442 () at /root/unison-git/src/lwt/lwt_util.ml:75
No locals.
#4 0x0000000000464e4d in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#5 0x000000000046510e in camlLwt__fun_1451 () at /root/unison-git/src/lwt/lwt.ml:94
No locals.
#6 0x000000000048d101 in camlList__iter_1252 () at list.ml:77
No locals.
#7 0x0000000000464b2e in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#8 0x0000000000460f92 in camlLwt_unix_impl__restart_threads_1278 () at /root/unison-git/src/lwt/lwt.ml:83
No locals.
#9 0x0000000000461661 in camlLwt_unix_impl__run_1579 () at /root/unison-git/src/lwt/generic/lwt_unix_impl.ml:147
No locals.
#10 0x000000000040e80a in camlUitext__doTransport_1863 () at /root/unison-git/src/uitext.ml:490
No locals.
#11 0x000000000040f84e in camlUitext__doit_1922 () at /root/unison-git/src/uitext.ml:556
No locals.
#12 0x0000000000410034 in camlUitext__synchronizeOnce_1968 () at /root/unison-git/src/uitext.ml:718
No locals.
#13 0x000000000041094a in camlUitext__loop_2237 () at /root/unison-git/src/uitext.ml:788
No locals.
#14 0x0000000000410b4d in camlUitext__synchronizeUntilDone_2242 () at /root/unison-git/src/uitext.ml:810
No locals.
#15 0x0000000000410df7 in camlUitext__start_2249 () at /root/unison-git/src/uitext.ml:870
No locals.
#16 0x00000000004085fa in camlMain__Body_1550 () at /root/unison-git/src/main.ml:241
No locals.
#17 0x0000000000407a93 in camlLinktext__entry () at /root/unison-git/src/linktext.ml:19
No locals.
#18 0x0000000000404369 in caml_program ()
No symbol table info available.
#19 0x00000000004ef12e in caml_start_program ()
No symbol table info available.
#20 0x00000000004ef475 in caml_main (argv=0x7fffffffdc98) at startup.c:145
exe_name =
proc_self_exe = "/usr/local/bin/unison", '\000' <repeats 234 times>
res =
tos = 0 '\000'
#21 0x0000000000403c5c in main (argc=, argv=) at main.c:37
No locals.
(gdb)

@vicuna
Copy link
Author

vicuna commented Dec 12, 2016

Comment author: @mshinwell

Can you confirm that when you disabled GCC optimisation, for building OCaml, that you did not remove the "-fno-strict-aliasing -fwrapv" compiler flags? These are needed to ensure correct compilation (and the former may be even more important for GCC 6).

In the OCaml source tree without GCC optimisation enabled please enter the testsuite/ directory and run "make all". Do any of the tests fail?

@vicuna
Copy link
Author

vicuna commented Dec 12, 2016

Comment author: alexmarkley

@shinwell, thanks for your attention to this issue!

My methodology for removing GCC optimization was to remove -O flags only. The following files and variables were modified:

  • configure (bytecccompopts)
  • asmrun/Makefile (CFLAGS)
  • otherlibs/Makefile (OPTCOMPFLAGS)
  • otherlibs/dynlink/Makefile (OPTCOMPFLAGS)
  • otherlibs/systhreads/Makefile (OPTCOMPFLAGS)
  • otherlibs/systhreads/Makefile.nt (OPTCOMPFLAGS)
  • stdlib/Makefile.shared (OPTCOMPFLAGS)

I ran the test suite as you suggested. Both with GCC optimizations and without, the results were exactly the same:

Summary:
635 tests passed
12 tests skipped
0 tests failed
0 unexpected errors
647 tests considered

List of skipped tests:
tests/lib-dynlink-csharp
tests/unwind
tests/asmcomp/static_float_array_flambda_opaque
tests/asmcomp/is_static_flambda
tests/asmcomp/unrolling_flambda
tests/lib-bigarray-2
tests/asmcomp/static_float_array_flambda
tests/asmcomp/unrolling_flambda2
tests/manual-intf-c

@vicuna
Copy link
Author

vicuna commented Dec 12, 2016

Comment author: alexmarkley

Also of note for this issue, one of the Unison developers suggested that the issue may be a stack overflow: bcpierce00/unison#48 (comment)

However, I carefully tested for and eliminated that possibility using a sequence of tests described here: bcpierce00/unison#48 (comment)

I am still open to any suggestion for troubleshooting steps and/or feedback on my troubleshooting so far.

Steps I am currently pursuing:

  • See if I can get Unison running via a non-native compilation option. (I'm very unfamiliar with OCaml, so this will require some research on my part.)

  • Repurpose some computer hardware and attempt to reproduce this issue on another computer. (I have run every hardware diagnostic I can think of, so I am fairly confident there are no hardware issues with my new laptop, but reproducing this issue on separate hardware would be one way to be sure.)

@vicuna
Copy link
Author

vicuna commented Dec 12, 2016

Comment author: @gasche

I don't have experience building unison but their makefiles appear to have a NATIVE variable that you could set to 'false' to get a bytecode executable (DEBUGGING=true might also help diagnose a problem, although it is mostly useful to generate OCaml stack traces using OCAMLRUNPARAM=b, I'm not sure it makes a difference when you have a core dump).

@vicuna
Copy link
Author

vicuna commented Dec 12, 2016

Comment author: @mshinwell

I'm making a few investigations into this problem, but in the meantime, I think it would be useful to try to rule out the C compiler change.

Can you try to build an old version of GCC and use that? I believe 4.4.7 is OK for OCaml compilation, but I'm not sure if it will build on your system. However try to stick to a 4.x version if you can.

This should be fairly easy. You need dev packages for GMP and MPFR which there should be Fedora packages for. Then download the source and make a build directory which must be outside of the extracted source tree. From the build directory I think you do:

../gcc-4.4.7/configure --prefix=/path/to/install
make bootstrap
make install

Then just add /path/to/install to your path and rebuild OCaml and Unison. Does it still fail?

@vicuna
Copy link
Author

vicuna commented Dec 12, 2016

Comment author: @mshinwell

I suspect this is not the source of the problem, but it's quite hard to think about so just in case:

In src/bytearray_stubs.c line 32, in the Unison tree, there is this code:

char *src = String_val(s) + Int_val(i);

I think this is wrong. [i] is the offset into a string, which might be large; in particular, it might be so large that the result of the "(int)" cast inside "Int_val" might be a negative number (since on your platform "int" is probably only 32 bits wide, and I imagine your platform is 64 bit). It should be "Long_val" instead.

@vicuna
Copy link
Author

vicuna commented Dec 13, 2016

Comment author: alexmarkley

@gasche, thanks for the find in the Unison Makefile. I'll give that a try if/when I move on to attempting non-native OCaml execution.

@shinwell, regarding the Int_val vs. Long_val, that's a great catch. I tried re-building Unison with Long_val on line 32 of src/bytearray_stubs.c and unfortunately it did not fix the problem. (Regardless, if you think Long_val is the more appropriate way to go, I will probably open a pull request against the Unison project and see if they accept it.)

Regarding building an older compiler, GCC 4.4.7 would not compile for me. However, GCC 5.4.0 DID compile okay, so I got that installed in an isolated corner of my system and I am in the process of attempting to reproduce the bug on Fedora 25 on GCC 5.

In addition to all of this, I did manage to build another system with enough disk space to test this issue. It's another x86_64 machine, but for my first test, it was running Ubuntu 16.04.

Interestingly, I was able to successfully run the initial Unison synchronization (the entire 1TB+ replica) to the Ubuntu machine with Ocaml 4.04.0 and Unison git master. I was able to do this with GCC 5.4 (included with Ubuntu 16.04) AND GCC 6.2 (from the toolchain test PPA: https://launchpad.net/~ubuntu-toolchain-r/+archive/ubuntu/test?field.series_filter=xenial ).

So I think I was able to confirm that nothing about the data set itself (nor the size of the data set) was causing the Unison codebase to segfault.

My next step is to more thoroughly examine the GCC bug option by attempting to eliminate the issue on my laptop running Fedora 25 by using GCC 5.4.

In parallel, I am going to continue to attempt to reproduce the issue on the new desktop by installing Fedora 25 and using GCC 6.2, which would reduce or eliminate the likelihood of a hardware bug.

@vicuna
Copy link
Author

vicuna commented Dec 13, 2016

Comment author: @mshinwell

I do think Long_val is correct in that circumstance, yes.

@vicuna
Copy link
Author

vicuna commented Dec 14, 2016

Comment author: alexmarkley

Another update here:

I was able to reproduce the bug on completely separate hardware. A desktop machine running Fedora 25 segfaulted while synchronizing the replica this afternoon.

Also, I did build GCC 5 on my laptop and used that to build OCaml 4.04.0 (and then used that to build Unison git master). Unfortunately, this did not make the problem go away.

What does this mean? Well, while I'm more sure than ever now that I'm dealing with a software bug (as opposed to a hardware bug), I'm now starting to think that the issue is deeper than I previously thought.

ldd says that the native unison binary requires the following shared libraries:

linux-vdso.so.1 (0x00007ffd31d90000)
libutil.so.1 => /lib64/libutil.so.1 (0x00007f6501a73000)
libm.so.6 => /lib64/libm.so.6 (0x00007f650176a000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f6501566000)
libc.so.6 => /lib64/libc.so.6 (0x00007f65011a0000)
/lib64/ld-linux-x86-64.so.2 (0x000055d772c81000)

Presumably the issue could be:

  • A bug in one or more of these libraries?
  • A compiler bug causing one or more of these libraries to misbehave, introducing memory corruption.
  • A kernel bug.
  • Intel is out to get me.
  • Faeries.

As always, feedback is welcome.

@vicuna
Copy link
Author

vicuna commented Dec 15, 2016

Comment author: @mshinwell

You said you'd run this successfully on earlier versions of Fedora. Can you tell us the most recent version of Fedora on which this works? (e.g. 24 maybe?)

@vicuna
Copy link
Author

vicuna commented Dec 15, 2016

Comment author: @chambart

If it is a GC or binding bug, it might be possible to narrow the search space by using the debug runtime.
Set the environment variable OCAMLPARAM=runtime-variant=d,_ before compiling

@vicuna
Copy link
Author

vicuna commented Dec 15, 2016

Comment author: @gasche

(That is, compiling the compiler distribution with the --with-debug-runtime configure option.)

@vicuna
Copy link
Author

vicuna commented Dec 15, 2016

Comment author: @mshinwell

You need to do both what @gasche and @chambart said, if I remember correctly.

@vicuna
Copy link
Author

vicuna commented Dec 15, 2016

Comment author: alexmarkley

@shinwell, I'm not positive when the most recent version of fedora was that I ran Unison in client mode. (The other client host I use a lot is Mac OS X.) It was probably not any earlier than Fedora 22.

I expect I will continue using my repurposed desktop for testing, attempt to reproduce the issue on Fedora 24, and work backward from there.

The other factor here which might muddy the water a bit is that my home directory replica has been growing rapidly for the past couple of years, so if this issue only crops up when Unison is running for multiple hours I'm not super confident that I would have been affected by this issue in the past.

@chambart & @gasche, thank you for letting me know about this option. I was not aware of this, so I will definitely add this to my list.

I was also thinking of engaging Valgrind as another deep-dive debugging tool. It can be really helpful in getting early warnings when things start to go off the rails. The disadvantage for me is that I don't know anything about the OCaml language or runtime internals, so the results might be hard to interpret.

The other option I'm pursuing at this moment is running OCaml & Unison in 32-bit mode on my laptop. Fedora supports installing i686 binaries from the package manager side-by-side with 64-bit binaries, so it was pretty easy to install all of the prerequisite libraries and get OCaml to build in i686 mode.

My rationale here is that the i686 instruction set is pretty different from the x86_64 instruction set, so if the problem is still reproducible in i686 mode, that may indicate a real code bug (as opposed to a compiler regression).

@vicuna
Copy link
Author

vicuna commented Dec 15, 2016

Comment author: alexmarkley

Quick update: I was able to reproduce the issue in i686 mode. Produced the following backtrace:

===SNIP===
Shortcut: copied /home/alex/Library/Application Support/Skype/malexmedia/dc.db-journal from local file /home/alex/.unison.Library.7f78c97b4bc1aea68cc67a71dd2bde7f.unison.tmp/Application Support/Skype/malexmedia/config.lck
53% 152:48 ETA
Program received signal SIGSEGV, Segmentation fault.
0x08101977 in caml_ba_finalize ()
(gdb) thread apply all bt full

Thread 1 (process 9392):
#0 0x08101977 in caml_ba_finalize ()
No symbol table info available.
#1 0x0810d4cb in sweep_slice (work=1723451766, work@entry=2147483647) at major_gc.c:554
final_fun =
hp = 0xee5425b4 "\377\030"
hd =
#2 0x0810e4ea in caml_finish_major_cycle () at major_gc.c:822
No locals.
#3 0x0811f97b in caml_compact_heap_maybe () at compact.c:539
fw =
fp = 1000000
#4 0x0810e46c in caml_major_collection_slice (howmuch=-1) at major_gc.c:785
p =
dp =
filt_p =
spend =
computed_work = 1341742
i =
#5 0x0810f18d in caml_gc_dispatch () at minor_gc.c:463
trigger = 0xf7cfa000
#6 0x0810c0c6 in caml_garbage_collection () at signals_asm.c:78
No locals.
#7 0x081214e6 in caml_system.code_begin ()
No symbol table info available.
#8 0x080a183b in camlLwt__catch_rec_1248 () at /root/unison-git/src/lwt/lwt.ml:103
No locals.
#9 0x080abbf8 in camlMyMap__mapi_1340 () at /root/unison-git/src/ubase/myMap.ml:177
No locals.
#10 0x080abc0b in camlMyMap__mapi_1340 () at /root/unison-git/src/ubase/myMap.ml:178
No locals.
#11 0x080abbe1 in camlMyMap__mapi_1340 () at /root/unison-git/src/ubase/myMap.ml:176
No locals.
#12 0x080abbe1 in camlMyMap__mapi_1340 () at /root/unison-git/src/ubase/myMap.ml:176
No locals.
#13 0x080abbe1 in camlMyMap__mapi_1340 () at /root/unison-git/src/ubase/myMap.ml:176
No locals.
#14 0x080abbe1 in camlMyMap__mapi_1340 () at /root/unison-git/src/ubase/myMap.ml:176
No locals.
#15 0x080abc0b in camlMyMap__mapi_1340 () at /root/unison-git/src/ubase/myMap.ml:178
No locals.
#16 0x080abc0b in camlMyMap__mapi_1340 () at /root/unison-git/src/ubase/myMap.ml:178
No locals.
#17 0x080abc0b in camlMyMap__mapi_1340 () at /root/unison-git/src/ubase/myMap.ml:178
No locals.
#18 0x080abbe1 in camlMyMap__mapi_1340 () at /root/unison-git/src/ubase/myMap.ml:176
No locals.
#19 0x080abc0b in camlMyMap__mapi_1340 () at /root/unison-git/src/ubase/myMap.ml:178
No locals.
#20 0x080abbe1 in camlMyMap__mapi_1340 () at /root/unison-git/src/ubase/myMap.ml:176
No locals.
#21 0x080621c0 in camlFiles__fun_3934 () at /root/unison-git/src/files.ml:551
No locals.
#22 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#23 0x080a1755 in camlLwt__fun_1451 () at /root/unison-git/src/lwt/lwt.ml:94
No locals.
#24 0x080c4f39 in camlList__iter_1252 () at list.ml:77
No locals.
#25 0x080a1160 in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#26 0x080a0d7e in camlLwt_util__leave_region_1278 () at /root/unison-git/src/lwt/lwt.ml:83
No locals.
#27 0x080a0e23 in camlLwt_util__fun_1442 () at /root/unison-git/src/lwt/lwt_util.ml:75
No locals.
#28 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#29 0x080a0df4 in camlLwt_util__run_in_region_1_1283 () at /root/unison-git/src/lwt/lwt.ml:109
No locals.
#30 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#31 0x080a1755 in camlLwt__fun_1451 () at /root/unison-git/src/lwt/lwt.ml:94
No locals.
#32 0x080c4f39 in camlList__iter_1252 () at list.ml:77
No locals.
#33 0x080a1160 in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#34 0x080a0d7e in camlLwt_util__leave_region_1278 () at /root/unison-git/src/lwt/lwt.ml:83
No locals.
#35 0x080a0e23 in camlLwt_util__fun_1442 () at /root/unison-git/src/lwt/lwt_util.ml:75
No locals.
#36 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#37 0x080a0df4 in camlLwt_util__run_in_region_1_1283 () at /root/unison-git/src/lwt/lwt.ml:109
No locals.
#38 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#39 0x080a1755 in camlLwt__fun_1451 () at /root/unison-git/src/lwt/lwt.ml:94
No locals.
#40 0x080c4f39 in camlList__iter_1252 () at list.ml:77
No locals.
#41 0x080a1160 in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#42 0x080a0d7e in camlLwt_util__leave_region_1278 () at /root/unison-git/src/lwt/lwt.ml:83
No locals.
#43 0x080a0e23 in camlLwt_util__fun_1442 () at /root/unison-git/src/lwt/lwt_util.ml:75
No locals.
#44 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#45 0x080a0df4 in camlLwt_util__run_in_region_1_1283 () at /root/unison-git/src/lwt/lwt.ml:109
No locals.
#46 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#47 0x080a1755 in camlLwt__fun_1451 () at /root/unison-git/src/lwt/lwt.ml:94
No locals.
#48 0x080c4f39 in camlList__iter_1252 () at list.ml:77
No locals.
#49 0x080a1160 in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#50 0x080a0d7e in camlLwt_util__leave_region_1278 () at /root/unison-git/src/lwt/lwt.ml:83
No locals.
#51 0x080a0e23 in camlLwt_util__fun_1442 () at /root/unison-git/src/lwt/lwt_util.ml:75
No locals.
#52 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#53 0x080a0df4 in camlLwt_util__run_in_region_1_1283 () at /root/unison-git/src/lwt/lwt.ml:109
No locals.
#54 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#55 0x080a1755 in camlLwt__fun_1451 () at /root/unison-git/src/lwt/lwt.ml:94
No locals.
#56 0x080c4f39 in camlList__iter_1252 () at list.ml:77
No locals.
#57 0x080a1160 in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#58 0x080a0d7e in camlLwt_util__leave_region_1278 () at /root/unison-git/src/lwt/lwt.ml:83
No locals.
#59 0x080a0e23 in camlLwt_util__fun_1442 () at /root/unison-git/src/lwt/lwt_util.ml:75
No locals.
#60 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#61 0x080a0df4 in camlLwt_util__run_in_region_1_1283 () at /root/unison-git/src/lwt/lwt.ml:109
No locals.
#62 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#63 0x080a1755 in camlLwt__fun_1451 () at /root/unison-git/src/lwt/lwt.ml:94
No locals.
#64 0x080c4f39 in camlList__iter_1252 () at list.ml:77
No locals.
#65 0x080a1160 in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#66 0x080a0d7e in camlLwt_util__leave_region_1278 () at /root/unison-git/src/lwt/lwt.ml:83
No locals.
#67 0x080a0e23 in camlLwt_util__fun_1442 () at /root/unison-git/src/lwt/lwt_util.ml:75
No locals.
#68 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#69 0x080a0df4 in camlLwt_util__run_in_region_1_1283 () at /root/unison-git/src/lwt/lwt.ml:109
No locals.
#70 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#71 0x080a1755 in camlLwt__fun_1451 () at /root/unison-git/src/lwt/lwt.ml:94
No locals.
#72 0x080c4f39 in camlList__iter_1252 () at list.ml:77
No locals.
#73 0x080a1160 in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#74 0x080a0d7e in camlLwt_util__leave_region_1278 () at /root/unison-git/src/lwt/lwt.ml:83
No locals.
#75 0x080a0e23 in camlLwt_util__fun_1442 () at /root/unison-git/src/lwt/lwt_util.ml:75
No locals.
#76 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#77 0x080a0df4 in camlLwt_util__run_in_region_1_1283 () at /root/unison-git/src/lwt/lwt.ml:109
No locals.
#78 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#79 0x080a1755 in camlLwt__fun_1451 () at /root/unison-git/src/lwt/lwt.ml:94
No locals.
#80 0x080c4f39 in camlList__iter_1252 () at list.ml:77
No locals.
#81 0x080a1160 in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#82 0x080a0d7e in camlLwt_util__leave_region_1278 () at /root/unison-git/src/lwt/lwt.ml:83
No locals.
#83 0x080a0e23 in camlLwt_util__fun_1442 () at /root/unison-git/src/lwt/lwt_util.ml:75
No locals.
#84 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#85 0x080a0df4 in camlLwt_util__run_in_region_1_1283 () at /root/unison-git/src/lwt/lwt.ml:109
No locals.
#86 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#87 0x080a1755 in camlLwt__fun_1451 () at /root/unison-git/src/lwt/lwt.ml:94
No locals.
#88 0x080c4f39 in camlList__iter_1252 () at list.ml:77
No locals.
#89 0x080a1160 in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#90 0x080a0d7e in camlLwt_util__leave_region_1278 () at /root/unison-git/src/lwt/lwt.ml:83
No locals.
#91 0x080a0e23 in camlLwt_util__fun_1442 () at /root/unison-git/src/lwt/lwt_util.ml:75
No locals.
#92 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#93 0x080a0df4 in camlLwt_util__run_in_region_1_1283 () at /root/unison-git/src/lwt/lwt.ml:109
No locals.
#94 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#95 0x080a1755 in camlLwt__fun_1451 () at /root/unison-git/src/lwt/lwt.ml:94
No locals.
#96 0x080c4f39 in camlList__iter_1252 () at list.ml:77
No locals.
#97 0x080a1160 in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#98 0x080a0d7e in camlLwt_util__leave_region_1278 () at /root/unison-git/src/lwt/lwt.ml:83
No locals.
#99 0x080a0e23 in camlLwt_util__fun_1442 () at /root/unison-git/src/lwt/lwt_util.ml:75
No locals.
#100 0x080a14a5 in camlLwt__apply_1225 () at /root/unison-git/src/lwt/lwt.ml:75
No locals.
#101 0x080a1755 in camlLwt__fun_1451 () at /root/unison-git/src/lwt/lwt.ml:94
No locals.
#102 0x080c4f39 in camlList__iter_1252 () at list.ml:77
No locals.
#103 0x080a1160 in camlLwt__restart_1211 () at /root/unison-git/src/lwt/lwt.ml:31
No locals.
#104 0x0809dcde in camlLwt_unix_impl__restart_threads_1278 () at /root/unison-git/src/lwt/lwt.ml:83
No locals.
#105 0x0809e397 in camlLwt_unix_impl__run_1579 () at /root/unison-git/src/lwt/generic/lwt_unix_impl.ml:147
No locals.
#106 0x08054e8d in camlUitext__doTransport_1863 () at /root/unison-git/src/uitext.ml:490
No locals.
#107 0x08055bd8 in camlUitext__doit_1922 () at /root/unison-git/src/uitext.ml:556
No locals.
#108 0x080562b9 in camlUitext__synchronizeOnce_1968 () at /root/unison-git/src/uitext.ml:718
No locals.
#109 0x08056b41 in camlUitext__loop_2237 () at /root/unison-git/src/uitext.ml:788
No locals.
#110 0x08056d26 in camlUitext__synchronizeUntilDone_2242 () at /root/unison-git/src/uitext.ml:810
No locals.
#111 0x08056f6e in camlUitext__start_2249 () at /root/unison-git/src/uitext.ml:870
No locals.
#112 0x0804fa34 in camlMain__Body_1550 () at /root/unison-git/src/main.ml:241
---Type to continue, or q to quit---
No locals.
#113 0x0804ef8d in camlLinktext__entry () at /root/unison-git/src/linktext.ml:19
No locals.
#114 0x0804b00c in caml_program ()
No symbol table info available.
#115 0x08121629 in caml_start_program ()
No symbol table info available.
#116 0x08121945 in caml_main (argv=0xffffcdf4) at startup.c:145
exe_name =
proc_self_exe = "/usr/local/bin/unison", '\000' <repeats 234 times>
res =
tos = -9 '\367'
#117 0x0804aa79 in main (argc=2, argv=0xffffcdf4) at main.c:37
No locals.
(gdb)
===SNIP===

I am also working now on running Unison within the OCaml debug runtime within Valgrind to see if I can't get more info.

@vicuna
Copy link
Author

vicuna commented Dec 15, 2016

Comment author: @gasche

I'm not super-familiar with Valgrind's precise guarantees, but I expect that one limitation of it for OCaml programs is that it probably isn't able to notice memory read/writes that violate OCaml's memory model (in terms of gc-expectations etc.) but occur within GC-owned memory.

(A tool that could be interesting to use is an overflow-sanitizer, in case the issue here is overflow-related.)

@vicuna
Copy link
Author

vicuna commented Dec 15, 2016

Comment author: @mshinwell

Yeah, I'm unsure valgrind will be of much help here. I think there may be a lot of spurious warnings generated as well, from what I remember last time I tried to use it.

@vicuna
Copy link
Author

vicuna commented Dec 16, 2016

Comment author: alexmarkley

So initially I tried reproducing the problem running the OCaml debug runtime inside of valgrind. However, the combined performance hit from both worked out to something like a factor of ten, and the ETA of the experiment was over 24 hours.

So instead I broke them up and ran Unison with the OCaml debug runtime separately from the Valgrind experiment.

The OCaml debug runtime finished already, and I have uploaded the output to this issue. ( https://caml.inria.fr/mantis/file_download.php?file_id=1648&type=bug )

Unison's output contains a lot of sensitive/personal information, so I redacted the output quite a lot. Most everything that remains is the debug runtime's chatter.

I was concerned about the exception at the end -- I've never seen that before. But I couldn't tell if it terminated the program or not. It looks like Unison might have finished successfully.

Regardless, the segmentation fault definitely did not occur, which definitely escalates this issue to my very favorite kind of bug: https://en.wikipedia.org/wiki/Heisenbug

Regarding Valgrind, which is still running, I have had some success with it in the past detecting early causes or symptoms of memory corruption. Obviously the limitations @gausche mentions are salient -- if you are scribbling on your own memory, nobody cares but you.

However, if we go on the assumption that the OCaml heap-related functions are well debugged and probably bug-free (especially likely, since the segfault is happening everywhere), it seems reasonable to expect that the root cause of the bug is memory corruption happening earlier, forming garbage input or garbage state, causing the later code to go crazy.

I've seen a really subtle, off-by-one-byte overflow issue go unnoticed in a codebase for YEARS because it so rarely caused any issues. But then on one platform it would cause a stack corruption that would crash the program (of course many many calls later) so we had to track it down and fix it.

@shinwell, regarding Valgrind's spurious errors, that often depends on third-party code. I have seen cases where graphically-enabled code gets lots of warnings due to passing buffers back and forth with kernel drivers (deep in third party libraries usually). Usually those warnings/errors can be suppressed so you can focus on the issue(s) at hand.

@vicuna
Copy link
Author

vicuna commented Dec 16, 2016

Comment author: alexmarkley

Valgrind finished, and I'm optimistic that it may have uncovered the root cause of the segfaults.

I've uploaded the full transcript (personal information redacted as described before) here: https://caml.inria.fr/mantis/file_download.php?file_id=1649&type=bug

However, everything is explained pretty clearly in the error summary at the bottom:

==20579== ERROR SUMMARY: 528 errors from 3 contexts (suppressed: 0 from 0)
==20579==
==20579== 36 errors in context 1 of 3:
==20579== Invalid read of size 1
==20579== at 0x4E474D: read8u (intern.c:78)
==20579== by 0x4E474D: intern_rec (intern.c:356)
==20579== by 0x4E5412: input_val_from_block.isra.2 (intern.c:811)
==20579== by 0x4E57A4: caml_input_value_from_block (intern.c:839)
==20579== by 0x43E7EA: camlRemote__fun_4028 (remote.ml:454)
==20579== by 0x43F816: camlRemote__fun_4404 (remote.ml:464)
==20579== by 0x464E4C: camlLwt__apply_1225 (lwt.ml:75)
==20579== by 0x46510D: camlLwt__fun_1451 (lwt.ml:94)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579== by 0x464B2D: camlLwt__restart_1211 (lwt.ml:31)
==20579== by 0x46182D: camlLwt_unix_impl__fun_2430 (lwt_unix_impl.ml:153)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579== by 0x461670: camlLwt_unix_impl__run_1579 (lwt_unix_impl.ml:148)
==20579== Address 0x6d0cf87 is 23 bytes inside a block of size 51 free'd
==20579== at 0x4C2ED4A: free (vg_replace_malloc.c:530)
==20579== by 0x4DB9E5: caml_empty_minor_heap (minor_gc.c:381)
==20579== by 0x4DBE3A: caml_gc_dispatch (minor_gc.c:438)
==20579== by 0x4DCE29: caml_alloc_small (alloc.c:65)
==20579== by 0x4E51B8: intern_alloc.part.1 (intern.c:591)
==20579== by 0x4E5408: intern_alloc (intern.c:567)
==20579== by 0x4E5408: input_val_from_block.isra.2 (intern.c:809)
==20579== by 0x4E57A4: caml_input_value_from_block (intern.c:839)
==20579== by 0x43E7EA: camlRemote__fun_4028 (remote.ml:454)
==20579== by 0x43F816: camlRemote__fun_4404 (remote.ml:464)
==20579== by 0x464E4C: camlLwt__apply_1225 (lwt.ml:75)
==20579== by 0x46510D: camlLwt__fun_1451 (lwt.ml:94)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579== Block was alloc'd at
==20579== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==20579== by 0x4CEA55: caml_ba_alloc (in /usr/local/bin/unison)
==20579== by 0x4CEBCF: caml_ba_create (in /usr/local/bin/unison)
==20579== by 0x47E877: camlBigarray__create_1527 (bigarray.ml:143)
==20579== by 0x43EAD6: camlRemote__fun_4051 (bytearray.ml:24)
==20579== by 0x43F779: camlRemote__fun_4312 (remote.ml:706)
==20579== by 0x464E4C: camlLwt__apply_1225 (lwt.ml:75)
==20579== by 0x46510D: camlLwt__fun_1451 (lwt.ml:94)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579== by 0x464B2D: camlLwt__restart_1211 (lwt.ml:31)
==20579== by 0x46182D: camlLwt_unix_impl__fun_2430 (lwt_unix_impl.ml:153)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579==
==20579==
==20579== 150 errors in context 2 of 3:
==20579== Invalid read of size 1
==20579== at 0x4C344D8: memmove (vg_replace_strmem.c:1252)
==20579== by 0x4E4852: readblock (intern.c:134)
==20579== by 0x4E4852: intern_rec (intern.c:403)
==20579== by 0x4E5412: input_val_from_block.isra.2 (intern.c:811)
==20579== by 0x4E57A4: caml_input_value_from_block (intern.c:839)
==20579== by 0x43E7EA: camlRemote__fun_4028 (remote.ml:454)
==20579== by 0x43F816: camlRemote__fun_4404 (remote.ml:464)
==20579== by 0x464E4C: camlLwt__apply_1225 (lwt.ml:75)
==20579== by 0x46510D: camlLwt__fun_1451 (lwt.ml:94)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579== by 0x464B2D: camlLwt__restart_1211 (lwt.ml:31)
==20579== by 0x46182D: camlLwt_unix_impl__fun_2430 (lwt_unix_impl.ml:153)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579== Address 0x6d0cf89 is 25 bytes inside a block of size 51 free'd
==20579== at 0x4C2ED4A: free (vg_replace_malloc.c:530)
==20579== by 0x4DB9E5: caml_empty_minor_heap (minor_gc.c:381)
==20579== by 0x4DBE3A: caml_gc_dispatch (minor_gc.c:438)
==20579== by 0x4DCE29: caml_alloc_small (alloc.c:65)
==20579== by 0x4E51B8: intern_alloc.part.1 (intern.c:591)
==20579== by 0x4E5408: intern_alloc (intern.c:567)
==20579== by 0x4E5408: input_val_from_block.isra.2 (intern.c:809)
==20579== by 0x4E57A4: caml_input_value_from_block (intern.c:839)
==20579== by 0x43E7EA: camlRemote__fun_4028 (remote.ml:454)
==20579== by 0x43F816: camlRemote__fun_4404 (remote.ml:464)
==20579== by 0x464E4C: camlLwt__apply_1225 (lwt.ml:75)
==20579== by 0x46510D: camlLwt__fun_1451 (lwt.ml:94)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579== Block was alloc'd at
==20579== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==20579== by 0x4CEA55: caml_ba_alloc (in /usr/local/bin/unison)
==20579== by 0x4CEBCF: caml_ba_create (in /usr/local/bin/unison)
==20579== by 0x47E877: camlBigarray__create_1527 (bigarray.ml:143)
==20579== by 0x43EAD6: camlRemote__fun_4051 (bytearray.ml:24)
==20579== by 0x43F779: camlRemote__fun_4312 (remote.ml:706)
==20579== by 0x464E4C: camlLwt__apply_1225 (lwt.ml:75)
==20579== by 0x46510D: camlLwt__fun_1451 (lwt.ml:94)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579== by 0x464B2D: camlLwt__restart_1211 (lwt.ml:31)
==20579== by 0x46182D: camlLwt_unix_impl__fun_2430 (lwt_unix_impl.ml:153)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579==
==20579==
==20579== 342 errors in context 3 of 3:
==20579== Invalid read of size 1
==20579== at 0x4C344E6: memmove (vg_replace_strmem.c:1252)
==20579== by 0x4E4852: readblock (intern.c:134)
==20579== by 0x4E4852: intern_rec (intern.c:403)
==20579== by 0x4E5412: input_val_from_block.isra.2 (intern.c:811)
==20579== by 0x4E57A4: caml_input_value_from_block (intern.c:839)
==20579== by 0x43E7EA: camlRemote__fun_4028 (remote.ml:454)
==20579== by 0x43F816: camlRemote__fun_4404 (remote.ml:464)
==20579== by 0x464E4C: camlLwt__apply_1225 (lwt.ml:75)
==20579== by 0x46510D: camlLwt__fun_1451 (lwt.ml:94)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579== by 0x464B2D: camlLwt__restart_1211 (lwt.ml:31)
==20579== by 0x46182D: camlLwt_unix_impl__fun_2430 (lwt_unix_impl.ml:153)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579== Address 0x6d0cf8b is 27 bytes inside a block of size 51 free'd
==20579== at 0x4C2ED4A: free (vg_replace_malloc.c:530)
==20579== by 0x4DB9E5: caml_empty_minor_heap (minor_gc.c:381)
==20579== by 0x4DBE3A: caml_gc_dispatch (minor_gc.c:438)
==20579== by 0x4DCE29: caml_alloc_small (alloc.c:65)
==20579== by 0x4E51B8: intern_alloc.part.1 (intern.c:591)
==20579== by 0x4E5408: intern_alloc (intern.c:567)
==20579== by 0x4E5408: input_val_from_block.isra.2 (intern.c:809)
==20579== by 0x4E57A4: caml_input_value_from_block (intern.c:839)
==20579== by 0x43E7EA: camlRemote__fun_4028 (remote.ml:454)
==20579== by 0x43F816: camlRemote__fun_4404 (remote.ml:464)
==20579== by 0x464E4C: camlLwt__apply_1225 (lwt.ml:75)
==20579== by 0x46510D: camlLwt__fun_1451 (lwt.ml:94)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579== Block was alloc'd at
==20579== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==20579== by 0x4CEA55: caml_ba_alloc (in /usr/local/bin/unison)
==20579== by 0x4CEBCF: caml_ba_create (in /usr/local/bin/unison)
==20579== by 0x47E877: camlBigarray__create_1527 (bigarray.ml:143)
==20579== by 0x43EAD6: camlRemote__fun_4051 (bytearray.ml:24)
==20579== by 0x43F779: camlRemote__fun_4312 (remote.ml:706)
==20579== by 0x464E4C: camlLwt__apply_1225 (lwt.ml:75)
==20579== by 0x46510D: camlLwt__fun_1451 (lwt.ml:94)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579== by 0x464B2D: camlLwt__restart_1211 (lwt.ml:31)
==20579== by 0x46182D: camlLwt_unix_impl__fun_2430 (lwt_unix_impl.ml:153)
==20579== by 0x48D100: camlList__iter_1252 (list.ml:77)
==20579==
==20579== ERROR SUMMARY: 528 errors from 3 contexts (suppressed: 0 from 0)

I'm still analysing these errors to determine what exactly they indicate. I initially thought they were simple off-by-one read overflows, but now I'm wondering if these are read-after-free errors.

To make the backtraces clearer I'm going to rebuild OCaml without optimization (using the method described above) and reproduce these error messages.

The documentation here: ( http://valgrind.org/docs/manual/quick-start.html ) and here: ( http://valgrind.org/docs/manual/mc-manual.html#mc-manual.errormsgs ) should be especially helpful.

This is all so great! I'm very optimistic that this may be a simple bug after all.

@vicuna
Copy link
Author

vicuna commented Dec 17, 2016

Comment author: alexmarkley

The unoptimized run finished. I've uploaded the (redacted) transcript here: https://caml.inria.fr/mantis/file_download.php?file_id=1650&type=bug

The relevant portions (error summary) is here:

==27863== ERROR SUMMARY: 448 errors from 3 contexts (suppressed: 0 from 0)
==27863==
==27863== 30 errors in context 1 of 3:
==27863== Invalid read of size 1
==27863== at 0x4ECEF5: read8u (intern.c:78)
==27863== by 0x4ED741: intern_rec (intern.c:356)
==27863== by 0x4EE909: input_val_from_block (intern.c:811)
==27863== by 0x4EE9D3: caml_input_value_from_block (intern.c:839)
==27863== by 0x4DA418: ml_unmarshal_from_bigarray (bytearray_stubs.c:25)
==27863== by 0x43E74A: camlRemote__fun_4028 (remote.ml:454)
==27863== by 0x43F776: camlRemote__fun_4404 (remote.ml:464)
==27863== by 0x464DAC: camlLwt__apply_1225 (lwt.ml:75)
==27863== by 0x46506D: camlLwt__fun_1451 (lwt.ml:94)
==27863== by 0x48D060: camlList__iter_1252 (list.ml:77)
==27863== by 0x464A8D: camlLwt__restart_1211 (lwt.ml:31)
==27863== by 0x46178D: camlLwt_unix_impl__fun_2430 (lwt_unix_impl.ml:153)
==27863== Address 0x627bb47 is 23 bytes inside a block of size 55 free'd
==27863== at 0x4C2ED4A: free (vg_replace_malloc.c:530)
==27863== by 0x4CEE04: caml_ba_finalize (in /usr/local/bin/unison)
==27863== by 0x4E0EF0: caml_empty_minor_heap (minor_gc.c:381)
==27863== by 0x4E10CB: caml_gc_dispatch (minor_gc.c:438)
==27863== by 0x4E26F4: caml_alloc_small (alloc.c:65)
==27863== by 0x4EE0AD: intern_alloc (intern.c:591)
==27863== by 0x4EE8FD: input_val_from_block (intern.c:809)
==27863== by 0x4EE9D3: caml_input_value_from_block (intern.c:839)
==27863== by 0x4DA418: ml_unmarshal_from_bigarray (bytearray_stubs.c:25)
==27863== by 0x43E74A: camlRemote__fun_4028 (remote.ml:454)
==27863== by 0x43F776: camlRemote__fun_4404 (remote.ml:464)
==27863== by 0x464DAC: camlLwt__apply_1225 (lwt.ml:75)
==27863== Block was alloc'd at
==27863== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==27863== by 0x4CDA61: caml_ba_alloc (in /usr/local/bin/unison)
==27863== by 0x4CDD5A: caml_ba_create (in /usr/local/bin/unison)
==27863== by 0x47E7D7: camlBigarray__create_1527 (bigarray.ml:143)
==27863== by 0x43EA36: camlRemote__fun_4051 (bytearray.ml:24)
==27863== by 0x43F6D9: camlRemote__fun_4312 (remote.ml:706)
==27863== by 0x464DAC: camlLwt__apply_1225 (lwt.ml:75)
==27863== by 0x46506D: camlLwt__fun_1451 (lwt.ml:94)
==27863== by 0x48D060: camlList__iter_1252 (list.ml:77)
==27863== by 0x464A8D: camlLwt__restart_1211 (lwt.ml:31)
==27863== by 0x46178D: camlLwt_unix_impl__fun_2430 (lwt_unix_impl.ml:153)
==27863== by 0x48D060: camlList__iter_1252 (list.ml:77)
==27863==
==27863==
==27863== 127 errors in context 2 of 3:
==27863== Invalid read of size 1
==27863== at 0x4C344D8: memmove (vg_replace_strmem.c:1252)
==27863== by 0x4ED17E: readblock (intern.c:134)
==27863== by 0x4EDA7B: intern_rec (intern.c:403)
==27863== by 0x4EE909: input_val_from_block (intern.c:811)
==27863== by 0x4EE9D3: caml_input_value_from_block (intern.c:839)
==27863== by 0x4DA418: ml_unmarshal_from_bigarray (bytearray_stubs.c:25)
==27863== by 0x43E74A: camlRemote__fun_4028 (remote.ml:454)
==27863== by 0x43F776: camlRemote__fun_4404 (remote.ml:464)
==27863== by 0x464DAC: camlLwt__apply_1225 (lwt.ml:75)
==27863== by 0x46506D: camlLwt__fun_1451 (lwt.ml:94)
==27863== by 0x48D060: camlList__iter_1252 (list.ml:77)
==27863== by 0x464A8D: camlLwt__restart_1211 (lwt.ml:31)
==27863== Address 0x627bb49 is 25 bytes inside a block of size 55 free'd
==27863== at 0x4C2ED4A: free (vg_replace_malloc.c:530)
==27863== by 0x4CEE04: caml_ba_finalize (in /usr/local/bin/unison)
==27863== by 0x4E0EF0: caml_empty_minor_heap (minor_gc.c:381)
==27863== by 0x4E10CB: caml_gc_dispatch (minor_gc.c:438)
==27863== by 0x4E26F4: caml_alloc_small (alloc.c:65)
==27863== by 0x4EE0AD: intern_alloc (intern.c:591)
==27863== by 0x4EE8FD: input_val_from_block (intern.c:809)
==27863== by 0x4EE9D3: caml_input_value_from_block (intern.c:839)
==27863== by 0x4DA418: ml_unmarshal_from_bigarray (bytearray_stubs.c:25)
==27863== by 0x43E74A: camlRemote__fun_4028 (remote.ml:454)
==27863== by 0x43F776: camlRemote__fun_4404 (remote.ml:464)
==27863== by 0x464DAC: camlLwt__apply_1225 (lwt.ml:75)
==27863== Block was alloc'd at
==27863== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==27863== by 0x4CDA61: caml_ba_alloc (in /usr/local/bin/unison)
==27863== by 0x4CDD5A: caml_ba_create (in /usr/local/bin/unison)
==27863== by 0x47E7D7: camlBigarray__create_1527 (bigarray.ml:143)
==27863== by 0x43EA36: camlRemote__fun_4051 (bytearray.ml:24)
==27863== by 0x43F6D9: camlRemote__fun_4312 (remote.ml:706)
==27863== by 0x464DAC: camlLwt__apply_1225 (lwt.ml:75)
==27863== by 0x46506D: camlLwt__fun_1451 (lwt.ml:94)
==27863== by 0x48D060: camlList__iter_1252 (list.ml:77)
==27863== by 0x464A8D: camlLwt__restart_1211 (lwt.ml:31)
==27863== by 0x46178D: camlLwt_unix_impl__fun_2430 (lwt_unix_impl.ml:153)
==27863== by 0x48D060: camlList__iter_1252 (list.ml:77)
==27863==
==27863==
==27863== 291 errors in context 3 of 3:
==27863== Invalid read of size 1
==27863== at 0x4C344E6: memmove (vg_replace_strmem.c:1252)
==27863== by 0x4ED17E: readblock (intern.c:134)
==27863== by 0x4EDA7B: intern_rec (intern.c:403)
==27863== by 0x4EE909: input_val_from_block (intern.c:811)
==27863== by 0x4EE9D3: caml_input_value_from_block (intern.c:839)
==27863== by 0x4DA418: ml_unmarshal_from_bigarray (bytearray_stubs.c:25)
==27863== by 0x43E74A: camlRemote__fun_4028 (remote.ml:454)
==27863== by 0x43F776: camlRemote__fun_4404 (remote.ml:464)
==27863== by 0x464DAC: camlLwt__apply_1225 (lwt.ml:75)
==27863== by 0x46506D: camlLwt__fun_1451 (lwt.ml:94)
==27863== by 0x48D060: camlList__iter_1252 (list.ml:77)
==27863== by 0x464A8D: camlLwt__restart_1211 (lwt.ml:31)
==27863== Address 0x627bb4b is 27 bytes inside a block of size 55 free'd
==27863== at 0x4C2ED4A: free (vg_replace_malloc.c:530)
==27863== by 0x4CEE04: caml_ba_finalize (in /usr/local/bin/unison)
==27863== by 0x4E0EF0: caml_empty_minor_heap (minor_gc.c:381)
==27863== by 0x4E10CB: caml_gc_dispatch (minor_gc.c:438)
==27863== by 0x4E26F4: caml_alloc_small (alloc.c:65)
==27863== by 0x4EE0AD: intern_alloc (intern.c:591)
==27863== by 0x4EE8FD: input_val_from_block (intern.c:809)
==27863== by 0x4EE9D3: caml_input_value_from_block (intern.c:839)
==27863== by 0x4DA418: ml_unmarshal_from_bigarray (bytearray_stubs.c:25)
==27863== by 0x43E74A: camlRemote__fun_4028 (remote.ml:454)
==27863== by 0x43F776: camlRemote__fun_4404 (remote.ml:464)
==27863== by 0x464DAC: camlLwt__apply_1225 (lwt.ml:75)
==27863== Block was alloc'd at
==27863== at 0x4C2DB9D: malloc (vg_replace_malloc.c:299)
==27863== by 0x4CDA61: caml_ba_alloc (in /usr/local/bin/unison)
==27863== by 0x4CDD5A: caml_ba_create (in /usr/local/bin/unison)
==27863== by 0x47E7D7: camlBigarray__create_1527 (bigarray.ml:143)
==27863== by 0x43EA36: camlRemote__fun_4051 (bytearray.ml:24)
==27863== by 0x43F6D9: camlRemote__fun_4312 (remote.ml:706)
==27863== by 0x464DAC: camlLwt__apply_1225 (lwt.ml:75)
==27863== by 0x46506D: camlLwt__fun_1451 (lwt.ml:94)
==27863== by 0x48D060: camlList__iter_1252 (list.ml:77)
==27863== by 0x464A8D: camlLwt__restart_1211 (lwt.ml:31)
==27863== by 0x46178D: camlLwt_unix_impl__fun_2430 (lwt_unix_impl.ml:153)
==27863== by 0x48D060: camlList__iter_1252 (list.ml:77)
==27863==
==27863== ERROR SUMMARY: 448 errors from 3 contexts (suppressed: 0 from 0)

@vicuna
Copy link
Author

vicuna commented Dec 17, 2016

Comment author: alexmarkley

Based on the Valgrind documentation, it definitely looks like all three of these issues are read-after-free errors.

@vicuna
Copy link
Author

vicuna commented Dec 17, 2016

Comment author: alexmarkley

Taking a quick look at these backtraces, one common denominator I see is this:

==27863== by 0x4DA418: ml_unmarshal_from_bigarray (bytearray_stubs.c:25)

It jumps out at me because it's within the Unison codebase. Is it possible that Unison is doing something funky or deprecated with the OCaml internals API that is causing this issue?

I might need someone who knows OCaml internals to look at this -- I'm going to be pretty useless from here on out.

@vicuna
Copy link
Author

vicuna commented Dec 17, 2016

Comment author: @gasche

There was a change to the OCaml implementation of marshaling in
4.03.0, that allowed to (un)marshal blocks larger than 4Gio -- see
#224. Before this change,
attempting to marshal larger blocks would fail, after this change the
marshaling code uses a header format that is different from the legacy
one.

I looked at Unison (un)marshalling code (included for reference at the
end of this message), and I don't see anything obviously odd -- but
then I know little about bigarray representation, and Jérôme Vouillon
(who wrote this code in 2009) obviously knows more.

On the other hand, there seems to be logic in Unison that checks the
format of marshaled data, and may not be aware of the new marshaling
format:
https://github.com/bcpierce00/unison/blob/63d0a58/src/remote.ml#L429-L439

So it is theoretically possible that this mismatch could be a cause of
issue (also, the new codepaths for large packet are less often tested,
so there may be a bug there for the mode of use of Unison). There are
two things that I don't understand, however:

  • If Unison used larger-than-4Gio marshals, it would fail before
    4.03.0. There could be logic in Unison to retry with smaller chunks
    in case of failure, which means that the new codepath would start
    being used with 4.03, but I haven't found any such logic while
    looking (lightly) at the codebase.

  • The code handling the marshaling format seems rather defensive and
    mostly fails if it sees something unexpected. If the
    larger-than-4Gio marshaling happened, I would expect such a failure
    instead of undefined behavior.

So it seems unlikely to me that this is the cause of the issue, but we
can still ask Jérôme if that rings a bell.

CAMLprim value ml_marshal_to_bigarray(value v, value flags)
{
char *buf;
long len;
output_value_to_malloc(v, flags, &buf, &len);
return alloc_bigarray(BIGARRAY_UINT8 | BIGARRAY_C_LAYOUT | BIGARRAY_MANAGED,
1, buf, &len);
}

#define Array_data(a, i) (((char *) a->data) + Long_val(i))

CAMLprim value ml_unmarshal_from_bigarray(value b, value ofs)
{
struct caml_bigarray *b_arr = Bigarray_val(b);
return input_value_from_block (Array_data (b_arr, ofs),
b_arr->dim[0] - Long_val(ofs));
}

@vicuna
Copy link
Author

vicuna commented Dec 19, 2016

Comment author: @mshinwell

Thanks for the valgrind output, which was very helpful.

I think this may be a bug in Unison. I conjecture that it unmarshals data from a bigarray which, immediately after the call to [Bytearray.unmarshal], is dead in the OCaml code (in the liveness sense). The C variable in bytearray_stubs.c, function ml_unmarshal_from_bigarray, that holds the bigarray value is not registered as a root despite the fact that data is being read from memory managed by that bigarray. As such, if an allocation triggers during unmarshalling as we can see happened in the valgrind trace above, the memory being unmarshalled from could presumably be overwritten before unmarshalling finishes.

Try this:


diff --git a/src/bytearray_stubs.c b/src/bytearray_stubs.c
index ec1ed65..8c7c681 100644
--- a/src/bytearray_stubs.c
+++ b/src/bytearray_stubs.c
@@ -21,9 +21,12 @@ CAMLprim value ml_marshal_to_bigarray(value v, value flags)

CAMLprim value ml_unmarshal_from_bigarray(value b, value ofs)
{

  • CAMLparam1(b); /* Holds [b] live until unmarshalling completes. */
  • value result;
    struct caml_bigarray *b_arr = Bigarray_val(b);
  • return input_value_from_block (Array_data (b_arr, ofs),
  •                             b_arr->dim[0] - Long_val(ofs));
    
  • result = input_value_from_block (Array_data (b_arr, ofs),
  •                               b_arr->dim[0] - Long_val(ofs));
    
  • CAMLreturn(result);
    }

CAMLprim value ml_blit_string_to_bigarray

I suspect this has always been a bug but was previously unlikely since bigarrays were only finalised by the major GC. #92 changed that. However that patch I believe is in 4.03 so I'm not sure why it wasn't seen there. Quite possibly just chance.

@vicuna
Copy link
Author

vicuna commented Dec 19, 2016

Comment author: alexmarkley

@shinwell, thanks for the patch! This sounds like a reasonable theory.

I have a couple of experiments running right now, but as soon as they are done I will try this patch.

@vicuna
Copy link
Author

vicuna commented Dec 19, 2016

Comment author: alexmarkley

@shinwell, I got the following:

ocamlopt: bytearray_stubs.c ---> bytearray_stubs.o
ocamlopt -g -I lwt -I ubase -I system -I fsmonitor -I fsmonitor/linux -I fsmonitor/windows -I system/generic -I lwt/generic -ccopt "-o "/root/unison-git/src/bytearray_stubs.o -c /root/unison-git/src/bytearray_stubs.c
/root/unison-git/src/bytearray_stubs.c: In function ‘ml_unmarshal_from_bigarray’:
/root/unison-git/src/bytearray_stubs.c:24:3: warning: implicit declaration of function ‘CAMLparam1’ [-Wimplicit-function-declaration]
CAMLparam1(b); /* Holds [b] live until unmarshalling completes. */
^~~~~~~~~~
/root/unison-git/src/bytearray_stubs.c:29:3: warning: implicit declaration of function ‘CAMLreturn’ [-Wimplicit-function-declaration]
CAMLreturn(result);
^~~~~~~~~~
/root/unison-git/src/bytearray_stubs.c:30:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
Linking unison
ocamlopt -verbose -g -I lwt -I ubase -I system -I fsmonitor -I fsmonitor/linux -I fsmonitor/windows -I system/generic -I lwt/generic -o unison unix.cmxa str.cmxa bigarray.cmxa ubase/rx.cmx unicode_tables.cmx unicode.cmx bytearray.cmx system/system_generic.cmx system/generic/system_impl.cmx system.cmx ubase/projectInfo.cmx ubase/myMap.cmx ubase/safelist.cmx ubase/util.cmx ubase/uarg.cmx ubase/prefs.cmx ubase/trace.cmx ubase/proplist.cmx lwt/pqueue.cmx lwt/lwt.cmx lwt/lwt_util.cmx lwt/generic/lwt_unix_impl.cmx lwt/lwt_unix.cmx uutil.cmx case.cmx pred.cmx fileutil.cmx name.cmx path.cmx fspath.cmx fs.cmx fingerprint.cmx abort.cmx osx.cmx external.cmx fswatch.cmx props.cmx fileinfo.cmx os.cmx lock.cmx clroot.cmx common.cmx tree.cmx checksum.cmx terminal.cmx transfer.cmx xferhint.cmx remote.cmx globals.cmx fswatchold.cmx fpcache.cmx update.cmx copy.cmx stasher.cmx files.cmx sortri.cmx recon.cmx transport.cmx strings.cmx uicommon.cmx uitext.cmx test.cmx main.cmx linktext.cmx osxsupport.o pty.o bytearray_stubs.o -cclib -lutil

  • as -o '/tmp/camlstartupddfc5b.o' '/tmp/camlstartup8a035d.s'
  • gcc -o 'unison' '-Llwt' '-Lubase' '-Lsystem' '-Lfsmonitor' '-Lfsmonitor/linux' '-Lfsmonitor/windows' '-Lsystem/generic' '-Llwt/generic' '-L/usr/local/lib/ocaml' '/tmp/camlstartupddfc5b.o' '/usr/local/lib/ocaml/std_exit.o' 'linktext.o' 'main.o' 'test.o' 'uitext.o' 'uicommon.o' 'strings.o' 'transport.o' 'recon.o' 'sortri.o' 'files.o' 'stasher.o' 'copy.o' 'update.o' 'fpcache.o' 'fswatchold.o' 'globals.o' 'remote.o' 'xferhint.o' 'transfer.o' 'terminal.o' 'checksum.o' 'tree.o' 'common.o' 'clroot.o' 'lock.o' 'os.o' 'fileinfo.o' 'props.o' 'fswatch.o' 'external.o' 'osx.o' 'abort.o' 'fingerprint.o' 'fs.o' 'fspath.o' 'path.o' 'name.o' 'fileutil.o' 'pred.o' 'case.o' 'uutil.o' 'lwt/lwt_unix.o' 'lwt/generic/lwt_unix_impl.o' 'lwt/lwt_util.o' 'lwt/lwt.o' 'lwt/pqueue.o' 'ubase/proplist.o' 'ubase/trace.o' 'ubase/prefs.o' 'ubase/uarg.o' 'ubase/util.o' 'ubase/safelist.o' 'ubase/myMap.o' 'ubase/projectInfo.o' 'system.o' 'system/generic/system_impl.o' 'system/system_generic.o' 'bytearray.o' 'unicode.o' 'unicode_tables.o' 'ubase/rx.o' '/usr/local/lib/ocaml/bigarray.a' '/usr/local/lib/ocaml/str.a' '/usr/local/lib/ocaml/unix.a' '/usr/local/lib/ocaml/stdlib.a' '-lbigarray' '-lcamlstr' '-lunix' 'osxsupport.o' 'pty.o' 'bytearray_stubs.o' '-lutil' '/usr/local/lib/ocaml/libasmrun.a' -lm -ldl
    bytearray_stubs.o: In function ml_unmarshal_from_bigarray': /root/unison-git/src/bytearray_stubs.c:24: undefined reference to CAMLparam1'
    /root/unison-git/src/bytearray_stubs.c:29: undefined reference to `CAMLreturn'
    collect2: error: ld returned 1 exit status
    File "caml_startup", line 1:
    Error: Error during linking
    Makefile.OCaml:437: recipe for target 'unison' failed
    make[1]: *** [unison] Error 2
    make[1]: Leaving directory '/root/unison-git/src'
    Makefile:6: recipe for target 'text' failed
    make: *** [text] Error 2

I'm investigating now, but if you have any pointers on how to get this to correctly build & link, that would be great.

@vicuna
Copy link
Author

vicuna commented Dec 19, 2016

Comment author: alexmarkley

Never mind! A quick grep through the ocaml codebase revealed the symbols as macros defined in caml/memory.h.

Testing now...

@vicuna
Copy link
Author

vicuna commented Dec 20, 2016

Comment author: alexmarkley

@shinwell, this patch does not solve this issue.

I will see if I can produce another valgrind trace with the patch in place, but for the moment I am operating under the assumption that the patch did not affect this issue.

@vicuna
Copy link
Author

vicuna commented Dec 20, 2016

Comment author: @mshinwell

Another valgrind trace taken with that patch applied would be useful, thanks.

@vicuna
Copy link
Author

vicuna commented Dec 21, 2016

Comment author: alexmarkley

@shinwell, I was attempting to do precisely that, but now I am somewhat confused.

My first attempt to reproduce the problem with Valgrind + the patch failed with a segfault and complaints about a stack overflow (a completely different issue).

So then I bumped up my stack limit and re-ran the process -- but this time it ran to completion and produced this:

==13604== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==13604== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

So now I'm wondering if my initial report of "not fixed" was a false positive due to a stack overflow. I'm going to need to back up and re-run an experiment or two to reassess.

@vicuna
Copy link
Author

vicuna commented Dec 22, 2016

Comment author: alexmarkley

@shinwell, I can confirm that the proposed patch fixes the issue for me!

Thanks everybody.

@vicuna
Copy link
Author

vicuna commented Dec 22, 2016

Comment author: @mshinwell

OK, I'm going to close this issue then. Please submit the two patches to Unison if you could. Thanks for your help debugging this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants