Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0007457OCamlOCaml runtime systempublic2017-01-11 08:442017-01-12 11:47
Assigned Toxleroy 
PlatformIntel x86-64 (amd64)OSUbuntu 16.04.1 LTSOS Version16.04.1
Product Version4.04.0 
Target Version4.04.1+devFixed in Version 
Summary0007457: crash in free() logic due to what appears like pointer error in 4.03.0 & 4.04.0
Description(1) I have a rather multithreaded, networked system with many C libs glued in. It's been running correctly on ocaml 4.02.3 for many months.

(2) The failure is of a unit-test runner, so some interprocess comms, lots of crypto, but no actual off-machine networking.

(3) This failure is very, very reproducible. It happens with 4.03.0, 4.04.0 (both byte & native). IT DOES NOT happen with 4.02.3.

If you have any suggestions for how I might help narrow down (e.g., if you have a pointer to where I can get access to the source-code repository, so I could do bisection-search in the code-repo, I'm happy to do that). I have the reproduction process almost-automated, so while bisection will be time-consuming, it will be labor-free.
Steps To ReproduceBuild and run my unit-test. I realize this is a bit vague, and I'm going to work to narrow this down, but I figured I should report it, since it might be known, and also, since it only shows up with ocaml > 4.02.3, that might point to where the problem arises.

Also, I ran both with and without valgrind, on both 4.02.3 and 4.04.0. On 4.02.3, I get no significant valgrind errors (so it's not as if there's a latent error, but not crashing). With 4.04.0, of course I get a failure, and then I run with valgrind to get better crash info.
Additional InformationHere's some information I got from valgrind at the point where the failure occurs. There are many more messages, but there's no point in my sending these in, I suspect, until I can simplify the test. This output came from a run with a native executable.

.==11710== Invalid read of size 1
==11710== at 0x527A6B1: pthread_mutex_destroy (pthread_mutex_destroy.c:30)
==11710== by 0xFD42E0: caml_thread_reinitialize (in /home/chet/Hack/IL/infraledger/src/il-replica/_build/peer0_unit_tests.native)
==11710== by 0x709C945: fork (fork.c:211)
==11710== by 0xFD5C18: unix_fork (in /home/chet/Hack/IL/infraledger/src/il-replica/_build/peer0_unit_tests.native)
==11710== by 0xF2315A: camlUnix__system_1846 (
==11710== by 0x7E2890: camlTest_util__reset_unit_2674 (
==11710== by 0x7E2921: camlTest_util__setup_unit_dir_2888 (
==11710== by 0x80118B: camlPeer0_test_utils__master_test_inner_69265 (
==11710== by 0x8260CB: camlOUnitRunner__fun_2459 (
==11710== by 0x81B97F: camlOUnitTest__section_ctxt_1435 (
==11710== by 0x825F70: camlOUnitRunner__run_one_test_1382 (
==11710== by 0x826281: camlOUnitRunner__iter_1582 (
==11710== Address 0x7de6110 is 16 bytes inside a block of size 40 free'd
==11710== at 0x4C2EDEB: free (in /usr/lib/valgrind/
==11710== by 0xFE6358: caml_finalize_channel (io.c:398)
==11710== by 0xFE0645: caml_empty_minor_heap (minor_gc.c:381)
==11710== by 0xFE0AB5: caml_gc_dispatch (minor_gc.c:438)
==11710== by 0xFE1C40: caml_alloc_string (alloc.c:100)
==11710== by 0xF3B025: camlBuffer__resize_1232 (
==11710== by 0xF3B15B: camlBuffer__add_substring_1242 (
==11710== by 0xF5A314: camlFormat__format_pp_token_1355 (
==11710== by 0xF5A905: camlFormat__advance_loop_1394 (
==11710== by 0xF5A969: camlFormat__advance_left_1400 (
==11710== by 0x82524D: camlOUnitDiff__fun_1812 (
==11710== by 0xF35C2C: camlSet__iter_1430 (
==11710== Block was alloc'd at
==11710== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/
==11710== by 0xFD3AC2: st_mutex_create (in /home/chet/Hack/IL/infraledger/src/il-replica/_build/peer0_unit_tests.native)
==11710== by 0xFD3DE7: caml_io_mutex_lock (in /home/chet/Hack/IL/infraledger/src/il-replica/_build/peer0_unit_tests.native)
==11710== by 0xFE748D: caml_ml_output_bytes (io.c:660)
==11710== by 0xF276FE: camlPervasives__output_1206 (
==11710== by 0x962B51: camlTFramedTransport__fun_2149 (
==11710== by 0x911DCD: camlInvoker__fun_3330 (
==11710== by 0x7BFAB5: camlReplicaDB__thunk_10433 (
==11710== by 0xD67DEC: camlCore_kernel__Exn__protectx_3197 (
==11710== by 0x7C02EB: camlReplicaDB__fun_62944 (
==11710== by 0xD67DEC: camlCore_kernel__Exn__protectx_3197 (
==11710== by 0xD67DEC: camlCore_kernel__Exn__protectx_3197 (
TagsNo tags attached.
Attached Files? file icon foo [^] (214,639 bytes) 2017-01-11 08:48 [Show Content]
patch file icon 001-harden-io-mutex-free.patch [^] (542 bytes) 2017-01-11 13:57 [Show Content]

- Relationships

-  Notes
chetmurthy (reporter)
2017-01-11 08:48

I just uploaded a file containing the backtraces and such from a run without valgrind.
yallop (developer)
2017-01-11 09:34

The OCaml sources are available on GitHub: [^]
dra (developer)
2017-01-11 10:21

Te merge-base of 4.03 and trunk 2449d2fc7daf802cdb09fe6440cfb6965005c710 and b12f7430423d0fb115a0ee551c2009e6094f (which should be the commit on trunk where 4.02.x is fully merged into trunk) might be good one-off tests to hone on where bisection is needed.
xleroy (administrator)
2017-01-11 14:04

Hello Chet,

Thanks for the very precise valgrind trace, it was really informative.

Based on this trace, I attached a plausible fix for this issue, see file 001-harden-io-mutex-free.patch. I hope it applies cleanly to 4.03 and 4.04. Feel free to test again.

Plausible explanation: since 4.03, buffered IO channels can remain in the list of opened channels even after they have been "half-closed" by finalization. Those channels will be considered by caml_thread_reinitialize when a Unix.fork() is performed in a multithreaded program. That's where pthread_mutex_destroy() is called on a mutex that has already been destroyed as part of the half-closing.

Also: there may be a logic error in your program whereas it forgets to close (or at least flush) a out_channel before the channel becomes unreachable. This you can check for by running your program with run-time warnings enabled.
chetmurthy (reporter)
2017-01-11 19:29

Xavier, I tried your patch just now with ocaml version 4.03.0, and my test passes without problem. Obviously with this sort of thing, one must be careful b/c so much depends on determinism, so I'd like to leave this issue open, but I'll revisit in a week, and if it hasn't recurred, I suspect we can close this issue.

Thank you so much for so quickly coming up with a diagnosis and fix! I didn't even get a chance to bisect the ocaml repository!

BTW, if you let me know when this patch gets checked-into Git, I'd be happy to test my stuff against the git repo.
xleroy (administrator)
2017-01-12 11:46

I pushed the tentative fix to trunk, commit [trunk 2011591]. Will probably push it to 4.04 bugfix branch later.

- Issue History
Date Modified Username Field Change
2017-01-11 08:44 chetmurthy New Issue
2017-01-11 08:48 chetmurthy File Added: foo
2017-01-11 08:48 chetmurthy Note Added: 0017144
2017-01-11 09:34 yallop Note Added: 0017145
2017-01-11 10:21 dra Note Added: 0017149
2017-01-11 13:57 xleroy File Added: 001-harden-io-mutex-free.patch
2017-01-11 14:04 xleroy Note Added: 0017152
2017-01-11 14:04 xleroy Status new => feedback
2017-01-11 14:04 xleroy Target Version => 4.04.1+dev
2017-01-11 19:29 chetmurthy Note Added: 0017153
2017-01-11 19:29 chetmurthy Status feedback => new
2017-01-12 11:46 xleroy Note Added: 0017156
2017-01-12 11:47 xleroy Assigned To => xleroy
2017-01-12 11:47 xleroy Status new => assigned

Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker