Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weak hash of serialized closures #5942

Closed
vicuna opened this issue Mar 11, 2013 · 9 comments
Closed

Weak hash of serialized closures #5942

vicuna opened this issue Mar 11, 2013 · 9 comments

Comments

@vicuna
Copy link

vicuna commented Mar 11, 2013

Original bug ID: 5942
Reporter: bvaugon
Status: confirmed (set by @damiendoligez on 2017-10-05T15:45:39Z)
Resolution: open
Priority: normal
Severity: major
Version: 4.00.1
Category: runtime system and C interface
Tags: patch

Bug description

Usually, when you serialize a closure with one program, it is impossible to unserialize it with a different program, and you expect to obtain a runtime exception like: Failure("input_value: unknown code module FAE2E4BE7A3AE0091CF3043126B2CC65")

But, when two programs differs only by their data segments (see the attached example), it is possible to marshal a closure with the first program and to unmarshal it with the second program. Obviously, if you try to call the invalid unserialized closue, execution results in the famous "segmentation fault".

This bug is reproducible with ocamlc and ocamlopt.

The problem is that the hash sum stored in the marshalled closure is only computed with the code segment and not with the data segment.

I wrote a patch to fix it.

Steps to reproduce

With attached x.ml and y.ml, run:

ocamlopt x.ml -o x
ocamlopt y.ml -o y
./x
Do not import "data", ok
./y
Segmentation fault (core dumped)

Additional information

Fixed by the attached patch: ocaml-4.00.1-data-marsh.diff.

Remark: dynlink is ok because the entire dynlinked files are hashed.
So, I just modify byterun and asmrun.

File attachments

@vicuna
Copy link
Author

vicuna commented Mar 11, 2013

Comment author: @gasche

I took the liberty to upload the patch independently from the archive, for eventual reviewers that would be more comfortable having a look directly inside Mantis.

I could reproduce the bug with 3.12.1, 32 bits, but not 4.00.1 64 bits: as the data fields should be of the same size (or at least aligned), on a 64 bits machine you need to shorten the list in y.ml to two integers rather than four to observe the segfault.

@vicuna
Copy link
Author

vicuna commented Jul 22, 2013

Comment author: @alainfrisch

Important, but too risky to merge just before a release, I'd say.

@vicuna
Copy link
Author

vicuna commented Nov 30, 2015

Comment author: @alainfrisch

I've started to review the code.

  • In byterun/startup.c, function caml_main: if I'm not wrong, the patch has the effect of loading the DATA section twice (once to compute the digest, and once to deserialize the value). Instead, one could read once into caml_data and use caml_input_value_from_block to deserialize directly from this block.

  • caml_startup_code (i.e. "ocamlc -custom") should be adapted as well. It receives the data block as an argument, so this should be quite easy.

Benoit: would you be as kind as to submit a Github PR rebased on trunk (including changes for the remarks above), trying to add a test to the suite (preferably one that would fail before including on 64-bit)?

@vicuna
Copy link
Author

vicuna commented Dec 4, 2015

Comment author: bvaugon

Thanks for your remarks.

I just submit a PR on github (see #330). This PR is for the trunk and contains tests for 32 and 64 bits.

  • The double reading of the DATA segment is now fixed.

  • However, the caml_startup_code function from byterun/startup.c seems to already set the caml_data and caml_data_size variables in the original patch from mantis. So I didn't change that code in the GitHub PR. Maybe I missed something from your second remark... ?

@vicuna
Copy link
Author

vicuna commented Dec 4, 2015

Comment author: @alainfrisch

However, the caml_startup_code function from byterun/startup.c seems to already set the caml_data and caml_data_size variables in the original patch from mantis.

Indeed, sorry!

@vicuna
Copy link
Author

vicuna commented Dec 8, 2016

Comment author: @mshinwell

Downgrading to high priority.

@bvaugon Have you made any further progress on this problem? #330 and #332 have both stalled.

@vicuna
Copy link
Author

vicuna commented Feb 16, 2017

Comment author: @xavierleroy

For reference: latest proposal at #332

@vicuna
Copy link
Author

vicuna commented Oct 9, 2017

Comment author: @xavierleroy

This report and the associated Github pull requests have been dormant for a long time. What should we do? I still believe that my proposal is the most reasonable: use an MD5 checksum of the file denoted by Sys.executable_name, computed at run-time the first time we need it. MD5 is fast, don't worry.

xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Apr 6, 2019
This tests marshaling and unmarshaling of function closures
between two programs that have the same code area but differ
in their data areas.
xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Apr 6, 2019
This addresses issue ocaml#5942 in the case of bytecode executables.

The checksum associated with the main code fragment covers both
the bytecode and the global data, as read off the bytecode executable.

This way, programs that differ only by their data are considered
incompatible with respect to marshaled function closures.

Also: refactor code between the two bytecode startup functions
caml_main and caml_startup_code_exn.
xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Apr 8, 2019
This addresses issue ocaml#5942 in the case of bytecode executables.

The checksum associated with the main code fragment covers both
the bytecode and the global data, as read off the bytecode executable.

This way, programs that differ only by their data are considered
incompatible with respect to marshaled function closures.

Also: refactor code between the two bytecode startup functions
caml_main and caml_startup_code_exn.
xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Apr 8, 2019
This addresses issue ocaml#5942 in the case of native-code compilation.

The checksum associated with the main code fragment is the digest of
the executable file.  A fortiori it includes the code and the initial
global data of the executable. It also includes less desirable stuff
such as debug info, causing the checksum to vary more than absolutely
necessary, but this is still safe.

The checksum could be computed at program start-up, but this might be
too costly for a rarely-used feature (marshaling of code pointers).
Instead, we wait until the checksum is needed for the first time
to read the executable file and compute its MD5 digest.
To avoid problems with relative paths and chdir(), the executable
file is opened at program start-up and the corresponding file descriptor
is kept for later, on-demand computation of the MD5 digest.

The in-memory representation of code fragments was extended and cleaned
up to
1- include this "lazy computation of digest from file contents" method,
2- factor out the code that computes the digests on demand
   (function `caml_digest_code_fragment` in runtime/intern.c)

The previous "lazy computation of digest from code area" method
remains available and is still used in some cases relative to
the toplevel.  It may not be safe to digest only the code and
not associated data.  On the other hand it is not obvious what
data is associated with that code.  This needs further review.
xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Apr 14, 2019
This addresses issue ocaml#5942 in the case of native-code compilation.

The checksum associated with the main code fragment is the digest of
the executable file.  A fortiori it includes the code and the initial
global data of the executable. It also includes less desirable stuff
such as debug info, causing the checksum to vary more than absolutely
necessary, but this is still safe.

The checksum could be computed at program start-up, but this might be
too costly for a rarely-used feature (marshaling of code pointers).
Instead, we wait until the checksum is needed for the first time
to read the executable file and compute its MD5 digest.
To avoid problems with relative paths and chdir(), the executable
file is opened at program start-up and the corresponding file descriptor
is kept for later, on-demand computation of the MD5 digest.

The in-memory representation of code fragments was extended and cleaned
up to
1- include this "lazy computation of digest from file contents" method,
2- factor out the code that computes the digests on demand
   (function `caml_digest_code_fragment` in runtime/intern.c)

The previous "lazy computation of digest from code area" method
remains available and is still used in some cases relative to
the toplevel.  It may not be safe to digest only the code and
not associated data.  On the other hand it is not obvious what
data is associated with that code.  This needs further review.
xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Apr 14, 2019
This tests marshaling and unmarshaling of function closures
between two programs that have the same code area but differ
in their data areas.
xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Apr 14, 2019
This addresses issue ocaml#5942 in the case of bytecode executables.

The checksum associated with the main code fragment covers both
the bytecode and the global data, as read off the bytecode executable.

This way, programs that differ only by their data are considered
incompatible with respect to marshaled function closures.

Also: refactor code between the two bytecode startup functions
caml_main and caml_startup_code_exn.
xavierleroy added a commit to xavierleroy/ocaml that referenced this issue Apr 14, 2019
This addresses issue ocaml#5942 in the case of native-code compilation.

The checksum associated with the main code fragment is the digest of
the executable file.  A fortiori it includes the code and the initial
global data of the executable. It also includes less desirable stuff
such as debug info, causing the checksum to vary more than absolutely
necessary, but this is still safe.

The checksum could be computed at program start-up, but this might be
too costly for a rarely-used feature (marshaling of code pointers).
Instead, we wait until the checksum is needed for the first time
to read the executable file and compute its MD5 digest.
To avoid problems with relative paths and chdir(), the executable
file is opened at program start-up and the corresponding file descriptor
is kept for later, on-demand computation of the MD5 digest.

The in-memory representation of code fragments was extended and cleaned
up to
1- include this "lazy computation of digest from file contents" method,
2- factor out the code that computes the digests on demand
   (function `caml_digest_code_fragment` in runtime/intern.c)

The previous "lazy computation of digest from code area" method
remains available and is still used in some cases relative to
the toplevel.  It may not be safe to digest only the code and
not associated data.  On the other hand it is not obvious what
data is associated with that code.  This needs further review.
xavierleroy added a commit that referenced this issue Apr 19, 2019
This tests marshaling and unmarshaling of function closures
between two programs that have the same code area but differ
in their data areas.

The test is disabled on ARM because in ARM Thumb2, code pointers have
the low bit set, thus look like integers during marshaling of
closures.  This is wrong and causes the "closures" test to fail.
The issue should go away once we have a "parsable" in-memory format for
function closures.  In the meantime let's just disable this test on ARM.
xavierleroy added a commit that referenced this issue Apr 19, 2019
This addresses issue #5942 in the case of bytecode executables.

The checksum associated with the main code fragment covers both
the bytecode and the global data, as read off the bytecode executable.

This way, programs that differ only by their data are considered
incompatible with respect to marshaled function closures.

Also: refactor code between the two bytecode startup functions
caml_main and caml_startup_code_exn.
xavierleroy added a commit that referenced this issue Apr 19, 2019
This addresses issue #5942 in the case of native-code compilation.

The checksum associated with the main code fragment is the digest of
the executable file AND of the bytes in the code area.  A fortiori it
includes the code and the initial global data of the executable. It
also includes less desirable stuff such as debug info, causing the
checksum to vary more than absolutely necessary, but this is still
safe.

Besides the executable file, we also include the bytes in the code
area because it could be that the executable file is just a loader for
a shared object that contains the actual OCaml native code.  In this
case, hashing the executable file provides no useful identifier for
the OCaml code, and it would be a regression compared with the old
behavior (of hashing the bytes in the code area).

For this reason, let's hash BOTH the executable file AND the bytes
in the code area.  This avoids the regression in the weird case above,
while giving a hash that accounts for both code and data in the common case.

The checksum could be computed at program start-up, but this might be
too costly for a rarely-used feature (marshaling of code pointers).
Instead, we wait until the checksum is needed for the first time
to read the executable file and compute its MD5 digest.
To avoid problems with relative paths and chdir(), the executable
file is opened at program start-up and the corresponding file descriptor
is kept for later, on-demand computation of the MD5 digest.

The in-memory representation of code fragments was extended and cleaned
up to
1- include this "lazy computation of digest from file contents" method,
2- factor out the code that computes the digests on demand
   (function `caml_digest_code_fragment` in runtime/intern.c)

The previous "lazy computation of digest from code area" method
remains available and is still used in some cases relative to
the toplevel.  It may not be safe to digest only the code and
not associated data.  On the other hand it is not obvious what
data is associated with that code.  This needs further review.
@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant