Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCaml 4.04, 4.05-rc1 incompatible with snapshot binutils, when built with -fPIC, results in crashes #7585

Closed
vicuna opened this issue Jul 14, 2017 · 19 comments

Comments

@vicuna
Copy link

vicuna commented Jul 14, 2017

Original bug ID: 7585
Reporter: xnox
Assigned to: @mshinwell
Status: resolved (set by @mshinwell on 2017-08-07T15:31:32Z)
Resolution: duplicate
Priority: high
Severity: block
Platform: arm64
OS: Ubuntu
OS Version: 17.10/devel
Version: 4.05.0
Target version: 4.06.0 +dev/beta1/beta2/rc1
Fixed in version: 4.06.0 +dev/beta1/beta2/rc1
Category: back end (clambda to assembly)
Has duplicate: #7602
Monitored by: @gasche @glondu "Richard Jones"

Bug description

OCaml 4.04, 4.05-rc1 incompatible with snapshot binutils, when built with -fPIC, results in crashes

When configuring and building ocaml with -fPIC option to configure, it is miscompiled and results in test-suite crashes, and subsequent miscompilation of other ocaml packages.

This is on arm64 only.

See attached buildlog

Steps to reproduce

  • Use a recent snapshot of binutils trunk, e.g. 2.28.90.20170704-0ubuntu1
  • Use a recent snapshot of ocaml, e.g. 4.05-rc1
  • Use arm64 platform
  • Run configure with -fPIC option
  • Observe reloaction errors in the build log
  • Observe failing test suite

Additional information

binutils on arm64 has been improved to be more strict with PIC code, which may be resulting in these errors / bugs. Please see the below mailing list thread.

https://sourceware.org/ml/binutils/2017-06/msg00226.html

File attachments

@vicuna
Copy link
Author

vicuna commented Jul 18, 2017

Comment author: Jiong Wang

I have received an bug report on this. Quick look at the build log, it looks to me -fPIC was not specified when building the object unix.o?

The R_AARCH64_ADR_PREL_PG_HI21 relocation should have not been generated by compiler if -fPIC was specified.

@vicuna
Copy link
Author

vicuna commented Jul 18, 2017

Comment author: xnox

Both unix.o and unix.pic.o are built with -fPIC

/ocaml-4.05.0rc1$ ./debian/rules build | grep 'unix.c'
gcc -O2 -fno-strict-aliasing -fwrapv -D_FILE_OFFSET_BITS=64 -D_REENTRANT -fPIC -DCAML_NAME_SPACE -Wall -c unix.c
gcc -O2 -fno-strict-aliasing -fwrapv -D_FILE_OFFSET_BITS=64 -D_REENTRANT -fPIC -DCAML_NAME_SPACE -Wall -fPIC -c -ounix.pic.o unix.c

Both objects appear to contain something like that:

/ocaml-4.05.0rc1$ readelf -W -a byterun/unix.o byterun/unix.pic.o | grep R_AARCH64_ADR_PREL_PG
0000000000000274 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 8
0000000000000278 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 0
0000000000000330 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 10
000000000000039c 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 18
00000000000004e4 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 20
0000000000000274 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 8
0000000000000278 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 0
0000000000000330 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 10
000000000000039c 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 18
00000000000004e4 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 20

Is this then a compiler bug? binutils bug? unix.c bug?

@vicuna
Copy link
Author

vicuna commented Jul 18, 2017

Comment author: xnox

binutils 2.28.90.20170718-0ubuntu1
gcc (Ubuntu/Linaro 6.4.0-1ubuntu2) 6.4.0 20170704

@vicuna
Copy link
Author

vicuna commented Jul 18, 2017

Comment author: Jiong Wang

Hi xnox,

Those are PC relative relocations against local section symbol which is fine under PIC, as they will be fully resolved during static linking. (And you could see what linker is complaining is with different symbol names)

What Binutils 2.29 AArch64 linker tightened is actually using pc relative relocation against global symbol defined in the same object with the relocated instructions as it is possible they will be overridden that the instruction sequences won't be position independent.

The linker would expect those symbol are marked as .hidden explicitly if they won't be overridden.

After searching ocaml source code tree, I suspect

asmcomp/arm64/emit.mlp

let emit_load_symbol_addr dst s =
if (not !Clflags.dlcode) || Compilenv.symbol_in_current_unit s then begin
adrp {emit_reg dst}, {emit_symbol s}\n;
add {emit_reg dst}, {emit_reg dst}, #:lo12:{emit_symbol s}\n
end else begin
adrp {emit_reg dst}, :got:{emit_symbol s}\n;
ldr {emit_reg dst}, [{emit_reg dst}, #:got_lo12:{emit_symbol s}]\n
end

(there are several other places using adrp, the issue may caused by those parts as well)

I feel for PIC, you want to always use the sequences in the "else" part. I am not familiar with ocaml, I may be wrong.

@vicuna
Copy link
Author

vicuna commented Jul 26, 2017

Comment author: infinity0

In the interests of maybe pushing things along a bit, I have written a straw-man patch. It gets rid of the relocation errors, and the build succeeds, but the tests segfault as soon as they start running.

I'm totally unfamiliar with arm64 so I have no idea what the patch is actually doing, but the general idea was to take the existing PIC/non-PIC code examples in asmrun/arm64.S (the {ADDR,LOAD,STORE}GLOBAL macros), and apply this to the parts of the code in emit.mlp that did stuff using the "adrp" instruction.

If someone more familiar with arm64 assembly could take a look and maybe spot tweaks that could fix the segfaults, that would be great! If not, we might have to temporarily disable ocamlopt support in Debian (and probably Ubuntu) arm64.

@vicuna
Copy link
Author

vicuna commented Jul 26, 2017

Comment author: Jiong Wang

Hi Infinity0,

One thing looks suspicious is

adrp {emit_reg reg_tmp1}, :got:{emit_symbol_offset s ofs}\n;

it is emit_symbol_offset, so it will be in the format "symbol+offset", this will
work when using ADRP instruction to get the address directly, but I feel the
offset part will lost when using it with :got: directive.

The other thing is I assume those assembly auto-generated are supposed to be
referenced inside single module and won't be referenced by external code, so
could you please keep existing code as is while just generating one extra line, something like
.hidden {emit_symbols/label THE-NAME}\n before those ADRP instructions?

I feel it will work and can reduce object size.

@vicuna
Copy link
Author

vicuna commented Jul 27, 2017

Comment author: infinity0

"the patch" = arm64_ocaml_pic_straw-man-2.patch

I've done as you said and cut down the patch to only emit extra .hidden lines, in 4 places.There are 5 uses of "adrp" in that file; if we add it to the "else" clause of emit_load_symbol_addr then we get "undefined reference", so the patch omits this.

unix.a(unix.o): In function camlUnix__302': :(.data+0x24a8): undefined reference to caml_sys_exit'

(It seems it's necessary to do a clean-rebuild every time I edit emit.mlp, which is extremely frustrating. "make opt" by itself seems not to work correctly - for example if I add .hidden into the "else" clause, I don't get these "undefined reference" error, until after I do "make clean" then build it again.)

Anyway, the patch makes the relocation errors go away, and there are no segfaults, but the failing test remains (same as xnox reported, tests/lib-dynlink-native/main). The diff between the expected and actual output is this:

--- reference 2017-07-22 18:39:47.788532745 +0000
+++ result 2017-07-27 12:28:16.616658517 +0000
@@ -1,30 +1,7 @@
Loading plugin.so
-Registering module Plugin
-COUCOU
+Dynlink error: error loading shared library: /home/infinity0/ocaml/testsuite/tests/lib-dynlink-native/plugin.so: undefined symbol: camlRandom
Loading plugin2.so
-Registering module Plugin2
-1
-2
-6
-1
-XXX
+Dynlink error: error loading shared library: /home/infinity0/ocaml/testsuite/tests/lib-dynlink-native/plugin2.so: undefined symbol: camlPervasives
Loading plugin_thread.so
-Registering module Plugin_thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Callback from plugin2
-Callback from plugin
+Dynlink error: error loading shared library: /home/infinity0/ocaml/testsuite/tests/lib-dynlink-native/plugin_thread.so: undefined symbol: camlPervasives
Callback from main

This looks similar to the "undefined reference" errors I described above, so then I tried amending the patch to wrap the .hidden lines like this:

( if Compilenv.symbol_in_current_unit s then .hidden {emit_symbol s}\n );

I could only do this 3/4 of the time, the other one is a emit_label and I couldn't find the corresponding "in_current_unit" function for that.

But unfortunately this does not fix things, and the failure remains.

@vicuna
Copy link
Author

vicuna commented Jul 28, 2017

Comment author: Jiong Wang

Hi Infinity0,

I have setup the Ocaml environment on AArch64, and am able to reproduce all
above issues.

For your first patch arm64_ocaml_pic_straw-man-1.patch:

I don't know why it doesn't work. The reason looks like GOT based indirect
access now is generated for some non-global symbol as well. For these
symbols, the content of in GOT table are initialized to the address we have
at static link stage which somehow doesn't work. There is no any other
runtime relocation generated. For global symbol, we typically will generate
R_AARCH64_GLOB_DAT, which allows dynamic linker to fill the final runtime
address of that symbol in the GOT table.

So, I modified your patch in two places:

1. As described at comment 0018139, I have changed

    ` adrp {emit_reg reg_tmp1}, :got:{emit_symbol_offset s ofs}\n`;

  into:

    ` adrp {emit_reg reg_tmp1}, :got:{emit_symbol s}\n`;

  Then I added the offset at the second part of the address formation,
  something like:

      (* second line of LOADGLOBAL from arm64.S *)
      if !Clflags.pic_code then
        `[{emit_reg r}, {emit_pure_offset ofs}]`

2. I have added one extra line before all place of "adrp...:got:..".

  `.global {emit_symbol s}`;

After these two change, the builds are OK and make tests are OK.

As this patch is using GOT based indirect addressing for all pic code, there
is sligly codesize increase.

For your second patch arm64_ocaml_pic_straw-man-2.patch:

I feel this is the correct approach.

The reason for the lib-dynlink-native test errors is because some symbols
defined in ocamlopt.opt (an executable) is reference by extern plugin.

Marking all symbol as ".hidden" is overkilling.

Actually for a global symbol defined in executable, it won't be overridden, so
linker won't generate warning if it's used by pc-relative relocation under pic
mode.

I think what we really want is your arm64_ocaml_pic_straw-man-2.patch, but
changing

let emit_load_symbol_addr dst s =
if (not !Clflags.dlcode) || Compilenv.symbol_in_current_unit s then begin
+    `  .hidden {emit_symbol s}\n`;

into something like

let emit_load_symbol_addr dst s =
if (not !Clflags.dlcode) || Compilenv.symbol_in_current_unit s then begin
   if SYMBOL_IS_DEFINED_IN_CODE_GOES_TO_SO_LIBRARY then begin
   +    `  .hidden {emit_symbol s}\n`;
   end

However, it seems we can't know this in Ocaml after some quick in the code base.

So, I have tries another approach:

1. apply arm64_ocaml_pic_straw-man-2.patch,
2. tighen the function "symbol_in_current_unit", by removing "name = prefix".
   I think symbol with "__" will absolutely only referenced within the same
   unit.

   In asmcomp/compilenv.ml:

   let symbol_in_current_unit name =
     let prefix = "caml" ^ current_unit.ui_symbol in
     -  name = prefix ||

This approach builds OK and make tests OK as well without the overhead of a few direct address access are turned into indirect access. But I noticed the generated ocamlopt.opt is actually smaller than the original code built by old binutils. I guess it's result of using ".hidden" on most of the symbols.

@vicuna
Copy link
Author

vicuna commented Jul 29, 2017

Comment author: infinity0

Hey, thank you that worked! I can confirm your amendment to -2.patch works and all the tests pass. (I didn't check -1.patch yet). I'll upload this to Debian for now and we'll see how it does compiling other ocaml packages.

Actually, the reason why it works is a bit subtle and I don't understand it fully yet:

On a clean 4.05.0 Debian build without .hidden:

| ocaml$ readelf -Ws ./ocamlopt.opt | grep 'camlPervasives$'
| 645: 0000000000a28460 0 NOTYPE GLOBAL DEFAULT 24 camlPervasives
| 36865: 0000000000a28460 0 NOTYPE GLOBAL DEFAULT 24 camlPervasives

(The symbol appears twice, once in .dynsym and once in .symtab, the latter is lost after stripping.)

After applying .hidden (which fixes the relocation warnings), but before applying the "- name = prefix ||" amendment:

| ocaml$ readelf -Ws ./ocamlopt.opt | grep 'camlPervasives$'
| 35784: 00000000009d7740 0 NOTYPE LOCAL DEFAULT 24 camlPervasives
| ocaml$ readelf -Ws ./ocamlopt.opt | grep 'caml[[:alnum:]]*$'
| # in fact 144/148 of these are LOCAL

After applying the amendment:

| ocaml$ readelf -Ws ./ocamlopt.opt | grep 'camlPervasives$'
| 1983: 00000000009d9910 0 NOTYPE GLOBAL DEFAULT 24 camlPervasives
| 41417: 00000000009d9910 0 NOTYPE GLOBAL DEFAULT 24 camlPervasives
| ocaml$ readelf -Ws ./ocamlopt.opt | grep 'caml[[:alnum:]]*$'
| # all are GLOBAL

These symbols ('caml[[:alnum:]]*$') are all GLOBAL on my x86-64 system with ocaml 4.05.0, and they are also all GLOBAL with Debian ocaml-nox_4.05.0-4_arm64.deb [1] which was compiled against binutils 2.28 [2]:

[1] http://snapshot.debian.org/archive/debian/20170722T040820Z/pool/main/o/ocaml/ocaml-nox_4.05.0-4_arm64.deb
[2] https://buildd.debian.org/status/fetch.php?pkg=ocaml&arch=arm64&ver=4.05.0-4&stamp=1500666144&raw=0

So it seems our original extra .hidden annotations, somehow causes some other symbols (that are not marked HIDDEN) to change from GLOBAL to LOCAL, and your later amendment reverts this. This is very surprising to me.

Also I think it is not correct to say that "symbol with "__" will absolutely only referenced within the same unit". For example:

ocaml$ readelf -Ws ./testsuite/tests/lib-dynlink-native/plugin.so | grep camlPervasives
14: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND camlPervasives__print_endline_1310
104: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND camlPervasives__print_endline_1310
ocaml$ readelf -Ws ./ocamlopt.opt | grep camlPervasives__print_endline_1310
8656: 000000000061d450 92 FUNC GLOBAL DEFAULT 13 camlPervasives__print_endline_1310
51689: 000000000061d450 92 FUNC GLOBAL DEFAULT 13 camlPervasives__print_endline_1310

However the tests pass fine, so perhaps the statement is true for all the cases where the code path goes through symbol_in_current_unit.

Anyway, my comments are only for informational purposes to understand the patch better. For now, it seems the patch works so I am satisfied :) I'll proceed with updating OCaml in Debian with it.

Hopefully an OCaml developer can confirm soon that the patch is fully correct. To clarify, it is arm64_ocaml_pic_straw-man-2.patch plus this:

--- a/asmcomp/compilenv.ml
+++ b/asmcomp/compilenv.ml
@@ -161,7 +161,6 @@

let symbol_in_current_unit name =
let prefix = "caml" ^ current_unit.ui_symbol in

  • name = prefix ||
    (let lp = String.length prefix in
    String.length name >= 2 + lp
    && String.sub name 0 lp = prefix

@vicuna
Copy link
Author

vicuna commented Jul 29, 2017

Comment author: @gasche

Would one of you two mind submitting the final version of the patch as github pull request on https://github.com/ocaml/ocaml/, with a link to the current issue? This provides a better interface for review, and it makes it easier to ping the people that should look at this issue.

@vicuna
Copy link
Author

vicuna commented Jul 30, 2017

Comment author: infinity0

Filed at #1268. I'd be grateful if Jiong Wang could take a look, in case I made any mistakes in explaining it.

@vicuna
Copy link
Author

vicuna commented Jul 30, 2017

Comment author: @xavierleroy

Thanks to @xnox and @infinity0 and @jiong Wang for the report, the discussion, and the Github pull request.

Before proceedings further, I'd like to be absolutely sure that this is not a bug in the development version of binutils. When something works with all released version of a software package and suddenly breaks with the development version, it is more often than not a bug in the software package itself, not in its users. So, please, convince me that this is not a binutils bug.

@vicuna
Copy link
Author

vicuna commented Jul 31, 2017

Comment author: Jiong Wang

Hi Xleroy,

The binutils commit that caused this trouble is this one:

https://sourceware.org/ml/binutils/2017-06/msg00171.html

For pc-relative relocation, it is naturally position independent if the
distance to the destination won't change in any case. Previously, I was
thinking for a shared library, if pc-relative relocation is against an symbol
that is defined just inside this .so, then the distance should be fixed, so
we should allow this when generating PIC shared library. And compiler or
assembly writers can utilize this to avoid indirect address access through GOT
table.

However, when fixing some linker bug related with COPY relocation:

https://sourceware.org/ml/binutils/2017-06/msg00164.html
https://sourceware.org/bugzilla/show_bug.cgi?id=21532

I found my above assumption actually is not correctly. The reason is an
global symbol defined in the .so may be overridden by the COPY symbol created
for executable. In general, if an executable is accessing one global data
defined in the .so, it will require the linker to create on local copy for that
executable, and then all access that global data in this process, in both the
executable and the .so will be redirected to that copy of the global data.

So, if in .so, we are allowing pc-relative relocation against global symbol
defined in the .so, then it will be fully resolved with the distance to that
symbol from the relocated instruction. But the reference of that symbol from
executable will be redirected to the COPY version, therefore there will be
inconsistent.

I later found Binutils actually have an macro SYMBOL_REFERENCES_LOCAL which
exactly tells whether the reference to this symbol will always reference the
symbol in the same object. So I committed that Binutils patch to replace some
unreliable manual checks with SYMBOL_REFERENCES_LOCAL. And this change of
course has tighten the requirements and narrowed the scenarios where you can
using pc-relative relocation against global symbols in shared library.

There was similar issue reported by GLIBC community like this one

https://sourceware.org/ml/binutils/2017-06/msg00226.html

It later proves to be AArch64 GLIBC issue where GLIBC generic framework has
mechanisms to encourage the use of .hidden symbol and a few targets has migrated
to that while AArch64 hasn't.

For this Ocaml case, I understand the purpose of using pc-relative addressing,
and I do agree we should use them whenever we can to avoid indirect accessing
through GOT.

But I feel it is only safe to use pc-relative addressing when that symbol in
shard library won't be access by executable. Then, if one global symbol qualify
this, the Ocaml AArch64 backend need to generate a ".hidden" directive to tell
linker that there is guarantee on this.

As global symbol defined in executable won't be overridden, so there is only
need to do this for global symbols defined in shared library. But my quick
search of the code base shows it is impossible to tell if an symbol in defined
in shared library or executable, so I had proposed to change
symbol_in_current_unit with the assumption of symbols with "" in the name will
always supposed to be internal symbol and won't be access by external objects.
If the assumption is correct, I think the proposed change can kill the hiding
bug on inconsistent access when the symbol is overridden although it also kills
a few extra opportunities of using efficient pc-relative access as there will be
cases that symbol without "
" in the name are still not accessed by any
external objects.

@vicuna
Copy link
Author

vicuna commented Aug 7, 2017

Comment author: @mshinwell

Let's take this to the github pull request.

@vicuna vicuna closed this as completed Aug 7, 2017
@vicuna
Copy link
Author

vicuna commented Aug 24, 2017

Comment author: Richard Jones

Since option 2 (#1268) isn't working, I tried to reconstruct option 1 as described in comment #7585#c18143 above. However I wasn't able to work it out exactly. I will attach the patch I tried (which fails in the tests).

Do you happen to have the working option 1?

@vicuna
Copy link
Author

vicuna commented Aug 24, 2017

Comment author: Richard Jones

Ignore the previous comment, I think I've got it working now. I'll update the pull request comment thread.

@vicuna
Copy link
Author

vicuna commented Sep 13, 2017

Comment author: @xavierleroy

Fixed by this pull request: #1330. Will be in 4.06.

@vicuna
Copy link
Author

vicuna commented Oct 25, 2017

Comment author: @gasche

Apparently this fix is instrumental in letting people build and use Coq on their Android phone. Congratulations :-)

@vicuna
Copy link
Author

vicuna commented Oct 25, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants