Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0007585OCamlback end (clambda to assembly)public2017-07-14 11:442017-09-13 19:40
Reporterxnox 
Assigned Toshinwell 
PriorityhighSeverityblockReproducibilityalways
StatusresolvedResolutionduplicate 
Platformarm64OSUbuntuOS Version17.10/devel
Product Version4.05.0 
Target Version4.06.0+devFixed in Version4.06.0+dev 
Summary0007585: OCaml 4.04, 4.05-rc1 incompatible with snapshot binutils, when built with -fPIC, results in crashes
DescriptionOCaml 4.04, 4.05-rc1 incompatible with snapshot binutils, when built with -fPIC, results in crashes

When configuring and building ocaml with -fPIC option to configure, it is miscompiled and results in test-suite crashes, and subsequent miscompilation of other ocaml packages.

This is on arm64 only.

See attached buildlog
Steps To Reproduce* Use a recent snapshot of binutils trunk, e.g. 2.28.90.20170704-0ubuntu1
* Use a recent snapshot of ocaml, e.g. 4.05-rc1
* Use arm64 platform
* Run configure with -fPIC option
* Observe reloaction errors in the build log
* Observe failing test suite

Additional Informationbinutils on arm64 has been improved to be more strict with PIC code, which may be resulting in these errors / bugs. Please see the below mailing list thread.


https://sourceware.org/ml/binutils/2017-06/msg00226.html [^]
TagsNo tags attached.
Attached Files? file icon ocaml_4.05.0~rc1-1_arm64.build.txt.xz [^] (37,164 bytes) 2017-07-14 11:44
patch file icon arm64_ocaml_pic_straw-man-1.patch [^] (3,887 bytes) 2017-07-26 11:01 [Show Content]
patch file icon arm64_ocaml_pic_straw-man-2.patch [^] (1,494 bytes) 2017-07-27 13:35 [Show Content]
patch file icon arm64-ocaml-pic-option-1.patch [^] (3,926 bytes) 2017-08-24 10:49 [Show Content]

- Relationships
has duplicate 0007602resolvedgasche Test suite crashes on aarch64 

-  Notes
(0018089)
Jiong Wang (reporter)
2017-07-18 15:23

I have received an bug report on this. Quick look at the build log, it looks to me -fPIC was not specified when building the object unix.o?

The R_AARCH64_ADR_PREL_PG_HI21 relocation should have not been generated by compiler if -fPIC was specified.
(0018093)
xnox (reporter)
2017-07-18 16:57

Both unix.o and unix.pic.o are built with -fPIC

~/ocaml-4.05.0~rc1$ ./debian/rules build | grep 'unix.c'
gcc -O2 -fno-strict-aliasing -fwrapv -D_FILE_OFFSET_BITS=64 -D_REENTRANT -fPIC -DCAML_NAME_SPACE -Wall -c unix.c
gcc -O2 -fno-strict-aliasing -fwrapv -D_FILE_OFFSET_BITS=64 -D_REENTRANT -fPIC -DCAML_NAME_SPACE -Wall -fPIC -c -ounix.pic.o unix.c

Both objects appear to contain something like that:

~/ocaml-4.05.0~rc1$ readelf -W -a byterun/unix.o byterun/unix.pic.o | grep R_AARCH64_ADR_PREL_PG
0000000000000274 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 8
0000000000000278 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 0
0000000000000330 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 10
000000000000039c 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 18
00000000000004e4 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 20
0000000000000274 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 8
0000000000000278 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 0
0000000000000330 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 10
000000000000039c 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 18
00000000000004e4 0000000600000113 R_AARCH64_ADR_PREL_PG_ 0000000000000000 .rodata.str1.8 + 20

Is this then a compiler bug? binutils bug? unix.c bug?
(0018094)
xnox (reporter)
2017-07-18 16:58

binutils 2.28.90.20170718-0ubuntu1
gcc (Ubuntu/Linaro 6.4.0-1ubuntu2) 6.4.0 20170704
(0018095)
Jiong Wang (reporter)
2017-07-18 18:51
edited on: 2017-07-18 18:59

Hi xnox,

  Those are PC relative relocations against local section symbol which is fine under PIC, as they will be fully resolved during static linking. (And you could see what linker is complaining is with different symbol names)

  What Binutils 2.29 AArch64 linker tightened is actually using pc relative relocation against global symbol defined in the same object with the relocated instructions as it is possible they will be overridden that the instruction sequences won't be position independent.

  The linker would expect those symbol are marked as .hidden explicitly if they won't be overridden.
 
  After searching ocaml source code tree, I suspect

asmcomp/arm64/emit.mlp

let emit_load_symbol_addr dst s =
  if (not !Clflags.dlcode) || Compilenv.symbol_in_current_unit s then begin
    ` adrp {emit_reg dst}, {emit_symbol s}\n`;
    ` add {emit_reg dst}, {emit_reg dst}, #:lo12:{emit_symbol s}\n`
  end else begin
    ` adrp {emit_reg dst}, :got:{emit_symbol s}\n`;
    ` ldr {emit_reg dst}, [{emit_reg dst}, #:got_lo12:{emit_symbol s}]\n`
  end

(there are several other places using adrp, the issue may caused by those parts as well)

I feel for PIC, you want to always use the sequences in the "else" part. I am not familiar with ocaml, I may be wrong.

(0018138)
infinity0 (reporter)
2017-07-26 11:06
edited on: 2017-07-26 11:08

In the interests of maybe pushing things along a bit, I have written a straw-man patch. It gets rid of the relocation errors, and the build succeeds, but the tests segfault as soon as they start running.

I'm totally unfamiliar with arm64 so I have no idea what the patch is actually doing, but the general idea was to take the existing PIC/non-PIC code examples in asmrun/arm64.S (the {ADDR,LOAD,STORE}GLOBAL macros), and apply this to the parts of the code in emit.mlp that did stuff using the "adrp" instruction.

If someone more familiar with arm64 assembly could take a look and maybe spot tweaks that could fix the segfaults, that would be great! If not, we might have to temporarily disable ocamlopt support in Debian (and probably Ubuntu) arm64.

(0018139)
Jiong Wang (reporter)
2017-07-26 18:12

Hi Infinity0,

One thing looks suspicious is

` adrp {emit_reg reg_tmp1}, :got:{emit_symbol_offset s ofs}\n`;

it is emit_symbol_offset, so it will be in the format "symbol+offset", this will
work when using ADRP instruction to get the address directly, but I feel the
offset part will lost when using it with :got: directive.

The other thing is I assume those assembly auto-generated are supposed to be
referenced inside single module and won't be referenced by external code, so
could you please keep existing code as is while just generating one extra line, something like
`.hidden {emit_symbols/label THE-NAME}\n` before those ADRP instructions?

I feel it will work and can reduce object size.
(0018141)
infinity0 (reporter)
2017-07-27 14:30
edited on: 2017-07-27 14:31

"the patch" = arm64_ocaml_pic_straw-man-2.patch

I've done as you said and cut down the patch to only emit extra .hidden lines, in 4 places.There are 5 uses of "adrp" in that file; if we add it to the "else" clause of emit_load_symbol_addr then we get "undefined reference", so the patch omits this.

unix.a(unix.o): In function `camlUnix__302':
:(.data+0x24a8): undefined reference to `caml_sys_exit'

(It seems it's necessary to do a clean-rebuild every time I edit emit.mlp, which is extremely frustrating. "make opt" by itself seems not to work correctly - for example if I add .hidden into the "else" clause, I don't get these "undefined reference" error, until after I do "make clean" then build it again.)

Anyway, the patch makes the relocation errors go away, and there are no segfaults, but the failing test remains (same as xnox reported, tests/lib-dynlink-native/main). The diff between the expected and actual output is this:

--- reference 2017-07-22 18:39:47.788532745 +0000
+++ result 2017-07-27 12:28:16.616658517 +0000
@@ -1,30 +1,7 @@
 Loading plugin.so
-Registering module Plugin
-COUCOU
+Dynlink error: error loading shared library: /home/infinity0/ocaml/testsuite/tests/lib-dynlink-native/plugin.so: undefined symbol: camlRandom
 Loading plugin2.so
-Registering module Plugin2
-1
-2
-6
-1
-XXX
+Dynlink error: error loading shared library: /home/infinity0/ocaml/testsuite/tests/lib-dynlink-native/plugin2.so: undefined symbol: camlPervasives
 Loading plugin_thread.so
-Registering module Plugin_thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Thread
-Callback from plugin2
-Callback from plugin
+Dynlink error: error loading shared library: /home/infinity0/ocaml/testsuite/tests/lib-dynlink-native/plugin_thread.so: undefined symbol: camlPervasives
 Callback from main

This looks similar to the "undefined reference" errors I described above, so then I tried amending the patch to wrap the .hidden lines like this:

( if Compilenv.symbol_in_current_unit s then ` .hidden {emit_symbol s}\n` );

I could only do this 3/4 of the time, the other one is a emit_label and I couldn't find the corresponding "in_current_unit" function for that.

But unfortunately this does not fix things, and the failure remains.

(0018143)
Jiong Wang (reporter)
2017-07-28 19:00

Hi Infinity0,

  I have setup the Ocaml environment on AArch64, and am able to reproduce all
above issues.

For your first patch arm64_ocaml_pic_straw-man-1.patch:
===

  I don't know why it doesn't work. The reason looks like GOT based indirect
  access now is generated for some non-global symbol as well. For these
  symbols, the content of in GOT table are initialized to the address we have
  at static link stage which somehow doesn't work. There is no any other
  runtime relocation generated. For global symbol, we typically will generate
  R_AARCH64_GLOB_DAT, which allows dynamic linker to fill the final runtime
  address of that symbol in the GOT table.

  So, I modified your patch in two places:

    1. As described at comment 0018139, I have changed

        ` adrp {emit_reg reg_tmp1}, :got:{emit_symbol_offset s ofs}\n`;

      into:

        ` adrp {emit_reg reg_tmp1}, :got:{emit_symbol s}\n`;

      Then I added the offset at the second part of the address formation,
      something like:

          (* second line of LOADGLOBAL from arm64.S *)
          if !Clflags.pic_code then
            `[{emit_reg r}, {emit_pure_offset ofs}]`

    2. I have added one extra line before all place of "adrp...:got:..".

      `.global {emit_symbol s}`;

   After these two change, the builds are OK and make tests are OK.

   As this patch is using GOT based indirect addressing for all pic code, there
   is sligly codesize increase.

For your second patch arm64_ocaml_pic_straw-man-2.patch:
===

  I feel this is the correct approach.

  The reason for the lib-dynlink-native test errors is because some symbols
  defined in ocamlopt.opt (an executable) is reference by extern plugin.

  Marking all symbol as ".hidden" is overkilling.

  Actually for a global symbol defined in executable, it won't be overridden, so
  linker won't generate warning if it's used by pc-relative relocation under pic
  mode.

  I think what we really want is your arm64_ocaml_pic_straw-man-2.patch, but
  changing

    let emit_load_symbol_addr dst s =
    if (not !Clflags.dlcode) || Compilenv.symbol_in_current_unit s then begin
    + ` .hidden {emit_symbol s}\n`;

  into something like

    let emit_load_symbol_addr dst s =
    if (not !Clflags.dlcode) || Compilenv.symbol_in_current_unit s then begin
       if SYMBOL_IS_DEFINED_IN_CODE_GOES_TO_SO_LIBRARY then begin
       + ` .hidden {emit_symbol s}\n`;
       end

  However, it seems we can't know this in Ocaml after some quick in the code base.

  So, I have tries another approach:

    1. apply arm64_ocaml_pic_straw-man-2.patch,
    2. tighen the function "symbol_in_current_unit", by removing "name = prefix".
       I think symbol with "__" will absolutely only referenced within the same
       unit.

       In asmcomp/compilenv.ml:

       let symbol_in_current_unit name =
         let prefix = "caml" ^ current_unit.ui_symbol in
         - name = prefix ||

   This approach builds OK and make tests OK as well without the overhead of a few direct address access are turned into indirect access. But I noticed the generated ocamlopt.opt is actually smaller than the original code built by old binutils. I guess it's result of using ".hidden" on most of the symbols.
(0018144)
infinity0 (reporter)
2017-07-29 10:38

Hey, thank you that worked! I can confirm your amendment to -2.patch works and all the tests pass. (I didn't check -1.patch yet). I'll upload this to Debian for now and we'll see how it does compiling other ocaml packages.

Actually, the reason why it works is a bit subtle and I don't understand it fully yet:

On a clean 4.05.0 Debian build without .hidden:

| ocaml$ readelf -Ws ./ocamlopt.opt | grep 'camlPervasives$'
| 645: 0000000000a28460 0 NOTYPE GLOBAL DEFAULT 24 camlPervasives
| 36865: 0000000000a28460 0 NOTYPE GLOBAL DEFAULT 24 camlPervasives

(The symbol appears twice, once in .dynsym and once in .symtab, the latter is lost after stripping.)

After applying .hidden (which fixes the relocation warnings), but before applying the "- name = prefix ||" amendment:

| ocaml$ readelf -Ws ./ocamlopt.opt | grep 'camlPervasives$'
| 35784: 00000000009d7740 0 NOTYPE LOCAL DEFAULT 24 camlPervasives
| ocaml$ readelf -Ws ./ocamlopt.opt | grep 'caml[[:alnum:]]*$'
| # in fact 144/148 of these are LOCAL

After applying the amendment:

| ocaml$ readelf -Ws ./ocamlopt.opt | grep 'camlPervasives$'
| 1983: 00000000009d9910 0 NOTYPE GLOBAL DEFAULT 24 camlPervasives
| 41417: 00000000009d9910 0 NOTYPE GLOBAL DEFAULT 24 camlPervasives
| ocaml$ readelf -Ws ./ocamlopt.opt | grep 'caml[[:alnum:]]*$'
| # all are GLOBAL

These symbols ('caml[[:alnum:]]*$') are all GLOBAL on my x86-64 system with ocaml 4.05.0, and they are also all GLOBAL with Debian ocaml-nox_4.05.0-4_arm64.deb [1] which was compiled against binutils 2.28 [2]:

[1] http://snapshot.debian.org/archive/debian/20170722T040820Z/pool/main/o/ocaml/ocaml-nox_4.05.0-4_arm64.deb [^]
[2] https://buildd.debian.org/status/fetch.php?pkg=ocaml&arch=arm64&ver=4.05.0-4&stamp=1500666144&raw=0 [^]

So it seems our original extra .hidden annotations, somehow causes some other symbols (that are *not* marked HIDDEN) to change from GLOBAL to LOCAL, and your later amendment reverts this. This is very surprising to me.

Also I think it is not correct to say that "symbol with "__" will absolutely only referenced within the same unit". For example:

ocaml$ readelf -Ws ./testsuite/tests/lib-dynlink-native/plugin.so | grep camlPervasives
    14: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND camlPervasives__print_endline_1310
   104: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND camlPervasives__print_endline_1310
ocaml$ readelf -Ws ./ocamlopt.opt | grep camlPervasives__print_endline_1310
  8656: 000000000061d450 92 FUNC GLOBAL DEFAULT 13 camlPervasives__print_endline_1310
 51689: 000000000061d450 92 FUNC GLOBAL DEFAULT 13 camlPervasives__print_endline_1310

However the tests pass fine, so perhaps the statement is true for all the cases where the code path goes through symbol_in_current_unit.

Anyway, my comments are only for informational purposes to understand the patch better. For now, it seems the patch works so I am satisfied :) I'll proceed with updating OCaml in Debian with it.

Hopefully an OCaml developer can confirm soon that the patch is fully correct. To clarify, it is arm64_ocaml_pic_straw-man-2.patch plus this:

--- a/asmcomp/compilenv.ml
+++ b/asmcomp/compilenv.ml
@@ -161,7 +161,6 @@
 
 let symbol_in_current_unit name =
   let prefix = "caml" ^ current_unit.ui_symbol in
- name = prefix ||
   (let lp = String.length prefix in
    String.length name >= 2 + lp
    && String.sub name 0 lp = prefix
(0018145)
gasche (developer)
2017-07-29 18:50

Would one of you two mind submitting the final version of the patch as github pull request on <https://github.com/ocaml/ocaml/>, [^] with a link to the current issue? This provides a better interface for review, and it makes it easier to ping the people that should look at this issue.
(0018146)
infinity0 (reporter)
2017-07-30 14:29

Filed at https://github.com/ocaml/ocaml/pull/1268. [^] I'd be grateful if Jiong Wang could take a look, in case I made any mistakes in explaining it.
(0018147)
xleroy (administrator)
2017-07-30 19:04

Thanks to @xnox and @infinity0 and @Jiong Wang for the report, the discussion, and the Github pull request.

Before proceedings further, I'd like to be absolutely sure that this is not a bug in the development version of binutils. When something works with all released version of a software package and suddenly breaks with the development version, it is more often than not a bug in the software package itself, not in its users. So, please, convince me that this is not a binutils bug.
(0018150)
Jiong Wang (reporter)
2017-07-31 22:50

Hi Xleroy,

  The binutils commit that caused this trouble is this one:

    https://sourceware.org/ml/binutils/2017-06/msg00171.html [^]

  For pc-relative relocation, it is naturally position independent if the
distance to the destination won't change in any case. Previously, I was
thinking for a shared library, if pc-relative relocation is against an symbol
that is defined just inside this .so, then the distance should be fixed, so
we should allow this when generating PIC shared library. And compiler or
assembly writers can utilize this to avoid indirect address access through GOT
table.

  However, when fixing some linker bug related with COPY relocation:

    https://sourceware.org/ml/binutils/2017-06/msg00164.html [^]
    https://sourceware.org/bugzilla/show_bug.cgi?id=21532 [^]

  I found my above assumption actually is not correctly. The reason is an
global symbol defined in the .so may be overridden by the COPY symbol created
for executable. In general, if an executable is accessing one global data
defined in the .so, it will require the linker to create on local copy for that
executable, and then all access that global data in this process, in both the
executable and the .so will be redirected to that copy of the global data.

  So, if in .so, we are allowing pc-relative relocation against global symbol
defined in the .so, then it will be fully resolved with the distance to that
symbol from the relocated instruction. But the reference of that symbol from
executable will be redirected to the COPY version, therefore there will be
inconsistent.

  I later found Binutils actually have an macro SYMBOL_REFERENCES_LOCAL which
exactly tells whether the reference to this symbol will always reference the
symbol in the same object. So I committed that Binutils patch to replace some
unreliable manual checks with SYMBOL_REFERENCES_LOCAL. And this change of
course has tighten the requirements and narrowed the scenarios where you can
using pc-relative relocation against global symbols in shared library.

  There was similar issue reported by GLIBC community like this one

    https://sourceware.org/ml/binutils/2017-06/msg00226.html [^]

  It later proves to be AArch64 GLIBC issue where GLIBC generic framework has
mechanisms to encourage the use of .hidden symbol and a few targets has migrated
to that while AArch64 hasn't.

  For this Ocaml case, I understand the purpose of using pc-relative addressing,
and I do agree we should use them whenever we can to avoid indirect accessing
through GOT.

  But I feel it is only safe to use pc-relative addressing when that symbol in
shard library won't be access by executable. Then, if one global symbol qualify
this, the Ocaml AArch64 backend need to generate a ".hidden" directive to tell
linker that there is guarantee on this.

As global symbol defined in executable won't be overridden, so there is only
need to do this for global symbols defined in shared library. But my quick
search of the code base shows it is impossible to tell if an symbol in defined
in shared library or executable, so I had proposed to change
symbol_in_current_unit with the assumption of symbols with "__" in the name will
always supposed to be internal symbol and won't be access by external objects.
If the assumption is correct, I think the proposed change can kill the hiding
bug on inconsistent access when the symbol is overridden although it also kills
a few extra opportunities of using efficient pc-relative access as there will be
cases that symbol without "__" in the name are still not accessed by any
external objects.
(0018164)
shinwell (developer)
2017-08-07 17:31

Let's take this to the github pull request.
(0018195)
Richard Jones (reporter)
2017-08-24 10:49

Since option 2 (https://github.com/ocaml/ocaml/pull/1268 [^]) isn't working, I tried to reconstruct option 1 as described in comment https://caml.inria.fr/mantis/view.php?id=7585#c18143 [^] above. However I wasn't able to work it out exactly. I will attach the patch I tried (which fails in the tests).

Do you happen to have the working option 1?
(0018196)
Richard Jones (reporter)
2017-08-24 11:36

Ignore the previous comment, I think I've got it working now. I'll update the pull request comment thread.
(0018250)
xleroy (administrator)
2017-09-13 19:40

Fixed by this pull request: https://github.com/ocaml/ocaml/pull/1330. [^] Will be in 4.06.

- Issue History
Date Modified Username Field Change
2017-07-14 11:44 xnox New Issue
2017-07-14 11:44 xnox File Added: ocaml_4.05.0~rc1-1_arm64.build.txt.xz
2017-07-18 15:23 Jiong Wang Note Added: 0018089
2017-07-18 16:57 xnox Note Added: 0018093
2017-07-18 16:58 xnox Note Added: 0018094
2017-07-18 18:51 Jiong Wang Note Added: 0018095
2017-07-18 18:59 Jiong Wang Note Edited: 0018095 View Revisions
2017-07-26 11:01 infinity0 File Added: arm64_ocaml_pic_straw-man-1.patch
2017-07-26 11:06 infinity0 Note Added: 0018138
2017-07-26 11:08 infinity0 Note Edited: 0018138 View Revisions
2017-07-26 18:12 Jiong Wang Note Added: 0018139
2017-07-27 13:35 infinity0 File Added: arm64_ocaml_pic_straw-man-2.patch
2017-07-27 14:30 infinity0 Note Added: 0018141
2017-07-27 14:31 infinity0 Note Edited: 0018141 View Revisions
2017-07-28 19:00 Jiong Wang Note Added: 0018143
2017-07-29 10:38 infinity0 Note Added: 0018144
2017-07-29 18:50 gasche Note Added: 0018145
2017-07-30 14:29 infinity0 Note Added: 0018146
2017-07-30 19:04 xleroy Note Added: 0018147
2017-07-30 19:04 xleroy Status new => acknowledged
2017-07-30 19:04 xleroy Target Version => 4.06.0+dev
2017-07-31 22:50 Jiong Wang Note Added: 0018150
2017-08-05 15:00 gasche Relationship added has duplicate 0007602
2017-08-07 17:31 shinwell Note Added: 0018164
2017-08-07 17:31 shinwell Status acknowledged => resolved
2017-08-07 17:31 shinwell Resolution open => duplicate
2017-08-07 17:31 shinwell Assigned To => shinwell
2017-08-24 10:49 Richard Jones Note Added: 0018195
2017-08-24 10:49 Richard Jones File Added: arm64-ocaml-pic-option-1.patch
2017-08-24 11:36 Richard Jones Note Added: 0018196
2017-09-13 19:40 xleroy Note Added: 0018250
2017-09-13 19:40 xleroy Fixed in Version => 4.06.0+dev


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker