Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006468OCamlOCaml generalpublic2014-06-24 14:242015-05-05 16:17
Reporterwhitequark 
Assigned Towhitequark 
PrioritynormalSeverityfeatureReproducibilityalways
StatusresolvedResolutionfixed 
PlatformOSOS Version
Product Version4.02.0+beta1 / +rc1 
Target VersionFixed in Version4.03.0+dev 
Summary0006468: Backtraces in toplevel
DescriptionI just realized there appears to be no way to get backtraces in toplevel. This makes it much less useful. I see there is a patch for 3.11; what's holding its integration?

https://github.com/jaked/ocaml-backtrace-patch/blob/master/patch-3.11.2 [^]
Tagspatch
Attached Filespatch file icon backtrace.patch [^] (50,838 bytes) 2015-01-25 21:04 [Show Content]

- Relationships
has duplicate 0006317resolvedgasche ocamldebug should be able to work with Dynlink code 
related to 0006760resolvedwhitequark Closure marshalling in toplevel is broken since at least 4.00 

-  Notes
(0012897)
whitequark (developer)
2014-12-20 13:03
edited on: 2014-12-20 13:07

Reminder sent to: gasche

@gasche, I've reworked the patch to apply on trunk.

$ ocaml
        OCaml version 4.03.0+dev5-2014-10-15

# let f () = raise Not_found;;
val f : unit -> 'a = <fun>
# let g () = f (); 1;;
val g : unit -> int = <fun>
# g ();;
Exception: Not_found.
Raised at file "//toplevel//", line 1, characters 17-26
Called from file "//toplevel//", line 1, characters 11-15
#

The patch is somewhat complex, so I will explain in detail how it works. Logically, it consists out of four parts:

  * First, the runtime representation of the debug information in backtrace.c is changed, so that new chunks of it, corresponding to newly loaded (and unloaded) code can be added and removed.
  * Second, the startup process is changed so that the debug information, if present in the executable, is loaded at startup and not lazily. This affects startup.c and io.c.
  * Third, the bytecode compiler and the toplevel are changed so as to generate and register the debug information for toplevel phrases. This affects bytecomp/emitcode.ml, bytecomp/meta.ml, toplevel/toploop.ml and toplevel/topdirs.ml. This also requires a bootstrap, as new primitives were added.
  * Fourth, the Dynlink module was extended to also register debug information. This affects otherlibs/dynlink/dynlink.ml.

The changes to the runtime representation of the debug information are as follows. In trunk, debug information is represented by an array of events, represented as C structures. In the original patch, it's a list of event lists, represented as OCaml structures. I implemented a middle ground between these approaches: there is a global linked list of all registered debug information, but every entry is a custom block containing a C structure. This allows to keep the fast backtrace extraction code that uses binary search intact.

The change to the startup process is necessary, as caml_stash_backtrace walks the stack by looking at every successive value and checking whether it points into a known range of bytecode. Thus, either the debug information could be loaded at startup, or the code would have to handle the "main" bytecode range specially. I think the former is cleaner.

The original patch also exported a variant of caml_input_val function that allows to allocate the space for the blocks from the old generation right away. I'm not sure why was this done--I don't see a reason for it--and the compiler can bootstrap itself just fine without it when built with -g (i.e. stressing the relevant code), so I assume it's not necessary.

The rest of the changes are straightforward.

(0012964)
frisch (developer)
2014-12-23 16:12

FWIW, it's only a recent addition that debug info is loaded only once in bytecode. Previously, it was loaded everytime it was requested. And this was really slow: on one big application, the order of magnitude was around 1s to load the debug info. Now that this is done at most once, this is fine. But I'm wondering about doing it non-lazily, since one would have to pay this price at startup, which might be sub-optimal e.g. for command-line applications.
(0012965)
whitequark (developer)
2014-12-23 16:22

I had a variant of patch which did it lazily (when Printexc.record_backtrace true was called), but it added quite a bit of complexity for unclear benefit. It also had awkward semantics in presence of dynlinking.

What about not compiling command-line applications with debug info when this is critical?
(0012966)
frisch (developer)
2014-12-23 16:26

> What about not compiling command-line applications with debug info when this is critical?

I think we always want backtraces to get useful information in case of unexpected errors. I can imagine large application with short-lived processes for which raw performance doesn't really matter but for which a delay on startup will be noticed.
(0012968)
whitequark (developer)
2014-12-23 16:29

But this wouldn't work for *unexpected* errors. If the error is unexpected, then the backtrace will not be computed at all for absence of debugging information, and if you always enable backtraces via Printexc or read the debugging information on first raised exception, the same delay will be present.
(0012969)
gasche (developer)
2014-12-23 16:57

I think the style Alain has in mind is to use raise_notraice if and only if he can prove that the exception will be caught, and always enable debug information. There are two cases of "unexpected errors", the ones where you didn't use _notrace (typically out-of-bounds error), which get traces, and the ones where the proof was or became wrong -- then you're out of luck as you say.
(0012971)
whitequark (developer)
2014-12-23 17:08

@gasche, actually 6317 is not a complete duplicate. The reason ocamldebug can't set breakpoints on dynlinked and otherwise loaded code is that the runtime keeps a duplicate array of all bytecode so that the byte that has been in the place of breakpoint could be restored. Of course, this array does not cover dynlinked code.
(0012972)
gasche (developer)
2014-12-23 17:13

Does it seem unreasonable to ask for a patch that also solves PR#6317, then?
(0012973)
whitequark (developer)
2014-12-23 17:15

It would be unrelated to this issue. The way PR#6317 should be solved is to move the tracking of replaced bytecode to ocamldebug itself; it is not the runtime's job, and it is even easier to implement in ocamldebug.
(0012974)
gasche (developer)
2014-12-23 17:50

I only did a very quick review of the patch so far (I'm not yet in holidays!):

1) the linked-list-traversal logic is duplicated many times; if I understand correctly, for any program counter there is at most one chunk of debuginfo that matches it. You could thus encapsulate almost all the linked-list logic in a single function

  struct debug_info *caml_find_debuginfo(code_t pc)

that returns a debuginfo di such that (di->start <= pc && pc < di->end), or NULL if none exists.

2) The user-interface of control from the toplevel is a bit unclear to me. If I understand correctly:
- the toplevel always consider "-debug" is passed by default (why?)
- but there is a toplevel directive ("#debug true;;", "#debug false;;") to set it explicitly (when do we expect it to be used?)

3) The first diff chunk after
  @@ -200,7 +210,14 @@ let print_out_exception ppf exn outv =
hurts my OCaml sensibility: there should really be a begin..end bracketing between the "then" and the "match".
(0012977)
whitequark (developer)
2014-12-23 18:05

re 2: yes, #debug is probably not useful.

Regarding -g (Clflags.debug) always being enabled: I can't imagine a reason someone would want to explicitly disable generation of debug information in toplevel, and the performance hit is certainly insignificant here. Do you have other opinion?
(0012987)
gasche (developer)
2014-12-25 22:29

I thought of the following approach to re-enable lazy loading: in caml_add_debug_info, store a dummy description of the debug_info chunk in the global list, with just the start and stop position available (events is NULL). In the caml_find_debuginfo function I suggested above, force the actual computation of the debuginfo information on the chunk that corresponds to the program counter.

This would mean that reading the debuginfo in the file is done eagerly, but the rest is done lazily (there is a "relocation" step that is currently done eagerly and could be delayed or done completely on the fly). We could think of even delaying reading the serialized debuginfo by changing the storage format to serialize the stop position in addition to the start position.

Is that the approach you had tried and that was too complex? I haven't implemented it so it's a game of guessing, but I don't think it would make the code too difficult to follow -- the idea is that the structure of the chunk list does not change when the debuginfo is forced.
(0012988)
whitequark (developer)
2014-12-25 22:30

I was thinking of hardcoding a comparison with caml_start_code. I like your suggestion much better and will implement it.
(0012999)
whitequark (developer)
2014-12-29 05:37

I've updated the patch:

  * #debug toplevel directive was removed;
  * toplevel now sets Clflags.debug earlier, so that $ ocaml t.ml also returns proper backtraces, ditto for ocaml init;
  * find_debuginfo was factored out;
  * debug information from the main executable is loaded lazily; debug information from elsewhere is still loaded eagerly, as I do not anticipate this to cause a slowdown.
(0013004)
gasche (developer)
2015-01-03 20:19

(Sorry for the long time before feedback. The happy ignorance of vacation. I probably won't be able to do much in the following month for other reasons.)

I like the general shape of the patch, but here are a few nitpicking comments.

There is a comment in bytecomp/emitcode.mli that is made incorrect by your patch.

I think "lazy_loaded" is a bad name choice (I am not able to tell from the name alone whether a 1 value suggests the lazy loading has already happened.) I would suggest just "loaded", or "already_read" (it's strange that the function using to make a debuginfo "loaded" would be "read_debug_info"; "read" is also reasonable but ambiguous).

There is a consistency problem with the name "debuginfo" that I unfortunately suggested (it is more natural to me): the existing code uses "debug_info" all over the place. I think you should use "debug_info" as well (in find_debug_info), and maybe in the Debug_info_val macro as well (one may or may not like the fact that the two underscores have a different status here, but that's already used in Unsigned_long_val for example).

Finally, I think you could change event_for_location so that you use an early exit (if (di == NULL)\n\treturn NULL;) rather than a long conditional. This removes the "short case last" issue, but more importantly it allows to go back to the indentation level of the previous implementation, which should make for a more readable diff.
(0013010)
whitequark (developer)
2015-01-06 01:49

Addressed.
(0013174)
gasche (developer)
2015-01-24 18:11

Applying the patch to the current trunk and running the testsuite gives me one test failure in testsuite/tests/typing-objects. The failure message is rather worrying:

$ diff Tests.ml.{result,reference}
303,304c303,304
< # Exception: Invalid_argument "output_value: abstract value (outside heap)".
< # Exception: Invalid_argument "output_value: abstract value (outside heap)".
---
> # - : int * int * int * int * int = (6, 7, 8, 33, 33)
> # - : int * int * int * int * int * int * int = (9, 10, 10, 11, 11, 33, 33)

This might be related to other stuff that happened in trunk recently rather than this patch, or maybe a bootstrap error on my part. Can you reproduce the failure on your machine?
(0013175)
gasche (developer)
2015-01-25 14:09

I can reliably reproduce this failure -- it seems to be a problem of the patch.

Minimized test case, run in the toplevel:

(good) with 4.02.1 or trunk before the patch:

# (Marshal.from_string (Marshal.to_string (object method m = 33 end) [Marshal.Closures]) 0);;
- : 'a = <poly>

(bad) with the patch:

# Marshal.from_string (Marshal.to_string (object method m = 33 end) [Marshal.Closures]) 0;;
Exception: Invalid_argument "output_value: abstract value (outside heap)".
(0013176)
whitequark (developer)
2015-01-25 14:20

Ok. I know why the failure happens and will fix soon.
(0013180)
whitequark (developer)
2015-01-25 19:44

After investigation, I admit that I did not, in fact, know the cause of the failure. What puzzles me is the fact that the following code:

    Marshal.to_string (fun () -> ()) [Marshal.Closures];;

fails in toplevel with the same message since 4.00.1. My guess is that the following code:

    Marshal.to_string (object method m = 33 end) [Marshal.Closures]

triggers the same bug, but only after my patch exposes it somehow.

Since the code you highlight works with both ocamlc and ocamlc -g, but reliably fails in toplevel, I'm going to say that I happened to expose a latent bug in extern.c. I'm not sure which one though, and why is it failing intermittently.
(0013182)
whitequark (developer)
2015-01-25 21:04

After applying the patch for 6760 and adjusting the test outputs for the backtraces that appear, all tests pass.
(0013185)
gasche (developer)
2015-01-25 21:50

I find it disturbing that your patch would provoke a change in difference between closures and object marshalling, and would like to understand better what is happening here. I'll try to get opinions from developpers that actually know about these things (here and PR#6760) before considering merge.
(0013186)
whitequark (developer)
2015-01-25 21:59

Yes, I would like to understand that better too. I think there might be more bugs hidden in code that is triggered by object serialization.
(0013223)
jacques-henri.jourdan (manager)
2015-02-04 17:48

Now I understand this: when using this patch, the compiler is instructed to
produce debug information. In particular, in translclass.ml, when not
compiling with debug information, the compiler tries to use predefined
implementations (from CamlinternalOO). Thus, with this patch, the object
contains a code pointer to some code that has been just compiled, while
without the patch it does contain a pointer to a closure from the standard
library. This closure triggers 0006760, which explains why it is magically
appearing.
(0013224)
gasche (developer)
2015-02-04 17:55
edited on: 2015-02-04 17:55

Aha! Congratulations for the mystery solving.

Jacques-Henri, do you have any specific review/comment on 0006760 itself? Now that I understand the issue, I think we're ready to merge both.

(0013236)
gasche (developer)
2015-02-05 13:19

Now that we also have the patch for PR#6760, I'm wondering why we have this duplication of two different tables that basically index information by code fragment intervals. Could we not store the (delayed) ev_info information (the fragment-local set of debug events) in a new field of the caml_code_fragments table?
(0013238)
whitequark (developer)
2015-02-05 15:42

@gasche, not easily, since the caml_code_fragments table is managed in C but caml_debug_info is managed by OCaml GC. It would require a major refactoring and I am not sure what is the value.
(0013255)
gasche (developer)
2015-02-08 15:14

I applied the patch as-is. You changed some tests in the testsuite (exceptions raised from toplevel tests, to add a backtrace), but on my machine the tests are run *without* OCAMLRUNPARAM=b so I reverted those changes to have the test pass.

It may be more robust to decide once and forall whether we want to set OCAMLRUNPARAM=b to run the toplevel tests. Do you always have b=1 in your OCAMLRUNPARAM environment?
(0013256)
whitequark (developer)
2015-02-08 16:15
edited on: 2015-02-08 16:15

Yeah; and as described in http://caml.inria.fr/mantis/view.php?id=6728 [^] it is my view that OCAMLRUNPARAM=b should be the default.

Now that I think about it, didn't I always enable backtraces for the toplevel in this patch? This seems odd...

(0013258)
whitequark (developer)
2015-02-09 07:36
edited on: 2015-02-09 07:36

Ah right--the toplevel does not currently (try to) enable backtraces by default, it only does that for -g. I was confused momentarily.

In my opinion it should. There is little performance reasons not to.

What do you think?

(0013807)
frisch (developer)
2015-05-05 16:17

Commit 16077 fixes a bug introduced with this change: caml_init_debug_info needs to be called from caml_startup_code as well to support 'ocamlc -custom'. (The commit messages wrongly mentions caml_init_code_fragments.)

- Issue History
Date Modified Username Field Change
2014-06-24 14:24 whitequark New Issue
2014-07-16 10:21 doligez Severity minor => feature
2014-07-16 10:21 doligez Status new => acknowledged
2014-07-16 10:21 doligez Tag Attached: patch
2014-12-20 12:35 whitequark Assigned To => whitequark
2014-12-20 12:35 whitequark Status acknowledged => assigned
2014-12-20 12:35 whitequark File Added: backtrace.patch
2014-12-20 13:03 whitequark Note Added: 0012897
2014-12-20 13:03 whitequark File Deleted: backtrace.patch
2014-12-20 13:04 whitequark File Added: backtrace.patch
2014-12-20 13:06 whitequark File Deleted: backtrace.patch
2014-12-20 13:06 whitequark File Added: backtrace.patch
2014-12-20 13:07 whitequark Note Edited: 0012897 View Revisions
2014-12-23 16:12 frisch Note Added: 0012964
2014-12-23 16:22 whitequark Note Added: 0012965
2014-12-23 16:26 frisch Note Added: 0012966
2014-12-23 16:29 whitequark Note Added: 0012968
2014-12-23 16:57 gasche Note Added: 0012969
2014-12-23 17:02 gasche Relationship added related to 0006137
2014-12-23 17:03 gasche Note Added: 0012970
2014-12-23 17:04 gasche Note Deleted: 0012970
2014-12-23 17:05 gasche Relationship deleted related to 0006137
2014-12-23 17:05 gasche Relationship added has duplicate 0006317
2014-12-23 17:08 whitequark Note Added: 0012971
2014-12-23 17:13 gasche Note Added: 0012972
2014-12-23 17:15 whitequark Note Added: 0012973
2014-12-23 17:50 gasche Note Added: 0012974
2014-12-23 18:05 whitequark Note Added: 0012977
2014-12-25 22:29 gasche Note Added: 0012987
2014-12-25 22:30 whitequark Note Added: 0012988
2014-12-29 05:34 whitequark File Deleted: backtrace.patch
2014-12-29 05:34 whitequark File Added: backtraces.patch
2014-12-29 05:37 whitequark Note Added: 0012999
2015-01-03 20:19 gasche Note Added: 0013004
2015-01-06 01:49 whitequark File Deleted: backtraces.patch
2015-01-06 01:49 whitequark File Added: backtrace.patch
2015-01-06 01:49 whitequark Note Added: 0013010
2015-01-24 18:11 gasche Note Added: 0013174
2015-01-25 14:09 gasche Note Added: 0013175
2015-01-25 14:20 whitequark Note Added: 0013176
2015-01-25 17:28 gasche Note Added: 0013178
2015-01-25 17:28 gasche Note Edited: 0013178 View Revisions
2015-01-25 17:28 gasche Note Deleted: 0013178
2015-01-25 19:44 whitequark Note Added: 0013180
2015-01-25 20:14 whitequark Relationship added related to 0006760
2015-01-25 21:04 whitequark File Deleted: backtrace.patch
2015-01-25 21:04 whitequark File Added: backtrace.patch
2015-01-25 21:04 whitequark Note Added: 0013182
2015-01-25 21:50 gasche Note Added: 0013185
2015-01-25 21:59 whitequark Note Added: 0013186
2015-02-04 17:48 jacques-henri.jourdan Note Added: 0013223
2015-02-04 17:55 gasche Note Added: 0013224
2015-02-04 17:55 gasche Note Edited: 0013224 View Revisions
2015-02-05 13:19 gasche Note Added: 0013236
2015-02-05 15:42 whitequark Note Added: 0013238
2015-02-08 15:14 gasche Note Added: 0013255
2015-02-08 15:15 gasche Status assigned => resolved
2015-02-08 15:15 gasche Fixed in Version => 4.03.0+dev
2015-02-08 15:15 gasche Resolution open => fixed
2015-02-08 16:15 whitequark Note Added: 0013256
2015-02-08 16:15 whitequark Note Edited: 0013256 View Revisions
2015-02-09 07:36 whitequark Note Added: 0013258
2015-02-09 07:36 whitequark Note Edited: 0013258 View Revisions
2015-05-05 16:17 frisch Note Added: 0013807


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker