Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backtraces in toplevel #6468

Closed
vicuna opened this issue Jun 24, 2014 · 31 comments
Closed

Backtraces in toplevel #6468

vicuna opened this issue Jun 24, 2014 · 31 comments

Comments

@vicuna
Copy link

vicuna commented Jun 24, 2014

Original bug ID: 6468
Reporter: @whitequark
Assigned to: @whitequark
Status: closed (set by @xavierleroy on 2016-12-07T10:49:08Z)
Resolution: fixed
Priority: normal
Severity: feature
Version: 4.02.0+beta1 / +rc1
Fixed in version: 4.03.0+dev / +beta1
Category: ~DO NOT USE (was: OCaml general)
Tags: patch
Has duplicate: #6317
Related to: #6760
Monitored by: @gasche @hcarty

Bug description

I just realized there appears to be no way to get backtraces in toplevel. This makes it much less useful. I see there is a patch for 3.11; what's holding its integration?

https://github.com/jaked/ocaml-backtrace-patch/blob/master/patch-3.11.2

File attachments

@vicuna
Copy link
Author

vicuna commented Dec 20, 2014

Comment author: @whitequark

@gasche, I've reworked the patch to apply on trunk.

$ ocaml
OCaml version 4.03.0+dev5-2014-10-15

let f () = raise Not_found;;

val f : unit -> 'a =

let g () = f (); 1;;

val g : unit -> int =

g ();;

Exception: Not_found.
Raised at file "//toplevel//", line 1, characters 17-26
Called from file "//toplevel//", line 1, characters 11-15

The patch is somewhat complex, so I will explain in detail how it works. Logically, it consists out of four parts:

  • First, the runtime representation of the debug information in backtrace.c is changed, so that new chunks of it, corresponding to newly loaded (and unloaded) code can be added and removed.
  • Second, the startup process is changed so that the debug information, if present in the executable, is loaded at startup and not lazily. This affects startup.c and io.c.
  • Third, the bytecode compiler and the toplevel are changed so as to generate and register the debug information for toplevel phrases. This affects bytecomp/emitcode.ml, bytecomp/meta.ml, toplevel/toploop.ml and toplevel/topdirs.ml. This also requires a bootstrap, as new primitives were added.
  • Fourth, the Dynlink module was extended to also register debug information. This affects otherlibs/dynlink/dynlink.ml.

The changes to the runtime representation of the debug information are as follows. In trunk, debug information is represented by an array of events, represented as C structures. In the original patch, it's a list of event lists, represented as OCaml structures. I implemented a middle ground between these approaches: there is a global linked list of all registered debug information, but every entry is a custom block containing a C structure. This allows to keep the fast backtrace extraction code that uses binary search intact.

The change to the startup process is necessary, as caml_stash_backtrace walks the stack by looking at every successive value and checking whether it points into a known range of bytecode. Thus, either the debug information could be loaded at startup, or the code would have to handle the "main" bytecode range specially. I think the former is cleaner.

The original patch also exported a variant of caml_input_val function that allows to allocate the space for the blocks from the old generation right away. I'm not sure why was this done--I don't see a reason for it--and the compiler can bootstrap itself just fine without it when built with -g (i.e. stressing the relevant code), so I assume it's not necessary.

The rest of the changes are straightforward.

@vicuna
Copy link
Author

vicuna commented Dec 23, 2014

Comment author: @alainfrisch

FWIW, it's only a recent addition that debug info is loaded only once in bytecode. Previously, it was loaded everytime it was requested. And this was really slow: on one big application, the order of magnitude was around 1s to load the debug info. Now that this is done at most once, this is fine. But I'm wondering about doing it non-lazily, since one would have to pay this price at startup, which might be sub-optimal e.g. for command-line applications.

@vicuna
Copy link
Author

vicuna commented Dec 23, 2014

Comment author: @whitequark

I had a variant of patch which did it lazily (when Printexc.record_backtrace true was called), but it added quite a bit of complexity for unclear benefit. It also had awkward semantics in presence of dynlinking.

What about not compiling command-line applications with debug info when this is critical?

@vicuna
Copy link
Author

vicuna commented Dec 23, 2014

Comment author: @alainfrisch

What about not compiling command-line applications with debug info when this is critical?

I think we always want backtraces to get useful information in case of unexpected errors. I can imagine large application with short-lived processes for which raw performance doesn't really matter but for which a delay on startup will be noticed.

@vicuna
Copy link
Author

vicuna commented Dec 23, 2014

Comment author: @whitequark

But this wouldn't work for unexpected errors. If the error is unexpected, then the backtrace will not be computed at all for absence of debugging information, and if you always enable backtraces via Printexc or read the debugging information on first raised exception, the same delay will be present.

@vicuna
Copy link
Author

vicuna commented Dec 23, 2014

Comment author: @gasche

I think the style Alain has in mind is to use raise_notraice if and only if he can prove that the exception will be caught, and always enable debug information. There are two cases of "unexpected errors", the ones where you didn't use _notrace (typically out-of-bounds error), which get traces, and the ones where the proof was or became wrong -- then you're out of luck as you say.

@vicuna
Copy link
Author

vicuna commented Dec 23, 2014

Comment author: @whitequark

@gasche, actually 6317 is not a complete duplicate. The reason ocamldebug can't set breakpoints on dynlinked and otherwise loaded code is that the runtime keeps a duplicate array of all bytecode so that the byte that has been in the place of breakpoint could be restored. Of course, this array does not cover dynlinked code.

@vicuna
Copy link
Author

vicuna commented Dec 23, 2014

Comment author: @gasche

Does it seem unreasonable to ask for a patch that also solves #6317, then?

@vicuna
Copy link
Author

vicuna commented Dec 23, 2014

Comment author: @whitequark

It would be unrelated to this issue. The way #6317 should be solved is to move the tracking of replaced bytecode to ocamldebug itself; it is not the runtime's job, and it is even easier to implement in ocamldebug.

@vicuna
Copy link
Author

vicuna commented Dec 23, 2014

Comment author: @gasche

I only did a very quick review of the patch so far (I'm not yet in holidays!):

  1. the linked-list-traversal logic is duplicated many times; if I understand correctly, for any program counter there is at most one chunk of debuginfo that matches it. You could thus encapsulate almost all the linked-list logic in a single function

struct debug_info *caml_find_debuginfo(code_t pc)

that returns a debuginfo di such that (di->start <= pc && pc < di->end), or NULL if none exists.

  1. The user-interface of control from the toplevel is a bit unclear to me. If I understand correctly:
  • the toplevel always consider "-debug" is passed by default (why?)
  • but there is a toplevel directive ("#debug true;;", "#debug false;;") to set it explicitly (when do we expect it to be used?)
  1. The first diff chunk after
    @@ -200,7 +210,14 @@ let print_out_exception ppf exn outv =
    hurts my OCaml sensibility: there should really be a begin..end bracketing between the "then" and the "match".

@vicuna
Copy link
Author

vicuna commented Dec 23, 2014

Comment author: @whitequark

re 2: yes, #debug is probably not useful.

Regarding -g (Clflags.debug) always being enabled: I can't imagine a reason someone would want to explicitly disable generation of debug information in toplevel, and the performance hit is certainly insignificant here. Do you have other opinion?

@vicuna
Copy link
Author

vicuna commented Dec 25, 2014

Comment author: @gasche

I thought of the following approach to re-enable lazy loading: in caml_add_debug_info, store a dummy description of the debug_info chunk in the global list, with just the start and stop position available (events is NULL). In the caml_find_debuginfo function I suggested above, force the actual computation of the debuginfo information on the chunk that corresponds to the program counter.

This would mean that reading the debuginfo in the file is done eagerly, but the rest is done lazily (there is a "relocation" step that is currently done eagerly and could be delayed or done completely on the fly). We could think of even delaying reading the serialized debuginfo by changing the storage format to serialize the stop position in addition to the start position.

Is that the approach you had tried and that was too complex? I haven't implemented it so it's a game of guessing, but I don't think it would make the code too difficult to follow -- the idea is that the structure of the chunk list does not change when the debuginfo is forced.

@vicuna
Copy link
Author

vicuna commented Dec 25, 2014

Comment author: @whitequark

I was thinking of hardcoding a comparison with caml_start_code. I like your suggestion much better and will implement it.

@vicuna
Copy link
Author

vicuna commented Dec 29, 2014

Comment author: @whitequark

I've updated the patch:

  • #debug toplevel directive was removed;
  • toplevel now sets Clflags.debug earlier, so that $ ocaml t.ml also returns proper backtraces, ditto for ocaml init;
  • find_debuginfo was factored out;
  • debug information from the main executable is loaded lazily; debug information from elsewhere is still loaded eagerly, as I do not anticipate this to cause a slowdown.

@vicuna
Copy link
Author

vicuna commented Jan 3, 2015

Comment author: @gasche

(Sorry for the long time before feedback. The happy ignorance of vacation. I probably won't be able to do much in the following month for other reasons.)

I like the general shape of the patch, but here are a few nitpicking comments.

There is a comment in bytecomp/emitcode.mli that is made incorrect by your patch.

I think "lazy_loaded" is a bad name choice (I am not able to tell from the name alone whether a 1 value suggests the lazy loading has already happened.) I would suggest just "loaded", or "already_read" (it's strange that the function using to make a debuginfo "loaded" would be "read_debug_info"; "read" is also reasonable but ambiguous).

There is a consistency problem with the name "debuginfo" that I unfortunately suggested (it is more natural to me): the existing code uses "debug_info" all over the place. I think you should use "debug_info" as well (in find_debug_info), and maybe in the Debug_info_val macro as well (one may or may not like the fact that the two underscores have a different status here, but that's already used in Unsigned_long_val for example).

Finally, I think you could change event_for_location so that you use an early exit (if (di == NULL)\n\treturn NULL;) rather than a long conditional. This removes the "short case last" issue, but more importantly it allows to go back to the indentation level of the previous implementation, which should make for a more readable diff.

@vicuna
Copy link
Author

vicuna commented Jan 6, 2015

Comment author: @whitequark

Addressed.

@vicuna
Copy link
Author

vicuna commented Jan 24, 2015

Comment author: @gasche

Applying the patch to the current trunk and running the testsuite gives me one test failure in testsuite/tests/typing-objects. The failure message is rather worrying:

$ diff Tests.ml.{result,reference}
303,304c303,304
< # Exception: Invalid_argument "output_value: abstract value (outside heap)".
< # Exception: Invalid_argument "output_value: abstract value (outside heap)".

- : int * int * int * int * int = (6, 7, 8, 33, 33)

- : int * int * int * int * int * int * int = (9, 10, 10, 11, 11, 33, 33)

This might be related to other stuff that happened in trunk recently rather than this patch, or maybe a bootstrap error on my part. Can you reproduce the failure on your machine?

@vicuna
Copy link
Author

vicuna commented Jan 25, 2015

Comment author: @gasche

I can reliably reproduce this failure -- it seems to be a problem of the patch.

Minimized test case, run in the toplevel:

(good) with 4.02.1 or trunk before the patch:

(Marshal.from_string (Marshal.to_string (object method m = 33 end) [Marshal.Closures]) 0);;

  • : 'a =

(bad) with the patch:

Marshal.from_string (Marshal.to_string (object method m = 33 end) [Marshal.Closures]) 0;;

Exception: Invalid_argument "output_value: abstract value (outside heap)".

@vicuna
Copy link
Author

vicuna commented Jan 25, 2015

Comment author: @whitequark

Ok. I know why the failure happens and will fix soon.

@vicuna
Copy link
Author

vicuna commented Jan 25, 2015

Comment author: @whitequark

After investigation, I admit that I did not, in fact, know the cause of the failure. What puzzles me is the fact that the following code:

Marshal.to_string (fun () -> ()) [Marshal.Closures];;

fails in toplevel with the same message since 4.00.1. My guess is that the following code:

Marshal.to_string (object method m = 33 end) [Marshal.Closures]

triggers the same bug, but only after my patch exposes it somehow.

Since the code you highlight works with both ocamlc and ocamlc -g, but reliably fails in toplevel, I'm going to say that I happened to expose a latent bug in extern.c. I'm not sure which one though, and why is it failing intermittently.

@vicuna
Copy link
Author

vicuna commented Jan 25, 2015

Comment author: @whitequark

After applying the patch for 6760 and adjusting the test outputs for the backtraces that appear, all tests pass.

@vicuna
Copy link
Author

vicuna commented Jan 25, 2015

Comment author: @gasche

I find it disturbing that your patch would provoke a change in difference between closures and object marshalling, and would like to understand better what is happening here. I'll try to get opinions from developpers that actually know about these things (here and #6760) before considering merge.

@vicuna
Copy link
Author

vicuna commented Jan 25, 2015

Comment author: @whitequark

Yes, I would like to understand that better too. I think there might be more bugs hidden in code that is triggered by object serialization.

@vicuna
Copy link
Author

vicuna commented Feb 4, 2015

Comment author: @jhjourdan

Now I understand this: when using this patch, the compiler is instructed to
produce debug information. In particular, in translclass.ml, when not
compiling with debug information, the compiler tries to use predefined
implementations (from CamlinternalOO). Thus, with this patch, the object
contains a code pointer to some code that has been just compiled, while
without the patch it does contain a pointer to a closure from the standard
library. This closure triggers #6760, which explains why it is magically
appearing.

@vicuna
Copy link
Author

vicuna commented Feb 4, 2015

Comment author: @gasche

Aha! Congratulations for the mystery solving.

Jacques-Henri, do you have any specific review/comment on #6760 itself? Now that I understand the issue, I think we're ready to merge both.

@vicuna
Copy link
Author

vicuna commented Feb 5, 2015

Comment author: @gasche

Now that we also have the patch for #6760, I'm wondering why we have this duplication of two different tables that basically index information by code fragment intervals. Could we not store the (delayed) ev_info information (the fragment-local set of debug events) in a new field of the caml_code_fragments table?

@vicuna
Copy link
Author

vicuna commented Feb 5, 2015

Comment author: @whitequark

@gasche, not easily, since the caml_code_fragments table is managed in C but caml_debug_info is managed by OCaml GC. It would require a major refactoring and I am not sure what is the value.

@vicuna
Copy link
Author

vicuna commented Feb 8, 2015

Comment author: @gasche

I applied the patch as-is. You changed some tests in the testsuite (exceptions raised from toplevel tests, to add a backtrace), but on my machine the tests are run without OCAMLRUNPARAM=b so I reverted those changes to have the test pass.

It may be more robust to decide once and forall whether we want to set OCAMLRUNPARAM=b to run the toplevel tests. Do you always have b=1 in your OCAMLRUNPARAM environment?

@vicuna
Copy link
Author

vicuna commented Feb 8, 2015

Comment author: @whitequark

Yeah; and as described in #6728 it is my view that OCAMLRUNPARAM=b should be the default.

Now that I think about it, didn't I always enable backtraces for the toplevel in this patch? This seems odd...

@vicuna
Copy link
Author

vicuna commented Feb 9, 2015

Comment author: @whitequark

Ah right--the toplevel does not currently (try to) enable backtraces by default, it only does that for -g. I was confused momentarily.

In my opinion it should. There is little performance reasons not to.

What do you think?

@vicuna
Copy link
Author

vicuna commented May 5, 2015

Comment author: @alainfrisch

Commit 16077 fixes a bug introduced with this change: caml_init_debug_info needs to be called from caml_startup_code as well to support 'ocamlc -custom'. (The commit messages wrongly mentions caml_init_code_fragments.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant