New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unnecessary calls to caml_modify #6019
Comments
Comment author: @xavierleroy Yes, caml_modify is too slow, and we should do something about this. Thanks for raising the issue on the bug tracker. Before inlining a fast path, as you do, I think we need to rewrite caml_modify so that fast paths (there are several possible) are clearly identified. I suspect most of the cost of caml_modify comes primarily from calls to Is_in_heap, which can be avoided in several cases, and not from the call to caml_modify instead. To be investigated... |
Comment author: @gasche For the record, I modified the proposed example as such to test the performance hit on non-integer modifications. time_it "addr field" (fun () -> time_it "addr field with dynamic check" (fun () -> The results on my amd64 machine are the following: int field took 0.116000 so this particular microbenchmark must be relativized in that it may speed up integer modifications by 50%, but also slow down the other modifications by 30%. It probably wouldn't be appropriate to blindly inject the corresponding code in all modification places. |
Comment author: bnigito Would you mind running that benchmark again ensuring the resulting set_dynamic code is inlined? I believe that is the only reason you would see a difference of such magnitude (our standard compilation tools already have aggressive inlining options enabled, so I sometimes assume it). Here's what I see: Certainly it is a tradeoff, not in all modification sites but only those with a polymorphic value (if this is known/possible at compilation). If it were to slow down polymorphic modifications by 30% I would not have suggested it, but if it's in the range of 3-5%, I think there is a case to consider it. I also believe the microbenchmark might be understating the benefit, as the an unnecessary caml_modify call could also be evicting useful cache lines, etc. thanks for looking into it. |
Comment author: @xavierleroy The SVN trunk (r13723) reimplements caml_modify() to reduce overheads and take advantage of several fast paths. On the microbenchmark of this PR, the performance improvement is significant:
int field took 0.060003 0.056003 In particular, the performance improvement resulting from open-coding the dynamic check is much more modest. Given that the dynamic check is not always a win, I'd rather not pursue this direction. |
Comment author: yminsky When trying to build the latest trunk on MacOSX, I get the following error. I'm guessing this is related to the present changes. rm minor_gc.d.c stderrmemory.d.c: In function ‘caml_modify’: |
Comment author: yminsky Hmm. I tried to replicate this outside of OPAM, but the checkout of the specific revision builds without incident, and the current tip breaks in a different place (scanf related). When tip settles down, I'll try again. |
Comment author: @xavierleroy Indeed there was a missing cast, but it only shows up when building the runtime system in debug mode. Fixed in commit r13727. |
Comment author: yminsky I suppose this bug is unrelated? stderrFile "scanf.ml", line 640, characters 14-75: |
Comment author: @xavierleroy Yes it is unrelated. Just look at the committers. |
Comment author: yminsky I've tested out the new version, and it's clearly and significantly better, but I think an update of Brian's original benchmark shows that there's still significant room to improve. In particular, the dynamic check still seems like a real win. Here's the benchmark results I see: First, with the old compiler:
And with the new compiler:
In both cases, it looks to me like the dynamic check wins, and by a decent margin. The fact that the poly-field version with the dynamic check does better by a bit I suspect is just some noise, more indicating that the two are very close than that there's a real advantage. That said, for the int, the poly check seems to be more than 3 times better. I suspect this win comes from avoiding the C call to caml_modify. My benchmark code is below: module Int_field = struct module Poly_field = struct let set_dynamic t x = type obj = Obj of int let time_it desc f = let obj0 = Obj 0 let () = let times = 200_000_000 in time_it "poly field with int" (fun () -> time_it "poly field with obj" (fun () -> time_it "poly field with int, dynamic check" (fun () -> time_it "poly field with obj, dynamic check" (fun () -> |
Comment author: @damiendoligez Note to future visitors: there's a bug in Yaron's latest benchmark: it runs the exact same code for cases 3 and 5, which explains the very close performance numbers. |
Original bug ID: 6019
Reporter: bnigito
Status: resolved (set by @xavierleroy on 2013-06-01T07:59:09Z)
Resolution: suspended
Priority: normal
Severity: tweak
Version: 4.00.1
Category: back end (clambda to assembly)
Monitored by: @gasche @diml @ygrek bobot @jmeber yminsky @yakobowski @alainfrisch
Bug description
For some mutation heavy code, the c-call to caml_modify is extremely high on our profiles (exhibiting almost a full order of magnitude difference in tight loops). This is in some sense a duplicate of 0005592, but it highlights the fact that writes could be made cheaper even absent the inlining improvements by a emitting a dynamic check that would skip the C-call to caml_modify.
We've added something like this:
let unsafe_set t i obj =
let old_obj = unsafe_get t i in
if Obj.is_int old_obj && Obj.is_int obj then
(* It is OK to skip [caml_modify] if both the old and new values are integers. *)
Array.unsafe_set (Obj.magic (t : t) : int array) i (Obj.obj obj : int)
else if not (old_obj == obj) then
Array.unsafe_set t i obj
around wrappers of arrays that often hold immediate values with good results. It seems like it makes more sense to perform at the level of code generation. For the included test I've demonstrated it for a polymorphic mutable field, but doing this sort of dynamic check in our code feels somewhat impractical.
Steps to reproduce
$ ocamlopt.opt -inline 20 -nodynlink -g -o z z.ml
$ ./z
int field took 0.120980
poly field took 1.021845
poly field with dynamic check took 0.240963
File attachments
The text was updated successfully, but these errors were encountered: