Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array.init of a float array needlessly initialize array with (f 0) #6065

Closed
vicuna opened this issue Jul 5, 2013 · 10 comments
Closed

Array.init of a float array needlessly initialize array with (f 0) #6065

vicuna opened this issue Jul 5, 2013 · 10 comments

Comments

@vicuna
Copy link

vicuna commented Jul 5, 2013

Original bug ID: 6065
Reporter: @rixed
Status: acknowledged (set by @xavierleroy on 2013-07-06T08:34:18Z)
Resolution: open
Priority: normal
Severity: feature
Version: 4.00.1
Target version: later
Category: standard library
Monitored by: @gasche @jmeber

Bug description

I'm using small arrays of float to store vectors and caml_make_vect is high in CPU consumption. Looking at the code it seams to me that in many cases (Array.init, Array.map, etc, actually all but Array.make) caml_make_vect could let the array uninitialized (since it's a Double_array_tag block) because the code of Array.init, map, etc, is going to properly initialize the array.

So I propose to introduce caml_init_vect, which will be the same as caml_make_vect without the Store_double_field() calls that copy init value into
the float array. Of course both caml_init_vect and caml_make_vect could be wrappers around a third one taking an additional parameter such as "need_init".

@vicuna
Copy link
Author

vicuna commented Jul 5, 2013

Comment author: @jhjourdan

What architecture are you using ?

Are you sure the initialization is the actual bottleneck (would it be the allocation ?) ? Do you have some practical benchmark showing this ?

If necessary, I would rather optimize caml_make_vect: this is not normal that a simple memset-like loop is a CPU bottleneck. The actual initialization by the Ocaml code (map, init) should use at least comparable time.

@vicuna
Copy link
Author

vicuna commented Jul 6, 2013

Comment author: @rixed

I use everyone's amd64 PC.
I haven't checked if this initialization is the bottleneck. It's certainly a waste of CPU (and, for bigger arrays, of cache bandwidth) anyway.
But now that you ask for it I had a look to the annotated caml_make_vect as reported by perf (ocaml compiled with stack frame pointers),
and it looks like this initialization is indeed the bottleneck, which is surprising:

  • 12.33% mlrocket.opt mlrocket.opt [.] caml_make_vect
    • caml_make_vect
      • 93.43% camlArray__init_1019
      • 5.76% camlGeom_path__bezier_1910

Annotated caml_make_vect at hottest point:

     :        CAMLprim value caml_make_vect(value len, value init)
2.26 :          64fb9d:       shl    $0x3,%r12
0.00 :          64fba1:       mov    $0x0,%eax
     :            d = Double_val(init);
     :            wsize = size * Double_wosize;
     :            if (wsize > Max_wosize) caml_invalid_argument("Array.make");
     :            res = caml_alloc(wsize, Double_array_tag);
     :            for (i = 0; i < size; i++) {
     :              Store_double_field(res, i, d);
2.36 :          64fba6:       mov    -0x68(%rbp),%rdx
2.26 :          64fbaa:       mov    %rbx,(%rdx,%rax,1)

12.05 : 64fbae: add $0x8,%rax
: && Tag_val(init) == Double_tag) {
: d = Double_val(init);
: wsize = size * Double_wosize;
: if (wsize > Max_wosize) caml_invalid_argument("Array.make");
: res = caml_alloc(wsize, Double_array_tag);
: for (i = 0; i < size; i++) {
2.51 : 64fbb2: cmp %r12,%rax
0.00 : 64fbb5: jne 64fba6 <caml_make_vect+0xff>
0.00 : 64fbb7: jmpq 64fc89 <caml_make_vect+0x1e2>

I tried to implement the suggested patch to measure the gain but I failed to compile the compiler with new runtime ; I tried make core/coreboot/bootstrap and other recipes mentioned here and there to no avail.

@vicuna
Copy link
Author

vicuna commented Jul 6, 2013

Comment author: @xavierleroy

Interesting suggestion, thanks. Some thoughts:

  • Array operations greatly benefit from being type-specialized, because polymorphic array operations involve run-time tag tests. In this respect, the polymorphic Array.map will always be less efficient than a version specialized to type "float array".

  • What is the typical size of your arrays? Probably not that small, otherwise the initialization loop would not take so much time.

  • There is a significant fixed cost in caml_make_vect, namely the Is_in_value_area test. This, I'm afraid, we can't get rid of. But you're right that we could avoid the double initialization for Array.map-like operations. And also avoid triggering a minor GC in the case of a large array whose first element points to the young generation.

  • For non-float arrays, the first initialization must still be performed, to satisfy the GC invariants. However, we can initialize with a safe constant, say, Val_unit, and avoid calling caml_initialize() at all.

  • The alternate caml_make_vect (let's call it caml_alloc_vect) should still initialize the first element of the array with the given init value. That would preserve the code structure of Array.map et al.

@vicuna
Copy link
Author

vicuna commented Jul 17, 2013

Comment author: @rixed

Thank you for the suggestions. Not sure If I will be able to specialize my Array.map since the float type comes from a functor argument, though.

Typical size of my arrays are 2 to 4 items (so, quite small).

Also, what's the procedure to change the runtime in recent ocaml source distribution? Maybe I should open another feature request to have this documented in the README?

@vicuna
Copy link
Author

vicuna commented Jul 17, 2013

Comment author: @jhjourdan

I am a bit surprised by your performance problems in the case of that small arrays.

Could you please provide some repro case ?

@vicuna
Copy link
Author

vicuna commented Jul 17, 2013

Comment author: @rixed

The actual program is a small demonstration of a vector graphic library ; see for yourself:

opam switch testperf --alias-of 4.02.0dev+fp
opam repo add testperf https://git.gitorious.org/ocalme/opam-repository.git
opam install mlrockets
perf record somewhere/in/opam/testperf/mlrockets/mlrockets.opt

should compile and run the program and record it's runtime performances.

@vicuna
Copy link
Author

vicuna commented Jul 19, 2013

Comment author: @jhjourdan

I get the following:

Building world of radius 150....
Gravity = 0.1
Building ground...
Randomizing ground...
Randomizing ground...
Building rocket 1...
Adding 706 stars...
Erreur de segmentation (core dumped)

Using gdb, the backtrace is:

#0 0x0000000000000000 in ?? ()
#1 0x00007ffff7205546 in generic_run () from /usr/lib64/dri/nouveau_dri.so
#2 0x00007ffff730e424 in nv50_push_vbo () from /usr/lib64/dri/nouveau_dri.so
#3 0x00007ffff730a64b in nv50_draw_vbo () from /usr/lib64/dri/nouveau_dri.so
#4 0x00007ffff711f879 in st_draw_vbo () from /usr/lib64/dri/nouveau_dri.so
#5 0x00007ffff70f8a54 in vbo_draw_arrays () from /usr/lib64/dri/nouveau_dri.so
#6 0x00000000005c5489 in gl_render ()
#7 0x000000000047a16d in camlGlop_base__fun_4814 ()
#8 0x000000000058b2a5 in camlList__iter_1061 () at list.ml:73
#9 0x0000000000471123 in camlGame__fun_3453 () at game.ml:90
#10 0x000000000047c200 in camlGlop_view__aux_2867 () at glop_view.ml:87
#11 0x000000000058b2a5 in camlList__iter_1061 () at list.ml:73
#12 0x000000000047c4a0 in camlGlop_view__next_frame_2945 () at glop_view.ml:172
#13 0x000000000057c44f in camlBricabrac__forever_1030 () at bricabrac.ml:23
#14 0x00000000004706b7 in camlMain__entry () at main.ml:20
#15 0x000000000046dd3d in caml_program ()
#16 0x00000000005e170a in caml_start_program ()
#17 0x0000000000000000 in ?? ()

So it seems it is trying to draw something, but my openGL implementation does not like this.

Is there a way to run it without rendering ?

@vicuna
Copy link
Author

vicuna commented Jul 19, 2013

Comment author: @jhjourdan

I have finally been able to run it.

I did the following experiment : I duplicated the initialization loop in caml_make_vect (so the second instance of the loop does nothing interesting, it just takes time). After recompiling everything, it seems like the second instance takes much less time than the first. For me, it precisely means the first one is almost always a cache miss, while the second one is a cache hit.

I precise the assembly code generated for both loops are very similar.

My conclusion is that if we remove the loop (or replace it by the initialization of the first field), the performance won't be improved a lot, because of the cache misses (they will appear in Array.map or whatever anyway).

@vicuna
Copy link
Author

vicuna commented Jul 19, 2013

Comment author: @rixed

Sorry for the segfault, you probably figured out how to disable rendering.

Yes my use case may not benefit a lot from tis change, but for larger arrays the first initialization may load the same cache lines several times, entailing more latency than necessary.

@vicuna vicuna added the stdlib label Mar 14, 2019
@xavierleroy
Copy link
Contributor

The Float.Array module that was added in 4.08 provides float-specialized arrays, with functions that avoid double initialization when possible. It looks like a good answer to the original request.

As I mentioned in my comment above, similar optimizations might be possible for generic array operations such as Array.init and Array.map. But that's significant work for unclear benefits.

Closing this report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants