Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006065OCamlOCaml standard librarypublic2013-07-05 17:092013-07-19 17:40
Reporterrixed 
Assigned To 
PrioritynormalSeverityminorReproducibilityN/A
StatusacknowledgedResolutionopen 
PlatformOSOS Version
Product Version4.00.1 
Target VersionlaterFixed in Version 
Summary0006065: Array.init of a float array needlessly initialize array with (f 0)
DescriptionI'm using small arrays of float to store vectors and caml_make_vect is high in CPU consumption. Looking at the code it seams to me that in many cases (Array.init, Array.map, etc, actually all but Array.make) caml_make_vect could let the array uninitialized (since it's a Double_array_tag block) because the code of Array.init, map, etc, is going to properly initialize the array.

So I propose to introduce caml_init_vect, which will be the same as caml_make_vect without the Store_double_field() calls that copy init value into
the float array. Of course both caml_init_vect and caml_make_vect could be wrappers around a third one taking an additional parameter such as "need_init".
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
(0009709)
jacques-henri.jourdan (manager)
2013-07-05 23:37

What architecture are you using ?

Are you sure the initialization is the actual bottleneck (would it be the allocation ?) ? Do you have some practical benchmark showing this ?

If necessary, I would rather optimize caml_make_vect: this is not normal that a simple memset-like loop is a CPU bottleneck. The actual initialization by the Ocaml code (map, init) should use at least comparable time.
(0009711)
rixed (reporter)
2013-07-06 07:10

I use everyone's amd64 PC.
I haven't checked if this initialization is the bottleneck. It's certainly a waste of CPU (and, for bigger arrays, of cache bandwidth) anyway.
But now that you ask for it I had a look to the annotated caml_make_vect as reported by perf (ocaml compiled with stack frame pointers),
and it looks like this initialization is indeed the bottleneck, which is surprising:

- 12.33% mlrocket.opt mlrocket.opt [.] caml_make_vect
   - caml_make_vect
      + 93.43% camlArray__init_1019
      + 5.76% camlGeom_path__bezier_1910

Annotated caml_make_vect at hottest point:

         : CAMLprim value caml_make_vect(value len, value init)
    2.26 : 64fb9d: shl $0x3,%r12
    0.00 : 64fba1: mov $0x0,%eax
         : d = Double_val(init);
         : wsize = size * Double_wosize;
         : if (wsize > Max_wosize) caml_invalid_argument("Array.make");
         : res = caml_alloc(wsize, Double_array_tag);
         : for (i = 0; i < size; i++) {
         : Store_double_field(res, i, d);
    2.36 : 64fba6: mov -0x68(%rbp),%rdx
    2.26 : 64fbaa: mov %rbx,(%rdx,%rax,1)
   12.05 : 64fbae: add $0x8,%rax
         : && Tag_val(init) == Double_tag) {
         : d = Double_val(init);
         : wsize = size * Double_wosize;
         : if (wsize > Max_wosize) caml_invalid_argument("Array.make");
         : res = caml_alloc(wsize, Double_array_tag);
         : for (i = 0; i < size; i++) {
    2.51 : 64fbb2: cmp %r12,%rax
    0.00 : 64fbb5: jne 64fba6 <caml_make_vect+0xff>
    0.00 : 64fbb7: jmpq 64fc89 <caml_make_vect+0x1e2>

I tried to implement the suggested patch to measure the gain but I failed to compile the compiler with new runtime ; I tried make core/coreboot/bootstrap and other recipes mentioned here and there to no avail.
(0009712)
xleroy (administrator)
2013-07-06 10:34

Interesting suggestion, thanks. Some thoughts:

- Array operations greatly benefit from being type-specialized, because polymorphic array operations involve run-time tag tests. In this respect, the polymorphic Array.map will always be less efficient than a version specialized to type "float array".

- What is the typical size of your arrays? Probably not that small, otherwise the initialization loop would not take so much time.

- There is a significant fixed cost in caml_make_vect, namely the Is_in_value_area test. This, I'm afraid, we can't get rid of. But you're right that we could avoid the double initialization for Array.map-like operations. And also avoid triggering a minor GC in the case of a large array whose first element points to the young generation.

- For non-float arrays, the first initialization must still be performed, to satisfy the GC invariants. However, we can initialize with a safe constant, say, Val_unit, and avoid calling caml_initialize() at all.

- The alternate caml_make_vect (let's call it caml_alloc_vect) should still initialize the *first* element of the array with the given init value. That would preserve the code structure of Array.map et al.
(0009802)
rixed (reporter)
2013-07-17 17:00

Thank you for the suggestions. Not sure If I will be able to specialize my Array.map since the float type comes from a functor argument, though.

Typical size of my arrays are 2 to 4 items (so, quite small).

Also, what's the procedure to change the runtime in recent ocaml source distribution? Maybe I should open another feature request to have this documented in the README?
(0009805)
jacques-henri.jourdan (manager)
2013-07-17 18:26

I am a bit surprised by your performance problems in the case of that small arrays.

Could you please provide some repro case ?
(0009806)
rixed (reporter)
2013-07-17 20:10

The actual program is a small demonstration of a vector graphic library ; see for yourself:

opam switch testperf --alias-of 4.02.0dev+fp
opam repo add testperf https://git.gitorious.org/ocalme/opam-repository.git [^]
opam install mlrockets
perf record somewhere/in/opam/testperf/mlrockets/mlrockets.opt

should compile and run the program and record it's runtime performances.
(0009813)
jacques-henri.jourdan (manager)
2013-07-19 16:17

I get the following:

Building world of radius 150....
    Gravity = 0.1
    Building ground...
        Randomizing ground...
        Randomizing ground...
    Building rocket 1...
    Adding 706 stars...
Erreur de segmentation (core dumped)

Using gdb, the backtrace is:

#0 0x0000000000000000 in ?? ()
#1 0x00007ffff7205546 in generic_run () from /usr/lib64/dri/nouveau_dri.so
#2 0x00007ffff730e424 in nv50_push_vbo () from /usr/lib64/dri/nouveau_dri.so
0000003 0x00007ffff730a64b in nv50_draw_vbo () from /usr/lib64/dri/nouveau_dri.so
0000004 0x00007ffff711f879 in st_draw_vbo () from /usr/lib64/dri/nouveau_dri.so
0000005 0x00007ffff70f8a54 in vbo_draw_arrays () from /usr/lib64/dri/nouveau_dri.so
0000006 0x00000000005c5489 in gl_render ()
0000007 0x000000000047a16d in camlGlop_base__fun_4814 ()
0000008 0x000000000058b2a5 in camlList__iter_1061 () at list.ml:73
0000009 0x0000000000471123 in camlGame__fun_3453 () at game.ml:90
0000010 0x000000000047c200 in camlGlop_view__aux_2867 () at glop_view.ml:87
0000011 0x000000000058b2a5 in camlList__iter_1061 () at list.ml:73
0000012 0x000000000047c4a0 in camlGlop_view__next_frame_2945 () at glop_view.ml:172
0000013 0x000000000057c44f in camlBricabrac__forever_1030 () at bricabrac.ml:23
0000014 0x00000000004706b7 in camlMain__entry () at main.ml:20
0000015 0x000000000046dd3d in caml_program ()
0000016 0x00000000005e170a in caml_start_program ()
0000017 0x0000000000000000 in ?? ()

So it seems it is trying to draw something, but my openGL implementation does not like this.

Is there a way to run it without rendering ?
(0009814)
jacques-henri.jourdan (manager)
2013-07-19 17:08

I have finally been able to run it.

I did the following experiment : I duplicated the initialization loop in caml_make_vect (so the second instance of the loop does nothing interesting, it just takes time). After recompiling everything, it seems like the second instance takes much less time than the first. For me, it precisely means the first one is almost always a cache miss, while the second one is a cache hit.

I precise the assembly code generated for both loops are very similar.

My conclusion is that if we remove the loop (or replace it by the initialization of the first field), the performance won't be improved a lot, because of the cache misses (they will appear in Array.map or whatever anyway).
(0009815)
rixed (reporter)
2013-07-19 17:40

Sorry for the segfault, you probably figured out how to disable rendering.

Yes my use case may not benefit a lot from tis change, but for larger arrays the first initialization may load the same cache lines several times, entailing more latency than necessary.

- Issue History
Date Modified Username Field Change
2013-07-05 17:09 rixed New Issue
2013-07-05 17:57 gasche Note Added: 0009705
2013-07-05 17:58 gasche Note Deleted: 0009705
2013-07-05 23:37 jacques-henri.jourdan Note Added: 0009709
2013-07-06 07:10 rixed Note Added: 0009711
2013-07-06 10:34 xleroy Note Added: 0009712
2013-07-06 10:34 xleroy Status new => acknowledged
2013-07-06 10:34 xleroy Target Version => later
2013-07-17 17:00 rixed Note Added: 0009802
2013-07-17 18:26 jacques-henri.jourdan Note Added: 0009805
2013-07-17 20:10 rixed Note Added: 0009806
2013-07-19 16:17 jacques-henri.jourdan Note Added: 0009813
2013-07-19 17:08 jacques-henri.jourdan Note Added: 0009814
2013-07-19 17:40 rixed Note Added: 0009815


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker