Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0005872OCamlOCaml standard librarypublic2013-01-04 16:392013-07-11 23:49
Reportergerd 
Assigned To 
PrioritynormalSeverityminorReproducibilityalways
StatusresolvedResolutionfixed 
PlatformOSOS Version
Product Version4.00.1 
Target Version4.01.0+devFixed in Version4.01.0+dev 
Summary0005872: Performance: Buffer.add_char is not inlined
DescriptionAccording to ocamlobjinfo, Buffer.add_char is not marked for being inlined. IMHO, this is a performance bug, because this function makes only sense if it can be inlined.

Proposed fix: change [let resize] into [let rec resize], i.e. make this function artificially recursive, as this prevents [resize] from being inlined into [add_char] (which is probably the cause for the problem).
TagsNo tags attached.
Attached Files

- Relationships
related to 0005873confirmed Feature: control inlining better 

-  Notes
(0008708)
frisch (developer)
2013-01-08 13:51

> because this function makes only sense if it can be inlined.

While I can imagine that inlining could make a difference here, I don't understand this statement. Can you elaborate on why Buffer.add_char is useless if it is not inlined?
(0008714)
gerd (reporter)
2013-01-08 15:51

Maybe not totally useless, but my understanding is that algorithms are sometimes easier to write when you treat every char separately - finally caling Buffer.add_char when you are done with a character. The alternative is a different style where you try to treat several characters at once (e.g. find first a substring, then add all at once to the buffer). Well, of course my statement makes only sense when this alternative exists. You can then view add_char as the function that enables the simpler style hopefully without impact on the performance.

Regarding the performance, at least, I recently had to rewrite an algorithm into the alternate style because Buffer.add_char was too slow (or better: it was important for me to maximize the performance, and the choice of style made a difference). My guess is that this would not have been necessary if add_char was inlined (in which case it compiles to around 15 assembly instructions in the "fast path").

Btw, if I compile buffer.ml with -inline 3, add_char is marked as inline function. It is just slightly too large. (Maybe some hint that the default inlining limit is too low.) I just checked again: A simple loop adding 100M chars to a buffer is 10% faster if add_char is inlined (for amd64).
(0008715)
frisch (developer)
2013-01-08 15:53

So an alternative fix would be to compile the standard library with a larger -inline factor, right?
(0008716)
gerd (reporter)
2013-01-08 15:55

Yes.
(0008717)
frisch (developer)
2013-01-08 16:42

So bumping -inline for the standard library seems more robust to me than trying to work around the inlining heuristic by adding a spurious "rec". We would need to assess the impact of this change, though. There is an ongoing project to collect realistic piece of code in order to create a serious benchmarking suite for OCaml code (in order to study optimizations to the compiler, runtime system, standard library). Cf: http://gallium.inria.fr/~scherer/gagallium/we-need-a-representative-benchmark-suite/ [^]

Hopefully such a suite will make it easier to make a decision like changing the default inlining factor (or for the stdlib only).
(0008720)
gerd (reporter)
2013-01-08 17:40

Just for this case, increasing -inline is ok.

But thinking a bit more general, one of the problems we face here is that the inlining heuristics is very imprecise. So far I understand, the metrics is on the size of an intermediate code representation of the inlining candidate. What is totally ignored is the cost of the function call (if inlining is not done), which is (also) a function of the number of registers the caller needs to save to the stack. Illustrated at the example of Buffer.add_char, this is not negligible: if you e.g. need to save 5 registers, you need 10 more instructions, and this is a lot when the called function consists only of 15 instructions. I have no good idea how you can guess the calling costs based on the intermediate code, but maybe one can use something like the number of variables in the innermost scope as estimate. My thinking is that this matters more when CPUs get more registers. But anyway, "calibrating" something like this (or even inlining in general) is probably difficult.
(0009007)
xleroy (administrator)
2013-03-24 15:42

Even if Gerd's suggestion (let -> let rec) is a bit of a hack, I like it because it would cause the fast path of add_char to be inlined, while keeping the slow path "offline", and that's what's best wrt performance.
(0009760)
doligez (administrator)
2013-07-11 23:49

>Even if Gerd's suggestion (let -> let rec) is a bit of a hack, I like it because...

Except that I've just tried and it doesn't work: add_char is still not inlined.

OTOH, compiling buffer.ml with -inline 3 does work, and it doesn't inline resize, so the slow path stays offline.

I have added the -inline 3 to Compflags (rev 13888 in branch 4.01).

- Issue History
Date Modified Username Field Change
2013-01-04 16:39 gerd New Issue
2013-01-08 13:51 frisch Note Added: 0008708
2013-01-08 15:51 gerd Note Added: 0008714
2013-01-08 15:53 frisch Note Added: 0008715
2013-01-08 15:55 gerd Note Added: 0008716
2013-01-08 16:42 frisch Note Added: 0008717
2013-01-08 17:40 gerd Note Added: 0008720
2013-03-24 15:42 xleroy Note Added: 0009007
2013-03-24 15:42 xleroy Status new => acknowledged
2013-06-28 18:14 doligez Relationship added related to 0005873
2013-07-11 23:49 doligez Note Added: 0009760
2013-07-11 23:49 doligez Status acknowledged => resolved
2013-07-11 23:49 doligez Resolution open => fixed
2013-07-11 23:49 doligez Fixed in Version => 4.01.0+dev
2013-07-11 23:49 doligez Target Version => 4.01.0+dev


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker