|Anonymous | Login | Signup for a new account||2017-10-24 13:34 CEST|
|Main | My View | View Issues | Change Log | Roadmap|
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0005872||OCaml||standard library||public||2013-01-04 16:39||2015-12-11 19:20|
|Target Version||4.01.0+dev||Fixed in Version||4.01.0+dev|
|Summary||0005872: Performance: Buffer.add_char is not inlined|
|Description||According to ocamlobjinfo, Buffer.add_char is not marked for being inlined. IMHO, this is a performance bug, because this function makes only sense if it can be inlined.|
Proposed fix: change [let resize] into [let rec resize], i.e. make this function artificially recursive, as this prevents [resize] from being inlined into [add_char] (which is probably the cause for the problem).
|Tags||No tags attached.|
> because this function makes only sense if it can be inlined.
While I can imagine that inlining could make a difference here, I don't understand this statement. Can you elaborate on why Buffer.add_char is useless if it is not inlined?
Maybe not totally useless, but my understanding is that algorithms are sometimes easier to write when you treat every char separately - finally caling Buffer.add_char when you are done with a character. The alternative is a different style where you try to treat several characters at once (e.g. find first a substring, then add all at once to the buffer). Well, of course my statement makes only sense when this alternative exists. You can then view add_char as the function that enables the simpler style hopefully without impact on the performance.
Regarding the performance, at least, I recently had to rewrite an algorithm into the alternate style because Buffer.add_char was too slow (or better: it was important for me to maximize the performance, and the choice of style made a difference). My guess is that this would not have been necessary if add_char was inlined (in which case it compiles to around 15 assembly instructions in the "fast path").
Btw, if I compile buffer.ml with -inline 3, add_char is marked as inline function. It is just slightly too large. (Maybe some hint that the default inlining limit is too low.) I just checked again: A simple loop adding 100M chars to a buffer is 10% faster if add_char is inlined (for amd64).
|So an alternative fix would be to compile the standard library with a larger -inline factor, right?|
So bumping -inline for the standard library seems more robust to me than trying to work around the inlining heuristic by adding a spurious "rec". We would need to assess the impact of this change, though. There is an ongoing project to collect realistic piece of code in order to create a serious benchmarking suite for OCaml code (in order to study optimizations to the compiler, runtime system, standard library). Cf: http://gallium.inria.fr/~scherer/gagallium/we-need-a-representative-benchmark-suite/ [^]
Hopefully such a suite will make it easier to make a decision like changing the default inlining factor (or for the stdlib only).
Just for this case, increasing -inline is ok.
But thinking a bit more general, one of the problems we face here is that the inlining heuristics is very imprecise. So far I understand, the metrics is on the size of an intermediate code representation of the inlining candidate. What is totally ignored is the cost of the function call (if inlining is not done), which is (also) a function of the number of registers the caller needs to save to the stack. Illustrated at the example of Buffer.add_char, this is not negligible: if you e.g. need to save 5 registers, you need 10 more instructions, and this is a lot when the called function consists only of 15 instructions. I have no good idea how you can guess the calling costs based on the intermediate code, but maybe one can use something like the number of variables in the innermost scope as estimate. My thinking is that this matters more when CPUs get more registers. But anyway, "calibrating" something like this (or even inlining in general) is probably difficult.
Even if Gerd's suggestion (let -> let rec) is a bit of a hack, I like it because it would cause the fast path of add_char to be inlined, while keeping the slow path "offline", and that's what's best wrt performance.
>Even if Gerd's suggestion (let -> let rec) is a bit of a hack, I like it because...
Except that I've just tried and it doesn't work: add_char is still not inlined.
OTOH, compiling buffer.ml with -inline 3 does work, and it doesn't inline resize, so the slow path stays offline.
I have added the -inline 3 to Compflags (rev 13888 in branch 4.01).
|2013-01-04 16:39||gerd||New Issue|
|2013-01-08 13:51||frisch||Note Added: 0008708|
|2013-01-08 15:51||gerd||Note Added: 0008714|
|2013-01-08 15:53||frisch||Note Added: 0008715|
|2013-01-08 15:55||gerd||Note Added: 0008716|
|2013-01-08 16:42||frisch||Note Added: 0008717|
|2013-01-08 17:40||gerd||Note Added: 0008720|
|2013-03-24 15:42||xleroy||Note Added: 0009007|
|2013-03-24 15:42||xleroy||Status||new => acknowledged|
|2013-06-28 18:14||doligez||Relationship added||related to 0005873|
|2013-07-11 23:49||doligez||Note Added: 0009760|
|2013-07-11 23:49||doligez||Status||acknowledged => resolved|
|2013-07-11 23:49||doligez||Resolution||open => fixed|
|2013-07-11 23:49||doligez||Fixed in Version||=> 4.01.0+dev|
|2013-07-11 23:49||doligez||Target Version||=> 4.01.0+dev|
|2015-12-11 19:20||xleroy||Status||resolved => closed|
|2017-02-23 16:43||doligez||Category||OCaml standard library => standard library|
|Copyright © 2000 - 2011 MantisBT Group|