Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Buffer.add_char is not inlined #5872

Closed
vicuna opened this issue Jan 4, 2013 · 8 comments
Closed

Performance: Buffer.add_char is not inlined #5872

vicuna opened this issue Jan 4, 2013 · 8 comments
Milestone

Comments

@vicuna
Copy link

vicuna commented Jan 4, 2013

Original bug ID: 5872
Reporter: gerd
Status: closed (set by @xavierleroy on 2015-12-11T18:20:58Z)
Resolution: fixed
Priority: normal
Severity: minor
Version: 4.00.1
Target version: 4.01.0+dev
Fixed in version: 4.01.0+dev
Category: standard library
Related to: #5873
Monitored by: @gasche @lefessan @yakobowski @alainfrisch

Bug description

According to ocamlobjinfo, Buffer.add_char is not marked for being inlined. IMHO, this is a performance bug, because this function makes only sense if it can be inlined.

Proposed fix: change [let resize] into [let rec resize], i.e. make this function artificially recursive, as this prevents [resize] from being inlined into [add_char] (which is probably the cause for the problem).

@vicuna
Copy link
Author

vicuna commented Jan 8, 2013

Comment author: @alainfrisch

because this function makes only sense if it can be inlined.

While I can imagine that inlining could make a difference here, I don't understand this statement. Can you elaborate on why Buffer.add_char is useless if it is not inlined?

@vicuna
Copy link
Author

vicuna commented Jan 8, 2013

Comment author: gerd

Maybe not totally useless, but my understanding is that algorithms are sometimes easier to write when you treat every char separately - finally caling Buffer.add_char when you are done with a character. The alternative is a different style where you try to treat several characters at once (e.g. find first a substring, then add all at once to the buffer). Well, of course my statement makes only sense when this alternative exists. You can then view add_char as the function that enables the simpler style hopefully without impact on the performance.

Regarding the performance, at least, I recently had to rewrite an algorithm into the alternate style because Buffer.add_char was too slow (or better: it was important for me to maximize the performance, and the choice of style made a difference). My guess is that this would not have been necessary if add_char was inlined (in which case it compiles to around 15 assembly instructions in the "fast path").

Btw, if I compile buffer.ml with -inline 3, add_char is marked as inline function. It is just slightly too large. (Maybe some hint that the default inlining limit is too low.) I just checked again: A simple loop adding 100M chars to a buffer is 10% faster if add_char is inlined (for amd64).

@vicuna
Copy link
Author

vicuna commented Jan 8, 2013

Comment author: @alainfrisch

So an alternative fix would be to compile the standard library with a larger -inline factor, right?

@vicuna
Copy link
Author

vicuna commented Jan 8, 2013

Comment author: gerd

Yes.

@vicuna
Copy link
Author

vicuna commented Jan 8, 2013

Comment author: @alainfrisch

So bumping -inline for the standard library seems more robust to me than trying to work around the inlining heuristic by adding a spurious "rec". We would need to assess the impact of this change, though. There is an ongoing project to collect realistic piece of code in order to create a serious benchmarking suite for OCaml code (in order to study optimizations to the compiler, runtime system, standard library). Cf: http://gallium.inria.fr/~scherer/gagallium/we-need-a-representative-benchmark-suite/

Hopefully such a suite will make it easier to make a decision like changing the default inlining factor (or for the stdlib only).

@vicuna
Copy link
Author

vicuna commented Jan 8, 2013

Comment author: gerd

Just for this case, increasing -inline is ok.

But thinking a bit more general, one of the problems we face here is that the inlining heuristics is very imprecise. So far I understand, the metrics is on the size of an intermediate code representation of the inlining candidate. What is totally ignored is the cost of the function call (if inlining is not done), which is (also) a function of the number of registers the caller needs to save to the stack. Illustrated at the example of Buffer.add_char, this is not negligible: if you e.g. need to save 5 registers, you need 10 more instructions, and this is a lot when the called function consists only of 15 instructions. I have no good idea how you can guess the calling costs based on the intermediate code, but maybe one can use something like the number of variables in the innermost scope as estimate. My thinking is that this matters more when CPUs get more registers. But anyway, "calibrating" something like this (or even inlining in general) is probably difficult.

@vicuna
Copy link
Author

vicuna commented Mar 24, 2013

Comment author: @xavierleroy

Even if Gerd's suggestion (let -> let rec) is a bit of a hack, I like it because it would cause the fast path of add_char to be inlined, while keeping the slow path "offline", and that's what's best wrt performance.

@vicuna
Copy link
Author

vicuna commented Jul 11, 2013

Comment author: @damiendoligez

Even if Gerd's suggestion (let -> let rec) is a bit of a hack, I like it because...

Except that I've just tried and it doesn't work: add_char is still not inlined.

OTOH, compiling buffer.ml with -inline 3 does work, and it doesn't inline resize, so the slow path stays offline.

I have added the -inline 3 to Compflags (rev 13888 in branch 4.01).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant