New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: Buffer.add_char is not inlined #5872
Comments
Comment author: @alainfrisch
While I can imagine that inlining could make a difference here, I don't understand this statement. Can you elaborate on why Buffer.add_char is useless if it is not inlined? |
Comment author: gerd Maybe not totally useless, but my understanding is that algorithms are sometimes easier to write when you treat every char separately - finally caling Buffer.add_char when you are done with a character. The alternative is a different style where you try to treat several characters at once (e.g. find first a substring, then add all at once to the buffer). Well, of course my statement makes only sense when this alternative exists. You can then view add_char as the function that enables the simpler style hopefully without impact on the performance. Regarding the performance, at least, I recently had to rewrite an algorithm into the alternate style because Buffer.add_char was too slow (or better: it was important for me to maximize the performance, and the choice of style made a difference). My guess is that this would not have been necessary if add_char was inlined (in which case it compiles to around 15 assembly instructions in the "fast path"). Btw, if I compile buffer.ml with -inline 3, add_char is marked as inline function. It is just slightly too large. (Maybe some hint that the default inlining limit is too low.) I just checked again: A simple loop adding 100M chars to a buffer is 10% faster if add_char is inlined (for amd64). |
Comment author: @alainfrisch So an alternative fix would be to compile the standard library with a larger -inline factor, right? |
Comment author: gerd Yes. |
Comment author: @alainfrisch So bumping -inline for the standard library seems more robust to me than trying to work around the inlining heuristic by adding a spurious "rec". We would need to assess the impact of this change, though. There is an ongoing project to collect realistic piece of code in order to create a serious benchmarking suite for OCaml code (in order to study optimizations to the compiler, runtime system, standard library). Cf: http://gallium.inria.fr/~scherer/gagallium/we-need-a-representative-benchmark-suite/ Hopefully such a suite will make it easier to make a decision like changing the default inlining factor (or for the stdlib only). |
Comment author: gerd Just for this case, increasing -inline is ok. But thinking a bit more general, one of the problems we face here is that the inlining heuristics is very imprecise. So far I understand, the metrics is on the size of an intermediate code representation of the inlining candidate. What is totally ignored is the cost of the function call (if inlining is not done), which is (also) a function of the number of registers the caller needs to save to the stack. Illustrated at the example of Buffer.add_char, this is not negligible: if you e.g. need to save 5 registers, you need 10 more instructions, and this is a lot when the called function consists only of 15 instructions. I have no good idea how you can guess the calling costs based on the intermediate code, but maybe one can use something like the number of variables in the innermost scope as estimate. My thinking is that this matters more when CPUs get more registers. But anyway, "calibrating" something like this (or even inlining in general) is probably difficult. |
Comment author: @xavierleroy Even if Gerd's suggestion (let -> let rec) is a bit of a hack, I like it because it would cause the fast path of add_char to be inlined, while keeping the slow path "offline", and that's what's best wrt performance. |
Comment author: @damiendoligez
Except that I've just tried and it doesn't work: add_char is still not inlined. OTOH, compiling buffer.ml with -inline 3 does work, and it doesn't inline resize, so the slow path stays offline. I have added the -inline 3 to Compflags (rev 13888 in branch 4.01). |
Original bug ID: 5872
Reporter: gerd
Status: closed (set by @xavierleroy on 2015-12-11T18:20:58Z)
Resolution: fixed
Priority: normal
Severity: minor
Version: 4.00.1
Target version: 4.01.0+dev
Fixed in version: 4.01.0+dev
Category: standard library
Related to: #5873
Monitored by: @gasche @lefessan @yakobowski @alainfrisch
Bug description
According to ocamlobjinfo, Buffer.add_char is not marked for being inlined. IMHO, this is a performance bug, because this function makes only sense if it can be inlined.
Proposed fix: change [let resize] into [let rec resize], i.e. make this function artificially recursive, as this prevents [resize] from being inlined into [add_char] (which is probably the cause for the problem).
The text was updated successfully, but these errors were encountered: