Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dramatic performance increase when compiling with annotations #6030

Closed
vicuna opened this issue Jun 4, 2013 · 15 comments
Closed

Dramatic performance increase when compiling with annotations #6030

vicuna opened this issue Jun 4, 2013 · 15 comments
Assignees

Comments

@vicuna
Copy link

vicuna commented Jun 4, 2013

Original bug ID: 6030
Reporter: gmelquiond
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2015-12-11T18:19:29Z)
Resolution: fixed
Priority: normal
Severity: major
Version: 4.00.1
Fixed in version: 4.01.0+dev
Category: typing

Bug description

When compiling the core library of Why3 (lib/why3/why3.cma) on my computer, it takes about 28 seconds. What is interesting is the ratio between user time and system time: actual compilation takes 7 seconds, while the kernel wastes 21 seconds outputting files.

A look at the top four entries of the kernel profile shows:

49,56% ocamlc.opt [kernel.kallsyms] [k] aes_encrypt
7,17% ocamlc.opt [kernel.kallsyms] [k] memcpy
4,08% ocamlc.opt [kernel.kallsyms] [k] crypto_xor
1,82% ocamlc.opt [kernel.kallsyms] [k] crypto_cbc_encrypt

(As you can guess from the profile, the disk is encrypted.)

If we now take a look at the compilation of a single file, e.g. src/core/term.cmo, strace shows that ocamlc does 47,000 writes to src/core/term.annot. In some cases, it performs one system call per two characters!

The culprits are the print_ident_annot and print_info functions from typing/stype.ml. They contain format strings such as "@.)@.", which are definitely a recipe for disaster.

With the attached patch, overall compilation speed is multiplied by 3.5, system time is divided by 12, and the kernel profile now looks a lot saner:

6,60% ocamlc.opt ocamlc.opt [.] mark_slice
5,21% ocamlc.opt ocamlc.opt [.] caml_page_table_lookup
3,74% ocamlc.opt ocamlc.opt [.] caml_oldify_one
2,48% ocamlc.opt [kernel.kallsyms] [k] aes_encrypt

Please apply.

File attachments

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: @alainfrisch

You should try the trunk version. It has already been optimized a lot, after observing (as you did) how slow the previous version was.

Commits:

http://caml.inria.fr/cgi-bin/viewvc.cgi?view=revision&revision=13070

http://caml.inria.fr/cgi-bin/viewvc.cgi?view=revision&revision=13071

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: @gasche

Alain has already made those changes in trunk, commits 13070 and 13071. Can you confirm that compiling trunk as-is solves your problem?

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: gmelquiond

This is a bit better with trunk, but nowhere near as good as with my patch. The profile with trunk looks like:

35,36% ocamlc.opt [kernel.kallsyms] [k] aes_encrypt
4,84% ocamlc.opt [kernel.kallsyms] [k] memcpy
3,80% ocamlc.opt ocamlc.opt [.] mark_slice
2,98% ocamlc.opt [kernel.kallsyms] [k] crypto_xor

System time is at 8 seconds, while total time is at 14 seconds. A decent compiler should never waste 60% of its time doing I/O. (With my patch, system time is at 1.5 seconds.)

For instance, if we take a look again at the compilation of src/core/term.cmo in Why3, we see that:

  • ocaml-4.00.01 performs 47,000 calls to write,
  • ocaml-trunk performs 15,000 calls to write,
  • my patch performs only 18 calls to write.

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: @gasche

The only difference I found (by code inspection) that may be relevant to performances is the fact that Alain's code has a flush in the [Ti_expr] case of [print_info], after the opening parenthesis, while Guillaume's patch only uses "@\n" (newline with no flush). If I can validate this hypothesis on some long file, we can validate the change.

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: @alainfrisch

I'm working on it.

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: @alainfrisch

Commit 13742 on trunk should fix the problem. When compiling typing/typecore.cmo (with -annot), the number of writes is reduced from 29494 to 39.

Can you try it and report how it affects the performances on your example?

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: gmelquiond

Note that this is not the only flush in Alain's code; there is also one hidden behind pp_print_newline. Also, I don't know the internals of Format, but the fact that there is no final flush for annotations in trunk worries me a bit. Could annotations be abruptly truncated once the compiler terminates? (If not, then the explicit flush I put in my patch is redundant.)

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: @alainfrisch

Could annotations be abruptly truncated once the compiler terminates?

Annotations are collected and all dumped at once (Stypes.dump), so I don't see how this could happen, except if you kill the process explicitly.

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: @alainfrisch

Note that this is not the only flush in Alain's code; there is also one hidden behind pp_print_newline.

The commit also get rid of this one. All uses of Format are now on Format.str_formatter (no syscall).

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: gmelquiond

I understand that they are all dumped at once, but I don't see where the final half-filled buffer is flushed to disk in the code.

Anyway, as far as I am concerned, the number of system calls for writing annotations is now optimal, so the bug can be marked as fixed.

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: @alainfrisch

I understand that they are all dumped at once, but I don't see where the final half-filled buffer is flushed to disk in the code.

All open out_channels are flushed when the process exits (Pervasives.flush_all is the default exit_function in Pervasives).

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: @alainfrisch

Guillaume, thanks for the detailed report and quick feedback.

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: @alainfrisch

Actually, it would be useful to compare the total time between the trunk and your patch (which is much less invasive that my attempt to reduce the use of Format and of format strings).

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: gmelquiond

I have measured again how long it takes to compile why3.cma. There is no doubt about it: your version is so much better than mine. In fact, the huge timing difference on the actual compilation time (20% overhead) kind of frightens me about the performance of Format.

Trunk:

5.30user 1.80system 0:08.72elapsed 81%CPU
5.30user 1.80system 0:08.71elapsed 81%CPU
5.29user 1.78system 0:08.66elapsed 81%CPU
5.25user 1.96system 0:08.84elapsed 81%CPU
5.35user 1.75system 0:08.71elapsed 81%CPU

Trunk + reverting stypes.ml + applying my patch - the final flush:

6.36user 1.79system 0:09.75elapsed 83%CPU
6.34user 1.78system 0:09.72elapsed 83%CPU
6.25user 1.80system 0:09.68elapsed 83%CPU
6.18user 1.92system 0:09.72elapsed 83%CPU
6.34user 1.82system 0:09.77elapsed 83%CPU

@vicuna
Copy link
Author

vicuna commented Jun 4, 2013

Comment author: @alainfrisch

Thanks again for the results. They justify the extra effort put into de-Formatting this piece of code. And indeed, there are reasons to be concerned with the performance of Format and format strings. See:

http://www.lexifi.com/blog/note-about-performance-printf-and-format

Hopefully, #6017 will remove some overhead related to parsing format strings at runtime (but not the overhead related to Format itself, I guess).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants