Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoiding page table using contiguous heap #6101

Closed
vicuna opened this issue Jul 29, 2013 · 11 comments
Closed

Avoiding page table using contiguous heap #6101

vicuna opened this issue Jul 29, 2013 · 11 comments

Comments

@vicuna
Copy link

vicuna commented Jul 29, 2013

Original bug ID: 6101
Reporter: @jhjourdan
Assigned to: @damiendoligez
Status: assigned (set by @mshinwell on 2016-12-08T07:23:52Z)
Resolution: open
Priority: normal
Severity: feature
Version: 4.02.0+dev
Category: runtime system and C interface
Tags: patch
Monitored by: @gasche @diml @ygrek @jmeber @hcarty @alainfrisch

Bug description

Lookups in the OCaml page hashtable is clearly a bottleneck in the GC performance on 64 bits machines. I propose a patch to solve this issue.

Here is what it does:

  • At initialization of the runtime, we use mmap to allocate a big (1TB) chunk of virtual memory, with MAP_NORESERVE and PROT_NONE.
  • When some memory is needed (or freed) by the major heap, we use a simple buddy allocator together with mmap.
  • We replace lookups to the page table by comparisons of the pointer with the address of the big chunk.

For now, this is a proof-of-concept implementation that only works on Linux. I know it is possible to do the same at least on Windows. I suppose (without having tested) it is possible on MacosX too.

The typical performance gain is 7% - 8% (benchmarked on Coq and Compcert compilation), which is substantial given that this bottleneck exists for a large class of programs.

I would be happy to get comments from GC/runtime experts. Especially, the patch change the behavior of the following internals:

  • The GC increment is modified, so that the memory allocated by the buddy allocator is a power of two most of the time. This helps the useful mmaped regions of the memory to be contiguous (the kernel refuses to have more than 2^16 intervals of mmaped memory with different properties).
  • Is_in_heap know returns true if given the address of a chunk header. AFAIK, there is no reason it would be called on a block header.
  • Between a call to caml_block_alloc and a subsequent call to caml_add_to_heap, the considered memory chunk will answer true to a call to Is_in_heap. AFAIK, the only place it can cause problems is in the deserializer, but I can't see why it could be.
  • It is possible to limit, using setrlimit (with RLIMIT_AS), the size of the virtual memory space of a program. Obviously, with this patch, the virtual address space will be larger than 1TB, even if the program uses little physical memory. The problem is that it is the only actual way to limit the amount of memory a process uses. This is not included in this patch, but what we could do in the case the initial MAP_NORESERVE mmap fails is trying allocating memory in a contiguous manner at the end of the virtual memory space, switching back to the page table method if not possible.

File attachments

@vicuna
Copy link
Author

vicuna commented Jul 29, 2013

Comment author: gerd

Wow, that's a radical idea. I guess the virtual size of the process (as reported by ps) is then 1TB. This will lead to misunderstandings. But otherwise... we should figure out whether there are hardware requirements, or whether any 64 bit machine will do. To my knowledge, this depends on the type of CPU page table (I guess the 2^16 limit you observe is actually a CPU limitation).

Btw, the 1TB should be configurable - there ARE machines with that much RAM.

I guess another downside of this approach is that you'll get a sigsegv at any time when the machine really runs out of memory. So that should be handled somehow.

@vicuna
Copy link
Author

vicuna commented Jul 29, 2013

Comment author: gerd

This comment is only loosely connected to your patch. Actually, I'm wondering whether anybody has ever tried to improve the performance of the page table. Something in my gut says that there are chances to make it better, because a hash table is not a good data structure for small amounts of data. Just for information, CPUs implement their page tables very differently. There is a very fast but small associative lookup buffer ("TLB"), and only if the entry cannot be found there, the CPU consults the main page table (which is AFAIK a B tree with fixed height).

Translated to our situation, we could try to add a small cache with fixed size (maybe 4 or 8, and this must be 128-aligned) which is linearly searched (by comparing the address with the boundaries of the memory region). This is a lot cheaper than a multiplication (for the hash function). Only if nothing is found, the real page table is consulted. The entries in the cache will need some weight, and the lightest entry is replaced (continously updating the weight without writing too often might be a challenge in this approach - btw. weight=usecount would be overinterpretation).

But anyway, this is just following my intuition that the current path isn't maxed out, and I think we should do it before doing something that is much more complex.

@vicuna
Copy link
Author

vicuna commented Jul 29, 2013

Comment author: @jhjourdan

Wow, that's a radical idea. I guess the virtual size of the process (as reported by ps) is then 1TB. This will lead to misunderstandings. But otherwise... we should figure out whether there are hardware requirements, or whether any 64 bit machine will do. To my knowledge, this depends on the type of CPU page table (I guess the 2^16 limit you observe is actually a CPU limitation).

There is indeed an hardware requirement: on x86-64, we have a limit of 64TB on the virtual memory space (typically 32TB in user space and 32TB in kernel space). I do not know the other 64bits architectures, but x86-64 would be a good start.

The 2^16 limit I observe is a limit of the Linux kernel. It tries to keep it data structures small. Allocating virtual memory space is lazy : in the first place, it does not touch the microprocessor data structures (for 1TB of memory, it would be 1GB of page tables).

Anyway, all that could be easily tested by the configure script.

Btw, the 1TB should be configurable - there ARE machines with that much RAM.

Indeed. But I am not sure the Ocaml GC would be able to manage that much memory in a single process (imagine the length of one GC cycle...).

I guess another downside of this approach is that you'll get a sigsegv at any time when the machine really runs out of memory. So that should be handled somehow.

the behavior will not change, this is already the case (though it is not a sigsegv, but an oom killer).

@vicuna
Copy link
Author

vicuna commented Jul 29, 2013

Comment author: @jhjourdan

Gerd: concerning yous second comment: I already tried a bit (only a bit, so that you can try if you want) to do various optimizations in the hash table. My conclusion is that the main bottleneck is not the multiplication (which is VERY cheap on modern processors), but rather its non-locality (and an artificial TLB will not work here).

So the solution would indeed to use a B-tree for this page table, as this is already done in the 32 bits case. However, this should be a triple level array, given the size of the memory space. I suppose this option has been tested before implementing the hash table, and I can imagine this to be slower because of the bad cache locality. Xavier, what do you think ?

@vicuna
Copy link
Author

vicuna commented Jul 29, 2013

Comment author: gerd

I've some time tomorrow, so maybe I just try my idea for very small n (even n=1). When I remember correctly, a multiplication still needs something in the range of 20-40 cycles, and a linear search of a few cache entries should be faster than a multiplication plus a likely cache miss.

Of course, this all depends on data locality. If there isn't any locality we can exploit, the current hash table could even be the best choice.

@vicuna
Copy link
Author

vicuna commented Aug 28, 2013

Comment author: @damiendoligez

I don't think integer multiplication is 20 cycles. 20 years ago on the 88000 it was 4 cycles.
We should ask someone who knows for sure.

@vicuna
Copy link
Author

vicuna commented Aug 28, 2013

Comment author: @chambart

According to recent documentation multiplication latency is ~ 3 cycles. http://www.agner.org/optimize/instruction_tables.pdf

This is a bit unrelated, but notice that we could reduce a bit the cost of non locality by using huge pages when possible. On recent Linux, it is possible to tell the kernel that some pages should probably be allocated as huge pages by using
madvise(addr, Page_size, MADV_HUGEPAGE);

The main constraint is that pages must be aligned on huge page size (2MB on Linux AMD64) but with your allocation method this shouldn't be a problem. On some of my test this had ~ 5% speed improvements.

@vicuna
Copy link
Author

vicuna commented Sep 7, 2013

Comment author: gerd

Yep, I found in some AMD docs 5-8 cycles for a MUL on 64 bit data (the doc is a few years old, maybe it is even a bit faster now - the pdf posted by chambart included only 32 bit multiplications). But that should be taken with some care - due to the pipelined and out-of-order execution you cannot just add up latencies of instructions (which can actually be faster or slower than predicted).

In the meantime I've had this idea: we could special-case the first heap chunk. The user would get a new parameter for setting the size of the first chunk, and in the page table lookup function the first chunk is always checked beforehand (i.e. before checking with the hash table). If the user has an idea how much RAM is actually needed, the size parameter can be set accordingly, and all (or most) of the lookups would be for the first chunk only, and we would get a speed gain. Of course, this would make the "standard case" a bit slower, and maybe this is even unnoticeable (two extra comparsions).

@vicuna
Copy link
Author

vicuna commented Dec 8, 2016

Comment author: @mshinwell

OCaml now supports huge pages (OCAMLRUNPARAM 'H' option).

As for the contiguous heap: this isn't needed if the page table test can be removed or avoided most of the time. The no naked pointers configure option can achieve this. Using this does place some restrictions as to where out-of-heap pointers may point to, although I don't think they are onerous. It is worth noting that this "no naked pointers" setup is the only option for multicore OCaml -- there is no page table in that world at all.

@doligez Are you ok to close this? (I know you had an implementation of contiguous heap as well.)

@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

@github-actions github-actions bot added the Stale label Apr 26, 2021
@jhjourdan
Copy link
Contributor

The current plan is to get rid of the page table by using the no-naked-pointer mode. This PR is no longer relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants