|Anonymous | Login | Signup for a new account||2014-03-11 01:06 CET|
|Main | My View | View Issues | Change Log | Roadmap|
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0006101||OCaml||OCaml runtime system||public||2013-07-29 13:57||2014-01-17 16:24|
|Target Version||4.02.0+dev||Fixed in Version|
|Summary||0006101: Avoiding page table using contiguous heap|
|Description||Lookups in the OCaml page hashtable is clearly a bottleneck in the GC performance on 64 bits machines. I propose a patch to solve this issue.|
Here is what it does:
- At initialization of the runtime, we use mmap to allocate a big (1TB) chunk of virtual memory, with MAP_NORESERVE and PROT_NONE.
- When some memory is needed (or freed) by the major heap, we use a simple buddy allocator together with mmap.
- We replace lookups to the page table by comparisons of the pointer with the address of the big chunk.
For now, this is a proof-of-concept implementation that only works on Linux. I know it is possible to do the same at least on Windows. I suppose (without having tested) it is possible on MacosX too.
The typical performance gain is 7% - 8% (benchmarked on Coq and Compcert compilation), which is substantial given that this bottleneck exists for a large class of programs.
I would be happy to get comments from GC/runtime experts. Especially, the patch change the behavior of the following internals:
- The GC increment is modified, so that the memory allocated by the buddy allocator is a power of two most of the time. This helps the useful mmaped regions of the memory to be contiguous (the kernel refuses to have more than 2^16 intervals of mmaped memory with different properties).
- Is_in_heap know returns true if given the address of a chunk header. AFAIK, there is no reason it would be called on a block header.
- Between a call to caml_block_alloc and a subsequent call to caml_add_to_heap, the considered memory chunk will answer true to a call to Is_in_heap. AFAIK, the only place it can cause problems is in the deserializer, but I can't see why it could be.
- It is possible to limit, using setrlimit (with RLIMIT_AS), the size of the virtual memory space of a program. Obviously, with this patch, the virtual address space will be larger than 1TB, even if the program uses little physical memory. The problem is that it is the only actual way to limit the amount of memory a process uses. This is not included in this patch, but what we could do in the case the initial MAP_NORESERVE mmap fails is trying allocating memory in a contiguous manner at the end of the virtual memory space, switching back to the page table method if not possible.
|Attached Files||contiguous.patch [^] (13,070 bytes) 2013-07-29 13:57 [Show Content]|
Wow, that's a radical idea. I guess the virtual size of the process (as reported by ps) is then 1TB. This will lead to misunderstandings. But otherwise... we should figure out whether there are hardware requirements, or whether any 64 bit machine will do. To my knowledge, this depends on the type of CPU page table (I guess the 2^16 limit you observe is actually a CPU limitation).
Btw, the 1TB should be configurable - there ARE machines with that much RAM.
I guess another downside of this approach is that you'll get a sigsegv at any time when the machine really runs out of memory. So that should be handled somehow.
This comment is only loosely connected to your patch. Actually, I'm wondering whether anybody has ever tried to improve the performance of the page table. Something in my gut says that there are chances to make it better, because a hash table is not a good data structure for small amounts of data. Just for information, CPUs implement their page tables very differently. There is a very fast but small associative lookup buffer ("TLB"), and only if the entry cannot be found there, the CPU consults the main page table (which is AFAIK a B tree with fixed height).
Translated to our situation, we could try to add a small cache with fixed size (maybe 4 or 8, and this must be 128-aligned) which is linearly searched (by comparing the address with the boundaries of the memory region). This is a lot cheaper than a multiplication (for the hash function). Only if nothing is found, the real page table is consulted. The entries in the cache will need some weight, and the lightest entry is replaced (continously updating the weight without writing too often might be a challenge in this approach - btw. weight=usecount would be overinterpretation).
But anyway, this is just following my intuition that the current path isn't maxed out, and I think we should do it before doing something that is much more complex.
> Wow, that's a radical idea. I guess the virtual size of the process (as reported by ps) is then 1TB. This will lead to misunderstandings. But otherwise... we should figure out whether there are hardware requirements, or whether any 64 bit machine will do. To my knowledge, this depends on the type of CPU page table (I guess the 2^16 limit you observe is actually a CPU limitation).
There is indeed an hardware requirement: on x86-64, we have a limit of 64TB on the virtual memory space (typically 32TB in user space and 32TB in kernel space). I do not know the other 64bits architectures, but x86-64 would be a good start.
The 2^16 limit I observe is a limit of the Linux kernel. It tries to keep it data structures small. Allocating virtual memory space is lazy : in the first place, it does not touch the microprocessor data structures (for 1TB of memory, it would be 1GB of page tables).
Anyway, all that could be easily tested by the configure script.
> Btw, the 1TB should be configurable - there ARE machines with that much RAM.
Indeed. But I am not sure the Ocaml GC would be able to manage that much memory in a single process (imagine the length of one GC cycle...).
> I guess another downside of this approach is that you'll get a sigsegv at any time when the machine really runs out of memory. So that should be handled somehow.
the behavior will not change, this is already the case (though it is not a sigsegv, but an oom killer).
Gerd: concerning yous second comment: I already tried a bit (only a bit, so that you can try if you want) to do various optimizations in the hash table. My conclusion is that the main bottleneck is not the multiplication (which is VERY cheap on modern processors), but rather its non-locality (and an artificial TLB will not work here).
So the solution would indeed to use a B-tree for this page table, as this is already done in the 32 bits case. However, this should be a triple level array, given the size of the memory space. I suppose this option has been tested before implementing the hash table, and I can imagine this to be slower because of the bad cache locality. Xavier, what do you think ?
I've some time tomorrow, so maybe I just try my idea for very small n (even n=1). When I remember correctly, a multiplication still needs something in the range of 20-40 cycles, and a linear search of a few cache entries should be faster than a multiplication plus a likely cache miss.
Of course, this all depends on data locality. If there isn't any locality we can exploit, the current hash table could even be the best choice.
I don't think integer multiplication is 20 cycles. 20 years ago on the 88000 it was 4 cycles.
We should ask someone who knows for sure.
edited on: 2013-08-29 13:14
According to recent documentation multiplication latency is ~ 3 cycles. http://www.agner.org/optimize/instruction_tables.pdf [^]
This is a bit unrelated, but notice that we could reduce a bit the cost of non locality by using huge pages when possible. On recent Linux, it is possible to tell the kernel that some pages should probably be allocated as huge pages by using
madvise(addr, Page_size, MADV_HUGEPAGE);
The main constraint is that pages must be aligned on huge page size (2MB on Linux AMD64) but with your allocation method this shouldn't be a problem. On some of my test this had ~ 5% speed improvements.
Yep, I found in some AMD docs 5-8 cycles for a MUL on 64 bit data (the doc is a few years old, maybe it is even a bit faster now - the pdf posted by chambart included only 32 bit multiplications). But that should be taken with some care - due to the pipelined and out-of-order execution you cannot just add up latencies of instructions (which can actually be faster or slower than predicted).
In the meantime I've had this idea: we could special-case the first heap chunk. The user would get a new parameter for setting the size of the first chunk, and in the page table lookup function the first chunk is always checked beforehand (i.e. before checking with the hash table). If the user has an idea how much RAM is actually needed, the size parameter can be set accordingly, and all (or most) of the lookups would be for the first chunk only, and we would get a speed gain. Of course, this would make the "standard case" a bit slower, and maybe this is even unnoticeable (two extra comparsions).
|2013-07-29 13:57||jacques-henri.jourdan||New Issue|
|2013-07-29 13:57||jacques-henri.jourdan||File Added: contiguous.patch|
|2013-07-29 22:39||gerd||Note Added: 0009997|
|2013-07-29 23:52||gerd||Note Added: 0009998|
|2013-07-30 00:02||jacques-henri.jourdan||Note Added: 0009999|
|2013-07-30 00:20||jacques-henri.jourdan||Note Added: 0010000|
|2013-07-30 00:49||gerd||Note Added: 0010003|
|2013-08-28 11:49||doligez||Note Added: 0010248|
|2013-08-28 11:49||doligez||Status||new => acknowledged|
|2013-08-28 17:22||chambart||Note Added: 0010254|
|2013-08-29 13:14||chambart||Note Edited: 0010254||View Revisions|
|2013-09-08 01:28||gerd||Note Added: 0010325|
|2014-01-17 16:24||doligez||Tag Attached: patch|
|Copyright © 2000 - 2011 MantisBT Group|