Version française
Home     About     Download     Resources     Contact us    
Browse thread
caml_oldify_local_roots takes 50% of the total runtime
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Robert Roessler <roessler@r...>
Subject: Re: [Caml-list] caml_oldify_local_roots takes 50% of the total runtime
Brian Hurt wrote:
> With respect to 4146 (minor heap size adjusts to memory size)- I'm not 
> sure this is a good idea.  32K is small enough to fit into L1 cache with 
> space left over on pretty much all systems these days (64K L1 cache 
> seems to be standard).  Having the minor heap small enough to fit into 
> L1 cache, and thus having the minor heap live in L1 cache, seems to me 
> to be a big advantage.  If for no other reason than it makes minor 
> collections a lot faster- the collector doesn't have to fault memory 
> into cache, it's already there.
> 
> If anything, I'd be inclined to add more generations.  I haven't done 
> any timings, but my first guess for reasonable values would be: Gen1 
> (youngest): 32K (fits into L1 cache), Gen2: 256K (fits into L2 cache), 
> Gen3: 8M (fits into memory), Gen4 (oldest): as large as necessary.

I have not yet looked at the details of OCaml's memory management, but 
in my commercial Smalltalk-80 implementation, I used a simple scheme:

Twin 32K "young" spaces... the allocations come from one of these 
(like OCaml's, using trivial and fast code) until it is exhausted. 
Then all of the reachable [young] objects are copied to the other 
[empty] "young" space.  Eventually, a more or less conventional 
mark-and-sweep is performed on the "old" space.

Adjustable parameters control 1) how many times an object ping-pongs 
between the two "young" spaces before it is promoted to the "old" 
space (4 was a reasonable number), and 2) what size an allocation 
needs to be before it is created as "old" (this somewhat controls the 
consumption of the "young" spaces at the expense of some improper 
"old" promotions).

This approach allowed an 8MHz 286 to perform at close to Xerox 
"Dolphin" speeds, while using [typically] less than 1% of total time 
for memory management.

Humorously, the 32K was chosen not because it would fit into cache 
(there wasn't any), but because it would allow both "young" spaces to 
fit into a single x86 64K "segment" - remember those? :)

Robert Roessler
robertr@rftp.com
http://www.rftp.com