Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] Bug? Printf, %X and negative numbers
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Brian Hurt <brian.hurt@q...>
Subject: Re: [Caml-list] Bug? Printf, %X and negative numbers
On Tue, 1 Apr 2003, Ville-Pertti Keinonen wrote:

> 
> > I was thinking of just a bitmask.  You are always allocating into the
> > minor heap.  You just define the low 1/32nd of the minor heap a 
> > bitmask of
> > what is a pointer and what isn't.  This would probably slow down
> > allocation- not by much, would be my prediction, but it would slow down
> > allocation.  You may gain some of that cost back by not needing to do 
> > the
> > shifts and ors we currently do to deal with the low bit.  I can't 
> > predict
> > what the overall performance delta would be- not even if it will be
> > positive or negative.
> 
> I suspect that would be quite slow, since you have to do one bit 
> operation for each word of each allocation (optimizing them is 
> difficult because you don't know the alignment).

For structures 32 words or less, you can do it like (using x86 assembly 
language here):
	movl	heap_top, %eax
	movl	%eax, %ecx
	andl	$31, %ecx
	movl	%eax, %ebx
	shrw	$5, %bx	; /* partial register stall- oh well */
	movl	$bit_pattern, %esi
	xorl	%edi, %edi
	shldl	%cl, %esi, %edi
	shll	%cl, %esi
	orl	%esi, (%ebx)
	orl	%edi, 4(%ebx)

10 instructions, no branchs, only one partial register stall (which can
probably be removed at the cost of four more instructions), only one
'complex' instruction (the shld) which might take more than a clock cycle.  
Some rearrangement could probably give us 2, maybe 3 instructions issued
per clock.  We're probably looking at single digit clock cycles here.  
Note that I'm assuming the minor heap is 64K (removing the partial stall 
would also solve this).

On the other side of this equation, we have the overhead of dealing with 
the low bit being set.  This does have a performance cost.  How much of 
one depends upon what you're doing with the ints.  But we're not talking 
about more than an instruction or two per access.  With few accesses per 
allocation it's probably a loss- with more access per allocation, it's 
probably a win.  But either way, we're talking about single digit clock 
cycles.

By way of comparison, on the P4 a branch mispredict costs 20-28 clock 
cycles, a cache miss costs 100-300 clock cycles, and a page fault millions 
of clock cycles.

Brian


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners