Version française
Home     About     Download     Resources     Contact us    
Browse thread
Severe loss of performance due to new signal handling
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Brian Hurt <bhurt@s...>
Subject: Re: [Caml-list] Severe loss of performance due to new signal handling


On Mon, 20 Mar 2006, Markus Mottl wrote:

> On 3/20/06, Robert Roessler <roessler@rftp.com> wrote:
>>
>> At the risk of being "irrelevant", I wanted to nail down exactly what
>> assertion is being made here: are we talking about directly executing
>> in assembly code the relevant x86[-64]/ppc/whatever instructions for
>> "read-and-clear", or going through OS-dependent access routines like
>> Windows' InterlockedExchange()?
>
>
> We are talking of the assembly code.  See file byterun/signals_machdep.h,
> which contains the corresponding macros.

OK, poking around a little bit in byterun, I'm seeing this peice of code:

   for (signal_number = 0; signal_number < NSIG; signal_number++) {
     Read_and_clear(signal_state, caml_pending_signals[signal_number]);
     if (signal_state) caml_execute_signal(signal_number, 0);
   }

with Read_and_clear being defined as:

#if defined(__GNUC__) && defined(__i386__)

#define Read_and_clear(dst,src) \
   asm("xorl %0, %0; xchgl %0, %1" \
       : "=r" (dst), "=m" (src) \
       : "m" (src))


xchgl is the atomic operation (this is always atomic when referencing a 
memory location, regardless of the presence or absence of a lock prefix).

Appropos of nothing, a better definition of that macro would be:

#define Read_and_clear(dst,src) \
    asm volatile ("xchgl	%0, %1" \
        : "=r" (dst), "+m" (src) \
        : "0" (0))

as this gives gcc the choice of how to move 0 into the register (using an 
xor will still be a popular choice, but it'll occassionally do a movl 
depending upon instruction scheduling choices).

Some more poking around tells me that NSIG is defined on Linux to be 64.

I think the problem is not doing an atomic operation, but doing 64 of 
them.  I'd be inclined to move to a bitset implementation- allowing you 
to replace 64 atomic instructions with 2.

On the x86, you can use the lock bts instruction to set the bit.  Some 
implementation like:

#if defined(__GNUC__) && defined(__i386__)

     typedef unsigned long sigword_t;

#define Read_and_clear(dst,src) \
    asm volatile ("xchgl	%0, %1" \
        : "=r" (dst), "+m" (src) \
        : "0" (0))

#define Set_sigflag(sigflags, NR) \
    asm volatile ("lock bts %1, %0" \
        : "+m" (*sigflags) \
        : "rN" (NR) \
        : "cc")

...

#define SIGWORD_BITS (CHAR_BITS * sizeof(sigword_t))

#define NR_SIGWORDS ((NSIG + SIGWORD_BITS - 1)/SIGWORD_BITS)

   extern sigword_t caml_pending_signals[NR_SIGWORDS];

   for (i = 0; i < NR_SIGWORDS; i++) {
       sigword_t temp;
       int j;

       Read_and_clear(temp, caml_pending_signals[i]);
       for (j = 0; temp != 0; j++) {
           if ((temp & 1ul) != 0) {
               caml_execute_signal((i * SIGWORD_BITS) + j, 0)
           }
           temp >>= 1;
       }
   }


This is somewhat more code, but i, j, and temp would all end up in 
registers, and it'd be two atomic instructions, not 64.

The x86 assembly code I can dash off from the top of my head.  Similiar 
bits of assembly can be written for other CPUs- I just have to go dig out 
the right books.

Brian