Browse thread
Severe loss of performance due to new signal handling
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Brian Hurt <bhurt@s...> |
| Subject: | Re: [Caml-list] Severe loss of performance due to new signal handling |
On Mon, 20 Mar 2006, Markus Mottl wrote:
> On 3/20/06, Robert Roessler <roessler@rftp.com> wrote:
>>
>> At the risk of being "irrelevant", I wanted to nail down exactly what
>> assertion is being made here: are we talking about directly executing
>> in assembly code the relevant x86[-64]/ppc/whatever instructions for
>> "read-and-clear", or going through OS-dependent access routines like
>> Windows' InterlockedExchange()?
>
>
> We are talking of the assembly code. See file byterun/signals_machdep.h,
> which contains the corresponding macros.
OK, poking around a little bit in byterun, I'm seeing this peice of code:
for (signal_number = 0; signal_number < NSIG; signal_number++) {
Read_and_clear(signal_state, caml_pending_signals[signal_number]);
if (signal_state) caml_execute_signal(signal_number, 0);
}
with Read_and_clear being defined as:
#if defined(__GNUC__) && defined(__i386__)
#define Read_and_clear(dst,src) \
asm("xorl %0, %0; xchgl %0, %1" \
: "=r" (dst), "=m" (src) \
: "m" (src))
xchgl is the atomic operation (this is always atomic when referencing a
memory location, regardless of the presence or absence of a lock prefix).
Appropos of nothing, a better definition of that macro would be:
#define Read_and_clear(dst,src) \
asm volatile ("xchgl %0, %1" \
: "=r" (dst), "+m" (src) \
: "0" (0))
as this gives gcc the choice of how to move 0 into the register (using an
xor will still be a popular choice, but it'll occassionally do a movl
depending upon instruction scheduling choices).
Some more poking around tells me that NSIG is defined on Linux to be 64.
I think the problem is not doing an atomic operation, but doing 64 of
them. I'd be inclined to move to a bitset implementation- allowing you
to replace 64 atomic instructions with 2.
On the x86, you can use the lock bts instruction to set the bit. Some
implementation like:
#if defined(__GNUC__) && defined(__i386__)
typedef unsigned long sigword_t;
#define Read_and_clear(dst,src) \
asm volatile ("xchgl %0, %1" \
: "=r" (dst), "+m" (src) \
: "0" (0))
#define Set_sigflag(sigflags, NR) \
asm volatile ("lock bts %1, %0" \
: "+m" (*sigflags) \
: "rN" (NR) \
: "cc")
...
#define SIGWORD_BITS (CHAR_BITS * sizeof(sigword_t))
#define NR_SIGWORDS ((NSIG + SIGWORD_BITS - 1)/SIGWORD_BITS)
extern sigword_t caml_pending_signals[NR_SIGWORDS];
for (i = 0; i < NR_SIGWORDS; i++) {
sigword_t temp;
int j;
Read_and_clear(temp, caml_pending_signals[i]);
for (j = 0; temp != 0; j++) {
if ((temp & 1ul) != 0) {
caml_execute_signal((i * SIGWORD_BITS) + j, 0)
}
temp >>= 1;
}
}
This is somewhat more code, but i, j, and temp would all end up in
registers, and it'd be two atomic instructions, not 64.
The x86 assembly code I can dash off from the top of my head. Similiar
bits of assembly can be written for other CPUs- I just have to go dig out
the right books.
Brian