Browse thread
Severe loss of performance due to new signal handling
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2006-03-21 (04:03) |
From: | Brian Hurt <bhurt@s...> |
Subject: | Re: [Caml-list] Severe loss of performance due to new signal handling |
On Mon, 20 Mar 2006, Markus Mottl wrote: > On 3/20/06, Robert Roessler <roessler@rftp.com> wrote: >> >> At the risk of being "irrelevant", I wanted to nail down exactly what >> assertion is being made here: are we talking about directly executing >> in assembly code the relevant x86[-64]/ppc/whatever instructions for >> "read-and-clear", or going through OS-dependent access routines like >> Windows' InterlockedExchange()? > > > We are talking of the assembly code. See file byterun/signals_machdep.h, > which contains the corresponding macros. OK, poking around a little bit in byterun, I'm seeing this peice of code: for (signal_number = 0; signal_number < NSIG; signal_number++) { Read_and_clear(signal_state, caml_pending_signals[signal_number]); if (signal_state) caml_execute_signal(signal_number, 0); } with Read_and_clear being defined as: #if defined(__GNUC__) && defined(__i386__) #define Read_and_clear(dst,src) \ asm("xorl %0, %0; xchgl %0, %1" \ : "=r" (dst), "=m" (src) \ : "m" (src)) xchgl is the atomic operation (this is always atomic when referencing a memory location, regardless of the presence or absence of a lock prefix). Appropos of nothing, a better definition of that macro would be: #define Read_and_clear(dst,src) \ asm volatile ("xchgl %0, %1" \ : "=r" (dst), "+m" (src) \ : "0" (0)) as this gives gcc the choice of how to move 0 into the register (using an xor will still be a popular choice, but it'll occassionally do a movl depending upon instruction scheduling choices). Some more poking around tells me that NSIG is defined on Linux to be 64. I think the problem is not doing an atomic operation, but doing 64 of them. I'd be inclined to move to a bitset implementation- allowing you to replace 64 atomic instructions with 2. On the x86, you can use the lock bts instruction to set the bit. Some implementation like: #if defined(__GNUC__) && defined(__i386__) typedef unsigned long sigword_t; #define Read_and_clear(dst,src) \ asm volatile ("xchgl %0, %1" \ : "=r" (dst), "+m" (src) \ : "0" (0)) #define Set_sigflag(sigflags, NR) \ asm volatile ("lock bts %1, %0" \ : "+m" (*sigflags) \ : "rN" (NR) \ : "cc") ... #define SIGWORD_BITS (CHAR_BITS * sizeof(sigword_t)) #define NR_SIGWORDS ((NSIG + SIGWORD_BITS - 1)/SIGWORD_BITS) extern sigword_t caml_pending_signals[NR_SIGWORDS]; for (i = 0; i < NR_SIGWORDS; i++) { sigword_t temp; int j; Read_and_clear(temp, caml_pending_signals[i]); for (j = 0; temp != 0; j++) { if ((temp & 1ul) != 0) { caml_execute_signal((i * SIGWORD_BITS) + j, 0) } temp >>= 1; } } This is somewhat more code, but i, j, and temp would all end up in registers, and it'd be two atomic instructions, not 64. The x86 assembly code I can dash off from the top of my head. Similiar bits of assembly can be written for other CPUs- I just have to go dig out the right books. Brian