Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0007724OCamlruntime system and C interfacepublic2018-02-12 16:552018-03-16 13:58
Reportersbleazard 
Assigned To 
PrioritylowSeverityfeatureReproducibilityalways
StatusacknowledgedResolutionopen 
PlatformOSOS Version
Product Version 
Target VersionFixed in Version 
Summary0007724: Performance improvements when printing integers stored in floats
DescriptionWhen converting numbers to strings string_of_float has a significant performance impact on the X86 architecture when the number is an INT. Detecting INTs and using string_of_int significantly improves performance. Thus, using

let f2s f =
  let i = int_of_float f in
  let f1 = float_of_int i in
  if f1 = f then string_of_int i
  else string_of_float f

Has a cost of around 6.5% for floats but results in around 4x performance improvement for INTs. Here are comparisons of straight string_of_float and f2s for 1,000,000 conversions on a 64bit X86 machine:

string_of_float

1,552,011,783 cycles
2,958,207,469 instructions

f2s

  363,014,703 cycles
  945,099,124 instructions

Similar improvements would be expected on other architectures.
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
(0018871)
xclerc (reporter)
2018-02-13 14:54
edited on: 2018-02-13 15:07

We can substantially lower the overhead by doing
the check in C (thus avoiding to tag/untag the
integer value) with something along the lines of:

--- a/byterun/floats.c
+++ b/byterun/floats.c
@@ -95,11 +95,17 @@ CAMLprim value caml_format_float(value fmt, value arg)
 {
   value res;
   double d = Double_val(arg);
+ intnat i;
 
 #ifdef HAS_BROKEN_PRINTF
   if (isfinite(d)) {
 #endif
- res = caml_alloc_sprintf(String_val(fmt), d);
+ i = (intnat) d;
+ if (d == (double) i) {
+ res = caml_alloc_sprintf("%ld", i); /* TODO: should use ARCH_INTNAT_PRINTF_FORMAT */
+ } else {
+ res = caml_alloc_sprintf(String_val(fmt), d);
+ }
 #ifdef HAS_BROKEN_PRINTF
   } else {
     if (isnan(d)) {


With a dummy test program generating and converting
numbers, I get:

old implementation with "really float" values:

    24,129,906,778 cycles
    37,797,606,083 instructions

new implementation with "really float" values:

    24,079,087,367 cycles
    37,844,377,075 instructions

old implementation with "actually int" values:

    12,010,671,909 cycles
    23,996,041,476 instructions

new implementation with "actually int" values:

     4,777,066,875 cycles
    10,938,940,262 instructions

(0018873)
gasche (developer)
2018-02-13 17:21

I think that with either proposal the output changes: (string_of_float 1.) returns "1.", not "1". Can you add the additional dot in your measurements?

(Also, if I read the cycle numbers correct, the patch also decreased the cycle count (but not the instruction count) on "really float" values, how is this possible?)
(0018874)
xclerc (reporter)
2018-02-13 23:33
edited on: 2018-02-13 23:53

> I think that with either proposal the output changes: (string_of_float 1.) returns "1.", not "1". Can you add the additional dot in your measurements?

I beg to disagree; the patch to `caml_format_float` applying
changes only to the C side, the returned result goes through
`valid_float_lexem` that will add the missing dot (because
the libc would not -- `printf("%.12g", 1.)` outputs "1").

> (Also, if I read the cycle numbers correct, the patch also decreased the cycle count (but not the instruction count) on "really float" values, how is this possible?)

I will not go as far as to say it did not surprise me, but
given that modern CPUs have multiple (yet non-uniform)
ALUs, my understanding is that we should not expect to
observe a constant IPC (instructions per cycle).

(0018881)
xleroy (administrator)
2018-02-17 18:02

The patch to caml_format_float proposed by @xclerc is incorrect: the required format (e.g. "%5.2f") is not honored in the it-is-an-integer path.

The patch proposed by @sbleazard is probably correct. One could wonder why this optimization is not done by the C standard library functions, if it gives such impressive speedups. On the other hand, sprintf has to cope with all sorts of FP formats, so the optimization would be harder to exploit.

Speaking of the C standard library: which version is used for the timings? Performance of e.g. glibc and MSVC-CRT differ widely...
(0018889)
xclerc (reporter)
2018-02-19 10:33

> The patch to caml_format_float proposed by @xclerc is incorrect: the required format (e.g. "%5.2f") is not honored in the it-is-an-integer path.

Of course; the patch was not meant to be a drop-in replacement for
`caml_format_float` in general, but only for `caml_format_float` when
used to implement `string_of_float` (where the passed format is
always "%.12g").


> Speaking of the C standard library: which version is used for the timings? Performance of e.g. glibc and MSVC-CRT differ widely...

I used glibc (2.17).
(0018938)
sbleazard (reporter)
2018-03-11 14:59

The flowing re-implements string_of_float in C and adding NaN/Infinity support

CAMLprim value caml_string_of_float_fast_int(value arg)
{
  CAMLparam1(arg);
  CAMLlocal1(res);
  double d = Double_val(arg);
  long long int l;

  switch (fpclassify(d)) {
    case FP_NAN:
      res = caml_copy_string("nan");
      break;
    case FP_INFINITE:
      if (d > 0)
        res = caml_copy_string("inf");
      else
        res = caml_copy_string("-inf");
      break;
    default:
      l = (long long int)d;
      if (d == (double)l)
        res = caml_alloc_sprintf("%lld.", l);
      else
        res = caml_alloc_sprintf("%.12g", d);
      break;
  }
  CAMLreturn(res);
}
(0018939)
sbleazard (reporter)
2018-03-11 15:13

Bench testing caml_string_of_float_fast_int with Core_bench gives the following results for 1000 conversions bench tested over 10s:


?????????????????????????????????????????????????????????????????
? Name ? Time/Run ? mWd/Run ? Percentage ?
?????????????????????????????????????????????????????????????????
? FP to string ? 204.37us ? 7.00kw ? 97.43% ?
? FP to string fast int ? 207.26us ? 2.00kw ? 98.81% ?
? FP to string/w int ? 209.76us ? 9.00kw ? 100.00% ?
? FP to string fast int/w int ? 80.39us ? 2.00kw ? 38.32% ?
?????????????????????????????????????????????????????????????????
- "FP to string": standard string_of_float with a float value (1.1)
- "FP to string fast int": string_of_float_fast_int with a float value (1.1)
- "FP to string/w int": string_of_float with an integer in a float (11.)
- "FP to string fast int/w int": string_of_float_fast_int with an integer in a float (11.)

Note that in the case of a floating point number the cost of the test is negligable (1.5%) in the case of small floats but string_of_float_fast_int is 6% faster with larger (11 digits before the decimal) floats.

Same test but using 11111111111.1 instead of 1.1:
?????????????????????????????????????????????????????????????????
? Name ? Time/Run ? mWd/Run ? Percentage ?
?????????????????????????????????????????????????????????????????
? FP to string ? 374.76us ? 8.00kw ? 100.00% ?
? FP to string fast int ? 353.65us ? 3.00kw ? 94.37% ?
? FP to string/w int ? 212.71us ? 9.00kw ? 56.76% ?
? FP to string fast int/w int ? 77.20us ? 2.00kw ? 20.60% ?
?????????????????????????????????????????????????????????????????

I suspect valid_float_lexem is adding a significant cost
(0018940)
sbleazard (reporter)
2018-03-11 15:22

Mantis does not display boxes well, here's another run with the results in ascii:

  Name Time/Run mWd/Run Percentage
 ----------------------------- ---------- --------- ------------
  FP to string 207.81us 7.00kw 96.46%
  FP to string fast int 209.73us 2.00kw 97.35%
  FP to string/w int 215.44us 9.00kw 100.00%
  FP to string fast int/w int 79.46us 2.00kw 36.88%


Same test but using 11111111111.1 instead of 1.1

  Name Time/Run mWd/Run Percentage
 ----------------------------- ---------- --------- ------------
  FP to string 371.93us 8.00kw 100.00%
  FP to string fast int 356.03us 3.00kw 95.72%
  FP to string/w int 214.85us 9.00kw 57.76%
  FP to string fast int/w int 78.59us 2.00kw 21.13%
(0018950)
frisch (developer)
2018-03-16 13:58

FWIW, we use a similar approach for speeding up printing of floats, also supporting floats which can be printed with few decimal digits. I was also surprised by the gain of the approach (in the fast path case), esp. for the MSVC libc, but also glibc to a lesser extent.

Here is our current code, which implements printing with full precision:

static char format_buffer[100] = "(";

CAMLprim value caml_lexem_format_float(value v_paren, value v_arg)
{
  const int shift = 5;
  double d = Double_val(v_arg);
  char *p = &format_buffer[1], *pp;
  long k;
  int zeroes, n;

  if (isfinite(d)) {
    /* Fast path when the double can be printed with a small number
       of decimal digits.

       We multiply by a bit more than 100000 to handle numbers
       such as -0.00121
    */
    long i = (long)(d * (double) 100000.00001);
    if ((double) i / (double) 100000. == d) {
      int neg = 0;
      if (i == 0) {
        if (1/d > 0) return caml_copy_string("0.");
        else if (Bool_val(v_paren)) return caml_copy_string("(-0.)");
        else return caml_copy_string("-0.");
      }

      pp = p;
      /* Print the '-' sign if needed */
      if (i < 0) {
        neg = 1;
        *(pp++) = '-';
        i = -i;
        /* if i = -2^32 (resp. 64), then -i = i... */
        if (i < 0) goto fallback;
      }

      /* Count trailing 0s */
      k = i;
      zeroes = 0;
      while (k % 10 == 0 && zeroes < shift) { k /= 10; zeroes++; }
      /* Remember the part without trailing 0s */
      i = k;
      /* Count actual digits to be printed */
      n = 0;
      while (k > 0) { k /= 10; n++; }
      /* At least (shift+1) digits, including trailing 0s (to force at least one digit before '.') */
      
if (n + zeroes <= shift) n = shift + 1 - zeroes;

      /* Move to the end, write the final \0 (and ')' if needed) */
      pp += n;
      n -= (shift - zeroes);
      if (neg && Bool_val(v_paren)) {
        *(pp + 1)= ')';
        *(pp + 2) = 0;
        p--; /* This will return the string starting at '(' */
      } else {
        *(pp + 1)= 0;
      }

      /* Write the fractional part */
      while (zeroes < shift) {
        zeroes++;
        *(pp--) = '0' + (i % 10);
        i /= 10;
      }
      *(pp--) = '.';
      /* Write the integral part */
      while (n > 1) {
        *(pp--) = '0' + (i % 10);
        i /= 10;
        n--;
      }
      *(pp--) = '0' + i;

      return caml_copy_string(p);
    }
    else {
    fallback:
      sprintf(p, "%.15g", d);
      if (strtod(p, NULL) != d)
        sprintf(p, "%.17g", d);
      /* Add '.' if needed */
      for( ; ; p++) {
        if (*p == 0) { *p = '.'; *(++p) = 0; break;}
        else if ((*p == '.') || (*p == 'e')) {break;}
      }
      if (format_buffer[1] == '-' && Bool_val(v_paren)) {
        /* Add ')' if needed (the initial '(' is already there) */
        while (*p != 0) { p++; };
        *p = ')'; *(++p) = 0;
        return caml_copy_string(format_buffer);
      } else
        return caml_copy_string(&format_buffer[1]);
    }
  } else {
    /* Could use pre-allocated values for nan/infinity */
    if (isnan(d)) return caml_copy_string("nan");
    else if (d > 0) return caml_copy_string("infinity");
    else return caml_copy_string("neg_infinity");
  }
}



Other possible ways to optimize such printing code:

  - Version with an unboxed float argument.
  - Avoid the allocation of the result, by letting the caller pass a target buffer.

- Issue History
Date Modified Username Field Change
2018-02-12 16:55 sbleazard New Issue
2018-02-13 14:54 xclerc Note Added: 0018871
2018-02-13 15:07 xclerc Note Edited: 0018871 View Revisions
2018-02-13 15:07 xclerc Note Edited: 0018871 View Revisions
2018-02-13 17:21 gasche Note Added: 0018873
2018-02-13 23:33 xclerc Note Added: 0018874
2018-02-13 23:53 xclerc Note Edited: 0018874 View Revisions
2018-02-17 18:02 xleroy Note Added: 0018881
2018-02-17 18:02 xleroy Status new => acknowledged
2018-02-19 10:33 xclerc Note Added: 0018889
2018-03-11 14:59 sbleazard Note Added: 0018938
2018-03-11 15:13 sbleazard Note Added: 0018939
2018-03-11 15:22 sbleazard Note Added: 0018940
2018-03-16 13:58 frisch Note Added: 0018950


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker