Performance improvements when printing integers stored in floats #7724

vicuna · 2018-02-12T15:55:30Z

Original bug ID: 7724
Reporter: sbleazard
Status: acknowledged (set by @xavierleroy on 2018-02-17T17:02:46Z)
Resolution: open
Priority: low
Severity: feature
Category: runtime system and C interface
Monitored by: sbleazard @nojb @hhugo @gasche @hcarty

Bug description

When converting numbers to strings string_of_float has a significant performance impact on the X86 architecture when the number is an INT. Detecting INTs and using string_of_int significantly improves performance. Thus, using

let f2s f = 
  let i = int_of_float f in
  let f1 = float_of_int i in
  if f1 = f then string_of_int i 
  else string_of_float f

Has a cost of around 6.5% for floats but results in around 4x performance improvement for INTs. Here are comparisons of straight string_of_float and f2s for 1,000,000 conversions on a 64bit X86 machine:

string_of_float

1,552,011,783      cycles
2,958,207,469      instructions

f2s

  363,014,703      cycles                   
  945,099,124      instructions

Similar improvements would be expected on other architectures.

The text was updated successfully, but these errors were encountered:

vicuna · 2018-02-13T13:54:07Z

Comment author: @xclerc

We can substantially lower the overhead by doing
the check in C (thus avoiding to tag/untag the
integer value) with something along the lines of:

--- a/byterun/floats.c
+++ b/byterun/floats.c
@@ -95,11 +95,17 @@ CAMLprim value caml_format_float(value fmt, value arg)
 {
   value res;
   double d = Double_val(arg);
+  intnat i;
 
 #ifdef HAS_BROKEN_PRINTF
   if (isfinite(d)) {
 #endif
-    res = caml_alloc_sprintf(String_val(fmt), d);
+    i = (intnat) d;
+    if (d == (double) i) {
+      res = caml_alloc_sprintf("%ld", i); /* TODO: should use ARCH_INTNAT_PRINTF_FORMAT */
+    } else {
+      res = caml_alloc_sprintf(String_val(fmt), d);
+    }
 #ifdef HAS_BROKEN_PRINTF
   } else {
     if (isnan(d)) {

With a dummy test program generating and converting
numbers, I get:

old implementation with "really float" values:

    24,129,906,778      cycles
    37,797,606,083      instructions

new implementation with "really float" values:

    24,079,087,367      cycles
    37,844,377,075      instructions

old implementation with "actually int" values:

    12,010,671,909      cycles
    23,996,041,476      instructions

new implementation with "actually int" values:

     4,777,066,875      cycles
    10,938,940,262      instructions

vicuna · 2018-02-13T16:21:18Z

Comment author: @gasche

I think that with either proposal the output changes: (string_of_float 1.) returns "1.", not "1". Can you add the additional dot in your measurements?

(Also, if I read the cycle numbers correct, the patch also decreased the cycle count (but not the instruction count) on "really float" values, how is this possible?)

vicuna · 2018-02-13T22:33:30Z

Comment author: @xclerc

I think that with either proposal the output changes: (string_of_float 1.) returns "1.", not "1". Can you add the additional dot in your measurements?

I beg to disagree; the patch to caml_format_float applying
changes only to the C side, the returned result goes through
valid_float_lexem that will add the missing dot (because
the libc would not -- printf("%.12g", 1.) outputs "1").

(Also, if I read the cycle numbers correct, the patch also decreased the cycle count (but not the instruction count) on "really float" values, how is this possible?)

I will not go as far as to say it did not surprise me, but
given that modern CPUs have multiple (yet non-uniform)
ALUs, my understanding is that we should not expect to
observe a constant IPC (instructions per cycle).

vicuna · 2018-02-17T17:02:36Z

Comment author: @xavierleroy

The patch to caml_format_float proposed by @xclerc is incorrect: the required format (e.g. "%5.2f") is not honored in the it-is-an-integer path.

The patch proposed by @sbleazard is probably correct. One could wonder why this optimization is not done by the C standard library functions, if it gives such impressive speedups. On the other hand, sprintf has to cope with all sorts of FP formats, so the optimization would be harder to exploit.

Speaking of the C standard library: which version is used for the timings? Performance of e.g. glibc and MSVC-CRT differ widely...

vicuna · 2018-02-19T09:33:50Z

Comment author: @xclerc

The patch to caml_format_float proposed by @xclerc is incorrect: the required format (e.g. "%5.2f") is not honored in the it-is-an-integer path.

Of course; the patch was not meant to be a drop-in replacement for
caml_format_float in general, but only for caml_format_float when
used to implement string_of_float (where the passed format is
always "%.12g").

Speaking of the C standard library: which version is used for the timings? Performance of e.g. glibc and MSVC-CRT differ widely...

I used glibc (2.17).

vicuna · 2018-03-11T13:59:15Z

Comment author: sbleazard

The flowing re-implements string_of_float in C and adding NaN/Infinity support

CAMLprim value caml_string_of_float_fast_int(value arg)
{
  CAMLparam1(arg);
  CAMLlocal1(res);
  double d = Double_val(arg);
  long long int l;

  switch (fpclassify(d)) {
    case FP_NAN:
      res = caml_copy_string("nan");
      break;
    case FP_INFINITE:
      if (d > 0)
        res = caml_copy_string("inf");
      else
        res = caml_copy_string("-inf");
      break;
    default:
      l = (long long int)d;
      if (d == (double)l)
        res = caml_alloc_sprintf("%lld.", l);
      else
        res = caml_alloc_sprintf("%.12g", d);
      break;
  }
  CAMLreturn(res);
}

vicuna · 2018-03-11T14:13:33Z

Comment author: sbleazard

Bench testing caml_string_of_float_fast_int with Core_bench gives the following results for 1000 conversions bench tested over 10s:

?????????????????????????????????????????????????????????????????
? Name                        ? Time/Run ? mWd/Run ? Percentage ?
?????????????????????????????????????????????????????????????????
? FP to string                ? 204.37us ?  7.00kw ?     97.43% ?
? FP to string fast int       ? 207.26us ?  2.00kw ?     98.81% ?
? FP to string/w int          ? 209.76us ?  9.00kw ?    100.00% ?
? FP to string fast int/w int ?  80.39us ?  2.00kw ?     38.32% ?
?????????????????????????????????????????????????????????????????
- "FP to string": standard string_of_float with a float value (1.1)
- "FP to string fast int": string_of_float_fast_int  with a float value (1.1)
- "FP to string/w int": string_of_float with an integer in a float (11.)
- "FP to string fast int/w int": string_of_float_fast_int with an integer in a float (11.)

Note that in the case of a floating point number the cost of the test is negligable (1.5%) in the case of small floats but string_of_float_fast_int is 6% faster with larger (11 digits before the decimal) floats.

Same test but using 11111111111.1 instead of 1.1:

?????????????????????????????????????????????????????????????????
? Name                        ? Time/Run ? mWd/Run ? Percentage ?
?????????????????????????????????????????????????????????????????
? FP to string                ? 374.76us ?  8.00kw ?    100.00% ?
? FP to string fast int       ? 353.65us ?  3.00kw ?     94.37% ?
? FP to string/w int          ? 212.71us ?  9.00kw ?     56.76% ?
? FP to string fast int/w int ?  77.20us ?  2.00kw ?     20.60% ?
?????????????????????????????????????????????????????????????????

I suspect valid_float_lexem is adding a significant cost

vicuna · 2018-03-11T14:22:33Z

Comment author: sbleazard

Mantis does not display boxes well, here's another run with the results in ascii:

  Name                          Time/Run   mWd/Run   Percentage
 ----------------------------- ---------- --------- ------------
  FP to string                  207.81us    7.00kw       96.46%
  FP to string fast int         209.73us    2.00kw       97.35%
  FP to string/w int            215.44us    9.00kw      100.00%
  FP to string fast int/w int    79.46us    2.00kw       36.88%

Same test but using 11111111111.1 instead of 1.1

  Name                          Time/Run   mWd/Run   Percentage
 ----------------------------- ---------- --------- ------------
  FP to string                  371.93us    8.00kw      100.00%
  FP to string fast int         356.03us    3.00kw       95.72%
  FP to string/w int            214.85us    9.00kw       57.76%
  FP to string fast int/w int    78.59us    2.00kw       21.13%

vicuna · 2018-03-16T12:58:11Z

Comment author: @alainfrisch

FWIW, we use a similar approach for speeding up printing of floats, also supporting floats which can be printed with few decimal digits. I was also surprised by the gain of the approach (in the fast path case), esp. for the MSVC libc, but also glibc to a lesser extent.

Here is our current code, which implements printing with full precision:

static char format_buffer[100] = "(";

CAMLprim value caml_lexem_format_float(value v_paren, value v_arg)
{
  const int shift = 5;
  double d = Double_val(v_arg);
  char *p = &format_buffer[1], *pp;
  long k;
  int zeroes, n;

  if (isfinite(d)) {
    /* Fast path when the double can be printed with a small number
       of decimal digits.

       We multiply by a bit more than 100000 to handle numbers
       such as -0.00121
    */
    long i = (long)(d * (double) 100000.00001);
    if ((double) i / (double) 100000. == d) {
      int neg = 0;
      if (i == 0) {
        if (1/d > 0) return caml_copy_string("0.");
        else if (Bool_val(v_paren)) return caml_copy_string("(-0.)");
        else return caml_copy_string("-0.");
      }

      pp = p;
      /* Print the '-' sign if needed */
      if (i < 0) {
        neg = 1;
        *(pp++) = '-';
        i = -i;
        /* if i = -2^32 (resp. 64), then -i = i... */
        if (i < 0) goto fallback;
      }

      /* Count trailing 0s */
      k = i;
      zeroes = 0;
      while (k % 10 == 0 && zeroes < shift) { k /= 10; zeroes++; }
      /* Remember the part without trailing 0s */
      i = k;
      /* Count actual digits to be printed */
      n = 0;
      while (k > 0) { k /= 10; n++; }
      /* At least (shift+1) digits, including trailing 0s (to force at least one digit before '.') */
      if (n + zeroes <= shift) n = shift + 1 - zeroes;

      /* Move to the end, write the final \0 (and ')' if needed) */
      pp += n;
      n -= (shift - zeroes);
      if (neg && Bool_val(v_paren)) {
        *(pp + 1)= ')';
        *(pp + 2) = 0;
        p--; /* This will return the string starting at '(' */
      } else {
        *(pp + 1)= 0;
      }

      /* Write the fractional part */
      while (zeroes < shift) {
        zeroes++;
        *(pp--) = '0' + (i % 10);
        i /= 10;
      }
      *(pp--) = '.';
      /* Write the integral part */
      while (n > 1) {
        *(pp--) = '0' + (i % 10);
        i /= 10;
        n--;
      }
      *(pp--) = '0' + i;

      return caml_copy_string(p);
    }
    else {
    fallback:
      sprintf(p, "%.15g", d);
      if (strtod(p, NULL) != d)
        sprintf(p, "%.17g", d);
      /* Add '.' if needed */
      for( ; ; p++) {
        if (*p == 0) { *p = '.'; *(++p) = 0; break;}
        else if ((*p == '.') || (*p == 'e')) {break;}
      }
      if (format_buffer[1] == '-' && Bool_val(v_paren)) {
        /* Add ')' if needed (the initial '(' is already there) */
        while (*p != 0) { p++; };
        *p = ')'; *(++p) = 0;
        return caml_copy_string(format_buffer);
      } else
        return caml_copy_string(&format_buffer[1]);
    }
  } else {
    /* Could use pre-allocated values for nan/infinity */
    if (isnan(d)) return caml_copy_string("nan");
    else if (d > 0) return caml_copy_string("infinity");
    else return caml_copy_string("neg_infinity");
  }
}

Other possible ways to optimize such printing code:

Version with an unboxed float argument.
Avoid the allocation of the result, by letting the caller pass a target buffer.

github-actions · 2020-05-07T04:17:02Z

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

gasche · 2021-01-05T07:21:51Z

I have the impression that this optimization could make a sizeable difference for many user programs. (Excepting that numbers are floats is unfortunately common these days thanks to Javascript and JSON.) There are several proposals in this thread, I like this one (caml_string_of_float_fast_int) which has fairly simple code and both seems correct and solves the reported efficiency. Shall I create a PR?

(@alainfrisch's code may be faster and appropriate in other cases as well, it is worth measuring, and there are new algorithms proposed from time to time to print floating-point numbers. But I'm less excited about having to review all this code and maintain our own float-printing implementation in the runtime.)

alainfrisch · 2021-01-05T08:23:18Z

There are several proposals in this thread, I like this one (caml_string_of_float_fast_int) which has fairly simple code and both seems correct and solves the reported efficiency.

Not tested, but I seems it changes the result of string_of_float (-0.).

gasche · 2021-01-05T08:54:11Z

Ouch! I have not actually reviewed this code yet; so "seems correct" was a bit of an exaggeration.

vicuna added stdlib feature-wish labels Mar 14, 2019

github-actions bot added the Stale label May 7, 2020

github-actions bot closed this as completed Jun 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements when printing integers stored in floats #7724

Performance improvements when printing integers stored in floats #7724

vicuna commented Feb 12, 2018 •

edited by yallop

vicuna commented Feb 13, 2018 •

edited by yallop

vicuna commented Feb 13, 2018

vicuna commented Feb 13, 2018

vicuna commented Feb 17, 2018

vicuna commented Feb 19, 2018

vicuna commented Mar 11, 2018 •

edited by yallop

vicuna commented Mar 11, 2018 •

edited by yallop

vicuna commented Mar 11, 2018 •

edited by yallop

vicuna commented Mar 16, 2018 •

edited by yallop

github-actions bot commented May 7, 2020

gasche commented Jan 5, 2021

alainfrisch commented Jan 5, 2021

gasche commented Jan 5, 2021

Performance improvements when printing integers stored in floats #7724

Performance improvements when printing integers stored in floats #7724

Comments

vicuna commented Feb 12, 2018 • edited by yallop

Bug description

vicuna commented Feb 13, 2018 • edited by yallop

vicuna commented Feb 13, 2018

vicuna commented Feb 13, 2018

vicuna commented Feb 17, 2018

vicuna commented Feb 19, 2018

vicuna commented Mar 11, 2018 • edited by yallop

vicuna commented Mar 11, 2018 • edited by yallop

vicuna commented Mar 11, 2018 • edited by yallop

vicuna commented Mar 16, 2018 • edited by yallop

github-actions bot commented May 7, 2020

gasche commented Jan 5, 2021

alainfrisch commented Jan 5, 2021

gasche commented Jan 5, 2021

vicuna commented Feb 12, 2018 •

edited by yallop

vicuna commented Feb 13, 2018 •

edited by yallop

vicuna commented Mar 11, 2018 •

edited by yallop

vicuna commented Mar 11, 2018 •

edited by yallop

vicuna commented Mar 11, 2018 •

edited by yallop

vicuna commented Mar 16, 2018 •

edited by yallop