Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements when printing integers stored in floats #7724

Closed
vicuna opened this issue Feb 12, 2018 · 13 comments
Closed

Performance improvements when printing integers stored in floats #7724

vicuna opened this issue Feb 12, 2018 · 13 comments

Comments

@vicuna
Copy link

vicuna commented Feb 12, 2018

Original bug ID: 7724
Reporter: sbleazard
Status: acknowledged (set by @xavierleroy on 2018-02-17T17:02:46Z)
Resolution: open
Priority: low
Severity: feature
Category: runtime system and C interface
Monitored by: sbleazard @nojb @hhugo @gasche @hcarty

Bug description

When converting numbers to strings string_of_float has a significant performance impact on the X86 architecture when the number is an INT. Detecting INTs and using string_of_int significantly improves performance. Thus, using

let f2s f = 
  let i = int_of_float f in
  let f1 = float_of_int i in
  if f1 = f then string_of_int i 
  else string_of_float f 

Has a cost of around 6.5% for floats but results in around 4x performance improvement for INTs. Here are comparisons of straight string_of_float and f2s for 1,000,000 conversions on a 64bit X86 machine:

string_of_float

1,552,011,783      cycles
2,958,207,469      instructions

f2s

  363,014,703      cycles                   
  945,099,124      instructions

Similar improvements would be expected on other architectures.

@vicuna
Copy link
Author

vicuna commented Feb 13, 2018

Comment author: @xclerc

We can substantially lower the overhead by doing
the check in C (thus avoiding to tag/untag the
integer value) with something along the lines of:

--- a/byterun/floats.c
+++ b/byterun/floats.c
@@ -95,11 +95,17 @@ CAMLprim value caml_format_float(value fmt, value arg)
 {
   value res;
   double d = Double_val(arg);
+  intnat i;
 
 #ifdef HAS_BROKEN_PRINTF
   if (isfinite(d)) {
 #endif
-    res = caml_alloc_sprintf(String_val(fmt), d);
+    i = (intnat) d;
+    if (d == (double) i) {
+      res = caml_alloc_sprintf("%ld", i); /* TODO: should use ARCH_INTNAT_PRINTF_FORMAT */
+    } else {
+      res = caml_alloc_sprintf(String_val(fmt), d);
+    }
 #ifdef HAS_BROKEN_PRINTF
   } else {
     if (isnan(d)) {

With a dummy test program generating and converting
numbers, I get:

old implementation with "really float" values:

    24,129,906,778      cycles
    37,797,606,083      instructions

new implementation with "really float" values:

    24,079,087,367      cycles
    37,844,377,075      instructions

old implementation with "actually int" values:

    12,010,671,909      cycles
    23,996,041,476      instructions

new implementation with "actually int" values:

     4,777,066,875      cycles
    10,938,940,262      instructions

@vicuna
Copy link
Author

vicuna commented Feb 13, 2018

Comment author: @gasche

I think that with either proposal the output changes: (string_of_float 1.) returns "1.", not "1". Can you add the additional dot in your measurements?

(Also, if I read the cycle numbers correct, the patch also decreased the cycle count (but not the instruction count) on "really float" values, how is this possible?)

@vicuna
Copy link
Author

vicuna commented Feb 13, 2018

Comment author: @xclerc

I think that with either proposal the output changes: (string_of_float 1.) returns "1.", not "1". Can you add the additional dot in your measurements?

I beg to disagree; the patch to caml_format_float applying
changes only to the C side, the returned result goes through
valid_float_lexem that will add the missing dot (because
the libc would not -- printf("%.12g", 1.) outputs "1").

(Also, if I read the cycle numbers correct, the patch also decreased the cycle count (but not the instruction count) on "really float" values, how is this possible?)

I will not go as far as to say it did not surprise me, but
given that modern CPUs have multiple (yet non-uniform)
ALUs, my understanding is that we should not expect to
observe a constant IPC (instructions per cycle).

@vicuna
Copy link
Author

vicuna commented Feb 17, 2018

Comment author: @xavierleroy

The patch to caml_format_float proposed by @xclerc is incorrect: the required format (e.g. "%5.2f") is not honored in the it-is-an-integer path.

The patch proposed by @sbleazard is probably correct. One could wonder why this optimization is not done by the C standard library functions, if it gives such impressive speedups. On the other hand, sprintf has to cope with all sorts of FP formats, so the optimization would be harder to exploit.

Speaking of the C standard library: which version is used for the timings? Performance of e.g. glibc and MSVC-CRT differ widely...

@vicuna
Copy link
Author

vicuna commented Feb 19, 2018

Comment author: @xclerc

The patch to caml_format_float proposed by @xclerc is incorrect: the required format (e.g. "%5.2f") is not honored in the it-is-an-integer path.

Of course; the patch was not meant to be a drop-in replacement for
caml_format_float in general, but only for caml_format_float when
used to implement string_of_float (where the passed format is
always "%.12g").

Speaking of the C standard library: which version is used for the timings? Performance of e.g. glibc and MSVC-CRT differ widely...

I used glibc (2.17).

@vicuna
Copy link
Author

vicuna commented Mar 11, 2018

Comment author: sbleazard

The flowing re-implements string_of_float in C and adding NaN/Infinity support

CAMLprim value caml_string_of_float_fast_int(value arg)
{
  CAMLparam1(arg);
  CAMLlocal1(res);
  double d = Double_val(arg);
  long long int l;

  switch (fpclassify(d)) {
    case FP_NAN:
      res = caml_copy_string("nan");
      break;
    case FP_INFINITE:
      if (d > 0)
        res = caml_copy_string("inf");
      else
        res = caml_copy_string("-inf");
      break;
    default:
      l = (long long int)d;
      if (d == (double)l)
        res = caml_alloc_sprintf("%lld.", l);
      else
        res = caml_alloc_sprintf("%.12g", d);
      break;
  }
  CAMLreturn(res);
}

@vicuna
Copy link
Author

vicuna commented Mar 11, 2018

Comment author: sbleazard

Bench testing caml_string_of_float_fast_int with Core_bench gives the following results for 1000 conversions bench tested over 10s:

?????????????????????????????????????????????????????????????????
? Name                        ? Time/Run ? mWd/Run ? Percentage ?
?????????????????????????????????????????????????????????????????
? FP to string                ? 204.37us ?  7.00kw ?     97.43% ?
? FP to string fast int       ? 207.26us ?  2.00kw ?     98.81% ?
? FP to string/w int          ? 209.76us ?  9.00kw ?    100.00% ?
? FP to string fast int/w int ?  80.39us ?  2.00kw ?     38.32% ?
?????????????????????????????????????????????????????????????????
- "FP to string": standard string_of_float with a float value (1.1)
- "FP to string fast int": string_of_float_fast_int  with a float value (1.1)
- "FP to string/w int": string_of_float with an integer in a float (11.)
- "FP to string fast int/w int": string_of_float_fast_int with an integer in a float (11.)

Note that in the case of a floating point number the cost of the test is negligable (1.5%) in the case of small floats but string_of_float_fast_int is 6% faster with larger (11 digits before the decimal) floats.

Same test but using 11111111111.1 instead of 1.1:

?????????????????????????????????????????????????????????????????
? Name                        ? Time/Run ? mWd/Run ? Percentage ?
?????????????????????????????????????????????????????????????????
? FP to string                ? 374.76us ?  8.00kw ?    100.00% ?
? FP to string fast int       ? 353.65us ?  3.00kw ?     94.37% ?
? FP to string/w int          ? 212.71us ?  9.00kw ?     56.76% ?
? FP to string fast int/w int ?  77.20us ?  2.00kw ?     20.60% ?
?????????????????????????????????????????????????????????????????

I suspect valid_float_lexem is adding a significant cost

@vicuna
Copy link
Author

vicuna commented Mar 11, 2018

Comment author: sbleazard

Mantis does not display boxes well, here's another run with the results in ascii:

  Name                          Time/Run   mWd/Run   Percentage
 ----------------------------- ---------- --------- ------------
  FP to string                  207.81us    7.00kw       96.46%
  FP to string fast int         209.73us    2.00kw       97.35%
  FP to string/w int            215.44us    9.00kw      100.00%
  FP to string fast int/w int    79.46us    2.00kw       36.88%

Same test but using 11111111111.1 instead of 1.1

  Name                          Time/Run   mWd/Run   Percentage
 ----------------------------- ---------- --------- ------------
  FP to string                  371.93us    8.00kw      100.00%
  FP to string fast int         356.03us    3.00kw       95.72%
  FP to string/w int            214.85us    9.00kw       57.76%
  FP to string fast int/w int    78.59us    2.00kw       21.13%

@vicuna
Copy link
Author

vicuna commented Mar 16, 2018

Comment author: @alainfrisch

FWIW, we use a similar approach for speeding up printing of floats, also supporting floats which can be printed with few decimal digits. I was also surprised by the gain of the approach (in the fast path case), esp. for the MSVC libc, but also glibc to a lesser extent.

Here is our current code, which implements printing with full precision:

static char format_buffer[100] = "(";

CAMLprim value caml_lexem_format_float(value v_paren, value v_arg)
{
  const int shift = 5;
  double d = Double_val(v_arg);
  char *p = &format_buffer[1], *pp;
  long k;
  int zeroes, n;

  if (isfinite(d)) {
    /* Fast path when the double can be printed with a small number
       of decimal digits.

       We multiply by a bit more than 100000 to handle numbers
       such as -0.00121
    */
    long i = (long)(d * (double) 100000.00001);
    if ((double) i / (double) 100000. == d) {
      int neg = 0;
      if (i == 0) {
        if (1/d > 0) return caml_copy_string("0.");
        else if (Bool_val(v_paren)) return caml_copy_string("(-0.)");
        else return caml_copy_string("-0.");
      }

      pp = p;
      /* Print the '-' sign if needed */
      if (i < 0) {
        neg = 1;
        *(pp++) = '-';
        i = -i;
        /* if i = -2^32 (resp. 64), then -i = i... */
        if (i < 0) goto fallback;
      }

      /* Count trailing 0s */
      k = i;
      zeroes = 0;
      while (k % 10 == 0 && zeroes < shift) { k /= 10; zeroes++; }
      /* Remember the part without trailing 0s */
      i = k;
      /* Count actual digits to be printed */
      n = 0;
      while (k > 0) { k /= 10; n++; }
      /* At least (shift+1) digits, including trailing 0s (to force at least one digit before '.') */
      if (n + zeroes <= shift) n = shift + 1 - zeroes;

      /* Move to the end, write the final \0 (and ')' if needed) */
      pp += n;
      n -= (shift - zeroes);
      if (neg && Bool_val(v_paren)) {
        *(pp + 1)= ')';
        *(pp + 2) = 0;
        p--; /* This will return the string starting at '(' */
      } else {
        *(pp + 1)= 0;
      }

      /* Write the fractional part */
      while (zeroes < shift) {
        zeroes++;
        *(pp--) = '0' + (i % 10);
        i /= 10;
      }
      *(pp--) = '.';
      /* Write the integral part */
      while (n > 1) {
        *(pp--) = '0' + (i % 10);
        i /= 10;
        n--;
      }
      *(pp--) = '0' + i;

      return caml_copy_string(p);
    }
    else {
    fallback:
      sprintf(p, "%.15g", d);
      if (strtod(p, NULL) != d)
        sprintf(p, "%.17g", d);
      /* Add '.' if needed */
      for( ; ; p++) {
        if (*p == 0) { *p = '.'; *(++p) = 0; break;}
        else if ((*p == '.') || (*p == 'e')) {break;}
      }
      if (format_buffer[1] == '-' && Bool_val(v_paren)) {
        /* Add ')' if needed (the initial '(' is already there) */
        while (*p != 0) { p++; };
        *p = ')'; *(++p) = 0;
        return caml_copy_string(format_buffer);
      } else
        return caml_copy_string(&format_buffer[1]);
    }
  } else {
    /* Could use pre-allocated values for nan/infinity */
    if (isnan(d)) return caml_copy_string("nan");
    else if (d > 0) return caml_copy_string("infinity");
    else return caml_copy_string("neg_infinity");
  }
}

Other possible ways to optimize such printing code:

  • Version with an unboxed float argument.
  • Avoid the allocation of the result, by letting the caller pass a target buffer.

@github-actions
Copy link

github-actions bot commented May 7, 2020

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

@github-actions github-actions bot added the Stale label May 7, 2020
@github-actions github-actions bot closed this as completed Jun 8, 2020
@gasche
Copy link
Member

gasche commented Jan 5, 2021

I have the impression that this optimization could make a sizeable difference for many user programs. (Excepting that numbers are floats is unfortunately common these days thanks to Javascript and JSON.) There are several proposals in this thread, I like this one (caml_string_of_float_fast_int) which has fairly simple code and both seems correct and solves the reported efficiency. Shall I create a PR?

(@alainfrisch's code may be faster and appropriate in other cases as well, it is worth measuring, and there are new algorithms proposed from time to time to print floating-point numbers. But I'm less excited about having to review all this code and maintain our own float-printing implementation in the runtime.)

@alainfrisch
Copy link
Contributor

There are several proposals in this thread, I like this one (caml_string_of_float_fast_int) which has fairly simple code and both seems correct and solves the reported efficiency.

Not tested, but I seems it changes the result of string_of_float (-0.).

@gasche
Copy link
Member

gasche commented Jan 5, 2021

Ouch! I have not actually reviewed this code yet; so "seems correct" was a bit of an exaggeration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants