Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0005443OCamlOCaml generalpublic2011-12-23 16:372013-08-31 12:46
Reporterjohnwhitington 
Assigned Tolefessan 
PrioritynormalSeverityfeatureReproducibilityhave not tried
StatusclosedResolutionsuspended 
PlatformOSOS Version
Product Version 
Target VersionFixed in Version 
Summary0005443: Writing an integer to a file as a string requires allocation
DescriptionHallo. Happy Christmas.

One might expect that flattening a data structure, (such as this one from CamlPDF) to a file would not require data allocation.

type pdfobject =
  | Null
  | Boolean of bool
  | Integer of int
  | Real of float
  | String of string
  | Name of string
  | Array of pdfobject list
  | Dictionary of (string * pdfobject) list
  | Stream of (pdfobject * stream) ref
  | Indirect of int

(This is the main PDF datatype. and writing the valid PDF file mostly just requires flattening this out).

However, to write an integer, boolean or float value to a file requires making a string from it first. Even writing to standard output using Pervasives.print_int is defined as

let print_int i = output_string stdout (string_of_int i)

string_of_int not only allocates a string (obviously), but it has to parse the format string every time, because it's defined in Pervasives as

let string_of_int n =
  format_int "%d" n

Pervasives provides:

output_char
output_string

I claim we should add:

output_int_as_string
output_float_as_string
output_bool_as_string

To start with, these could be defined in terms of the existing code, with efficient versions replacing them later.

For example, in CamlPDF, I got a 20% speedup in writing large (20Mb) PDF files when I wrote my own function for writing integers out, which avoids parse_format and alloc_string:

let s = String.make 20 'X'

let rec chars_of_int p = function
  | 0 -> p - 1
  | n ->
      String.unsafe_set s p (let ch : char = Obj.magic (n mod 10 + 48) in ch);
      chars_of_int (p + 1) (n / 10)

let rec output_int_as_string output = function
  | 0 -> output '0'
  | n when n = min_int ->
      String.iter output (Pervasives.string_of_int min_int)
  | n when n < 0 ->
     output '-';
     output_int_as_string output ~-n
  | n ->
     let endpos = chars_of_int 0 n in
       for x = endpos downto 0 do
         output (String.unsafe_get s x)
       done

I'm not proposing a patch based on the code above at this stage (because it's not mature, and I don't know how to write output_float_as_string anyway, and I don't know how all this interacts with the C functions used to do these conversions now). But I think it highlights an interesting issue.

Could someone comment on the feasiblity of completing the implementation of these three functions in a way that doesn't cause allocation?

Are there any simpler steps we can take? Just getting rid of the format parsing in string_of_int might be a first step...

Thanks,

John Whitington
Coherent Graphics Ltd
TagsNo tags attached.
Attached Files

- Relationships
related to 0004688resolvedfrisch Special floating-point values aren't converted to strings correctly under Windows 

-  Notes
(0006522)
gasche (developer)
2011-12-23 17:48
edited on: 2011-12-23 17:48

> Could someone comment on the feasiblity of completing the implementation
> of these three functions in a way that doesn't cause allocation?

I have no idea whether the OCaml developpers would accept such a change (why not, if you get tangible performance improvements in an real-world application), but technically that looks extremely feasible.

In fact, you can just write your own functions, using a C binding to stdlib.h printf (fprintf(..,"%d",n) and fprintf(..,"%f",x)). You can look at `caml_ml_output_char` in byterun/io.c for inspiration of how to get the underlying C channel from an OCaml channel, lock it etc.

I suggest you write such a small C binding, test it on your application, report on the performance improvements and submit the code here. If the caml-team decides to make the change, it will be easy to integrate your code directly. And if they don't, you can use your C library in your application.

(0006558)
doligez (administrator)
2011-12-29 23:20
edited on: 2011-12-30 16:25

For int and bool, it's easy enough to write such functions in OCaml (even without Obj.magic).

For float, a C binding might be your best bet, but looking at 0004688 I think there will be troublesome corner cases.

(0006567)
johnwhitington (reporter)
2011-12-30 16:53

Thanks for these notes: I'll have a go at writing this soon.

There is also the converse issue - can we lex an integer from file without allocation? At the moment, my lexer looks through the characters in a channel until it has identified what the file is, reading them into a resusable Buffer.t until the token has ended - the integer, float etc. can then by read by calling float_of_string or int_of_string on the result of Buffer.contents (which, of course, allocates a string which we will then throw away). The buffer is then cleared, and lexing continues.

So, perhaps we also add Buffer.int_of_contents and Buffer.float_of_contents?

My thought is that if I write these two patches, it should be possible to write simple data structures like the one above to file entirely without allocation, and lex/parse them with no more allocation than that inherent in actually building the final parsed value. That seems like something to aim for, anyway.
(0006574)
meyer (developer)
2011-12-30 22:19
edited on: 2011-12-30 22:38

I assume your lexer is operating on big volumes of data.

I also assume that for portability issues - e.g. port to F# - you can't use ocamllex so your lexer is handwritten.

Number of tokens is always proportional to size of the data you are processing, so I assume you want to improve the constant factor, hence these ideas and the idea below.

What you really want is to optimize the a single allocation and copying in the Buffer.contents.

I see the only way you could *possibly* improve the performance over the Buffer module (caveat: I wouldn't do it that way if that wasn't a critical thing):

- do the lexing and keep the array pair of offsets as integers to your string for each token, denoting start and beginning.
- Handwrite a function based on libc atoi and atof (strings in Caml are linear blocks so you can easily do it)
- to separate lexemes you start from end of the array of offsets, and keep going to beginning of the array adding \0 at the end of lexeme
- each time you use atoi or atof and convert it to Caml values and construct the needed data

This is probably the most you can get for the price of re-implementing parts of stdlib and resorting to C.

For your questions regarding adding a couple of functions to Buffer I would say:
- you will not avoid Buffer allocation because you would need to extract the string, maybe hybrid approach with the above would work (using Obj.magic and casting the abstract data type to the concrete one defined in buffer.ml)
- I think this is very specific case when you really rely on performance, and stdlib try to be lean and generic, so probably adding them would not benefit most of the users and instead will add an additional maintenance baggage. That's said, I don't know what do others think, that's my initial thought.

Wojciech

(0006730)
lefessan (developer)
2012-01-19 14:00

I put this bug report as "suspended", as it is unlikely that it will be fixed soon, although not impossible. The purpose of "stdlib" is not to provide all the functions that a "standard library" would provide, but only the ones needed by the compiler and the distribution. Such functions would probably be welcome in either Jane Street Core or Batteries. If it is possible to have a clean and consistent set of such functions (write_int, write_float, write_bool, etc.), maybe it would be worth adding a complete module for them.
(0006734)
gasche (developer)
2012-01-19 15:35
edited on: 2012-01-19 15:36

I think there are two separate things being discussed here:

1. Changing output_{int,bool,float} to not allocate, which seems quite reasonable and simple except maybe for floats. If think johnwhitington could and should -- because I hope they could get included -- provide patches for those.

2. The more hazy discussion on lexing, Buffer.* etc: it looks more difficult, not very well defined, and I'd agree with Fabrice that the standard library is maybe not the best place for adding such things now, given the "low maintenance" goal that the OCaml devs have explicitely stated.

(0006803)
johnwhitington (reporter)
2012-01-26 10:28

I've done some preliminary investigations here.

It looks like I'll have to try gasche's original suggestion of linking to C, because my pure ocaml code for writing integers, whilst faster in native code, is slower in bytecode:

Writing 40Mb of this data structure:

type data =
  | Int of int
  | Float of float
  | Bool of bool
  | Ident of string
  | Array of data list

like this:

let rec data_to_channel ch = function
  | Int i -> fast_output_int ch i
  | Float f -> fast_output_float ch f
  | Bool b -> fast_output_bool ch b
  | Ident i -> output_string ch i
  | Array l ->
      output_string ch "[ ";
      List.iter
        (function x -> data_to_channel ch x; output_string ch " ")
        l;
      output_string ch "]"

(for some value of fast_output_(int|float|bool), we get:


Native, ocaml pervasives: 1.79s
Native, mine: 1.44s

Bytecode, ocaml pervasives: 12.77s
Bytecode, mine: 16.4s

(Where 'mine' is the output_string_as_int from the original bug report, the obvious output_bool_as_string and no change to the float code).

So, I'm going to try a C version of output_int_as_string as suggested above, and hopefully that will be clearly faster on both native and bytecode. Then, I'll move on to the possibly thornier issue of floats.

- Issue History
Date Modified Username Field Change
2011-12-23 16:37 johnwhitington New Issue
2011-12-23 17:48 gasche Note Added: 0006522
2011-12-23 17:48 gasche Note Edited: 0006522 View Revisions
2011-12-29 23:20 doligez Note Added: 0006558
2011-12-29 23:21 doligez Note Edited: 0006558 View Revisions
2011-12-29 23:21 doligez Note Edited: 0006558 View Revisions
2011-12-29 23:21 doligez Note Edited: 0006558 View Revisions
2011-12-29 23:21 doligez Relationship added related to 0004688
2011-12-30 00:13 gasche Note Edited: 0006558 View Revisions
2011-12-30 16:25 doligez Note Edited: 0006558 View Revisions
2011-12-30 16:53 johnwhitington Note Added: 0006567
2011-12-30 22:19 meyer Note Added: 0006574
2011-12-30 22:19 meyer Note Edited: 0006574 View Revisions
2011-12-30 22:38 meyer Note Edited: 0006574 View Revisions
2012-01-19 14:00 lefessan Note Added: 0006730
2012-01-19 14:00 lefessan Status new => resolved
2012-01-19 14:00 lefessan Resolution open => suspended
2012-01-19 14:00 lefessan Assigned To => lefessan
2012-01-19 15:35 gasche Note Added: 0006734
2012-01-19 15:36 gasche Note Edited: 0006734 View Revisions
2012-01-26 10:28 johnwhitington Note Added: 0006803
2013-08-31 12:46 xleroy Status resolved => closed


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker