New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing an integer to a file as a string requires allocation #5443
Comments
Comment author: @gasche
I have no idea whether the OCaml developpers would accept such a change (why not, if you get tangible performance improvements in an real-world application), but technically that looks extremely feasible. In fact, you can just write your own functions, using a C binding to stdlib.h printf (fprintf(..,"%d",n) and fprintf(..,"%f",x)). You can look at I suggest you write such a small C binding, test it on your application, report on the performance improvements and submit the code here. If the caml-team decides to make the change, it will be easy to integrate your code directly. And if they don't, you can use your C library in your application. |
Comment author: @damiendoligez For int and bool, it's easy enough to write such functions in OCaml (even without Obj.magic). For float, a C binding might be your best bet, but looking at #4688 I think there will be troublesome corner cases. |
Comment author: @johnwhitington Thanks for these notes: I'll have a go at writing this soon. There is also the converse issue - can we lex an integer from file without allocation? At the moment, my lexer looks through the characters in a channel until it has identified what the file is, reading them into a resusable Buffer.t until the token has ended - the integer, float etc. can then by read by calling float_of_string or int_of_string on the result of Buffer.contents (which, of course, allocates a string which we will then throw away). The buffer is then cleared, and lexing continues. So, perhaps we also add Buffer.int_of_contents and Buffer.float_of_contents? My thought is that if I write these two patches, it should be possible to write simple data structures like the one above to file entirely without allocation, and lex/parse them with no more allocation than that inherent in actually building the final parsed value. That seems like something to aim for, anyway. |
Comment author: meyer I assume your lexer is operating on big volumes of data. I also assume that for portability issues - e.g. port to F# - you can't use ocamllex so your lexer is handwritten. Number of tokens is always proportional to size of the data you are processing, so I assume you want to improve the constant factor, hence these ideas and the idea below. What you really want is to optimize the a single allocation and copying in the Buffer.contents. I see the only way you could possibly improve the performance over the Buffer module (caveat: I wouldn't do it that way if that wasn't a critical thing):
This is probably the most you can get for the price of re-implementing parts of stdlib and resorting to C. For your questions regarding adding a couple of functions to Buffer I would say:
Wojciech |
Comment author: @lefessan I put this bug report as "suspended", as it is unlikely that it will be fixed soon, although not impossible. The purpose of "stdlib" is not to provide all the functions that a "standard library" would provide, but only the ones needed by the compiler and the distribution. Such functions would probably be welcome in either Jane Street Core or Batteries. If it is possible to have a clean and consistent set of such functions (write_int, write_float, write_bool, etc.), maybe it would be worth adding a complete module for them. |
Comment author: @gasche I think there are two separate things being discussed here:
|
Comment author: @johnwhitington I've done some preliminary investigations here. It looks like I'll have to try gasche's original suggestion of linking to C, because my pure ocaml code for writing integers, whilst faster in native code, is slower in bytecode: Writing 40Mb of this data structure: type data = like this: let rec data_to_channel ch = function (for some value of fast_output_(int|float|bool), we get: Native, ocaml pervasives: 1.79s Bytecode, ocaml pervasives: 12.77s (Where 'mine' is the output_string_as_int from the original bug report, the obvious output_bool_as_string and no change to the float code). So, I'm going to try a C version of output_int_as_string as suggested above, and hopefully that will be clearly faster on both native and bytecode. Then, I'll move on to the possibly thornier issue of floats. |
Original bug ID: 5443
Reporter: @johnwhitington
Assigned to: @lefessan
Status: closed (set by @xavierleroy on 2013-08-31T10:46:11Z)
Resolution: suspended
Priority: normal
Severity: feature
Category: ~DO NOT USE (was: OCaml general)
Related to: #4688
Monitored by: @protz pilki @ygrek @damiendoligez
Bug description
Hallo. Happy Christmas.
One might expect that flattening a data structure, (such as this one from CamlPDF) to a file would not require data allocation.
type pdfobject =
| Null
| Boolean of bool
| Integer of int
| Real of float
| String of string
| Name of string
| Array of pdfobject list
| Dictionary of (string * pdfobject) list
| Stream of (pdfobject * stream) ref
| Indirect of int
(This is the main PDF datatype. and writing the valid PDF file mostly just requires flattening this out).
However, to write an integer, boolean or float value to a file requires making a string from it first. Even writing to standard output using Pervasives.print_int is defined as
let print_int i = output_string stdout (string_of_int i)
string_of_int not only allocates a string (obviously), but it has to parse the format string every time, because it's defined in Pervasives as
let string_of_int n =
format_int "%d" n
Pervasives provides:
output_char
output_string
I claim we should add:
output_int_as_string
output_float_as_string
output_bool_as_string
To start with, these could be defined in terms of the existing code, with efficient versions replacing them later.
For example, in CamlPDF, I got a 20% speedup in writing large (20Mb) PDF files when I wrote my own function for writing integers out, which avoids parse_format and alloc_string:
let s = String.make 20 'X'
let rec chars_of_int p = function
| 0 -> p - 1
| n ->
String.unsafe_set s p (let ch : char = Obj.magic (n mod 10 + 48) in ch);
chars_of_int (p + 1) (n / 10)
let rec output_int_as_string output = function
| 0 -> output '0'
| n when n = min_int ->
String.iter output (Pervasives.string_of_int min_int)
| n when n < 0 ->
output '-';
output_int_as_string output ~-n
| n ->
let endpos = chars_of_int 0 n in
for x = endpos downto 0 do
output (String.unsafe_get s x)
done
I'm not proposing a patch based on the code above at this stage (because it's not mature, and I don't know how to write output_float_as_string anyway, and I don't know how all this interacts with the C functions used to do these conversions now). But I think it highlights an interesting issue.
Could someone comment on the feasiblity of completing the implementation of these three functions in a way that doesn't cause allocation?
Are there any simpler steps we can take? Just getting rid of the format parsing in string_of_int might be a first step...
Thanks,
John Whitington
Coherent Graphics Ltd
The text was updated successfully, but these errors were encountered: