Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing an integer to a file as a string requires allocation #5443

Closed
vicuna opened this issue Dec 23, 2011 · 7 comments
Closed

Writing an integer to a file as a string requires allocation #5443

vicuna opened this issue Dec 23, 2011 · 7 comments

Comments

@vicuna
Copy link

vicuna commented Dec 23, 2011

Original bug ID: 5443
Reporter: @johnwhitington
Assigned to: @lefessan
Status: closed (set by @xavierleroy on 2013-08-31T10:46:11Z)
Resolution: suspended
Priority: normal
Severity: feature
Category: ~DO NOT USE (was: OCaml general)
Related to: #4688
Monitored by: @protz pilki @ygrek @damiendoligez

Bug description

Hallo. Happy Christmas.

One might expect that flattening a data structure, (such as this one from CamlPDF) to a file would not require data allocation.

type pdfobject =
| Null
| Boolean of bool
| Integer of int
| Real of float
| String of string
| Name of string
| Array of pdfobject list
| Dictionary of (string * pdfobject) list
| Stream of (pdfobject * stream) ref
| Indirect of int

(This is the main PDF datatype. and writing the valid PDF file mostly just requires flattening this out).

However, to write an integer, boolean or float value to a file requires making a string from it first. Even writing to standard output using Pervasives.print_int is defined as

let print_int i = output_string stdout (string_of_int i)

string_of_int not only allocates a string (obviously), but it has to parse the format string every time, because it's defined in Pervasives as

let string_of_int n =
format_int "%d" n

Pervasives provides:

output_char
output_string

I claim we should add:

output_int_as_string
output_float_as_string
output_bool_as_string

To start with, these could be defined in terms of the existing code, with efficient versions replacing them later.

For example, in CamlPDF, I got a 20% speedup in writing large (20Mb) PDF files when I wrote my own function for writing integers out, which avoids parse_format and alloc_string:

let s = String.make 20 'X'

let rec chars_of_int p = function
| 0 -> p - 1
| n ->
String.unsafe_set s p (let ch : char = Obj.magic (n mod 10 + 48) in ch);
chars_of_int (p + 1) (n / 10)

let rec output_int_as_string output = function
| 0 -> output '0'
| n when n = min_int ->
String.iter output (Pervasives.string_of_int min_int)
| n when n < 0 ->
output '-';
output_int_as_string output ~-n
| n ->
let endpos = chars_of_int 0 n in
for x = endpos downto 0 do
output (String.unsafe_get s x)
done

I'm not proposing a patch based on the code above at this stage (because it's not mature, and I don't know how to write output_float_as_string anyway, and I don't know how all this interacts with the C functions used to do these conversions now). But I think it highlights an interesting issue.

Could someone comment on the feasiblity of completing the implementation of these three functions in a way that doesn't cause allocation?

Are there any simpler steps we can take? Just getting rid of the format parsing in string_of_int might be a first step...

Thanks,

John Whitington
Coherent Graphics Ltd

@vicuna
Copy link
Author

vicuna commented Dec 23, 2011

Comment author: @gasche

Could someone comment on the feasiblity of completing the implementation
of these three functions in a way that doesn't cause allocation?

I have no idea whether the OCaml developpers would accept such a change (why not, if you get tangible performance improvements in an real-world application), but technically that looks extremely feasible.

In fact, you can just write your own functions, using a C binding to stdlib.h printf (fprintf(..,"%d",n) and fprintf(..,"%f",x)). You can look at caml_ml_output_char in byterun/io.c for inspiration of how to get the underlying C channel from an OCaml channel, lock it etc.

I suggest you write such a small C binding, test it on your application, report on the performance improvements and submit the code here. If the caml-team decides to make the change, it will be easy to integrate your code directly. And if they don't, you can use your C library in your application.

@vicuna
Copy link
Author

vicuna commented Dec 29, 2011

Comment author: @damiendoligez

For int and bool, it's easy enough to write such functions in OCaml (even without Obj.magic).

For float, a C binding might be your best bet, but looking at #4688 I think there will be troublesome corner cases.

@vicuna
Copy link
Author

vicuna commented Dec 30, 2011

Comment author: @johnwhitington

Thanks for these notes: I'll have a go at writing this soon.

There is also the converse issue - can we lex an integer from file without allocation? At the moment, my lexer looks through the characters in a channel until it has identified what the file is, reading them into a resusable Buffer.t until the token has ended - the integer, float etc. can then by read by calling float_of_string or int_of_string on the result of Buffer.contents (which, of course, allocates a string which we will then throw away). The buffer is then cleared, and lexing continues.

So, perhaps we also add Buffer.int_of_contents and Buffer.float_of_contents?

My thought is that if I write these two patches, it should be possible to write simple data structures like the one above to file entirely without allocation, and lex/parse them with no more allocation than that inherent in actually building the final parsed value. That seems like something to aim for, anyway.

@vicuna
Copy link
Author

vicuna commented Dec 30, 2011

Comment author: meyer

I assume your lexer is operating on big volumes of data.

I also assume that for portability issues - e.g. port to F# - you can't use ocamllex so your lexer is handwritten.

Number of tokens is always proportional to size of the data you are processing, so I assume you want to improve the constant factor, hence these ideas and the idea below.

What you really want is to optimize the a single allocation and copying in the Buffer.contents.

I see the only way you could possibly improve the performance over the Buffer module (caveat: I wouldn't do it that way if that wasn't a critical thing):

  • do the lexing and keep the array pair of offsets as integers to your string for each token, denoting start and beginning.
  • Handwrite a function based on libc atoi and atof (strings in Caml are linear blocks so you can easily do it)
  • to separate lexemes you start from end of the array of offsets, and keep going to beginning of the array adding \0 at the end of lexeme
  • each time you use atoi or atof and convert it to Caml values and construct the needed data

This is probably the most you can get for the price of re-implementing parts of stdlib and resorting to C.

For your questions regarding adding a couple of functions to Buffer I would say:

  • you will not avoid Buffer allocation because you would need to extract the string, maybe hybrid approach with the above would work (using Obj.magic and casting the abstract data type to the concrete one defined in buffer.ml)
  • I think this is very specific case when you really rely on performance, and stdlib try to be lean and generic, so probably adding them would not benefit most of the users and instead will add an additional maintenance baggage. That's said, I don't know what do others think, that's my initial thought.

Wojciech

@vicuna
Copy link
Author

vicuna commented Jan 19, 2012

Comment author: @lefessan

I put this bug report as "suspended", as it is unlikely that it will be fixed soon, although not impossible. The purpose of "stdlib" is not to provide all the functions that a "standard library" would provide, but only the ones needed by the compiler and the distribution. Such functions would probably be welcome in either Jane Street Core or Batteries. If it is possible to have a clean and consistent set of such functions (write_int, write_float, write_bool, etc.), maybe it would be worth adding a complete module for them.

@vicuna
Copy link
Author

vicuna commented Jan 19, 2012

Comment author: @gasche

I think there are two separate things being discussed here:

  1. Changing output_{int,bool,float} to not allocate, which seems quite reasonable and simple except maybe for floats. If think johnwhitington could and should -- because I hope they could get included -- provide patches for those.

  2. The more hazy discussion on lexing, Buffer.* etc: it looks more difficult, not very well defined, and I'd agree with Fabrice that the standard library is maybe not the best place for adding such things now, given the "low maintenance" goal that the OCaml devs have explicitely stated.

@vicuna
Copy link
Author

vicuna commented Jan 26, 2012

Comment author: @johnwhitington

I've done some preliminary investigations here.

It looks like I'll have to try gasche's original suggestion of linking to C, because my pure ocaml code for writing integers, whilst faster in native code, is slower in bytecode:

Writing 40Mb of this data structure:

type data =
| Int of int
| Float of float
| Bool of bool
| Ident of string
| Array of data list

like this:

let rec data_to_channel ch = function
| Int i -> fast_output_int ch i
| Float f -> fast_output_float ch f
| Bool b -> fast_output_bool ch b
| Ident i -> output_string ch i
| Array l ->
output_string ch "[ ";
List.iter
(function x -> data_to_channel ch x; output_string ch " ")
l;
output_string ch "]"

(for some value of fast_output_(int|float|bool), we get:

Native, ocaml pervasives: 1.79s
Native, mine: 1.44s

Bytecode, ocaml pervasives: 12.77s
Bytecode, mine: 16.4s

(Where 'mine' is the output_string_as_int from the original bug report, the obvious output_bool_as_string and no change to the float code).

So, I'm going to try a C version of output_int_as_string as suggested above, and hopefully that will be clearly faster on both native and bytecode. Then, I'll move on to the possibly thornier issue of floats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants