Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more efficient Digest.to_hex #6116

Closed
vicuna opened this issue Aug 4, 2013 · 7 comments
Closed

more efficient Digest.to_hex #6116

vicuna opened this issue Aug 4, 2013 · 7 comments
Assignees

Comments

@vicuna
Copy link

vicuna commented Aug 4, 2013

Original bug ID: 6116
Reporter: @ygrek
Assigned to: @gasche
Status: closed (set by @xavierleroy on 2015-12-11T18:21:27Z)
Resolution: fixed
Priority: normal
Severity: minor
Version: 4.00.1
Fixed in version: 4.01.1+dev
Category: standard library

Bug description

Digest.to_hex allocates too much. And sometimes it can be a hot spot. Compare (200K calls) :

Digest.to_hex : allocated 4.9GB, heap 0B, collect 0 362 2531, elapsed 1.525 sec
md5_hex_string : allocated 9.2MB, heap 0B, collect 0 0 4, elapsed 0.023 sec

Patch against 4.01 attached

Steps to reproduce

(*
Compile with ocamlbuild -lflags unix.cmxa md5.native
Measure module from http://ygrek.org.ua/p/code/measure.ml
*)

let hex n = Char.chr (n + if n < 10 then Char.code '0' else (Char.code 'a' - 10))

let md5_hex_string (md5:Digest.t) =
let md5 = (Obj.magic md5 : string) in
let s = String.create 32 in
for i = 0 to 15 do
let x = Char.code (String.unsafe_get md5 i) in
String.unsafe_set s (i2) (hex (x lsr 4));
String.unsafe_set s (i
2+1) (hex (x land 0x0f));
done;
s

let () =
(*
let n = 1024 in
let s = String.create n in
for i = 0 to pred n do s.[i] <- Char.chr (Random.int 256) done;
*)
let s = "make it run, make it right, make it fast" in
let md5 = Digest.string s in
let run f = for i = 1 to 200_000 do f () done in
assert (Digest.to_hex md5 = md5_hex_string md5);
Measure.show " Digest.to_hex" run (fun () -> Digest.to_hex md5);
Measure.show "md5_hex_string" run (fun () -> md5_hex_string md5);
()

File attachments

@vicuna
Copy link
Author

vicuna commented Aug 4, 2013

Comment author: @lefessan

Indeed, the implementation is not as efficient as it should be. I would change your code a little, to get rid of Obj.magic first (a Digest.t is already a string), and to get the same errors than the former version:

let hex n = Char.unsafe_chr (n + if n < 10 then Char.code '0' else (Char.code 'a' - 10))

let md5_hex (md5:Digest.t) =
if String.length md5 < 16 then invalid_arg "index out of bounds";
let s = String.create 32 in
for i = 0 to 15 do
let x = Char.code (String.unsafe_get md5 i) in
String.unsafe_set s (i2) (hex (x lsr 4));
String.unsafe_set s (i
2+1) (hex (x land 0x0f));
done;
s

Do you think this new version is correct ? could still be optimized ?

@vicuna
Copy link
Author

vicuna commented Aug 4, 2013

Comment author: @gasche

Fabrice, the patch attached by ygrek has no Obj.magic, it is only the measurement code that contains a copy of the function with Obj.magic (indeed, useless in that case).

I do think it's better to include the length test, in case an user would pass an invalid string (unsafe_get used unsafely is bad). A comment saying that "index out of bounds" is what the precedent implementation (in fact the String.get in d.[i]) raised and was preserved for backward-compatibility would be helpful.

ygrek, out of curiosity, in which kind of applications does this become a bottleneck?

@vicuna
Copy link
Author

vicuna commented Aug 4, 2013

Comment author: @lefessan

@gabriel: Indeed, I just thought the patch contained the same code. I also changed "hex n" to use Char.unsafe_chr instead if Char.chr, as it is known that the value is between 0 and 15.

ygrek, out of curiosity, in which kind of applications does this become a bottleneck?

Just my example: in ocp-build, for every rule to be executed, I compute a digest of what the rule would do, to check whether it has changed since the last execution, these digests are then saved in a file for the next run. For debug reason, I prefer to save them with Digest.to_hex, so that I can easily grep and lookup these digests in debug logs...

@vicuna
Copy link
Author

vicuna commented Aug 4, 2013

Comment author: @ygrek

My bad, I was completely sure that Digest.t is abstract and didn't bother to check, sorry. I have updated the patch, but using String.get instead of hardcoding exception string (it doesn't influence speed). Char.unsafe_chr contribution is more noticeable.

Gabriel, in my case it is a bottleneck when computing hashes for strings (to create index on text columns) while generating csv to load infile into mysql database. In this case no cryptographic strength is required and actual md5 computation is fast, but outputting to hex is not, hence the patch.

@vicuna
Copy link
Author

vicuna commented Aug 4, 2013

Comment author: @gasche

Committed in trunk. Thanks!

If you have time, would you consider adding tests in the testsuite (lib-digest) to exercise Digest.to_hex?

(It could be interesting to add this to version/4.01 as well, but I'm a bit reluctant to do so, in particular in absence of tests. I'll try to ask Damien his opinion.)

@vicuna
Copy link
Author

vicuna commented Aug 5, 2013

Comment author: @ygrek

patch for testsuite attached, used vectors from wikipedia page and http://www.nsrl.nist.gov/testdata/

@vicuna
Copy link
Author

vicuna commented Aug 5, 2013

Comment author: @gasche

Thanks! Committed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants