New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guarantee that Set.add doesn't allocate and returns the original set if the added element is already in the set #6645
Comments
Comment author: @damiendoligez Just note that strictly speaking it doesn't guarantee anything because the specification of == is very weak. |
Comment author: @alainfrisch So, do you think we shouldn't write down on the function documentation that its result is physically equal to its argument when the element is already in the set? This property could be very useful to avoid double lookup. |
Comment author: @alainfrisch
It is indeed the case today that storing a value in a data structure and loading it back doesn't necessarily return something physically equal to the original value. This happens when a float is stored in an array or in a unboxed-float record. But here, we only rely on the fact that: function returns something physically equal to its argument (plus usual data flow rules: if the "then" branch is chosen, the result of "if-then-else" is physically equal to the one of the "then" branch; or: the value seen be a function's caller is physically equal to the value computed by the function's body). Do you envision cases where the suggested implementation of Set.add would indeed not satisfy the stated invariant? (e.g. with js_of_ocaml for instance) Apart from this question of whether to mention the would-be invariant in the documentation, do you agree with the proposal? |
Comment author: @damiendoligez I agree with the proposal and I agree with fully documenting the invariant. I was just being pedant. |
Comment author: @alainfrisch Thanks! I've applied this to trunk (rev 15966), documenting the fact that the result is physically equal to the argument (but not the fact that the function doesn't allocate). |
Comment author: @xavierleroy Preservation of sharing is important for some applications (e.g. static analysis -- one of the reasons that Astree scales is that it is very careful with sharing). However, this is a slippery slope: you did it for Set.add, but what about other Set operations? (union, etc) What about Map operations? Preserving sharing in Map.merge is especially sensitive for static analysis applications. |
Comment author: @alainfrisch Actually, this is only a first step. LexiFi's version of set.ml indeed has similar optimizations for other operations. For instance, our version of Set.union tries to benefit from physical equality and preserve it: =============== let rec union s1 s2 = if s1 == s2 then s1 (* LEXIFI *) else match (s1, s2) with (Empty, t2) -> t2 | (t1, Empty) -> t1 | (Node(l1, v1, r1, h1), Node(l2, v2, r2, h2)) -> if h1 >= h2 then if h2 = 1 then add v2 s1 else begin (* BEGIN LEXIFI *) let (l2, _, r2) = split v1 s2 in let ll = union l1 l2 in let rr = union r1 r2 in if ll == l1 && rr == r1 then s1 else join ll v1 rr (* END LEXIFI *) end else if h1 = 1 then add v1 s2 else begin (* BEGIN LEXIFI *) let (l1, _, r1) = split v2 s1 in let ll = union l1 l2 in let rr = union r1 r2 in if ll == l2 && rr == r2 then s2 else join ll v2 rr (* END LEXIFI *) end =============== The reason I've started with Set.add is that it's easier to specify what the optimization does and there is a direct use case (avoid Set.mem for the typical memoization pattern). With Set.union as above, the performance depends on how the two sets have been built (whether they share internal nodes), and it's difficult to reason on that, so the optimization is more like a "best effort" than a guaranteed invariant. What's your opinion on adding such optimizations e.g. for Set.union? For Map.merge, we also have a specialized version:
which is more efficient than Map.merge (although it currently doesn't do anything special w.r.t. physical equality yet): =============== let rec union f s1 s2 = match (s1, s2) with | (Empty, s) | (s, Empty) -> s | (Node (l1, v1, d1, r1, h1), Node (l2, v2, d2, r2, h2)) -> if h1 >= h2 then let (l2, d2, r2) = split v1 s2 in let l = union f l1 l2 and r = union f r1 r2 in let d = match d2 with | None -> d1 | Some d2 -> f v1 d1 d2 in join l v1 d r else let (l1, d1, r1) = split v2 s1 in let l = union f l1 l2 and r = union f r1 r2 in let d = match d1 with | None -> d2 | Some d1 -> f v2 d1 d2 in join l v2 d r =============== |
Original bug ID: 6645
Reporter: @alainfrisch
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2016-12-07T10:49:02Z)
Resolution: fixed
Priority: low
Severity: tweak
Target version: 4.02.2+dev / +rc1
Fixed in version: 4.03.0+dev / +beta1
Category: standard library
Tags: patch
Monitored by: @ygrek @jmeber @hcarty
Bug description
I suggest to replace the implementation of Set.add with:
This improves performance by not allocating when the added element is already in the set, but equally importantly, it guarantees that the result is physically equal to the original set in that case. This makes it possible to implement the common pattern "add if not already present and do something different in the two cases" with a single traversal.
The text was updated successfully, but these errors were encountered: