This site is updated infrequently. For up-to-date information, please visit the new OCaml website at ocaml.org.

Re: [Caml-list] More efficient implementation of intersection of sets?
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
 Date: 2008-04-04 (17:27) From: Brian Hurt Subject: Re: [Caml-list] More efficient implementation of intersection of sets?
```sasha mal wrote:

>>I'm thinking of an alternative approach which inserts keys one at a time.
>>
>>
>
>
>
>>Basically, we go through both trees simultaneously, starting with smallest
>>
>>
>
>
>
>>nodes of both trees. If two nodes with the same key are seen, the are inserted
>>
>>
>
>
>
>>into a new tree that will contain the intersection; then the pair of nodes with
>>
>>
>
>
>
>>next larger keys is taken. Otherwise, if, say, key1<key2, then we search for
>>
>>
>
>
>
>>the smallest key k>=key2 inside the first tree. If k=key2, we add k into the
>>
>>
>
>
>
>>new tree and proceed to the next pair. If k>key2, we proceed with the pair
>>
>>
>
>
>
>>(k, key2) instead of (key1, key2). For key1>key2, proceed analogously, i.e.
>>
>>
>
>
>
>>search for the smallest key k>=key1 in the second tree. If at some point of
>>
>>
>
>
>
>>time no "larger" key is found in one of the two original trees, the new tree
>>
>>
>
>
>
>>where we insert elements to contains exactly the intersection.
>>
>>
>
>
>
>[Small corrections.]
>
>
>
>By the way, the current implemented procedure always splits the second tree, regardless of the sizes of the trees. Imagine that we have two trees of very different sizes (e.g. with 5 and 10^5 elements). What tree should be split then?
>
>
>
Sorry for the late response:

The length-5 tree should be the one which is split, for two reasons.
Reason number one, it makes it the most likely that one half or the
other will be empty.  This means that it's much more likely that I'll be
able to skip large subtrees of the million-element set without
inspection.  The second reason is that split (the function) is O((log
N)^2) in cost, this is obviously cheaper if N=5 than if N=1 million.

Looking at this algorithm over the last couple of days, I've come to
conclusion that it's O(N) normal-case, but O(N (log N)^2) worst case,
and O(log N) best-cast, where N = the size of the larger set.  But I'm
not sure of it.  It's certainly an interesting approach.

Brian

```