Re: [Camllist] More efficient implementation of intersection of sets?

sasha mal
 Brian Hurt
[
Home
]
[ Index:
by date

by threads
]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
Date:   (:) 
From:  Brian Hurt <bhurt@j...> 
Subject:  Re: [Camllist] More efficient implementation of intersection of sets? 
sasha mal wrote: >>I'm thinking of an alternative approach which inserts keys one at a time. >> >> > > > >>Basically, we go through both trees simultaneously, starting with smallest >> >> > > > >>nodes of both trees. If two nodes with the same key are seen, the are inserted >> >> > > > >>into a new tree that will contain the intersection; then the pair of nodes with >> >> > > > >>next larger keys is taken. Otherwise, if, say, key1<key2, then we search for >> >> > > > >>the smallest key k>=key2 inside the first tree. If k=key2, we add k into the >> >> > > > >>new tree and proceed to the next pair. If k>key2, we proceed with the pair >> >> > > > >>(k, key2) instead of (key1, key2). For key1>key2, proceed analogously, i.e. >> >> > > > >>search for the smallest key k>=key1 in the second tree. If at some point of >> >> > > > >>time no "larger" key is found in one of the two original trees, the new tree >> >> > > > >>where we insert elements to contains exactly the intersection. >> >> > > > >[Small corrections.] > > > >By the way, the current implemented procedure always splits the second tree, regardless of the sizes of the trees. Imagine that we have two trees of very different sizes (e.g. with 5 and 10^5 elements). What tree should be split then? > > > Sorry for the late response: The length5 tree should be the one which is split, for two reasons. Reason number one, it makes it the most likely that one half or the other will be empty. This means that it's much more likely that I'll be able to skip large subtrees of the millionelement set without inspection. The second reason is that split (the function) is O((log N)^2) in cost, this is obviously cheaper if N=5 than if N=1 million. Looking at this algorithm over the last couple of days, I've come to conclusion that it's O(N) normalcase, but O(N (log N)^2) worst case, and O(log N) bestcast, where N = the size of the larger set. But I'm not sure of it. It's certainly an interesting approach. Brian