Browse thread
Weak hash table for attaching extra data to an object
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2007-08-16 (16:17) |
From: | Richard Jones <rich@a...> |
Subject: | Some observations and measurements with using Weak Hashtable to annotate a tree (was: Re: [Caml-list] Weak hash table for attaching extra data to an object) |
On Wed, Aug 15, 2007 at 08:04:55PM +0100, Richard Jones wrote: > I should have said :-) I'm trying to annotate the <img> & <applet> > nodes in a typical HTML document, so they are both sparse (compared to > all nodes), but there are many of them. I did some measurements using my WeakMetadata module to annotate a large random HTML document tree. Here are some observations and results, in no particular order. Code: http://annexia.org/tmp/weakMetadata.ml http://annexia.org/tmp/weakMetadata.mli http://annexia.org/tmp/test_weakmetadata.ml Firstly, it works, but I found a few programming pitfalls. In one case I was accidentally copying my tree nodes when annotating them, so I was annotating the copies rather than the nodes themselves. The code was along these lines: | Element (a,b,c,d) -> annotate (a,b,c,d) instead of: | Element ((_,_,_,_) as e) -> annotate e Obviously experienced OCaml programmers wouldn't make this mistake, right?! Another programming problem was that I ended up having to assign a unique identifier to each node which was used to do the hash (otherwise the hash function would have to do a costly recursive descent to determine if two nodes are equal). This is a peculiarity of the Weak hash table itself, namely that physical equality cannot be used. Another programming problem is that the choice of the initial size of the hash table determines performance factors in rather non-obvious ways. Allocating the hash table with 'create 13' causes buckets to have long chains, but that creates much lower memory overhead. Allocating the hash table with 'create 100003' gives short chains and high memory overhead. (Exact numbers below). I wrote a program to create a document (tree) with approximately 560,000 elements and approximately 840,000 text nodes. I annotated every element in the document to get some idea of the overhead. As mentioned before, choice of initial hash table size affects the length of the bucket chain: Initial hash table size: 13 Median chain length per bucket: 33 Overhead per element annotated: 21 bytes (on a 64 bit machine) Initial hash table size: 100003 Median chain length per bucket: 6 Overhead per element annotated: 40 bytes (on a 64 bit machine) Approximately 210,000 elements were <img> or <input> elements (15% of the total nodes) and I am particularly interested in just annotating these nodes. The overhead for annotating sparse nodes like this is similar (again with chain length dependent a lot on initial choice of hash table size): Initial hash table size: 13 Median chain length per bucket: 27 Overhead per element annotated: 26 bytes (on a 64 bit machine) Initial hash table size: 100003 Median chain length per bucket: 3 Overhead per element annotated: 70 bytes (on a 64 bit machine) You have to manually compact the hash table in order to actually regain the memory used by your metadata annotations. (Alternately, wait for the hash table as a whole to become unreachable, or reuse the hash table which causes old annotations to be overwritten). Internally, the Weak module stores a linked list of weak arrays (caml_weak_list_head). The GC has a separate phase to deal with weak arrays, and as far as I can tell from the code the amount of work it will do in each slice is limited. The GC also has special code to remove unreachable weak arrays, so this list does not just grow without bound. However I'm still unclear on the exact O(..) for using huge numbers of weak arrays, which is what the weak hash table does once you start annotating large numbers of nodes. Rich. -- Richard Jones Red Hat