<?xml version="1.0" encoding="ISO-8859-1"?>

<!DOCTYPE message PUBLIC
  "-//MLarc//DTD MLarc output files//EN"
  "../../mlarc.dtd"[
  <!ATTLIST message
    listname CDATA #REQUIRED
    title CDATA #REQUIRED
  >
]>

  <?xml-stylesheet href="../../mlarc.xsl" type="text/xsl"?>


<message 
  url="2003/10/af98552ca0ffbadca25a0ba23600da76"
  from="Ram Bhamidipaty &lt;ramb@s...&gt;"
  author="Ram Bhamidipaty"
  date="2003-10-02T12:41:52"
  subject="[Caml-list] what is the functional way to solve this problem?"
  prev="2003/10/6c8c4edb0ef7498c45177fd1da1ac595"
  next="2003/10/9a4a4be0cfe7c019ae4f42aaf4c47b30"
  next-in-thread="2003/10/01f8985812016e242ddd8d73bc92ef59"
  prev-thread="2003/10/261f050dccf87201bc913df1d656bf10"
  next-thread="2003/10/9a4a4be0cfe7c019ae4f42aaf4c47b30"
  root="../../"
  period="month"
  listname="caml-list"
  title="Archives of the Caml mailing list">

<thread subject="[Caml-list] what is the functional way to solve this problem?">
<msg 
  url="2003/10/af98552ca0ffbadca25a0ba23600da76"
  from="Ram Bhamidipaty &lt;ramb@s...&gt;"
  author="Ram Bhamidipaty"
  date="2003-10-02T12:41:52"
  subject="[Caml-list] what is the functional way to solve this problem?">
<msg 
  url="2003/10/01f8985812016e242ddd8d73bc92ef59"
  from="Michal Moskal &lt;malekith@p...&gt;"
  author="Michal Moskal"
  date="2003-10-02T14:55:22"
  subject="Re: [Caml-list] what is the functional way to solve this problem?">
<msg 
  url="2003/10/7f88e86b9d298b383d73b6b58f84f6fb"
  from="Pierre Weis &lt;weis@p...&gt;"
  author="Pierre Weis"
  date="2003-10-08T14:48:11"
  subject="Re: [Caml-list] what is the functional way to solve this problem?">
<msg 
  url="2003/10/899d22452db6890e8647fbc2d53d9ee4"
  from="Michal Moskal &lt;malekith@p...&gt;"
  author="Michal Moskal"
  date="2003-10-08T16:10:14"
  subject="Re: [Caml-list] what is the functional way to solve this problem?">
<msg 
  url="2003/10/f0d7a9095f1a80f5d23897d58bf50d50"
  from="Pierre Weis &lt;weis@p...&gt;"
  author="Pierre Weis"
  date="2003-10-08T20:34:41"
  subject="Re: [Caml-list] what is the functional way to solve this problem?">
<msg 
  url="2003/10/0ee5858b073f26c4cbcaef3bc64f2a02"
  from="Michal Moskal &lt;malekith@p...&gt;"
  author="Michal Moskal"
  date="2003-10-08T21:25:11"
  subject="scanf (Re: [Caml-list] what is the functional way to solve this problem?)">
</msg>
</msg>
</msg>
</msg>
</msg>
</msg>
</thread>

<contents>
Someone suggested that my previous question about dynamically resizing
arrays hinted that my solution may be going in a non-functional
direction. That might be true.

So here is the problem I am trying to solve. I would like to solve it
in a "functional way".

I want to create an in-memory representation of some file system data. 
The input data file has three different types of lines in the file:

1. starts with "R": R 0 &lt;dirname&gt;

2. starts with "D": D &lt;dir_num&gt; &lt;parent_dir_num&gt; &lt;name&gt;
   The &lt;dir_num&gt; associated with each directory are
   sequentially assigned starting at 1.

3. starts with "F": F &lt;file_size&gt; &lt;name&gt;

The first line will always be: "R 0 &lt;dir_name&gt;" where 0 is the
directory number for the top level directory. This purpose of this
line is provide the starting point for the data in the rest of the
file.

The D lines indicate directories. The dir_num is an integer that
uniquely identifies this particular directory. The parent_dir_num
integer is used to locate this directory relative to the other
directories in the data file.

The F lines indicate data for a single file. The majority of the
lines in the file should be F lines. A file listed on an F line
is in the directory indicated by the closest D line that came
earlier in the file.

Once all the data is read in I want to output: A list of the top 100 largest
files, A list of the top 100 directories that contain the largest fraction
of the total disk space used by all the files in the data file - and in this
case the file size for a directory does not include sub-directories. Eventually
I want to also categorize the data by user-id as well.

I have already written a python application that can read this data file and
generate the data I am looking for. I was not happy with the python solution
because it was not very fast. Even with using a heap to store the top 100
largest files I was not able to create a python solution that could beat
a "grep | sort" pipeline (on unix of course).

In the python solution the limiting factor was putting the individual
files into a heap and repeatedly calling delete_min on the heap to
remove the smallest files. Even though the unix pipe based solution
ended up sorting _all_ the files and the python solution was handling
a smaller set of data the unix pipe solution was still much faster. 
There is a thread in google about this experiment. Do a google groups
search for group:comp.lang.python + bhamidipaty + "looking for speed-up ideas".

The bottom line for the python solution was that the grep+sort
pipeline took about 8 seconds and the fastest I could get python was
around 16 seconds. Of course the unix pipe solution would not be able
to do the other analysis that I wanted.

My goal of implementing this in OCaml is to beat the grep+sort
combination. If I can create a solution that can output all the
date I want - in one pass - AND still be faster than the grep+sort
partial solution - _that_ would be cool!

Having said all that - I wanted to use an array to hold the data for
each directory. I hoped that using an array would be faster than a
hash table since I know that the directory numbers are assigned
sequentially.

Thanks for reading this rather long message and thank you for any
advice.
-Ram

---------------------------------
An example of the data file:

R 0 /usr
D 1 0 local
F 4095165 f1
D 2 1 bin
F 189408 f2
F 189445 f3
D 4 1 etc
F 3956 f4
D 5 1 info
F 2613 f5
F 50111 f6
D 6 1 lib
F 610422 f7
D 7 1 man
F 82097 f8
---------------------------------


-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners

</contents>

</message>

