Browse thread
[Caml-list] what is the functional way to solve this problem?
- Ram Bhamidipaty
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Ram Bhamidipaty <ramb@s...> |
| Subject: | [Caml-list] what is the functional way to solve this problem? |
Someone suggested that my previous question about dynamically resizing arrays hinted that my solution may be going in a non-functional direction. That might be true. So here is the problem I am trying to solve. I would like to solve it in a "functional way". I want to create an in-memory representation of some file system data. The input data file has three different types of lines in the file: 1. starts with "R": R 0 <dirname> 2. starts with "D": D <dir_num> <parent_dir_num> <name> The <dir_num> associated with each directory are sequentially assigned starting at 1. 3. starts with "F": F <file_size> <name> The first line will always be: "R 0 <dir_name>" where 0 is the directory number for the top level directory. This purpose of this line is provide the starting point for the data in the rest of the file. The D lines indicate directories. The dir_num is an integer that uniquely identifies this particular directory. The parent_dir_num integer is used to locate this directory relative to the other directories in the data file. The F lines indicate data for a single file. The majority of the lines in the file should be F lines. A file listed on an F line is in the directory indicated by the closest D line that came earlier in the file. Once all the data is read in I want to output: A list of the top 100 largest files, A list of the top 100 directories that contain the largest fraction of the total disk space used by all the files in the data file - and in this case the file size for a directory does not include sub-directories. Eventually I want to also categorize the data by user-id as well. I have already written a python application that can read this data file and generate the data I am looking for. I was not happy with the python solution because it was not very fast. Even with using a heap to store the top 100 largest files I was not able to create a python solution that could beat a "grep | sort" pipeline (on unix of course). In the python solution the limiting factor was putting the individual files into a heap and repeatedly calling delete_min on the heap to remove the smallest files. Even though the unix pipe based solution ended up sorting _all_ the files and the python solution was handling a smaller set of data the unix pipe solution was still much faster. There is a thread in google about this experiment. Do a google groups search for group:comp.lang.python + bhamidipaty + "looking for speed-up ideas". The bottom line for the python solution was that the grep+sort pipeline took about 8 seconds and the fastest I could get python was around 16 seconds. Of course the unix pipe solution would not be able to do the other analysis that I wanted. My goal of implementing this in OCaml is to beat the grep+sort combination. If I can create a solution that can output all the date I want - in one pass - AND still be faster than the grep+sort partial solution - _that_ would be cool! Having said all that - I wanted to use an array to hold the data for each directory. I hoped that using an array would be faster than a hash table since I know that the directory numbers are assigned sequentially. Thanks for reading this rather long message and thank you for any advice. -Ram --------------------------------- An example of the data file: R 0 /usr D 1 0 local F 4095165 f1 D 2 1 bin F 189408 f2 F 189445 f3 D 4 1 etc F 3956 f4 D 5 1 info F 2613 f5 F 50111 f6 D 6 1 lib F 610422 f7 D 7 1 man F 82097 f8 --------------------------------- ------------------- To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners