Composite

by the Camls 'R Us team

Winner of the first prize at the ICFP'99 programming contest

Introduction
Structure of the program
The first local optimizer
The second local optimizer
The global synthesizer
Integrating the three optimizers
Developing a test suite
Experimental results
Conclusions and random thoughts
Further reading

Introduction

The challenge of the 1999 ICFP programming contest was to develop an optimizer for a small programming language used to describe non-player characters in interactive fiction games. The goal of the optimizer was to reduce as much as possible the size of the programs describing non-player characters, while preserving their semantics. The challenge task was posted on the Web on Thursday Sept 2nd at 11pm French time; participants were given 72 hours till Sunday 5th 11pm to complete the challenge, using the language of their choice.

Our entry, code-named ``Composite'', was written by a team of five persons:

Pascal Cuoq	PhD student, INRIA Rocquencourt
Damien Doligez	Researcher, INRIA Rocquencourt
Xavier Leroy	Researcher, INRIA Rocquencourt and Trusted Logic S.A.
Luc Maranget	Researcher, INRIA Rocquencourt
Alan Schmitt	PhD student, INRIA Rocquencourt

The program is written entirely in Objective Caml version 2.02+4 (the current development version at the time of the contest), using the Unix library for accessing OS services.

Structure of the program

As its name implies, ``Composite'' is the outcome of no less than four independent attempts at solving the challenge. Two of those attempts are local optimizers, operating by local size-reducing transformations on the input programs. The other two are global synthesizers that attempt to synthesize from the grounds up a program equivalent to the input program, but smaller.

The four optimizers were developed largely independently on Friday, Saturday and Sunday morning. The sub-teams shared basic pieces of code such as parsers and pretty-printers, as well as test suites and equivalence testers. On Sunday afternoon, one of the global synthesizers was very far from working and was abandoned, leaving three working optimizers that each had their own strengths and weaknesses. No optimizer was consistently better than the others on our tests, so we decided to combine them in a single entry, and spent a busy Sunday evening stitching them together.

The first local optimizer

The first local optimizer was developed by Pascal Cuoq and Alan Schmitt. The lexer, parser and pretty printer were completed within 90 man.minutes. The rest was a set of independent optimizations, each quickly written as soon as the powerful drugs from coffee and chocolate induced their images in our feverish minds.

This optimizer performs the following transformations on the input program:

Merging rules with identical statements. This optimization simply tries to merge states with identical statements, taking advantage of the fact that rules are case statements for which there is no span cost.
Merging identical statements in CASE and IF statements. This optimization does the same as above, with no precaution because it can not increase the cost.
Propagating information from tests in the program, and using it to simplify boolean expressions and to remove dead code.

Merging CASE statements involving the same variable. This optimization works on two CASE statements that concern the same variable, and that are the statements of two rules for two different state lists. It unfolds each ARM list such that the value set consists of only one value, and sort the two obtained lists so that it can merge them, adding an IF statement on the states if the statements are different for the same ARM value. Identical statements for different ARM values are then merged, and the result is evaluated to make sure the cost saved is greater than the possible additional span. For instance,

((1 2 (CASE (VAR "y")
        ((ARM (100) (DECISION _ "a"))
         (ARM (101) (DECISION _ "b"))
         (ARM (102) (DECISION _ "c")))
        (DECISION _ "d")))
 (3 4 (CASE (VAR "y")
        ((ARM (100) (DECISION _ "a"))
         (ARM (101) (DECISION _ "b"))
         (ARM (102) (DECISION _ "e")))
        (DECISION _ "d")))))

becomes

((1 2 3 4 (CASE (VAR "y")
            ((ARM (100) (DECISION _ "a"))
             (ARM (101) (DECISION _ "b"))
             (ARM (102) (
               (IF (OR (EQUALS (VAR "state") 1) (EQUALS (VAR "state") 2))
                 (DECISION _ "c")
                 ()
	         (DECISION _ "e")))))
            (DECISION _ "d")))
)

Replacement of DECISION n by DECISION _ when the current state is known to be n.
Transformation of CASE statements in IF when appropriate, and removing of degenerate cases (such as those with only a default statement). This optimization is fairly straightforward but compares the cost of the resulting code to make sure it is better.

This optimizer was completed on Friday evening and submitted as a ``lightning entry'' under the team name of ``Bat 8''. It ranked 6th on the preliminary ranking, and 3rd of the lightning entries, right behind the entries which were to receive the second prize and jury's prize.

The second local optimizer

The second local optimizer was developed independently by Xavier Leroy. It performs the following transformations:

Strength reduction over CASE and IF statements. Sparse CASE statements are split into cascades of dense CASE statements, e.g.

(CASE (VAR "x")       ==>       (CASE (VAR "x")
 ((ARM (1) a1)                   ((ARM (1) a1)
  (ARM (2) a2)                    (ARM (2) a2)
  (ARM (3) a3)                    (ARM (3) a3)
  (ARM (4 100) a4)                (ARM (4) a4))
  (ARM (101) a5)                 (CASE (VAR "x")
  (ARM (102) a6)                  ((ARM (100) a4)
  (ARM (103) a7))                  (ARM (101) a5)
 a8)                               (ARM (102) a6)
                                   (ARM (103) a7))
                                  a8)

(As shown in the example, this transformation can duplicate some arms of the CASE statement, and thus must be done with care. A dynamic programming algorithm is used to select the optimal cut points.)

Small CASE statements are transformed into cascades of IFs when beneficial:

(CASE (VAR "x")       ==>       (IF (EQUALS (VAR "x") 1)
 ((ARM (1) a1)                     a1
  (ARM (100) a2))                  ((ELSEIF (EQUALS (VAR "x") 100) a2))
 a3)                               a3)

Large cascades of IF tests over the same variable are turned into CASE statements:

(IF (EQUALS (VAR "x") 1) a1            ==>      (CASE (VAR "x")
  ((ELSEIF (EQUALS (VAR "x") 2) a2)               ((ARM (1) a1)
   (ELSEIF (EQUALS (VAR "x") 4) a4)                (ARM (2) a2)
   (ELSEIF (EQUALS (VAR "x") 5) a5))               (ARM (4) a4)
   a6)                                             (ARM (5) a5))
                                                  a6)

The CASE-CASE transformation: when the arms of a CASE are CASE statements on the same variable, attempt to permute the two levels of CASE:

(CASE (VAR "x")                    ==>    (CASE (VAR "y")                             
  ((ARM (1)                                 ((ARM (100) (DECISION _ "a"))             
    (CASE (VAR "y")                          (ARM (101) (DECISION _ "b"))             
      ((ARM (100) (DECISION _ "a"))          (ARM (102) (CASE (VAR "x")               
       (ARM (101) (DECISION _ "b"))                       ((ARM (1) (DECISION _ "c")) 
       (ARM (102) (DECISION _ "c")))                       (ARM (2) (DECISION _ "e")))
      (DECISION _ "d")))                                  (DECISION _ "f"))))         
   (ARM (2)                                 (DECISION _ "d"))                         
    (CASE (VAR "y")
      ((ARM (100) (DECISION _ "a"))
       (ARM (101) (DECISION _ "b"))
       (ARM (102) (DECISION _ "e")))
      (DECISION _ "d"))))
  (CASE (VAR "y")
    ((ARM (100) (DECISION _ "a"))
     (ARM (101) (DECISION _ "b"))
     (ARM (102) (DECISION _ "f")))
    (DECISION _ "d"))))

Factoring of identical arms of CASE and IF statements. For the purpose of identifying arms, the statements (DECISION nnn "utterance") and (DECISION _ "utterance") can be converted into one another when applicable, using a non-standard form of unification. For instance:
```
((0 (CASE (VAR "x")                   ==>  ((0 (CASE (VAR "x")
       ((ARM (1) (DECISION _ "foo"))              ((ARM (1 2) (DECISION _ "foo"))
        (ARM (2) (DECISION 0 "foo")))
       (DECISION _ "bar"))))                     (DECISION _ "bar"))))
```

Simplification of boolean expressions in IF statements. For instance, redundant tests in AND and OR are eliminated:

(IF (OR (EQUALS (VAR "x" 1))          ==>  (IF (EQUALS (VAR "x" 1))
        (AND (EQUALS (VAR "x" 1))            ...)
             (EQUALS (VAR "y" 2))))
  ...)

Similarly, we apply the distributivity laws for OR and AND to factor out common subexpressions in IF conditions:

(IF (OR (AND (EQUALS (VAR "x" 1))     ==>  (IF (AND (EQUALS (VAR "x" 1))
             (EQUALS (VAR "y" 2)))                  (OR (EQUALS (VAR "y" 2)
        (AND (EQUALS (VAR "x" 1))                       (EQUALS (VAR "z" 3))))
             (EQUALS (VAR "z" 3)))            ...)
  ...

Propagation of the information gained from tests already performed into the arms of IF and CASE statements. Consider for instance (IF (EQUALS (VAR "x") 1) a1 () a2). In a1, we know that x is equal to 1; in a2, we know that x is different from 1. Thus, all (EQUALS (VAR "x") NNN) and (CASE (VAR "x")) tests in a1 can be statically resolved. In a2, (EQUALS (VAR "x") 1) tests can also be statically resolved, and CASE statements over x can be simplified by removing the arm corresponding to x = 1.

Attempt to combine all the entry points of a character in a single entry point that performs an explicit CASE statement on the state variable:

((1 (CASE (VAR "y")
      ((ARM (100) (DECISION _ "a"))
       (ARM (101) (DECISION _ "b"))
       (ARM (102) (DECISION _ "c")))
      (DECISION _ "d")))
 (2 (CASE (VAR "y")
      ((ARM (100) (DECISION _ "a"))
       (ARM (101) (DECISION _ "b"))
       (ARM (102) (DECISION _ "e")))
      (DECISION _ "d")))))

thus becomes

((1 2 (CASE (VAR "state")
	((ARM (1)
           (CASE (VAR "y")
              ((ARM (100) (DECISION _ "a"))
               (ARM (101) (DECISION _ "b"))
               (ARM (102) (DECISION _ "c")))
              (DECISION _ "d"))))
        (CASE (VAR "y")
           ((ARM (100) (DECISION _ "a"))
            (ARM (101) (DECISION _ "b"))
            (ARM (102) (DECISION _ "e")))
           (DECISION _ "d")))))

In itself, this transformation always increase the size of the character, but it may enable a worthwhile CASE-CASE transformation later. In the example above, after swapping the CASE statements over state and y we finally get:

((1 2 (CASE (VAR "y")
        ((ARM (100) (DECISION _ "a"))
         (ARM (101) (DECISION _ "b"))
         (ARM (102) (IF (EQUALS (VAR "state") 1)
	               (DECISION _ "c")
		       ()
                       (DECISION _ "e"))))
        (DECISION _ "d"))))

The global synthesizer

The global synthesizer was written by Damien Doligez. It works by computing a denotational semantics (so to speak) for the input program, then rebuilding an equivalent program by a divide-and-conquer algorithm with backtracking.

More precisely, the program is first transformed into a list of (boolean expression, decision) pairs, where the boolean expression on the variables is true if and only if the input program terminates on the given decision. For instance, the input program

((1 (CASE (VAR "y")
      ((ARM (100) (DECISION 2 "a"))
       (ARM (101) (DECISION _ "b"))
       (ARM (102) (DECISION _ "c")))
      (DECISION _ "d")))
 (2 (CASE (VAR "y")
      ((ARM (100) (DECISION _ "a"))
       (ARM (101) (DECISION _ "b"))
       (ARM (102) (DECISION _ "e")))
      (DECISION 2 "d")))))

is transformed into

  ((state = 1 /\ y = 100) \/ (state = 2 /\ y = 100),
        DECISION 2 "a")
  ((state = 1 /\ y = 101) \/ (state = 2 /\ y = 101),
        DECISION _ "b")
  (state = 1 /\ y = 102,
        DECISION _ "c")
  ((state = 1 /\ y != 100 /\ y != 101 /\ y != 102) \/
   (state = 2 /\ y != 100 /\ y != 101 /\ y != 102),
        DECISION _ "d")
  (state = 2 /\ y = 102,
        DECISION _ "e")

Note the change from DECISION _ "a" to DECISION 2 "a", and also the change from DECISION 2 "d" to DECISION _ "d". These are made to minimise the number of different DECISION statements.

The boolean expressions are then simplified with rules similar to the ones used by the second local optimizer for IF statements (elimination of redundant tests and factoring of common subexpressions using distributivity).

Then begins the synthesis phase. The program picks an "interesting" variable and a partition of all the values of interest for this variable in the current context. It generates a CASE statement on this variable and the lists of pairs (condition, action) corresponding to each branch of the CASE. Each condition is the conjunction of the original condition and the assertion that the variable is in the subset for this case. The conditions are simplified and unsatisfiable clauses are eliminated.

An example is in order. Assume we pick the variable y and the partition {{100}, {101}, {102}, N \ {100,101,102}} in our previous example. We get the following program skeleton:

  (CASE (VAR "y")
    ((ARM (100) hole_1)
     (ARM (101) hole_2)
     (ARM (102) hole_3))
    hole_4)

The four holes are filled by recursively calling the synthesis function on the four lists of pairs obtained by conjoining the conditions of the original list respectively with y = 100, y = 101, y = 102, and y != 100 /\ y != 101 /\ y != 102. Those lists, after simplification, are:

  hole_1: (state = 1 \/ state = 2, DECISION 2 "a")

  hole_2: (state = 1 \/ state = 2, DECISION _ "b")

  hole_3: (state = 1, DECISION _ "c")
          (state = 2, DECISION _ "e")

  hole_4: (state = 1 \/ state = 2, DECISION _ "d")

For hole_1, hole_2, hole_4, we are in the ground case: only one clause remains so there is no need to test anything and the resulting program is the corresponding DECISION statement.

For hole_3, the recursive call will produce a CASE statement on state, which is transformed on the fly into an IF statement using code borrowed from the second local optimizer.

The resulting program is:

  (CASE (VAR "y")
    ((ARM (100) (DECISION 2 "a"))
     (ARM (101) (DECISION _ "b"))
     (ARM (102) (IF (EQUALS (VAR "state") 1) (DECISION _ "c") (DECISION _ "e")))
    (DECISION _ "d"))

Now we need to explain how the variable and partition are chosen. This is done by three heuristics controlled by a fourth heuristic and a timer. In addition, we use backtracking: at each step, several possible choices are tried and the best one (in terms of result size) is retained.

The first heuristic is very simple: try each variable with the maximal partition (i.e. the smallest partition that completely separates the cases, ensuring this variable will not need to be tested again).
The second heuristic is a bit smarter: for each variable, it tries to find a partition that minimizes the number of duplicated clauses. To this end, it regroups clauses that share many possible values of the variable.
Note that the first two heuristics can fail to give any non-trivial partition. The third one acts as a safety net because it always gives some result. It picks one of the clauses (trying to choose the one with the smallest condition) and generates an IF statement (instead of a CASE as above), with the clause's DECISION in the THEN branch, and the program corresponding to the other clauses in the ELSE branch.

The controlling heuristic computes an approximate cost for each partition generated by the first two heuristics and retains a number of the best ones (3 or 1, depending on the state of the timer) to try in addition to the skeleton generated by the third heuristic.

Integrating the three optimizers

The two local optimizers are rather fast and well-behaved. (Even after iterating them a few times for good measure, they only take a few seconds on typical inputs.) Thus, we decided to run them first.

The global synthesizer is exponential by nature (because several alternatives are tried at each point of the divide-and-conquer strategy), so it's an ideal candidate for eating up whatever remains of the 30 minutes of running time allocated to the programs. We thus run the global synthesizer after the local optimizers, using the smallest of the outputs of the latters as input to the global synthesizer.

In an attempt to help the global synthesizer produce some output in a reasonable amount of time, we start a 10-minute timer at the beginning of the synthesis phase. When this timer expires, the branching factor of the divide-and-conquer algorithm is reduced from 3 to 1 (that is, the algorithm no longer considers the 3 best candidate tests at each step, but picks only the best candidate).

The output of the global synthesizer can often benefit from peephole optimizations, so we re-run the local optimizers on the output of the synthesizer for a final cleanup.

According to the contest rules, size (of the output programs) matters a lot. But implicit in the rules is that correctness and timeliness matters even more. We guessed correctly that contestants that produce wrong output, or fail to produce any output at all within the time limits, would be eliminated immediately. We also knew for a fact that our program will be buggy, given the short development time and small test suite that we had. Thus, we made every possible effort to ensure that our program would always produce an output semantically equivalent to the input program within the time limits. This was achieved by the following standard defensive programming techniques:

At all times, a global variable holds the smallest result obtained so far. This variable is initialized with a copy of the input program. A timer is initialized at the very beginning of the execution to one minute less than the time allocated to this run (29 minutes by default). Whenever the timer expires, the result contained in the global variable is output. This makes sure that we always produce an output within the time limits; in the worst case where none of the optimizers terminates, this output can be identical to the input program, but that's still much better than no output at all.
When one of the optimizers produces a result that is smaller than the current best result, the new result is first tested for equivalence with the input program before being recorded. The equivalence test proceeds by generating random values for the variables, then evaluating the input program and the new result on this output and checking that they produce the same decision. This process is repeated for a duration of 30 seconds. We also had an exhaustive equivalence testing procedure, written by Luc Maranget as a kind of simulation test between two programs, but decided not to use it by fear that it would take too long to complete on large inputs. For the same reason, we chose to bound the duration of the randomized test rather than its number of iterations. Even though the randomized test is incomplete by nature, we found it extremely effective in catching bad outputs.
All exceptions escaping from the local optimizers are caught, causing the program to abort the current optimization round and try the next optimizer in its working list, rather than crashing the whole program. One of us worked on the Ariane 5 code review, and learned the hard way that there are situations where a program must never crash when it encounters an internal error, but rather make all possible attempts to continue working. Whether the program in question runs in a space rocket or on the contest judges' machines makes no fundamental difference.
The only parts of our code that aren't protected by the crash-proofing techniques above are the parser and pretty-printer. Thus, we tested them very carefully, making sure that our parser recognizes exactly the input language as defined in the challenge task (including borderline cases such as empty lists of constants or state numbers), and that our printer always produces syntactically correct output.

The integration of the various optimizers in a common framework went relatively smoothly. Due to a certain lack of synchronization at the beginning of the development, we ended up with two slightly different data types for abstract syntax trees, each accompanied by its own parser, pretty-printer, and equivalence tester. Rather than attempting an hazardous last-minute rewrite of the optimizers, we just wrote trivial conversion functions between the two syntax trees, and applied them on entrance and exit to the optimizers. We didn't even have to think about where to put the conversions: the OCaml type-checker told us. The OCaml module system was also helpful in ensuring that we didn't get any name clashes between the various parts.

Developing a test suite

Having other tests than the one example provided by the organizers (the Busker character) proved important, both for evaluating optimizations and catching bugs. Hand-written tests are useful for testing borderline cases and verifying that optimization ideas work as advertised, but machine generated tests also proved very useful.

We correctly guessed that machine-generated tests would be proposed to our optimizer in the final phase. As such tests mean huge input and a different programming style, our optimizer had to be ready for them.
The global synthesizer had to perform well on such tests, since this is where it is meant to outperform local optimization techniques. However, local optimizations are important since they can reduce the size of the input to the global synthesizer, thus the local optimizers should both yield acceptable performance and terminate quickly on these inputs.

We thus wrote a code pessimizer, its input language is an extension of the contest language, while its output is the contest language itself. The one extension is � la ML pattern matching on tuples, which the pessimizer dumbly compiles left-to-right using CASE only. We know that other compilation techniques can yield a much smaller output, thus there are ample opportunities for optimization in those examples.

Consider for instance the following pattern-matching triangle:

 (0
   (CASE
      ((VAR "x1") (VAR "x2")(VAR "x3") (VAR "x4") (VAR "x5")
       (VAR "x6") (VAR "x7") (VAR "x8"))
  (
   (ARM ((1) (1) (1) (1) (1) (1) (1) (1) )
        (DECISION 1 "FOO"))
   (ARM ((2) (2) (2) (2) (2) (2) (2) (2) )
        (DECISION 2 "FOO"))
   (ARM ( _ (1) (1) (1) (1) (1) (1) (1) )
        (DECISION 3 "FOO"))
   (ARM ( _ (2) (2) (2) (2) (2) (2) (2) )
        (DECISION 2 "FOO"))
   (ARM ( _  _ (2) (2) (2) (2) (2) (2) )
        (DECISION 4 "FOO"))
   (ARM (_   _  _  (1) (1) (1) (1) (1) )
        (DECISION 5 "FOO"))
   (ARM (_   _  _  (2) (2) (2) (2) (2) )
        (DECISION 6 "FOO"))
   (ARM (_  _  _    _ (1) (1) (1) (1) )
        (DECISION 7 "FOO"))
   (ARM (_  _  _    _ (2) (2) (2) (2) )
        (DECISION 8 "FOO"))
   (ARM (_  _  _    _  _ (1) (1) (1) )
        (DECISION 9 "FOO"))
   (ARM (_  _  _    _  _ (2) (2) (2) )
        (DECISION 10 "FOO"))

  )
  (DECISION _ "DEFAULT"))
 )
)

Then, the pessimizer produces a program of size 16089, which the local optimization phases reduces to 815. Then the global synthesizer outputs a program of size 139. Note that it would be difficult to write such a test by hand.

Other examples of interest are pattern-matching expressions that cannot be compiled properly by using CASE only:

(
  (0
    (CASE (
        (VAR "x1") (VAR "x2") (VAR "x3") (VAR "x4") (VAR "x5") (VAR "x6")
        (VAR "y1") (VAR "y2") (VAR "y3") (VAR "y4") (VAR "y5") (VAR "y6")
        (VAR "z1") (VAR "z2") (VAR "z3") (VAR "z4") (VAR "z5") (VAR "z6")
        (VAR "k1") (VAR "k2") (VAR "k3")
        ) (
(ARM ((1)(1)(1)(1)(1)(1) _  _  _  _  _  _  _  _  _  _  _  _  _  _  _ )
        (DECISION 1 "FOO"))
(ARM ((2) _  _  _  _  _ (1)(1)(1)(1)(1) _  _  _  _  _  _  _  _  _  _ )
        (DECISION 2 "FOO"))
(ARM ( _ (2) _  _  _  _ (2) _  _  _  _ (1)(1)(1)(1) _  _  _  _  _  _ )
        (DECISION 3 "FOO"))
(ARM ( _  _ (2) _  _  _  _ (2) _  _  _ (2) _  _  _ (1)(1)(1) _  _  _ )
        (DECISION 4 "FOO"))
(ARM ( _  _  _ (2) _  _  _  _ (2) _  _  _ (2) _  _ (2) _  _ (1)(1) _ )
        (DECISION 5 "FOO"))
(ARM ( _  _  _  _ (2) _  _  _  _ (2) _  _  _ (2) _  _ (2) _ (2) _  _ )
        (DECISION 6 "FOO"))
(ARM ( _  _  _  _  _ (2) _  _  _  _ (2) _  _  _ (2) _  _ (2) _ (2) (1))
        (DECISION 7 "FOO"))
        )
     (DECISION _ "DEFAULT"))
        )
)

This example is pessimized into a program of size 318639, which the local optimization phases reduce to 79457. Then, the global optimizer fails to optimize the example any further in the allocated 30 minutes. This is a very demanding example...

Experimental results

The contest judges were kind enough to provide contestants with the set of input programs they used to rank the entries. Below are the space savings that our optimizers achieved on those inputs.

Caption: for each test input, we show the initial size, the sizes after the two local optimization passes (Local 1 and Local 2), after the global synthesizer pass (Synthesizer), and after the re-run of the local optimizers on the output of the synthesizer (Synth-Local 1 and Synth-Local 2). The last column is the best space saving achieved, given as a percentage of the input size.

Hand-written characters

Input name	Initial size	Local 1	Local 2	Synthesizer	Synth-Local 1	Synth-Local 2	Space savings
busker	576	360	351	358	358	351	-39.1%
cat	302	274	265	298	298	265	-12.3%
jarboe	1066	298	178	**/174	**/174	**/174	-83.3%/-83.6%
parrot	380	356	356	**/444	**/442	**/356	-6.3%

The first set of tests is composed of four characters from an actual interactive fiction game. They have been written by hand in a relatively natural style. One of them (the busker) was given as an example to all participants. On all those tests, the second local optimizer achieved the best results. The synthesizer produced incorrect output on jarboe and parrot, as denoted by the ** entries in the table. These incorrect results were caught by the random test phase and discarded. After the contest, we fixed the bug in the synthesizer and obtained the results shown on the right of the / sign. The corrected synthesizer improves slightly the results on the jarboe test.

If-converted characters

Input name	Initial size	Local 1	Local 2	Synthesizer	Synth-Local 1	Synth-Local 2	Space savings
ifbusker	645	601	351	358	358	351	-45.6%
ifcat	361	355	280	298	298	280	-22.4%
ifjarboe	1616	1616	1286	**/174	**/174	**/174	-20.4%/-89.2%
ifparrot	514	506	364	**/444	**/442	**/356	-29.1%/-30.7%

The next set of tests consists in variants of the four hand-written characters discussed above, with CASE statements replaced by cascades of IF tests. On ifbusker and ifcat, the second local optimizer does a reasonable job. In particular, it achieves the same output size for ifbusker than for the original busker, thus un-doing the ``obfuscation'' performed by the judges on the original character.

On ifjarboe, the local optimizers did OK, but the synthesizer would have done much better (also succeeding in un-doing the obfuscation), if not for its stupid bug.

The ifparrot test shows an interesting pattern that occurs in several of the tests below: the synthesizer alone doesn't do better than the local optimizers, but re-running the local optimizers on the output of the synthesizer does. Figuratively, the synthesizer acts as a nut-breaker: it cracks enough of the nutshell to allow the local synthesizers to get at the fruit.

Pessimized characters

Input name	Initial size	Local 1	Local 2	Synthesizer	Synth-Local 1	Synth-Local 2	Space savings
pessbusker	1892	1704	1328	358	358	358	-81.1%
pesscat	799	769	525	268	265	265	-66.8%
pessjarboe	3153	3059	2824	**/174	**/174	**/174	-10.4%/-94.4%
pessparrot	3291	2458	1607	**/444	**/442	**/356	-51.1%/-89.1%

The third set of tests is also derived from the four hand-written characters, with a number of ``pessimizing'' transformations applied, making them much larger and much harder to optimize. But that doesn't fool the synthesizer, which is, in essence, able to undo the pessimization and recover the same results as for the original characters. The stupid bug in the submission still bites us in two of the four tests.

Instruction recognizers

Input name	Initial size	Local 1	Local 2	Synthesizer	Synth-Local 1	Synth-Local 2	Space savings
alpha0	30	28	28	28	28	28	-6.7%
alpha1	24	22	22	22	22	22	-8.3%
alpha2	8976	8392	7035	3311	3185	3160	-64.8%
mips0	3842	3652	3443	3402	3374	3370	-12.3%
pentium0	8	3	3	3	3	3	-62.5%
pentium1	8	3	3	3	3	3	-62.5%
pentium2	8	3	3	3	3	3	-62.5%
pentium3	356	313	211	148	148	148	-58.4%
pentium4	382	319	313	122	122	122	-68.1%
pentium5	335336	311326	123700	---	---	---	-63.1%
sparc0	24	22	22	22	22	22	-8.3%
sparc1	76	67	52	40	40	40	-47.4%
sparc2	30	28	28	27	22	22	-26.7%
sparc3	14946	13432	7096	2864	2859	2486	-83.4%

The fourth set of tests comes from a totally different application area. It consists of instruction recognizers for subsets of various processors. They seem to be generated from a high-level processor description.

On all except the smaller tests, the global synthesizer simply kicks ass, delivering results at least twice as small as the local optimizers. The pentium5 test is very large, and causes the global synthesizer to time-out, but fortunately the second local optimizer did a decent job already.

Processor simulations

Input name	Initial size	Local 1	Local 2	Synthesizer	Synth-Local 1	Synth-Local 2	Space savings
sim-intel.rtl0	115339	2214	2214	1391	1386	1095	-99.1%
sim-sparc.rtl0	15370	15370	15370	---	---	---	0%
sim-sparc.rtl1	1804	1804	1804	305	303	231	-87.2%
sim-sparc.rtl2	1219	1219	1219	245	240	160	-86.9%
xs.rtl0	952	952	265	270	265	265	-72.2%
xs.rtl0.mc	3413	2458	1005	1203	1203	1005	-70.6%

The last set of tests also comes from processor descriptions, although we didn't figure out what they do exactly. At any rate, this is a mixed bunch: sim-sparc.rtl0 is essentially impervious to our optimizations (the local optimizers do nothing and the synthesizer times out), while sim-intel.rtl0 is reduced by a factor of 100. Notice again the nutcracker effect of the synthesizer, which not only beats consistently the local optimizers, but allow them to do an even better job on its output.

Conclusions and random thoughts

First and foremost, let us emphasize that we had a lot of fun working on this programming challenge. Even Leroy, who in the past pooh-poohed the ICFP contest as ``a pissing contest'' and didn't deign looking at this year's challenge until Friday night, ended up spending the whole week-end riveted to his home machine. Thanks a lot to the organizers for coming up with such an interesting topic.

Now is the time to exercise the ``unlimited bragging rights'' promised in the contest announcement.

Functional languages in general, and the Caml language in particular, were a very good match for the problem. The language features we relied on a lot include:

Automatic memory management (there was simply no time to waste on C- or C++-style memory management);
Simple and efficient representation of abstract syntax trees, along with pattern-matching over those (there was no time either to lose on class-based encodings of abstract syntax trees nor on cascades of elementary tests);
Strong static typing (caught the usual collection of trivial bugs in each optimizer, and more importantly in their last-minute integration into one program);
Equally good support for functional and imperative programming (while the general framework of running the optimizers in sequence, record the best result, test for equivalence, handle timeouts, etc, etc, was very imperative in nature, each of the optimizers was very naturally written in a purely functional style).

In addition to those language features, several features of the Objective Caml implementation also played an important role:

Fast turnaround, thanks to the bytecode compiler;
Reasonable set of development tools (we used the parser generator, the replay debugger and the execution profiler -- few other f.p.l. implementations offer all three);
Good execution speed, thanks to the native-code compiler (execution speed didn't really matter for the local optimizers, but was significant for the global synthesizer);
Ability to generate small stand-alone executables that can safely be shipped to the judges' machines (getting a statically-linked binary was just a matter of adding the -ccopt -static to the ocamlopt compiler invocation).

It is interesting to note that 4 of the 8 finalists were using the Caml garbage collector, virtual machine or native-code compiler in one form or another (two OCaml entries, one OLabl entry and one Moscow ML entry).

The topic of the contest was also a very good match for our expertise in compilers and theorem provers. The other two prices also went to experts in the field of compilation of functional programming languages. But compiler skills alone -- useful as they are to acquaint oneself with program transformations and the importance of preserving semantics -- weren't enough. Compiler writers tend to settle for sub-optimal transformations with reasonable running time, mostly effective against human-written input, such as our two local optimizers. It takes someone familiar with theorem proving and decision procedures for various logics to come up with something as ambitious (some would say crazy) as Doligez's synthesizer.

Finally, a word should be said about our highly parallel development process. The organizers warned against large teams as unwieldy and hard to synchronize within the 72 hour time limit. This is true in general, but perhaps less so for the topic of the contest, which has two unusual features as far as programming projects go. First, the input and output of the program, and the relations between both, are perfectly defined. Second, two solutions can always be combined together, obtaining a third solution that is always at least as good as the initial two. (This is not quite right because of the time limits; but disregarding this, a mere shell script suffices to run sequentially N optimizers and select the smallest output.) That's how we were able to harness the programming power of five persons and end up with a program that handles well different styles of inputs (hand-written, hand-pessimized, and machine-generated).