Composite

by the Camls 'R Us team

Winner of the first prize at the ICFP'99 programming contest


Introduction
Structure of the program
The first local optimizer
The second local optimizer
The global synthesizer
Integrating the three optimizers
Developing a test suite
Experimental results
Conclusions and random thoughts
Further reading


Introduction

The challenge of the 1999 ICFP programming contest was to develop an optimizer for a small programming language used to describe non-player characters in interactive fiction games. The goal of the optimizer was to reduce as much as possible the size of the programs describing non-player characters, while preserving their semantics. The challenge task was posted on the Web on Thursday Sept 2nd at 11pm French time; participants were given 72 hours till Sunday 5th 11pm to complete the challenge, using the language of their choice.

Our entry, code-named ``Composite'', was written by a team of five persons:

Pascal Cuoq PhD student, INRIA Rocquencourt
Damien Doligez Researcher, INRIA Rocquencourt
Xavier Leroy Researcher, INRIA Rocquencourt and Trusted Logic S.A.
Luc Maranget Researcher, INRIA Rocquencourt
Alan Schmitt PhD student, INRIA Rocquencourt

The program is written entirely in Objective Caml version 2.02+4 (the current development version at the time of the contest), using the Unix library for accessing OS services.

Structure of the program

As its name implies, ``Composite'' is the outcome of no less than four independent attempts at solving the challenge. Two of those attempts are local optimizers, operating by local size-reducing transformations on the input programs. The other two are global synthesizers that attempt to synthesize from the grounds up a program equivalent to the input program, but smaller.

The four optimizers were developed largely independently on Friday, Saturday and Sunday morning. The sub-teams shared basic pieces of code such as parsers and pretty-printers, as well as test suites and equivalence testers. On Sunday afternoon, one of the global synthesizers was very far from working and was abandoned, leaving three working optimizers that each had their own strengths and weaknesses. No optimizer was consistently better than the others on our tests, so we decided to combine them in a single entry, and spent a busy Sunday evening stitching them together.

The first local optimizer

The first local optimizer was developed by Pascal Cuoq and Alan Schmitt. The lexer, parser and pretty printer were completed within 90 man.minutes. The rest was a set of independent optimizations, each quickly written as soon as the powerful drugs from coffee and chocolate induced their images in our feverish minds.

This optimizer performs the following transformations on the input program:

This optimizer was completed on Friday evening and submitted as a ``lightning entry'' under the team name of ``Bat 8''. It ranked 6th on the preliminary ranking, and 3rd of the lightning entries, right behind the entries which were to receive the second prize and jury's prize.

The second local optimizer

The second local optimizer was developed independently by Xavier Leroy. It performs the following transformations:

The global synthesizer

The global synthesizer was written by Damien Doligez. It works by computing a denotational semantics (so to speak) for the input program, then rebuilding an equivalent program by a divide-and-conquer algorithm with backtracking.

More precisely, the program is first transformed into a list of (boolean expression, decision) pairs, where the boolean expression on the variables is true if and only if the input program terminates on the given decision. For instance, the input program

((1 (CASE (VAR "y")
      ((ARM (100) (DECISION 2 "a"))
       (ARM (101) (DECISION _ "b"))
       (ARM (102) (DECISION _ "c")))
      (DECISION _ "d")))
 (2 (CASE (VAR "y")
      ((ARM (100) (DECISION _ "a"))
       (ARM (101) (DECISION _ "b"))
       (ARM (102) (DECISION _ "e")))
      (DECISION 2 "d")))))
is transformed into
  ((state = 1 /\ y = 100) \/ (state = 2 /\ y = 100),
        DECISION 2 "a")
  ((state = 1 /\ y = 101) \/ (state = 2 /\ y = 101),
        DECISION _ "b")
  (state = 1 /\ y = 102,
        DECISION _ "c")
  ((state = 1 /\ y != 100 /\ y != 101 /\ y != 102) \/
   (state = 2 /\ y != 100 /\ y != 101 /\ y != 102),
        DECISION _ "d")
  (state = 2 /\ y = 102,
        DECISION _ "e")
Note the change from DECISION _ "a" to DECISION 2 "a", and also the change from DECISION 2 "d" to DECISION _ "d". These are made to minimise the number of different DECISION statements.

The boolean expressions are then simplified with rules similar to the ones used by the second local optimizer for IF statements (elimination of redundant tests and factoring of common subexpressions using distributivity).

Then begins the synthesis phase. The program picks an "interesting" variable and a partition of all the values of interest for this variable in the current context. It generates a CASE statement on this variable and the lists of pairs (condition, action) corresponding to each branch of the CASE. Each condition is the conjunction of the original condition and the assertion that the variable is in the subset for this case. The conditions are simplified and unsatisfiable clauses are eliminated.

An example is in order. Assume we pick the variable y and the partition {{100}, {101}, {102}, N \ {100,101,102}} in our previous example. We get the following program skeleton:

  (CASE (VAR "y")
    ((ARM (100) hole_1)
     (ARM (101) hole_2)
     (ARM (102) hole_3))
    hole_4)

The four holes are filled by recursively calling the synthesis function on the four lists of pairs obtained by conjoining the conditions of the original list respectively with y = 100, y = 101, y = 102, and y != 100 /\ y != 101 /\ y != 102. Those lists, after simplification, are:

  hole_1: (state = 1 \/ state = 2, DECISION 2 "a")

  hole_2: (state = 1 \/ state = 2, DECISION _ "b")

  hole_3: (state = 1, DECISION _ "c")
          (state = 2, DECISION _ "e")

  hole_4: (state = 1 \/ state = 2, DECISION _ "d")

For hole_1, hole_2, hole_4, we are in the ground case: only one clause remains so there is no need to test anything and the resulting program is the corresponding DECISION statement.

For hole_3, the recursive call will produce a CASE statement on state, which is transformed on the fly into an IF statement using code borrowed from the second local optimizer.

The resulting program is:

  (CASE (VAR "y")
    ((ARM (100) (DECISION 2 "a"))
     (ARM (101) (DECISION _ "b"))
     (ARM (102) (IF (EQUALS (VAR "state") 1) (DECISION _ "c") (DECISION _ "e")))
    (DECISION _ "d"))

Now we need to explain how the variable and partition are chosen. This is done by three heuristics controlled by a fourth heuristic and a timer. In addition, we use backtracking: at each step, several possible choices are tried and the best one (in terms of result size) is retained.

The controlling heuristic computes an approximate cost for each partition generated by the first two heuristics and retains a number of the best ones (3 or 1, depending on the state of the timer) to try in addition to the skeleton generated by the third heuristic.

Integrating the three optimizers

The two local optimizers are rather fast and well-behaved. (Even after iterating them a few times for good measure, they only take a few seconds on typical inputs.) Thus, we decided to run them first.

The global synthesizer is exponential by nature (because several alternatives are tried at each point of the divide-and-conquer strategy), so it's an ideal candidate for eating up whatever remains of the 30 minutes of running time allocated to the programs. We thus run the global synthesizer after the local optimizers, using the smallest of the outputs of the latters as input to the global synthesizer.

In an attempt to help the global synthesizer produce some output in a reasonable amount of time, we start a 10-minute timer at the beginning of the synthesis phase. When this timer expires, the branching factor of the divide-and-conquer algorithm is reduced from 3 to 1 (that is, the algorithm no longer considers the 3 best candidate tests at each step, but picks only the best candidate).

The output of the global synthesizer can often benefit from peephole optimizations, so we re-run the local optimizers on the output of the synthesizer for a final cleanup.

According to the contest rules, size (of the output programs) matters a lot. But implicit in the rules is that correctness and timeliness matters even more. We guessed correctly that contestants that produce wrong output, or fail to produce any output at all within the time limits, would be eliminated immediately. We also knew for a fact that our program will be buggy, given the short development time and small test suite that we had. Thus, we made every possible effort to ensure that our program would always produce an output semantically equivalent to the input program within the time limits. This was achieved by the following standard defensive programming techniques:

The integration of the various optimizers in a common framework went relatively smoothly. Due to a certain lack of synchronization at the beginning of the development, we ended up with two slightly different data types for abstract syntax trees, each accompanied by its own parser, pretty-printer, and equivalence tester. Rather than attempting an hazardous last-minute rewrite of the optimizers, we just wrote trivial conversion functions between the two syntax trees, and applied them on entrance and exit to the optimizers. We didn't even have to think about where to put the conversions: the OCaml type-checker told us. The OCaml module system was also helpful in ensuring that we didn't get any name clashes between the various parts.

Developing a test suite

Having other tests than the one example provided by the organizers (the Busker character) proved important, both for evaluating optimizations and catching bugs. Hand-written tests are useful for testing borderline cases and verifying that optimization ideas work as advertised, but machine generated tests also proved very useful.

We thus wrote a code pessimizer, its input language is an extension of the contest language, while its output is the contest language itself. The one extension is la ML pattern matching on tuples, which the pessimizer dumbly compiles left-to-right using CASE only. We know that other compilation techniques can yield a much smaller output, thus there are ample opportunities for optimization in those examples.

Consider for instance the following pattern-matching triangle:

 (0
   (CASE
      ((VAR "x1") (VAR "x2")(VAR "x3") (VAR "x4") (VAR "x5")
       (VAR "x6") (VAR "x7") (VAR "x8"))
  (
   (ARM ((1) (1) (1) (1) (1) (1) (1) (1) )
        (DECISION 1 "FOO"))
   (ARM ((2) (2) (2) (2) (2) (2) (2) (2) )
        (DECISION 2 "FOO"))
   (ARM ( _ (1) (1) (1) (1) (1) (1) (1) )
        (DECISION 3 "FOO"))
   (ARM ( _ (2) (2) (2) (2) (2) (2) (2) )
        (DECISION 2 "FOO"))
   (ARM ( _  _ (2) (2) (2) (2) (2) (2) )
        (DECISION 4 "FOO"))
   (ARM (_   _  _  (1) (1) (1) (1) (1) )
        (DECISION 5 "FOO"))
   (ARM (_   _  _  (2) (2) (2) (2) (2) )
        (DECISION 6 "FOO"))
   (ARM (_  _  _    _ (1) (1) (1) (1) )
        (DECISION 7 "FOO"))
   (ARM (_  _  _    _ (2) (2) (2) (2) )
        (DECISION 8 "FOO"))
   (ARM (_  _  _    _  _ (1) (1) (1) )
        (DECISION 9 "FOO"))
   (ARM (_  _  _    _  _ (2) (2) (2) )
        (DECISION 10 "FOO"))

  )
  (DECISION _ "DEFAULT"))
 )
)

Then, the pessimizer produces a program of size 16089, which the local optimization phases reduces to 815. Then the global synthesizer outputs a program of size 139. Note that it would be difficult to write such a test by hand.

Other examples of interest are pattern-matching expressions that cannot be compiled properly by using CASE only:

(
  (0
    (CASE (
        (VAR "x1") (VAR "x2") (VAR "x3") (VAR "x4") (VAR "x5") (VAR "x6")
        (VAR "y1") (VAR "y2") (VAR "y3") (VAR "y4") (VAR "y5") (VAR "y6")
        (VAR "z1") (VAR "z2") (VAR "z3") (VAR "z4") (VAR "z5") (VAR "z6")
        (VAR "k1") (VAR "k2") (VAR "k3")
        ) (
(ARM ((1)(1)(1)(1)(1)(1) _  _  _  _  _  _  _  _  _  _  _  _  _  _  _ )
        (DECISION 1 "FOO"))
(ARM ((2) _  _  _  _  _ (1)(1)(1)(1)(1) _  _  _  _  _  _  _  _  _  _ )
        (DECISION 2 "FOO"))
(ARM ( _ (2) _  _  _  _ (2) _  _  _  _ (1)(1)(1)(1) _  _  _  _  _  _ )
        (DECISION 3 "FOO"))
(ARM ( _  _ (2) _  _  _  _ (2) _  _  _ (2) _  _  _ (1)(1)(1) _  _  _ )
        (DECISION 4 "FOO"))
(ARM ( _  _  _ (2) _  _  _  _ (2) _  _  _ (2) _  _ (2) _  _ (1)(1) _ )
        (DECISION 5 "FOO"))
(ARM ( _  _  _  _ (2) _  _  _  _ (2) _  _  _ (2) _  _ (2) _ (2) _  _ )
        (DECISION 6 "FOO"))
(ARM ( _  _  _  _  _ (2) _  _  _  _ (2) _  _  _ (2) _  _ (2) _ (2) (1))
        (DECISION 7 "FOO"))
        )
     (DECISION _ "DEFAULT"))
        )
)

This example is pessimized into a program of size 318639, which the local optimization phases reduce to 79457. Then, the global optimizer fails to optimize the example any further in the allocated 30 minutes. This is a very demanding example...

Experimental results

The contest judges were kind enough to provide contestants with the set of input programs they used to rank the entries. Below are the space savings that our optimizers achieved on those inputs.

Caption: for each test input, we show the initial size, the sizes after the two local optimization passes (Local 1 and Local 2), after the global synthesizer pass (Synthesizer), and after the re-run of the local optimizers on the output of the synthesizer (Synth-Local 1 and Synth-Local 2). The last column is the best space saving achieved, given as a percentage of the input size.

Hand-written characters

Input name Initial size Local 1 Local 2 Synthesizer Synth-Local 1 Synth-Local 2 Space savings
busker 576 360 351 358 358 351 -39.1%
cat 302 274 265 298 298 265 -12.3%
jarboe 1066 298 178 **/174 **/174 **/174 -83.3%/-83.6%
parrot 380 356 356 **/444 **/442 **/356 -6.3%

The first set of tests is composed of four characters from an actual interactive fiction game. They have been written by hand in a relatively natural style. One of them (the busker) was given as an example to all participants. On all those tests, the second local optimizer achieved the best results. The synthesizer produced incorrect output on jarboe and parrot, as denoted by the ** entries in the table. These incorrect results were caught by the random test phase and discarded. After the contest, we fixed the bug in the synthesizer and obtained the results shown on the right of the / sign. The corrected synthesizer improves slightly the results on the jarboe test.

If-converted characters

Input name Initial size Local 1 Local 2 Synthesizer Synth-Local 1 Synth-Local 2 Space savings
ifbusker 645 601 351 358 358 351 -45.6%
ifcat 361 355 280 298 298 280 -22.4%
ifjarboe 1616 1616 1286 **/174 **/174 **/174 -20.4%/-89.2%
ifparrot 514 506 364 **/444 **/442 **/356 -29.1%/-30.7%

The next set of tests consists in variants of the four hand-written characters discussed above, with CASE statements replaced by cascades of IF tests. On ifbusker and ifcat, the second local optimizer does a reasonable job. In particular, it achieves the same output size for ifbusker than for the original busker, thus un-doing the ``obfuscation'' performed by the judges on the original character.

On ifjarboe, the local optimizers did OK, but the synthesizer would have done much better (also succeeding in un-doing the obfuscation), if not for its stupid bug.

The ifparrot test shows an interesting pattern that occurs in several of the tests below: the synthesizer alone doesn't do better than the local optimizers, but re-running the local optimizers on the output of the synthesizer does. Figuratively, the synthesizer acts as a nut-breaker: it cracks enough of the nutshell to allow the local synthesizers to get at the fruit.

Pessimized characters

Input name Initial size Local 1 Local 2 Synthesizer Synth-Local 1 Synth-Local 2 Space savings
pessbusker 1892 1704 1328 358 358 358 -81.1%
pesscat 799 769 525 268 265 265 -66.8%
pessjarboe 3153 3059 2824 **/174 **/174 **/174 -10.4%/-94.4%
pessparrot 3291 2458 1607 **/444 **/442 **/356 -51.1%/-89.1%

The third set of tests is also derived from the four hand-written characters, with a number of ``pessimizing'' transformations applied, making them much larger and much harder to optimize. But that doesn't fool the synthesizer, which is, in essence, able to undo the pessimization and recover the same results as for the original characters. The stupid bug in the submission still bites us in two of the four tests.

Instruction recognizers

Input name Initial size Local 1 Local 2 Synthesizer Synth-Local 1 Synth-Local 2 Space savings
alpha0 30 28 28 28 28 28 -6.7%
alpha1 24 22 22 22 22 22 -8.3%
alpha2 8976 8392 7035 3311 3185 3160 -64.8%
mips0 3842 3652 3443 3402 3374 3370 -12.3%
pentium0 8 3 3 3 3 3 -62.5%
pentium1 8 3 3 3 3 3 -62.5%
pentium2 8 3 3 3 3 3 -62.5%
pentium3 356 313 211 148 148 148 -58.4%
pentium4 382 319 313 122 122 122 -68.1%
pentium5 335336 311326 123700 --- --- --- -63.1%
sparc0 24 22 22 22 22 22 -8.3%
sparc1 76 67 52 40 40 40 -47.4%
sparc2 30 28 28 27 22 22 -26.7%
sparc3 14946 13432 7096 2864 2859 2486 -83.4%

The fourth set of tests comes from a totally different application area. It consists of instruction recognizers for subsets of various processors. They seem to be generated from a high-level processor description.

On all except the smaller tests, the global synthesizer simply kicks ass, delivering results at least twice as small as the local optimizers. The pentium5 test is very large, and causes the global synthesizer to time-out, but fortunately the second local optimizer did a decent job already.

Processor simulations

Input name Initial size Local 1 Local 2 Synthesizer Synth-Local 1 Synth-Local 2 Space savings
sim-intel.rtl0 115339 2214 2214 1391 1386 1095 -99.1%
sim-sparc.rtl0 15370 15370 15370 --- --- --- 0%
sim-sparc.rtl1 1804 1804 1804 305 303 231 -87.2%
sim-sparc.rtl2 1219 1219 1219 245 240 160 -86.9%
xs.rtl0 952 952 265 270 265 265 -72.2%
xs.rtl0.mc 3413 2458 1005 1203 1203 1005 -70.6%

The last set of tests also comes from processor descriptions, although we didn't figure out what they do exactly. At any rate, this is a mixed bunch: sim-sparc.rtl0 is essentially impervious to our optimizations (the local optimizers do nothing and the synthesizer times out), while sim-intel.rtl0 is reduced by a factor of 100. Notice again the nutcracker effect of the synthesizer, which not only beats consistently the local optimizers, but allow them to do an even better job on its output.

Conclusions and random thoughts

First and foremost, let us emphasize that we had a lot of fun working on this programming challenge. Even Leroy, who in the past pooh-poohed the ICFP contest as ``a pissing contest'' and didn't deign looking at this year's challenge until Friday night, ended up spending the whole week-end riveted to his home machine. Thanks a lot to the organizers for coming up with such an interesting topic.

Now is the time to exercise the ``unlimited bragging rights'' promised in the contest announcement.

Functional languages in general, and the Caml language in particular, were a very good match for the problem. The language features we relied on a lot include:

In addition to those language features, several features of the Objective Caml implementation also played an important role:

It is interesting to note that 4 of the 8 finalists were using the Caml garbage collector, virtual machine or native-code compiler in one form or another (two OCaml entries, one OLabl entry and one Moscow ML entry).

The topic of the contest was also a very good match for our expertise in compilers and theorem provers. The other two prices also went to experts in the field of compilation of functional programming languages. But compiler skills alone -- useful as they are to acquaint oneself with program transformations and the importance of preserving semantics -- weren't enough. Compiler writers tend to settle for sub-optimal transformations with reasonable running time, mostly effective against human-written input, such as our two local optimizers. It takes someone familiar with theorem proving and decision procedures for various logics to come up with something as ambitious (some would say crazy) as Doligez's synthesizer.

Finally, a word should be said about our highly parallel development process. The organizers warned against large teams as unwieldy and hard to synchronize within the 72 hour time limit. This is true in general, but perhaps less so for the topic of the contest, which has two unusual features as far as programming projects go. First, the input and output of the program, and the relations between both, are perfectly defined. Second, two solutions can always be combined together, obtaining a third solution that is always at least as good as the initial two. (This is not quite right because of the time limits; but disregarding this, a mere shell script suffices to run sequentially N optimizers and select the smallest output.) That's how we were able to harness the programming power of five persons and end up with a program that handles well different styles of inputs (hand-written, hand-pessimized, and machine-generated).

Further reading

The sources for Composite, as submitted for the contest.
The statement of the contest.
INRIA's press release (in French).
More info on Caml.
The report on Ariane flight 501.


{Pascal.Cuoq,Damien.Doligez,Xavier.Leroy,Luc.Maranget,Alan.Schmitt}@inria.fr, 1999/11/02