Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006269OCamlOCaml backend (code generation)public2013-12-13 12:452014-03-21 17:44
Reporterbvaugon 
Assigned Tomaranget 
PrioritynormalSeverityfeatureReproducibilityalways
StatusresolvedResolutionfixed 
PlatformOSOS Version
Product Version4.01.0 
Target VersionFixed in Version4.02.0+dev 
Summary0006269: [PATCH] Optimization of pattern-matching on strings
DescriptionI wrote a patch to improve performances of pattern-matching on strings
in native code.

###

For example, with this optimization, the following code:

match s with
| "let" -> 0
| "match" -> 1
| "with" -> 2
| "if" -> 3
| "then" -> 4
| "else" -> 5
| "constraints" -> 6
| "module" -> 7
| "type" -> 8
| "struct" -> 9
| "sig" -> 10
| "end" -> 11
| _ -> -1

becomes about:
 * 34 times faster on a "Intel(R) Atom(TM) i686 CPU Z520 @ 1.33GHz"
 * 39.5 times faster on a "Intel(R) Core(TM)2 Duo i686 CPU E8400 @ 3.00GHz"
 * 29.5 times faster on a "Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz"

###

In order to be uniform, this optimization also impacts simple
comparisons with string constants. Consequently, this kind of code:

if s = "Hello World" then ... else ...

is also a bit more efficient.

###

This optimization is performed at the translation of clambda
to C-- because it generates low-level instructions unrepresentable in
higher intermediate languages.

###

The size of the generated code remains proportional to the size of the
source code. However, it may become a bit smaller or bigger than it is
currently. For example, with this optimization, the bootstraped
ocamlopt.opt is 1.0015 times bigger, and the preceding match with 13
cases is 224 bytes smaller.

###

To replace the current implementation of pattern-matching on strings,
people usually uses a hash table. Indeed, Hashtbl access is in O(1),
while current string pattern-matching is in O(n) and the new one is in
O(log(n)) where n is the number of cases. With this optimized string
pattern-matching, it is no longer necessary to resort to Hashtbl
unless the number of cases is over 17000 (on an "Intel(R) Core(TM)
i7-3770 CPU @ 3.40GHz")!
Tagspatch
Attached Filesdiff file icon ocaml-4.01.0-strmatch.diff [^] (30,047 bytes) 2013-12-13 12:45 [Show Content]
diff file icon ocaml-4.01.0-strmatch-v2.diff [^] (29,873 bytes) 2013-12-16 00:28 [Show Content]
diff file icon ocaml-4.01.0-strmatch-v3.diff [^] (16,629 bytes) 2014-03-18 16:43 [Show Content]

- Relationships

-  Notes
(0010705)
frisch (developer)
2013-12-13 14:22

I'm very much in favor of such a patch.

Benoit: can you say a few words about the algorithm to dispatch on strings? Does it apply only when matching on a single string, or also on more complex patterns involving strings?

A minor comment: code such as

let lt_pats =
  List.filter (fun (pat, _) -> Array.length pat <  med_sz) pats
and ge_pats =
  List.filter (fun (pat, _) -> Array.length pat >= med_sz) pats in


can be simplified using List.partition.
(0010707)
gasche (developer)
2013-12-13 16:38

I haven't reviewed the code carefully, but here are a few words about what I understood from the algorithm.

(0) it collects all the constant strings in a pattern and matches on all of them at once

(1) Benoît first turns strings into array of machine words, and does the matching on them (he uses (Cload Word) to ask cmm to read a full word from a given offset in the matched string)

(2) A combination of two strategies is used:

  (2a): we look in the constant word-arrays for a good "discriminant", which is an index smaller than the minimum length of all arrays, which has the biggest number of groups of word-arrays having the same value at this index, and outputs a binary tree of dichotomic tests on the scrutinee word at this index

  (2b): if no such discriminant exists (all such indices have been checked already, which happens if you match eg. "foo" "foobar" "foobaz" after the "foo" prefix has been tested), it does a tree of dichomotic tests on the *size* of each constant string

  at the leaves of the test of either strategy, the set of possible matches has been reduced, and (2) starts again



I have a couple remarks from looking quickly at the code:
1. isn't del_structured_constant just a List.filter?
2. when you use it, you seem to assume that there is no sharing of constant string labels, is that a robust assumption?
3. extract_distinct could use a hashtable to store quotient sets, that would be clearer and obviously almost-free from potential quadratic behaviors
4. I think the strategy of looping on (2) terminates, but it's not obvious; the whole algorithm obviously deserves an explanation comment, and this point should be explained carefully
5. Is there a risk of confusion between two strings that have the same approximated length in term of machine words, but not string length, with one of them padded with zeroes?
(0010708)
frisch (developer)
2013-12-13 17:07
edited on: 2013-12-13 18:56

> which has the biggest number of groups of word-arrays having the same value at this index

So, each index defines a partition of candidates, and the strategy is to select the index in order to maximize the number of classes in this partition? Intuitively, it would be better to find the index which minimizes the maximum size of a class in the partition, no?

(0010709)
bvaugon (reporter)
2013-12-13 18:59

Thank you for reviewing the code and comments.

###

> Benoit: can you say a few words about the algorithm to dispatch on
  strings? Does it apply only when matching on a single string, or
  also on more complex patterns involving strings?

In fact, this optimization is performed after the pattern-matching are
compiled, that's why it is also interesting when strings are inside
deep patterns. In this case, the initial deep pattern-matching is
decomposed in multiple simple pattern-matchings, that may contain
pattern-matchings on strings. For example, the code:

match l with
  | [ "a"; "b"; "c" ] -> 0
  | [ "a"; "c"; "c" ] -> 1
  | [ "ab" ] -> 2
  | _ -> -1

is near than 3.1 times faster with this optimization.

###

> let lt_pats =
    List.filter (fun (pat, _) -> Array.length pat < med_sz) pats
  and ge_pats =
    List.filter (fun (pat, _) -> Array.length pat >= med_sz) pats in
  can be simplified using List.partition.

Indeed, it might be better to call List.partition instead of calling
twice List.filter (even if it does not change the complexity). Thanks
for the remark.

###

> isn't del_structured_constant just a List.filter?

Semantically, yes. For efficiency, my version is a bit faster since it
uses the fact that the key appears at most one time.

###

> when you use it, you seem to assume that there is no sharing of
  constant string labels, is that a robust assumption?

This assumption is statically verified, so if a problem appears later,
it will be statically detected. In practice, as the clambda generator
is currently written, this assumption is always true. However, it
shows that the allocation of global strings is performed a bit too
early in the compilation chain.

###

> extract_distinct could use a hashtable to store quotient sets, that
  would be clearer and obviously almost-free from potential quadratic
  behaviors

No because the "set of cases" changes at each call. Indeed, the
selection of the most discriminant columns and its extraction reduce
the set of strings to be tested after. The quadratic behaviors is in
the size of strings, not in the number of cases, so it may not be a
problem in practice. If this is still a problem, we might choose a
more naïve implementation by testing strings from the lowest index to
the highest.

###

> I think the strategy of looping on (2) terminates, but it's not
  obvious; the whole algorithm obviously deserves an explanation
  comment, and this point should be explained carefully

Indeed, it terminates. In fact, all string patterns are distinct at
this step of the algorithm, then, if they have the same length, it
exists a cell which is different and the list "checked" grows, and if
they do not have the same length, either we check a cell of index
lower than min_sz and the list "checked" grows, or we discriminate on
the length of the string and the number of cases decreases.

It can be added as a comment, if you want.

###

> Is there a risk of confusion between two strings that have the same
  approximated length in term of machine words, but not string length,
  with one of them padded with zeroes?

Obviously not. According to the memory representation of strings, an
offset on the exact size of the string is stored in the last byte of
the last word, and the last word of the string is checked as the
others.

###

> So, each index defines a partition of candidates, and the strategy
  is to select the index in order to maximize the number of classes in
  this partition? Intuitively, it would be better to find the index
  which minimizes the maximum size of classes in the partition, no?

Indeed, to reduce the size of the generated code, it is interesting to
start with the less discriminant column. In terms of performances, I
think it depends on cases. Selecting on the most discriminant column
is the fastest way to reduce the number of cases. However, if we think
that the "default case" is the most frequent case, it might be
interesting to discriminate from the less discriminant column to the
most one. It would be interesting to benchmark this behaviour.
(0010710)
frisch (developer)
2013-12-13 19:23

> Indeed, to reduce the size of the generated code, it is interesting to
> start with the less discriminant column.

I don't think that my proposal can be described as "the less discriminant" column. What I propose to minimize is not the number of classes in the partition, but the size of the largest class in it. This strategy should minimize the maximum number of comparisons required at runtime.

Let's consider the following example:

 a000
 a001
 a010
 a011
 a100
 a101
 b110
 c111

According to your criterion (assuming it operates on characters, not words), one should first dispatch on the first character (it gives a partition with 3 classes, while other choices give partitions with only 2 classes), but when it is 'a', we still have 5 candidates, and we need 3 more dispatches to conclude.

It seems better to start with any of the last three characters, because they all give partitions whose largest class has 4 elements (and 2 more comparisons are then enough), while the first character gives a partition whose largest class has 5 elements. In general, finding the optimal strategy (which minimizes the depth of the decision tree) is probably difficult, but this seems to be a good heuristics.


Also, I'm wondering whether it wouldn't be appropriate to dispatch first on the size: since it is usually quite small, the dispatch could be compiled as a table-based switch.
(0010711)
gasche (developer)
2013-12-13 22:54

> Also, I'm wondering whether it wouldn't be appropriate to dispatch
> first on the size: since it is usually quite small, the dispatch
> could be compiled as a table-based switch.

In fact you could want to see "testing character at offset i" and
"checking the size" as two heuristics to build a discriminator on the
given strings, and use whichever is "best" (for the notion of 'best'
you want) at any point.


> > extract_distinct could use a hashtable to store quotient sets, that
> > would be clearer and obviously almost-free from potential quadratic
> > behaviors
>
> No because the "set of cases" changes at each call. Indeed, the
> selection of the most discriminant columns and its extraction reduce
> the set of strings to be tested after.

I was thinking of using a hashtable internally to each
extract_distinct call from search_most_discriminant, not across
several cases. Something like (untested):

  and extract_distinct pats i =
    let table = Hashtbl.create (Array.length pats) in
    let handle ((pat, _) as data) = Hashtbl.add table pat.(i) data in
    List.iter handle pats;
    table

> The quadratic behaviors is in the size of strings, not in the number
> of cases, so it may not be a problem in practice.

My understanding of the previous implementation is that the List.assoc
business is linear in the number of equivalence classes (and such for
each pattern/clause). If the N patterns in the 'pats' list all have
a distict word in index i, the time spent in List.assoc
(fully traversing the 'acc' list) is ?(N*N).
(0010714)
bvaugon (reporter)
2013-12-16 00:25
edited on: 2013-12-16 00:30

> I don't think that my proposal can be described as "the less
> discriminant" column. What I propose to minimize is not the number
> of classes in the partition, but the size of the largest class in
> it. This strategy should minimize the maximum number of comparisons
> required at runtime.

Ah ok, sorry. Well, I compared and benchmarked 4 strategies:
 1/ most discriminant first
 2/ least discriminant first
 3/ smallest size of the biggest class in partition first
 4/ order of the string

It was particulary hard to find examples giving significant
differencies in performances depending on the strategie.
Surprisingly, on all my tests, it is the strategy 2/ (least
significant first) that gives the best performances. I think it is
probably because it generates a smaller code and the branch predictor
of the processor works better. Anyway, all stategies generates codes
which differ at most by 6% in speed on my tests.

It is also this strategie (2/) which generates the fastest code (just
1% faster than your strategie (3/)) on your example:

match s with
| "--------xxxxxxxxbbbbbbbbbbbbbbbbbbbbbbbb" -> 0
| "--------xxxxxxxxbbbbbbbbbbbbbbbbcccccccc" -> 1
| "--------xxxxxxxxbbbbbbbbccccccccbbbbbbbb" -> 2
| "--------xxxxxxxxbbbbbbbbcccccccccccccccc" -> 3
| "--------xxxxxxxxccccccccbbbbbbbbbbbbbbbb" -> 4
| "--------xxxxxxxxccccccccbbbbbbbbcccccccc" -> 5
| "--------yyyyyyyyccccccccccccccccbbbbbbbb" -> 6
| "--------zzzzzzzzcccccccccccccccccccccccc" -> 7
| _ -> -1


Please find attached the new version, using the strategie (2/), using
one List.partition instead of two List.filter, and using a Hashtbl
instead of assocs as suggested correctly by Gabriel (sorry, I was
wrong in computing complexity) unlike it was necessary to make a hash
table of list rather than using Hashtbl.find_all to more easily
calculate the number of class.

(0010727)
frisch (developer)
2013-12-16 17:42

Did you try a strategy based on always dispatching on the size (either with a decision tree, or with a Cswitch) first?
(0010728)
gasche (developer)
2013-12-16 18:05
edited on: 2013-12-16 18:12

I discussed this proposal with Luc. He points out that his paper on compiling pattern-matching with decision trees tries a lot of different heuristics and comes to the conclusion that it is indeed the "least discriminant choice" that is better. He also points out that the Switch module (which provides a functor) should be used instead of building the dichomotic test sequences by hand -- see how the call_switcher function is used in in bytecomp/matching.ml, and transl_switch in cmmgen.

I now have an intuitive understanding of why the least discriminant choice is better; it strongly depends on the existence of a fallback case if none of the constants are matched. Indeed, what happens if the N string constants were all the same, except on the first character? The least-discriminant match would recommend testing the second character, third characters, etc., before the first one. This seems useless but is necessary to check that we're not in the "none of the constants matched" case anyway. After we made all those non-discriminating checks, we check the first character that makes a difference.

Note that if you test the first character first, you duplicate a lot of code, as all the tests for all the other characters are done N times (in each branch you still need to check for the failure case). In Luc's pattern-matching paper, this was avoided of by hashconsing the decision trees (to share identical test tails). But the least-discriminating strategy was still best, and his argument is that locally (if you only look at the level of one test), it is the one that produces the less code (less branches, smaller tree), and empirically this property extends in a glutton way.

Now let's be clear about the fact that:
- the few percent of difference in running time between the several solutions do *not* matter, as the general gain is much larger
- the existence of a worst-case blow-up in code size *does* matter, but Luc remarked that even in presence of those unshared tails, the code was still linear in the sum of the size of the strings (which is what you get with the current compilation if you count the data segment as well)
- the ability to factorize code using the dreaded "zyva" function *does* matter, and should be investigated

I also wonder whether we couldn't actually do that in the typedtree/lambda-code level, by using the %caml_string_get{32,64} primitives. I'm not sure this would be better, though, in particular I don't know how this would handle the extraneous padding bits at the end of the string (I'm still not sure how the present patch handles this either, though).

(0010952)
doligez (administrator)
2014-02-19 17:13

> comes to the conclusion that it is indeed the "least discriminant choice" that is better

Possibly irrelevant, but this reminds me of the Paige & Tarjan partition-refinement algorithm (IIRC, for computing bisimulations on automata).
(0011050)
maranget (manager)
2014-03-18 16:27

I have reviewed the patch and well I have rewritten strmatch.ml...

I have somehow simplified the algorithm by first discriminating on
string lengths (in words) and then on string contents.
The algorithm is thus simplified as all strings matched together have the
same length. It still uses the "less discrimating column" heuristics,
on string contents only.


I have also made some effort to use code defined in cmmgen that is passed
over to strmatch as a functor argument. Those are
1. A function that outputs the switch for sizes, using the standard switch compiler
2. A function that emits the code to retrieve string length (in words)

The rest of the v2 patch has not changed.
(0011077)
maranget (manager)
2014-03-21 17:39

I am to include the patch in trunk.

Given the interaction with the new feature of shared constants,
string matchings are now expressed with a new lambda node.
The compiler then carries over this new construct down to C-- generation,
when it is compiled by followin the initial idea of Benoi^t.

Bytecode compilation has also improved, from N string tests to
identify a non matching string w.r.t. N string patterns, to about log N
string tests, by the means of a static dichotomic search.

- Issue History
Date Modified Username Field Change
2013-12-13 12:45 bvaugon New Issue
2013-12-13 12:45 bvaugon File Added: ocaml-4.01.0-strmatch.diff
2013-12-13 14:22 frisch Note Added: 0010705
2013-12-13 16:38 gasche Note Added: 0010707
2013-12-13 17:07 frisch Note Added: 0010708
2013-12-13 17:07 frisch Note Edited: 0010708 View Revisions
2013-12-13 18:56 frisch Note Edited: 0010708 View Revisions
2013-12-13 18:59 bvaugon Note Added: 0010709
2013-12-13 19:23 frisch Note Added: 0010710
2013-12-13 22:54 gasche Note Added: 0010711
2013-12-16 00:25 bvaugon Note Added: 0010714
2013-12-16 00:28 bvaugon File Added: ocaml-4.01.0-strmatch-v2.diff
2013-12-16 00:30 bvaugon Note Edited: 0010714 View Revisions
2013-12-16 17:42 frisch Note Added: 0010727
2013-12-16 18:05 gasche Note Added: 0010728
2013-12-16 18:12 gasche Note Edited: 0010728 View Revisions
2014-02-19 16:59 doligez Tag Attached: patch
2014-02-19 17:13 doligez Note Added: 0010952
2014-03-14 18:10 gasche Status new => acknowledged
2014-03-18 16:27 maranget Note Added: 0011050
2014-03-18 16:28 maranget File Added: ocaml-4.01.0-strmatch-v3.diff
2014-03-18 16:40 maranget File Deleted: ocaml-4.01.0-strmatch-v3.diff
2014-03-18 16:40 maranget File Added: ocaml-4.01.0-strmatch-v3.diff
2014-03-18 16:42 maranget File Deleted: ocaml-4.01.0-strmatch-v3.diff
2014-03-18 16:43 maranget File Added: ocaml-4.01.0-strmatch-v3.diff
2014-03-21 17:39 maranget Note Added: 0011077
2014-03-21 17:44 maranget Assigned To => maranget
2014-03-21 17:44 maranget Status acknowledged => resolved
2014-03-21 17:44 maranget Resolution open => fixed
2014-03-21 17:44 maranget Fixed in Version => 4.02.0+dev


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker