Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PATCH] Optimization of pattern-matching on strings #6269

Closed
vicuna opened this issue Dec 13, 2013 · 12 comments
Closed

[PATCH] Optimization of pattern-matching on strings #6269

vicuna opened this issue Dec 13, 2013 · 12 comments

Comments

@vicuna
Copy link

vicuna commented Dec 13, 2013

Original bug ID: 6269
Reporter: bvaugon
Assigned to: @maranget
Status: closed (set by @xavierleroy on 2015-12-11T18:25:57Z)
Resolution: fixed
Priority: normal
Severity: feature
Version: 4.01.0
Fixed in version: 4.02.0+dev
Category: back end (clambda to assembly)
Tags: patch
Monitored by: @gasche @ygrek @jmeber @hcarty

Bug description

I wrote a patch to improve performances of pattern-matching on strings
in native code.

For example, with this optimization, the following code:

match s with
| "let" -> 0
| "match" -> 1
| "with" -> 2
| "if" -> 3
| "then" -> 4
| "else" -> 5
| "constraints" -> 6
| "module" -> 7
| "type" -> 8
| "struct" -> 9
| "sig" -> 10
| "end" -> 11
| _ -> -1

becomes about:

  • 34 times faster on a "Intel(R) Atom(TM) i686 CPU Z520 @ 1.33GHz"
  • 39.5 times faster on a "Intel(R) Core(TM)2 Duo i686 CPU E8400 @ 3.00GHz"
  • 29.5 times faster on a "Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz"

In order to be uniform, this optimization also impacts simple
comparisons with string constants. Consequently, this kind of code:

if s = "Hello World" then ... else ...

is also a bit more efficient.

This optimization is performed at the translation of clambda
to C-- because it generates low-level instructions unrepresentable in
higher intermediate languages.

The size of the generated code remains proportional to the size of the
source code. However, it may become a bit smaller or bigger than it is
currently. For example, with this optimization, the bootstraped
ocamlopt.opt is 1.0015 times bigger, and the preceding match with 13
cases is 224 bytes smaller.

To replace the current implementation of pattern-matching on strings,
people usually uses a hash table. Indeed, Hashtbl access is in O(1),
while current string pattern-matching is in O(n) and the new one is in
O(log(n)) where n is the number of cases. With this optimized string
pattern-matching, it is no longer necessary to resort to Hashtbl
unless the number of cases is over 17000 (on an "Intel(R) Core(TM)
i7-3770 CPU @ 3.40GHz")!

File attachments

@vicuna
Copy link
Author

vicuna commented Dec 13, 2013

Comment author: @alainfrisch

I'm very much in favor of such a patch.

Benoit: can you say a few words about the algorithm to dispatch on strings? Does it apply only when matching on a single string, or also on more complex patterns involving strings?

A minor comment: code such as

let lt_pats =
  List.filter (fun (pat, _) -> Array.length pat <  med_sz) pats
and ge_pats =
  List.filter (fun (pat, _) -> Array.length pat >= med_sz) pats in

can be simplified using List.partition.

@vicuna
Copy link
Author

vicuna commented Dec 13, 2013

Comment author: @gasche

I haven't reviewed the code carefully, but here are a few words about what I understood from the algorithm.

(0) it collects all the constant strings in a pattern and matches on all of them at once

(1) Benoît first turns strings into array of machine words, and does the matching on them (he uses (Cload Word) to ask cmm to read a full word from a given offset in the matched string)

(2) A combination of two strategies is used:

(2a): we look in the constant word-arrays for a good "discriminant", which is an index smaller than the minimum length of all arrays, which has the biggest number of groups of word-arrays having the same value at this index, and outputs a binary tree of dichotomic tests on the scrutinee word at this index

(2b): if no such discriminant exists (all such indices have been checked already, which happens if you match eg. "foo" "foobar" "foobaz" after the "foo" prefix has been tested), it does a tree of dichomotic tests on the size of each constant string

at the leaves of the test of either strategy, the set of possible matches has been reduced, and (2) starts again

I have a couple remarks from looking quickly at the code:

  1. isn't del_structured_constant just a List.filter?
  2. when you use it, you seem to assume that there is no sharing of constant string labels, is that a robust assumption?
  3. extract_distinct could use a hashtable to store quotient sets, that would be clearer and obviously almost-free from potential quadratic behaviors
  4. I think the strategy of looping on (2) terminates, but it's not obvious; the whole algorithm obviously deserves an explanation comment, and this point should be explained carefully
  5. Is there a risk of confusion between two strings that have the same approximated length in term of machine words, but not string length, with one of them padded with zeroes?

@vicuna
Copy link
Author

vicuna commented Dec 13, 2013

Comment author: @alainfrisch

which has the biggest number of groups of word-arrays having the same value at this index

So, each index defines a partition of candidates, and the strategy is to select the index in order to maximize the number of classes in this partition? Intuitively, it would be better to find the index which minimizes the maximum size of a class in the partition, no?

@vicuna
Copy link
Author

vicuna commented Dec 13, 2013

Comment author: bvaugon

Thank you for reviewing the code and comments.

Benoit: can you say a few words about the algorithm to dispatch on
strings? Does it apply only when matching on a single string, or
also on more complex patterns involving strings?

In fact, this optimization is performed after the pattern-matching are
compiled, that's why it is also interesting when strings are inside
deep patterns. In this case, the initial deep pattern-matching is
decomposed in multiple simple pattern-matchings, that may contain
pattern-matchings on strings. For example, the code:

match l with
| [ "a"; "b"; "c" ] -> 0
| [ "a"; "c"; "c" ] -> 1
| [ "ab" ] -> 2
| _ -> -1

is near than 3.1 times faster with this optimization.

let lt_pats =
List.filter (fun (pat, _) -> Array.length pat < med_sz) pats
and ge_pats =
List.filter (fun (pat, _) -> Array.length pat >= med_sz) pats in
can be simplified using List.partition.

Indeed, it might be better to call List.partition instead of calling
twice List.filter (even if it does not change the complexity). Thanks
for the remark.

isn't del_structured_constant just a List.filter?

Semantically, yes. For efficiency, my version is a bit faster since it
uses the fact that the key appears at most one time.

when you use it, you seem to assume that there is no sharing of
constant string labels, is that a robust assumption?

This assumption is statically verified, so if a problem appears later,
it will be statically detected. In practice, as the clambda generator
is currently written, this assumption is always true. However, it
shows that the allocation of global strings is performed a bit too
early in the compilation chain.

extract_distinct could use a hashtable to store quotient sets, that
would be clearer and obviously almost-free from potential quadratic
behaviors

No because the "set of cases" changes at each call. Indeed, the
selection of the most discriminant columns and its extraction reduce
the set of strings to be tested after. The quadratic behaviors is in
the size of strings, not in the number of cases, so it may not be a
problem in practice. If this is still a problem, we might choose a
more naïve implementation by testing strings from the lowest index to
the highest.

I think the strategy of looping on (2) terminates, but it's not
obvious; the whole algorithm obviously deserves an explanation
comment, and this point should be explained carefully

Indeed, it terminates. In fact, all string patterns are distinct at
this step of the algorithm, then, if they have the same length, it
exists a cell which is different and the list "checked" grows, and if
they do not have the same length, either we check a cell of index
lower than min_sz and the list "checked" grows, or we discriminate on
the length of the string and the number of cases decreases.

It can be added as a comment, if you want.

Is there a risk of confusion between two strings that have the same
approximated length in term of machine words, but not string length,
with one of them padded with zeroes?

Obviously not. According to the memory representation of strings, an
offset on the exact size of the string is stored in the last byte of
the last word, and the last word of the string is checked as the
others.

So, each index defines a partition of candidates, and the strategy
is to select the index in order to maximize the number of classes in
this partition? Intuitively, it would be better to find the index
which minimizes the maximum size of classes in the partition, no?

Indeed, to reduce the size of the generated code, it is interesting to
start with the less discriminant column. In terms of performances, I
think it depends on cases. Selecting on the most discriminant column
is the fastest way to reduce the number of cases. However, if we think
that the "default case" is the most frequent case, it might be
interesting to discriminate from the less discriminant column to the
most one. It would be interesting to benchmark this behaviour.

@vicuna
Copy link
Author

vicuna commented Dec 13, 2013

Comment author: @alainfrisch

Indeed, to reduce the size of the generated code, it is interesting to
start with the less discriminant column.

I don't think that my proposal can be described as "the less discriminant" column. What I propose to minimize is not the number of classes in the partition, but the size of the largest class in it. This strategy should minimize the maximum number of comparisons required at runtime.

Let's consider the following example:

a000
a001
a010
a011
a100
a101
b110
c111

According to your criterion (assuming it operates on characters, not words), one should first dispatch on the first character (it gives a partition with 3 classes, while other choices give partitions with only 2 classes), but when it is 'a', we still have 5 candidates, and we need 3 more dispatches to conclude.

It seems better to start with any of the last three characters, because they all give partitions whose largest class has 4 elements (and 2 more comparisons are then enough), while the first character gives a partition whose largest class has 5 elements. In general, finding the optimal strategy (which minimizes the depth of the decision tree) is probably difficult, but this seems to be a good heuristics.

Also, I'm wondering whether it wouldn't be appropriate to dispatch first on the size: since it is usually quite small, the dispatch could be compiled as a table-based switch.

@vicuna
Copy link
Author

vicuna commented Dec 13, 2013

Comment author: @gasche

Also, I'm wondering whether it wouldn't be appropriate to dispatch
first on the size: since it is usually quite small, the dispatch
could be compiled as a table-based switch.

In fact you could want to see "testing character at offset i" and
"checking the size" as two heuristics to build a discriminator on the
given strings, and use whichever is "best" (for the notion of 'best'
you want) at any point.

extract_distinct could use a hashtable to store quotient sets, that
would be clearer and obviously almost-free from potential quadratic
behaviors

No because the "set of cases" changes at each call. Indeed, the
selection of the most discriminant columns and its extraction reduce
the set of strings to be tested after.

I was thinking of using a hashtable internally to each
extract_distinct call from search_most_discriminant, not across
several cases. Something like (untested):

and extract_distinct pats i =
let table = Hashtbl.create (Array.length pats) in
let handle ((pat, _) as data) = Hashtbl.add table pat.(i) data in
List.iter handle pats;
table

The quadratic behaviors is in the size of strings, not in the number
of cases, so it may not be a problem in practice.

My understanding of the previous implementation is that the List.assoc
business is linear in the number of equivalence classes (and such for
each pattern/clause). If the N patterns in the 'pats' list all have
a distict word in index i, the time spent in List.assoc
(fully traversing the 'acc' list) is ?(N*N).

@vicuna
Copy link
Author

vicuna commented Dec 15, 2013

Comment author: bvaugon

I don't think that my proposal can be described as "the less
discriminant" column. What I propose to minimize is not the number
of classes in the partition, but the size of the largest class in
it. This strategy should minimize the maximum number of comparisons
required at runtime.

Ah ok, sorry. Well, I compared and benchmarked 4 strategies:
1/ most discriminant first
2/ least discriminant first
3/ smallest size of the biggest class in partition first
4/ order of the string

It was particulary hard to find examples giving significant
differencies in performances depending on the strategie.
Surprisingly, on all my tests, it is the strategy 2/ (least
significant first) that gives the best performances. I think it is
probably because it generates a smaller code and the branch predictor
of the processor works better. Anyway, all stategies generates codes
which differ at most by 6% in speed on my tests.

It is also this strategie (2/) which generates the fastest code (just
1% faster than your strategie (3/)) on your example:

match s with
| "--------xxxxxxxxbbbbbbbbbbbbbbbbbbbbbbbb" -> 0
| "--------xxxxxxxxbbbbbbbbbbbbbbbbcccccccc" -> 1
| "--------xxxxxxxxbbbbbbbbccccccccbbbbbbbb" -> 2
| "--------xxxxxxxxbbbbbbbbcccccccccccccccc" -> 3
| "--------xxxxxxxxccccccccbbbbbbbbbbbbbbbb" -> 4
| "--------xxxxxxxxccccccccbbbbbbbbcccccccc" -> 5
| "--------yyyyyyyyccccccccccccccccbbbbbbbb" -> 6
| "--------zzzzzzzzcccccccccccccccccccccccc" -> 7
| _ -> -1

Please find attached the new version, using the strategie (2/), using
one List.partition instead of two List.filter, and using a Hashtbl
instead of assocs as suggested correctly by Gabriel (sorry, I was
wrong in computing complexity) unlike it was necessary to make a hash
table of list rather than using Hashtbl.find_all to more easily
calculate the number of class.

@vicuna
Copy link
Author

vicuna commented Dec 16, 2013

Comment author: @alainfrisch

Did you try a strategy based on always dispatching on the size (either with a decision tree, or with a Cswitch) first?

@vicuna
Copy link
Author

vicuna commented Dec 16, 2013

Comment author: @gasche

I discussed this proposal with Luc. He points out that his paper on compiling pattern-matching with decision trees tries a lot of different heuristics and comes to the conclusion that it is indeed the "least discriminant choice" that is better. He also points out that the Switch module (which provides a functor) should be used instead of building the dichomotic test sequences by hand -- see how the call_switcher function is used in in bytecomp/matching.ml, and transl_switch in cmmgen.

I now have an intuitive understanding of why the least discriminant choice is better; it strongly depends on the existence of a fallback case if none of the constants are matched. Indeed, what happens if the N string constants were all the same, except on the first character? The least-discriminant match would recommend testing the second character, third characters, etc., before the first one. This seems useless but is necessary to check that we're not in the "none of the constants matched" case anyway. After we made all those non-discriminating checks, we check the first character that makes a difference.

Note that if you test the first character first, you duplicate a lot of code, as all the tests for all the other characters are done N times (in each branch you still need to check for the failure case). In Luc's pattern-matching paper, this was avoided of by hashconsing the decision trees (to share identical test tails). But the least-discriminating strategy was still best, and his argument is that locally (if you only look at the level of one test), it is the one that produces the less code (less branches, smaller tree), and empirically this property extends in a glutton way.

Now let's be clear about the fact that:

  • the few percent of difference in running time between the several solutions do not matter, as the general gain is much larger
  • the existence of a worst-case blow-up in code size does matter, but Luc remarked that even in presence of those unshared tails, the code was still linear in the sum of the size of the strings (which is what you get with the current compilation if you count the data segment as well)
  • the ability to factorize code using the dreaded "zyva" function does matter, and should be investigated

I also wonder whether we couldn't actually do that in the typedtree/lambda-code level, by using the %caml_string_get{32,64} primitives. I'm not sure this would be better, though, in particular I don't know how this would handle the extraneous padding bits at the end of the string (I'm still not sure how the present patch handles this either, though).

@vicuna
Copy link
Author

vicuna commented Feb 19, 2014

Comment author: @damiendoligez

comes to the conclusion that it is indeed the "least discriminant choice" that is better

Possibly irrelevant, but this reminds me of the Paige & Tarjan partition-refinement algorithm (IIRC, for computing bisimulations on automata).

@vicuna
Copy link
Author

vicuna commented Mar 18, 2014

Comment author: @maranget

I have reviewed the patch and well I have rewritten strmatch.ml...

I have somehow simplified the algorithm by first discriminating on
string lengths (in words) and then on string contents.
The algorithm is thus simplified as all strings matched together have the
same length. It still uses the "less discrimating column" heuristics,
on string contents only.

I have also made some effort to use code defined in cmmgen that is passed
over to strmatch as a functor argument. Those are

  1. A function that outputs the switch for sizes, using the standard switch compiler
  2. A function that emits the code to retrieve string length (in words)

The rest of the v2 patch has not changed.

@vicuna
Copy link
Author

vicuna commented Mar 21, 2014

Comment author: @maranget

I am to include the patch in trunk.

Given the interaction with the new feature of shared constants,
string matchings are now expressed with a new lambda node.
The compiler then carries over this new construct down to C-- generation,
when it is compiled by followin the initial idea of Benoi^t.

Bytecode compilation has also improved, from N string tests to
identify a non matching string w.r.t. N string patterns, to about log N
string tests, by the means of a static dichotomic search.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants