Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve compilation of common idioms on strings #6426

Closed
vicuna opened this issue May 16, 2014 · 8 comments
Closed

Improve compilation of common idioms on strings #6426

vicuna opened this issue May 16, 2014 · 8 comments

Comments

@vicuna
Copy link

vicuna commented May 16, 2014

Original bug ID: 6426
Reporter: @alainfrisch
Assigned to: @damiendoligez
Status: acknowledged (set by @damiendoligez on 2014-06-02T18:27:34Z)
Resolution: open
Priority: low
Severity: feature
Category: back end (clambda to assembly)
Tags: patch
Monitored by: @diml @yallop

Bug description

  • It is quite common to build a string by n-ary concatenation: "foo" ^ bar ^ ":" ^ baz. Currently, this allocates and copies intermediate strings. It should be quite easy to detect this pattern and generate some code which allocates a buffer of the correct size and blit each string to it directly. This could be extended to the case where some of the concatenated strings are obtained by String.sub, Bytes.to_string, etc. This operations could be fused with the blit.

  • Less important, but it's also quite common to check prefixes/suffixes with code such as (String.length s <= 4 && String.sub s 0 4 = "foo:"). We could expose a function String.sub_equal to avoid the copy in this case, and maybe have the compiler generate a call to it automatically from the code above.

  • I haven't checked, but if the common test s = "" is compiled down to a generic string comparison, it would be worth changing that to String.length s = 0.

File attachments

@vicuna
Copy link
Author

vicuna commented May 16, 2014

Comment author: @gasche

Not opposed to clever optimizations of strings, but note that to avoid extra allocations you should use String.concat instead of (^):
"foo" ^ bar ^ ":" ^ baz
becomes
String.concat "" ["foo"; bar; ":"; baz]
and only allocates once.

@vicuna
Copy link
Author

vicuna commented May 16, 2014

Comment author: @alainfrisch

Well, String.concat itself will allocate a little bit (a couple of references and closures) and will do useless calls to concatenate the empty separator. This could be counter-productive for small numbers of (small) strings.

We could also expose series of optimized functions:

val concat3: string -> string -> string
val concat4: string -> string -> string
...

Anyway, it seems it is a place where the compiler could easily do the optimization for the user, so that he doesn't have to think whether it's better to use ^ or a different function.

@vicuna
Copy link
Author

vicuna commented May 16, 2014

Comment author: @alainfrisch

Note: as long as strings are not really immutable, one must be careful to preserve the semantics if one optimizes f () ^ g () ^ h (), since the evaluation of each function could modify the result of other functions. In practice, it would be fine to restrict the optimization to the case where concatenated strings are pure.

@vicuna
Copy link
Author

vicuna commented Jun 3, 2014

Comment author: @alainfrisch

A very quick micro-benchmark (Linux 64) to compare three ways of checking string emptiness:

let f1 x = (x = "")
let f2 x = (match x with "" -> true | _ -> false)
let f3 x = String.length x = 0

f1: 2.92s
f2: 1.18s
f3: 0.59s

This shows that even the optimized compilation of string pattern matching might be optimized for the empty string.

@vicuna
Copy link
Author

vicuna commented Jun 3, 2014

Comment author: @alainfrisch

The attached patch empty_strcmp.diff optimizes the comparison against the empty string. It works also when such a comparison arises after inlining (even when the original comparison is the generic comparison). The comparison is reduced to comparing the string length to 0. This still induces two memory accesses due to the way the string length is encoded. It would be possible to perform a single memory access on the first word of the string block.

@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

@github-actions github-actions bot added the Stale label May 13, 2020
@gasche gasche removed the Stale label May 13, 2020
@gasche
Copy link
Member

gasche commented May 13, 2020

I believe that the issue is still relevant, but no one has worked on it.

Note: GHC has generic support for type-preserving rewrite rules (REWRITE pragmas), which can be used to do this kind of things, and also defined by library authors.

@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants