Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caml_alloc_string allocates empty strings #7145

Closed
vicuna opened this issue Feb 11, 2016 · 13 comments
Closed

caml_alloc_string allocates empty strings #7145

vicuna opened this issue Feb 11, 2016 · 13 comments

Comments

@vicuna
Copy link

vicuna commented Feb 11, 2016

Original bug ID: 7145
Reporter: @mmottl
Status: acknowledged (set by @damiendoligez on 2016-02-24T15:51:32Z)
Resolution: open
Priority: normal
Severity: feature
Version: 4.02.3
Target version: later
Category: runtime system and C interface
Monitored by: @bobzhang @diml @jmeber @mmottl

Bug description

caml_alloc_string in byterun/alloc.c does not check whether the length of the requested string is zero and hence always allocates a fresh value even for empty strings. As a consequence, physical equality does not hold in e.g. String.copy "" == ""

This does not happen with e.g. empty arrays. The caml_atom_table apparently contains an entry for string tags, too. Not sure, but it may be safe to tweak that case, which could be beneficial for bindings that allocate a lot of empty strings (for optional / empty fields).

@vicuna
Copy link
Author

vicuna commented Apr 21, 2016

Comment author: junsli

physical equality does not hold for any strings that = holds.

# "" == "";;
- : bool = false
# "1" == "1";;
- : bool = false

Maybe I don't fully understand the problem. But I do feel the inconsistency here:

# [|1|] == [|1|];;
- : bool = false
# [||] == [||];;
- : bool = true

@vicuna
Copy link
Author

vicuna commented Apr 21, 2016

Comment author: @mmottl

Interesting, I thought that strings behave like arrays in that respect. I think they should. It's also worth noting that the "string" type has recently been made "pure", i.e. it can only be manipulated in purely functional ways. The above behavior seems even more odd in that case.

@vicuna
Copy link
Author

vicuna commented Jun 27, 2016

Comment author: @xavierleroy

Strings and arrays have different representations. The empty array is a block of size 0 words, which therefore must be allocated statically outside the heap. The empty string is a block of size 1 word. So, the caml_atom_table trick cannot be used for the empty string.

As to re-sharing of string literals in general, it could happen in the future, but not before -safe-string becomes the default.

@vicuna
Copy link
Author

vicuna commented Jun 27, 2016

Comment author: @mmottl

Ah, I see, I wasn't aware that unlike arrays empty string blocks have non-zero size. Looking forward to -safe-string becoming the default...

@vicuna
Copy link
Author

vicuna commented Jul 13, 2016

Comment author: @alainfrisch

I don't understand what would prevent us from sharing empty strings (at least in caml_alloc_string). Even without -safe-string, the empty string is immutable.

@vicuna
Copy link
Author

vicuna commented Jan 4, 2017

Comment author: @bobzhang

as a short-cut, can we have empty in string module?

@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

@github-actions github-actions bot added the Stale label Mar 24, 2021
@mmottl
Copy link
Contributor

mmottl commented Mar 24, 2021

I don’t know if there have been any changes to the handling of string constants. Strings have been immutable for a long time now so it would seem worthwhile to reconsider sharing them.

@gasche
Copy link
Member

gasche commented Mar 24, 2021

Looking at this as well (hi @mmottl!):

  • strings are now immutable by default indeed (in any case, the string of length 0 cannot actually be mutated)
  • we have a String.empty constant that can be used to enforce sharing in performance-aware or sharing-aware code

No strong opinion on whether the feature still matters.

@lthls
Copy link
Contributor

lthls commented Mar 24, 2021

Keep in mind that even for empty arrays the sharing is not 100% reliable: Array.make 0 0 == [||] will return true for bytecode but false in native mode.
The situation would be similar for strings: if we want to be sure that there can only be one empty string, we would need to define a C symbol pointing to the unique empty string, but that means that every empty string coming from OCaml code would either fail to be shared the same way, or need to be translated to a C call (and that's expensive).

There is actually already a solution for constants that absolutely need to be shared (namely, predefined exceptions). This works by reserving symbols for these constants on the C side, but making the OCaml linker generate the code for them.
But I don't think we should extend this trick for empty strings or arrays.

@xavierleroy
Copy link
Contributor

Nowadays, ocamlopt shares identical string literals within one compilation unit, and ocamlc does not. This is the same behavior as with immutable structured constants (e.g. [1;2;3]).

I think ocamlc could do better and also share within compilation units. I agree with @lthls that sharing across compilation units is a pain and not something we should strive for.

I'll go one step further. Immutable data structures offer more possibilities for the compiler to share them, but this is strictly an optimization: sharing should not be mandated in any way, and user code should not rely on it.

I move to close this issue.

@mmottl
Copy link
Contributor

mmottl commented Mar 24, 2021

There is certainly no need to make the compilation process unnecessarily more complex to share across compilation units. It's good to know that String.empty will be available in the next release.

May I suggest also adding String.is_empty and Bytes.is_empty? String comparison is much less efficient than testing for String.length str = 0.

Maybe String.empty can also be exploited in C-stubs, including the OCaml runtime, to avoid unnecessary allocations? E.g. the PostgreSQL bindings preallocate an empty string to return for NULL values. It's frequent enough to make a difference.

I can't close this issue myself since my report was copied from Mantis. Please feel free to do so.

@github-actions github-actions bot removed the Stale label Mar 26, 2021
@damiendoligez
Copy link
Member

I agree with Xavier: this is not a problem, as physical equality is never guaranteed for immutable structures. And if we want to share strings, we should probably strive to share more than just the empty string.

I'm closing the issue. If anyone wants to work on sharing in the compiler, feel free to open a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants