Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make string literal more friendly to utf8 #7481

Closed
vicuna opened this issue Feb 13, 2017 · 5 comments
Closed

Make string literal more friendly to utf8 #7481

vicuna opened this issue Feb 13, 2017 · 5 comments

Comments

@vicuna
Copy link

vicuna commented Feb 13, 2017

Original bug ID: 7481
Reporter: @bobzhang
Assigned to: @Octachron
Status: resolved (set by @Octachron on 2017-02-24T14:57:58Z)
Resolution: suspended
Priority: normal
Severity: feature
Target version: later
Category: ~DO NOT USE (was: OCaml general)
Monitored by: @dbuenzli

Bug description

These days, most programming languages adopt utf8 as source encoding. OCaml is almost there except in OCaml, string is sometimes to be used to encode binary data, for example, in lexer generator:

Lexing.lex_base =
"\000\000\246\255\247\255\248\255\249\255\250\255\251\255\252\255
\058\000\133\000\255\255"

Actually, if we provide a sugar for bytes maybe
b"\000\000\246\255\247\255\248\255\249\255\250\255\251\255\252\255
\058\000\133\000\255\255"

Then we don't need use string to encode binary data?
I made such ticket mostly for discussions

@vicuna
Copy link
Author

vicuna commented Feb 24, 2017

Comment author: @Octachron

I am not really sure if I follow your thoughts here.

String literals are already independent of encodings.

Using utf-8 encoding for source files already works: I have utf-8 encoded mathematical symbols in my comments/documentations all the time.

It is true that string literals can also be used for non-text data, but I don't see how it impedes the use of string literal for text data?

Similarly, your proposed syntax for byte literals cannot work, and having mutable literals sounds dubious.

All in all, with the mismatch between the ticket title, your introductory paragraph and your proposition for byte literals, I am a bit lost and I would be very grateful if you could precise the question that you are asking.

@vicuna
Copy link
Author

vicuna commented Feb 24, 2017

Comment author: @bobzhang

Currently string literals is a byte array which is fine. I wish there is a guarantee that such byte array is a valid sequence of utf8 code points (not arbitrary byte array) like GoLang.

To keep such invariant, we can do a check in the syntactic level and make sure all functions generating string are valid utf8 code points.

However, string also used in lexer generators as a byte array which really should be the use case of byte.

Hope it helps.

@vicuna
Copy link
Author

vicuna commented Feb 24, 2017

Comment author: @Octachron

Thank you for precising your thoughts, it helped me.

As far as I can tell, golang does not offer such invariant that strings are valid utf-8 encoded texts: https://golang.org/pkg/builtin/#string.

This is not surprising since the subset of utf-8 encoded text is not stable by slicing: "\226\136\128" is a valid utf-8 encoded text but none of its substrings satisfy this invariant. Consequently, insuring that all string literals are valid utf-8 codepoint sequences is not enough to guaranteed that this property is satisfied by all strings.
String indexing (i.e. s.[k]) is also of questionable utility for utf-8 encoded text.

Regardless of these issues, I am not really sure what would be the benefits
to have this guarantee directly enforced by the compiler, rather than by an exterior library eventually coupled with a ppx literal checker.

Could you comment on this point?

@vicuna
Copy link
Author

vicuna commented Feb 24, 2017

Comment author: @bobzhang

Thanks for your info, you are correct that there is no such invariant in Go.
Note string indexing is okay, suppose we provide
String.utf8_get : string -> int -> uchar, note that in Go, the for loop is utf8-aware.
About ppx literal checker, I think it would be nice to have such basic stuff in the core of language (ppx is too heavy to me)
I agree with you in general, I will do more research before sending an email, feel free to close this issue.

@vicuna
Copy link
Author

vicuna commented Feb 24, 2017

Comment author: @Octachron

String indexing is okay for byte-level access, however isolated uchar values does not have a graphical or linguistic interpretation. For text value, the more natural option would be to index characters or glyphs. However, both of these options break random access.

Note that for this use case, it seems possible to use a ppx literal checker as an external tools that checks that all literal are utf-8 valid (and normalized) but does not pipe the resulting abstract tree to the compiler. Also the ppx utility ppx_utf8_lit by Daniel Bünzli can already check utf-8 validity and normalization.

I am temporarily closing this issue, but please does not hesitate to reopen it (or another one) once you are done with your research.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants