New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make string literal more friendly to utf8 #7481
Comments
Comment author: @Octachron I am not really sure if I follow your thoughts here. String literals are already independent of encodings. Using utf-8 encoding for source files already works: I have utf-8 encoded mathematical symbols in my comments/documentations all the time. It is true that string literals can also be used for non-text data, but I don't see how it impedes the use of string literal for text data? Similarly, your proposed syntax for byte literals cannot work, and having mutable literals sounds dubious. All in all, with the mismatch between the ticket title, your introductory paragraph and your proposition for byte literals, I am a bit lost and I would be very grateful if you could precise the question that you are asking. |
Comment author: @bobzhang Currently string literals is a byte array which is fine. I wish there is a guarantee that such byte array is a valid sequence of utf8 code points (not arbitrary byte array) like GoLang. To keep such invariant, we can do a check in the syntactic level and make sure all functions generating string are valid utf8 code points. However, string also used in lexer generators as a byte array which really should be the use case of byte. Hope it helps. |
Comment author: @Octachron Thank you for precising your thoughts, it helped me. As far as I can tell, golang does not offer such invariant that strings are valid utf-8 encoded texts: https://golang.org/pkg/builtin/#string. This is not surprising since the subset of utf-8 encoded text is not stable by slicing: "\226\136\128" is a valid utf-8 encoded text but none of its substrings satisfy this invariant. Consequently, insuring that all string literals are valid utf-8 codepoint sequences is not enough to guaranteed that this property is satisfied by all strings. Regardless of these issues, I am not really sure what would be the benefits Could you comment on this point? |
Comment author: @bobzhang Thanks for your info, you are correct that there is no such invariant in Go. |
Comment author: @Octachron String indexing is okay for byte-level access, however isolated uchar values does not have a graphical or linguistic interpretation. For text value, the more natural option would be to index characters or glyphs. However, both of these options break random access. Note that for this use case, it seems possible to use a ppx literal checker as an external tools that checks that all literal are utf-8 valid (and normalized) but does not pipe the resulting abstract tree to the compiler. Also the ppx utility ppx_utf8_lit by Daniel Bünzli can already check utf-8 validity and normalization. I am temporarily closing this issue, but please does not hesitate to reopen it (or another one) once you are done with your research. |
Original bug ID: 7481
Reporter: @bobzhang
Assigned to: @Octachron
Status: resolved (set by @Octachron on 2017-02-24T14:57:58Z)
Resolution: suspended
Priority: normal
Severity: feature
Target version: later
Category: ~DO NOT USE (was: OCaml general)
Monitored by: @dbuenzli
Bug description
These days, most programming languages adopt utf8 as source encoding. OCaml is almost there except in OCaml, string is sometimes to be used to encode binary data, for example, in lexer generator:
Lexing.lex_base =
"\000\000\246\255\247\255\248\255\249\255\250\255\251\255\252\255
\058\000\133\000\255\255"
Actually, if we provide a sugar for bytes maybe
b"\000\000\246\255\247\255\248\255\249\255\250\255\251\255\252\255
\058\000\133\000\255\255"
Then we don't need use string to encode binary data?
I made such ticket mostly for discussions
The text was updated successfully, but these errors were encountered: