no more identifiers with accented characters #5348

vicuna · 2011-08-25T10:46:11Z

Original bug ID: 5348
Reporter: poirriez
Status: resolved (set by @damiendoligez on 2017-03-03T14:44:53Z)
Resolution: not a bug
Priority: normal
Severity: feature
Version: 3.12.0
Category: ~DO NOT USE (was: OCaml general)
Related to: #6694
Monitored by: poirriez mehdi furuse @glondu @dbuenzli

Bug description

Just update to 3.12.0 and, on MAC OS 10.6.8 the accented characters are no longer usable:

$ ocaml
Objective Caml version 3.12.0

let carr?? x = x*x;;

Error: Illegal character (?)

The ?? was é

And under emacs:

    Objective Caml version 3.12.0

Characters 9-10:

let carré n = n * n;;
^
Error: Illegal character (\251)

Vincent

The text was updated successfully, but these errors were encountered:

vicuna · 2011-08-25T12:44:42Z

Comment author: poirriez

same with ocaml 3.12.1

vicuna · 2011-09-05T09:24:01Z

Comment author: @xclerc

Well, I am not sure it is an OCaml issue.
I can locally reproduce the problem by setting encoding
of the terminal to UTF8. When switching back to ISO Latin 1
(which is the encoding actually used by OCaml), everything is
fine.

Your can change the terminal encoding in the "Advanced" pane
of the "Settings" page of "Terminal.app" preferences.

vicuna · 2011-10-23T12:53:17Z

Comment author: gerd

Maybe the real issue is: ocaml should support UTF8 identifiers (instead of Latin1). Almost the whole world switched to UTF-8 in the meantime (e.g. many Linux distros use it as default now), and it becomes more and more painful that ocaml is so old-fashioned.

The full solution is complicated to implement, though - ocamllex would have to be changed so it can deal with multi-byte encodings. But as a tiny step into this direction, ocaml could at least allow UTF-8 as external encoding but keep Latin1 as internal encoding. That would mean a recoding step for every identifier that is read or written.

vicuna · 2012-01-30T16:38:28Z

Comment author: @damiendoligez

The real solution is to disallow accented letters in identifiers and accept only ASCII letters, but of course we cannot do that without breaking some existing programs.

vicuna · 2012-01-31T21:46:28Z

Comment author: @alainfrisch

Damien: What about a warning that reports accented letters in identifiers as a deprecated feature?

Gerd: It would be weird to accept UTF8 identifiers, while still parsing string literals as sequences of bytes (i.e. if we assume the source code to be utf8-encoded, String.length on a string literal would not return the length of the literal seen as a sequence of Unicode code points, unless you also change the semantics of strings, but I don't think you propose that).

vicuna · 2012-01-31T23:27:26Z

Comment author: gerd

Alain: I agree that a complete solution would also include a Unicode version of "string", maybe called ustring, and with literals like U"xyz". I don't think we should redefine "string", because there is also a need for byte arrays, and we would run into endless compatibility problems. So, having both string and ustring would be the ideal world. I see that there are currently not enough resources for getting there, and the question is how many elements we can nevertheless implement. That could also mean to only deprecate accented letters at the moment.

Btw, for ustring we won't need that much, given that we accept that ocaml only provides basic Unicode support (string literals, one possible representation (ustring = int array), basic input/output, ocamllex), and leave the rest (alternate representations, character classes, transformations, ...) to add-on libraries like Camomile.

But anyway, I hope there is at least consensus that Unicode support is essential nowadays. The world is changing, and it has become irrelevant that Latin1 is still sufficient for most languages.

vicuna · 2012-02-01T10:28:38Z

Comment author: @damiendoligez

Alain: I would like such a warning but I'm not sure we have a consensus among OCaml developers at this point.

vicuna · 2012-02-06T16:45:10Z

Comment author: @zoggy

+1 for Gerd's proposition on Unicode support in ocaml distro ;-)

vicuna · 2012-07-10T11:57:58Z

Comment author: @damiendoligez

What we have here are several feature wishes:

Deprecate Latin-1 accents in source code and add the corresponding warning
Support Unicode strings
Have a separate type for mutable byte arrays

vicuna · 2012-07-10T12:54:01Z

Comment author: @ygrek

Concerning point 3 - what about adding module Bytes equal to current String and provide the compiler switch to expose only read-only access to String module. This way interested people can start migrating some code right now.

vicuna · 2012-08-16T14:43:24Z

Comment author: @dbuenzli

In my opinion, a first good step would be to

Mandate that sources should be UTF-8 encoded.
Disallow non-ascii identifiers (which just implies banning any octet
greater than 0x7F outside string literals and comments).
Indicate that an OCaml string is just a sequence of octets
and that if it needs to be interpreted as text it should
be understood as UTF-8 encoded text (for better library
interoperability).
Deprecate latin-1 functionality (uppercase, lowercase) from
the String module and indicate that all the function of the
string module operate octet-wise.

The main of advantage of this is that it allows to write UTF-8
string literals and documentation (by passing -charset utf-8 to
ocamldoc). If one needs to work with a unicode string type it's
easy to invoke a suitable of_utf8 on the UTF-8 string literals to
get a kind of U"bla" notation, at the cost of a small runtime
penalty.

To me UTF-8 identifiers are more a curse than a benefit because
of keyboard and normalization issues (é can be encoded in more
than one way, and I'm not sure the dev team wants to bring in
that whole issue into the compiler). If however some think that's
a good idea note that the Unicode standard has recommendations
for what a unicode identifier should be made of
http://unicode.org/reports/tr31/

I have written many programs that deal with unicode by just UTF-8
encoding my sources without any problem. I think that this brings
in more unicode compatibility without having to include a unicode
library in the distribution.

Regarding having unicode string support in the distribution. I'd
rather have nothing than a toy Unicode implementation in the
standard library. I also have written many programs that didn't
need a full blown unicode library and only needed to just
blindly pass around and concatenate UTF-8 encoded strings.

vicuna · 2013-10-09T12:23:06Z

Comment author: @damiendoligez

Note: 4.01.0 adds a warning signaling Latin1 characters in identifiers as a deprecated feature.

vicuna · 2017-03-03T14:42:46Z

Comment author: @damiendoligez

Note: 4.02.0 added a type for mutable byte arrays.

vicuna · 2017-03-03T14:44:53Z

Comment author: @damiendoligez

What's left of the discussion is a wish for support for unicode strings. Unicode support is best left to external libraries and we already have several such libraries.

vicuna closed this as completed Mar 3, 2017

This was referenced Mar 14, 2019

Identifiers in Unicode #6692

Closed

Do not implicitly use ISO-8859-1 in Char.uppercase/lowercase and derived functions #6694

Closed

vicuna added the feature-wish label Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

no more identifiers with accented characters #5348

no more identifiers with accented characters #5348

vicuna commented Aug 25, 2011

vicuna commented Aug 25, 2011

vicuna commented Sep 5, 2011

vicuna commented Oct 23, 2011

vicuna commented Jan 30, 2012

vicuna commented Jan 31, 2012

vicuna commented Jan 31, 2012

vicuna commented Feb 1, 2012

vicuna commented Feb 6, 2012

vicuna commented Jul 10, 2012

vicuna commented Jul 10, 2012

vicuna commented Aug 16, 2012

vicuna commented Oct 9, 2013

vicuna commented Mar 3, 2017

vicuna commented Mar 3, 2017

no more identifiers with accented characters #5348

no more identifiers with accented characters #5348

Comments

vicuna commented Aug 25, 2011

Bug description

let carr?? x = x*x;;

Characters 9-10:

vicuna commented Aug 25, 2011

vicuna commented Sep 5, 2011

vicuna commented Oct 23, 2011

vicuna commented Jan 30, 2012

vicuna commented Jan 31, 2012

vicuna commented Jan 31, 2012

vicuna commented Feb 1, 2012

vicuna commented Feb 6, 2012

vicuna commented Jul 10, 2012

vicuna commented Jul 10, 2012

vicuna commented Aug 16, 2012

vicuna commented Oct 9, 2013

vicuna commented Mar 3, 2017

vicuna commented Mar 3, 2017