Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifiers in Unicode #6692

Closed
vicuna opened this issue Dec 6, 2014 · 23 comments
Closed

Identifiers in Unicode #6692

vicuna opened this issue Dec 6, 2014 · 23 comments

Comments

@vicuna
Copy link

vicuna commented Dec 6, 2014

Original bug ID: 6692
Reporter: @whitequark
Status: closed (set by @gasche on 2014-12-13T20:58:57Z)
Resolution: suspended
Priority: normal
Severity: feature
Category: ~DO NOT USE (was: OCaml general)
Related to: #6695 #6697 #6704
Monitored by: lelf @hcarty @dbuenzli @yakobowski

Bug description

Any modern language obviously should handle source code in non-latin scripts.

I think it would be possible to change only OCaml's lexer to properly parse UTF-8, possibly behind a command-line flag, and without, yet, any stdlib API changes.

I volunteer to write a patch if there is some consensus on how to bootstrap it--embedding sedlex being the simplest solution in my view.

@vicuna
Copy link
Author

vicuna commented Dec 6, 2014

Comment author: @gasche

On the feature proposal itself: I remember we mentioned this at the OCaml User Meeting back in 2008. Xavier Leroy thought it was a reasonable idea in principle, and that we could simply follow what Java standardized as reasonable unicode identifiers instead of arguing again about a new spec (preserving of course OCaml's case-distinction, which extends to Unicode).

@vicuna
Copy link
Author

vicuna commented Dec 6, 2014

Comment author: @whitequark

The problem with case-distinction is that many languages don't have it. For example, Japanese and Hebrew.

I propose to allow using a sigil instead of first capital letter, e.g. @???.@?.

Additionally, I'm not sure what is the status of characters outside of BMP in the Java spec.

Otherwise it should be fine.

@vicuna
Copy link
Author

vicuna commented Dec 6, 2014

Comment author: @whitequark

Argh, Mantis transforms Unicode chars into question marks.

@vicuna
Copy link
Author

vicuna commented Dec 6, 2014

Comment author: @gasche

After a bit of digging:

Neither documents precisely specify how to distinguish capitalized from uncapitalized identifiers. This is discussed in section 4.2 of the chapter 4, Character Properties, of the main Unicode standard:
http://www.unicode.org/versions/Unicode7.0.0/ch04.pdf

(There are at least three different notions of what being lowercase or uppercase mean. The only good news is that it's vastly easier than mapping a word to uppercase or lowercase, which we don't need.)

I think the best choice would be to find a reasonable document in production somewhere that has a case distinction, and follow it to the letter -- the bigger risk is interminable discussion. If this does not work, I would suggest considering as capitalized the valid identifiers that start with a character of the General Category of uppercase or titlecase letters (Lu, Lt)¹.

¹: general categories are a partition of the letter space, so they're guaranteed to preserve disjointness of the capitalized and non-capitalized namespaces. It's better to define the set of capitalized identifiers by restriction (than the set of non-capitalized by restriction) because capitalized identifiers (variant constructors and module names) are less frequent than non-capitalized (everything else) in OCaml.

@vicuna
Copy link
Author

vicuna commented Dec 6, 2014

Comment author: @dbuenzli

Very bad idea. If you want to do that correctly you will have to include a good deal of the unicode machinery into the compiler (e.g. normalisation, see http://www.unicode.org/reports/tr31/tr31-21.html#normalization_and_case). Also bear in mind that the set of characters is not closed so you will also have to commit the language to a particular Unicode version; and people will certainly ask you to upgrade when new characters are introduced, one more burden for the maintainers, there's a new version of Unicode each year, in June.

I see really no benefit but trouble of having Unicode identifiers, especially since there are too many unreasonable programmers out there and too many arrows and odd characters to choose from. I don't see how this makes the language more "modern", oh emoji identifiers maybe... The underlying natural language of programming is english (whether you like it or not) and is perfectly served by the beautiful set of ASCII characters. And if you absolutely want your poop emoji identifier this can be tackled by an overlay at the IDE level.

Something much more useful would be for the compiler to check for UTF-8 validity of the sources when instructed they are encoded as such as this would allow us to guarantee the UTF-8 validity of the string literals (which can't be guaranteed now as it depends on the editor you are actually using, so formally you have to recheck them for UTF-8 validity at runtime).

See also #5348#c7948

@vicuna
Copy link
Author

vicuna commented Dec 6, 2014

Comment author: @whitequark

The underlying natural language of programming is english

Exactly. And there is nothing appropriate about this.

@vicuna
Copy link
Author

vicuna commented Dec 6, 2014

Comment author: @dbuenzli

Why not ? Would you prefer e.g. if science had no lingua franca ?

Anyway the thing is that it seems that you have a poor understanding of what your proposal would really entail technically. It's not just about bringing in sedlex.

What happens if one writes an identifier with a é in decomposed form in one source and another calls that identifier with an é in pre-composed form in another source ? To resolve these issue you need Unicode normalisation in the compiler.

You could mandate that the sources themselves must be in a given normal form but then that would put all the burden on the users of ensuring their editor saves in the right normal form (I don't have any idea in which form (if any) emacs saves my buffers, and don't even want to know).

@vicuna
Copy link
Author

vicuna commented Dec 6, 2014

Comment author: @whitequark

I'm not interested in an opinion on using non-latin1 characters from someone who doesn't even know any language using non-latin1 script.

Yes, I'm aware that I will have to bring UTF-8 parsing, normalization, and to some degree the character database into the compiler. This is actually a good argument for bringing the UChar module into the stdlib, although I will not suggest that it will be exposed as part of this issue.

@vicuna
Copy link
Author

vicuna commented Dec 6, 2014

Comment author: @dbuenzli

Whitequark:

  1. So apparently you know me well, I'm surprised do you have access to information about me ? I would kindly suggest you don't even try to imagine what I may know or not or who I can actually be. FWIW I did take russian classes and while I'm certainly not up to a conversation level I do know well my cyrillic alphabet; a byproduct of having travelled a lot in central asia and ex-soviet countries; it's always useful to know that a pectopah is a restaurant.

  2. Since you claim you were aware of the complexities, why didn't you expose them in your initial request ? I suggest you reread it again, as it is strikingly naive.

Anyway I'm glad that be both agree on the actual impliciations.

@vicuna
Copy link
Author

vicuna commented Dec 6, 2014

Comment author: @whitequark

The reason is very simple. The discussion I am interested in is the form of the interface, as this is something that changes the surface language quite a bit, and the ocamlc-specific implementation details, as this is something I am unfamiliar with.

The minutae of the Unicode support are mostly fixed, familiar to me, and not interesting in the context of this discussion as long as no visible interface changes are made.

@vicuna
Copy link
Author

vicuna commented Dec 6, 2014

Comment author: @dbuenzli

One problem is that you entirely side stepped the issue of knowing whether we actually want Unicode identifiers. That's what you wrote:

"Any modern language obviously should handle source code in non-latin scripts."

"Modern", "obvious", that's not argumentation, that's just someone trying to quickly shove its own wants. Further you tell us that you don't want to discuss this with someone who may only know languages written in the latin1 alphabet (impolitely and wrongly suggesting that this is my case along the way). Sorry but you lost a lot of credibility in that discussion.

The starting point should be a real cost/benefit analysis of having Unicode identifiers, which includes both their actual utility and problems in (the) practice (of programming), their interface, their implementation and the maintenance entailed. And yes, this wholly includes the minutae of Unicode support whether you are interested or not.

@vicuna
Copy link
Author

vicuna commented Dec 7, 2014

Comment author: dario

whitequark: I see where you're coming from, and I empathise with the sentiment. Moreover, there might be a few use cases where localised identifiers would be of use. Perhaps you are writing an accounting application for the Russian market and you want the function names to match the legal terminology, for instance.

However, let's be practical: if I were writing such a localised application, I would nevertheless stay away from Unicode, using only a pure ASCII romanisation for identifiers. Daniel has already brought up a few reasons why, and I can add a couple more:

  • all the visually identical characters are just a headache waiting to happen;
  • you may find yourself needing to edit your source-code while traveling and having access only to international QWERTY keyboards (it has happened to me);
  • you never know if a foreigner will join your project (in fact, this makes the stronger argument that besides using only ASCII, you should also use English for your identifiers).

Therefore, while the notion of Unicode identifiers seems in theory very hip and inclusive and kumbaya, I am deeply sceptical about its real world usefulness. Moreover, bear in mind that this is not a zero-cost addition to the compiler -- no extra feature ever is. Instead, it represents yet another layer that must be maintained.

To summarise, I would agree with Daniel that the disadvantages of this feature far outweigh its advantages.

@vicuna
Copy link
Author

vicuna commented Dec 8, 2014

Comment author: @alainfrisch

Yes, I'm aware that I will have to bring UTF-8 parsing, normalization, and to some degree the character database into the compiler.

Even if you're not interested in my opinion, let me say it: I can see some uses of having the compiler check that string literals are valid utf-8, but I've a strong preference for restricting identifiers to ASCII letters. I don't think it's a good idea to bring all the complexity related to Unicode identifiers and additional headaches related to encoding issues of the file system and the terminal.

@vicuna
Copy link
Author

vicuna commented Dec 9, 2014

Comment author: @damiendoligez

We haven't heard from Jacques yet, but I'd say don't write your patch yet.

@vicuna
Copy link
Author

vicuna commented Dec 12, 2014

Comment author: @whitequark

@dbuenzli: You talk about credibility and act like cursory knowledge of Russian gives you the authority to outright dismiss most of existing cultures.

@Dario, @Frisch, @dbuenzli: Of course, there is a maintenance cost. However, as @gasche mentioned, there is some interest among the maintainers to bring this in. So I will wait for a decision from those who do maintain the compiler.

Visually similar characters as well as filesystem normalization issues are not a significant practical problem. It is not only solvable, but is also solvable with negligible compile time cost, or none at all for ASCII-only sources.

To not be unsubstantiated, I have implemented an alternative frontend to the compiler using the -pp option: https://github.com/whitequark/ocaml-m17n

Some highlights include:

  • At its core, OCaml cares very little about encoding of the identifiers. In fact, the minimal possible change is one line in ocamldep, although there are a few strange places in typechecker (which in my opinion should be fixed regardless of whether non-ASCII is desirable).

  • The frontend is not large or complex. The current implementation comprises about 1KLOC, which is actually slightly less than Lexing+Lexer (which it effectively replaces), and that's including the facilities for disambiguating visually similar and misnormalized identifiers, and wrappers for incompatible compiler-libs machinery.

  • The frontend depends on Uutf, Uunf, Uucp and Sedlex, which are essential to its operation, and Gen, which can be rather easily replaced with an abstraction over Lexing.lexbuf. (It will even make the whole thing a bit simpler.)

  • Uucp, and, transitively, Uunf and Sedlex, depend on the versioned Unicode tables. However, the 1-year release cycle of Unicode maps well to 1-year release cycle of OCaml, and regenerating the tables from consortium-provided definitions are, arguably, not a large maintenance burden.

Currently it is not very optimized. In particular, I did not add fast paths for ASCII-only identifiers. However, I did not add any operations with supralinear complexity.

See also #125.

@vicuna
Copy link
Author

vicuna commented Dec 12, 2014

Comment author: @whitequark

Oh, and it does not currently guarantee the validity of string literals if they include escape sequences. I'm not opposed to adding that, although in my view this is completely useless. Not only invalid UTF-8 in string literals in source code is not a problem anyone has in practice (the editors will loudly complain, for one), but also lack of a type that enforces the invariant means that you eliminate the rarest possible source of invalid UTF-8 sequences.

@vicuna
Copy link
Author

vicuna commented Dec 12, 2014

Comment author: @gasche

I'm not happy with the overly heated tone of the discussion here. I usually silently mumble about Daniel's rudeness, but in this case the wrongs are more than shared. Please stop making comments on the persons rather than the code.

(Incidentally, @Frisch indeed is a maintainer of the compiler and he's probably the one most invested in the frontend part, that is the more relevant to the present discussion>)

As a first step, I would be interested in helping (on my spare time) with whatever low-invasiveness changes to the compiler can improve, or be extracted from whitequark's alternative frontend. I don't particularly expect this frontend to live a long and useful life (it is mostly an impressing pile of hacks independently waiting to be "upstreamed" somewhere or go back to sleep), but it seems a good way to look at small, incremental, non-controversial changes to the compiler internals first.

@vicuna
Copy link
Author

vicuna commented Dec 12, 2014

Comment author: @whitequark

I agree. I also know that @Frisch is one of maintainers; I meant but not said explicitly enough that I defer the decision to him.

I don't quite agree on the "pile of hacks". While the interfaces presented by compiler-libs are less than stellar (understandable, given they were internal for a few decades before 4.02 release) and there is some fragile code there, the changes are not that invasive. Given that:

  • the problem with expunging,
  • the treatment of Latin-1 in filenames,
  • the treatment of ASCII in typechecker

are fixed, the rest only uses defined and in some cases exported for a long time (Toploop) interfaces. Thus it could exist as a usable separate package indefinitely--OCaml's lexer barely changes--although it is still my view that this should be upstreamed.

@vicuna
Copy link
Author

vicuna commented Dec 12, 2014

Comment author: @dbuenzli

@whitequark When did I tell that my underknowledge in Russian gave me any kind of authority on the matter ? I was just responding to you pretending I didn't know anything beyond languages written in latin script.

What I did say however (since it seems you are not very apt at understanding me) is that I was fine with the idea of english being the lingua franca of computing and absolutely never pretended in that discussion that this may not be culturally biased.

Stop misrepresenting what I say or what I may know.

@vicuna
Copy link
Author

vicuna commented Dec 12, 2014

Comment author: @whitequark

We can stop this discussion on this:

I do not think that forcing everyone to use only ASCII is a good way, or a way at all, to ensure computing has a lingua franca, or that having the ability to use non-English identifiers will destroy that property. In fact, if it somehow does, then it was not worth having it in the first place.

I also do not think computing needs a lingua franca in the first place. Or necessarily currently has: transliterated identifiers and comments in a local language, as it commonly happens, hardly allow for a deep understanding.

@vicuna
Copy link
Author

vicuna commented Dec 13, 2014

Comment author: @dbuenzli

Interesting to 1) stop a discussion that never started 2) decide when the discussion should stop.

And by the way it's fun that you actually accuse me of dismissing most cultures while you were the one to rule out anybody of the discussion whose culture may be based on the sole knowledge of the latin script. What an entertaining psychological twist.

Just a hint, I work on unicode not because I'm paid to do it (I never was, except for part of uucp this summer), but because I'm actually interested in being able to represent, from an hci point view, the world's scripts and their idiosyncrasies in our computing artefacts.

@vicuna
Copy link
Author

vicuna commented Dec 13, 2014

Comment author: @whitequark

"I only know X, and because of that I'm certainly able to decide whether non-X is desirable!" Talk about entertaining.

@vicuna
Copy link
Author

vicuna commented Dec 13, 2014

Comment author: @gasche

Enough. We have enough work with the related issues to let you both vent steam elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant