New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[patch] On "unbound identifier" errors, use spell-checking to suggest names present in the environment #5768
Comments
Comment author: @damiendoligez This has been proposed before and I think the consensus was that it's not a job for the compiler, but rather for some development environment like TypeRex. OTOH, it's hard for TypeRex to get the info it needs if the file doesn't compile (and hence doesn't generate a typed AST). |
Comment author: @protz I would like to point out that the error-reporting mechanism of clang has been widely praised in every single presentation/report I've seen about clang; and that's one of the reasons why people are switching from gcc to clang. Moreover, the patch is really small and self-contained, I think this is a low-cost / big-win kind of bug. |
Comment author: meyer I think with the patch that Gabriel is proposing has my positive vote. The patch itself is not too destructive and improves a lot user experience so it's a step in the right direction. From the side note there are other things in Clang toolchain worth to look at, also Mlton has nice schemes for typing problems. |
Comment author: @alainfrisch I haven't looked at the implementation, but in general, I believe that improvements to error messages should go to the compiler, not an external tool, even if technically possible. |
Comment author: pilki I confirm that some are using Clang exclusively for error reporting. Also, doing this kind of things in the type checker itself allows some clever things:
|
Comment author: @alainfrisch This is probably not necessary, but there are simple ways to improve performance:
|
Comment author: @gasche I had tested that matrix allocation is not the performance bottleneck (by measuring the time of the allocation alone), so I intentionally neglected the first optimization. I have not experimented with the two others, and I agree this may be worth it (but I'm not sure how to test them reliably). I'm ready to experiment with this afterwards if the feature gets accepted. |
Comment author: @gasche I have slightly modified the patch to take into account:
I have also attached a second patch providing a testsuite for the edit_distance function. |
Comment author: @alainfrisch Thanks! I've applied the patch to the trunk (commit 13018). |
Comment author: @gasche Thanks Alain, I will be happy to make typos in the next version of OCaml... Maybe you could also apply the patch that adds the test to the testsuite? It's not world-changing but could provide regression test against very imaginary future changes to the edit function (eg. for performance reasons). |
Comment author: @alainfrisch Ok, I've committed the test to the trunk as well (commit 13021). |
Original bug ID: 5768
Reporter: @gasche
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2015-12-11T18:08:16Z)
Resolution: fixed
Priority: normal
Severity: feature
Target version: 4.01.0+dev
Fixed in version: 4.01.0+dev
Category: ~DO NOT USE (was: OCaml general)
Monitored by: meyer @ygrek @hcarty @alainfrisch
Bug description
The Clang compiler uses spell-checking techniques ( http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker ) :
t.c:4:13: error: no member named 'st_blocksize' in 'struct stat'; did you mean 'st_blksize'?
I think this feature is cool and could be useful as well in the OCaml compiler. I have attached a patch that implements this.
Some examples:
mods;;
Error: Unbound value mods
Did you mean mod or modf?
List.lengt;;
Error: Unbound value List.lengt
Did you mean length?
open List;;
lengt;;
Error: Unbound value lengt
Did you mean length?
{content = 1; foo = 2};;
Error: Unbound record field label content
Did you mean contents?
(fun (x : int lis) -> x);;
Error: Unbound type constructor lis
Did you mean list?
Lis.lengt;;
Error: Unbound module Lis
Did you mean List?
Would you be ready to integrate such a feature? If so, which technical changes do you think would be necessary for the patch to be acceptable?
Additional details
The patch was designed to be non-invasive, and only modifies the final error reporting function in typetexp.ml; it does not modify the semantics of the type-checker at all. I still needed to convey the current environment to the point of error reporting, so the Unbound_foo exceptions have changed to take an additional parameter.
Is it ok to change the exception definitions? (May user code be affected by such a change?). If this is unacceptable, it would be possible to change the patch to use an ugly global variable.
The last example (Lis.lengt) exhibits a slight downside of the feature: while there are two typos in the complete path, it will only report the first one. This is not due to the patch itself, but to the fact that it is grafted onto the error reporting function, after the "unbound name narrowing" of narrow_unbound_lid_error has taken place, rather than at the exception raising site. I think this is the right choice.
The heuristics of the spell-checking (when is another word considered a good suggestion?) can be fine-tuned. Currently, the patch uses the Damerau-Levensthein distance ( http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance : an edit distance taking swapping of adjacent letters into account ), with a low cutoff designed to detect only obvious typos: it does no correction for identifiers of size below 2, look at distance 1 at most for below 4, distance 2 for sizes below 6 and distance 3 for larger identifiers. It returns the set of words found in the environment with the minimal distance, if it is smaller than the distance limit.
I have done some performance testing. For the real-world cases I could try the suggestion is instantaneous. I have included in the patch a test generator used to creates code with 1000 and 500_000 identifiers unqualified in the environment. With 1000 identifiers (to put it in perspective, Pervasives exports around 150 identifiers and the modules of the OCaml compiler itself have at most around 500 identifiers in the global scope), spell-checking takes 0.12s with ocamlc and 0.015s with ocamlc.opt. For 500_000 identifiers it takes 5s with ocamlc and 0.8s with ocamlc.opt. Given that this computation is only done in the case of an Unbound_foo error, I think performance is not an issue -- if it was considered to be, I could add a flag to disable the feature. The spell-checking runs after the preliminary "Error: Unbound ..." has been flushed, so even in the case of a particularly long delay the user would have the choice to interrupt the computation (as there is nothing done afterwards); I have tested that this works in the toplevel as well (without exiting the toplevel).
The patch currently doesn't handle the "Unbound type variable" error, as it relies on a different form of environment and would therefore make the code a bit more complex. I'm ready to add this as a separate patch afterwards.
The patch also does not rely on any form of type information, only names bound in the environment. One could dream of suggestions refined by type-checking, but this is outside the scope of the current, modest implementation.
File attachments
The text was updated successfully, but these errors were encountered: