Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[patch] On "unbound identifier" errors, use spell-checking to suggest names present in the environment #5768

Closed
vicuna opened this issue Oct 1, 2012 · 11 comments
Assignees
Milestone

Comments

@vicuna
Copy link

vicuna commented Oct 1, 2012

Original bug ID: 5768
Reporter: @gasche
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2015-12-11T18:08:16Z)
Resolution: fixed
Priority: normal
Severity: feature
Target version: 4.01.0+dev
Fixed in version: 4.01.0+dev
Category: ~DO NOT USE (was: OCaml general)
Monitored by: meyer @ygrek @hcarty @alainfrisch

Bug description

The Clang compiler uses spell-checking techniques ( http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker ) :

t.c:4:13: error: no member named 'st_blocksize' in 'struct stat'; did you mean 'st_blksize'?

I think this feature is cool and could be useful as well in the OCaml compiler. I have attached a patch that implements this.

Some examples:

mods;;

Error: Unbound value mods
Did you mean mod or modf?

List.lengt;;

Error: Unbound value List.lengt
Did you mean length?

open List;;

lengt;;

Error: Unbound value lengt
Did you mean length?

{content = 1; foo = 2};;

Error: Unbound record field label content
Did you mean contents?

(fun (x : int lis) -> x);;

Error: Unbound type constructor lis
Did you mean list?

Lis.lengt;;

Error: Unbound module Lis
Did you mean List?

Would you be ready to integrate such a feature? If so, which technical changes do you think would be necessary for the patch to be acceptable?

Additional details

The patch was designed to be non-invasive, and only modifies the final error reporting function in typetexp.ml; it does not modify the semantics of the type-checker at all. I still needed to convey the current environment to the point of error reporting, so the Unbound_foo exceptions have changed to take an additional parameter.
Is it ok to change the exception definitions? (May user code be affected by such a change?). If this is unacceptable, it would be possible to change the patch to use an ugly global variable.

The last example (Lis.lengt) exhibits a slight downside of the feature: while there are two typos in the complete path, it will only report the first one. This is not due to the patch itself, but to the fact that it is grafted onto the error reporting function, after the "unbound name narrowing" of narrow_unbound_lid_error has taken place, rather than at the exception raising site. I think this is the right choice.

The heuristics of the spell-checking (when is another word considered a good suggestion?) can be fine-tuned. Currently, the patch uses the Damerau-Levensthein distance ( http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance : an edit distance taking swapping of adjacent letters into account ), with a low cutoff designed to detect only obvious typos: it does no correction for identifiers of size below 2, look at distance 1 at most for below 4, distance 2 for sizes below 6 and distance 3 for larger identifiers. It returns the set of words found in the environment with the minimal distance, if it is smaller than the distance limit.

I have done some performance testing. For the real-world cases I could try the suggestion is instantaneous. I have included in the patch a test generator used to creates code with 1000 and 500_000 identifiers unqualified in the environment. With 1000 identifiers (to put it in perspective, Pervasives exports around 150 identifiers and the modules of the OCaml compiler itself have at most around 500 identifiers in the global scope), spell-checking takes 0.12s with ocamlc and 0.015s with ocamlc.opt. For 500_000 identifiers it takes 5s with ocamlc and 0.8s with ocamlc.opt. Given that this computation is only done in the case of an Unbound_foo error, I think performance is not an issue -- if it was considered to be, I could add a flag to disable the feature. The spell-checking runs after the preliminary "Error: Unbound ..." has been flushed, so even in the case of a particularly long delay the user would have the choice to interrupt the computation (as there is nothing done afterwards); I have tested that this works in the toplevel as well (without exiting the toplevel).

The patch currently doesn't handle the "Unbound type variable" error, as it relies on a different form of environment and would therefore make the code a bit more complex. I'm ready to add this as a separate patch afterwards.

The patch also does not rely on any form of type information, only names bound in the environment. One could dream of suggestions refined by type-checking, but this is outside the scope of the current, modest implementation.

File attachments

@vicuna
Copy link
Author

vicuna commented Oct 1, 2012

Comment author: @damiendoligez

This has been proposed before and I think the consensus was that it's not a job for the compiler, but rather for some development environment like TypeRex. OTOH, it's hard for TypeRex to get the info it needs if the file doesn't compile (and hence doesn't generate a typed AST).

@vicuna
Copy link
Author

vicuna commented Oct 1, 2012

Comment author: @protz

I would like to point out that the error-reporting mechanism of clang has been widely praised in every single presentation/report I've seen about clang; and that's one of the reasons why people are switching from gcc to clang. Moreover, the patch is really small and self-contained, I think this is a low-cost / big-win kind of bug.

@vicuna
Copy link
Author

vicuna commented Oct 1, 2012

Comment author: meyer

I think with the patch that Gabriel is proposing has my positive vote. The patch itself is not too destructive and improves a lot user experience so it's a step in the right direction.

From the side note there are other things in Clang toolchain worth to look at, also Mlton has nice schemes for typing problems.

@vicuna
Copy link
Author

vicuna commented Oct 1, 2012

Comment author: @alainfrisch

I haven't looked at the implementation, but in general, I believe that improvements to error messages should go to the compiler, not an external tool, even if technically possible.

@vicuna
Copy link
Author

vicuna commented Oct 1, 2012

Comment author: pilki

I confirm that some are using Clang exclusively for error reporting.

Also, doing this kind of things in the type checker itself allows some clever things:

  • type aware suggestion: if c is a record with a field foobar, and there is a variable foocar, when the user types c.foolar, you can suggest only foobar. Same for a module, or even more clever things.
  • if there is only one suggestion, the type checking can keep going after replacing the identifier with the suggested one to report more errors in one pass.

@vicuna
Copy link
Author

vicuna commented Oct 2, 2012

Comment author: @alainfrisch

This is probably not necessary, but there are simple ways to improve performance:

  • Do not allocate a matrix: keeping the last two rows (and the current one) is enough. For repeated distance calculation with a fixed target, these rows can be allocated once and for all. (If the function returns a cutoff + 1 instead of None, one can arrange so that the lookup for suggestions does not allocate at all.)

  • Drop the longest common suffix and the longest common prefix from the two strings.

  • Use the current best distance as a cutoff.

@vicuna
Copy link
Author

vicuna commented Oct 2, 2012

Comment author: @gasche

I had tested that matrix allocation is not the performance bottleneck (by measuring the time of the allocation alone), so I intentionally neglected the first optimization. I have not experimented with the two others, and I agree this may be worth it (but I'm not sure how to test them reliably). I'm ready to experiment with this afterwards if the feature gets accepted.

@vicuna
Copy link
Author

vicuna commented Oct 3, 2012

Comment author: @gasche

I have slightly modified the patch to take into account:

  • an error in the handling of cutoff reported by 'yoch'
  • some protection against overflows if "max_int" is used as cutoff
  • some style comments by Damien

I have also attached a second patch providing a testsuite for the edit_distance function.

@vicuna
Copy link
Author

vicuna commented Oct 16, 2012

Comment author: @alainfrisch

Thanks!

I've applied the patch to the trunk (commit 13018).

@vicuna
Copy link
Author

vicuna commented Oct 16, 2012

Comment author: @gasche

Thanks Alain, I will be happy to make typos in the next version of OCaml...

Maybe you could also apply the patch that adds the test to the testsuite? It's not world-changing but could provide regression test against very imaginary future changes to the edit function (eg. for performance reasons).

@vicuna
Copy link
Author

vicuna commented Oct 16, 2012

Comment author: @alainfrisch

Ok, I've committed the test to the trunk as well (commit 13021).

@vicuna vicuna closed this as completed Dec 11, 2015
@vicuna vicuna added this to the 4.01.0 milestone Mar 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants