[patch] On "unbound identifier" errors, use spell-checking to suggest names present in the environment #5768

vicuna · 2012-10-01T10:21:53Z

Original bug ID: 5768
Reporter: @gasche
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2015-12-11T18:08:16Z)
Resolution: fixed
Priority: normal
Severity: feature
Target version: 4.01.0+dev
Fixed in version: 4.01.0+dev
Category: ~DO NOT USE (was: OCaml general)
Monitored by: meyer @ygrek @hcarty @alainfrisch

Bug description

The Clang compiler uses spell-checking techniques ( http://blog.llvm.org/2010/04/amazing-feats-of-clang-error-recovery.html#spell_checker ) :

t.c:4:13: error: no member named 'st_blocksize' in 'struct stat'; did you mean 'st_blksize'?

I think this feature is cool and could be useful as well in the OCaml compiler. I have attached a patch that implements this.

Some examples:

mods;;

Error: Unbound value mods
Did you mean mod or modf?

List.lengt;;

Error: Unbound value List.lengt
Did you mean length?

open List;;

lengt;;

Error: Unbound value lengt
Did you mean length?

{content = 1; foo = 2};;

Error: Unbound record field label content
Did you mean contents?

(fun (x : int lis) -> x);;

Error: Unbound type constructor lis
Did you mean list?

Lis.lengt;;

Error: Unbound module Lis
Did you mean List?

Would you be ready to integrate such a feature? If so, which technical changes do you think would be necessary for the patch to be acceptable?

Additional details

The patch was designed to be non-invasive, and only modifies the final error reporting function in typetexp.ml; it does not modify the semantics of the type-checker at all. I still needed to convey the current environment to the point of error reporting, so the Unbound_foo exceptions have changed to take an additional parameter.
Is it ok to change the exception definitions? (May user code be affected by such a change?). If this is unacceptable, it would be possible to change the patch to use an ugly global variable.

The last example (Lis.lengt) exhibits a slight downside of the feature: while there are two typos in the complete path, it will only report the first one. This is not due to the patch itself, but to the fact that it is grafted onto the error reporting function, after the "unbound name narrowing" of narrow_unbound_lid_error has taken place, rather than at the exception raising site. I think this is the right choice.

The heuristics of the spell-checking (when is another word considered a good suggestion?) can be fine-tuned. Currently, the patch uses the Damerau-Levensthein distance ( http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance : an edit distance taking swapping of adjacent letters into account ), with a low cutoff designed to detect only obvious typos: it does no correction for identifiers of size below 2, look at distance 1 at most for below 4, distance 2 for sizes below 6 and distance 3 for larger identifiers. It returns the set of words found in the environment with the minimal distance, if it is smaller than the distance limit.

I have done some performance testing. For the real-world cases I could try the suggestion is instantaneous. I have included in the patch a test generator used to creates code with 1000 and 500_000 identifiers unqualified in the environment. With 1000 identifiers (to put it in perspective, Pervasives exports around 150 identifiers and the modules of the OCaml compiler itself have at most around 500 identifiers in the global scope), spell-checking takes 0.12s with ocamlc and 0.015s with ocamlc.opt. For 500_000 identifiers it takes 5s with ocamlc and 0.8s with ocamlc.opt. Given that this computation is only done in the case of an Unbound_foo error, I think performance is not an issue -- if it was considered to be, I could add a flag to disable the feature. The spell-checking runs after the preliminary "Error: Unbound ..." has been flushed, so even in the case of a particularly long delay the user would have the choice to interrupt the computation (as there is nothing done afterwards); I have tested that this works in the toplevel as well (without exiting the toplevel).

The patch currently doesn't handle the "Unbound type variable" error, as it relies on a different form of environment and would therefore make the code a bit more complex. I'm ready to add this as a separate patch afterwards.

The patch also does not rely on any form of type information, only names bound in the environment. One could dream of suggestions refined by type-checking, but this is outside the scope of the current, modest implementation.

File attachments

vicuna · 2012-10-01T14:02:19Z

Comment author: @damiendoligez

This has been proposed before and I think the consensus was that it's not a job for the compiler, but rather for some development environment like TypeRex. OTOH, it's hard for TypeRex to get the info it needs if the file doesn't compile (and hence doesn't generate a typed AST).

vicuna · 2012-10-01T14:38:59Z

Comment author: @protz

I would like to point out that the error-reporting mechanism of clang has been widely praised in every single presentation/report I've seen about clang; and that's one of the reasons why people are switching from gcc to clang. Moreover, the patch is really small and self-contained, I think this is a low-cost / big-win kind of bug.

vicuna · 2012-10-01T17:11:26Z

Comment author: meyer

I think with the patch that Gabriel is proposing has my positive vote. The patch itself is not too destructive and improves a lot user experience so it's a step in the right direction.

From the side note there are other things in Clang toolchain worth to look at, also Mlton has nice schemes for typing problems.

vicuna · 2012-10-01T17:35:41Z

Comment author: @alainfrisch

I haven't looked at the implementation, but in general, I believe that improvements to error messages should go to the compiler, not an external tool, even if technically possible.

vicuna · 2012-10-01T20:47:53Z

Comment author: pilki

I confirm that some are using Clang exclusively for error reporting.

Also, doing this kind of things in the type checker itself allows some clever things:

type aware suggestion: if c is a record with a field foobar, and there is a variable foocar, when the user types c.foolar, you can suggest only foobar. Same for a module, or even more clever things.
if there is only one suggestion, the type checking can keep going after replacing the identifier with the suggested one to report more errors in one pass.

vicuna · 2012-10-02T13:29:24Z

Comment author: @alainfrisch

This is probably not necessary, but there are simple ways to improve performance:

Do not allocate a matrix: keeping the last two rows (and the current one) is enough. For repeated distance calculation with a fixed target, these rows can be allocated once and for all. (If the function returns a cutoff + 1 instead of None, one can arrange so that the lookup for suggestions does not allocate at all.)
Drop the longest common suffix and the longest common prefix from the two strings.
Use the current best distance as a cutoff.

vicuna · 2012-10-02T14:02:25Z

Comment author: @gasche

I had tested that matrix allocation is not the performance bottleneck (by measuring the time of the allocation alone), so I intentionally neglected the first optimization. I have not experimented with the two others, and I agree this may be worth it (but I'm not sure how to test them reliably). I'm ready to experiment with this afterwards if the feature gets accepted.

vicuna · 2012-10-03T14:01:29Z

Comment author: @gasche

I have slightly modified the patch to take into account:

an error in the handling of cutoff reported by 'yoch'
some protection against overflows if "max_int" is used as cutoff
some style comments by Damien

I have also attached a second patch providing a testsuite for the edit_distance function.

vicuna · 2012-10-16T13:55:23Z

Comment author: @alainfrisch

Thanks!

I've applied the patch to the trunk (commit 13018).

vicuna · 2012-10-16T14:16:21Z

Comment author: @gasche

Thanks Alain, I will be happy to make typos in the next version of OCaml...

Maybe you could also apply the patch that adds the test to the testsuite? It's not world-changing but could provide regression test against very imaginary future changes to the edit function (eg. for performance reasons).

vicuna · 2012-10-16T17:12:08Z

Comment author: @alainfrisch

Ok, I've committed the test to the trunk as well (commit 13021).

vicuna closed this as completed Dec 11, 2015

vicuna added this to the 4.01.0 milestone Mar 14, 2019

vicuna assigned alainfrisch Mar 14, 2019

vicuna added the feature-wish label Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[patch] On "unbound identifier" errors, use spell-checking to suggest names present in the environment #5768

[patch] On "unbound identifier" errors, use spell-checking to suggest names present in the environment #5768

vicuna commented Oct 1, 2012

vicuna commented Oct 1, 2012

vicuna commented Oct 1, 2012

vicuna commented Oct 1, 2012

vicuna commented Oct 1, 2012

vicuna commented Oct 1, 2012

vicuna commented Oct 2, 2012

vicuna commented Oct 2, 2012

vicuna commented Oct 3, 2012

vicuna commented Oct 16, 2012

vicuna commented Oct 16, 2012

vicuna commented Oct 16, 2012

[patch] On "unbound identifier" errors, use spell-checking to suggest names present in the environment #5768

[patch] On "unbound identifier" errors, use spell-checking to suggest names present in the environment #5768

Comments

vicuna commented Oct 1, 2012

Bug description

mods;;

List.lengt;;

open List;;

lengt;;

{content = 1; foo = 2};;

(fun (x : int lis) -> x);;

Lis.lengt;;

Additional details

File attachments

vicuna commented Oct 1, 2012

vicuna commented Oct 1, 2012

vicuna commented Oct 1, 2012

vicuna commented Oct 1, 2012

vicuna commented Oct 1, 2012

vicuna commented Oct 2, 2012

vicuna commented Oct 2, 2012

vicuna commented Oct 3, 2012

vicuna commented Oct 16, 2012

vicuna commented Oct 16, 2012

vicuna commented Oct 16, 2012