Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support docstrings in the OCaml parser #6811

Closed
vicuna opened this issue Mar 12, 2015 · 29 comments
Closed

Support docstrings in the OCaml parser #6811

vicuna opened this issue Mar 12, 2015 · 29 comments
Assignees
Milestone

Comments

@vicuna
Copy link

vicuna commented Mar 12, 2015

Original bug ID: 6811
Reporter: @alainfrisch
Assigned to: @alainfrisch
Status: closed (set by @alainfrisch on 2015-06-30T08:30:13Z)
Resolution: won't fix
Priority: normal
Severity: feature
Target version: 4.03.0+dev / +beta1
Category: ~DO NOT USE (was: OCaml general)
Monitored by: @trefis @diml @jmeber @hcarty @dbuenzli

Bug description

This is a proposal to push documentation strings (currently extracted from comments by ocamldoc) into the actual language grammar.

The idea is to turn those documentation strings into attributes attached to specific nodes in the Parsetree, so that the rest of the toolchain can easily extract them. In particular, external documentation generators such as codoc could get the docstrings from .cmti/.cmt/.cmi files.

The branch at:

http://caml.inria.fr/cgi-bin/viewvc.cgi/ocaml/branches/doc2attr/

implements the following:

  • A new -strict-doc command-line flag, which enables a new mode.

  • In this mode, comments starting with (** or (*** are recognized by the lexer and kept in the stream of tokens on which the grammar is defined. In other words, these comments are no longer comments, since they contribute to the grammar.

  • (**....) is treated as being equivalent to [@@@ocaml.doc "...."] (or more exactly [@@@ocaml.doc {|doc|....|doc|}] to keep an hint of the original syntax), i.e. this creates a floating attribute.

  • (**....*) is treated as being equivalent to [@@ocaml.doc "...."], i.e. this creates an item-level attribute.

  • Item-level attributes (including (**....*)) are allowed after each record field declaration before and/or after the semi-colon that follows it:

    type t = { x : int (** X ); y : int }
    type t = {
    x : int; (
    * X *)
    y : int;
    }

  • A new syntax is introduced for sum types:

    type t = [ A | B ]

    type t = [
    | A
    | B
    ]

    When using this syntax, item-level attributes are allowed after each constructor (before the next token "|" or "]"), so:

    type t = [ A (** Doc for A ) | (* Doc for B ) ]
    (
    * Doc for t *)

  • There is currently no short notation for docstrings on constructors that sum type that use the "legacy" syntax for sum types (without brackets). It is of course possible to use standard attributes:

    type t = A [@doc "Doc for A"] | B ...

    and it would be possible to introduce another form of docstring syntax such as (| ...*), but I don't see the benefits compared to adding brackets to the declaration and use the more usual ( ... *).

  • Compared to ocamldoc rules, this proposal only supports postfix placement for docstrings on value/type/... declarations.

The interesting bits of the patch (i.e. everything except the plumbing for the command-line option):

http://caml.inria.fr/cgi-bin/viewvc.cgi/ocaml/branches/doc2attr/parsing/parser.mly?r1=15904&r2=HEAD

http://caml.inria.fr/cgi-bin/viewvc.cgi/ocaml/branches/doc2attr/parsing/lexer.mll?r1=15904&r2=HEAD

@vicuna
Copy link
Author

vicuna commented Mar 12, 2015

Comment author: @alainfrisch

For those already using the postfix placement for their ocamldoc comments, switching to this system essentially amounts to:

  • Switching from (...) to (...*) for floating comments.
  • Switching from "type t = A | B" to "type t = [A | B]" for sum types with doc comments on constructors.

@vicuna
Copy link
Author

vicuna commented Mar 12, 2015

Comment author: @dbuenzli

While I see the rationale for the (** and (*** notations w.r.t. the nice mapping to equivalent @@, @@@, being someone who writes comment postfix and a lot of floating comments for library background documentation and sample code, I don't really like it, for the following reasons.

It worries me that I lose one column, it may not seem much but often you are already indented in a submodule, so it gives you a narrower text block for writing your doc string. It will also make the patches for switching to this system quite heavy and non-automatisable with the current tools (ocp-indent indents, it doesn't line-wrap).

Currently for val constructs the start of the doc string perfectly aligns with the identifier, which from a page design layout perspective, makes it a quieter an unbusy layout:

val my_function
(** The documentation bla bla.
bli bla blo *)

(You can't see the alignement here as no fixed-width font is being used)

leading to more pleasant source reading and scanning. (This property doesn't hold for module and type constructs but those definitions are less widespread, so the rhythm breaks are less perceptible).

Couldn't we rather find another two character indicator for floating comments ? I'd propose (> (was tempted by (? but found it too busy in practice).

@vicuna
Copy link
Author

vicuna commented Mar 12, 2015

Comment author: @alainfrisch

Thanks Daniel for your comments. Do you have a general feeling about going into such a direction (probably with some temporary legacy mode to support ocamldoc rules)?

I vaguely follow your reasoning for floating comments, but how is this related to documenting "val" declaration (these are not floating comments)?

As far as I know, floating comments are used for two different kinds of things:

  • Identifying sections (** {2 My Section} *) or providing a short summary for the whole unit.

    ---> These are typically one liners, and shouldn't suffer too much from (**...)

  • Providing a kind of general introduction to a piece of code (typically at the beginning of a signature).

    ---> These are typically quite long, so what about starting the text on the next line, as in:

    (***
      This module blabla
      blabalablabla.
     *)
    

    I've also seen this style (I assume that ocamldoc deals with it correctly):

    (***
     * This module blabla
     * blabalablabla.
     *)
    

@vicuna
Copy link
Author

vicuna commented Mar 12, 2015

Comment author: @alainfrisch

I see in some of your code that you combine the section and "introduction text" into a single comment:

(** Streaming XML IO.

A well-formed sequence of {{:#TYPEsignal}signals} represents an
{{:http://www.w3.org/TR/REC-xml}XML} document tree traversal in
... *)

Is there a difference in terms of HTML output with:

(** Streaming XML IO. *)

(**
A well-formed sequence of {{:#TYPEsignal}signals} represents an
{{:http://www.w3.org/TR/REC-xml}XML} document tree traversal in
...
*)

or is it more a matter of readability of the source code?

@vicuna
Copy link
Author

vicuna commented Mar 12, 2015

Comment author: @dbuenzli

Sorry for the confusion about val, I'm not up-to date with the different attribute forms and placement rules. I now realize that vals would still be commented with (** which makes me entirely happy; what you suggest regarding *** seems reasonable.

Personally I have no special feeling about this in the sense that from a documentation writing perspective almost nothing change and as long as the documentation gets nicely output and accessed everything is fine by my book... So I'll leave it up to the more knowledgable documentation tooling and compiler front-end hackers to comment on the implications.

One thing I dislike though is the new command line flag, I would either want to avoid it altogether (what would be the implications be, non output of floating comments ?) or if we absolutely need to maintain multiple modes I'd rather have a flag to enable the non-strict behaviour. The rationale here is that this new mode doesn't break anything, it doesn't break compilation and it doesn't break ocamldoc. From a long term perspective it will enable more transitions to strict mode: when you'll see that the doc doesn't get output, you'll have two choices either put the flag in your build system or fix your sources, given the state of our build systems, programmers will prefer to fix their sources...

From a language perspective I do like the unification of the regular variant syntax with the polyvar one.

@vicuna
Copy link
Author

vicuna commented Mar 12, 2015

Comment author: @alainfrisch

Actually, the new mode breaks existing code. The point is that (...) and (...) comments are now considered as real tokens that need to be eaten by the parser, otherwise they result in parse errors. And they are recognized only at very specific places. For instance, (**...) on a constructor (with the "old" syntax for sum types) isn't allowed. Idem for (**...*) at the beginning of a file.

Perhaps there could be some nice error recovery to automatically turn a parse error caused by an unexpected (...) or (...*) token into a warning (and resume parsing by ignoring it). I don't know if this is possible with ocamlyacc. If this require adding explicit error cases in all grammar entries, this doesn't seem realistic but maybe there are other ways.

If this is accepted, it would make sense to adapt ocamldoc to support the new rules so that people can start using the new mode without having to switch to codoc or another new tool.

It would also make sense to provide some support for "legacy" rules in codoc, by adding a pass in the compiler (enabled when -strict-doc is disabled) that turns ocamldoc comments (with their less rigid rules) into the same attributes. This will allow to start using codoc on code bases that did not switch to the new strict mode. Ideally, this legacy pass would be implemented in a way which is not intrusive in the parser itself, i.e. more like #51 (without parsing the content of comments) than like #149 .

@vicuna
Copy link
Author

vicuna commented Mar 12, 2015

Comment author: @dbuenzli

RE difference in HTML output, I don't exactly know but I certainly prefer the first form from a readability point of view. ocamldoc always had strange behaviour regarding html output.

For example:

(** The module synopsis for M.

This is the brief about module M.*)

module M : sig
end

output this:

module M : sig ... end
The module synopsis for M

but this:

module M : sig
(** The module synopsis for M.

This is the brief about module M*)

end

outputs this:

module M : sig ... end

(synopsis is missing). This is certainly a bug but I failed to report it after all this years.

@vicuna
Copy link
Author

vicuna commented Mar 12, 2015

Comment author: @dbuenzli

Ah, didn't get the token thing. I'm not sure it's a good thing that comments somehow can now make parsing fail.

@vicuna
Copy link
Author

vicuna commented Mar 12, 2015

Comment author: @alainfrisch

The thing is that they are really not comments any more. Comment start are to be defined by a regexp like

"(*" [^'*']

I'm not sure that this is bad enough to justify choosing another syntax.

Section "12.4.4 Error handling" of the OCaml manual is quite promising about providing automatic error recovery that would discard badly placed (**...*) tokens (with a warning).

@vicuna
Copy link
Author

vicuna commented Mar 12, 2015

Comment author: @dbuenzli

Especially when you are playing with code/moving things around, I don't think you want to be bothered by documentation placement rules. This could render the system more annoying than useful.

@vicuna
Copy link
Author

vicuna commented Mar 12, 2015

Comment author: @alainfrisch

I'm looking at how to turn a parse error caused by (**...*) into a warning (and simply drop this token).

Currently, I managed to handle the case where (...*) itself causes the parse error, as in "let x = ( ) 1 + 2" (since (**...) is not allowed after "let x =". The case of "let x = 1 (** *) + 2" is more difficult, since the error is detected only at the "+" token.

At worst, one could always detect the condition and restart completely parsing in legacy mode.

@vicuna
Copy link
Author

vicuna commented Mar 12, 2015

Comment author: @alainfrisch

Ok, so I've committed the following logic: in case of a syntax error in strict-doc mode, remember the location of the last doc token and try again parsing in legacy mode; if this succeeds, report a warning on the last doc token. This means that one get at most one warning per unit. I also put in the commit some info about how to do something more fine-grained, to ignore doc tokens that cause an immediate error. And one could enrich the yacc grammar here and there with explicit rules to support ill-located doc tokens.

@vicuna
Copy link
Author

vicuna commented Mar 16, 2015

Comment author: @alainfrisch

I've committed to the branch some iterative logic to report all ill-placed documentation comments (at least it returns a set of warnings such that removing all reported doc comments produces a source file which is valid w.r.t. strict placement rules; and it tries to return a minimal set satisfying this property).

@vicuna
Copy link
Author

vicuna commented Mar 16, 2015

Comment author: @dbuenzli

Btw. an unexpected side effect of having documentation in the language is that they become part of the API as far as the compilers are concerned, I'm not sure this is a good idea (See yallop/ocaml-ctypes#283, for an example of that). I don't know how cmi hashes are derived but I think it would be wise to apply them only API definitions.

@vicuna
Copy link
Author

vicuna commented Mar 16, 2015

Comment author: @alainfrisch

Btw. an unexpected side effect of having documentation in the language is that > they become part of the API as far as the compilers are concerned

This fact is also shared by #149 .

One could certainly "clean up" the signature stored in .cmi files to remove doc attributes from them. Robust documentation tools would need to read .cmti files anyway. But still, keeping docs in .cmi files has some useful applications: degraded mode when .cmti files are not available; and perhaps more importantly in the long term, allowing to show the documentation in the toplevel (e.g. with a "#doc" directive, or as an addition to the existing "#show" directive).

What's wrong with keeping documentation comments in .cmi files? Are you concerned that changing documentation text only can force recompiling more files?

@vicuna
Copy link
Author

vicuna commented Mar 16, 2015

Comment author: @dbuenzli

What's wrong with keeping documentation comments in .cmi files? Are you concerned that changing documentation text only can force recompiling more files?

Didn't think about that. But you can't document the same interface twice differently (see the issue I mentioned). I don't think that linkers should be concerned about what documentation you have in your interface (that idea already cost more than an hour of investigations on sunday, not counting Jeremy's time who was clever enough to think about that).

@vicuna
Copy link
Author

vicuna commented Mar 16, 2015

Comment author: @lpw25

This problem is shared with attributes in general. Ideally, the cmi digest would be calculated with the attributes stripped from the interface, but that would mean marshaling the interface twice. Doing this would also allow locations to always be stored in cmi files (without needing "-keep-locs").

@vicuna
Copy link
Author

vicuna commented Mar 16, 2015

Comment author: @dbuenzli

@lpw25 As usual I have not idea of the low-level details (which doesn't mean they are not important). But I'll provocatively ask are we living in a nominal or structural type system ? I have the feeling that it boils down to this.

@vicuna
Copy link
Author

vicuna commented Mar 16, 2015

Comment author: @dbuenzli

I have the feeling that it boils down to this.

Well, leaving cmx files aside...

@vicuna
Copy link
Author

vicuna commented Mar 17, 2015

Comment author: @dbuenzli

Hum, that's actually not the right question. But still I really don't like this idea that your documentation define your interface.

@vicuna
Copy link
Author

vicuna commented Mar 17, 2015

Comment author: @alainfrisch

Attributes on external declarations slightly affect how the type-checker behaves. Currently, this is the case for the ocaml.deprecated attribute, but more are considered (e.g. an ocaml.since attribute). And the toplevel could very well be extended with a directive to display doc.

I'm not shocked in principle by the fact that a .cmi interface contains more than just the module signature (and that its hash includes extra information).

Do you really have cases where you compile two variants of foo.mli with different content from the documentation and expect compatible .cmi files?

@vicuna
Copy link
Author

vicuna commented Mar 17, 2015

Comment author: @dbuenzli

Do you really have cases where you compile two variants of foo.mli with different content from the documentation and expect compatible .cmi files?

Not at the moment, but I see myself doing more and more backend linking tricks (especially because of jsoo) and I don't think it's a good idea to render this pattern brittle to comments in your file, especially if this file may be duplicated in different repos.

Hashes are supposed to provide us with a way to detect binary compatibility, I fail to see how this is related to documentation.

@vicuna
Copy link
Author

vicuna commented Mar 17, 2015

Comment author: @alainfrisch

binary compatibility

What is binary compatibility?

One could argue for instance that the hash should identify signatures up to equivalence (with identity coercion) in the module type system. Then the hash should be same for "type t = int type s = string" and "type s = string type t = int" (reordering); or for "type t = int type s = t" and "type t = int type s = int" (expanding aliases). This wouldn't affect type soundness.

The current definition of the hash is much stronger than that, and I have never found a case where this is problematic (even with -keep-locs).

Can you explain in more details what caused yallop/ocaml-ctypes#283 ?

@vicuna
Copy link
Author

vicuna commented Mar 17, 2015

Comment author: @dbuenzli

What is binary compatibility?

Good question, I'm not sure I have the answer, but at least I can tell for sure that I don't find documentation to be part of it. When the compiler tells me that two modules make different assumptions about an interface I don't have the feeling that it should be concerned by documentation...

Can you explain in more details what caused [...]

The library has different implementations of foreign.ml which are selected according to the context (e.g. one for multithreaded programs and another for single threaded programs). When you compile things you compile them against a single cmi, if you hide cmx files at that moment this allows you to link against any of the backends.

So Jeremy had two copies of foreign.mli in his repo and those where out of sync on a single comment (see the fix yallop/ocaml-ctypes@9928d4f) which resulted in a different .cmi hash used by the different backends. But he would still install a single cmi file to compile your file against assuming both backend had the same cmi hash. So when you wanted to use one or the other backend it would fail.

Now we can argue that he shouldn't have a copy of the file and rather have a single mli file and used symlinks or use a build system that doesn't depend on the structure of your file hierarchies. But I still think that making the system sensitive to these kind of things makes the whole system less usable; in trying to check interface assumptions the compilers should only be concerned about what concerns them to perform their job and I don't feel that documentation is part of that.

@vicuna
Copy link
Author

vicuna commented Mar 17, 2015

Comment author: @lpw25

Attributes on external declarations slightly affect how the type-checker behaves. Currently, this is the case for the ocaml.deprecated attribute, but more are considered (e.g. an ocaml.since attribute).

I think that such attributes should be included directly in the interface. For example, there should be a since field in Types.value_description containing the information from the ocaml.since attribute. Attributes should just be the syntax for adding these things, not their internal representation.

@vicuna
Copy link
Author

vicuna commented Mar 18, 2015

Comment author: @damiendoligez

dbuenzli wrote:

From a language perspective I do like the unification of the regular variant syntax with the polyvar one.

I don't think it's a good idea. When beginners start writing

type t = [ A of int | `B of string ]

you will have to explain to them that a variant type definition and a polymorphic variant type definitions are two completely different things that happen to use the same syntax. We have too much of that in the language already.

On the other hand, if you change these brackets into parentheses, you get a nice parallel with the handling of attributes in products:

type t = int * (int [@attr foo])
type t = (int * int) [@attr foo]

@vicuna
Copy link
Author

vicuna commented Mar 18, 2015

Comment author: @damiendoligez

@Frisch:

It would also make sense to provide some support for "legacy" rules in codoc, by [...]

Don't use the conditional, compatibility mode is a must if you want a smooth transition from ocamldoc.

Also, what do you make of doc comments in .ml files? A function's documentation is usually right before the function's definition.

@vicuna
Copy link
Author

vicuna commented Mar 18, 2015

Comment author: @alainfrisch

Don't use the conditional, compatibility mode is a must if you want a smooth transition from ocamldoc.

Yes, but this can be done outside the compiler, e.g. with a -pp flag, or with some support in codoc that emulates the current behavior of ocamldoc. And since ocamldoc won't die anytime soon anyway (at least I hope), one could imagine that each project switch at some point from ocamldoc to codoc (and rewrite the doc in their mli at this point), which is another way to address the transition.

Also, what do you make of doc comments in .ml files?
A function's documentation is usually right before the function's
definition.

Personally I don't care about a tool that generates nice HTML for implementation code. My guess is that the community is largely uninterested in such a feature (although you'll always find people for who this would be a nice addition), but I can of course be completely wrong.

@vicuna
Copy link
Author

vicuna commented Jun 30, 2015

Comment author: @alainfrisch

Support for parsing comments in the main parser (using ocamldoc rules) has been contributed by lpw25.

@vicuna vicuna closed this as completed Jun 30, 2015
@vicuna vicuna added the wontfix label Mar 14, 2019
@vicuna vicuna added this to the 4.03.0 milestone Mar 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants