Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autofocusing pattern on record field to simplify matching trees with meta-data #5667

Closed
vicuna opened this issue Jun 29, 2012 · 29 comments
Closed

Comments

@vicuna
Copy link

vicuna commented Jun 29, 2012

Original bug ID: 5667
Reporter: @alainfrisch
Status: acknowledged (set by @damiendoligez on 2012-07-17T12:35:57Z)
Resolution: open
Priority: normal
Severity: feature
Target version: later
Category: typing
Monitored by: @bobzhang tgazagna @lefessan @diml @ygrek lavi @jmeber "Julien Signoles" @hcarty @Chris00 @yakobowski

Bug description

A common idiom in OCaml when defining an AST is:

  type expr = {
       loc: ...;
       desc: expr_desc;
  } and desc =
    | Foo of ...
    | Bar of ...

i.e. wrap each node in a record type to attach some meta-data (like the location). With such a type definition, it becomes quickly cumbersome to write non-trivial patterns (e.g. to implement optimizations, or AST rewriting). This has even been used as an argument against extending some of the intermediate languages in OCaml itself with location information (to make locations available to gdb, IIRC).

One might want to look at "views" or "active patterns" as generic solutions to such problems. Here, I'd like to propose a little patch which implements the following ad hoc strategy: when a constructor pattern [p] is type-checked in a context where a record type is expected (this information is now propagated, as it is required by GADTs), and the record type has a single field [l] whose type is a variant, then the pattern is automatically wrapped into a pattern {l = p}. There are several ways to improve the cases covered by the strategy (e.g. consider only fields which are variant types that contain the specified constructor), and one could prefer to mark explicitly the record type declaration so as to enable automatically this "auto-focusing" hack (by specifying the field explicitly).

As an example, here is how a pattern matching can be simplified by this hack:

(* before *)

      | {pmod_desc = Pmod_apply(
         {pmod_desc = Pmod_apply(
          {pmod_desc = Pmod_apply(
           {pmod_desc = Pmod_ident {txt = Lident "IFDEF"}},
           {pmod_desc = Pmod_ident {txt = Lident sym}}
          )},
          body_def)},
         body_not_def)}


(* after *)

      | Pmod_apply
           (Pmod_apply
              (Pmod_apply(Pmod_ident (Lident "IFDEF"),Pmod_ident (Lident sym)),
               body_def), body_not_def)

The patch is quite ad hoc, but I believe it covers a lot of concrete cases. I don't expect the patch to be integrated in its current form, but hopefully this proposal will trigger some interesting discussions.

File attachments

@vicuna
Copy link
Author

vicuna commented Jun 30, 2012

Comment author: @garrigue

This makes sense (I suppose that many people already had this idea).
However, this would be a policy switch for ocaml.
Namely, in ocaml dynamic semantics never depend on the type system.
This is true for GADTs too: they only change the typing, not the dynamic behavior.
What you propose means that you automatically insert a field access, with the choice
of the field depending on its type.
This is a major change.

In some way, a mechanism of pattern macros would be less intrusive.

@vicuna
Copy link
Author

vicuna commented Jun 30, 2012

Comment author: @gasche

I was just going to mention the mechanism of pattern synonyms. When Conor McBride came to Paris a few months ago, he mentioned it as an improvement he suggested for Haskell, but that would also make sense for OCaml.

http://hackage.haskell.org/trac/haskell-prime/wiki/PatternSynonyms

The syntax he demonstrated was of the form:

pattern p x y z = Foo (x, [y], z)

(In his mind, p could then be used in both patterns and expressions, but using it only for patterns make it simpler.)

In this case, one could write:

pattern apply (e1, e2) = {pmod_desc = Pmod_apply(e1, e2); ...}

I think there is also value in Alain's proposition of type-directed code inference (here pattern inference), but maybe it should be a more syntactically explicit.

@vicuna
Copy link
Author

vicuna commented Jun 30, 2012

Comment author: @protz

FWIW, I often wished I was able to abstract on patterns. The code I'm writing right often performs the same set of patterns:

match t with
| A (B t')

I'd be happy to abstract that as:

pattern p (t) = A (B t)

and write

match t with
| p t' ->

@vicuna
Copy link
Author

vicuna commented Jul 2, 2012

Comment author: @alainfrisch

Jacques: the semantics of the language already depend on the type system. For instance in

module A = struct include (M : sig end) let () = print_endline "AAA" end

the semantics depends on the signature constraint. One can argue that there is some kind of elaboration semantics, but then it's hard to conclude that the semantics doesn't depend on the type system.

If we mark explicitly the record declaration with the field on which "auto-focusing" is enabled, and the elaboration keeps this mark "at runtime", the dynamic semantics is defined without further reference to type inference (a constructor pattern applied to a record value "reduces" to the same pattern applied to the marked field).

@vicuna
Copy link
Author

vicuna commented Jul 9, 2012

Comment author: @garrigue

Oops, I wrote an answer that was never sent.
At the module level it is clear that signature have a semantical meaning.
But in the core language type constraints are not supposed to change the dynamical behaviour.
Of course, you might come up with a proper dynamic semantics for focusing, not depending on typing :-)
And one could argue that if we add type witnesses this will change this a lot.

At the very least, I would suggest an explicit syntax:

 | Pmod_apply
       ({Pmod_apply
          ({Pmod_apply({Pmod_ident {Lident "IFDEF"}},{Pmod_ident {Lident sym}}),
           body_def)}, body_not_def)

I.e., this should be an extension of the record syntax, not of pattern matching in general.
And I agree that in this case, it does not depend on type inference (at least if all the record
fields have a concrete type, not just a type variable).

@vicuna
Copy link
Author

vicuna commented Jul 17, 2012

Comment author: @damiendoligez

Alain, the common idiom that you describe is probably not the best way to add location annotations.
It is probably better to add a location field to every constructor, just to avoid doubling the depth of every pattern.

Then you have to define a "location_of" function for every type, but we have good syntactic tools to do this kind of thing.

@vicuna
Copy link
Author

vicuna commented Jul 17, 2012

Comment author: @alainfrisch

Damien: the solution with intermediate record nodes to store extra information is wide-spread, and is used everywhere in the OCaml code base itself, as you know well. Adding extra arguments to all constructors requires to add extra wildcard patterns to many places, and is not very nice conceptually (if a concept applies to all kinds of nodes, it should be placed at the level of the node, not under all constructors).

Jacques: (I think you miss a closing } before body_def. And {...} around the whole pattern.) Yes, an explicit syntax might be worth the little syntactic overhead, I'm not sure (the extra {...} make the pattern somehow less readable). More important to me would be to make it explicit on the record definition site which field is subject to this trick.

@vicuna
Copy link
Author

vicuna commented Jul 17, 2012

Comment author: @gasche

Note that the "patterns synonyms" feature would be able to handle both style (factor our the record layer, or factor out the additional location field); if applied consistently they would even allow to change from one representation to the other without having to change the pattern-matching sites.

(I've mentioned this feature idea to Luc, and he seemed quite interested in it. I hope we can come up with a reasonable proposal some time after the release.)

@vicuna
Copy link
Author

vicuna commented Jul 17, 2012

Comment author: @alainfrisch

Gabriel: I'm eager to hear more about a proposal on pattern synonyms! Could you give an idea on how a declaration like:

pattern apply (e1, e2) = {pmod_desc = Pmod_apply(e1, e2); _}

would be used?

Something like

match e with
| apply (e1, e2) -> ...
| ...

?

This looks very nice. It'd be a little bit tedious to define an alias for each constructor, but it works. One also needs to be careful about constant alias (with no capture variable), in order to distinguish them from capture variables. (Maybe aliases should be uppercase identifiers?)

One could also allow calling arbitrary function and matching the result. This is a much more complex approach, of course. (The idea would be that a pattern like:

f p -> ...

would call function f and match its result with p.)

@vicuna
Copy link
Author

vicuna commented Jul 17, 2012

Comment author: @gasche

Yes, something along those lines -- I was hoping to smuggle a currified syntax for patterns, and the use site may get some syntactic noise to avoid ambiguities, but that's the idea.

Your latter proposal of applying a function (looks like view patterns) makes me suspicious. The problem with such proposals (also internalized "when" for example) is that the interaction between the pattern matching compilation and the fact of executing arbitrary user code are a very messy problem. For once, it's hard to specify if the function will be called at all, in which order and how many times. Even more problematic, the compiler may check the presence of some head constructor, then execute your function that would mutate the data by changing the head constructor, and then the optimized parameter access code would provoke a segfault.
I am planning for a minimal proposal rather than an ambitious and possibly problematic proposal.

You are right that using the pattern synonyms in the present use case (locations) involve some boilerplate. You have to choose between defining patterns for each use case, as in your "apply" example, or defining a more generic pattern

pattern desc p = {pmod_desc = p; pmod_loc = _loc}

match e with
| desc (Apply(e1, e2)) -> ...

which provides a much lesser syntactic gain (we could have essentially the same thing by choosing a shorter field name) in exchange for less boilerplate. Note, as an independent side remark, the implicit introduction of a "_loc" identifier that captures the innermost location, in the spirit of Camlp4 quotations.

I think having some boilerplate is a reasonable price to pay for the use of a relatively low-tech solution; it's certainly less sophisticated than type-based pattern inference as you suggest. Hopefully that mean we have less ways to get it wrong, an maybe it could be useful in more different situations -- but I readily admit that this nested record example is so far my only practical example to motivate the feature.

(The difficulties will be, I suppose, with whether we choose to expose it through module boundaries; conceptually we'd like to be able both to keep them abstract and publish only their "type" (matched type plus environment effect) or to publish their code to allow for more performant compilation. In practice I'm not sure the current pattern compilation techniques would handle abstract patterns, so they may be always-concrete for now.)

@vicuna
Copy link
Author

vicuna commented Jul 17, 2012

Comment author: @lpw25

As mentioned in the comments on feature request #5528, "inline records for constructor arguments" could also be a solution to the specific problem of location fields. If the inline record constructors were allowed to share fields, perhaps with an explicit syntax like:

type expr =
| Foo of { external loc: ...; ... }
| Bar of { external loc: ...; ... }

These could be pattern matched as Foo {...} and Bar {...}, but also have their loc fields accessed directly as e.loc.

@vicuna
Copy link
Author

vicuna commented Jul 18, 2012

Comment author: @alainfrisch

Gabriel: your arguments about side effects while running code during pattern matching already apply to lazy patterns. But it's certainly to be taken into account.

Patterns aliases are like syntactic macros; they could almost be implemented as a syntax extension (e.g. using the new -ppx flag :-)). For instance, an AST filter could detect (and remove) a declaration like:

let alias {pmod_desc = Pmod_apply(e1, e2); _} = Apply (e1, e2)

and use it to rewrite any pattern:

Apply (p1, p2)

into {pmod_desc = Pmod_apply(p1, p2); _}

A deeper support in the compiler is required to get better scoping, exporting to interfaces, etc. If we do that, this is no longer a really low-tech approach; at least, it's going to be more intrusive in the language definition than my proposal.

@vicuna
Copy link
Author

vicuna commented Jul 18, 2012

Comment author: @alainfrisch

Another variant of my proposal, committed to branch autofocus. It introduces an explicit syntax on the record declaration:

type t = {
...;
match x : s;
....;
}

(only one field can be marked as 'match', not checked for now)

When the type-checker encounters a pattern, with an expected type referencing such a record declaration with a 'match' field, and the pattern under consideration cannot be a record pattern (not a record, variable, alias, constraint, or-pattern), it focuses on the marked field.

This is more explicit than the original proposal, and also slightly more general since this is not restricted to sum types.

@vicuna
Copy link
Author

vicuna commented Jul 19, 2012

Comment author: @lefessan

@alain: Does the order of unification matters for your patch ?

If yes, shouldn't you add the restriction that 's' is not a polymorphic variable ?

Otherwise, does your patch correctly computes the correct type for both 'f1' and 'f2' in this example ?

type 'a ref = { match mutable content : 'a }

let f1 x =
if true then
match x with None | Some _ -> None
else x.content
let f2 x =
if true then x.content
else
match x with None | Some _ -> None

@vicuna
Copy link
Author

vicuna commented Jul 20, 2012

Comment author: @alainfrisch

@fabrice: yes, it does, unless you compile with -principal. The patch relies on the same type propagation technology as used for GADTs.

When type-checking without -principal, f1 is rejected and f2 is accepted. In -principal mode, both as rejected, as expected. Moreover, if you add the constraint (x : 'a option ref) on the argument, all four cases are accepted.

@vicuna
Copy link
Author

vicuna commented Aug 2, 2012

Comment author: @xavierleroy

Changing target version, as this is a new language feature with deep implications.

@vicuna
Copy link
Author

vicuna commented Sep 15, 2012

Comment author: @alainfrisch

Jacques: one more point about your proposal of an explicit syntax. I guess that the proposed syntax will cause conflicts in the yacc grammar, or will require significant effort. In particular, something starting with "{A1.A2.A3." could be either (1) a normal record pattern with a qualified first label, (2) a "focused" record pattern with a sub-pattern being qualified constructor name, or (3) a "record override" construction ("A1.A2.A3." being also a prefix of several kinds of valid expressions).

@vicuna
Copy link
Author

vicuna commented Sep 15, 2012

Comment author: @garrigue

Alain: Yes, I know that there is already a lot of things in record syntax.
However, this can be solved in a later phase.
The real question should be semantics.
Could there be a conflict with punning?

By the way, my understanding was that focusing was only for patterns,
so I would think that record override is irrelevant here.

@vicuna
Copy link
Author

vicuna commented Sep 15, 2012

Comment author: @alainfrisch

The real question should be semantics.

I don't see any real issue with semantics. For instance, in the variant of the proposal with the explicit 'match' annotation on the type declaration: when a record value is matched against a pattern of one of the authorized forms, matching continues with the designated field and the same pattern.

record override is irrelevant here.

Indeed!

@vicuna
Copy link
Author

vicuna commented Apr 4, 2013

Comment author: @alainfrisch

Making it easier to pattern match on Parsetree fragments will become more and more important as we move towards -ppx based preprocessors instead of Camlp4. Of course, the proposal is not specific to the OCaml Parsetree itself and would also benefit to other AST encodings using a similar representation.

Solutions which have been discussed in the thread:

  1. Original proposal: automatically focus on a record field (many possible variants: with/without an explicit annotation on the target field in the record type declaration; only for constructor patterns; etc).

  2. Jacques: use a new syntax on the record pattern to make the field implicit while keeping it explicit that we rely on this new feature.

  3. Gabriel: use pattern synonyms.

  4. Leo: use inline records (with one location field in each constructor).

I still believe that 1 is the less intrusive change to the compiler and syntax, and the most direct way to provide a good experience to people writing patterns on Parsetree fragments. I don't think it is a very deep philosophical switch on the semantical side, especially if we pick the variant with an explicit annotation on target field on the record type declaration and if we restrict focusing to the case where the pattern is a constructor. I also ok with Jacques' variant, but I'd then prefer a very light syntax. 3 and 4 require much more pervasive changes to the language definition and compiler and affect significantly the way the ASTs have to be defined (for 3, we need to define synonyms for each constructor, which mean picking new names, etc). 4 in particular does not scale well if many "satellite" (locations + attributes + ...) fields have to be added to each constructor. Moreover, doing 1 or 2 does not prevent from adding 3 or 4 to the language later.

For 2, I'd suggest to use a single-character prefix operator (for instance, "&") in front of the constructor pattern, as in:

  | &Pmod_apply
       (&Pmod_apply
          (&Pmod_apply(&Pmod_ident (&Lident "IFDEF"),&Pmod_ident (&Lident sym)),
           body_def), body_not_def)

Since we're making good progress on the proposal "-ppx + extension_points", it's a good time to resurrect this discussion... Please comment!

@vicuna
Copy link
Author

vicuna commented Apr 4, 2013

Comment author: @lpw25

  1. Leo: use inline records (with one location field in each constructor).

4 in particular does not scale well if many "satellite" (locations + attributes + ...) fields have to be added to each constructor

I don't see how it would not scale:

type expr =
| Foo of { foo_arg: t;
other_foo_arg: s;
external loc: Loc.t;
external attribute: Atrr.t }
| Bar of { bar_arg: r;
external loc: Loc.t;
external attribute: Attr.t; }

I think that these "tagged records" are a much more direct solution to the
problem, with a much better data layout than using "auto-focusing patterns"
with ordinary records.

The problem with data types like Parsetree.module_expr, which have different
"variants" that share some fields, is not just that pattern matching is a bit
verbose, but that they are not in the most appropriate data layout. Ordinary
variants would provide the best data layout, but then the shared fields can
only be accessed through unnecessarily expensive pattern matching.

"Tagged records" solve this problem by combining the best data layout (like
variants) with quick access to shared fields (like records).

@vicuna
Copy link
Author

vicuna commented Apr 4, 2013

Comment author: @gasche

Just a remark (I'm not weighting in for inline records, altough I do like this approach): I'm not sure those "external" qualifiers would be necessary for runtime performance. With GADTs I've already seen cases where access to fields went without runtime tag test. I wouldn't be surprised to hear that this case (access to the first argument present in all constructors) already generates optimal code. Otherwise it could probably be recognized and optimized, provided there is a convincing argument that this is actually important for performances.

@vicuna
Copy link
Author

vicuna commented Apr 4, 2013

Comment author: @alainfrisch

Leo: are you concerned with the runtime performance and memory usage? Because otherwise I don't see why "tagged record" is a more appropriate data layout. I'd argue quite the contrary, as it didn't make it explicit that all constructors share some common fields. Splitting the common part in a dedicated record and restricting the variable part to a inner sum type is a better separation of concerns than flattening everything so that it fits in a single memory block.

What's the "external" qualifier in your example? If it is supposed to tell the compiler that all constructors share the common part, it's kind of weird having to duplicate the field definition in all constructors and have the compiler check that they all match. If we start adding annotations to type definitions in order to help the compiler pick a good runtime representation, I'd rather do the opposite and mark the "description" field within the enclosing record as being "inlined".

But I'm not concerned with performance at all here, and the current Parsetree representation definition just seems more logical to me and it allows us to write cleaner code (in my opinion). For instance, we can expose a "maker" function which only takes the "description" and optional arguments for the other fields. And a generic "map" function does not have to copy all satellite fields for each constructors. The same technique also allows better code factorization of type definitions (as in "'a loc"). How would you deal with 'a loc with inline records? You'd have to extract its location field in the enclosing data structure, turning something like:

Record of (Longident.t loc * expression) list

into

Record of (Longident.t * Location.t * expression) list

or rather introduce an extra record:

Record of record_field list
...
and record_field =
{
label: Longident.t;
label_loc: Location.t;
content: expression;
}

But then pattern matching is not really made simpler, and you cannot pass the "Longident.t loc" as a single object, thus polluting a lot of code. (Applying the same inlining trick to "Longident.t loc" does not work because inner nodes don't have location; and this would not work for "string loc".)

Don't get me wrong: I'd love to get inline records (that's why I've proposed them!), including in the definition of OCaml's parsetree, but only to name constructor arguments, not to merge location and attribute fields in all constructors. I also agree that in some cases tweaking the type definition for performance reasons can be useful and that inline records can be a good tool, but this is a different topic.

@vicuna
Copy link
Author

vicuna commented Apr 5, 2013

Comment author: @lpw25

Leo: are you concerned with the runtime performance and memory usage?

It annoys me that I have to choose between easy access to shared fields and the data layout that seems most suitable. This probably comes under the category of "premature optimisation", but it is frustrating nonetheless.

What's the "external" qualifier in your example? If it is supposed to tell the compiler that all constructors share the common part, it's kind of weird having to duplicate the field definition in all constructors and have the compiler check that they all match.

I admit that the duplication of field names is not ideal. I chose that syntax as it was the smallest change from ordinary variant constructors. Maybe something like:

type t =
Foo of { foo: int }
| Bar of { bar: int }
with { shared: int }

would be better (since we are going to move all type-conv extensions over to ppx :)).

If we start adding annotations to type definitions in order to help the compiler pick a good runtime representation, I'd rather do the opposite and mark the "description" field within the enclosing record as being "inlined".

This is probably a matter of taste.

For instance, we can expose a "maker" function which only takes the "description" and optional arguments for the other fields

Indeed, there is always a choice between combining data into a single data structure and being able to represent a "partial" data structure. This is probably orthogonal to the issue of tagged records: if some shared fields have no default do you put them in a separate sub-record or leave them inlined in a single data structure?

How would you deal with 'a loc with inline records?

I'm afraid I don't really understand your point here. Why would I not just use

Record of (Longident.t loc * expression) list

@vicuna
Copy link
Author

vicuna commented Apr 6, 2013

Comment author: @lpw25

For instance, we can expose a "maker" function which only takes the "description" and optional arguments for the other fields

It occurs to me that rather than providing both of:

mkExpr ?loc ?attr descr

mkLet other args

we probably want to provide:

mkLet ?loc ?attr other args

which can easily be implemented with tagged records.

@vicuna
Copy link
Author

vicuna commented Apr 8, 2013

Comment author: @alainfrisch

This probably comes under the category of "premature optimisation"

I believe this is indeed the case!

Why would I not just use
Record of (Longident.t loc * expression) list

The whole discussion is about simplifying patterns like:

Record [{txt=Lident "x"},ex; {txt=Lident "y"},ey] -> ...

into:

Record [Lident "x",ex; Lident "y",ex] -> ...

(or maybe even: Record ["x",ex; "y",ex] -> ... if we also support auto-focusing on a constructor)

or:

Record l when List.exists (function {txt=Lident "x"} -> true | _ -> false) -> ...

into:

Record l when List.exists (function Lident "x" -> true | _ -> false) -> ...

Worst: I tend to enable warning 9 (labels not bound in a record pattern), which makes the version with records even more tedious:

Record [{txt=Lident "x";},ex; {txt=Lident "y";},ey] -> ...

we probably want to provide:
mkLet ?loc ?attr other args

It really depends. While it is nice to have such builder function (and this is precisely what the new Ast_helper module in the extension_points branch does), it is also useful to have access to mkExpr. Consider for instance OCaml's parser.mly: it defines a standard "mkexp" function which sets the location (using symbol_rloc()), which is then reused directly in all rules creating expressions. In extension_points, there is also a mkexp_attrs which processes "inline" attributes and extension node notions. How would that look if the expression type was defined directly as a sum type with shared arguments?

Another example would be the implementation of a term algebra with hash-consing. Each node would have a unique id which has to be assigned automatically if the "term description" cannot be found in a memoization table. We need a way to manipulate the term description (without the unique id), if only to pass it to the builder function is charge of the hash-consing.

@vicuna
Copy link
Author

vicuna commented Apr 23, 2014

Comment author: @alainfrisch

I've attached a new patch (autofocus2.diff) against the current trunk, making use of attributes to mark fields that support autofocusing.

type t = {desc [@Focus] : [A | B]; loc : int};;

let f (x : t) = match x with A -> true | B -> false;;

The patch shows that the feature could easily be implemented in user-land if we decide to add some generic type-checking hooks (#6378).

@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

@github-actions github-actions bot added the Stale label Apr 26, 2021
@gasche
Copy link
Member

gasche commented Apr 26, 2021

We had this discussion almost ten years ago, but I still believe that pattern synonyms would be nicer than the proposed feature of using type information to guess an implicit (runtime-relevant) coercion. Since then I have worked more on the implementation of pattern-matching in OCaml, so maybe it would be easier to try to implement it. (But then we all still have many things to do!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants