New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A new format implementation based on GADTs #6017
Comments
Comment author: @gasche While I have not yet reviewed the implementation in detail, I have The change can be summarized as this: instead of being kept as Below is a more detailed high-level comment. The current way OCaml handles format strings can be roughly summarized
The duplicated parsing, once at type-checking time and once at The change by Benoît turns format handling into the following picture:
I believe this change is highly desirable: it rationalizes the runtime However, due to the large number of features in the OCaml format |
Comment author: @yallop I like this idea a lot. If I'm not mistaken, format could now be written entirely as a library, without special compiler support, using ppx or camlp4 to turn the strings into a GADT tree. (Of course, we'll want to keep format around for backwards compatibility). |
Comment author: @alainfrisch This is an impressive piece of work, which I'd like to see integrated at some point. It probably deserves to be tested in the wild, on large code bases. I know there is some reluctance against using recent additions to the language within the core compiler or stdlib. What about integrating the patch on the trunk after the branch for the next release has been created? This would make it easy to test it, while providing a full release cycle before it is officially release, and a release branch for bug fix releases, if required. |
Comment author: @johnwhitington This looks great. What is the effect on the readability of type errors? An improvement? Worse? The same? It's rather hard to explain to students those complicated format4 / format6 types... |
Comment author: bvaugon I confirm that typing errors with format types are difficult to understand (and to explain to students). Unfortunately, this patch does not change anything because the inferred types are the same, and the unification produces the same typing errors. However, I have a suggestion to try to improve this problem about error messages containing format types, and I would like some outside opinions on this potential issue. What do you think to modify the type pretty-printer to display format types in this way: "%d%s%c" format rather than: (int -> string -> char -> 'a,' b, 'c,' d, 'd, 'a) format6 We then get this kind of error message: This expression has type "%d%s%c" format but an expression was expected of type "%d%d%c" format. Unfortunately, all type parameters are not necessarily instantiated when generating the error message. These cases would require the display of format types like: "%d%?%?" format or maybe: "%d%['a]%['c]" format to name type parameters. Does someone have an opinion or a better idea ? |
Comment author: @gasche I'd rather not mix a discussion about a change to the runtime with new feature ideas about printf/scanf/format. My gut feeling would be to first discuss and handle what can be done purely as an internal change, with no modification to the observable behavior. Then, once we know for sure the status of the change, we can act on it and have room for discussion of observable changes based on this experience. If you think there is value in having these other discussions right now rather than in a few weeks/months, I would suggest a separate bugtracker issue. |
Comment author: bobot That's a great work. I was wondering if the author can think of ways for improving Format.ikfprintf, Format.ifprintf, Printf.ifprintf: let debug = ref false let dprintf fmt = Such a function is useful for printing conditionaly debugging information. With the old implementation the cost is quite high since even when debug is false the format string is parsed. With the new implementation that should be faster but there is some allocation. Currently the best dprintf version that I know, consists to use specialize version without ifprintf: let dprintf0 fmt = let dprintf1 fmt x1 = let dprintf2 fmt x1 x2 = ... But the user must select the correct version. Is it possible to reduce the cost of ifprintf? A specialized version of make_printf: (a, b, c, c, c, unit) CamlinternalFormatBasics.format6 -> a that construct nothing except closure? By adding the number of remaining argument in format6 for faster application for small arities? Once again thank you for this great job, |
Comment author: @gasche I'm not sure there is some allocation: the generated GADTs are constant immutable values, and there should therefore be pre-allocated by the compilers. |
Comment author: bvaugon The GADT representing the format is not allocated by printing |
Comment author: bvaugon Oh, sorry, to write an ifprintf which allocates nothing (as in type ('a, 'b) iprintf_acc = let rec fun_of_iprintf_acc : type a . (a, unit) iprintf_acc -> a = let rec iprintf_acc_succ : type a b c . let rec make_iprintf fmt = make_iprintf_aux IPrintf_arg_0 fmt let ifprintf _ fmt = make_iprintf fmt However, if this code is integrated into the standard library, it may |
Comment author: bobot Nice, you still need to look at all the formatter but at least it does just that. I tried by adding the size of the formatter in format6 (your format6 is format6_aux): The nice point is that ifprintf and printf, fprintf, ... can be implemented without any closure but you need one function as big as make_printf by arity. ... let app4 : type a1 a2 a3 a4 b c d . let make_print2 : type a b c d . (all the code for just a few case is uploaded as special_case_for_small_arities.ml) I can't think of a generic version of app that doesn't use a closure, except if the inliner can create all the version by itself with the following definitions: let app_gen f1 f2 x2 x1 o = ... f1 o ... f2 x2 o ... let app3 x3 x2 x1 o = app_gen (app1 x3) (app2 x3) x2 x1 o |
Comment author: bobot Just another question about your patch: Do you only need the type acc to start printing only when the application is complete? Is it specified in the documentation or is it for respecting the old behavior? |
Comment author: @gasche If the question is "do you really need acc?", the answer is 'yes'. As it is the interpretation of the GADT values that does the consumption of the arguments one after the other, you need the accumulator to delay all printing actions until all arguments have been consumed. Otherwise constructing the closure let print_id = Printf.printf "%s: %d\n" Conf.i18n_id_name would start printing stuff before the remaining %d argument is given. This is clearly not the correct semantics (because it would be printed only once, instead of each time |
Comment author: bobot My second question was: Is this behavior specified somewhere in the documentation or is it only for respecting the old behavior (which must be done!)? Gabriel, in another world where it is not specified if printf can start printing before complete application (it is also a possible semantics), you will write let print_id id = Printf.printf "%s: %d\n" Conf.i18n_id_name id and in this other world Benoît will implement a make_printf which process directly the argument and which is easier to inline. So its easier in this world to make printf as fast as using directly the output functions: http://www.lexifi.com/blog/note-about-performance-printf-and-format But we are not in this other world, so Benoît created the acc type, and we should hope that the improvement done in the inliner will be greater in this world than in the other. I really like the patch proposed by Benoît and it already give speed-up and simplification, that's why I want to know if we can push it further. |
Comment author: @gasche I am strongly opposed to any change of the semantics that would amount to printing starting before complete application. This is certain to break existing programs¹, and I don't think a patch doing that should be accepted. ¹: and not even some hitting some bug or misbehavior of the current implementation and not noticing it; some perfectly reasonable user programs. There is in fact a similar problem with invalid_arg failures due to runtime-invalid format, that is done "wrong" in the patch. I'm writing a longer (too long) message about my first forays at patch review, and will write about this more than anyone (except maybe Benoît) wants to read :-) Regarding performances, Benoît had made early measurements on a much simpler prototype, that shown that the performance it when compared to "using directly the output functions" was reasonable (around 10%-30%, iirc., compared to the several-times slowdown observed by Alain). Benoît, do you think you could update those small speed experiments for the current patch? I would even be in favor of considering reverting some of the changes made by Alain on performance grounds, turning Format.printf into direct printers in the compiler code. But that will be evaluated and discussed with him in due time. On a related note, replacing formats with direct printing function may result in large increase in code size. Benoît's expansion into a GADT preserves the compact code size we had with the previous one -- as opposed to some other translation schemes that produce large code closures. It is unclear that always inlining to the native functions will actually improve performances (I've often seen excessive -inline tunings results in slowdown) -- but that will be left to the tricky inliner heuristics to decide. |
Comment author: @gasche I started to look at the details of the patch, and here is some early feedback. I have not done any
There is code specific to Scanf or Format, that is present in scanf.ml and format.ml The split between CamlinternalFormatBasics in pervasives.ml on one side, and camlinternalFormat.ml
All the code needed to implement these is in pervasives.ml, all the rest is in
The code added to pervasives.ml is really long (850 lines, added to a file that was previously 460 Here is how each of the format-related export of pervasives.ml contributes to code size under the
In the previous implementation, formats are stored as strings until they are passed to the relevant One other possibility (that was used in some previous prototypes of Benoît, using functions instead I tried to evaluate the possibility of reverting to this style, for the purpose of getting rid of
So it appears the (, ) compilation scheme may allow to get rid of that Besides, there are arguable correctness benefit in keeping the format-string around
Regarding the request of François, I considered moving the generated GADT value from the structure String_literal("foo", to something more like Arg((Some(String_literal "foo"), String_tag, String_arg(padding)), This changes makes the GADT type more complicated (but isomorphic to the previous one), and it is let rec ifprintf = function The problem is that it means a very large change to the current patch, which makes everything else Driving the point further: I think there are other ways to do what François want (preprocessing,
The proposed patches makes changes to the way format errors are handled that are problematic. (First: some nonsensical formats where previously rejected at runtime and are now rejected at (1) The error messages are different from those of the previous implementation, for the worse: (2) The error message may happen before all the format arguments have been readed. This is precisely
I believe that for Benoît's quite ambitious change to be acceptable, it should guarantee maximal ²: Benoît considered embedding static typing information about which formats are valid for all |
Comment author: bobot
type ('a, 'b, 'c, 'd, 'e, 'f) format6 = let ifprintf : type a b c d . (* make_printf or Benoît make_ifprintf *)
Why not only at toplevel? For make_printf, string_of_format is used on a whole format6. Another function need that? |
Comment author: @gasche
|
Comment author: @gasche I wrote a small format fuzzer and quickly found out that string_of_format does not preserve the format strings. Here are some examples:
I did make the necessary changes to fix the first of these (patch to be uploaded), but that amounts to polluting the GADT description of formats with details that are irrelevant to the semantics, and only meant to enable faithful re-printing of the format string. I suspect the right design choice is to move back to a (, ) pair, as I previously proposed. Benoît, what do you think? |
Comment author: bvaugon I thought that changing the semantics of programs which are fundamentally Printf.printf "%d%[a-z]%d" 1 or Printf.printf "%d\n%!%[a-c]%d" 1 "foo" 3 was not a problem, because, with the new verion, the program break However, it is very simple to reaquire the old behaviour of this kind === About:
It is not a bug, it is completely normal. Indeed, %L and %N are About:
Can be fixed by adding the case : As a consequence, string_of_format "@" will equal "%@" but it is not a === About make_printf versions wich do not allocate intermediate closures print_string "Hello "; from: printf "Hello %d\n%!" n;; |
Comment author: @gasche Next week I will consult the people that are even grumpier than me about those backward changes, but my gut feeling is that it's better to have as few of them as possible. Make this patch a purely internal change with no observable difference¹, and let's discuss improvements to the behavior as separate step. It's much easier to discuss and review a behavior change when it is not embedded in a thousands-of-line diff. Besides, not having behavior changes makes it easier to find bugs: currently my fuzzer stops 99 times out of 100 on minor difference you introduced in error message reporting, and this risks hiding the 1/100 bug it may spot. ¹: I told before I thought some behavior changes, such as failing statically instead of at runtime, were acceptable. I'm not sure where to draw the line between what's ok and what's not, and it would be easier to review and discuss the patch if there was no such line to draw. |
Comment author: bvaugon Remarks that, with the old version, you have : Printf.printf "%{%d%d%}" "%d%d";;%i%i- : unit = () So, "%d" is transformed into "%i". The hypothesis that formats are literally pretty-printed was already wrong. |
Comment author: bvaugon I think that the GADT definition of formats should verify the two properties :
If you add a constructor for "%b" and an other for "%B" (idem for all others), you increase the code size and break the second property. The solution (, ) is interresting because it reduce the size of CamlinternalFormatBasics, but it break the first property. |
Comment author: @gasche The first property would still hold of the GADT part, which is indeed the one that contains the structured information. The property is helpful inasmuch as it helps ensure the correctness of the typecore translation and later of the printf/scanf/format functions. With the pair representation, transcore is still trivial to audit for correctness, and printf/scanf/format that immediately take the second part are still as correct as before. And furthermore, the pair representation guarantees that string_of_format behaves as it previously did, which I still insist is an important property. You think it is about the semantics of format strings, but that is wrong. If people use string_of_format, it is precisely to have a non-semantic representation and play dirty tricks with it. We don't know, and we don't want to know, what changing string_of_format may provoke there. I think we should do the change to the pair representation. |
Comment author: bvaugon Disadvantages of the (, ) representation are:
Advantages are:
If everybody think that the performance costs are acceptable and that nobody will touch the format strings embeded in formats (which can't be abstracted, but which are hidden in CamlinternalFormatBasics), I agree. Do you want to implement it yourself or I do it ? If I implement it, do I remake clean patches from original 4.00.1 ? I think that error messages about invalid formats obtained at compile time are acceptable. They are already much better than before. However, it is difficult to obtain good error messages at runtime (typically by embedding the whole format in messages) when "%[" or "%_" are used in formats passing to printing functions without reducing performances of all prints. Is it so important ? |
Comment author: @gasche
I wouldn't be able to work on that before next week (beginning of June).
As you prefer. So far I applied your patches (to the last commit of Damien closing the 4.00 branch, in fact) and worked/reviewed on top of that.
I think it's good to have, and included it in my working copy. If you start from 4.00.1 again, you should integrate it in the result.
That's rather orthogonal to the change being discussed now and I think we can discuss it later. |
Comment author: @alainfrisch Before doing the change, I'd suggest to check first if string_of_format is actually used in real code. I don't remember every using it. |
Comment author: bvaugon I agree that a further discussion about the utility of the pair (, ) representation would be useful. It would be necessary to look into the influence on the sensitive tradeoff and the implicit drawbacks on the overall performances and style. |
Comment author: bobot FWIW string_of_format is not used in Coq, Frama-C, Why3, Why2, janest-core, lwt and all the other ml files in my home. |
Comment author: @diml I know one place were [string_of_format] is used: ocaml-gettext. In gettext strings are converted by looking into a table mapping original strings to their translations, so obviously one needs the exact string as it appears in the source code. |
Comment author: bvaugon I wrote a new version using pairs (Gadt, "string") to encode formats as suggested by Gabriel (see howto-apply-new-formats-PAIR-patches.txt, ocaml-4.00.1_new-formats_PAIR_patch-1.diff and ocaml-4.00.1_new-formats_PAIR_patch-2.diff) and package the old version including the fix-incompatible-formats-raising-time-patch.diff (see howto-apply-new-formats-NOPAIR-patches.txt, ocaml-4.00.1_new-formats_NOPAIR_patch-1.diff and ocaml-4.00.1_new-formats_NOPAIR_patch-2.diff) to facilitate comparisons of the two versions. So, you can choose the one you prefer. |
Comment author: berenger Since I maintain a lazy logger (it's called dolog in OPAM), I am very interested by this feature. Especially, I am interested in all the functions to ignore some material when conditionally printing. |
Comment author: @gasche Merged in trunk. There is still a small incompatibility introduced by this patch when you use %(..%), but hopefully it should be ironed out before the release. |
Comment author: pboutill I wrote in my project (https://github.com/Kappa-Dev/KaSim) the function : let tag_if_debug s = hoping that "tag_if_debug "%a" Pp.map my_huge_map" costs nothing when not in mode debug. |
Comment author: bvaugon Remarks that in your example, the callback function Pp.map won't be Consequently, in no-debug mode, the cost to pay with this program is However, your example suggests that it is the reading of your If yes, it might be done by implementing, in the library, a specialised |
Comment author: @gasche
This is what I had in mind when asking Pierre to reply to this PR. I think we should try to see if this does not duplicate too much code (my guess it that it wouldn't), without imposing hard-limits such as "no allocation at all" but trying to do something reasonable. Pierre: indeed as Benoît points out I was wrong when I said that the %a and %t functions would be evaluated in any case -- I'll put that on the lack of lunch. They aren't in the i(k)fprintf case. |
Comment author: bobot If the call of tag_if_debug is in a hot path of your program. I advice you to not use a variadic function. Do as in #6017#c9322 Add a version of tag_if_debug for each arity and don't use ifprintf, it removes the cost of the intermediate closure. |
Original bug ID: 6017
Reporter: bvaugon
Assigned to: @gasche
Status: closed (set by @xavierleroy on 2017-02-16T14:14:38Z)
Resolution: fixed
Priority: high
Severity: feature
Version: 4.00.1
Target version: 4.02.0+dev
Fixed in version: 4.02.0+dev
Category: standard library
Tags: patch
Related to: #5949 #6286 #6297 #7034
Monitored by: ibnfirnas @gasche @lpw25 mcclurmc @diml "Pascal Cuoq" bobot @chambart @jmeber @hcarty @yakobowski @alainfrisch
Bug description
I wrote a new implementation of formats unsing GADTs.
It contains modifications of Printf, Format and Scanf standard library
modules, and OCaml typer adaptations (principally in typing/typecore.ml).
In this new implementation, formats are compiled into GADTs trees by
the compiler and are "evaluated" by the printing and scanning functions.
The main advantages are :
and the OCaml compiler!) to ensure compatibility.
This new version is "mostly compatible" with the old-one. Differencies are :
"%1.1s", "%#s", "%0c" (I had to slightly adapt the testsuite,
principally to comment absud tests which no longer compile, so, obviously,
I did not remove any test which still compile).
with an "invalid Format-like syntax" but a "valid Printf-like syntax"
are now accepted by Format printing functions and literally printed.
It is a small disadvantage because it is slightly less convenient to
learn Format "formatting info" syntax by practice, sorry.
Since the OCaml standard library and the OCaml compilers use formats, the
bootstrap is a bit complicated.
I attach two patches (ocaml-4.00.1_new-formats_patch-{1,2}.diff) and
instructions to apply them into "howto-apply-new-formats-patches.txt".
This .txt file can be used (with a slight adaptation from your config.)
as an installation shell-script.
Please do not hesitate to send me any comments or questions about this code.
Benoît Vaugon.
File attachments
The text was updated successfully, but these errors were encountered: