Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006925OCamlruntime system and C interfacepublic2015-07-09 14:082017-10-06 09:55
Assigned Todra 
PlatformOSOS Version
Product Version4.02.2 
Target VersionlaterFixed in Version 
Summary0006925: Garbage console output on Windows with UTF-8 console in caml_partial_flush and caml_putblock
DescriptionRoll your eyes and prepare for another bug in the Microsoft C runtime!

The Windows API function WriteConsole function (see [^]) uses the word "characters" confusingly in its description of nNumberOfCharsToWrite and lpNumberOfCharsWritten. In Windows API speak, "chars" typically means "bytes" for the ANSI version (WriteConsoleA) and UCS2-ish characters (i.e. byte-length / 2) for the Unicode-ish version (WriteConsoleW).

However, it appears (at least on Windows 7, Windows Server 2012 and the latest public build of Windows 10) that lpNumberOfCharsWritten takes into account encoding too. So for the call WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE, "\xe2\x86\x95", 3, &dwWritten, NULL), dwWritten will be 1 if the Console is set to UTF-8 encoding (it will be 3 for the default cp850)

Contrary to popular opinion, the Windows Console has actually supported UTF-8 for 15 years, so this isn't anything new!

Where this comes back to the C runtime and thus to OCaml is that it means that the C runtime function write returns the wrong number when writing to a console (it clearly returns the effective dwWritten from WriteConsole)... which means that Printf.printf and related functions using output_string (and thus eventually caml_putblock) keep repeating characters from the string.

This issue I don't think should affect other kinds of I/O (e.g. file I/O) because OCaml doesn't expose the Windows extensions which allow you to enable UTF-8 and UTF-16 encoding on file handles. It's possible that a C stub which enabled them could cause the same effect, but I haven't investigated that.
Steps To ReproduceIn order to see the issue, you must be using a UTF-8 enabled Command Prompt. This is achieved by starting cmd and running chcp 65001. You must also select a Unicode font from the Font tab of the Properties dialog for the Command Prompt - either Consolas or Lucida Console. If you leave the default "Raster Fonts" option, you won't see the problem.

From an OCaml top-level, simply execute Printf.printf "\xe2\x86\x95" and you will see three characters (↕) rather than just the one expected.

Curiously, C's printf function is not affected by the issue. If you compile the attached broken-write.c using i686-w64-mingw32-gcc -o broken-write.exe broken-write.c and run it in a Unicode-enabled console, then you'll see printf correctly output just ? and a C demonstration of what's going wrong in caml_putblock which outputs ???
Additional InformationThere is something else going on in the runtime which affects the character encoding, because if the program [Printf.printf "\xe2\x86\x95"] is instead compiled using ocamlc/ocamlopt then the output is the more expected ???, but I haven't managed to trace precisely what's going on in the runtime to cause the mistranslation to ↕. Some kind of code page translation is going in write but I can't see why or where it gets set. In opam (which is where I've hit this), I was seeing this change in behaviour a long way through execution - i.e. I was getting ??? at the start of the program and then suddenly (but consistently) Printf.printf started displaying ↕.

It's possible to detect the code-page using GetConsoleCP and GetConsoleOutputCP and at least realise that the problem may occur.
TagsNo tags attached.
Attached Filesc file icon broken-write.c [^] (782 bytes) 2015-07-09 14:09 [Show Content]

- Relationships
related to 0006521closed String.escaped returns strange results in Mac OS X + LANG=ja_JP.UTF-8 

-  Notes
dra (developer)
2015-07-09 14:12

Sadly I seem to have hit encoding problems with Mantis as well! The erroneous console output in the toplevel (↕) is displaying correctly in the bug report. The ??? displayed elsewhere should be a U+2195 (Up Down Arrow) symbol followed by two U+FFFD (Replacement Character)
xleroy (administrator)
2015-07-25 17:46

That sounds like a bug in Microsoft's CRT library. The MSDN page for _write ( [^]), Visual Studio 2015 edition, says "If successful, _write returns the number of BYTES actually written." Note the word "bytes": it's not talking about the number of logical characters actually written after UTF-8 decoding.

So, I don't see OCaml at fault here. Could you please report this to Microsoft instead?
dra (developer)
2015-07-25 18:23

Hah - got an email address? :o) The first sentence is "Roll your eyes and prepare for another bug in the Microsoft C runtime!"

It's a question of perspective, though (similar to the argument over working around erroneous floating point conversions in an older PR...). I've handed Printf.printf *in OCaml* 3 bytes to send to the console - I can demonstrate that it's actively sent 5.

It's obviously been known for a while - [^]

I can do some experimentation to see if more recent C runtimes have fixed the problem (msvcrxxx.dll, etc.) - but if my understanding is correct that would allow the MSVC ports to be fixed simply by compiling with a more recent Visual Studio, but leave the mingw ports permanently screwed because they always link with msvcrt.dll?

The "fix" would be to modify caml_putblock to identify the corner-case and convert the length accordingly - my request is more that if a patch is given, will you countenance merging it? In the particular application where I've hit this (OPAM), I've simply wrapped WriteConsole function directly and bypassed caml_putblock completely. UTF-8 is not exactly going to go away (even the Console sub-system has received quite a bit of Redmond-attention in Windows 10)...
xleroy (administrator)
2015-07-25 19:21

> Hah - got an email address?

No, and I'm tempted to add "it's not my problem". More constructively, at the bottom of MSDN pages I see questions and comments by users, so there must be a way. Also, it could be worth taking this issue with the Mingw people -- they have worked around CRT issues in the past.

> The first sentence is "Roll your eyes and prepare for another bug in the Microsoft C runtime!"

Right, and my first reaction is "Then it's not an OCaml problem".

> If a patch is given, will you countenance merging it?

My first reaction is that we cannot work around every bug in Microsoft's code -- it's just not sustainable. This said, we've made exceptions in the past (against my gut feeling) if the patch was *really* simple. I doubt it would be simple in this case, as it sounds like a big mess of Microsoft's left hand and right hand ignoring each other.
nevor (reporter)
2015-07-27 15:05

I tried to reproduce the bug and there don't seem to have any bug actually, or at least, as explained in the report.

1) I compiled broken-write using visual C version 15 and it correctly displays that 3 bytes were written, even in a UTF-8 console.

2) I've done the ocaml top level quick test with the suggested sprintf and I get the correct 3 latin characters in a Latin console (which is expected) and the correct 1 unicode character in UTF-8 console (which is expected).

It is to be noted that cmd might start by default with latin code page activated (this can be checked and changed with the command), the current code page is also displayed in the properties of the target console (right click on title bar).

I don't know why "WriteConsoleA" is mentioned since it's the "write" function that is used in the ocaml runtime.
dra (developer)
2015-07-27 15:30

That's interesting - I'll download Visual Studio 2015 and see too. Certainly both mingw64-gcc and Microsoft C 11 (Visual Studio 2012) are behaving as I reported (with the need to change the example to have all the variables in C89 position at the top of the block). Perhaps the CRT in Visual Studio 2015 has fixed this - maybe this is also the improvement for the toplevel.

In order to write to the console, WriteConsoleA or WriteConsoleW is the ultimate function being called. I had a quick look at the source code for the CRT in VS 2012 which suggests that it's called via the WriteFile API (which hints that the bug may be Windows, rather than the CRT, but I haven't had a chance to investigate further).

I'm guessing from your comment about being able to see the Code Page in Properties that you're running on Windows 10?
nevor (reporter)
2015-07-27 15:42

Yes running Windows 10
xleroy (administrator)
2015-11-15 17:06

So, any news on this issue? If Visual Studio 2015 makes it go away, hurrah for Visual Studio 2015!
shinwell (developer)
2016-12-08 10:28

@dra Can you try to decide what to do with this?
dra (developer)
2017-10-05 22:15

The Unicode changes for 4.06.0 caused me to revisit this PR today. I discovered that the behaviour changed since 4.03.0 and so did some further investigation.

The difference in behaviour between the toplevel and a compiled program for the test Printf.printf "\xe2\x86\x95" was eliminated by e60a2db8 which was the fix for PR#6521. The toplevel calls String.escaped which means that the C runtime function setlocale was called in the toplevel, but not in a compiled program - if the compiled program included, say `let _f = String.escaped "foo" in Printf ...` then it behaved the same as the toplevel.

This reduces the problem to displaying too many extra characters, but at least the correct ones.

I can confirm @nevor's finding that Windows 10 (certainly from Windows 10 1511, possibly even earlier) does not exhibit this problem - the C program included in this PR works - i.e. WriteConsole now returns the number of *bytes* written. Windows Server 2012 (i.e. Windows 8) is definitely broken - I don't have access write now to a Windows Server 2012 R2 box or a vanilla Windows 10 RTM, but I would lay odds this was quietly fixed by Microsoft either as part of the overhaul of the console in Windows 10, or slightly later in preparation for the Linux Subsystem.

I'm on the cusp of agreeing that we should close this as "won't fix" (or at least "won't workaround"), but I'm going to hold off for a bit until the Windows Unicode work for 4.06.0 is completed.
dra (developer)
2017-10-05 23:30

Just for the record, Windows 10 1507 (i.e. 10240; RTM) also works correctly.

- Issue History
Date Modified Username Field Change
2015-07-09 14:08 dra New Issue
2015-07-09 14:09 dra File Added: broken-write.c
2015-07-09 14:12 dra Note Added: 0014195
2015-07-22 16:11 doligez Status new => acknowledged
2015-07-22 16:11 doligez Target Version => 4.03.0+dev / +beta1
2015-07-25 17:46 xleroy Note Added: 0014268
2015-07-25 17:46 xleroy Status acknowledged => feedback
2015-07-25 18:23 dra Note Added: 0014271
2015-07-25 18:23 dra Status feedback => new
2015-07-25 19:21 xleroy Note Added: 0014273
2015-07-27 15:05 nevor Note Added: 0014282
2015-07-27 15:30 dra Note Added: 0014284
2015-07-27 15:42 nevor Note Added: 0014285
2015-11-15 17:06 xleroy Note Added: 0014684
2015-11-15 17:06 xleroy Status new => feedback
2015-12-11 16:11 frisch Severity major => minor
2015-12-11 16:11 frisch Target Version 4.03.0+dev / +beta1 => later
2016-12-08 10:28 shinwell Note Added: 0016831
2016-12-08 10:28 shinwell Assigned To => dra
2016-12-08 10:28 shinwell Status feedback => assigned
2017-02-23 16:43 doligez Category OCaml runtime system => runtime system
2017-03-03 17:45 doligez Category runtime system => runtime system and C interface
2017-10-05 22:06 dra Relationship added related to 0006521
2017-10-05 22:15 dra Note Added: 0018491
2017-10-05 23:30 dra Note Added: 0018493

Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker