Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbage console output on Windows with UTF-8 console in caml_partial_flush and caml_putblock #6925

Closed
vicuna opened this issue Jul 9, 2015 · 12 comments

Comments

@vicuna
Copy link

vicuna commented Jul 9, 2015

Original bug ID: 6925
Reporter: @dra27
Assigned to: @dra27
Status: assigned (set by @mshinwell on 2016-12-08T09:28:42Z)
Resolution: open
Priority: normal
Severity: minor
Version: 4.02.2
Target version: later
Category: runtime system and C interface
Related to: #6521
Monitored by: @nojb @ygrek

Bug description

Roll your eyes and prepare for another bug in the Microsoft C runtime!

The Windows API function WriteConsole function (see https://msdn.microsoft.com/en-us/library/windows/desktop/ms687401) uses the word "characters" confusingly in its description of nNumberOfCharsToWrite and lpNumberOfCharsWritten. In Windows API speak, "chars" typically means "bytes" for the ANSI version (WriteConsoleA) and UCS2-ish characters (i.e. byte-length / 2) for the Unicode-ish version (WriteConsoleW).

However, it appears (at least on Windows 7, Windows Server 2012 and the latest public build of Windows 10) that lpNumberOfCharsWritten takes into account encoding too. So for the call WriteConsoleA(GetStdHandle(STD_OUTPUT_HANDLE, "\xe2\x86\x95", 3, &dwWritten, NULL), dwWritten will be 1 if the Console is set to UTF-8 encoding (it will be 3 for the default cp850)

Contrary to popular opinion, the Windows Console has actually supported UTF-8 for 15 years, so this isn't anything new!

Where this comes back to the C runtime and thus to OCaml is that it means that the C runtime function write returns the wrong number when writing to a console (it clearly returns the effective dwWritten from WriteConsole)... which means that Printf.printf and related functions using output_string (and thus eventually caml_putblock) keep repeating characters from the string.

This issue I don't think should affect other kinds of I/O (e.g. file I/O) because OCaml doesn't expose the Windows extensions which allow you to enable UTF-8 and UTF-16 encoding on file handles. It's possible that a C stub which enabled them could cause the same effect, but I haven't investigated that.

Steps to reproduce

In order to see the issue, you must be using a UTF-8 enabled Command Prompt. This is achieved by starting cmd and running chcp 65001. You must also select a Unicode font from the Font tab of the Properties dialog for the Command Prompt - either Consolas or Lucida Console. If you leave the default "Raster Fonts" option, you won't see the problem.

From an OCaml top-level, simply execute Printf.printf "\xe2\x86\x95" and you will see three characters (↕) rather than just the one expected.

Curiously, C's printf function is not affected by the issue. If you compile the attached broken-write.c using i686-w64-mingw32-gcc -o broken-write.exe broken-write.c and run it in a Unicode-enabled console, then you'll see printf correctly output just ? and a C demonstration of what's going wrong in caml_putblock which outputs ???

Additional information

There is something else going on in the runtime which affects the character encoding, because if the program [Printf.printf "\xe2\x86\x95"] is instead compiled using ocamlc/ocamlopt then the output is the more expected ???, but I haven't managed to trace precisely what's going on in the runtime to cause the mistranslation to ↕. Some kind of code page translation is going in write but I can't see why or where it gets set. In opam (which is where I've hit this), I was seeing this change in behaviour a long way through execution - i.e. I was getting ??? at the start of the program and then suddenly (but consistently) Printf.printf started displaying ↕.

It's possible to detect the code-page using GetConsoleCP and GetConsoleOutputCP and at least realise that the problem may occur.

File attachments

@vicuna
Copy link
Author

vicuna commented Jul 9, 2015

Comment author: @dra27

Sadly I seem to have hit encoding problems with Mantis as well! The erroneous console output in the toplevel (↕) is displaying correctly in the bug report. The ??? displayed elsewhere should be a U+2195 (Up Down Arrow) symbol followed by two U+FFFD (Replacement Character)

@vicuna
Copy link
Author

vicuna commented Jul 25, 2015

Comment author: @xavierleroy

That sounds like a bug in Microsoft's CRT library. The MSDN page for _write (https://msdn.microsoft.com/en-us/library/1570wh78.aspx), Visual Studio 2015 edition, says "If successful, _write returns the number of BYTES actually written." Note the word "bytes": it's not talking about the number of logical characters actually written after UTF-8 decoding.

So, I don't see OCaml at fault here. Could you please report this to Microsoft instead?

@vicuna
Copy link
Author

vicuna commented Jul 25, 2015

Comment author: @dra27

Hah - got an email address? :o) The first sentence is "Roll your eyes and prepare for another bug in the Microsoft C runtime!"

It's a question of perspective, though (similar to the argument over working around erroneous floating point conversions in an older PR...). I've handed Printf.printf in OCaml 3 bytes to send to the console - I can demonstrate that it's actively sent 5.

It's obviously been known for a while - https://social.msdn.microsoft.com/Forums/vstudio/en-US/e4b91f49-6f60-4ffe-887a-e18e39250905/possible-bugs-in-writefile-and-crt-unicode-issues?forum=vcgeneral

I can do some experimentation to see if more recent C runtimes have fixed the problem (msvcrxxx.dll, etc.) - but if my understanding is correct that would allow the MSVC ports to be fixed simply by compiling with a more recent Visual Studio, but leave the mingw ports permanently screwed because they always link with msvcrt.dll?

The "fix" would be to modify caml_putblock to identify the corner-case and convert the length accordingly - my request is more that if a patch is given, will you countenance merging it? In the particular application where I've hit this (OPAM), I've simply wrapped WriteConsole function directly and bypassed caml_putblock completely. UTF-8 is not exactly going to go away (even the Console sub-system has received quite a bit of Redmond-attention in Windows 10)...

@vicuna
Copy link
Author

vicuna commented Jul 25, 2015

Comment author: @xavierleroy

Hah - got an email address?

No, and I'm tempted to add "it's not my problem". More constructively, at the bottom of MSDN pages I see questions and comments by users, so there must be a way. Also, it could be worth taking this issue with the Mingw people -- they have worked around CRT issues in the past.

The first sentence is "Roll your eyes and prepare for another bug in the Microsoft C runtime!"

Right, and my first reaction is "Then it's not an OCaml problem".

If a patch is given, will you countenance merging it?

My first reaction is that we cannot work around every bug in Microsoft's code -- it's just not sustainable. This said, we've made exceptions in the past (against my gut feeling) if the patch was really simple. I doubt it would be simple in this case, as it sounds like a big mess of Microsoft's left hand and right hand ignoring each other.

@vicuna
Copy link
Author

vicuna commented Jul 27, 2015

Comment author: nevor

I tried to reproduce the bug and there don't seem to have any bug actually, or at least, as explained in the report.

  1. I compiled broken-write using visual C version 15 and it correctly displays that 3 bytes were written, even in a UTF-8 console.

  2. I've done the ocaml top level quick test with the suggested sprintf and I get the correct 3 latin characters in a Latin console (which is expected) and the correct 1 unicode character in UTF-8 console (which is expected).

It is to be noted that cmd might start by default with latin code page activated (this can be checked and changed with the chcp.com command), the current code page is also displayed in the properties of the target console (right click on title bar).

I don't know why "WriteConsoleA" is mentioned since it's the "write" function that is used in the ocaml runtime.

@vicuna
Copy link
Author

vicuna commented Jul 27, 2015

Comment author: @dra27

That's interesting - I'll download Visual Studio 2015 and see too. Certainly both mingw64-gcc and Microsoft C 11 (Visual Studio 2012) are behaving as I reported (with the need to change the example to have all the variables in C89 position at the top of the block). Perhaps the CRT in Visual Studio 2015 has fixed this - maybe this is also the improvement for the toplevel.

In order to write to the console, WriteConsoleA or WriteConsoleW is the ultimate function being called. I had a quick look at the source code for the CRT in VS 2012 which suggests that it's called via the WriteFile API (which hints that the bug may be Windows, rather than the CRT, but I haven't had a chance to investigate further).

I'm guessing from your comment about being able to see the Code Page in Properties that you're running on Windows 10?

@vicuna
Copy link
Author

vicuna commented Jul 27, 2015

Comment author: nevor

Yes running Windows 10

@vicuna
Copy link
Author

vicuna commented Nov 15, 2015

Comment author: @xavierleroy

So, any news on this issue? If Visual Studio 2015 makes it go away, hurrah for Visual Studio 2015!

@vicuna
Copy link
Author

vicuna commented Dec 8, 2016

Comment author: @mshinwell

@dra27 Can you try to decide what to do with this?

@vicuna
Copy link
Author

vicuna commented Oct 5, 2017

Comment author: @dra27

The Unicode changes for 4.06.0 caused me to revisit this PR today. I discovered that the behaviour changed since 4.03.0 and so did some further investigation.

The difference in behaviour between the toplevel and a compiled program for the test Printf.printf "\xe2\x86\x95" was eliminated by e60a2db which was the fix for #6521. The toplevel calls String.escaped which means that the C runtime function setlocale was called in the toplevel, but not in a compiled program - if the compiled program included, say let _f = String.escaped "foo" in Printf ... then it behaved the same as the toplevel.

This reduces the problem to displaying too many extra characters, but at least the correct ones.

I can confirm @Nevor's finding that Windows 10 (certainly from Windows 10 1511, possibly even earlier) does not exhibit this problem - the C program included in this PR works - i.e. WriteConsole now returns the number of bytes written. Windows Server 2012 (i.e. Windows 8) is definitely broken - I don't have access write now to a Windows Server 2012 R2 box or a vanilla Windows 10 RTM, but I would lay odds this was quietly fixed by Microsoft either as part of the overhaul of the console in Windows 10, or slightly later in preparation for the Linux Subsystem.

I'm on the cusp of agreeing that we should close this as "won't fix" (or at least "won't workaround"), but I'm going to hold off for a bit until the Windows Unicode work for 4.06.0 is completed.

@vicuna
Copy link
Author

vicuna commented Oct 5, 2017

Comment author: @dra27

Just for the record, Windows 10 1507 (i.e. 10240; RTM) also works correctly.

@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants