Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0003771OCamlOCaml windowspublic2005-08-26 00:072014-12-11 19:02
Reporteradministrator 
Assigned To 
PrioritynormalSeverityminorReproducibilityalways
StatusacknowledgedResolutionopen 
PlatformOSOS Version
Product Version 
Target Version4.02.2+devFixed in Version 
Summary0003771: Reading Unicode filenames fails on Windows
DescriptionFull_Name: spiralvoice
Version: 3.08.4
OS: Windows/MinGW
Submission from: p5481eb87.dip.t-dialin.net (84.129.235.135)


Hi,

in otherlibs\win32unix\windir.c the functions win_findfirst and win_findnext
use WIN32_FIND_DATA which is not Unicode aware:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/fs/win32_find_data_str.asp [^]

Tagspatch
Attached Filespatch file icon ocaml_unicode_mod_20080421.patch [^] (43,304 bytes) 2012-09-11 09:55 [Show Content]

- Relationships
parent of 0003786acknowledged Addition to bug 0003771 
parent of 0003789acknowledged Addition to bug 0003771 
related to 0006695new Do not treat paths as encoded in ISO-8859-1 
Not all the children of this issue are yet resolved or closed.

-  Notes
(0007832)
ygrek (reporter)
2012-07-30 12:07

Here is an old update to mldonkey's patch which does not rely on iconv and works(ed) with msvc : http://ygrek.org.ua/p/ocaml_unicode.html [^]
It worked ok in 2009, but YMMV today.
(0008058)
doligez (administrator)
2012-09-11 09:55

Uploaded the patch referenced by ygrek.
(0009671)
doligez (administrator)
2013-07-02 16:06

This is such a huge patch :-(
(0009794)
ygrek (reporter)
2013-07-17 06:46

It is unavoidable, because it is exactly that all runtime functions have to be tweaked.. Do you have any recommendations to enhance the patch and/or to make it more acceptable?
(0011831)
ygrek (reporter)
2014-07-16 08:54

see also https://github.com/mfp/win32unixw [^]
(0012630)
johnwhitington (reporter)
2014-12-03 16:57

Just a quick note to say I ran into this problem with a customer today, with our command line tools. I had to ask them to write a script to rename the file, process it, and then rename it back, which isn't ideal.

What are the barriers to having this patch included? I would be happy to test it, though I've never built OCaml from scratch on Windows before, so it might take some time.
(0012649)
frisch (developer)
2014-12-04 11:32

Can someone explain what the patch does? Does it have the effect of interpreting all filenames (in stdlib/unix) as utf8-encoded strings? Can this break existing code? There seems to be some fallback into interpreting paths differently when not valid utf; what's the rationale? etc...
(0012695)
ygrek (reporter)
2014-12-07 08:47
edited on: 2014-12-07 08:48

Windows api has two variants of every function that deals with filenames, e.g. RenameFileA and RenameFileW, the first one treats filename as encoded in 1-byte current "locale", the second accepts UTF16. The patch switches (maybe not all places) OCaml runtime to use the second family of functions (through MS specific (IIUC) crt wrappers). The fallback is there to specifically preserve backwards compatibility - if the string is not utf8 - it is treated as 1-byte encoded, otherwise as utf8.

Probably we could remove this fallback and enable it unconditionally based on env variable, or vice-versa - leave the current mode a default and switch to utf8 only on request (function in Sys?). As for backwards compatibility - the breakage may occur if the code assumes 1-byte encoded filenames and builds filenames from other sources other than the readdir (which in utf8 mode would return utf8 strings). For simple programs - which read the filenames from filesystem and then use those filenames - there would be no breakage when switching to utf8.

I personally would vote for switching default to utf8 and leaving a knob for old code. Let me know what you think, I am not developing on windows anymore, but I am willing to spend time on this patch to see it integrated.

(0012696)
ygrek (reporter)
2014-12-07 08:50

BTW the first version of this patch (iconv-dependent) is used in mldonkey, the second one was used in commercial windows application. I wonder how unison people deal with this problem..
(0012697)
frisch (developer)
2014-12-07 11:35

Thanks ygrek for the useful background information. The patch would easily break existing code. For instance, in LexiFi application, we store filenames (currently encoded in the current "locale") in databases, we retrieve them from file pickers (using .Net dialogs and mapping .Net strings into OCaml Latin-1 strings, failing on non Latin-1 ones), etc. It would make sense to switch all filename-related strings to utf8 internally, but this requires some non-trivial migration path. I assume most existing deployed OCaml Windows applications would suffer from the same problem.

The fallback is a little bit dangerous. Interpreting OCaml strings in the current locales when not valid utf8 is probably harmless in practice; but there would still the problem that strings returned by Sys/Unix module (e.g. from readdir) would now be encoded in utf8, and application need to be adapated to deal with that (e.g. if they put that in a GUI).

A global mode switch might be preferable (perhaps in addition to the fallback when utf8 mode is enabled).

It would indeed be interesting to hear about how e.g. Unison deal currently with the problem.
(0012723)
ygrek (reporter)
2014-12-09 05:47

Unison has separate windows specific wrapper that uses Unicode-family of winapi, see https://github.com/nitrous-io/unison/blob/master/system/system_win_stubs.c [^] (not official repo)
(0012726)
frisch (developer)
2014-12-09 10:02

I'm convinced about the need for such a change.

One general question: which of the 6 Windows ports should be impacted? Clearly msvc32/64. Also mingw32/64, I guess. What about Cygwin32/64?

About the patch itself:

 - There are two copies of u8tou16 (one for the core runtime system, one for win32unix). Couldn't they be merged?

 - I'd propose to define shared helpers so that code such as:

  char * temp=String_val(name);
  WCHAR * wtemp;
  int retcode;
  if(is_valid_utf8(temp))
    wtemp = utf8_to_utf16(temp);
  else
    wtemp = ansi_to_utf16(temp);

  can be written as:

  WCHAR *wtemp = to_u16(String_val(name));

  and:

  tempo = utf16_to_utf8(fileinfo.cFileName);
  valname = copy_string(tempo);
  free(tempo);

  becomes:

  valname = caml_string_u8_of_u16(fileinfo.cFileName);

 
   These helpers will also help for checking e.g. a global flag that control the desired behavior.
  

 - One should use functions such as caml_stat_free, etc.

 - The patch conflicts with changes related to the global runtime lock (e.g. in caml_sys_rename). Actually, the new code looks like:

  char * p_old;
  char * p_new;
  int ret;
  p_old = caml_strdup(String_val(oldname));
  p_new = caml_strdup(String_val(newname));
  caml_enter_blocking_section();
  ret = rename(p_old, p_new);
  caml_leave_blocking_section();
  caml_stat_free(p_new);
  caml_stat_free(p_old);

  which, in essence, is closer to what needs to be done on Windows (i.e. copy the string, and finally free the copy). This should give an opportunity for more sharing between the Windows and non-Windows versions (defining a macro/typedef that expand either to "char*" or "WCHAR*" depending on the system, and functions to map between OCaml string values and that type).

 - Is there a point to have a specific macro UTF16? Shouldn't it be equivalent to e.g. _WIN32? The code under UTF16 uses the Windows API, it couldn't be used under non-Windows systems, and I don't think that we should give the option to disable the UTF16 stuff for Windows port at configure time.

 - Is it relevant to check e.g. for #ifdef HAS_GETCWD in the Windows/UTF16 code path?
(0012727)
ygrek (reporter)
2014-12-09 10:20

I believe cygwin doesn't need this change - it should take care of that as it pretends to be unix.

Agreed on the helpers.
Will need to check whether mingw port has _WIN32 defined and cygwin doesn't.
It will take me some time to setup windows environment, hold on.
(0012728)
frisch (developer)
2014-12-09 10:22
edited on: 2014-12-09 10:28

> I believe cygwin doesn't need this change - it should take care of that as it pretends to be unix.

That was my guess, but as far as I understand, Unix doesn't say anything about the encoding of filenames (they are just sequences of bytes with some disallowed bytes), so I wonder how they define the mapping.

[edit] Found some information here: https://cygwin.com/cygwin-ug-net/using-specialnames.html [^] It seems that the mapping depends on the current locale. This seems ugly, but this is a quite independent problem from the current proposal.




> It will take me some time to setup windows environment, hold on.

No problem, and thanks for your effort!

(0012729)
dbuenzli (reporter)
2014-12-09 10:36

I didn't have a close look but the UTF bits look dodgy, e.g. UTF8_LENGTH tells us that an UTF-8 sequence can be 5 to 6 bytes long which is wrong, 4 bytes is the max. Also UNICODE_VALID seems to exclude non-characters but I don't think it should (usually you want to deal with the full range of Unicode scalar values).
(0012730)
frisch (developer)
2014-12-09 13:27

Couldn't we use MultiByteToWideChar and WideCharToMultiByte to convert between WCHAR and UTF-8 ?
(0012734)
ygrek (reporter)
2014-12-09 17:40

that's what the patch does actually
(0012755)
frisch (developer)
2014-12-10 10:22

Ah yes, sorry, I was confused by the beginning of u8tou16 (and dbuenzli's comment); this code is only used to validate UTF-8. Cannot this be achieved with MultiByteToWideChar (with dwflags = MB_ERR_INVALID_CHARS)? The doc says: << Starting with Windows Vista, this function fully conforms with the Unicode 4.1 specification for UTF-8 and UTF-16. >> so I assume that, except for older Windows, the function would perform a full validity check (I don't know exactly what it does for XP).
(0012761)
frisch (developer)
2014-12-11 14:14

What about environment variables? It's quite common to pass filenames/paths in environment variables. Shouldn't we switch to GetEnvironmentVariableW and co?
(0012762)
whitequark (developer)
2014-12-11 14:22

Same concern applies for WinMainW.

- Issue History
Date Modified Username Field Change
2005-11-18 10:14 administrator New Issue
2008-01-22 12:08 doligez Relationship added parent of 0003786
2012-07-11 17:01 doligez Relationship added parent of 0003789
2012-07-11 17:04 doligez Target Version => 4.01.0+dev
2012-07-11 17:04 doligez Description Updated View Revisions
2012-07-29 18:02 frisch Category OCaml general => OCaml windows
2012-07-30 12:07 ygrek Note Added: 0007832
2012-07-31 13:37 doligez Target Version 4.01.0+dev => 4.00.1+dev
2012-09-11 09:55 doligez File Added: ocaml_unicode_mod_20080421.patch
2012-09-11 09:55 doligez Note Added: 0008058
2012-09-11 09:55 doligez Target Version 4.00.1+dev => 4.00.2+dev
2013-07-02 16:06 doligez Note Added: 0009671
2013-07-02 16:06 doligez Target Version 4.00.2+dev => 4.01.0+dev
2013-07-16 16:33 frisch Target Version 4.01.0+dev => later
2013-07-17 06:46 ygrek Note Added: 0009794
2013-07-24 11:42 doligez Target Version later => 4.02.0+dev
2013-09-03 16:11 doligez Tag Attached: patch
2014-07-16 08:54 ygrek Note Added: 0011831
2014-07-21 23:12 doligez Target Version 4.02.0+dev => 4.02.1+dev
2014-09-04 00:25 doligez Target Version 4.02.1+dev => undecided
2014-09-18 13:45 doligez Target Version undecided => 4.02.2+dev
2014-12-03 16:57 johnwhitington Note Added: 0012630
2014-12-04 11:32 frisch Note Added: 0012649
2014-12-07 08:47 ygrek Note Added: 0012695
2014-12-07 08:48 ygrek Note Edited: 0012695 View Revisions
2014-12-07 08:50 ygrek Note Added: 0012696
2014-12-07 11:35 frisch Note Added: 0012697
2014-12-08 10:45 frisch Relationship added related to 0006695
2014-12-09 05:47 ygrek Note Added: 0012723
2014-12-09 10:02 frisch Note Added: 0012726
2014-12-09 10:20 ygrek Note Added: 0012727
2014-12-09 10:22 frisch Note Added: 0012728
2014-12-09 10:28 frisch Note Edited: 0012728 View Revisions
2014-12-09 10:36 dbuenzli Note Added: 0012729
2014-12-09 13:27 frisch Note Added: 0012730
2014-12-09 17:40 ygrek Note Added: 0012734
2014-12-10 10:22 frisch Note Added: 0012755
2014-12-11 14:14 frisch Note Added: 0012761
2014-12-11 14:22 whitequark Note Added: 0012762


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker