Anonymous | Login | Signup for a new account | 2018-04-26 09:24 CEST | ![]() |
Main | My View | View Issues | Change Log | Roadmap |
View Issue Details [ Jump to Notes ] | [ Issue History ] [ Print ] | |||||||||||
ID | Project | Category | View Status | Date Submitted | Last Update | |||||||
0003771 | OCaml | platform support (windows, cross-compilation, etc) | public | 2005-08-26 00:07 | 2017-09-20 16:05 | |||||||
Reporter | administrator | |||||||||||
Assigned To | dra | |||||||||||
Priority | normal | Severity | major | Reproducibility | always | |||||||
Status | resolved | Resolution | fixed | |||||||||
Platform | OS | OS Version | ||||||||||
Product Version | ||||||||||||
Target Version | 4.06.0 +dev/beta1/beta2/rc1 | Fixed in Version | 4.06.0 +dev/beta1/beta2/rc1 | |||||||||
Summary | 0003771: Reading Unicode filenames fails on Windows | |||||||||||
Description | Full_Name: spiralvoice Version: 3.08.4 OS: Windows/MinGW Submission from: p5481eb87.dip.t-dialin.net (84.129.235.135) Hi, in otherlibs\win32unix\windir.c the functions win_findfirst and win_findnext use WIN32_FIND_DATA which is not Unicode aware: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/fs/win32_find_data_str.asp [^] | |||||||||||
Tags | patch | |||||||||||
Attached Files | ![]() ![]() | |||||||||||
![]() |
||||||||||||||||
|
![]() |
|
(0007832) ygrek (reporter) 2012-07-30 12:07 |
Here is an old update to mldonkey's patch which does not rely on iconv and works(ed) with msvc : http://ygrek.org.ua/p/ocaml_unicode.html [^] It worked ok in 2009, but YMMV today. |
(0008058) doligez (administrator) 2012-09-11 09:55 |
Uploaded the patch referenced by ygrek. |
(0009671) doligez (administrator) 2013-07-02 16:06 |
This is such a huge patch :-( |
(0009794) ygrek (reporter) 2013-07-17 06:46 |
It is unavoidable, because it is exactly that all runtime functions have to be tweaked.. Do you have any recommendations to enhance the patch and/or to make it more acceptable? |
(0011831) ygrek (reporter) 2014-07-16 08:54 |
see also https://github.com/mfp/win32unixw [^] |
(0012630) johnwhitington (reporter) 2014-12-03 16:57 |
Just a quick note to say I ran into this problem with a customer today, with our command line tools. I had to ask them to write a script to rename the file, process it, and then rename it back, which isn't ideal. What are the barriers to having this patch included? I would be happy to test it, though I've never built OCaml from scratch on Windows before, so it might take some time. |
(0012649) frisch (developer) 2014-12-04 11:32 |
Can someone explain what the patch does? Does it have the effect of interpreting all filenames (in stdlib/unix) as utf8-encoded strings? Can this break existing code? There seems to be some fallback into interpreting paths differently when not valid utf; what's the rationale? etc... |
(0012695) ygrek (reporter) 2014-12-07 08:47 edited on: 2014-12-07 08:48 |
Windows api has two variants of every function that deals with filenames, e.g. RenameFileA and RenameFileW, the first one treats filename as encoded in 1-byte current "locale", the second accepts UTF16. The patch switches (maybe not all places) OCaml runtime to use the second family of functions (through MS specific (IIUC) crt wrappers). The fallback is there to specifically preserve backwards compatibility - if the string is not utf8 - it is treated as 1-byte encoded, otherwise as utf8. Probably we could remove this fallback and enable it unconditionally based on env variable, or vice-versa - leave the current mode a default and switch to utf8 only on request (function in Sys?). As for backwards compatibility - the breakage may occur if the code assumes 1-byte encoded filenames and builds filenames from other sources other than the readdir (which in utf8 mode would return utf8 strings). For simple programs - which read the filenames from filesystem and then use those filenames - there would be no breakage when switching to utf8. I personally would vote for switching default to utf8 and leaving a knob for old code. Let me know what you think, I am not developing on windows anymore, but I am willing to spend time on this patch to see it integrated. |
(0012696) ygrek (reporter) 2014-12-07 08:50 |
BTW the first version of this patch (iconv-dependent) is used in mldonkey, the second one was used in commercial windows application. I wonder how unison people deal with this problem.. |
(0012697) frisch (developer) 2014-12-07 11:35 |
Thanks ygrek for the useful background information. The patch would easily break existing code. For instance, in LexiFi application, we store filenames (currently encoded in the current "locale") in databases, we retrieve them from file pickers (using .Net dialogs and mapping .Net strings into OCaml Latin-1 strings, failing on non Latin-1 ones), etc. It would make sense to switch all filename-related strings to utf8 internally, but this requires some non-trivial migration path. I assume most existing deployed OCaml Windows applications would suffer from the same problem. The fallback is a little bit dangerous. Interpreting OCaml strings in the current locales when not valid utf8 is probably harmless in practice; but there would still the problem that strings returned by Sys/Unix module (e.g. from readdir) would now be encoded in utf8, and application need to be adapated to deal with that (e.g. if they put that in a GUI). A global mode switch might be preferable (perhaps in addition to the fallback when utf8 mode is enabled). It would indeed be interesting to hear about how e.g. Unison deal currently with the problem. |
(0012723) ygrek (reporter) 2014-12-09 05:47 |
Unison has separate windows specific wrapper that uses Unicode-family of winapi, see https://github.com/nitrous-io/unison/blob/master/system/system_win_stubs.c [^] (not official repo) |
(0012726) frisch (developer) 2014-12-09 10:02 |
I'm convinced about the need for such a change. One general question: which of the 6 Windows ports should be impacted? Clearly msvc32/64. Also mingw32/64, I guess. What about Cygwin32/64? About the patch itself: - There are two copies of u8tou16 (one for the core runtime system, one for win32unix). Couldn't they be merged? - I'd propose to define shared helpers so that code such as: char * temp=String_val(name); WCHAR * wtemp; int retcode; if(is_valid_utf8(temp)) wtemp = utf8_to_utf16(temp); else wtemp = ansi_to_utf16(temp); can be written as: WCHAR *wtemp = to_u16(String_val(name)); and: tempo = utf16_to_utf8(fileinfo.cFileName); valname = copy_string(tempo); free(tempo); becomes: valname = caml_string_u8_of_u16(fileinfo.cFileName); These helpers will also help for checking e.g. a global flag that control the desired behavior. - One should use functions such as caml_stat_free, etc. - The patch conflicts with changes related to the global runtime lock (e.g. in caml_sys_rename). Actually, the new code looks like: char * p_old; char * p_new; int ret; p_old = caml_strdup(String_val(oldname)); p_new = caml_strdup(String_val(newname)); caml_enter_blocking_section(); ret = rename(p_old, p_new); caml_leave_blocking_section(); caml_stat_free(p_new); caml_stat_free(p_old); which, in essence, is closer to what needs to be done on Windows (i.e. copy the string, and finally free the copy). This should give an opportunity for more sharing between the Windows and non-Windows versions (defining a macro/typedef that expand either to "char*" or "WCHAR*" depending on the system, and functions to map between OCaml string values and that type). - Is there a point to have a specific macro UTF16? Shouldn't it be equivalent to e.g. _WIN32? The code under UTF16 uses the Windows API, it couldn't be used under non-Windows systems, and I don't think that we should give the option to disable the UTF16 stuff for Windows port at configure time. - Is it relevant to check e.g. for #ifdef HAS_GETCWD in the Windows/UTF16 code path? |
(0012727) ygrek (reporter) 2014-12-09 10:20 |
I believe cygwin doesn't need this change - it should take care of that as it pretends to be unix. Agreed on the helpers. Will need to check whether mingw port has _WIN32 defined and cygwin doesn't. It will take me some time to setup windows environment, hold on. |
(0012728) frisch (developer) 2014-12-09 10:22 edited on: 2014-12-09 10:28 |
> I believe cygwin doesn't need this change - it should take care of that as it pretends to be unix. That was my guess, but as far as I understand, Unix doesn't say anything about the encoding of filenames (they are just sequences of bytes with some disallowed bytes), so I wonder how they define the mapping. [edit] Found some information here: https://cygwin.com/cygwin-ug-net/using-specialnames.html [^] It seems that the mapping depends on the current locale. This seems ugly, but this is a quite independent problem from the current proposal. > It will take me some time to setup windows environment, hold on. No problem, and thanks for your effort! |
(0012729) dbuenzli (reporter) 2014-12-09 10:36 |
I didn't have a close look but the UTF bits look dodgy, e.g. UTF8_LENGTH tells us that an UTF-8 sequence can be 5 to 6 bytes long which is wrong, 4 bytes is the max. Also UNICODE_VALID seems to exclude non-characters but I don't think it should (usually you want to deal with the full range of Unicode scalar values). |
(0012730) frisch (developer) 2014-12-09 13:27 |
Couldn't we use MultiByteToWideChar and WideCharToMultiByte to convert between WCHAR and UTF-8 ? |
(0012734) ygrek (reporter) 2014-12-09 17:40 |
that's what the patch does actually |
(0012755) frisch (developer) 2014-12-10 10:22 |
Ah yes, sorry, I was confused by the beginning of u8tou16 (and dbuenzli's comment); this code is only used to validate UTF-8. Cannot this be achieved with MultiByteToWideChar (with dwflags = MB_ERR_INVALID_CHARS)? The doc says: << Starting with Windows Vista, this function fully conforms with the Unicode 4.1 specification for UTF-8 and UTF-16. >> so I assume that, except for older Windows, the function would perform a full validity check (I don't know exactly what it does for XP). |
(0012761) frisch (developer) 2014-12-11 14:14 |
What about environment variables? It's quite common to pass filenames/paths in environment variables. Shouldn't we switch to GetEnvironmentVariableW and co? |
(0012762) whitequark (developer) 2014-12-11 14:22 |
Same concern applies for WinMainW. |
(0013423) ygrek (reporter) 2015-03-10 09:07 |
https://github.com/ocaml/ocaml/pull/153 [^] addresses Alain's comments. TODO remaining : exposing runtime switch and use winapi for utf8 validation |
(0017305) xleroy (administrator) 2017-02-17 15:42 |
Tentatively shooting for release 4.06. PR#153 on Github is a first cut but needs some work. Any help welcome. |
(0017805) cfranchini (reporter) 2017-05-18 17:49 |
Hi, after some work. I rebase and rewrite the patch to be applied on current repository. I attached to that ticket a pack.tag.gz which contain 3 patches. It's the same patch but applied to three different commits. * 4.04.patch patch for branch 4.04 apply on commit 34400b16acd1e9dd932d065f76278330f335def2 * 4.05.patch patch for branch 4.05 apply on commit dafb25944630a7f42d2099f887e71bf43cebc52c * trunk.patch patch for branch trunk apply on commit 472ee2c0ff40ce1daa5afb27cf2077e2df83590e I also add my test folder utf16-test which contains different tests for those commits. I hope it would help. |
(0017852) frisch (developer) 2017-06-08 10:57 edited on: 2017-06-08 10:57 |
Thanks for putting some energy into it. (I think that you can safely focus on working with trunk. Such a large change is unlikely to be integrated on a release branch.) I did not look at the updated patch closely, but it seems to me that several functions still need to be switched to the Unicode-aware variants. In particular, all accesses to the environment (getenv/putenv/environ), and functions to create process (execv and friend, and CreateProcess). Also, we probably need to do something for the command-line arguments of the current process. Can you confirm? |
(0017866) nojebar (developer) 2017-06-12 07:16 |
I opened a new PR at https://github.com/ocaml/ocaml/pull/1200 [^] with cfranchini's trunk patch with the intention of pushing this through and getting it to a mergeable state. So far: - Fixed several issues arising from the rebase/rewrite operation to the latest trunk - Switched to using WinAPI for UTF8 validation and also fixed a couple of bugs in the existing use of WinAPI for UTF-16->UTF-8 conversion. - Integrated cfranchini's tests into the OCaml testsuite. - Added wrappers for Sys.getenv and Sys.command Apart from completing the patch (wrapping environment, process creation functions), one key point to think about next: how to minimize/cleanup the use of #ifdefs and make the UTF8 conversion a runtime switch. |
(0017867) dra (developer) 2017-06-12 09:56 |
Regarding the runtime switch - especially with things implemented as #define's at the moment, this is going to be a fair amount of work resulting in quite a lot of code duplication (at least some functions will need to be compiled twice) and we'd be doing this to maintain a totally broken way of handling filenames from before. I'm toying with the proposal of making switching to UTF-8 a blocking part on Windows of upgrading to this version of OCaml, so not providing compatibility with the old behaviour (obviously, we could provide documentation and guidance on what you'd need to do!). So, rather than having a runtime switch, could we consider having some kind of new compile time action which is required - i.e. some way of declaring to the compiler that you are aware of the implications of the UTF-8 change and which triggers a warning (or possibly even error?) if you use any of the affected functions? That could be a function you need to call (e.g. a UW IMAP-style Sys.i_accept_the_risk ()), or a module/linker flags which must also be included (e.g. -cclib -lwin32utf8 or win32utf8.cmo), or an actual flag to ocamlc/ocamlopt (e.g. -win32-utf8). We could then in a few versions time relax this requirement (it's like the reverse of a deprecation warning). Windows programs which want to compile both with this and with old versions will certainly have to worry about the two different modes, so I'm not necessarily seeing the downside of embracing this in the compiler itself? Programs which don't use filename-related functions at all would be unaffected. |
(0017868) nojebar (developer) 2017-06-12 11:04 |
I agree that there isn't an obvious way to do the runtime switch without overly complicating or duplicating the code. It would be nice to reach a consensus on whether a compile-time or compiler switch is a good solution before refactoring this part of the code. |
(0017869) dra (developer) 2017-06-12 11:06 |
Alain - thoughts? |
(0017870) dbuenzli (reporter) 2017-06-12 11:10 |
What behaviour does exactly change with the fix ? Isn't it the case that before you could simply not deal with filenames beyond ASCII ? If that is the case I'm not sure you need a compat story on that. |
(0017872) nojebar (developer) 2017-06-12 11:19 |
Currently ("before" this patch) OCaml strings given to Unix/Sys functions are passed verbatim to the ANSI Windows API, which means they are interpreted in the current Windows codepage (typically Latin-1 in Western Europe). This means that you cannot deal with filenames outside the Latin-1 (not ASCII) range. This patch fixes this problem by translating on the OCaml side between UTF-8 and UTF-16 and using the UTF-16 Windows API. The problem is that this change can break code that uses Latin-1 strings (non ASCII Latin-1 strings are not valid UTF-8) if it is not adapted to re-encode those strings to UTF-8 before passing them to OCaml library functions. |
(0017877) dbuenzli (reporter) 2017-06-13 20:02 |
I see. It seems to me that the sanest way is to simply provide the functionality under an ocaml configure option rather than try to fiddle with this at the ocaml runtime level or worse to add more flags to `ocamlc`. In general it seems better if we can avoid to put the burden on build systems for this. It guess it should be easy to ifdef the function that you interpose between the syscalls to choose between treating the result of `String_val` as Latin-1 or UTF-8. The default could be kept to Latin-1 for one or two versions major versions and then switched to UTF-8. Meanwhile the UTF-8 variant could be published under another opam switch name say +win-utf-8. |
(0017879) nojebar (developer) 2017-06-14 17:50 |
A little update on this patch. After discussion with Alain, we settled on the following approach. - Switch the runtime system, under Windows, to use Unicode API. OCaml strings passed to the stdlib will be interpreted, either as 1) UTF-8 (in "modern" mode), or 2) current codepage (in "legacy" mode). In both cases they are re-encoded into wide strings using Windows API and then passed to the Unicode API. - Same thing going in the other direction: UTF-16 strings returned by the Unicode API are re-encoded as 1) UTF-8 in "modern" mode or 2) current codepage in "legacy" mode. - The switch between "modern" and "legacy" can be modified at runtime by calling Sys.enable_windows_unicode_runtime. - In fact, there will be a third mode, "transitional" which is like "modern" but falls back to re-encoding from current codepage if the OCaml string if not valid UTF-8. This mode comes with a runtime warning when passing invalid UTF-8. - For those bits of code that are compiled under both Linux and Windows a very small compatibility header is necessary. |
(0017881) dbuenzli (reporter) 2017-06-14 18:54 |
TBH I'm not sure I see the advantage of being able to do this at runtime. I think it's going to give us a longer compatibility story and more erratic behaviour for everyone. If we do it statically packages could gradually assert they need the ocaml compiler with UTF-8 and refuse to install in Latin-1. With the runtime switch everyone out there is basically left guessing, from end user to program and package devs. And if you want to be robust you need to check which variant you are actually running which puts the work on the program code. |
(0017900) nojebar (developer) 2017-06-15 14:48 |
To be clear, if we agree on the basic design of switching unconditionally to the Unicode API and doing the UTF-8 or CODEPAGE conversion on the OCaml side (the first two points in my previous note), the (runtime) switch comes for free. We can discuss how to expose it; it may very well be the case that a static switch is the best for opam-delivered compilers. Having said that, I think one argument for having a runtime switch was that it can help incremental migration for large codebases where it is not feasible to do it all at once due to the need to preserve backwards compatibility (this is the case at LexiFi), as well as making it easier to test (for example, the current testsuite in the PR is testing both the "legacy" and "modern" codepaths). |
(0017901) dbuenzli (reporter) 2017-06-15 15:03 |
I still don't see any compelling argument to expose that to *programs* at the API level via Sys.enable_windows_unicode_runtime. If you need easier testing you could do this via an env var and/or compiling against a `-runtime-variant`. |
(0017902) nojebar (developer) 2017-06-15 15:04 |
Yes, I agree. Sys.enable_windows_unicode_runtime is just a proof of concept. |
(0018294) xleroy (administrator) 2017-09-20 16:05 |
Unicode filename support was added to trunk and will be in release 4.06. Feel free to test and reopen this MPR if the issue is still there. |
![]() |
|||
Date Modified | Username | Field | Change |
2005-11-18 10:14 | administrator | New Issue | |
2008-01-22 12:08 | doligez | Relationship added | parent of 0003786 |
2012-07-11 17:01 | doligez | Relationship added | parent of 0003789 |
2012-07-11 17:04 | doligez | Target Version | => 4.01.0+dev |
2012-07-11 17:04 | doligez | Description Updated | View Revisions |
2012-07-29 18:02 | frisch | Category | OCaml general => OCaml windows |
2012-07-30 12:07 | ygrek | Note Added: 0007832 | |
2012-07-31 13:37 | doligez | Target Version | 4.01.0+dev => 4.00.1+dev |
2012-09-11 09:55 | doligez | File Added: ocaml_unicode_mod_20080421.patch | |
2012-09-11 09:55 | doligez | Note Added: 0008058 | |
2012-09-11 09:55 | doligez | Target Version | 4.00.1+dev => 4.00.2+dev |
2013-07-02 16:06 | doligez | Note Added: 0009671 | |
2013-07-02 16:06 | doligez | Target Version | 4.00.2+dev => 4.01.0+dev |
2013-07-16 16:33 | frisch | Target Version | 4.01.0+dev => later |
2013-07-17 06:46 | ygrek | Note Added: 0009794 | |
2013-07-24 11:42 | doligez | Target Version | later => 4.02.0+dev |
2013-09-03 16:11 | doligez | Tag Attached: patch | |
2014-07-16 08:54 | ygrek | Note Added: 0011831 | |
2014-07-21 23:12 | doligez | Target Version | 4.02.0+dev => 4.02.1+dev |
2014-09-04 00:25 | doligez | Target Version | 4.02.1+dev => undecided |
2014-09-18 13:45 | doligez | Target Version | undecided => 4.02.2+dev / +rc1 |
2014-12-03 16:57 | johnwhitington | Note Added: 0012630 | |
2014-12-04 11:32 | frisch | Note Added: 0012649 | |
2014-12-07 08:47 | ygrek | Note Added: 0012695 | |
2014-12-07 08:48 | ygrek | Note Edited: 0012695 | View Revisions |
2014-12-07 08:50 | ygrek | Note Added: 0012696 | |
2014-12-07 11:35 | frisch | Note Added: 0012697 | |
2014-12-08 10:45 | frisch | Relationship added | related to 0006695 |
2014-12-09 05:47 | ygrek | Note Added: 0012723 | |
2014-12-09 10:02 | frisch | Note Added: 0012726 | |
2014-12-09 10:20 | ygrek | Note Added: 0012727 | |
2014-12-09 10:22 | frisch | Note Added: 0012728 | |
2014-12-09 10:28 | frisch | Note Edited: 0012728 | View Revisions |
2014-12-09 10:36 | dbuenzli | Note Added: 0012729 | |
2014-12-09 13:27 | frisch | Note Added: 0012730 | |
2014-12-09 17:40 | ygrek | Note Added: 0012734 | |
2014-12-10 10:22 | frisch | Note Added: 0012755 | |
2014-12-11 14:14 | frisch | Note Added: 0012761 | |
2014-12-11 14:22 | whitequark | Note Added: 0012762 | |
2015-01-15 18:51 | doligez | Target Version | 4.02.2+dev / +rc1 => 4.02.3+dev |
2015-03-10 09:07 | ygrek | Note Added: 0013423 | |
2015-07-15 16:34 | doligez | Target Version | 4.02.3+dev => 4.03.0+dev / +beta1 |
2016-01-27 09:32 | frisch | Severity | minor => major |
2016-01-27 09:32 | frisch | Target Version | 4.03.0+dev / +beta1 => later |
2016-02-08 18:34 | doligez | Target Version | later => 4.03.1+dev |
2017-02-16 14:00 | doligez | Target Version | 4.03.1+dev => undecided |
2017-02-17 15:42 | xleroy | Note Added: 0017305 | |
2017-02-17 15:42 | xleroy | Target Version | undecided => 4.06.0 +dev/beta1/beta2/rc1 |
2017-02-23 16:46 | doligez | Category | OCaml windows => platform support (windows, etc) |
2017-02-23 17:16 | doligez | Category | platform support (windows, etc) => platform support (windows, cross-compilation, etc) |
2017-03-10 11:22 | shinwell | Assigned To | => dra |
2017-03-10 11:22 | shinwell | Status | acknowledged => assigned |
2017-05-18 17:46 | cfranchini | File Added: pack.tar.gz | |
2017-05-18 17:49 | cfranchini | Note Added: 0017805 | |
2017-06-08 10:57 | frisch | Note Added: 0017852 | |
2017-06-08 10:57 | frisch | Note Edited: 0017852 | View Revisions |
2017-06-12 07:16 | nojebar | Note Added: 0017866 | |
2017-06-12 09:56 | dra | Note Added: 0017867 | |
2017-06-12 11:04 | nojebar | Note Added: 0017868 | |
2017-06-12 11:06 | dra | Note Added: 0017869 | |
2017-06-12 11:10 | dbuenzli | Note Added: 0017870 | |
2017-06-12 11:19 | nojebar | Note Added: 0017872 | |
2017-06-13 20:02 | dbuenzli | Note Added: 0017877 | |
2017-06-14 17:50 | nojebar | Note Added: 0017879 | |
2017-06-14 18:54 | dbuenzli | Note Added: 0017881 | |
2017-06-15 14:48 | nojebar | Note Added: 0017900 | |
2017-06-15 15:03 | dbuenzli | Note Added: 0017901 | |
2017-06-15 15:04 | nojebar | Note Added: 0017902 | |
2017-09-20 16:05 | xleroy | Note Added: 0018294 | |
2017-09-20 16:05 | xleroy | Status | assigned => resolved |
2017-09-20 16:05 | xleroy | Resolution | open => fixed |
2017-09-20 16:05 | xleroy | Fixed in Version | => 4.06.0 +dev/beta1/beta2/rc1 |
Copyright © 2000 - 2011 MantisBT Group |