New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String.escaped returns strange results in Mac OS X + LANG=ja_JP.UTF-8 #6521
Comments
Comment author: furuse The spec of setlocale I found is here: http://pubs.opengroup.org/onlinepubs/7908799/xsh/setlocale.html |
Comment author: @mshinwell Jun, is this a new bug in 4.02? |
Comment author: @dbuenzli The bug is also present before 4.02. If I read the documentation of String.escape and given the encoding of OCaml's string I expect String.escape to escape only the unprintable characters of ISO-8859-1 (i.e. the gray unlabelled boxes here: http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout). That is those that are not in (0x20 <= c && c <= 0x7E || 0xA0 <= c && c <= 0xFF) So the current behaviour with LANG=ja_JP.UTF-8 seems fine to me it's just the behaviour with LANG=C that is not. Now if historically (what did the original authors expect ?) and statistically the function has given us the behaviour of LANG=C then I also suggest to hard code the check to ASCII printable characters. Example with my own locale (right behaviour to me) and then LANG=C on osx 10.9.4 with 4.01.0
let s = String.escaped "\233\171\152";;val s : string = "?\152" Char.code s.[0];;
Char.code s.[1];;
let s = String.escaped "\233\171\152";;val s : string = "\233\171\152" |
Comment author: @dbuenzli Well actually given the doc: Return a copy of the argument, with special characters represented by escape sequences, following the lexical conventions of OCaml. We can say that both answers in the example above are in fact valid, in the sense that interpreted by an OCaml compiler both strings denote the same sequence of bytes. |
Comment author: furuse Yes the both outputs are valid... I think we should choose one of them. It is very confusing that the runtime of OCaml is affected by LANG. I prefer escaping non-ASCII printables, since
We can choose quote only non-printable ISO-8859-1, but in that case, I would like to have escaped_to_ASCII too. |
Comment author: @dbuenzli Yes to everything (even you being selfish). I we choose one I'd also be in absolute favour of escaping all the non-ASCII printable characters. UTF-8 compatibility of the returned string is the argument. There's no real point against that if we want a forward looking solution. |
Comment author: @alainfrisch I'm also in favor of the change. FWIW, we already have it in LexiFi's version, to avoid introducing different behaviors between platforms for such a basic function. |
Comment author: @damiendoligez Strings are supposed to be encoding-agnostic, and certainly not officially iso-8859-1. Definitely change it to escape all non-ascii-printable. |
Comment author: @damiendoligez This will be an incompatible changes, so I'm pushing it back to 4.03. |
Comment author: @damiendoligez Fixed in trunk (commit 15901). Note that I also changed |
Original bug ID: 6521
Reporter: furuse
Status: closed (set by @damiendoligez on 2015-03-11T19:15:48Z)
Resolution: fixed
Priority: normal
Severity: major
Version: 4.02.0+beta1 / +rc1
Target version: 4.03.0+dev / +beta1
Fixed in version: 4.03.0+dev / +beta1
Category: runtime system and C interface
Tags: junior_job
Related to: #6925
Monitored by: @gasche
Bug description
In Mac OS X, if LANG=ja_JP.UTF-8, String.quoted does not quote some characters >= 0x80. It seems that ISO-8859-1 printable chars are not quoted in this setting. See janestreet/sexplib#11 for details. String.escaped is LANG dependent, and in ja_JP.UTF-8 (and probably in other UTF-8 locales too), its results are not valid in UTF-8. This is strange even with the fact that OCaml's string is not in UTF-8 but in ISO-8859-1 unofficially.
The comment of String.escaped does not clearly state which chars are escaped. I thought it escapes ASCII non-printable chars for long but apparently it is not in the above setting. The function internally calls caml_is_printable() which uses setlocale(LC_CTYPE, ""). I am not an i18n guru, but the spec of setlocale says:
"C" Same as POSIX.
"" : Specifies an implementation-dependent native environment. For XSI-conformant systems, this corresponds to the value of the associated environment variables, LC_* and LANG; see the XBD specification, Locale and the XBD specification, Environment Variables .
It seems that isprint() is implementation dependent if LC_TYPE="". This might explain what we see in Mac OS X + LANG=ja_JP.UTF-8.
I propose the followings:
The text was updated successfully, but these errors were encountered: