Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006521OCamlOCaml runtime systempublic2014-08-26 04:082014-09-15 15:58
Reporterfuruse 
Assigned To 
PrioritynormalSeveritymajorReproducibilityalways
StatusacknowledgedResolutionopen 
PlatformOSOS Version
Product Version4.02.0+beta1 / +rc1 
Target Version4.03.0+devFixed in Version 
Summary0006521: String.escaped returns strange results in Mac OS X + LANG=ja_JP.UTF-8
DescriptionIn Mac OS X, if LANG=ja_JP.UTF-8, String.quoted does not quote some characters >= 0x80. It seems that ISO-8859-1 printable chars are not quoted in this setting. See https://github.com/janestreet/sexplib/issues/11 [^] for details. String.escaped is LANG dependent, and in ja_JP.UTF-8 (and probably in other UTF-8 locales too), its results are not valid in UTF-8. This is strange even with the fact that OCaml's string is not in UTF-8 but in ISO-8859-1 unofficially.

The comment of String.escaped does not clearly state which chars are escaped. I thought it escapes ASCII non-printable chars for long but apparently it is not in the above setting. The function internally calls caml_is_printable() which uses setlocale(LC_CTYPE, ""). I am not an i18n guru, but the spec of setlocale says:

---

"C" Same as POSIX.

"" : Specifies an implementation-dependent native environment. For XSI-conformant systems, this corresponds to the value of the associated environment variables, LC_* and LANG; see the XBD specification, Locale and the XBD specification, Environment Variables .

---

It seems that isprint() is implementation dependent if LC_TYPE="". This might explain what we see in Mac OS X + LANG=ja_JP.UTF-8.

I propose the followings:

* Clearly comment what String.escaped returns. I think many believe that it returns strings only contain ASCII printables.
* Change setlocale(LC_TYPE, "") in caml_is_printable to setlocale(LC_TYPE "C") so that it can become implementation independent.
* Or, simply hard code ASCII printable check (0x20 <= c && c <= 0x7E)
Tagsjunior_job
Attached Files

- Relationships

-  Notes
(0012044)
furuse (reporter)
2014-08-26 04:10

The spec of setlocale I found is here: http://pubs.opengroup.org/onlinepubs/7908799/xsh/setlocale.html [^]
(0012045)
shinwell (developer)
2014-08-26 09:48

Jun, is this a new bug in 4.02?
(0012046)
dbuenzli (reporter)
2014-08-26 11:25
edited on: 2014-08-26 11:26

The bug is also present before 4.02.

If I read the documentation of String.escape and given the encoding of OCaml's string I expect String.escape to escape only the unprintable characters of ISO-8859-1 (i.e. the gray unlabelled boxes here: http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout [^]). That is those that are not in (0x20 <= c && c <= 0x7E || 0xA0 <= c && c <= 0xFF)

So the current behaviour with LANG=ja_JP.UTF-8 seems fine to me it's just the behaviour with LANG=C that is not. Now if historically (what did the original authors expect ?) and statistically the function has given us the behaviour of LANG=C then I also suggest to hard code the check to ASCII printable characters.

Example with my own locale (right behaviour to me) and then LANG=C on osx 10.9.4 with 4.01.0

> echo $LANG
fr_CH.UTF-8
> ocaml
        OCaml version 4.01.0

# let s = String.escaped "\233\171\152";;
val s : string = "?\\152"
# Char.code s.[0];;
- : int = 233
# Char.code s.[1];;
- : int = 171

> export LANG=C
> ocaml
        OCaml version 4.01.0

# let s = String.escaped "\233\171\152";;
val s : string = "\\233\\171\\152"

(0012047)
dbuenzli (reporter)
2014-08-26 11:30

Well actually given the doc:

Return a copy of the argument, with special characters represented by escape sequences, following the lexical conventions of OCaml.

We can say that both answers in the example above are in fact valid, in the sense that interpreted by an OCaml compiler both strings denote the same sequence of bytes.
(0012048)
furuse (reporter)
2014-08-26 12:26
edited on: 2014-08-26 12:27

Yes the both outputs are valid... I think we should choose one of them. It is very confusing that the runtime of OCaml is affected by LANG.

I prefer escaping non-ASCII printables, since

* I do not check it thoroughly, but it seems to be the behaviour of Linux + any LANG, which many are used to for long.
* Escaping all the non-ASCII printables make the result valid also as UTF-8 and other encodings.
* Many use OCaml string to store UTF-8 data knowing or not knowing it is officially in ISO-8859-1. Escaping non ASCII printables is meaningful also to them.
* I am selfish and I live in Asia :-)

We can choose quote only non-printable ISO-8859-1, but in that case, I would like to have escaped_to_ASCII too.

(0012049)
dbuenzli (reporter)
2014-08-26 12:35

Yes to everything (even you being selfish).

I we choose one I'd also be in absolute favour of escaping all the non-ASCII printable characters. UTF-8 compatibility of the returned string is *the* argument. There's no real point against that if we want a forward looking solution.
(0012053)
frisch (developer)
2014-08-27 14:14

I'm also in favor of the change. FWIW, we already have it in LexiFi's version, to avoid introducing different behaviors between platforms for such a basic function.
(0012054)
doligez (administrator)
2014-08-28 17:42

Strings are supposed to be encoding-agnostic, and certainly not officially iso-8859-1.

Definitely change it to escape all non-ascii-printable.
(0012137)
doligez (administrator)
2014-09-15 15:58

This will be an incompatible changes, so I'm pushing it back to 4.03.

- Issue History
Date Modified Username Field Change
2014-08-26 04:08 furuse New Issue
2014-08-26 04:10 furuse Note Added: 0012044
2014-08-26 09:48 shinwell Note Added: 0012045
2014-08-26 11:25 dbuenzli Note Added: 0012046
2014-08-26 11:26 dbuenzli Note Edited: 0012046 View Revisions
2014-08-26 11:30 dbuenzli Note Added: 0012047
2014-08-26 12:26 furuse Note Added: 0012048
2014-08-26 12:27 furuse Note Edited: 0012048 View Revisions
2014-08-26 12:27 furuse Note Edited: 0012048 View Revisions
2014-08-26 12:35 dbuenzli Note Added: 0012049
2014-08-26 14:43 shinwell Status new => acknowledged
2014-08-27 14:14 frisch Note Added: 0012053
2014-08-28 17:42 doligez Note Added: 0012054
2014-08-28 17:42 doligez Tag Attached: junior_job
2014-08-28 17:43 doligez Target Version => 4.02.1+dev
2014-09-04 00:25 doligez Target Version 4.02.1+dev => undecided
2014-09-15 15:58 doligez Note Added: 0012137
2014-09-15 15:58 doligez Target Version undecided => 4.03.0+dev


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker