String.escaped returns strange results in Mac OS X + LANG=ja_JP.UTF-8 #6521

vicuna · 2014-08-26T02:08:59Z

Original bug ID: 6521
Reporter: furuse
Status: closed (set by @damiendoligez on 2015-03-11T19:15:48Z)
Resolution: fixed
Priority: normal
Severity: major
Version: 4.02.0+beta1 / +rc1
Target version: 4.03.0+dev / +beta1
Fixed in version: 4.03.0+dev / +beta1
Category: runtime system and C interface
Tags: junior_job
Related to: #6925
Monitored by: @gasche

Bug description

In Mac OS X, if LANG=ja_JP.UTF-8, String.quoted does not quote some characters >= 0x80. It seems that ISO-8859-1 printable chars are not quoted in this setting. See janestreet/sexplib#11 for details. String.escaped is LANG dependent, and in ja_JP.UTF-8 (and probably in other UTF-8 locales too), its results are not valid in UTF-8. This is strange even with the fact that OCaml's string is not in UTF-8 but in ISO-8859-1 unofficially.

The comment of String.escaped does not clearly state which chars are escaped. I thought it escapes ASCII non-printable chars for long but apparently it is not in the above setting. The function internally calls caml_is_printable() which uses setlocale(LC_CTYPE, ""). I am not an i18n guru, but the spec of setlocale says:

"C" Same as POSIX.

"" : Specifies an implementation-dependent native environment. For XSI-conformant systems, this corresponds to the value of the associated environment variables, LC_* and LANG; see the XBD specification, Locale and the XBD specification, Environment Variables .

It seems that isprint() is implementation dependent if LC_TYPE="". This might explain what we see in Mac OS X + LANG=ja_JP.UTF-8.

I propose the followings:

Clearly comment what String.escaped returns. I think many believe that it returns strings only contain ASCII printables.
Change setlocale(LC_TYPE, "") in caml_is_printable to setlocale(LC_TYPE "C") so that it can become implementation independent.
Or, simply hard code ASCII printable check (0x20 <= c && c <= 0x7E)

vicuna · 2014-08-26T02:10:35Z

Comment author: furuse

The spec of setlocale I found is here: http://pubs.opengroup.org/onlinepubs/7908799/xsh/setlocale.html

vicuna · 2014-08-26T07:48:11Z

Comment author: @mshinwell

Jun, is this a new bug in 4.02?

vicuna · 2014-08-26T09:25:13Z

Comment author: @dbuenzli

The bug is also present before 4.02.

If I read the documentation of String.escape and given the encoding of OCaml's string I expect String.escape to escape only the unprintable characters of ISO-8859-1 (i.e. the gray unlabelled boxes here: http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout). That is those that are not in (0x20 <= c && c <= 0x7E || 0xA0 <= c && c <= 0xFF)

So the current behaviour with LANG=ja_JP.UTF-8 seems fine to me it's just the behaviour with LANG=C that is not. Now if historically (what did the original authors expect ?) and statistically the function has given us the behaviour of LANG=C then I also suggest to hard code the check to ASCII printable characters.

Example with my own locale (right behaviour to me) and then LANG=C on osx 10.9.4 with 4.01.0

echo $LANG
fr_CH.UTF-8
ocaml
OCaml version 4.01.0

let s = String.escaped "\233\171\152";;

val s : string = "?\152"

Char.code s.[0];;

: int = 233

Char.code s.[1];;

: int = 171

export LANG=C
ocaml
OCaml version 4.01.0

let s = String.escaped "\233\171\152";;

val s : string = "\233\171\152"

vicuna · 2014-08-26T09:30:41Z

Comment author: @dbuenzli

Well actually given the doc:

Return a copy of the argument, with special characters represented by escape sequences, following the lexical conventions of OCaml.

We can say that both answers in the example above are in fact valid, in the sense that interpreted by an OCaml compiler both strings denote the same sequence of bytes.

vicuna · 2014-08-26T10:26:08Z

Comment author: furuse

Yes the both outputs are valid... I think we should choose one of them. It is very confusing that the runtime of OCaml is affected by LANG.

I prefer escaping non-ASCII printables, since

I do not check it thoroughly, but it seems to be the behaviour of Linux + any LANG, which many are used to for long.
Escaping all the non-ASCII printables make the result valid also as UTF-8 and other encodings.
Many use OCaml string to store UTF-8 data knowing or not knowing it is officially in ISO-8859-1. Escaping non ASCII printables is meaningful also to them.
I am selfish and I live in Asia :-)

We can choose quote only non-printable ISO-8859-1, but in that case, I would like to have escaped_to_ASCII too.

vicuna · 2014-08-26T10:35:42Z

Comment author: @dbuenzli

Yes to everything (even you being selfish).

I we choose one I'd also be in absolute favour of escaping all the non-ASCII printable characters. UTF-8 compatibility of the returned string is the argument. There's no real point against that if we want a forward looking solution.

vicuna · 2014-08-27T12:14:54Z

Comment author: @alainfrisch

I'm also in favor of the change. FWIW, we already have it in LexiFi's version, to avoid introducing different behaviors between platforms for such a basic function.

vicuna · 2014-08-28T15:42:43Z

Comment author: @damiendoligez

Strings are supposed to be encoding-agnostic, and certainly not officially iso-8859-1.

Definitely change it to escape all non-ascii-printable.

vicuna · 2014-09-15T13:58:01Z

Comment author: @damiendoligez

This will be an incompatible changes, so I'm pushing it back to 4.03.

vicuna · 2015-03-11T19:15:48Z

Comment author: @damiendoligez

Fixed in trunk (commit 15901).

Note that I also changed Bytes.escaped and Char.escaped.

vicuna closed this as completed Mar 11, 2015

vicuna added stdlib newcomer-job labels Mar 14, 2019

vicuna added this to the 4.03.0 milestone Mar 14, 2019

vicuna mentioned this issue Mar 14, 2019

Garbage console output on Windows with UTF-8 console in caml_partial_flush and caml_putblock #6925

Closed

vicuna added the bug label Mar 20, 2019

pi8027 mentioned this issue Aug 12, 2019

Extraction generates invalid Haskell code when strings contain unicode characters coq/coq#7870

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String.escaped returns strange results in Mac OS X + LANG=ja_JP.UTF-8 #6521

String.escaped returns strange results in Mac OS X + LANG=ja_JP.UTF-8 #6521

vicuna commented Aug 26, 2014

vicuna commented Aug 26, 2014

vicuna commented Aug 26, 2014

vicuna commented Aug 26, 2014

vicuna commented Aug 26, 2014

vicuna commented Aug 26, 2014

vicuna commented Aug 26, 2014

vicuna commented Aug 27, 2014

vicuna commented Aug 28, 2014

vicuna commented Sep 15, 2014

vicuna commented Mar 11, 2015

String.escaped returns strange results in Mac OS X + LANG=ja_JP.UTF-8 #6521

String.escaped returns strange results in Mac OS X + LANG=ja_JP.UTF-8 #6521

Comments

vicuna commented Aug 26, 2014

Bug description

vicuna commented Aug 26, 2014

vicuna commented Aug 26, 2014

vicuna commented Aug 26, 2014

let s = String.escaped "\233\171\152";;

Char.code s.[0];;

Char.code s.[1];;

let s = String.escaped "\233\171\152";;

vicuna commented Aug 26, 2014

vicuna commented Aug 26, 2014

vicuna commented Aug 26, 2014

vicuna commented Aug 27, 2014

vicuna commented Aug 28, 2014

vicuna commented Sep 15, 2014

vicuna commented Mar 11, 2015