The accented characters in strings are automatically uppercased #5732

vicuna · 2012-08-17T09:25:33Z

Original bug ID: 5732
Reporter: Ted
Assigned to: @protz
Status: closed (set by @xavierleroy on 2016-12-07T10:37:03Z)
Resolution: not a bug
Priority: normal
Severity: minor
Platform: Laptop
OS: Debian Unstable
OS Version: 3.2.0-3-amd64
Version: 3.12.1
Category: ~DO NOT USE (was: OCaml general)
Child of: #6694

Bug description

(I have reproduced this bug to 3.10 version of OCaml too)

A little example is worth a long speech :

$ ocaml
Objective Caml version 3.12.1

"Ô, mon brûlant zéphyr doré";;

: string = "\195\148, mon br\195\187lant z\195\169phyr dor\195\169"

String.lowercase "Ô, mon brûlant zéphyr doré";;

: string = "\227\148, mon br\227\187lant z\227\169phyr dor\227\169"

String.uppercase "Ô, mon brûlant zéphyr doré";;

: string = "\195\148, MON BR\195\187LANT Z\195\169PHYR DOR\195\169"

I don't know if the encoding problem is normal, but I am pretty sure that this behaviour is not : String.uppercase does nothing, which means that the system automatically transforms the letter "é" into "É", etc. This bug is present for many accented letters :

String.uppercase "éèàâôû?ãõëäöÿçùò?" = "éèàâôû?ãõëäöÿçùò?";;

: bool = true

but, quite surprisingly, not for every one of them :

String.uppercase "?" = "?";;

: bool = false

String.uppercase "?" = "?";;

: bool = false

This problem happens even when I do not use my usual alias (ocaml="rlwrap ocaml") or my usual shell (zsh), and this bug occurs too when compiling ocaml code with ocamlc or ocamlopt.

vicuna · 2012-08-17T09:28:33Z

Comment author: Ted

The two characters that I have found for which the problem does not appear are these ones :

http://fr.wikipedia.org/wiki/%E1%BA%80 (does not exist in english wikipédia)
http://en.wikipedia.org/wiki/%E1%BA%BC

vicuna · 2012-08-17T09:34:50Z

Comment author: @protz

From what OCaml prints, your Ô character uses two bytes, so I guess you're inputting utf-8. OCaml still lives in the former millenium and is not utf8-compatible, so I assume these uppercase and lowercase routines only work properly on latin1-encoded strings, unfortunately :).

I suggest you take a look at the Batteries project. It has a BatUTF8 module that provides some utf8 handling routines. If you need more advanced routines, Camomile is the Unicode library for OCaml.

vicuna · 2012-08-17T09:35:52Z

Comment author: @protz

OCaml version 4.00.0

String.length "Ô";;

: int = 2

(If you get the same results on your machine, then you're inputting utf8).

vicuna · 2012-08-17T09:45:27Z

Comment author: Ted

It looks like I am inputting utf8 then. It does not surprise me that there is such encoding problems, but I really do not get why I got things like :

String.lowercase "é";;

: string = "\227\169"

"é";;

: string = "\195\169"

Could'nt String.lowercase just ignore accented letter characters when it does not recognize them ? As I do not need to actually print anything, the strange output does not bother me much, but the strange behaviour of String.lowercase does.

vicuna · 2012-08-17T10:34:42Z

Comment author: @dbuenzli

But it does recognize them, the String module interprets strings as latin-1 encoded.

The behaviour is correct, in latin-1 \227\169 is the sequence ã© which it correctly maps to \195\169 which is the sequence Ã©.

Consult the table on this page http://en.wikipedia.org/wiki/ISO_8859-1

vicuna · 2012-08-17T11:06:19Z

Comment author: Ted

Aah, I get it. Well, sorry for the "wrong" bug report, then.

vicuna closed this as completed Dec 7, 2016

vicuna mentioned this issue Mar 14, 2019

Do not implicitly use ISO-8859-1 in Char.uppercase/lowercase and derived functions #6694

Closed

vicuna added the bug label Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The accented characters in strings are automatically uppercased #5732

The accented characters in strings are automatically uppercased #5732

vicuna commented Aug 17, 2012

vicuna commented Aug 17, 2012

vicuna commented Aug 17, 2012

vicuna commented Aug 17, 2012

vicuna commented Aug 17, 2012

vicuna commented Aug 17, 2012

vicuna commented Aug 17, 2012

The accented characters in strings are automatically uppercased #5732

The accented characters in strings are automatically uppercased #5732

Comments

vicuna commented Aug 17, 2012

Bug description

"Ô, mon brûlant zéphyr doré";;

String.lowercase "Ô, mon brûlant zéphyr doré";;

String.uppercase "Ô, mon brûlant zéphyr doré";;

String.uppercase "éèàâôû?ãõëäöÿçùò?" = "éèàâôû?ãõëäöÿçùò?";;

String.uppercase "?" = "?";;

String.uppercase "?" = "?";;

vicuna commented Aug 17, 2012

vicuna commented Aug 17, 2012

vicuna commented Aug 17, 2012

String.length "Ô";;

vicuna commented Aug 17, 2012

String.lowercase "é";;

"é";;

vicuna commented Aug 17, 2012

vicuna commented Aug 17, 2012