Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0005732OCamlOCaml generalpublic2012-08-17 11:252014-12-07 17:39
ReporterTed 
Assigned Toprotz 
PrioritynormalSeverityminorReproducibilityalways
StatusresolvedResolutionno change required 
PlatformLaptopOSDebian UnstableOS Version3.2.0-3-amd64
Product Version3.12.1 
Target VersionFixed in Version 
Summary0005732: The accented characters in strings are automatically uppercased
Description(I have reproduced this bug to 3.10 version of OCaml too)

A little example is worth a long speech :

$ ocaml
        Objective Caml version 3.12.1
# "Ô, mon brûlant zéphyr doré";;
- : string = "\195\148, mon br\195\187lant z\195\169phyr dor\195\169"
# String.lowercase "Ô, mon brûlant zéphyr doré";;
- : string = "\227\148, mon br\227\187lant z\227\169phyr dor\227\169"
# String.uppercase "Ô, mon brûlant zéphyr doré";;
- : string = "\195\148, MON BR\195\187LANT Z\195\169PHYR DOR\195\169"

I don't know if the encoding problem is normal, but I am pretty sure that this behaviour is not : String.uppercase does nothing, which means that the system automatically transforms the letter "é" into "É", etc. This bug is present for many accented letters :

# String.uppercase "éèàâôû?ãõëäöÿçùò?" = "éèàâôû?ãõëäöÿçùò?";;
- : bool = true

but, quite surprisingly, not for every one of them :

# String.uppercase "?" = "?";;
- : bool = false
# String.uppercase "?" = "?";;
- : bool = false

This problem happens even when I do not use my usual alias (ocaml="rlwrap ocaml") or my usual shell (zsh), and this bug occurs too when compiling ocaml code with ocamlc or ocamlopt.
TagsNo tags attached.
Attached Files

- Relationships
child of 0006694new Do not implicitly use ISO-8859-1 in Char.uppercase/lowercase and derived functions 

-  Notes
(0007950)
Ted (reporter)
2012-08-17 11:28

The two characters that I have found for which the problem does not appear are these ones :

http://fr.wikipedia.org/wiki/%E1%BA%80 [^] (does not exist in english wikipédia)
http://en.wikipedia.org/wiki/%E1%BA%BC [^]
(0007951)
protz (manager)
2012-08-17 11:34

From what OCaml prints, your Ô character uses two bytes, so I guess you're inputting utf-8. OCaml still lives in the former millenium and is not utf8-compatible, so I assume these uppercase and lowercase routines only work properly on latin1-encoded strings, unfortunately :).

I suggest you take a look at the Batteries project. It has a BatUTF8 module that provides some utf8 handling routines. If you need more advanced routines, Camomile is the Unicode library for OCaml.
(0007952)
protz (manager)
2012-08-17 11:35

OCaml version 4.00.0

# String.length "Ô";;
- : int = 2

(If you get the same results on your machine, then you're inputting utf8).
(0007953)
Ted (reporter)
2012-08-17 11:45

It looks like I am inputting utf8 then. It does not surprise me that there is such encoding problems, but I really do not get why I got things like :

# String.lowercase "é";;
- : string = "\227\169"
# "é";;
- : string = "\195\169"

Could'nt String.lowercase just ignore accented letter characters when it does not recognize them ? As I do not need to actually print anything, the strange output does not bother me much, but the strange behaviour of String.lowercase does.
(0007954)
dbuenzli (reporter)
2012-08-17 12:34
edited on: 2012-08-17 12:35

But it does recognize them, the String module interprets strings as latin-1 encoded.

The behaviour is correct, in latin-1 \227\169 is the sequence 㩠which it correctly maps to \195\169 which is the sequence é.

Consult the table on this page http://en.wikipedia.org/wiki/ISO_8859-1 [^]

(0007955)
Ted (reporter)
2012-08-17 13:06

Aah, I get it. Well, sorry for the "wrong" bug report, then.

- Issue History
Date Modified Username Field Change
2012-08-17 11:25 Ted New Issue
2012-08-17 11:28 Ted Note Added: 0007950
2012-08-17 11:34 protz Note Added: 0007951
2012-08-17 11:35 protz Note Added: 0007952
2012-08-17 11:35 protz Status new => resolved
2012-08-17 11:35 protz Resolution open => no change required
2012-08-17 11:35 protz Assigned To => protz
2012-08-17 11:45 Ted Note Added: 0007953
2012-08-17 12:34 dbuenzli Note Added: 0007954
2012-08-17 12:35 dbuenzli Note Edited: 0007954 View Revisions
2012-08-17 13:06 Ted Note Added: 0007955
2014-12-07 17:39 gasche Relationship added child of 0006694


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker