Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006694OCamlstandard librarypublic2014-12-07 16:462016-12-07 11:46
Assigned Togasche 
PlatformOSOS Version
Product Version 
Target VersionFixed in Version4.03.0+dev / +beta1 
Summary0006694: Do not implicitly use ISO-8859-1 in Char.uppercase/lowercase and derived functions
DescriptionMany strings that OCaml manipulates today--paths, UI labels, HTML text, database results, user input, and basically almost anything--are encoded in UTF-8. Using OCaml's casefolding breaks UTF-8 sequences. (This can be seen in e.g. [^] and [^])

I think it is worth considering whether these functions could be converted to only work on ASCII characters with codes <128, which would leave the rest of UTF-8 sequences intact.
TagsNo tags attached.
Attached Files

- Relationships
related to 0005348resolved no more identifiers with accented characters 
parent of 0005732closedprotz The accented characters in strings are automatically uppercased 
related to 0006695closedwhitequark Do not treat paths as encoded in ISO-8859-1 

-  Notes
whitequark (developer)
2014-12-07 16:54

I actually think it would make sense to deprecate these functions entirely--say, in favor of Uucp--which is small and self-contained enough that there is hardly any reason to avoid it as a dependency.

(My personal view is that Uutf, Uunf and Uucp should be integrated in the stdlib and used in the compiler, but it sadly seems unlikely to happen.)
doligez (administrator)
2014-12-09 23:28

They are used quite a lot in the compiler and tools, so we can't outright deprecate them.

As for turning them into ASCII-only, I'm in favor, but be aware that it'll silently break some existing code.

A (rather heavyweight) way out would be to deprecate them and add new functions for ASCII-only capitalization.
whitequark (developer)
2014-12-09 23:36
edited on: 2014-12-09 23:38

They're actually not used a lot in the compiler and tools (I've just grepped them). Deprecating them and adding an ASCII-only alternative, which is my preferred solution, would change less than 50 lines, not counting the Bytes/String changes itself. Of course, just removing them is impossible, given the significance of case in OCaml.

Given that the world is almost entirely UTF-8 based today, I'd argue that the code was probably already broken. Turning them ASCII-only is an acceptable solution, but I think that deprecating them will nudge people trying to use the API in the right direction.

And wasn't the indexing operator (which suffers from much the same fate) discussed to have string changed to bytes in its signature? That would be an excellent counterpart; indexing bytes makes sense, UTF-8 strings, less so.

lpw25 (developer)
2014-12-09 23:54

> indexing bytes makes sense, UTF-8 strings, less so.

I think it is worth bearing in mind that the `string` type represents a string of ASCII (well latin1 but that looks like being deprecated at some point) characters. Not a string of UTF-8 characters.

If you want to use UTF-8 strings they really need to have a different type to be used safely. For example, we allow string literals to be pattern matched. This is not remotely okay for a UTF-8 encoded string as it ignores all the various normalization issues.

For now the only UTF-8 related guarantee that should be associated with the `string` type should be that you can pass a string literal containing UTF-8 encoded characters to a function which will convert it into a proper unicode type. Without this guarantee all the unicode libraries become a bit useless.

So before we go around deprecating all ASCII based functionality from the `String` module, it is worth bearing in mind that the `string` type which the module operates on can probably never mean UTF-8 encoded string.
whitequark (developer)
2014-12-09 23:59

My point is that many `string`s in the OCaml already are implicitly UTF-8, from filenames to network protocols to user input. That OCaml does not or can not enforce the relevant invariants does not change the fact that much code operates under incorrect assumptions.

Pattern matching should be extended to handle UTF-8 properly as well, and the fact that you can't do it now is an argument either for more UTF-8 support in the compiler or typechecker/codegen plugins (probably former).
whitequark (developer)
2014-12-10 00:01

(Normalization can e.g. be handled by {nfd||nfd} for the relatively rare case where you want a literal with non-canonical characters.)
lpw25 (developer)
2014-12-10 00:03
edited on: 2014-12-10 00:06

To make my point clearer. What I mean is that there is too much functionality associated with the `string` type which cannot be implemented for UTF-8 encoded strings for us to gradually change `string` to be UTF-8 encoded. What is required instead is a clean break (e.g. a `text` type and `Text` module). This means there is not much to be gained from deprecating functionality from the `String` module because it could not be implemented with UTF-8.

whitequark (developer)
2014-12-10 00:05

Ok. I agree. Then my proposal changes to deprecating anything non-ASCII from String in some way or another.
lpw25 (developer)
2014-12-10 00:05

> Pattern matching should be extended to handle UTF-8 properly as well

It would be very complicated for little gain.
lpw25 (developer)
2014-12-10 00:06

> Then my proposal changes to deprecating anything non-ASCII from String in some way or another.

Fully agree
whitequark (developer)
2014-12-10 00:09

What I mean is it should be extended to cover the hypothetical text type, but perform a byte comparison with the literal. The majority of data exists in NFC and the need to normalize the data in order to match over it can be a documented quirk.

It is even possible to make Text.of_bytes and Text.concat (and others) auto-normalize to NFC by default, although there are some downsides.
lpw25 (developer)
2014-12-10 00:13

I think lack of normalization on matching would be sufficiently risky for it to better to just not support it. I also think that pattern matching on strings is normally a bad idea anyway.

(I also think that pattern matching on floats is pretty dubious for similar reasons, but it is probably too late to do anything about it now).
whitequark (developer)
2014-12-10 00:15

Perhaps it would be sufficient to support matching on bytes (and thus sidestep the question of normalization).
gasche (administrator)
2014-12-21 12:56

whitequark's patch to deprecate String.uppercase and friends in favor of String.uppercase_ascii has been merged in trunk. This is the most conservative route (no change to existing functions), and some people would have wished the conveniently-named functions to get the saner behaviour by default.

If you think the somewhat larger scope of this issue (than having ascii-only stdlib functions) deserves more discussion, tell me and I'll reopen the issue.

- Issue History
Date Modified Username Field Change
2014-12-07 16:46 whitequark New Issue
2014-12-07 16:54 whitequark Note Added: 0012702
2014-12-07 17:39 gasche Relationship added related to 0005348
2014-12-07 17:39 gasche Relationship added parent of 0005732
2014-12-07 17:40 gasche Relationship added related to 0006695
2014-12-09 23:28 doligez Note Added: 0012737
2014-12-09 23:28 doligez Status new => feedback
2014-12-09 23:36 whitequark Note Added: 0012738
2014-12-09 23:36 whitequark Status feedback => new
2014-12-09 23:37 whitequark Note Edited: 0012738 View Revisions
2014-12-09 23:38 whitequark Note Edited: 0012738 View Revisions
2014-12-09 23:54 lpw25 Note Added: 0012742
2014-12-09 23:59 whitequark Note Added: 0012743
2014-12-10 00:01 whitequark Note Added: 0012744
2014-12-10 00:03 lpw25 Note Added: 0012745
2014-12-10 00:05 whitequark Note Added: 0012746
2014-12-10 00:05 lpw25 Note Added: 0012747
2014-12-10 00:06 lpw25 Note Edited: 0012745 View Revisions
2014-12-10 00:06 lpw25 Note Added: 0012748
2014-12-10 00:09 whitequark Note Added: 0012749
2014-12-10 00:13 lpw25 Note Added: 0012750
2014-12-10 00:15 whitequark Note Added: 0012751
2014-12-21 12:56 gasche Note Added: 0012905
2014-12-21 12:56 gasche Status new => resolved
2014-12-21 12:56 gasche Resolution open => fixed
2014-12-21 12:56 gasche Assigned To => gasche
2014-12-21 12:57 gasche Fixed in Version => 4.03.0+dev / +beta1
2016-12-07 11:46 xleroy Status resolved => closed
2017-02-23 16:43 doligez Category OCaml standard library => standard library

Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker