Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not implicitly use ISO-8859-1 in Char.uppercase/lowercase and derived functions #6694

Closed
vicuna opened this issue Dec 7, 2014 · 14 comments
Assignees

Comments

@vicuna
Copy link

vicuna commented Dec 7, 2014

Original bug ID: 6694
Reporter: @whitequark
Assigned to: @gasche
Status: closed (set by @xavierleroy on 2016-12-07T10:46:59Z)
Resolution: fixed
Priority: normal
Severity: minor
Fixed in version: 4.03.0+dev / +beta1
Category: standard library
Related to: #5348 #6695
Parent of: #5732
Monitored by: @gasche @diml @hcarty

Bug description

Many strings that OCaml manipulates today--paths, UI labels, HTML text, database results, user input, and basically almost anything--are encoded in UTF-8. Using OCaml's casefolding breaks UTF-8 sequences. (This can be seen in e.g. #5732 and #5348)

I think it is worth considering whether these functions could be converted to only work on ASCII characters with codes <128, which would leave the rest of UTF-8 sequences intact.

@vicuna
Copy link
Author

vicuna commented Dec 7, 2014

Comment author: @whitequark

I actually think it would make sense to deprecate these functions entirely--say, in favor of Uucp--which is small and self-contained enough that there is hardly any reason to avoid it as a dependency.

(My personal view is that Uutf, Uunf and Uucp should be integrated in the stdlib and used in the compiler, but it sadly seems unlikely to happen.)

@vicuna
Copy link
Author

vicuna commented Dec 9, 2014

Comment author: @damiendoligez

They are used quite a lot in the compiler and tools, so we can't outright deprecate them.

As for turning them into ASCII-only, I'm in favor, but be aware that it'll silently break some existing code.

A (rather heavyweight) way out would be to deprecate them and add new functions for ASCII-only capitalization.

@vicuna
Copy link
Author

vicuna commented Dec 9, 2014

Comment author: @whitequark

They're actually not used a lot in the compiler and tools (I've just grepped them). Deprecating them and adding an ASCII-only alternative, which is my preferred solution, would change less than 50 lines, not counting the Bytes/String changes itself. Of course, just removing them is impossible, given the significance of case in OCaml.

Given that the world is almost entirely UTF-8 based today, I'd argue that the code was probably already broken. Turning them ASCII-only is an acceptable solution, but I think that deprecating them will nudge people trying to use the API in the right direction.

And wasn't the indexing operator (which suffers from much the same fate) discussed to have string changed to bytes in its signature? That would be an excellent counterpart; indexing bytes makes sense, UTF-8 strings, less so.

@vicuna
Copy link
Author

vicuna commented Dec 9, 2014

Comment author: @lpw25

indexing bytes makes sense, UTF-8 strings, less so.

I think it is worth bearing in mind that the string type represents a string of ASCII (well latin1 but that looks like being deprecated at some point) characters. Not a string of UTF-8 characters.

If you want to use UTF-8 strings they really need to have a different type to be used safely. For example, we allow string literals to be pattern matched. This is not remotely okay for a UTF-8 encoded string as it ignores all the various normalization issues.

For now the only UTF-8 related guarantee that should be associated with the string type should be that you can pass a string literal containing UTF-8 encoded characters to a function which will convert it into a proper unicode type. Without this guarantee all the unicode libraries become a bit useless.

So before we go around deprecating all ASCII based functionality from the String module, it is worth bearing in mind that the string type which the module operates on can probably never mean UTF-8 encoded string.

@vicuna
Copy link
Author

vicuna commented Dec 9, 2014

Comment author: @whitequark

My point is that many strings in the OCaml already are implicitly UTF-8, from filenames to network protocols to user input. That OCaml does not or can not enforce the relevant invariants does not change the fact that much code operates under incorrect assumptions.

Pattern matching should be extended to handle UTF-8 properly as well, and the fact that you can't do it now is an argument either for more UTF-8 support in the compiler or typechecker/codegen plugins (probably former).

@vicuna
Copy link
Author

vicuna commented Dec 9, 2014

Comment author: @whitequark

(Normalization can e.g. be handled by {nfd||nfd} for the relatively rare case where you want a literal with non-canonical characters.)

@vicuna
Copy link
Author

vicuna commented Dec 9, 2014

Comment author: @lpw25

To make my point clearer. What I mean is that there is too much functionality associated with the string type which cannot be implemented for UTF-8 encoded strings for us to gradually change string to be UTF-8 encoded. What is required instead is a clean break (e.g. a text type and Text module). This means there is not much to be gained from deprecating functionality from the String module because it could not be implemented with UTF-8.

@vicuna
Copy link
Author

vicuna commented Dec 9, 2014

Comment author: @whitequark

Ok. I agree. Then my proposal changes to deprecating anything non-ASCII from String in some way or another.

@vicuna
Copy link
Author

vicuna commented Dec 9, 2014

Comment author: @lpw25

Pattern matching should be extended to handle UTF-8 properly as well

It would be very complicated for little gain.

@vicuna
Copy link
Author

vicuna commented Dec 9, 2014

Comment author: @lpw25

Then my proposal changes to deprecating anything non-ASCII from String in some way or another.

Fully agree

@vicuna
Copy link
Author

vicuna commented Dec 9, 2014

Comment author: @whitequark

What I mean is it should be extended to cover the hypothetical text type, but perform a byte comparison with the literal. The majority of data exists in NFC and the need to normalize the data in order to match over it can be a documented quirk.

It is even possible to make Text.of_bytes and Text.concat (and others) auto-normalize to NFC by default, although there are some downsides.

@vicuna
Copy link
Author

vicuna commented Dec 9, 2014

Comment author: @lpw25

I think lack of normalization on matching would be sufficiently risky for it to better to just not support it. I also think that pattern matching on strings is normally a bad idea anyway.

(I also think that pattern matching on floats is pretty dubious for similar reasons, but it is probably too late to do anything about it now).

@vicuna
Copy link
Author

vicuna commented Dec 9, 2014

Comment author: @whitequark

Perhaps it would be sufficient to support matching on bytes (and thus sidestep the question of normalization).

@vicuna
Copy link
Author

vicuna commented Dec 21, 2014

Comment author: @gasche

whitequark's patch to deprecate String.uppercase and friends in favor of String.uppercase_ascii has been merged in trunk. This is the most conservative route (no change to existing functions), and some people would have wished the conveniently-named functions to get the saner behaviour by default.

If you think the somewhat larger scope of this issue (than having ascii-only stdlib functions) deserves more discussion, tell me and I'll reopen the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants