New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not implicitly use ISO-8859-1 in Char.uppercase/lowercase and derived functions #6694
Comments
Comment author: @whitequark I actually think it would make sense to deprecate these functions entirely--say, in favor of Uucp--which is small and self-contained enough that there is hardly any reason to avoid it as a dependency. (My personal view is that Uutf, Uunf and Uucp should be integrated in the stdlib and used in the compiler, but it sadly seems unlikely to happen.) |
Comment author: @damiendoligez They are used quite a lot in the compiler and tools, so we can't outright deprecate them. As for turning them into ASCII-only, I'm in favor, but be aware that it'll silently break some existing code. A (rather heavyweight) way out would be to deprecate them and add new functions for ASCII-only capitalization. |
Comment author: @whitequark They're actually not used a lot in the compiler and tools (I've just grepped them). Deprecating them and adding an ASCII-only alternative, which is my preferred solution, would change less than 50 lines, not counting the Bytes/String changes itself. Of course, just removing them is impossible, given the significance of case in OCaml. Given that the world is almost entirely UTF-8 based today, I'd argue that the code was probably already broken. Turning them ASCII-only is an acceptable solution, but I think that deprecating them will nudge people trying to use the API in the right direction. And wasn't the indexing operator (which suffers from much the same fate) discussed to have string changed to bytes in its signature? That would be an excellent counterpart; indexing bytes makes sense, UTF-8 strings, less so. |
Comment author: @lpw25
I think it is worth bearing in mind that the If you want to use UTF-8 strings they really need to have a different type to be used safely. For example, we allow string literals to be pattern matched. This is not remotely okay for a UTF-8 encoded string as it ignores all the various normalization issues. For now the only UTF-8 related guarantee that should be associated with the So before we go around deprecating all ASCII based functionality from the |
Comment author: @whitequark My point is that many Pattern matching should be extended to handle UTF-8 properly as well, and the fact that you can't do it now is an argument either for more UTF-8 support in the compiler or typechecker/codegen plugins (probably former). |
Comment author: @whitequark (Normalization can e.g. be handled by {nfd||nfd} for the relatively rare case where you want a literal with non-canonical characters.) |
Comment author: @lpw25 To make my point clearer. What I mean is that there is too much functionality associated with the |
Comment author: @whitequark Ok. I agree. Then my proposal changes to deprecating anything non-ASCII from String in some way or another. |
Comment author: @lpw25
It would be very complicated for little gain. |
Comment author: @lpw25
Fully agree |
Comment author: @whitequark What I mean is it should be extended to cover the hypothetical text type, but perform a byte comparison with the literal. The majority of data exists in NFC and the need to normalize the data in order to match over it can be a documented quirk. It is even possible to make Text.of_bytes and Text.concat (and others) auto-normalize to NFC by default, although there are some downsides. |
Comment author: @lpw25 I think lack of normalization on matching would be sufficiently risky for it to better to just not support it. I also think that pattern matching on strings is normally a bad idea anyway. (I also think that pattern matching on floats is pretty dubious for similar reasons, but it is probably too late to do anything about it now). |
Comment author: @whitequark Perhaps it would be sufficient to support matching on bytes (and thus sidestep the question of normalization). |
Comment author: @gasche whitequark's patch to deprecate String.uppercase and friends in favor of String.uppercase_ascii has been merged in trunk. This is the most conservative route (no change to existing functions), and some people would have wished the conveniently-named functions to get the saner behaviour by default. If you think the somewhat larger scope of this issue (than having ascii-only stdlib functions) deserves more discussion, tell me and I'll reopen the issue. |
Original bug ID: 6694
Reporter: @whitequark
Assigned to: @gasche
Status: closed (set by @xavierleroy on 2016-12-07T10:46:59Z)
Resolution: fixed
Priority: normal
Severity: minor
Fixed in version: 4.03.0+dev / +beta1
Category: standard library
Related to: #5348 #6695
Parent of: #5732
Monitored by: @gasche @diml @hcarty
Bug description
Many strings that OCaml manipulates today--paths, UI labels, HTML text, database results, user input, and basically almost anything--are encoded in UTF-8. Using OCaml's casefolding breaks UTF-8 sequences. (This can be seen in e.g. #5732 and #5348)
I think it is worth considering whether these functions could be converted to only work on ASCII characters with codes <128, which would leave the rest of UTF-8 sequences intact.
The text was updated successfully, but these errors were encountered: