Browse thread
Storing UTF-8 in plain strings
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Richard Jones <rich@a...> |
| Subject: | Re: [Caml-list] Storing UTF-8 in plain strings |
On Wed, Aug 12, 2009 at 10:36:56AM -0700, Dario Teixeira wrote: > Hi, > > I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying > on plain strings for processing and storing data. I *think* I can get away > with using only the String module to handle this variable-length encoding > as long as I am careful with the way I treat these strings. Here are the > assumptions I am making: > > - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed. > I can therefore assume in subsequent steps that the source is compliant. > > - It is forbidden to use String.get, String.sub, String.length, or other > functions where awareness of variable-length encoding is required. Needless to say, don't use String.uppercase, String.lowercase, String.capitalize, String.uncapitalize, Char.uppercase or Char.lowercase. These all assume ISO-8859-1. I've written a number of applications which used UTF-8 extensively, including one which worked entirely in Japanese, and I've never had a problem. Just avoid the bad String/Char functions. Use either a database or a module like Ulex/Camomile. You'll be fine. Rich. -- Richard Jones Red Hat