Version française
Home     About     Download     Resources     Contact us    
Browse thread
Storing UTF-8 in plain strings
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Richard Jones <rich@a...>
Subject: Re: [Caml-list] Storing UTF-8 in plain strings
On Wed, Aug 12, 2009 at 10:36:56AM -0700, Dario Teixeira wrote:
> Hi,
> 
> I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
> on plain strings for processing and storing data.  I *think* I can get away
> with using only the String module to handle this variable-length encoding
> as long as I am careful with the way I treat these strings.  Here are the
> assumptions I am making:
> 
> - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
>   I can therefore assume in subsequent steps that the source is compliant.
> 
> - It is forbidden to use String.get, String.sub, String.length, or other
>   functions where awareness of variable-length encoding is required.

Needless to say, don't use String.uppercase, String.lowercase,
String.capitalize, String.uncapitalize, Char.uppercase or
Char.lowercase.  These all assume ISO-8859-1.

I've written a number of applications which used UTF-8 extensively,
including one which worked entirely in Japanese, and I've never had a
problem.  Just avoid the bad String/Char functions.  Use either a
database or a module like Ulex/Camomile.  You'll be fine.

Rich.

-- 
Richard Jones
Red Hat