Version française
Home     About     Download     Resources     Contact us    
Browse thread
Storing UTF-8 in plain strings
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Michael Ekstrand <michael+ocaml@e...>
Subject: Re: Storing UTF-8 in plain strings
Dario Teixeira wrote:
> Hi,
> 
> I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
> on plain strings for processing and storing data.  I *think* I can get away
> with using only the String module to handle this variable-length encoding
> as long as I am careful with the way I treat these strings.  Here are the
> assumptions I am making:
> 
> - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
>   I can therefore assume in subsequent steps that the source is compliant.
> 
> - It is forbidden to use String.get, String.sub, String.length, or other
>   functions where awareness of variable-length encoding is required.
> 
> - String concatenation is allowed.
> 
> - Using Extlib's String.nsplit is okay if the separator is a newline (0x0a),
>   because in a multi-byte sequence all bytes have a value > 127.  There is
>   therefore no chance of splitting a multi-byte sequence down the middle.
> 
> 
> So, can someone find any problems with this reasoning?  (Thanks in advance!)

It looks good to me.  Further, much of the functionality you're
forbidding yourself in String is provided by Extlib's UTF8 module
without additional dependencies if you do need it at some place in your
program.

You can also go ahead and use buffers, sprintf, etc., so long as
everything is valid UTF-8.  They won't care; they'll only see a sequence
of bytes.

- Michael