Version française
Home     About     Download     Resources     Contact us    
Browse thread
Storing UTF-8 in plain strings
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Edgar Friendly <thelema314@g...>
Subject: Re: [Caml-list] Storing UTF-8 in plain strings
Dario Teixeira wrote:
> Hi,
> 
> I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
> on plain strings for processing and storing data.  I *think* I can get away
> with using only the String module to handle this variable-length encoding
> as long as I am careful with the way I treat these strings.  Here are the
> assumptions I am making:
> 
> - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
>   I can therefore assume in subsequent steps that the source is compliant.
> 
This is the weakest assumption of the four - Ulex could parse and only
raise MalFormed on some errors.  I'm no expert on Ulex, though...

> - It is forbidden to use String.get, String.sub, String.length, or other
>   functions where awareness of variable-length encoding is required.
> 
Yes, those functions work on bytes, not on characters.

> - String concatenation is allowed.
> 
Yes, two valid UTF-8 strings concatenate into another valid UTF-8 string.

> - Using Extlib's String.nsplit is okay if the separator is a newline (0x0a),
>   because in a multi-byte sequence all bytes have a value > 127.  There is
>   therefore no chance of splitting a multi-byte sequence down the middle.
>
Yes, you can split on low bytes, multibyte characters start with
0b11xx xxxx and continue with 0b10xx xxxx.

E