Browse thread
Storing UTF-8 in plain strings
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Edgar Friendly <thelema314@g...> |
| Subject: | Re: [Caml-list] Storing UTF-8 in plain strings |
Dario Teixeira wrote: > Hi, > > I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying > on plain strings for processing and storing data. I *think* I can get away > with using only the String module to handle this variable-length encoding > as long as I am careful with the way I treat these strings. Here are the > assumptions I am making: > > - If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed. > I can therefore assume in subsequent steps that the source is compliant. > This is the weakest assumption of the four - Ulex could parse and only raise MalFormed on some errors. I'm no expert on Ulex, though... > - It is forbidden to use String.get, String.sub, String.length, or other > functions where awareness of variable-length encoding is required. > Yes, those functions work on bytes, not on characters. > - String concatenation is allowed. > Yes, two valid UTF-8 strings concatenate into another valid UTF-8 string. > - Using Extlib's String.nsplit is okay if the separator is a newline (0x0a), > because in a multi-byte sequence all bytes have a value > 127. There is > therefore no chance of splitting a multi-byte sequence down the middle. > Yes, you can split on low bytes, multibyte characters start with 0b11xx xxxx and continue with 0b10xx xxxx. E