Version française
Home     About     Download     Resources     Contact us    
Browse thread
Storing UTF-8 in plain strings
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Dario Teixeira <darioteixeira@y...>
Subject: Storing UTF-8 in plain strings
Hi,

I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
on plain strings for processing and storing data.  I *think* I can get away
with using only the String module to handle this variable-length encoding
as long as I am careful with the way I treat these strings.  Here are the
assumptions I am making:

- If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
  I can therefore assume in subsequent steps that the source is compliant.

- It is forbidden to use String.get, String.sub, String.length, or other
  functions where awareness of variable-length encoding is required.

- String concatenation is allowed.

- Using Extlib's String.nsplit is okay if the separator is a newline (0x0a),
  because in a multi-byte sequence all bytes have a value > 127.  There is
  therefore no chance of splitting a multi-byte sequence down the middle.


So, can someone find any problems with this reasoning?  (Thanks in advance!)

Best regards,
Dario Teixeira

P.S. And yes, I am aware that there are excellent libraries for handling
     UTF-8 (like the Rope module in Batteries).