Browse thread
Storing UTF-8 in plain strings
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Dario Teixeira <darioteixeira@y...> |
| Subject: | Storing UTF-8 in plain strings |
Hi,
I'm using Ulex + Menhir to parse UTF-8 encoded source code, and I'm relying
on plain strings for processing and storing data. I *think* I can get away
with using only the String module to handle this variable-length encoding
as long as I am careful with the way I treat these strings. Here are the
assumptions I am making:
- If the source is invalid UTF-8 in any way, Ulex will raise Utf8.MalFormed.
I can therefore assume in subsequent steps that the source is compliant.
- It is forbidden to use String.get, String.sub, String.length, or other
functions where awareness of variable-length encoding is required.
- String concatenation is allowed.
- Using Extlib's String.nsplit is okay if the separator is a newline (0x0a),
because in a multi-byte sequence all bytes have a value > 127. There is
therefore no chance of splitting a multi-byte sequence down the middle.
So, can someone find any problems with this reasoning? (Thanks in advance!)
Best regards,
Dario Teixeira
P.S. And yes, I am aware that there are excellent libraries for handling
UTF-8 (like the Rope module in Batteries).